# Moneyball - Baseball Dataset

URL: https://www.openml.org/d/41021

## Content

1) [Data preprocessing](#dataproc)

2) [Model training and evaluation](#train) 
    
2.a) [Linear regression](#linear)

2.b) [Lasso Regression](#lasso)

2.c) [Random Forest](#rf)

2.d) [kNN](#knn)

---

In [None]:
# Basic imports
import numpy as np
import pandas as pd
import glob
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split

# models for linear regression
from sklearn.linear_model import LinearRegression
from sklearn import linear_model

# models for Lasso regression
from sklearn.linear_model import Lasso
from sklearn.model_selection import GridSearchCV

# statistic tools
from sklearn import metrics
from statistics import stdev

# preprocessing
from sklearn import preprocessing

---

# 1) Data preprocessing

In [None]:
input_file = 'baseball.csv'
df_raw = pd.read_csv(input_file,  sep = ',', header = 0)
df_raw

# Description of data columns

RS ... Runs Scored, 

RA ... Runs Allowed

***RD ... Run differential (actually difference)***

W ... Wins

OBP ... On-Base Percentage

SLG ... Slugging Percentage

BA ... Batting Average

Playoffs (binary)

RankSeason

RankPlayoffs

G ... Games Played

OOBP ... Opponent On-Base Percentage

OSLG ... Opponent Slugging Percentage

In [None]:
col_dict = {'RS':  'Runs Scored', 
            'RA':  'Runs Allowed',
            'RD':  'Run differential (actually difference)',
            'W':  'Wins',
            'OBP':  'On-Base Percentage',
            'SLG':  'Slugging Percentage',
            'BA':  'Batting Average',
            'Playoffs': 'playoffs reached (binary)',
            'RankSeason': 'season rank',
            'RankPlayoffs': 'playoff rank',
            'G':  'Games Played',
            'OOBP':  'Opponent On-Base Percentage',
            'OSLG':  'Opponent Slugging Percentage'
           }

In [None]:
def display_all(df):
    with pd.option_context("display.max_rows", 1000): 
        with pd.option_context("display.max_columns", 1000): 
            display(df)
            
def add_RD(df):
    df['RD'] = df.apply(lambda row: row.RS - row.RA, axis = 1) 

# First look on DATA and information

In [None]:
display_all(df_raw.tail().transpose())
print('#'*40)
display('Some more info')
print('#'*40)
display(df_raw.info())

# Preprocessing for random forest

In [None]:
# Split into train and test
def split_simple(df, n): 
    '''n... number to split at'''
    return df[:n].copy(), df[n:].copy()

In [None]:
df_prep = df_raw
add_RD(df_prep) # add Round Difference
display_all(df_prep.tail().transpose())
display(df_prep.info())

In [None]:
cols_to_drop = ['Team', 'League', 'Year', 'RankSeason', 'RankPlayoffs', 'Playoffs']
df_prep = df_prep.drop(cols_to_drop, axis=1)

# Fix missing values and type
df_prep.replace("?",0, inplace=True)
#df_prep = df_prep[df_prep.OOBP != 0]
df_prep[['OOBP','OSLG']] = df_prep[['OOBP','OSLG']].astype(float)

In [None]:
display(df_prep.columns.values)
display(df_prep.index)

In [None]:
display(df_prep)

In [None]:
df_rf = df_prep

# Bootstrapping:

Bootstrapping: Selecting data from a data to generate a new dataset of the same size by picking WITH replacement.

Example:

    > DS = [1,2,3,4]
    > could turn into 
    > DS_bootstrapped = [3,2,4,4]
    
Consequences:

- Instances (rows) of the original set can end up duplicated (multiple times) in the resulting dataset.
- Some instances are left out entirely (up to 1/3) --> "Out-Of-Bag Dataset" (=OOB Dataset)

## Using the OOB Dataset

The OOB dataset was not used to construct the tree, so we can actually use it to test our tree and gain some insight into the error measure of the tree.
This error is called the "Out-Of-Bag Error" (OOB error).

# Preprocessing LinReg

In [None]:
plt.figure()
sns.lmplot("RS","W",df_prep)

sns.lmplot("RA","W",df_prep)

sns.lmplot("OBP","W",df_prep)

sns.lmplot("SLG","W",df_prep)

sns.lmplot("BA","W",df_prep)
df_lin = df_prep


# Preprocessing LassoReg

No special preprocessing for LassoReg needed

# Preprocessing kNN

In [None]:
from sklearn.impute import SimpleImputer

df_knn = df_raw

from sklearn.impute import SimpleImputer

# Impute the missing values within the OOBP and OSLG columns - with 0.3 being the relative mean value
imputerQ1 = SimpleImputer(missing_values='?', strategy='constant', fill_value=0.3)
imputerQ1 = imputerQ1.fit(df_knn[['OOBP', 'OSLG']])
df_knn[['OOBP', 'OSLG']] = imputerQ1.transform(df_knn[['OOBP', 'OSLG']])

# Impute the missing values within the RankSeason column - the idea behind 6 is that every team which isn't in the playoffs ranked worse than Rank 5
imputerQ2 = SimpleImputer(missing_values='?', strategy='constant', fill_value=6)
imputerQ2 = imputerQ2.fit(df_knn[['RankSeason']])
df_knn[['RankSeason']] = imputerQ2.transform(df_knn[['RankSeason']])
df_knn.League.replace(['NL', 'AL'], [1, 0], inplace=True)

#Drop useless columns - of no interest
df_knn.drop(['RankPlayoffs', 'Team', 'Year'], axis=1, inplace=True)


In [None]:

display(df_knn)

---
# 2) Model training and evaluation
---

# a) Linear Regression

In [None]:
Y = df_lin[['W']]
X = df_lin[['RS','RA','OBP','SLG','BA']]
X_train, X_test, Y_train, Y_test = train_test_split(X,Y, test_size = 0.3, random_state = 0)

In [None]:
linreg = LinearRegression(normalize = True)
linreg.fit(X_train,Y_train)

In [None]:
linreg.coef_
score = linreg.score(X_test,Y_test)
print("Linear regression model score: ",score)
Y_lin_pred = linreg.predict(X_test)

In [None]:
print('Mean Absolute Error:', metrics.mean_absolute_error(Y_test, Y_lin_pred))  
print('Mean Squared Error:', metrics.mean_squared_error(Y_test, Y_lin_pred))  
print('Root Mean Squared Error:', np.sqrt(metrics.mean_squared_error(Y_test, Y_lin_pred)))

In [None]:
sns.distplot(Y_test)
sns.distplot(Y_lin_pred, color="red")

In [None]:
sns.distplot(Y_test-Y_lin_pred)

# b) Lasso Regression

In [None]:
lasso = Lasso(normalize = True)
parameters = {'alpha':[1e-15,1e-10,1e-8,1e-3,1e-2,1,2,5,10,20,30,35,40,45,50,55,100]}
lasso_regressor = GridSearchCV(lasso,parameters,scoring = 'neg_mean_squared_error',cv = 5)

In [None]:
lasso_regressor.fit(X_train,Y_train)
print("Lasso regression model score: ", lasso_regressor.score(X_test, Y_test))
print(lasso_regressor.best_params_)
print(lasso_regressor.best_score_)

In [None]:
Y_lasso_pred = lasso_regressor.predict(X_test)

In [None]:
sns.distplot(Y_test)
sns.distplot(Y_lasso_pred, color="red")

In [None]:
# geht leider noch nicht die dimesionen passen nicht zusammen
#print('Mean Absolute Error:', metrics.mean_absolute_error(Y_train, Y_lasso_pred))  
#print('Mean Squared Error:', metrics.mean_squared_error(Y_train, Y_lasso_pred))  
#print('Root Mean Squared Error:', np.sqrt(metrics.mean_squared_error(Y_train, Y_lasso_pred)))

# c) kNN

In [None]:
train_knn, test_knn = train_test_split(df_knn, test_size=0.3)

x_train_knn = train_knn.drop('W', axis=1)
y_train_knn = train_knn['W']

x_test_knn = test_knn.drop('W', axis=1)
y_test_knn = test_knn['W']


In [None]:
#Scaler
from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler(feature_range=(0,1))

x_train_knn_scaled = scaler.fit_transform(x_train_knn)
x_train_knn = pd.DataFrame(x_train_knn_scaled)

x_test_knn_scaled = scaler.fit_transform(x_test_knn)
x_test_knn = pd.DataFrame(x_test_knn_scaled)

In [None]:
from sklearn import neighbors
from sklearn.metrics import mean_squared_error
from math import sqrt

<a id='rf'></a>

In [None]:
rmse_val_knn = [] # to store rmse values for different k
for k in range(25):
    k = k + 1
    model = neighbors.KNeighborsRegressor(n_neighbors=k)
    model.fit(x_train_knn, y_train_knn)
    pred = model.predict(x_test_knn)
    error = sqrt(mean_squared_error(y_test_knn, pred))
    rmse_val_knn.append(error)
    print("RMSE for k={}: {}".format(k, error))
    print("R^2 for k={}: {}\n".format(k, model.score(x_test_knn, y_test_knn)))

In [None]:
plt.figure(figsize=(15,8))
plt.plot(range(1,26), rmse_val_knn, color='blue', linestyle='dashed', marker='o',
        markerfacecolor='red', markersize=5)
plt.title('RMSE vs. k-Value')
plt.xlabel('k')
plt.ylabel('RMSE')

## Optimizing kNN-search for optimal k-Value via Gridsearch

In [None]:
from sklearn.model_selection import GridSearchCV
params = {'n_neighbors': range(1, 25)}

knn = neighbors.KNeighborsRegressor()

model = GridSearchCV(knn, params, cv=100)
model.fit(x_train_knn, y_train_knn)
print("Best k-Value is: ", model.best_params_['n_neighbors'])

In [None]:
model_cv = neighbors.KNeighborsRegressor(n_neighbors=model.best_params_['n_neighbors'])
model_cv.fit(x_train_knn, y_train_knn)
pred_cv = model.predict(x_test_knn)
sns.distplot(y_test_knn)
sns.distplot(pred_cv, color='red')


In [None]:
sns.distplot(y_test_knn-pred_cv)

# d) Random Forest

In [None]:
# Imports for RandomForestRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn import metrics
from sklearn.model_selection import train_test_split
from IPython.display import display

In [None]:
import math
def rmse(x,y): 
    return math.sqrt(((x-y)**2).mean())

def print_score(m, X_train, X_valid, y_train, y_valid, score='neg_mean_squared_error'):
    res = {
        'RMS(train)': rmse(m.predict(X_train), y_train),
        'RMS(valid)': rmse(m.predict(X_valid), y_valid)}
    if score=='neg_mean_squared_error':
        res['Model_Score=r²'] = [np.sqrt(-m.score(X_train, y_train)), np.sqrt(-m.score(X_valid, y_valid))]
    elif score=='pos_mean_squared_error':
        res['Model_Score=r²'] = [np.sqrt(m.score(X_train, y_train)), np.sqrt(m.score(X_valid, y_valid))]
    else:
        res['Model_Score=r²'] = [m.score(X_train, y_train), m.score(X_valid, y_valid)]
    if hasattr(m, 'oob_score_'): res['oob_score_'] = m.oob_score_
    display(res)
    return res

# Feature importance
from prettytable import PrettyTable as PT # pip install PTable
def print_RF_featureImportance(rf, X):
    table = PT()
    table.field_names = ['Feature', 'Score', 'Comment']
    for name, score in zip(X.columns.values, rf.feature_importances_):
        print(f"{name}: {score:.5f}\t\t... {col_dict[name]}")
        table.add_row([name, round(score, ndigits=4), col_dict[name]])
    print(table)

def print_GridSearchResult(grid):
    print(grid.best_params_)
    print(grid.best_estimator_)

In [None]:
# Split for random forest
rnd_state = 42
ratio = 0.2 # test/num_samples
#####
num_instances, _ = df_rf.shape
print(f"From {num_instances} using {num_instances*ratio:.0f} for testing and {num_instances*(1-ratio):.0f} for training. Ratio = {ratio*100:.2f}%")
X, y = (df_rf.drop(['W', 'RD'], axis=1), df_rf.W)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = ratio, random_state = rnd_state)
display(X_test)

In [None]:
before = 0

In [None]:
# Simple training of RFRegressor
n_cores = 4
rf_W = RandomForestRegressor(n_jobs=n_cores)
# The following code is supposed to fail due to string values in the input data
rf_W.fit(X_train, y_train)
print("Before:")
display(before)#
print("Now:")
before = print_score(rf_W, X_train, X_test, y_train, y_test)


In [None]:
print_RF_featureImportance(rf_W, X_train)

In [None]:
rf_W_prediction = rf_W.predict(X_test)

In [None]:
sns.distplot(y_test)
sns.distplot(rf_W_prediction, color="red")

In [None]:
sns.distplot(y_test-rf_W_prediction)

In [None]:
n_cores = 4
number_of_trees = 1000 # default = 100
rf = RandomForestRegressor(n_jobs=n_cores, n_estimators=number_of_trees, bootstrap=True) #, verbose=1)

rf.fit(X_train, y_train)
print("Before:")
display(before)#
print("Now:")
before = print_score(rf, X_train, X_test, y_train, y_test)
print()
print("Feature importance")
print_RF_featureImportance(rf, X_train)
rf_RD = rf

In [None]:
rfRD_prediction = rf_RD.predict(X_test)

In [None]:
sns.distplot(y_test)
sns.distplot(rfRD_prediction, color="red")

In [None]:
sns.distplot(y_test-rfRD_prediction)

# Optimize Hyperparameters via GridSearch

because we lazy bois

## Notes on the RandomForestRegressor from scikit-learn
-----
The default values for the parameters controlling the size of the trees
(e.g. ``max_depth``, ``min_samples_leaf``, etc.) lead to fully grown and
unpruned trees which can potentially be very large on some data sets. To
reduce memory consumption, the complexity and size of the trees should be
controlled by setting those parameter values.

## Number of variables/features per tree --> 'max_features'

A good starting point is/might be: *the square root of the number of features presented to the tree*. Then, test some values below and above that starting point.

## Number of trees in the forest --> 'n_estimators'

The more the merrier

In [None]:
from numpy import sqrt
num_features = X.shape[1]
print(num_features)
sqrt_num_features = round(sqrt(num_features), 0)
sqrt_num_features

In [None]:
from sklearn.model_selection import GridSearchCV
n_cores = 4
# but since we dont have that many features...we are just gonna brute force it :D
param_grid = [
    {
        'n_estimators': [3, 10, 30, 100, 1000], 'max_features': [i for i in range(1,num_features+1)]
    }
#,{'bootstrap': [False], 'n_estimators': [3, 30, 100, 1000], 'max_features': [2, 3, 4]},
]
k = 10
forest_reg = RandomForestRegressor(n_jobs=n_cores)
grid_search = GridSearchCV(forest_reg, param_grid, n_jobs=n_cores , cv=k, scoring='neg_mean_squared_error', return_train_score=True)
grid_search.fit(X_train, y_train)


In [None]:
print_GridSearchResult(grid_search)
grid_search.scorer_()
scores = grid_search.score(X_test, y_test)
print_score(grid_search, X_train, X_test, y_train, y_test)

---
# Save model and DF
---

In [None]:
# Dump model
import joblib
import os

os.makedirs('tmp', exist_ok=True)
joblib.dump(rf_RD, "tmp/rf_RD.pkl")
# To load the model
# my_model_loaded = joblib.load("my_model.pkl")

In [None]:
import os
os.makedirs('tmp', exist_ok=True)
df_raw.to_feather('tmp/bulldozers-raw')
df_raw = pd.read_feather('tmp/raw')