# Data preprocessing and pipelines
We explore the performance of several linear regression models on a real-world dataset, i.e. [MoneyBall](https://www.openml.org/d/41021). See the description on OpenML for more information. In short, this dataset captures performance data from baseball players. The regression task is to accurately predict the number of 'runs' each player can score, and understanding which are the most important factors.

In [2]:
# General imports
%matplotlib inline
import pandas as pd
import openml as oml
from sklearn.preprocessing import StandardScaler, MinMaxScaler, OneHotEncoder
from sklearn.pipeline import make_pipeline, Pipeline
from sklearn.compose import make_column_transformer
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import LinearRegression, Ridge, Lasso
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from sklearn.svm import SVR
from tqdm.notebook import tqdm as tqdm

In [3]:
# Download MoneyBall data from OpenML
moneyball = oml.datasets.get_dataset(41021)
# Get the pandas dataframe (default)
X, y, _, attribute_names = moneyball.get_data(target=moneyball.default_target_attribute)

For this first quick visualization, we will simply impute the missing values using the median. Removing all instances with missing values is not really an option since some features have consistent missing values: we would have to remove a lot of data.

## Impute Missing Values

In [4]:
X.head()

Unnamed: 0,Team,League,Year,RA,W,OBP,SLG,BA,Playoffs,RankSeason,RankPlayoffs,G,OOBP,OSLG
0,ARI,NL,2012,688,81,0.328,0.418,0.259,0,,,162,0.317,0.415
1,ATL,NL,2012,600,94,0.32,0.389,0.247,1,4.0,5.0,162,0.306,0.378
2,BAL,AL,2012,705,93,0.311,0.417,0.247,1,5.0,4.0,162,0.315,0.403
3,BOS,AL,2012,806,69,0.315,0.415,0.26,0,,,162,0.331,0.428
4,CHC,NL,2012,759,61,0.302,0.378,0.24,0,,,162,0.335,0.424


In [11]:
y

0       734
1       700
2       712
3       734
4       613
       ... 
1227    705
1228    706
1229    878
1230    774
1231    599
Name: RS, Length: 1232, dtype: int64

In [5]:
from sklearn.impute import SimpleImputer

imputer = SimpleImputer(strategy="median")

In [6]:
imputer.fit_transform(X[["RankSeason", "RankPlayoffs"]])

array([[3., 3.],
       [4., 5.],
       [5., 4.],
       ...,
       [1., 2.],
       [3., 3.],
       [3., 3.]])

## Exercise 1: Build a pipeline

Implement a function `build_pipeline` that does the following:
- Impute missing values by replacing NaN's with the feature median for numerical features.
- Encode the categorical features using OneHotEncoding.
- If the attribute `scaling=True`, also scale the data using standard scaling.
- Attach the given regression model to the end of the pipeline

In [7]:
def build_pipeline(regressor, categorical, scaling=False):
    cat_pipe = make_pipeline(OneHotEncoder(handle_unknown='ignore'))
    num_pipe = make_pipeline(SimpleImputer(strategy='mean'))
    if scaling:
        num_pipe.steps.insert(1,["scaler", StandardScaler()]) 
    transform = make_column_transformer((cat_pipe, categorical), remainder=num_pipe)
    # Give a name to the regressor so that we can tune it more easily
    return Pipeline(steps=[('preprocess', transform), ('reg', regressor)])

## Exercise 2: Test the pipeline
Test the pipeline by evaluating linear regression (without scaling) on the dataset, using 5-fold cross-validation and $R^2$. 

In [9]:
### Model solution
categorical = ["Team","League"]
# regressor = LinearRegression()
pipe = build_pipeline(regressor,categorical)
scores = cross_val_score(pipe, X, y)
print("Cross-validated R^2 score for {}: {:.2f}".format(regressor.__class__.__name__, scores.mean()))

Cross-validated R^2 score for LinearRegression: 0.92


## Exercise 3: A first benchmark
Evaluate the following algorithms in their default settings, both with and without scaling, and interpret the results:  
- Linear regression
- Ridge
- Lasso
- SVM (RBF)
- RandomForests
- GradientBoosting

In [10]:
### Model solution
models = [LinearRegression(), Ridge(), Lasso(), RandomForestRegressor(), GradientBoostingRegressor(), SVR()]
for m in tqdm(models): # nstantly make your loops show a smart progress meter
    pipe = build_pipeline(m,categorical)
    scores = cross_val_score(pipe, X, y)
    print("R^2 score for {}: {:.2f}".format(m.__class__.__name__, scores.mean()))
    pipe = build_pipeline(m,categorical, scaling=True)
    scores = cross_val_score(pipe, X, y)
    print("R^2 score for {} (scaled): {:.2f}".format(m.__class__.__name__, scores.mean()))

  0%|          | 0/6 [00:00<?, ?it/s]

R^2 score for LinearRegression: 0.92
R^2 score for LinearRegression (scaled): 0.92
R^2 score for Ridge: 0.80
R^2 score for Ridge (scaled): 0.92
R^2 score for Lasso: 0.81
R^2 score for Lasso (scaled): 0.92
R^2 score for RandomForestRegressor: 0.89
R^2 score for RandomForestRegressor (scaled): 0.89
R^2 score for GradientBoostingRegressor: 0.91
R^2 score for GradientBoostingRegressor (scaled): 0.91
R^2 score for SVR: -0.46
R^2 score for SVR (scaled): 0.27
