### Comparing Aggregate Models for Regression

This try-it focuses on utilizing ensemble models in a regression setting.  Much like you have used individual classification estimators to form an ensemble of estimators -- here your goal is to explore ensembles for regression models.  As with your earlier assignment, you will use scikitlearn to carry out the ensembles using the `VotingRegressor`.   


#### Dataset and Task

Below, a dataset containing census information on individuals and their hourly wage is loaded using the `fetch_openml` function.  OpenML is another repository for datasets [here](https://www.openml.org/).  Your task is to use ensemble methods to explore predicting the `wage` column of the data.  Your ensemble should at the very least consider the following models:

- `LinearRegression` -- perhaps you even want the `TransformedTargetRegressor` here.
- `KNeighborsRegressor`
- `DecisionTreeRegressor`
- `Ridge`
- `SVR`

Tune the `VotingRegressor` to try to optimize the prediction performance and determine if the wisdom of the crowd performed better in this setting than any of the individual models themselves.  Report back on your findings and discuss the interpretability of your findings.  Is there a way to determine what features mattered in predicting wages?

In [46]:
import warnings
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from plotly.figure_factory import create_table

from sklearn.datasets import fetch_openml

from sklearn.pipeline import Pipeline

from sklearn.ensemble import VotingRegressor
from sklearn.linear_model import LinearRegression, Ridge
from sklearn.compose import TransformedTargetRegressor
from sklearn.neighbors import KNeighborsRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.svm import SVR

from sklearn.preprocessing import StandardScaler, LabelEncoder, QuantileTransformer
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score

warnings.filterwarnings("ignore")

In [47]:
df_survey = fetch_openml(data_id=534, as_frame=True).frame

In [48]:
df_survey.head()

Unnamed: 0,EDUCATION,SOUTH,SEX,EXPERIENCE,UNION,WAGE,AGE,RACE,OCCUPATION,SECTOR,MARR
0,8,no,female,21,not_member,5.1,35,Hispanic,Other,Manufacturing,Married
1,9,no,female,42,not_member,4.95,57,White,Other,Manufacturing,Married
2,12,no,male,1,not_member,6.67,19,White,Other,Manufacturing,Unmarried
3,12,no,male,4,not_member,4.0,22,White,Other,Other,Unmarried
4,12,no,male,17,not_member,7.5,35,White,Other,Other,Married


In [49]:
df_survey.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 534 entries, 0 to 533
Data columns (total 11 columns):
 #   Column      Non-Null Count  Dtype   
---  ------      --------------  -----   
 0   EDUCATION   534 non-null    int64   
 1   SOUTH       534 non-null    category
 2   SEX         534 non-null    category
 3   EXPERIENCE  534 non-null    int64   
 4   UNION       534 non-null    category
 5   WAGE        534 non-null    float64 
 6   AGE         534 non-null    int64   
 7   RACE        534 non-null    category
 8   OCCUPATION  534 non-null    category
 9   SECTOR      534 non-null    category
 10  MARR        534 non-null    category
dtypes: category(7), float64(1), int64(3)
memory usage: 21.4 KB


In [50]:
encoders = []
def label_encoder(cat_columns, df_cars):
    for c in cat_columns:
        le = LabelEncoder().fit(list(df_cars[c].astype(str).values))
        encoders.append(le)
        df_cars[c] = le.transform(list(df_cars[c].astype(str).values))

cat_columns = df_survey.select_dtypes(exclude=np.number).columns.to_list()

df_encd = df_survey.copy()
label_encoder(cat_columns,df_encd)
df_encd.head()

Unnamed: 0,EDUCATION,SOUTH,SEX,EXPERIENCE,UNION,WAGE,AGE,RACE,OCCUPATION,SECTOR,MARR
0,8,0,0,21,1,5.1,35,0,2,1,0
1,9,0,0,42,1,4.95,57,2,2,1,0
2,12,0,1,1,1,6.67,19,2,2,1,1
3,12,0,1,4,1,4.0,22,2,2,2,1
4,12,0,1,17,1,7.5,35,2,2,2,0


In [51]:
df = df_encd.copy()

In [52]:
X = df.drop('WAGE', axis=1)
y = df['WAGE']

In [53]:
# Test/train split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

In [54]:
# returns adj r2
def get_adj_r2(X_train, y_train, r2):
    return 1 - (1-r2)*(len(y_train)-1)/(len(y_train)-X_train.shape[1]-1)

In [55]:
models = ['VotingRegressor','LinearRegression','TransformedTargetRegressor', 'Ridge', 'KNeighborsRegressor', 'DecisionTreeRegressor', 'SVR']
msa = []
mse = []
r2_squared = []
adj_r2 = []

In [56]:
transformer = QuantileTransformer(output_distribution='normal')
sc = StandardScaler()
lr = LinearRegression()
ttr = TransformedTargetRegressor(regressor=lr, transformer=transformer)
ridge = Ridge()
knn = KNeighborsRegressor(n_neighbors=5, weights='uniform')
dt = DecisionTreeRegressor(max_depth=4)
svr = SVR()

In [57]:
pipe_models = [ ('p_lr', Pipeline([('sc', sc), ('LinearRegression', lr) ])),
                ('p_ttr', Pipeline([('sc', sc), ('TransformedTargetRegressor', ttr) ])),
                ('p_ridge', Pipeline([('sc', sc), ('Ridge', ridge) ])),
                ('p_knn', Pipeline([('sc', sc), ('KNeighborsRegressor', knn) ])),
                ('p_dt', Pipeline([('DecisionTreeRegressor', dt) ])),
                ('p_svr', Pipeline([('sc', sc), ('SVR', svr) ])),
]

In [58]:
ereg = VotingRegressor(estimators=pipe_models)
ereg.fit(X_train, y_train)
y_pred = ereg.predict(X_test)

msa.append( mean_absolute_error(y_test, y_pred))
mse.append( mean_squared_error(y_test, y_pred))
r2_squared.append( r2_score(y_test, y_pred))
adj_r2.append(get_adj_r2(X_train, y_train, r2_score(y_test, y_pred)))

In [59]:
for model in pipe_models:
    model[1].fit(X_train, y_train)
    y_pred = model[1].predict(X_test)
    msa.append( mean_absolute_error(y_test, y_pred))
    mse.append( mean_squared_error(y_test, y_pred))
    r2_squared.append( r2_score(y_test, y_pred))
    adj_r2.append(get_adj_r2(X_train, y_train, r2_score(y_test, y_pred)))

In [60]:
pd_metric = pd.DataFrame.from_dict({'Model':models,'MAE':msa, 'MSE':mse, 'R2':r2_squared, 'Adj R2':adj_r2})
pd_metric.set_index('Model', inplace=True)

In [62]:
create_table(pd_metric.round(4).sort_values(by='Adj R2', ascending=False), index_title='Model',index=True)