## 6.01 - Supervised Learning Model Comparison

Recall the "data science process."

1. Define the problem.
2. Gather the data.
3. Explore the data.
4. Model the data.
5. Evaluate the model.
6. Answer the problem.

In this lab, we're going to focus mostly on creating (and then comparing) many regression and classification models. Thus, we'll define the problem and gather the data for you.
Most of the questions requiring a written response can be written in 2-3 sentences.

### Step 1: Define the problem.

You are a data scientist with a financial services company. Specifically, you want to leverage data in order to identify potential customers.

If you are unfamiliar with "401(k)s" or "IRAs," these are two types of retirement accounts. Very broadly speaking:
- You can put money for retirement into both of these accounts.
- The money in these accounts gets invested and hopefully has a lot more money in it when you retire.
- These are a little different from regular bank accounts in that there are certain tax benefits to these accounts. Also, employers frequently match money that you put into a 401k.
- If you want to learn more about them, check out [this site](https://www.nerdwallet.com/article/ira-vs-401k-retirement-accounts).

We will tackle one regression problem and one classification problem today.
- Regression: What features best predict one's income?
- Classification: Predict whether or not one is eligible for a 401k.

Check out the data dictionary [here](http://fmwww.bc.edu/ec-p/data/wooldridge2k/401KSUBS.DES).

### NOTE: When predicting `inc`, you should pretend as though you do not have access to the `e401k`, the `p401k` variable, and the `pira` variable. When predicting `e401k`, you may use the entire dataframe if you wish.

### Step 2: Gather the data.

##### 1. Read in the data from the repository.

In [11]:
import pandas as pd
import numpy as np

from math import sqrt
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.linear_model import LinearRegression, LogisticRegression, ElasticNet, Lasso,Ridge
from sklearn.neighbors import KNeighborsRegressor, KNeighborsClassifier
from sklearn.tree import DecisionTreeRegressor, DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV
from sklearn.naive_bayes import MultinomialNB
from sklearn.svm import SVC
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import BaggingRegressor, BaggingClassifier, RandomForestRegressor, RandomForestClassifier, AdaBoostRegressor, AdaBoostClassifier
from sklearn.metrics import mean_squared_error, f1_score
from sklearn import svm
from sklearn. compose import ColumnTransformer 
from sklearn.preprocessing import StandardScaler


In [2]:
df = pd.read_csv('401ksubs.csv')

In [3]:
df.info

<bound method DataFrame.info of       e401k     inc  marr  male  age  fsize   nettfa  p401k  pira      incsq  \
0         0  13.170     0     0   40      1    4.575      0     1   173.4489   
1         1  61.230     0     1   35      1  154.000      1     0  3749.1130   
2         0  12.858     1     0   44      2    0.000      0     0   165.3282   
3         0  98.880     1     1   44      2   21.800      0     0  9777.2540   
4         0  22.614     0     0   53      1   18.450      0     0   511.3930   
...     ...     ...   ...   ...  ...    ...      ...    ...   ...        ...   
9270      0  58.428     1     0   33      4   -1.200      0     0  3413.8310   
9271      0  24.546     0     1   37      3    2.000      0     0   602.5061   
9272      0  38.550     1     0   33      3  -13.600      0     1  1486.1020   
9273      0  34.410     1     0   57      3    3.550      0     0  1184.0480   
9274      0  25.608     0     1   49      1    1.800      0     0   655.7697   

      a

##### 2. What are 2-3 other variables that, if available, would be helpful to have?

 Two to three variables that would be helpful to have are: 1. the person's age; 2. whether or not he/she has children.

##### 3. Suppose a peer recommended putting `race` into your model in order to better predict who to target when advertising IRAs and 401(k)s. Why would this be an unethical decision?

 This would be unethical because it could unintentionally lead to discrimination, which is not only unethical, but also potentially illegal.

## Step 3: Explore the data.

##### 4. When attempting to predict income, which feature(s) would we reasonably not use? Why?

We would resonably not use incsq (income squared) or agesq (age squared) because they do not provide any material informational value to the dataset and therefore would just be 'noise' when building the model.¶

##### 5. What two variables have already been created for us through feature engineering? Come up with a hypothesis as to why subject-matter experts may have done this.
> This need not be a "statistical hypothesis." Just brainstorm why SMEs might have done this!

The two variables already created for us are incsq (income squared) and agesq (age squared); SMEs may have done this to create a more obvious distinction between those who qualify and those who do not when considering his/her income and age.

##### 6. Looking at the data dictionary, one variable description appears to be an error. What is this error, and what do you think the correct value would be?

 Income's (inc) description should not be squared and it should mention that the numbers are in 1000s.¶

## Step 4: Model the data. (Part 1: Regression Problem)

Recall:
- Problem: What features best predict one's income?
- When predicting `inc`, you should pretend as though you do not have access to the `e401k`, the `p401k` variable, and the `pira` variable.

##### 7. List all modeling tactics we've learned that could be used to solve a regression problem (as of Wednesday afternoon of Week 6). For each tactic, identify whether it is or is not appropriate for solving this specific regression problem and explain why or why not.

-  Linear Regression:  An appropriate tactic for predicting one's income because its predictions and coefficients can be interpreted fairly easily.
-  Ridge Regression:  An appropriate tactic for its predictions and coefficients are not just easy to understand, but the coefficients have been regulated, improving the predictive performance of the model.
-  Lasso Regression:  An appropriate tactic for the same reasons as Linear Regression and Ridge Regression but regulates the coefficients more harshly than Ridge Regression, potentially imrpoving the predctive performance of the model.
-  ElasticNet Regression:  An appropriate tactic that combines the effects of Lasso Regression and Ridge Regression.

##### 8. Regardless of your answer to number 7, fit at least one of each of the following models to attempt to solve the regression problem above:
    - a multiple linear regression model
    - a k-nearest neighbors model
    - a decision tree
    - a set of bagged decision trees
    - a random forest
    - an Adaboost model
    - a support vector regressor
    
> As always, be sure to do a train/test split! In order to compare modeling techniques, you should use the same train-test split on each. I recommend setting a random seed here.

> You may find it helpful to set up a pipeline to try each modeling technique, but you are not required to do so!

In [4]:
df.head()

Unnamed: 0,e401k,inc,marr,male,age,fsize,nettfa,p401k,pira,incsq,agesq
0,0,13.17,0,0,40,1,4.575,0,1,173.4489,1600
1,1,61.23,0,1,35,1,154.0,1,0,3749.113,1225
2,0,12.858,1,0,44,2,0.0,0,0,165.3282,1936
3,0,98.88,1,1,44,2,21.8,0,0,9777.254,1936
4,0,22.614,0,0,53,1,18.45,0,0,511.393,2809


In [5]:
features = ['marr', 'male', 'agesq', 'fsize', 'nettfa']

X = df[features]
y = df['incsq']

In [6]:
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

In [9]:
seed = np.random.seed(42)

In [16]:
# Initialze the estimators

reg1 = LinearRegression()
reg2 = Lasso()
reg3 = ElasticNet()
reg4 = Ridge()

# Initiaze the dictionary of hyperparameters  
param1={}
param1['reg'] = [reg1]

param2={}
param2['reg'] = [reg2]
param2['reg__alpha']=[0.1,1,10,100.0,200.0,450.0,600.0]

param3={}
param3['reg'] = [reg3]
param3['reg__alpha']=[0.1,1,10,100.0,200.0,450.0,600.0]
param3['reg__l1_ratio'] = [0.25,0.5,0.75,0.9,1.0]

param4={}
param4['reg'] = [reg4]
param4['reg__alpha']=[0.1,1,10,100.0,200.0,450.0,600.0]


params = [param1, param2, param3, param4]


preprocess = ColumnTransformer([
                            ('sc', StandardScaler(), ['agesq', 'fsize', 'nettfa'])
                        ])
                        
pipe_rg = Pipeline([
    ('sc', preprocess),
    ('reg', reg1)
])

In [17]:
%%time
# Train the randomized search models
rg1 = RandomizedSearchCV(pipe_rg, params, cv=3, n_jobs=-1, random_state =42).fit(X_train, y_train)

CPU times: user 99.2 ms, sys: 157 ms, total: 256 ms
Wall time: 4.03 s


In [18]:
rg1.best_estimator_

In [19]:
print('Best CV score:',rg1.best_score_)
print('Best Model parameters:',rg1.best_params_)
print('Train score:',rg1.score(X_train,y_train))
print('Test score:',rg1.score(X_test,y_test))

Best CV score: 0.1931090408056314
Best Model parameters: {'reg__alpha': 10, 'reg': Ridge(alpha=10)}
Train score: 0.19466813587557918
Test score: 0.11620898305324212


##### 9. What is bootstrapping?

Bootstrapping is random sampling with replacement.

##### 10. What is the difference between a decision tree and a set of bagged decision trees? Be specific and precise!

Bagged decision trees is an ensemble algorithm that fits multiple  decision tree models on different subsets of a training dataset, then combines the predictions from all decision tree  models.  A set of bagged decision trees is an ensemble method, meant to make 'weak signals' stronger, reducing variance in the model.

##### 11. What is the difference between a set of bagged decision trees and a random forest? Be specific and precise!

 Random forest is an extension of bagging that also randomly selects subsets of features used in each data sample.

##### 12. Why might a random forest be superior to a set of bagged decision trees?
> Hint: Consider the bias-variance tradeoff.

Random forest leverages the power of multiple decision trees. It does not rely on the feature importance given by a single decision tree. Therefore, the random forest can generalize over the data in a better way. This randomized feature selection makes random forest much more accurate than a decision tree.

## Step 5: Evaluate the model. (Part 1: Regression Problem)

##### 13. Using RMSE, evaluate each of the models you fit on both the training and testing data.

##### 14. Based on training RMSE and testing RMSE, is there evidence of overfitting in any of your models? Which ones?

##### 15. Based on everything we've covered so far, if you had to pick just one model as your final model to use to answer the problem in front of you, which one model would you pick? Defend your choice.

##### 16. Suppose you wanted to improve the performance of your final model. Brainstorm 2-3 things that, if you had more time, you would attempt.

## Step 4: Model the data. (Part 2: Classification Problem)

Recall:
- Problem: Predict whether or not one is eligible for a 401k.
- When predicting `e401k`, you may use the entire dataframe if you wish.

##### 17. While you're allowed to use every variable in your dataframe, mention at least one disadvantage of using `p401k` in your model.

##### 18. List all modeling tactics we've learned that could be used to solve a classification problem (as of Wednesday afternoon of Week 6). For each tactic, identify whether it is or is not appropriate for solving this specific classification problem and explain why or why not.

##### 19. Regardless of your answer to number 18, fit at least one of each of the following models to attempt to solve the classification problem above:
    - a logistic regression model
    - a k-nearest neighbors model
    - a decision tree
    - a set of bagged decision trees
    - a random forest
    - an Adaboost model
    - a support vector classifier
    
> As always, be sure to do a train/test split! In order to compare modeling techniques, you should use the same train-test split on each. I recommend using a random seed here.

> You may find it helpful to set up a pipeline to try each modeling technique, but you are not required to do so!

In [None]:
# Initialze the estimators
clf1 = RandomForestClassifier(random_state = seed)
clf2 = SVC(random_state =seed)
clf3_0 = LogisticRegression(solver = 'saga',random_state =seed)
clf3_1 = LogisticRegression(random_state =seed)
clf4 = DecisionTreeClassifier(random_state =seed)
clf5 = KNeighborsClassifier()
clf6 = MultinomialNB()
clf7 = GradientBoostingClassifier(random_state =seed)


# Initiaze the dictionary of hyperparameters  
# Random Forest
param1 = copy.deepcopy(pre)
param1['classifier'] = [clf1]
param1['classifier__n_estimators'] = [10, 50, 100, 250]
param1['classifier__max_depth'] = [5, 10, 20]

# SVC
param2 = copy.deepcopy(pre)
param2['classifier'] = [clf2]
param2['classifier__C'] = [10**-2, 10**-1, 10**0, 10**1, 10**2, 10e5]
param2['classifier__gamma'] = [0.0001, 0.001, 0.01, 0.1, 1, 10, 'auto']

# Logistic Regression
param3_0 = copy.deepcopy(pre)
param3_0['classifier'] = [clf3_0]
param3_0['classifier__C'] = [10**-2, 10**-1, 10**0, 10**1, 10**2]
param3_0['classifier__penalty'] = ['none', 'l1', 'l2', 'elasticnet']

# Logistic Regression (l2 only)
param3_1 = copy.deepcopy(pre)
param3_1['classifier'] = [clf3_1]
param3_1['classifier__C'] = [10**-2, 10**-1, 10**0, 10**1, 10**2]
param3_1['classifier__solver'] = ['newton-cg', 'lbfgs', 'liblinear']
param3_1['classifier__penalty'] = ['none', 'l2']

# Decision Tree
param4 = copy.deepcopy(pre)
param4['classifier'] = [clf4]
param4['classifier__max_depth'] = [5,15,25,None]
param4['classifier__min_samples_split'] = [2,5,10]
param4['classifier__min_samples_leaf'] =[1,2,3,4,5]
param4['classifier__ccp_alpha'] =[0,10**-1, 10**0, 10**1, 10**2]

# KNN
param5 = copy.deepcopy(pre)
param5['classifier'] = [clf5]
param5['classifier__n_neighbors'] = range(1, 51, 5)
param5['classifier__weights']= ['uniform', 'distance']
param5['classifier__metric']= ['euclidean', 'manhattan','minkowski']

# Multinomial Naive Bayers
param6 = copy.deepcopy(pre)
param6['classifier'] = [clf6]
param6['classifier__alpha'] = [10**0, 10**1, 10**2]

# Gradient Boosting
param7 = copy.deepcopy(pre)
param7['classifier'] = [clf7]
param7['classifier__n_estimators'] = [10, 50, 100, 250]
param7['classifier__max_depth'] = [5, 10, 20]

params = [param1, param2, param3_0, param4, param5, param6, param7]


## Step 5: Evaluate the model. (Part 2: Classfication Problem)

##### 20. Suppose our "positive" class is that someone is eligible for a 401(k). What are our false positives? What are our false negatives?

##### 21. In this specific case, would we rather minimize false positives or minimize false negatives? Defend your choice.

##### 22. Suppose we wanted to optimize for the answer you provided in problem 21. Which metric would we optimize in this case?

##### 23. Suppose that instead of optimizing for the metric in problem 21, we wanted to balance our false positives and false negatives using `f1-score`. Why might [f1-score](https://en.wikipedia.org/wiki/F1_score) be an appropriate metric to use here?

##### 24. Using f1-score, evaluate each of the models you fit on both the training and testing data.

##### 25. Based on training f1-score and testing f1-score, is there evidence of overfitting in any of your models? Which ones?

##### 26. Based on everything we've covered so far, if you had to pick just one model as your final model to use to answer the problem in front of you, which one model would you pick? Defend your choice.

##### 27. Suppose you wanted to improve the performance of your final model. Brainstorm 2-3 things that, if you had more time, you would attempt.

## Step 6: Answer the problem.

##### BONUS: Briefly summarize your answers to the regression and classification problems. Be sure to include any limitations or hesitations in your answer.

- Regression: What features best predict one's income?
- Classification: Predict whether or not one is eligible for a 401k.