## 6.01 - Supervised Learning Model Comparison

Recall the "data science process."

1. Define the problem.
2. Gather the data.
3. Explore the data.
4. Model the data.
5. Evaluate the model.
6. Answer the problem.

In this lab, we're going to focus mostly on creating (and then comparing) many regression and classification models. Thus, we'll define the problem and gather the data for you.

Most of the questions requiring a written response can be written in 2-3 sentences.

### Step 1: Define the problem.

You are a data scientist with a financial services company. Specifically, you want to leverage data in order to identify potential customers.

If you are unfamiliar with "401(k)s" or "IRAs," these are two types of retirement accounts. Very broadly speaking:
- You can put money for retirement into both of these accounts.
- The money in these accounts gets invested and hopefully has a lot more money in it when you retire.
- These are a little different from regular bank accounts in that there are certain tax benefits to these accounts. Also, employers frequently match money that you put into a 401k.
- If you want to learn more about them, check out [this site](https://www.nerdwallet.com/article/ira-vs-401k-retirement-accounts).

We will tackle one regression problem and one classification problem today.
- Regression: What features best predict one's income?
- Classification: Predict whether or not one is eligible for a 401k.

Check out the data dictionary [here](http://fmwww.bc.edu/ec-p/data/wooldridge2k/401KSUBS.DES).

### NOTE: When predicting `inc`, you should pretend as though you do not have access to the `e401k`, the `p401k` variable, and the `pira` variable. When predicting `e401k`, you may use the entire dataframe if you wish.

### Step 2: Gather the data.

##### 1. Read in the data from the repository.

In [69]:
import pandas as pd
import numpy as np

In [2]:
data = pd.read_csv('401ksubs.csv')

In [3]:
data.head()

Unnamed: 0,e401k,inc,marr,male,age,fsize,nettfa,p401k,pira,incsq,agesq
0,0,13.17,0,0,40,1,4.575,0,1,173.4489,1600
1,1,61.23,0,1,35,1,154.0,1,0,3749.113,1225
2,0,12.858,1,0,44,2,0.0,0,0,165.3282,1936
3,0,98.88,1,1,44,2,21.8,0,0,9777.254,1936
4,0,22.614,0,0,53,1,18.45,0,0,511.393,2809


##### 2. What are 2-3 other variables that, if available, would be helpful to have?

1. Proportion of income being deposited into 401k/IRA monthly
2. Duration that 401k/IRA account has been opened

##### 3. Suppose a peer recommended putting `race` into your model in order to better predict who to target when advertising IRAs and 401(k)s. Why would this be an unethical decision?

Using race as a predictor can discriminate against certain races, since their income levels might be different and lead to worsening of racial divides since they will have less knowledge of savings accounts

## Step 3: Explore the data.

##### 4. When attempting to predict income, which feature(s) would we reasonably not use? Why?

incsq, since it is derived from income itself

##### 5. What two variables have already been created for us through feature engineering? Come up with a hypothesis as to why subject-matter experts may have done this.
> This need not be a "statistical hypothesis." Just brainstorm why SMEs might have done this!

incsq and agesq are created through squaring the income and age variables. This is to widen the range of data values for income and age to create models which are more generalizable over wider ranges than those in the dataset

##### 6. Looking at the data dictionary, one variable description appears to be an error. What is this error, and what do you think the correct value would be?

inc is incorrectly described as inc^2. The correct description should be 'income'

## Step 4: Model the data. (Part 1: Regression Problem)

Recall:
- Problem: What features best predict one's income?
- When predicting `inc`, you should pretend as though you do not have access to the `e401k`, the `p401k` variable, and the `pira` variable.

##### 7. List all modeling tactics we've learned that could be used to solve a regression problem (as of Wednesday afternoon of Week 6). For each tactic, identify whether it is or is not appropriate for solving this specific regression problem and explain why or why not.

    Possible Models
    - a multiple linear regression model - Appropriate, since predicting continuous variable given data with multiple features
    - a k-nearest neighbors model - Inappropriate, since KNN is a classification model used when the target variable is categorical
    - a decision tree, random forest, a set of bagged decision trees, AdaBoost- Appropriate, decision trees and subsequent ensemble methods applied (bagging or boosting) can be used to predict continuous variables

##### 8. Regardless of your answer to number 7, fit at least one of each of the following models to attempt to solve the regression problem above:
    - a multiple linear regression model
    - a k-nearest neighbors model
    - a decision tree
    - a set of bagged decision trees
    - a random forest
    - an Adaboost model
    - a support vector regressor
    
> As always, be sure to do a train/test split! In order to compare modeling techniques, you should use the same train-test split on each. I recommend setting a random seed here.

> You may find it helpful to set up a pipeline to try each modeling technique, but you are not required to do so!

In [65]:
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import AdaBoostRegressor, BaggingRegressor
from sklearn.svm import LinearSVR 
from sklearn.tree import DecisionTreeRegressor
from sklearn.linear_model import LinearRegression, Ridge, Lasso
from sklearn.neighbors import KNeighborsRegressor
from sklearn.pipeline import make_pipeline

In [53]:
X = data[['agesq', 'marr', 'male', 'fsize', 'nettfa']]
y = data['incsq']

In [54]:
X_train, X_cv, y_train, y_cv = train_test_split(X,
                                                y,
                                                test_size=0.33,
                                                random_state=42)

In [58]:
lr = LinearRegression()
ss = StandardScaler()
pipe_lr = make_pipeline(ss, lr)

In [61]:
knn = KNeighborsRegressor()
ss = StandardScaler()
pipe_knn = make_pipeline(ss, knn)

In [62]:
dtree = DecisionTreeRegressor()
bag = BaggingRegressor(random_state=42)
ada = AdaBoostRegressor(n_estimators=50, learning_rate=1, random_state=0)
svr = LinearSVR(max_iter=20000)

models = [pipe_lr, knn, dtree, bag, ada, svr]

for model in models:
    model.fit(X_train, y_train)



##### 9. What is bootstrapping?

Bootstrapping is the process of taking repeated samples from a population with replacement of the samples each time

##### 10. What is the difference between a decision tree and a set of bagged decision trees? Be specific and precise!

A decision tree generates predictions from an entire sample, while bagged decision trees involve multiple decision tree models that are built upon bootstrapped samples, with the final prediction being an aggregation of predictions across each decision tree model

##### 11. What is the difference between a set of bagged decision trees and a random forest? Be specific and precise!

Bagged decision trees use all predictors in a dataset to generate trees, while random forest randomly selects predictors to use when generating trees

##### 12. Why might a random forest be superior to a set of bagged decision trees?
> Hint: Consider the bias-variance tradeoff.

By using a subset of predictors in generating trees, random forest lowers the variance of the trees generated and further reduces overfitting as compared to bagged decision trees that use all predictors

## Step 5: Evaluate the model. (Part 1: Regression Problem)

##### 13. Using RMSE, evaluate each of the models you fit on both the training and testing data.

In [27]:
from sklearn.metrics import mean_squared_error
from collections import defaultdict

In [72]:
metrics = defaultdict(dict)
for model in models:
    metrics[str(model)]['train_RMSE'] = mean_squared_error(y_train, model.predict(X_train), squared=False)
    metrics[str(model)]['validation_RMSE'] = mean_squared_error(y_cv, model.predict(X_cv), squared=False)

In [73]:
pd.DataFrame(metrics)

Unnamed: 0,"Pipeline(steps=[('standardscaler', StandardScaler()),\n ('linearregression', LinearRegression())])",KNeighborsRegressor(),DecisionTreeRegressor(),BaggingRegressor(random_state=42),"AdaBoostRegressor(learning_rate=1, random_state=0)",LinearSVR(max_iter=20000)
train_RMSE,2564.663189,2212.996962,199.747154,1160.147125,2899.88111,2684.370777
validation_RMSE,2752.034885,2795.617755,3716.69454,2806.075159,3091.153445,2898.776959


##### 14. Based on training RMSE and testing RMSE, is there evidence of overfitting in any of your models? Which ones?

There is overfitting for a single decision tree and bagged decision trees, as the training RMSE is much lower than the testing RMSE for both of these models

##### 15. Based on everything we've covered so far, if you had to pick just one model as your final model to use to answer the problem in front of you, which one model would you pick? Defend your choice.

##### 16. Suppose you wanted to improve the performance of your final model. Brainstorm 2-3 things that, if you had more time, you would attempt.

## Step 4: Model the data. (Part 2: Classification Problem)

Recall:
- Problem: Predict whether or not one is eligible for a 401k.
- When predicting `e401k`, you may use the entire dataframe if you wish.

##### 17. While you're allowed to use every variable in your dataframe, mention at least one disadvantage of using `p401k` in your model.

Since p401k is influenced by e401k (one can only participate in 401k only if he/she is eligible for it in the first place), it will result in the e401k feature being overweighted as compared to other variables in the final classification model created

##### 18. List all modeling tactics we've learned that could be used to solve a classification problem (as of Wednesday afternoon of Week 6). For each tactic, identify whether it is or is not appropriate for solving this specific classification problem and explain why or why not.

    Possible models
    - a logistic regression model, a decision tree, a set of bagged decision trees, a random forest, an Adaboost model - Appropriate for this classification problem since data is labelled
  
    - a k-nearest neighbors model - Inappropriate since KNN is used for classifying unlabelled data 

##### 19. Regardless of your answer to number 18, fit at least one of each of the following models to attempt to solve the classification problem above:
    - a logistic regression model
    - a k-nearest neighbors model
    - a decision tree
    - a set of bagged decision trees
    - a random forest
    - an Adaboost model
    - a support vector classifier
    
> As always, be sure to do a train/test split! In order to compare modeling techniques, you should use the same train-test split on each. I recommend using a random seed here.

> You may find it helpful to set up a pipeline to try each modeling technique, but you are not required to do so!

In [81]:
X_train, X_cv, y_train, y_cv = train_test_split(X,
                                                y,
                                                test_size=0.33,
                                                random_state=42)

In [82]:
X = data.drop(['p401k'], axis='columns')
y = data['e401k']

In [83]:
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import AdaBoostClassifier, BaggingClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import LinearSVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.pipeline import make_pipeline

In [92]:
lr = LogisticRegression()
ss = StandardScaler()
pipe_lr = make_pipeline(ss, lr)

knn = KNeighborsClassifier()
ss = StandardScaler()
pipe_knn = make_pipeline(ss, knn)

dtree = DecisionTreeClassifier()
bag = BaggingClassifier(random_state=42)
ada = AdaBoostClassifier(n_estimators=50, learning_rate=1, random_state=0)
svc = LinearSVC(max_iter=20000)
rf = RandomForestClassifier(n_estimators=100)

models = [pipe_lr, pipe_knn, dtree, bag, ada, svc, rf]

for model in models:
    model.fit(X_train, y_train)



## Step 5: Evaluate the model. (Part 2: Classfication Problem)

##### 20. Suppose our "positive" class is that someone is eligible for a 401(k). What are our false positives? What are our false negatives?

In [101]:
from sklearn.metrics import confusion_matrix, plot_confusion_matrix, f1_score

In [111]:
metrics = defaultdict(dict)
for model in models:
    tn, fp, fn, tp = confusion_matrix(y_cv, model.predict(X_cv)).ravel()
    metrics[str(model)]['false_positive'] = fp
    metrics[str(model)]['false_negative'] = fn

In [112]:
fp_fn_metrics = pd.DataFrame(metrics)
fp_fn_metrics

Unnamed: 0,"Pipeline(steps=[('standardscaler', StandardScaler()),\n ('logisticregression', LogisticRegression())])","Pipeline(steps=[('standardscaler', StandardScaler()),\n ('kneighborsclassifier', KNeighborsClassifier())])",DecisionTreeClassifier(),BaggingClassifier(random_state=42),"AdaBoostClassifier(learning_rate=1, random_state=0)",LinearSVC(max_iter=20000),RandomForestClassifier()
false_positive,0,3,0,0,0,0,0
false_negative,0,1,0,0,0,0,0


##### 21. In this specific case, would we rather minimize false positives or minimize false negatives? Defend your choice.

We would rather minimize false negatives since our goal is to maximize the take-up rate of people who are eligble for 401k, and as such we want to miss out on as little people who are eligible for 401k as possible. The marginal cost of advertising to someone who is ineligible is less than the marginal cost of not advertising to someone who is eligible

##### 22. Suppose we wanted to optimize for the answer you provided in problem 21. Which metric would we optimize in this case?

We would optimize for specificity, where higher specificity means that the proportion of negatives which we predicted wrongly is low

##### 23. Suppose that instead of optimizing for the metric in problem 21, we wanted to balance our false positives and false negatives using `f1-score`. Why might [f1-score](https://en.wikipedia.org/wiki/F1_score) be an appropriate metric to use here?

f1-score is appropriate here as it balances between precision and recall, which take into account false positives and false negatives respectively

##### 24. Using f1-score, evaluate each of the models you fit on both the training and testing data.

In [117]:
metrics = defaultdict(dict)
for model in models:
    f1score = f1_score(y_cv, model.predict(X_cv))
    metrics[str(model)]['f1score'] = f1score

In [118]:
metrics = pd.concat([fp_fn_metrics, pd.DataFrame(metrics)])
metrics

Unnamed: 0,"Pipeline(steps=[('standardscaler', StandardScaler()),\n ('logisticregression', LogisticRegression())])","Pipeline(steps=[('standardscaler', StandardScaler()),\n ('kneighborsclassifier', KNeighborsClassifier())])",DecisionTreeClassifier(),BaggingClassifier(random_state=42),"AdaBoostClassifier(learning_rate=1, random_state=0)",LinearSVC(max_iter=20000),RandomForestClassifier()
false_positive,0.0,3.0,0.0,0.0,0.0,0.0,0.0
false_negative,0.0,1.0,0.0,0.0,0.0,0.0,0.0
f1score,1.0,0.998308,1.0,1.0,1.0,1.0,1.0


##### 25. Based on training f1-score and testing f1-score, is there evidence of overfitting in any of your models? Which ones?

##### 26. Based on everything we've covered so far, if you had to pick just one model as your final model to use to answer the problem in front of you, which one model would you pick? Defend your choice.

##### 27. Suppose you wanted to improve the performance of your final model. Brainstorm 2-3 things that, if you had more time, you would attempt.

## Step 6: Answer the problem.

##### BONUS: Briefly summarize your answers to the regression and classification problems. Be sure to include any limitations or hesitations in your answer.

- Regression: What features best predict one's income?
- Classification: Predict whether or not one is eligible for a 401k.