## 6.01 - Supervised Learning Model Comparison

Recall the "data science process."

1. Define the problem.
2. Gather the data.
3. Explore the data.
4. Model the data.
5. Evaluate the model.
6. Answer the problem.

In this lab, we're going to focus mostly on creating (and then comparing) many regression and classification models. Thus, we'll define the problem and gather the data for you.
Most of the questions requiring a written response can be written in 2-3 sentences.

### Step 1: Define the problem.

You are a data scientist with a financial services company. Specifically, you want to leverage data in order to identify potential customers.

If you are unfamiliar with "401(k)s" or "IRAs," these are two types of retirement accounts. Very broadly speaking:
- You can put money for retirement into both of these accounts.
- The money in these accounts gets invested and hopefully has a lot more money in it when you retire.
- These are a little different from regular bank accounts in that there are certain tax benefits to these accounts. Also, employers frequently match money that you put into a 401k.
- If you want to learn more about them, check out [this site](https://www.nerdwallet.com/article/ira-vs-401k-retirement-accounts).

We will tackle one regression problem and one classification problem today.
- Regression: What features best predict one's income?
- Classification: Predict whether or not one is eligible for a 401k.

Check out the data dictionary [here](http://fmwww.bc.edu/ec-p/data/wooldridge2k/401KSUBS.DES).

### NOTE: When predicting `inc`, you should pretend as though you do not have access to the `e401k`, the `p401k` variable, and the `pira` variable. When predicting `e401k`, you may use the entire dataframe if you wish.

### Step 2: Gather the data.

##### 1. Read in the data from the repository.

In [1]:
import pandas as pd 
import numpy as np 
import matplotlib.pyplot as plt

In [2]:
df401k = pd.read_csv('401ksubs.csv')

In [3]:
df401k.head()

Unnamed: 0,e401k,inc,marr,male,age,fsize,nettfa,p401k,pira,incsq,agesq
0,0,13.17,0,0,40,1,4.575,0,1,173.4489,1600
1,1,61.23,0,1,35,1,154.0,1,0,3749.113,1225
2,0,12.858,1,0,44,2,0.0,0,0,165.3282,1936
3,0,98.88,1,1,44,2,21.8,0,0,9777.254,1936
4,0,22.614,0,0,53,1,18.45,0,0,511.393,2809


##### 2. What are 2-3 other variables that, if available, would be helpful to have?

A) state of residency/address, citizenship status, employment title/status

##### 3. Suppose a peer recommended putting `race` into your model in order to better predict who to target when advertising IRAs and 401(k)s. Why would this be an unethical decision?

A) Data surrounding race and ethnicity is often biased and can lead to unfair outcomes that perpetuate inequality of opportunities for minority groups.  For example, crime statistics are generated by reports of crime by police, not by actual objective metrics of occurance of criminal activity, and lots of research suggests this leads to overpolicing of lower socioeconomic groups, and underpolicing of higher socioeconomic groups. 

## Step 3: Explore the data.

##### 4. When attempting to predict income, which feature(s) would we reasonably not use? Why?

incsq because that would be cheating/result in overfitting.  Inc as well, of course, for the same reason.

##### 5. What two variables have already been created for us through feature engineering? Come up with a hypothesis as to why subject-matter experts may have done this.
> This need not be a "statistical hypothesis." Just brainstorm why SMEs might have done this!

incsq and agesq; this helps us adjust a linear model to correct for reduction of income as one reaches peak income potential, and approaches retirement and/or income reduction.  

##### 6. Looking at the data dictionary, one variable description appears to be an error. What is this error, and what do you think the correct value would be?

inc and age are described as inc^2 and age^2 respectively, and inc is not listed as being in thousands, as it should be. 

## Step 4: Model the data. (Part 1: Regression Problem)

Recall:
- Problem: What features best predict one's income?
- When predicting `inc`, you should pretend as though you do not have access to the `e401k`, the `p401k` variable, and the `pira` variable.

##### 7. List all modeling tactics we've learned that could be used to solve a regression problem (as of Wednesday afternoon of Week 6). For each tactic, identify whether it is or is not appropriate for solving this specific regression problem and explain why or why not.

1) linear regression<br>
2) knn<br>
3) decision tree <br>
4) random forest<br>
5) Support vector <br>
6) Generalized linear


##### 8. Regardless of your answer to number 7, fit at least one of each of the following models to attempt to solve the regression problem above:
    - a multiple linear regression model
    - a k-nearest neighbors model
    - a decision tree
    - a set of bagged decision trees
    - a random forest
    - an Adaboost model
    - a support vector regressor
    
> As always, be sure to do a train/test split! In order to compare modeling techniques, you should use the same train-test split on each. I recommend setting a random seed here.

> You may find it helpful to set up a pipeline to try each modeling technique, but you are not required to do so!

In [4]:
import numpy as np 
import pandas as pd

In [226]:
from sklearn.linear_model import LinearRegression
from sklearn.neighbors import KNeighborsRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import BaggingRegressor, RandomForestRegressor, AdaBoostRegressor
from sklearn.svm import SVR

In [496]:
from sklearn.model_selection import train_test_split, cross_val_score

In [7]:
from sklearn.preprocessing import StandardScaler

In [125]:
from sklearn.metrics import mean_squared_error, r2_score


In [50]:
type(df401k)

pandas.core.frame.DataFrame

In [344]:
#X = df401k[['marr', 'nettfa', 'age', 'fsize']]
#y= df401k['inc']

In [345]:
type(X)

pandas.core.frame.DataFrame

In [346]:
# #In class work, Jacob's skeleton of 
# class MasterRegressor:
    
#     def __init__(self, estimators):
#         self.estimators = estimators
#         self.df = df
        
#     def feature_and_split(self, predictor_list, target, random_state)
#     X_train, y_train, X_test, y_test = train_test_split(self.df[predictor_list], target,
#                                                    random_state = 42)
#     return [X_train, X_test, y_train, y_test]
       
#         random_state = np.random(range[1:100])
    
#     def regression_estimators(self):
    
#     def output_table(self):
        
#     def output_plot(self):
        

In [471]:
#train test split
#scale it
#transform sets accordingly
#loop through models and fit, score, predict
def regression_machine(estimator, X, y, scoring = 'RMSE'):
    X = df401k[['marr', 'nettfa', 'agesq', 'fsize']]
    y= df401k[['inc']]
    X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 42)
    
    
    ss=StandardScaler()
    ss.fit(X_train)
    X_train_scaled = ss.transform(X_train)
    X_test_scaled = ss.transform(X_test)
    estimators = [estimator] # why do my r_sq and rmse values not print unless i specify the model here?
    for e in estimators:
        est = e()
        est.fit(X_train_scaled, y_train)
        train_score = est.score(X_train_scaled, y_train)
        r_sq = est.score(X_test_scaled, y_test)
        preds= est.predict(X_test_scaled)
        rmse= np.sqrt(mean_squared_error(y_test, preds))
        print(train_score)
    return ["our R Squared is {} and our RMSE is {}".format( r_sq, rmse)]
    

In [472]:
lr = LinearRegression

In [473]:
regression_machine(lr, X, y, scoring = 'RMSE')

0.27360782564992414


  return self.partial_fit(X, y)
  del sys.path[0]
  


['our R Squared is 0.2231004994082149 and our RMSE is 21.522219737647593']

In [474]:
dtr = DecisionTreeRegressor

In [475]:
regression_machine(dtr, X, y, scoring = 'RMSE')

0.906781132232467


  return self.partial_fit(X, y)
  del sys.path[0]
  


['our R Squared is -0.16773376998251055 and our RMSE is 26.38618238012311']

In [476]:
knn = KNeighborsRegressor

In [477]:
regression_machine(knn, X, y, scoring = 'RMSE')

0.5132616213246797


  return self.partial_fit(X, y)
  del sys.path[0]
  


['our R Squared is 0.26304313169213467 and our RMSE is 20.961660143627835']

In [478]:
br = BaggingRegressor

In [479]:
regression_machine(BaggingRegressor, X, y, scoring = 'RMSE')

0.7884529465876988


  return self.partial_fit(X, y)
  del sys.path[0]
  
  return column_or_1d(y, warn=True)


['our R Squared is 0.1494520621151737 and our RMSE is 22.519255998726212']

In [480]:
rfr = RandomForestRegressor

In [481]:
regression_machine(rfr, X, y, scoring = 'RMSE')

0.7940014536995262


  return self.partial_fit(X, y)
  del sys.path[0]
  


['our R Squared is 0.15141247127292246 and our RMSE is 22.493288959472867']

In [482]:
abr = AdaBoostRegressor

In [483]:
regression_machine(abr, X, y, scoring = 'RMSE')

0.0958968974608777


  return self.partial_fit(X, y)
  del sys.path[0]
  
  y = column_or_1d(y, warn=True)


['our R Squared is 0.07593023852802527 and our RMSE is 23.472374164893914']

In [560]:
svm = SVR

In [561]:
regression_machine(SVR, X, y, scoring = 'RMSE')

  return self.partial_fit(X, y)
  del sys.path[0]
  
  y = column_or_1d(y, warn=True)


0.30682772242663736


['our R Squared is 0.290803834865079 and our RMSE is 20.56306409836328']

##### 9. What is bootstrapping?

Random sampling with replacement; usually to construct a larger sample from existing data to better reflect assumptions of what our sample mean to the population we are trying to infer info from.

##### 10. What is the difference between a decision tree and a set of bagged decision trees? Be specific and precise!

##### 11. What is the difference between a set of bagged decision trees and a random forest? Be specific and precise!

##### 12. Why might a random forest be superior to a set of bagged decision trees?
> Hint: Consider the bias-variance tradeoff.

## Step 5: Evaluate the model. (Part 1: Regression Problem)

##### 13. Using RMSE, evaluate each of the models you fit on both the training and testing data.

A) All of these models with the assigned variables in our regression machine function did fairly poorly;
none of them exceeded an r2 of .35 and all of them had a RMSE of at least 20, meaning in these models, 
none of them were able to guess on average within $20,000 of a particular client's income. 

##### 14. Based on training RMSE and testing RMSE, is there evidence of overfitting in any of your models? Which ones?

Our RandomForest, BaggingRegressor, and DecisionTreeRegressor models were all extremely overfitted.  
Our KNN model as well to a smaller degree.

##### 15. Based on everything we've covered so far, if you had to pick just one model as your final model to use to answer the problem in front of you, which one model would you pick? Defend your choice.

SVR did best, with an
RMSE of ~20.57

##### 16. Suppose you wanted to improve the performance of your final model. Brainstorm 2-3 things that, if you had more time, you would attempt.

1) engineer more interaction features
2) Do gridsearching and adjust the coefficients using Lasso and Ridge

## Step 4: Model the data. (Part 2: Classification Problem)

Recall:
- Problem: Predict whether or not one is eligible for a 401k.
- When predicting `e401k`, you may use the entire dataframe if you wish.

##### 17. While you're allowed to use every variable in your dataframe, mention at least one disadvantage of using `p401k` in your model.

A) You are overfitting your data, to equate the metrics of existing participants with whether someone is eligible
for 401(k)

##### 18. List all modeling tactics we've learned that could be used to solve a classification problem (as of Wednesday afternoon of Week 6). For each tactic, identify whether it is or is not appropriate for solving this specific classification problem and explain why or why not.

1) logistic regression <br>
2) knn<br>
3) decision tree<br>
4) random forest <br>
5) support vector<br>
8) variations of Naive Bayes, such as Bernouilli NB, Poisson NB, multivariate NB, etc.

#### 19. Regardless of your answer to number 18, fit at least one of each of the following models to attempt to solve the classification problem above:
    - a logistic regression model
    - a k-nearest neighbors model
    - a decision tree
    - a set of bagged decision trees
    - a random forest
    - an Adaboost model
    - a support vector classifier
    
> As always, be sure to do a train/test split! In order to compare modeling techniques, you should use the same train-test split on each. I recommend using a random seed here.

> You may find it helpful to set up a pipeline to try each modeling technique, but you are not required to do so!

In [497]:
from sklearn.linear_model import LogisticRegression

In [659]:
def classification_machine(estimators, X, y):
    X = df401k[['inc', 'incsq', 'nettfa', 'age', 'agesq']]
    y= df401k['e401k']
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.2, random_state = 42)

    
    ss=StandardScaler()
    ss.fit(X_train)
    X_train_scaled = ss.transform(X_train)
    X_test_scaled = ss.transform(X_test)
    estimators = [estimators] 
    for e in estimators:
        est = e()
        est.fit(X_train_scaled, y_train)
        train_score = est.score(X_train_scaled, y_train)
        r_sq = est.score(X_test_scaled, y_test)
        preds= est.predict(X_test_scaled)
        print(train_score, r_sq)
    
    return ["our training data has an r^2 of {} while our testing data has an r^2 of {}".format(train_score, r_sq)]

In [660]:
logreg = LogisticRegression

In [661]:
classification_machine(logreg, X, y)

0.6560646900269542 0.663611859838275


  return self.partial_fit(X, y)
  if __name__ == '__main__':
  # Remove the CWD from sys.path while we load stuff.


['our training data has an r^2 of 0.6560646900269542 while our testing data has an r^2 of 0.663611859838275']

In [662]:
knn = KNeighborsRegressor

In [663]:
classification_machine(KNeighborsRegressor, X, y)

0.32344300484277655 0.007990362105283588


  return self.partial_fit(X, y)
  if __name__ == '__main__':
  # Remove the CWD from sys.path while we load stuff.


['our training data has an r^2 of 0.32344300484277655 while our testing data has an r^2 of 0.007990362105283588']

In [664]:
dtr = DecisionTreeRegressor

In [665]:
classification_machine(dtr, X, y)

1.0 -0.7474854478542001


  return self.partial_fit(X, y)
  if __name__ == '__main__':
  # Remove the CWD from sys.path while we load stuff.


['our training data has an r^2 of 1.0 while our testing data has an r^2 of -0.7474854478542001']

In [666]:
br = BaggingRegressor

In [667]:
classification_machine(br, X, y)

  return self.partial_fit(X, y)
  if __name__ == '__main__':
  # Remove the CWD from sys.path while we load stuff.


0.8145068376763568 -0.017054662795869424


['our training data has an r^2 of 0.8145068376763568 while our testing data has an r^2 of -0.017054662795869424']

In [668]:
rfr = RandomForestRegressor

In [669]:
classification_machine(rfr, X, y)

  return self.partial_fit(X, y)
  if __name__ == '__main__':
  # Remove the CWD from sys.path while we load stuff.


0.8177730996054677 -0.04232633950608244


['our training data has an r^2 of 0.8177730996054677 while our testing data has an r^2 of -0.04232633950608244']

In [670]:
adabr = AdaBoostRegressor

In [671]:
classification_machine(adabr, X, y)

0.13059259619614472 0.11574775874452081


  return self.partial_fit(X, y)
  if __name__ == '__main__':
  # Remove the CWD from sys.path while we load stuff.


['our training data has an r^2 of 0.13059259619614472 while our testing data has an r^2 of 0.11574775874452081']

In [672]:
svm = SVR

In [673]:
classification_machine(svm, X, y)

  return self.partial_fit(X, y)
  if __name__ == '__main__':
  # Remove the CWD from sys.path while we load stuff.


0.005604288134886337 -0.004901611750223633


['our training data has an r^2 of 0.005604288134886337 while our testing data has an r^2 of -0.004901611750223633']

## Step 5: Evaluate the model. (Part 2: Classfication Problem)

##### 20. Suppose our "positive" class is that someone is eligible for a 401(k). What are our false positives? What are our false negatives?

False positives - people determined to be eligible for 401(k) when they are not eligible.
False negatives - people determined to be eligible for 401(k) when they are eligible. 

##### 21. In this specific case, would we rather minimize false positives or minimize false negatives? Defend your choice.

Minimize for false positives - we do not want to accidentally tell employees who are legally ineligible that they are, the repercussions of false positives are higher because this could lead to frustration from the employees,
possibly lawsuits?


##### 22. Suppose we wanted to optimize for the answer you provided in problem 21. Which metric would we optimize in this case?

Sensitivity/recall

##### 23. Suppose that instead of optimizing for the metric in problem 21, we wanted to balance our false positives and false negatives using `f1-score`. Why might [f1-score](https://en.wikipedia.org/wiki/F1_score) be an appropriate metric to use here?

We may be seeking for maximum 401(k) clients while also minimizing mistaken opportunities, maximizing revenue. 

##### 24. Using f1-score, evaluate each of the models you fit on both the training and testing data.

All of these models fit fairly poorly for testing accuracy (training accuracy is irrelevant if it doesn't generalize) besides logistic regression. 
                                                            

##### 25. Based on training f1-score and testing f1-score, is there evidence of overfitting in any of your models? Which ones?

In [None]:
Yes; the worst offenders are BaggingRegressor, DecisionTreesRegressor, and RandomForestsRegressor.  

##### 26. Based on everything we've covered so far, if you had to pick just one model as your final model to use to answer the problem in front of you, which one model would you pick? Defend your choice.

LogisticRegression is the clear winner here - training and testing scores were both correct 2/3rds of the time
in this model, meaning there's very little overfitting so this result generalizes well.

##### 27. Suppose you wanted to improve the performance of your final model. Brainstorm 2-3 things that, if you had more time, you would attempt.

1) tuning hyperparameters with regulizers <br>
2) engineering features that make better predictions, such as interactions. 

## Step 6: Answer the problem.

##### BONUS: Briefly summarize your answers to the regression and classification problems. Be sure to include any limitations or hesitations in your answer.

- Regression: What features best predict one's income?
- Classification: Predict whether or not one is eligible for a 401k.

Regression - Besides inc it would be incsq; I discluded both because I assumed even the latter is cheating,
since it's simply the square of income.  nettfa', 'agesq', 'fsize' seemed to have the highest predciting power.
Classification - We do not want to overfit our data, as per question 17 - results must be generalizable, as per our test scores. 