## 6.01 - Supervised Learning Model Comparison

Recall the "data science process."

1. Define the problem.
2. Gather the data.
3. Explore the data.
4. Model the data.
5. Evaluate the model.
6. Answer the problem.

In this lab, we're going to focus mostly on creating (and then comparing) many regression and classification models. Thus, we'll define the problem and gather the data for you.
Most of the questions requiring a written response can be written in 2-3 sentences.

### Step 1: Define the problem.

You are a data scientist with a financial services company. Specifically, you want to leverage data in order to identify potential customers.

If you are unfamiliar with "401(k)s" or "IRAs," these are two types of retirement accounts. Very broadly speaking:
- You can put money for retirement into both of these accounts.
- The money in these accounts gets invested and hopefully has a lot more money in it when you retire.
- These are a little different from regular bank accounts in that there are certain tax benefits to these accounts. Also, employers frequently match money that you put into a 401k.
- If you want to learn more about them, check out [this site](https://www.nerdwallet.com/article/ira-vs-401k-retirement-accounts).

We will tackle one regression problem and one classification problem today.
- Regression: What features best predict one's income?
- Classification: Predict whether or not one is eligible for a 401k.

Check out the data dictionary [here](http://fmwww.bc.edu/ec-p/data/wooldridge2k/401KSUBS.DES).

### NOTE: When predicting `inc`, you should pretend as though you do not have access to the `e401k`, the `p401k` variable, and the `pira` variable. When predicting `e401k`, you may use the entire dataframe if you wish.

### Step 2: Gather the data.

##### 1. Read in the data from the repository.

In [1]:
import pandas as pd

In [2]:
ira_df = pd.read_csv('./401ksubs.csv')
ira_df.head()

Unnamed: 0,e401k,inc,marr,male,age,fsize,nettfa,p401k,pira,incsq,agesq
0,0,13.17,0,0,40,1,4.575,0,1,173.4489,1600
1,1,61.23,0,1,35,1,154.0,1,0,3749.113,1225
2,0,12.858,1,0,44,2,0.0,0,0,165.3282,1936
3,0,98.88,1,1,44,2,21.8,0,0,9777.254,1936
4,0,22.614,0,0,53,1,18.45,0,0,511.393,2809


In [3]:
ira_df.dtypes

e401k       int64
inc       float64
marr        int64
male        int64
age         int64
fsize       int64
nettfa    float64
p401k       int64
pira        int64
incsq     float64
agesq       int64
dtype: object

In [4]:
ira_df.shape

(9275, 11)

##### 2. What are 2-3 other variables that, if available, would be helpful to have?

What kind of job a person has. How many years they have been at the job. What the education level is

##### 3. Suppose a peer recommended putting `race` into your model in order to better predict who to target when advertising IRAs and 401(k)s. Why would this be an unethical decision?

That would be unethical for a number of reasons. It could lead to minorities not getting the same chances to save for retirement or not get the same deals. Plus there is the worry about what else the data could be used for later.

## Step 3: Explore the data.

##### 4. When attempting to predict income, which feature(s) would we reasonably not use? Why?

Income squared would just give us the answer. 

##### 5. What two variables have already been created for us through feature engineering? Come up with a hypothesis as to why subject-matter experts may have done this.
> This need not be a "statistical hypothesis." Just brainstorm why SMEs might have done this!

income squared and age squared. maybe these variables have a quadratic relationship with some of the other data.

##### 6. Looking at the data dictionary, one variable description appears to be an error. What is this error, and what do you think the correct value would be?

both the income and the age are listed as being the same as the income squared and age squared but are clearly not squared.

## Step 4: Model the data. (Part 1: Regression Problem)

Recall:
- Problem: What features best predict one's income?
- When predicting `inc`, you should pretend as though you do not have access to the `e401k`, the `p401k` variable, and the `pira` variable.

##### 7. List all modeling tactics we've learned that could be used to solve a regression problem (as of Wednesday afternoon of Week 6). For each tactic, identify whether it is or is not appropriate for solving this specific regression problem and explain why or why not.

linear regression 
knn 
decision tree 
bagged trees 
random forest 
ada boosting 
support vector regressor 
neural network
they all have their flaws but they are appropriate for answering the problem

##### 8. Regardless of your answer to number 7, fit at least one of each of the following models to attempt to solve the regression problem above:
    - a multiple linear regression model
    - a k-nearest neighbors model
    - a decision tree
    - a set of bagged decision trees
    - a random forest
    - an Adaboost model
    - a support vector regressor
    
> As always, be sure to do a train/test split! In order to compare modeling techniques, you should use the same train-test split on each. I recommend setting a random seed here.

> You may find it helpful to set up a pipeline to try each modeling technique, but you are not required to do so!

In [28]:
from sklearn.linear_model import LinearRegression, LogisticRegression
from sklearn.neighbors import KNeighborsRegressor, KNeighborsClassifier
from sklearn.tree import DecisionTreeRegressor, DecisionTreeClassifier
from sklearn.ensemble import BaggingRegressor, RandomForestRegressor, AdaBoostRegressor, BaggingClassifier, RandomForestClassifier, AdaBoostClassifier
from sklearn.svm import SVR, SVC


from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import mean_squared_error, f1_score

In [6]:
ira_df.columns

Index(['e401k', 'inc', 'marr', 'male', 'age', 'fsize', 'nettfa', 'p401k',
       'pira', 'incsq', 'agesq'],
      dtype='object')

In [7]:
X = ira_df.drop(columns = ['e401k','p401k', 'pira', 'inc', 'incsq'])
y = ira_df['inc']

In [8]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 2021)

In [9]:
stan = StandardScaler()
X_train_sc = stan.fit_transform(X_train)
X_test_sc = stan.transform(X_test)

In [10]:
linear_reg = LinearRegression()
linear_reg.fit(X_train_sc, y_train)

knn_reg = KNeighborsRegressor()
knn_reg.fit(X_train_sc, y_train)

cart_reg = DecisionTreeRegressor()
cart_reg.fit(X_train_sc, y_train)

bagged_reg = BaggingRegressor()
bagged_reg.fit(X_train_sc, y_train)

random_forest_reg = RandomForestRegressor()
random_forest_reg.fit(X_train_sc, y_train)

adaboost_reg = AdaBoostRegressor()
adaboost_reg.fit(X_train_sc, y_train)

support_vector_reg = SVR()
support_vector_reg.fit(X_train_sc, y_train)

SVR()

##### 9. What is bootstrapping?

bootstrapping is random sampling with replacement.

##### 10. What is the difference between a decision tree and a set of bagged decision trees? Be specific and precise!

a decsion tree is just based on our original data. In bagged decsion trees we perform bootstrapping and do a decision tree on each bootstrapped sample.

##### 11. What is the difference between a set of bagged decision trees and a random forest? Be specific and precise!

A random forest is a set of bagged decsion trees except we use a random subset of features on each model.

##### 12. Why might a random forest be superior to a set of bagged decision trees?
> Hint: Consider the bias-variance tradeoff.

Random forests reduce variance in bagged trees.

## Step 5: Evaluate the model. (Part 1: Regression Problem)

##### 13. Using RMSE, evaluate each of the models you fit on both the training and testing data.

In [12]:
pred1 = linear_reg.predict(X_train_sc)
pred2 = linear_reg.predict(X_test_sc)
pred3 = knn_reg.predict(X_train_sc)
pred4 = knn_reg.predict(X_test_sc)
pred5 = cart_reg.predict(X_train_sc)
pred6 = cart_reg.predict(X_test_sc)
pred7 = bagged_reg.predict(X_train_sc)
pred8 = bagged_reg.predict(X_test_sc)
pred9 = random_forest_reg.predict(X_train_sc)
pred10 = random_forest_reg.predict(X_test_sc)
pred11 = adaboost_reg.predict(X_train_sc)
pred12 = adaboost_reg.predict(X_test_sc)
pred13 = support_vector_reg.predict(X_train_sc)
pred14 = support_vector_reg.predict(X_test_sc)

In [13]:
rmse1 = mean_squared_error(y_train, pred1) ** 0.5
rmse2 = mean_squared_error(y_test, pred2) ** 0.5
rmse3 = mean_squared_error(y_train, pred3) ** 0.5
rmse4 = mean_squared_error(y_test, pred4) ** 0.5
rmse5 = mean_squared_error(y_train, pred5) ** 0.5
rmse6 = mean_squared_error(y_test, pred6) ** 0.5
rmse7 = mean_squared_error(y_train, pred7) ** 0.5
rmse8 = mean_squared_error(y_test, pred8) ** 0.5
rmse9 = mean_squared_error(y_train, pred9) ** 0.5
rmse10 = mean_squared_error(y_test, pred10) ** 0.5
rmse11 = mean_squared_error(y_train, pred11) ** 0.5
rmse12 = mean_squared_error(y_test, pred12) ** 0.5
rmse13 = mean_squared_error(y_train, pred13) ** 0.5
rmse14 = mean_squared_error(y_test, pred14) ** 0.5

In [14]:
print(rmse1, rmse2)
print(rmse3, rmse4)
print(rmse5, rmse6)
print(rmse7, rmse8)
print(rmse9, rmse10)
print(rmse11, rmse12)
print(rmse13, rmse14)

20.422880918305054 19.8552653812956
16.52185880246019 20.19355529565361
2.3701607409728966 27.00883040520602
8.952670505448932 20.818655921667233
7.713964409001512 20.098638680533046
22.262280417521406 22.37018945237668
20.00429474367526 19.76742188668183


##### 14. Based on training RMSE and testing RMSE, is there evidence of overfitting in any of your models? Which ones?

The knn, decision tree, bagged trees, and random forest models all have overfitting. In all three cases the testing error is much higher then the training error

##### 15. Based on everything we've covered so far, if you had to pick just one model as your final model to use to answer the problem in front of you, which one model would you pick? Defend your choice.

I would pick the svm model. It has the lowest error and the least overfitting. However linear regression is really close in both so it is really a toss up.

##### 16. Suppose you wanted to improve the performance of your final model. Brainstorm 2-3 things that, if you had more time, you would attempt.

I would tune over hyperparameters. I would also play with the feature selection

## Step 4: Model the data. (Part 2: Classification Problem)

Recall:
- Problem: Predict whether or not one is eligible for a 401k.
- When predicting `e401k`, you may use the entire dataframe if you wish.

##### 17. While you're allowed to use every variable in your dataframe, mention at least one disadvantage of using `p401k` in your model.

Any one who has a 401k is elgible so we may end up with an overfit model. 

##### 18. List all modeling tactics we've learned that could be used to solve a classification problem (as of Wednesday afternoon of Week 6). For each tactic, identify whether it is or is not appropriate for solving this specific classification problem and explain why or why not.

logistic regression 
knn 
decision tree 
bagged trees 
random forest 
gradient boosting
adaboost
svm
they are all appropriate for the problem

##### 19. Regardless of your answer to number 18, fit at least one of each of the following models to attempt to solve the classification problem above:
    - a logistic regression model
    - a k-nearest neighbors model
    - a decision tree
    - a set of bagged decision trees
    - a random forest
    - an Adaboost model
    - a support vector classifier
    
> As always, be sure to do a train/test split! In order to compare modeling techniques, you should use the same train-test split on each. I recommend using a random seed here.

> You may find it helpful to set up a pipeline to try each modeling technique, but you are not required to do so!

In [35]:
X = ira_df.drop(columns = ['e401k','p401k', 'incsq', 'agesq'])
y = ira_df['e401k']

In [36]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 2021)

In [37]:
stan = StandardScaler()
X_train_sc = stan.fit_transform(X_train)
X_test_sc = stan.transform(X_test)

In [38]:
log_reg = LogisticRegression()
log_reg.fit(X_train_sc, y_train)

knn_class = KNeighborsClassifier()
knn_class.fit(X_train_sc, y_train)

cart_class = DecisionTreeClassifier()
cart_class.fit(X_train_sc, y_train)

bagged_class = BaggingClassifier()
bagged_class.fit(X_train_sc, y_train)

random_forest_class = RandomForestClassifier()
random_forest_class.fit(X_train_sc, y_train)

adaboost_class = AdaBoostClassifier()
adaboost_class.fit(X_train_sc, y_train)

support_vector_class = SVC()
support_vector_class.fit(X_train_sc, y_train)

SVC()

## Step 5: Evaluate the model. (Part 2: Classfication Problem)

##### 20. Suppose our "positive" class is that someone is eligible for a 401(k). What are our false positives? What are our false negatives?

False positives are when we predict someone is elgible for a 401(k) but they are not. False negatives are when we predict someone is not elgible but they are.

##### 21. In this specific case, would we rather minimize false positives or minimize false negatives? Defend your choice.

We want to minimize false negatives. False negatives means we have potential clients that we will never know exist. False postiives waste some of our time but does not cost us customers.

##### 22. Suppose we wanted to optimize for the answer you provided in problem 21. Which metric would we optimize in this case?

We would optimize recall.

##### 23. Suppose that instead of optimizing for the metric in problem 21, we wanted to balance our false positives and false negatives using `f1-score`. Why might [f1-score](https://en.wikipedia.org/wiki/F1_score) be an appropriate metric to use here?

F1-score makes sure our model is balanced and does not have too many false positives or false negatives. 

##### 24. Using f1-score, evaluate each of the models you fit on both the training and testing data.

In [40]:
pred1 = log_reg.predict(X_train_sc)
pred2 = log_reg.predict(X_test_sc)
pred3 = knn_class.predict(X_train_sc)
pred4 = knn_class.predict(X_test_sc)
pred5 = cart_class.predict(X_train_sc)
pred6 = cart_class.predict(X_test_sc)
pred7 = bagged_class.predict(X_train_sc)
pred8 = bagged_class.predict(X_test_sc)
pred9 = random_forest_class.predict(X_train_sc)
pred10 = random_forest_class.predict(X_test_sc)
pred11 = adaboost_class.predict(X_train_sc)
pred12 = adaboost_class.predict(X_test_sc)
pred13 = support_vector_class.predict(X_train_sc)
pred14 = support_vector_class.predict(X_test_sc)

In [41]:
f1 = f1_score(y_train, pred1) 
f2 = f1_score(y_test, pred2) 
f3 = f1_score(y_train, pred3) 
f4 = f1_score(y_test, pred4) 
f5 = f1_score(y_train, pred5)
f6 = f1_score(y_test, pred6) 
f7 = f1_score(y_train, pred7)
f8 = f1_score(y_test, pred8) 
f9 = f1_score(y_train, pred9) 
f10 = f1_score(y_test, pred10) 
f11 = f1_score(y_train, pred11) 
f12 = f1_score(y_test, pred12) 
f13 = f1_score(y_train, pred13)
f14 = f1_score(y_test, pred14)

In [42]:
print(f1, f2)
print(f3, f4)
print(f5, f6)
print(f7, f8)
print(f9, f10)
print(f11, f12)
print(f13, f14)

0.3797768810823641 0.3949661181026138
0.6538748137108793 0.5070842654735273
1.0 0.4796195652173913
0.9670716675471034 0.49647611589663276
1.0 0.5583145221971406
0.5557474573018615 0.5975975975975976
0.4700665188470067 0.4817777777777778


##### 25. Based on training f1-score and testing f1-score, is there evidence of overfitting in any of your models? Which ones?

the knn, decsion tree, bagged tree, and random forest moeds are overfit.

##### 26. Based on everything we've covered so far, if you had to pick just one model as your final model to use to answer the problem in front of you, which one model would you pick? Defend your choice.

I would used the adaboost model. It has the highest f1 score and no overfitting.

##### 27. Suppose you wanted to improve the performance of your final model. Brainstorm 2-3 things that, if you had more time, you would attempt.

I would tune over hyperparameters and try to do more feature engineering.

## Step 6: Answer the problem.

##### BONUS: Briefly summarize your answers to the regression and classification problems. Be sure to include any limitations or hesitations in your answer.

- Regression: What features best predict one's income?
- Classification: Predict whether or not one is eligible for a 401k.

For regression the best model is the SVM. For classification the best model is adaboost.