## 6.01 - Supervised Learning Model Comparison

Recall the "data science process."

1. Define the problem.
2. Gather the data.
3. Explore the data.
4. Model the data.
5. Evaluate the model.
6. Answer the problem.

In this lab, we're going to focus mostly on creating (and then comparing) many regression and classification models. Thus, we'll define the problem and gather the data for you.
Most of the questions requiring a written response can be written in 2-3 sentences.

### Step 1: Define the problem.

You are a data scientist with a financial services company. Specifically, you want to leverage data in order to identify potential customers.

If you are unfamiliar with "401(k)s" or "IRAs," these are two types of retirement accounts. Very broadly speaking:
- You can put money for retirement into both of these accounts.
- The money in these accounts gets invested and hopefully has a lot more money in it when you retire.
- These are a little different from regular bank accounts in that there are certain tax benefits to these accounts. Also, employers frequently match money that you put into a 401k.
- If you want to learn more about them, check out [this site](https://www.nerdwallet.com/article/ira-vs-401k-retirement-accounts).

We will tackle one regression problem and one classification problem today.
- Regression: What features best predict one's income?
- Classification: Predict whether or not one is eligible for a 401k.

Check out the data dictionary [here](http://fmwww.bc.edu/ec-p/data/wooldridge2k/401KSUBS.DES).

### NOTE: When predicting `inc`, you should pretend as though you do not have access to the `e401k`, the `p401k` variable, and the `pira` variable. When predicting `e401k`, you may use the entire dataframe if you wish.

### Step 2: Gather the data.

##### 1. Read in the data from the repository.

In [1]:
import pandas as pd

In [2]:
df = pd.read_csv(r'./401ksubs.csv')

In [3]:
df.head()

Unnamed: 0,e401k,inc,marr,male,age,fsize,nettfa,p401k,pira,incsq,agesq
0,0,13.17,0,0,40,1,4.575,0,1,173.4489,1600
1,1,61.23,0,1,35,1,154.0,1,0,3749.113,1225
2,0,12.858,1,0,44,2,0.0,0,0,165.3282,1936
3,0,98.88,1,1,44,2,21.8,0,0,9777.254,1936
4,0,22.614,0,0,53,1,18.45,0,0,511.393,2809


##### 2. What are 2-3 other variables that, if available, would be helpful to have?

1. Length of employment
2. Industry of career
3. Household income

##### 3. Suppose a peer recommended putting `race` into your model in order to better predict who to target when advertising IRAs and 401(k)s. Why would this be an unethical decision?

Racism might occur

## Step 3: Explore the data.

##### 4. When attempting to predict income, which feature(s) would we reasonably not use? Why?

Net total financial asset (nettfa). Net total financial asset varies according to lifestyle expenditure, family inheritance, etc.

##### 5. What two variables have already been created for us through feature engineering? Come up with a hypothesis as to why subject-matter experts may have done this.
> This need not be a "statistical hypothesis." Just brainstorm why SMEs might have done this!

Square of income and square of age. Income and age does not increase linearly. As we age, we have more work experiences and pay increases at a faster rate.

##### 6. Looking at the data dictionary, one variable description appears to be an error. What is this error, and what do you think the correct value would be?

inc and age. They should be income and age respectively instead of inc^2 and age^2

## Step 4: Model the data. (Part 1: Regression Problem)

Recall:
- Problem: What features best predict one's income?
- When predicting `inc`, you should pretend as though you do not have access to the `e401k`, the `p401k` variable, and the `pira` variable.

##### 7. List all modeling tactics we've learned that could be used to solve a regression problem (as of Wednesday afternoon of Week 6). For each tactic, identify whether it is or is not appropriate for solving this specific regression problem and explain why or why not.

1. Linear regression
2. K-Nearest Neighbors
3. Decision Tree
4. Random Forest
5. Bagging
6. AdaBoost
7. Support Vector Machine

##### 8. Regardless of your answer to number 7, fit at least one of each of the following models to attempt to solve the regression problem above:
    - a multiple linear regression model
    - a k-nearest neighbors model
    - a decision tree
    - a set of bagged decision trees
    - a random forest
    - an Adaboost model
    - a support vector regressor
    
> As always, be sure to do a train/test split! In order to compare modeling techniques, you should use the same train-test split on each. I recommend setting a random seed here.

> You may find it helpful to set up a pipeline to try each modeling technique, but you are not required to do so!

In [4]:
from sklearn.linear_model import LinearRegression
from sklearn.neighbors import KNeighborsRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor, BaggingRegressor, AdaBoostRegressor
from sklearn.svm import SVR

from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.preprocessing import StandardScaler

In [5]:
df_reg = df.drop(['e401k','p401k','pira'], axis = 1)

In [6]:
X = df_reg.drop('inc', axis=1)
y=df_reg['inc']

In [7]:
X.head()

Unnamed: 0,marr,male,age,fsize,nettfa,incsq,agesq
0,0,0,40,1,4.575,173.4489,1600
1,0,1,35,1,154.0,3749.113,1225
2,1,0,44,2,0.0,165.3282,1936
3,1,1,44,2,21.8,9777.254,1936
4,0,0,53,1,18.45,511.393,2809


In [8]:
y.head()

0    13.170
1    61.230
2    12.858
3    98.880
4    22.614
Name: inc, dtype: float64

In [9]:
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.2,random_state=42)

In [10]:
ss = StandardScaler()
Z_train = ss.fit_transform(X_train)
Z_test = ss.transform(X_test)

In [11]:
models = {"LinearRegression": LinearRegression(),
          "KNN": KNeighborsRegressor(),
          "Decision Tree": DecisionTreeRegressor(),
          "Bagging": BaggingRegressor(),
          "Random Forest": RandomForestRegressor(),
          "AdaBoost": AdaBoostRegressor(),
          "SVR": SVR()}

In [12]:
models.items()

dict_items([('LinearRegression', LinearRegression()), ('KNN', KNeighborsRegressor()), ('Decision Tree', DecisionTreeRegressor()), ('Bagging', BaggingRegressor()), ('Random Forest', RandomForestRegressor()), ('AdaBoost', AdaBoostRegressor()), ('SVR', SVR())])

In [13]:
for name, model in models.items():
    model.fit(Z_train, y_train)
    train_score = cross_val_score(model, Z_train, y_train, scoring = "r2", cv = 5).mean()
    test_score = cross_val_score(model, Z_test, y_test, scoring = "r2", cv = 5).mean()
    print(f"---{name}--- \nR2 score of Train set: {train_score} \nR2 score of Test set: {test_score} \n ")

---LinearRegression--- 
R2 score of Train set: 0.8951791853322378 
R2 score of Test set: 0.8941077606223509 
 
---KNN--- 
R2 score of Train set: 0.9654103685260373 
R2 score of Test set: 0.9211934922978221 
 
---Decision Tree--- 
R2 score of Train set: 0.9997888466035896 
R2 score of Test set: 0.9975105015652275 
 
---Bagging--- 
R2 score of Train set: 0.9999124835611981 
R2 score of Test set: 0.9986522472211818 
 
---Random Forest--- 
R2 score of Train set: 0.9999284474906768 
R2 score of Test set: 0.9985988769555669 
 
---AdaBoost--- 
R2 score of Train set: 0.9918038946697598 
R2 score of Test set: 0.9896879148462917 
 
---SVR--- 
R2 score of Train set: 0.8760943519890946 
R2 score of Test set: 0.7442422513666415 
 


##### 9. What is bootstrapping?

Random sampling with replacement

##### 10. What is the difference between a decision tree and a set of bagged decision trees? Be specific and precise!

Rather than depending on a single decision tree, a set of bagged decision trees depend on many many decision trees, which allows us to leverage the insight of many models.

##### 11. What is the difference between a set of bagged decision trees and a random forest? Be specific and precise!

In random forests, only a subset of features are selected at random out of the total and the best split feature from the subset is used to split each node in a tree, unlike in bagging where all features are considered for splitting a nod

##### 12. Why might a random forest be superior to a set of bagged decision trees?
> Hint: Consider the bias-variance tradeoff.

Due to the random feature selection, the trees are more independent of each other compared to regular bagging, which often results in better predictive performance (due to better variance-bias trade-offs)

## Step 5: Evaluate the model. (Part 1: Regression Problem)

##### 13. Using RMSE, evaluate each of the models you fit on both the training and testing data.

In [14]:
for name, model in models.items():
    model.fit(Z_train, y_train)
    rmse_train = (-cross_val_score(model, Z_train, y_train, scoring = "neg_root_mean_squared_error", cv = 5)).mean()
    rmse_test = (-cross_val_score(model, Z_test, y_test, scoring = "neg_root_mean_squared_error", cv = 5)).mean()
    print(f"---{name}--- \nRMSE of Train set: {rmse_train} \nRMSE of Test set: {rmse_test} \n ")

---LinearRegression--- 
RMSE of Train set: 7.753478372886714 
RMSE of Test set: 7.938431324781509 
 
---KNN--- 
RMSE of Train set: 4.440078319668359 
RMSE of Test set: 6.828906749097665 
 
---Decision Tree--- 
RMSE of Train set: 0.32550372864798344 
RMSE of Test set: 0.6241561937111604 
 
---Bagging--- 
RMSE of Train set: 0.20589915429530184 
RMSE of Test set: 0.6998526206969825 
 
---Random Forest--- 
RMSE of Train set: 0.16665559697811178 
RMSE of Test set: 0.6533214049524008 
 
---AdaBoost--- 
RMSE of Train set: 2.2415736685759406 
RMSE of Test set: 2.356519047816822 
 
---SVR--- 
RMSE of Train set: 8.435997024337572 
RMSE of Test set: 12.349973321956108 
 


##### 14. Based on training RMSE and testing RMSE, is there evidence of overfitting in any of your models? Which ones?

KNN and SVR.

##### 15. Based on everything we've covered so far, if you had to pick just one model as your final model to use to answer the problem in front of you, which one model would you pick? Defend your choice.

All are pretty good models. But if I had to pick one, perhaps it would be the bagging model as there is not much overfitting and the R2 score is the highest among all

##### 16. Suppose you wanted to improve the performance of your final model. Brainstorm 2-3 things that, if you had more time, you would attempt.

Tune the parameters such as max_depth, min_samples_split, min_samples_leaf, ccp_alpha, max_samples etc.

## Step 4: Model the data. (Part 2: Classification Problem)

Recall:
- Problem: Predict whether or not one is eligible for a 401k.
- When predicting `e401k`, you may use the entire dataframe if you wish.

##### 17. While you're allowed to use every variable in your dataframe, mention at least one disadvantage of using `p401k` in your model.

p401k and e401k are highly correlated to each other

##### 18. List all modeling tactics we've learned that could be used to solve a classification problem (as of Wednesday afternoon of Week 6). For each tactic, identify whether it is or is not appropriate for solving this specific classification problem and explain why or why not.

1. Logistic regression
2. K-Nearest Neighbors
3. Decision Tree
4. Random Forest
5. Bagging
6. AdaBoost
7. Support Vector Machine

##### 19. Regardless of your answer to number 18, fit at least one of each of the following models to attempt to solve the classification problem above:
    - a logistic regression model
    - a k-nearest neighbors model
    - a decision tree
    - a set of bagged decision trees
    - a random forest
    - an Adaboost model
    - a support vector classifier
    
> As always, be sure to do a train/test split! In order to compare modeling techniques, you should use the same train-test split on each. I recommend using a random seed here.

> You may find it helpful to set up a pipeline to try each modeling technique, but you are not required to do so!

In [15]:
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, BaggingClassifier, AdaBoostClassifier
from sklearn.svm import SVC

In [16]:
df_class = df.drop('p401k', axis = 1)

In [17]:
X = df_class.drop('e401k', axis=1)
y = df_class['e401k']

In [18]:
X.head()

Unnamed: 0,inc,marr,male,age,fsize,nettfa,pira,incsq,agesq
0,13.17,0,0,40,1,4.575,1,173.4489,1600
1,61.23,0,1,35,1,154.0,0,3749.113,1225
2,12.858,1,0,44,2,0.0,0,165.3282,1936
3,98.88,1,1,44,2,21.8,0,9777.254,1936
4,22.614,0,0,53,1,18.45,0,511.393,2809


In [19]:
y.head()

0    0
1    1
2    0
3    0
4    0
Name: e401k, dtype: int64

In [20]:
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.2,random_state=42)

In [21]:
ss = StandardScaler()
Z_train = ss.fit_transform(X_train)
Z_test = ss.transform(X_test)

In [22]:
models = {"LogisticRegression": LogisticRegression(),
          "KNN": KNeighborsClassifier(),
          "Decision Tree": DecisionTreeClassifier(),
          "Bagging": BaggingClassifier(),
          "Random Forest": RandomForestClassifier(),
          "AdaBoost": AdaBoostClassifier(),
          "SVC": SVC()}

In [23]:
models.items()

dict_items([('LogisticRegression', LogisticRegression()), ('KNN', KNeighborsClassifier()), ('Decision Tree', DecisionTreeClassifier()), ('Bagging', BaggingClassifier()), ('Random Forest', RandomForestClassifier()), ('AdaBoost', AdaBoostClassifier()), ('SVC', SVC())])

In [24]:
for name, model in models.items():
    model.fit(Z_train, y_train)
    train_score = cross_val_score(model, Z_train, y_train, cv = 5).mean()
    test_score = cross_val_score(model, Z_test, y_test, cv = 5).mean()
    print(f"---{name}--- \nAccuracy score of Train set: {train_score} \nAccuracy score of Test set: {test_score} \n ")

---LogisticRegression--- 
Accuracy score of Train set: 0.6529649595687331 
Accuracy score of Test set: 0.6619946091644205 
 
---KNN--- 
Accuracy score of Train set: 0.6243935309973045 
Accuracy score of Test set: 0.6053908355795148 
 
---Decision Tree--- 
Accuracy score of Train set: 0.5967654986522911 
Accuracy score of Test set: 0.6010781671159029 
 
---Bagging--- 
Accuracy score of Train set: 0.6440700808625337 
Accuracy score of Test set: 0.6415094339622641 
 
---Random Forest--- 
Accuracy score of Train set: 0.666711590296496 
Accuracy score of Test set: 0.6533692722371967 
 
---AdaBoost--- 
Accuracy score of Train set: 0.6820754716981133 
Accuracy score of Test set: 0.6706199460916442 
 
---SVC--- 
Accuracy score of Train set: 0.6675202156334231 
Accuracy score of Test set: 0.660377358490566 
 


## Step 5: Evaluate the model. (Part 2: Classfication Problem)

##### 20. Suppose our "positive" class is that someone is eligible for a 401(k). What are our false positives? What are our false negatives?

False Positives: Someone who is not eligible for a 401(k) but the model indicates that he is eligible

False Negatives: Someone who is eligible for a 401(k) but the model indicates that he is not eligible

##### 21. In this specific case, would we rather minimize false positives or minimize false negatives? Defend your choice.

Since we want to identify potential customers, we will want to minimise false negatives so that our target pool is wider

##### 22. Suppose we wanted to optimize for the answer you provided in problem 21. Which metric would we optimize in this case?

Recall

##### 23. Suppose that instead of optimizing for the metric in problem 21, we wanted to balance our false positives and false negatives using `f1-score`. Why might [f1-score](https://en.wikipedia.org/wiki/F1_score) be an appropriate metric to use here?

We will get the harmonic mean of precision and recall. This means that we can balance between having a wide pool of eligible audience while focusing our resources on those who are eligible.

##### 24. Using f1-score, evaluate each of the models you fit on both the training and testing data.

In [25]:
for name, model in models.items():
    model.fit(Z_train, y_train)
    train_score = cross_val_score(model, Z_train, y_train, scoring = "f1", cv = 5).mean()
    test_score = cross_val_score(model, Z_test, y_test, scoring = "f1", cv = 5).mean()
    print(f"---{name}--- \nF1 score of Train set: {train_score} \nF1 score of Test set: {test_score} \n ")

---LogisticRegression--- 
F1 score of Train set: 0.4702174143023689 
F1 score of Test set: 0.47270402506606224 
 
---KNN--- 
F1 score of Train set: 0.4756248874073982 
F1 score of Test set: 0.4411204372142021 
 
---Decision Tree--- 
F1 score of Train set: 0.4909181665411609 
F1 score of Test set: 0.48338055335661545 
 
---Bagging--- 
F1 score of Train set: 0.49319003365361436 
F1 score of Test set: 0.4405925702272321 
 
---Random Forest--- 
F1 score of Train set: 0.5216029531061317 
F1 score of Test set: 0.5002385925327397 
 
---AdaBoost--- 
F1 score of Train set: 0.5559118153035453 
F1 score of Test set: 0.5433568364114914 
 
---SVC--- 
F1 score of Train set: 0.44212906477520064 
F1 score of Test set: 0.426987518192245 
 


##### 25. Based on training f1-score and testing f1-score, is there evidence of overfitting in any of your models? Which ones?

KNN

##### 26. Based on everything we've covered so far, if you had to pick just one model as your final model to use to answer the problem in front of you, which one model would you pick? Defend your choice.

AdaBoost has the highest accuracy and best f1 score, with not much overfitting

##### 27. Suppose you wanted to improve the performance of your final model. Brainstorm 2-3 things that, if you had more time, you would attempt.

Tune the parameters n_estimators (how many weak models) and learning_rate (how fast the model learns)

## Step 6: Answer the problem.

##### BONUS: Briefly summarize your answers to the regression and classification problems. Be sure to include any limitations or hesitations in your answer.

- Regression: What features best predict one's income?
- Classification: Predict whether or not one is eligible for a 401k.