<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">

# Supervised Learning Model Comparison

_author The arbitrary and capricious heart of data science_

---

### Let us begin...

Recall the "data science process."
   1. Define the problem.
   2. Gather the data.
   3. Explore the data.
   4. Model the data.
   5. Evaluate the model.
   6. Answer the problem.

In this lab, we're going to focus mostly on creating (and then comparing) many regression and classification models. Thus, we'll define the problem and gather the data for you.
Most of the questions requiring a written response can be written in 2-3 sentences.

### Step 1: Define the problem.

You are a data scientist with a financial services company. Specifically, you want to leverage data in order to identify potential customers.

If you are unfamiliar with "401(k)s" or "IRAs," these are two types of retirement accounts. Very broadly speaking:
- You can put money for retirement into both of these accounts.
- The money in these accounts gets invested and hopefully has a lot more money in it when you retire.
- These are a little different from regular bank accounts in that there are certain tax benefits to these accounts. Also, employers frequently match money that you put into a 401k.
- If you want to learn more about them, check out [this site](https://www.nerdwallet.com/article/ira-vs-401k-retirement-accounts).

We will tackle one regression problem and one classification problem today.
- Regression: What features best predict one's income?
- Classification: Predict whether or not one is eligible for a 401k.

Check out the data dictionary [here](http://fmwww.bc.edu/ec-p/data/wooldridge2k/401KSUBS.DES).

### NOTE: When predicting `inc`, you should pretend as though you do not have access to the `e401k`, the `p401k` variable, and the `pira` variable. When predicting `e401k`, you may use the entire dataframe if you wish.

In [34]:
import pandas                as pd
import numpy                 as np
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.preprocessing   import StandardScaler
from sklearn.linear_model    import LinearRegression, LogisticRegression
from sklearn.neighbors       import KNeighborsRegressor
from sklearn.tree            import DecisionTreeRegressor
from sklearn.ensemble        import BaggingRegressor, RandomForestRegressor, AdaBoostRegressor
from sklearn.metrics         import mean_squared_error, f1_score, classification_report

### Step 2: Gather the data.

##### 1. Read in the data from the repository.

In [23]:
df = pd.read_csv('401ksubs.csv')

##### 2. What are 2-3 other variables that, if available, would be helpful to have?

1) Investment of any forms --> might decrease amount in 401k(s) or IRA deposit \
2) Indexed Universal Life Insurance --> some people choose this over 401k/IRA since, it could deduct tax too.

##### 3. Suppose a peer recommended putting `race` into your model in order to better predict who to target when advertising IRAs and 401(k)s. Why would this be an unethical decision?

Race data might raise biased-conclusion or offend people in some way if wording is not properly use.

## Step 3: Explore the data.

##### 4. When attempting to predict income, which feature(s) would we reasonably not use? Why?

Family size and marriage might not tell us much about income. \
In the other hand, outcome will be related to family size and marriage.

##### 5. What two variables have already been created for us through feature engineering? Come up with a hypothesis as to why subject-matter experts may have done this.
> This need not be a "statistical hypothesis." Just brainstorm why SMEs might have done this!

There are age-square and inc-square that has been pre-made in the dataset which means SMEs evaluate these two variable as a key variable and weighted them more than other.

##### 6. Looking at the data dictionary, one variable description appears to be an error. What is this error, and what do you think the correct value would be?

Actually, I noticed two wrong variable-descriptions which are \
age = age^2 and inc = inc^2, but the values are fine.

## Step 4: Model the data. (Part 1: Regression Problem)

Recall:
- Problem: What features best predict one's income?
- When predicting `inc`, you should pretend as though you do not have access to the `e401k`, the `p401k` variable, and the `pira` variable.

##### 7. List all modeling tactics we've learned that could be used to solve a regression problem (as of Wednesday afternoon of Week 6). For each tactic, identify whether it is or is not appropriate for solving this specific regression problem and explain why or why not.

| Regression | Details |
| --- | --- |
| Linear Regression | Easy to interpret, explain, and evaluate coefficients |
| Ridge Regression | Improve predictive power by dealing with multicollinearity |
| Lasso Regression | More aggressive than Ridge Regression |
| ElasticNet Regression | Combination of Lasso & Ridge Regression |

##### 8. Regardless of your answer to number 7, fit at least one of each of the following models to attempt to solve the regression problem above:
    - a multiple linear regression model
    - a k-nearest neighbors model
    - a decision tree
    - a set of bagged decision trees
    - a random forest
    - an Adaboost model
    
> As always, be sure to do a train/test split! In order to compare modeling techniques, you should use the same train-test split on each. I recommend setting a random seed here.

> You may find it helpful to set up a pipeline to try each modeling technique, but you are not required to do so!

In [24]:
# Check for features
df.columns

Index(['e401k', 'inc', 'marr', 'male', 'age', 'fsize', 'nettfa', 'p401k',
       'pira', 'incsq', 'agesq'],
      dtype='object')

#### Set X and y --> train_test_split --> Scale --> Fitting --> Cross Validation Score evaluation

In [25]:
# Set X and y
X = df[['marr', 'male', 'fsize', 'nettfa', 'agesq']]
y = df['incsq']

# Train test split
X_train,X_test,y_train,y_test = train_test_split(X,y,random_state=42)

# Scale X_train and X_test
ss = StandardScaler()
X_train_sc = ss.fit_transform(X_train)
X_test_sc = ss.transform(X_test)

# Initiation
lr = LinearRegression()
knn= KNeighborsRegressor()
dt = DecisionTreeRegressor()
bt = BaggingRegressor()
rf = RandomForestRegressor()
ad = AdaBoostRegressor()

# Fitting
lr.fit(X_train_sc, y_train)
knn.fit(X_train_sc, y_train)
dt.fit(X_train, y_train)
bt.fit(X_train, y_train)
rf.fit(X_train, y_train)
ad.fit(X_train, y_train)

In [27]:
# Check for Cross validation score
print(f'Score of Linear Regression:     {cross_val_score(lr, X_train_sc, y_train).mean()}')
print(f'Score of kNN Regression:        {cross_val_score(knn, X_train_sc, y_train).mean()}')
print(f'Score of Decision Trees:       {cross_val_score(dt, X_train, y_train).mean()}')
print(f'Score of Bagged Decision Trees: {cross_val_score(bt, X_train, y_train).mean()}')
print(f'Score of Random Forest:         {cross_val_score(rf, X_train, y_train).mean()}')
print(f'Score of Adaboost:             {cross_val_score(ad, X_train, y_train).mean()}')

Score of Linear Regression:     0.24783806903154507
Score of kNN Regression:        0.21144319227510655
Score of Decision Trees:       -0.24837547506006383
Score of Bagged Decision Trees: 0.20115907813635459
Score of Random Forest:         0.23963076605358635
Score of Adaboost:             -0.2737728166661988


##### 9. What is bootstrapping?

In machine learning, bootstrapping involves repeatedly drawing samples from our source data with replacement.
Same rows, but resampled.

##### 10. What is the difference between a decision tree and a set of bagged decision trees? Be specific and precise!

Bagged decision trees = Decision trees + Bootstrapping , while Decision trees doesn't has bootstrapping. \
What is bootstrapping? Read the thread above.

##### 11. What is the difference between a set of bagged decision trees and a random forest? Be specific and precise!

Random forests differ from bagged trees by forcing the tree to use only a subset of its available predictors to split on in the growing phase.

##### 12. Why might a random forest be superior to a set of bagged decision trees?
> Hint: Consider the bias-variance tradeoff.

<u>Evolution of Random Forest: Decision Trees$^{*}$ --> Bagged Decision Trees$^{**}$ --> Random Forest$^{***}$</u> \
Random forest is the enhanced version of Bagged Decision Trees model which is essentially an ensemble of decision trees trained with a bagging mechanism.


## Step 5: Evaluate the model. (Part 1: Regression Problem)

##### 13. Using RMSE, evaluate each of the models you fit on both the training and testing data.

In [28]:
import warnings
warnings.filterwarnings('ignore')

# Predict on X_train and X_test for RMSE score
y_lrtr= lr.predict(X_train_sc)
print(f'Linear Regression Train Score:     {mean_squared_error(y_train, y_lrtr, squared=False)}')
y_lrtt= lr.predict(X_test_sc)
print(f'Linear Regression Test Score:      {mean_squared_error(y_test, y_lrtt, squared=False)}')
print('') # =======================================================================================#
y_knntr= knn.predict(X_train_sc)
print(f'kNN Train Score:                   {mean_squared_error(y_train, y_knntr, squared=False)}')
y_knntt= knn.predict(X_test_sc) 
print(f'kNN Test Score:                    {mean_squared_error(y_test, y_knntt, squared=False)}')
print('') # =======================================================================================#
y_dttr= dt.predict(X_train_sc)
print(f'Decision Trees Train Score:        {mean_squared_error(y_train, y_dttr, squared=False)}')
y_dttt= dt.predict(X_test_sc)
print(f'Decision Trees Test Score:         {mean_squared_error(y_test, y_dttt, squared=False)}')
print('') # =======================================================================================#
y_bttr= bt.predict(X_train_sc)
print(f'Bagged Decision Trees Train Score: {mean_squared_error(y_train, y_bttr, squared=False)}')
y_bttt= bt.predict(X_test_sc)
print(f'Bagged Decision Trees Test Score:  {mean_squared_error(y_test, y_bttt, squared=False)}')
print('') # =======================================================================================#
y_rftr= rf.predict(X_train_sc)
print(f'Random Forest Train Score:         {mean_squared_error(y_train, y_rftr, squared=False)}')
y_rftt= rf.predict(X_test_sc)
print(f'Random Forest Test Score:          {mean_squared_error(y_test, y_rftt, squared=False)}')
print('') # =======================================================================================#
y_adtr= ad.predict(X_train_sc)
print(f'Adaboost Train Score:              {mean_squared_error(y_train, y_adtr, squared=False)}')
y_adtt= ad.predict(X_test_sc)
print(f'Adaboost Test Score:               {mean_squared_error(y_test, y_adtt, squared=False)}')

Linear Regression Train Score:     2577.458304052816
Linear Regression Test Score:      2771.707049080118

kNN Train Score:                   2149.5263402618393
kNN Test Score:                    2696.482628491796

Decision Trees Train Score:        3392.8471168314013
Decision Trees Test Score:         3438.9915313023803

Bagged Decision Trees Train Score: 3056.308784682495
Bagged Decision Trees Test Score:  3123.889431801379

Random Forest Train Score:         3039.0477405417614
Random Forest Test Score:          3110.5548520442835

Adaboost Train Score:              2821.690140101233
Adaboost Test Score:               2909.166356005474


##### 14. Based on training RMSE and testing RMSE, is there evidence of overfitting in any of your models? Which ones?

Decision Trees, Bagged Decision Trees, Random Forest, and Adaboost shown overfitting \
(Test Score < Train Score)

##### 15. Based on everything we've covered so far, if you had to pick just one model as your final model to use to answer the problem in front of you, which one model would you pick? Defend your choice.

I would pick Linear Regression model since, there is no sign of overfitting, trani-test score difference is not high as kNN score, and most of all, it is easy to explain to most non-technical audience.

##### 16. Suppose you wanted to improve the performance of your final model. Brainstorm 2-3 things that, if you had more time, you would attempt.

1) GridSearch \
2) Try polynomial features on dataset other than squareroot

## Step 4: Model the data. (Part 2: Classification Problem)

Recall:
- Problem: Predict whether or not one is eligible for a 401k.
- When predicting `e401k`, you may use the entire dataframe if you wish.

##### 17. While you're allowed to use every variable in your dataframe, mention at least one disadvantage of using `p401k` in your model.

A person in 'p401k' column is the person who has invested in 401k which will also be 'True' or '1' in e401k column. \
It is almost the same with including target variable into predictor variable which will not lead to great predictive result

##### 18. List all modeling tactics we've learned that could be used to solve a classification problem (as of Wednesday afternoon of Week 6). For each tactic, identify whether it is or is not appropriate for solving this specific classification problem and explain why or why not.

-  Logistic Regression:  An appropriate tactic, especially since its coefficients can be interpreted.
-  KNearest Neighbors:  An appropriate tactic, as it can be used for Classification purposes.
-  Decision Trees:  An appropriate tactic, as it can be used for Classification purposes.
-  Bagged Decision Trees:  An appropriate tactic, as it can be used for Classification purposes.
-  Random Forest:  An appropriate tactic, as it can be used for Classification purposes.

##### 19. Regardless of your answer to number 18, fit at least one of each of the following models to attempt to solve the classification problem above:
    - a logistic regression model
    - a k-nearest neighbors model
    - a decision tree
    - a set of bagged decision trees
    - a random forest
    - an Adaboost model
    
> As always, be sure to do a train/test split! In order to compare modeling techniques, you should use the same train-test split on each. I recommend using a random seed here.

> You may find it helpful to set up a pipeline to try each modeling technique, but you are not required to do so!

In [46]:
# Check for features
df.columns

Index(['e401k', 'inc', 'marr', 'male', 'age', 'fsize', 'nettfa', 'p401k',
       'pira', 'incsq', 'agesq'],
      dtype='object')

In [79]:
# Set X and y
X = df[['marr', 'male', 'age', 'fsize', 'nettfa', 'pira', 'incsq', 'agesq']]
y = df['e401k']

# Train test split
X_train,X_test,y_train,y_test = train_test_split(X,y,random_state=42)

# Scale X_train and X_test
ss = StandardScaler()
X_train_sc = ss.fit_transform(X_train)
X_test_sc = ss.transform(X_test)

# Initiation
logr = LogisticRegression()
knn  = KNeighborsRegressor()
dt   = DecisionTreeRegressor()
bt   = BaggingRegressor()
rf   = RandomForestRegressor()
ad   = AdaBoostRegressor()

# Fitting
logr.fit(X_train_sc, y_train)
knn.fit(X_train_sc, y_train)
dt.fit(X_train, y_train)
bt.fit(X_train, y_train)
rf.fit(X_train, y_train)
ad.fit(X_train, y_train)

In [84]:
# Check for Cross validation score
print(f'Score of Logistic Regression:   {cross_val_score(logr, X_train_sc, y_train).mean()}')
print(f'Score of kNN Regression:       {cross_val_score(knn, X_train_sc, y_train).mean()}')
print(f'Score of Decision Trees:       {cross_val_score(dt, X_train, y_train).mean()}')
print(f'Score of Bagged Decision Trees: {cross_val_score(bt, X_train, y_train).mean()}')
print(f'Score of Random Forest:         {cross_val_score(rf, X_train, y_train).mean()}')
print(f'Score of Adaboost:              {cross_val_score(ad, X_train, y_train).mean()}')

Score of Logistic Regression:   0.6434732310336564
Score of kNN Regression:       -0.041498108325827586
Score of Decision Trees:       -0.6907686811561575
Score of Bagged Decision Trees: 0.0033380539367735772
Score of Random Forest:         0.07533154408726442
Score of Adaboost:              0.11666540033396107


- kNN, Decision Tree, and Bagged Decision trees shown bad fitting. 
- Logistic Regression is the best choice in this case.

## Step 5: Evaluate the model. (Part 2: Classfication Problem)

##### 20. Suppose our "positive" class is that someone is eligible for a 401(k). What are our false positives? What are our false negatives?

e401(k) == 1 --> TP \
e401(k) == 0 --> TN \
p401(k) == 1 --> FP \
p401(k) == 0 --> FN

##### 21. In this specific case, would we rather minimize false positives or minimize false negatives? Defend your choice.

Minimizing false-negatives (error Type II) might be better than minimizing faslse-positives (error Type I) in this case.

A person with 'p401(k) == 0' doesn't mean that a person is 'e401(k) ==0' or not eligible to invest in 401K.

While people who 'p401(k) ==1' are people who eligible to invest in 401K or 'e401(k) == 1' 

##### 22. Suppose we wanted to optimize for the answer you provided in problem 21. Which metric would we optimize in this case?

The false negative rate (miss rate) – is the probability that a true positive will be missed by the test. 

$FN\% = \frac{FN}{FN+TP} $

Optimizing TP will reduce the FN% = Optimizing Sensitivity \
could reduce error type II


##### 23. Suppose that instead of optimizing for the metric in problem 21, we wanted to balance our false positives and false negatives using `f1-score`. Why might [f1-score](https://en.wikipedia.org/wiki/F1_score) be an appropriate metric to use here?

f1 score calculate from True Positive, False Positive, and False Negative according to the equation below: 

$F_{1}Score = \frac{2TP}{2TP+FN+FP}$

Having equation that contains our interested variables should help balancing it.

##### 24. Using f1-score, evaluate each of the models you fit on both the training and testing data.

In [96]:
# Predict y_train and y_test for F1 score
y_logr_train = logr.predict(X_train_sc)
y_logr_test = logr.predict(X_test_sc)
print(f'F1 LogisticRegression train:    {f1_score(y_train, y_logr_train)}')
print(f'F1 LogisticRegression test:     {f1_score(y_test, y_logr_test)}')
print('') # =======================================================================================#
y_knn_train = knn.predict(X_train_sc)
y_knn_test  = knn.predict(X_test_sc)
print(f'F1 kNN Regression train:        {f1_score(y_train, y_knn_train.round())}')
print(f'F1 kNN Regression test:         {f1_score(y_test, y_knn_test.round())}')
print('') # =======================================================================================#
y_bt_train = bt.predict(X_train_sc)
y_bt_test = bt.predict(X_test_sc)
print(f'F1 Bagged Decision Trees train: {f1_score(y_train, y_bt_train.round())}')
print(f'F1 Bagged Decision Trees test:  {f1_score(y_test, y_bt_test.round())}')
print('') # =======================================================================================#
y_rf_train = rf.predict(X_train_sc)
y_rf_test = rf.predict(X_test_sc)
print(f'F1 Random Forest train:         {f1_score(y_train, y_rf_train.round())}')
print(f'F1 Random Forest test:          {f1_score(y_test, y_rf_test.round())}')
print('') # =======================================================================================#
y_ad_train = ad.predict(X_train_sc)
y_ad_test = ad.predict(X_test_sc)
print(f'F1 Adaboost train:              {f1_score(y_train, y_ad_train.round())}')
print(f'F1 Adaboost test:               {f1_score(y_test, y_ad_test.round())}')
print('') # =======================================================================================#

F1 LogisticRegression train:    0.3342303552206674
F1 LogisticRegression test:     0.3450531479967293

F1 kNN Regression train:        0.6562998405103669
F1 kNN Regression test:         0.510353227771011

F1 Bagged Decision Trees train: 0.4474650991917708
F1 Bagged Decision Trees test:  0.43197026022304835

F1 Random Forest train:         0.06662087912087912
F1 Random Forest test:          0.07135362014690451

F1 Adaboost train:              0.006531204644412192
F1 Adaboost test:               0.00887902330743618



##### 25. Based on training f1-score and testing f1-score, is there evidence of overfitting in any of your models? Which ones?

There is only one overfitting model which is K-Nearest Neighbors

##### 26. Based on everything we've covered so far, if you had to pick just one model as your final model to use to answer the problem in front of you, which one model would you pick? Defend your choice.

Based on F1 score, K-nearest Neighbors ,eventhough it is overfitting model, has highest F1 score.

F1 = 1 indicating perfect precision and recall, kNN show F1 closest to 1 among others.

##### 27. Suppose you wanted to improve the performance of your final model. Brainstorm 2-3 things that, if you had more time, you would attempt.

- I would love to try Stacking models.
- SMOTE might be great too

## Step 6: Answer the problem.

##### BONUS: Briefly summarize your answers to the regression and classification problems. Be sure to include any limitations or hesitations in your answer.

- Regression: What features best predict one's income?
- Classification: Predict whether or not one is eligible for a 401k.

My final answers for this case are:
- Regression: Linear Regression
- Classification: K-nearest neighbors