<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">

# Supervised Learning Model Comparison

---

### Let us begin...

Recall the `data science process`.
   1. Define the problem.
   2. Gather the data.
   3. Explore the data.
   4. Model the data.
   5. Evaluate the model.
   6. Answer the problem.

In this lab, we're going to focus mostly on creating (and then comparing) many regression and classification models. Thus, we'll define the problem and gather the data for you.
Most of the questions requiring a written response can be written in 2-3 sentences.

### Step 1: Define the problem.

You are a data scientist with a financial services company. Specifically, you want to leverage data in order to identify potential customers.

If you are unfamiliar with "401(k)s" or "IRAs," these are two types of retirement accounts. Very broadly speaking:
- You can put money for retirement into both of these accounts.
- The money in these accounts gets invested and hopefully has a lot more money in it when you retire.
- These are a little different from regular bank accounts in that there are certain tax benefits to these accounts. Also, employers frequently match money that you put into a 401k.
- If you want to learn more about them, check out [this site](https://www.nerdwallet.com/article/ira-vs-401k-retirement-accounts).

We will tackle one regression problem and one classification problem today.
- Regression: What features best predict one's income?
- Classification: Predict whether or not one is eligible for a 401k.

Check out the data dictionary [here](http://fmwww.bc.edu/ec-p/data/wooldridge2k/401KSUBS.DES).

#### NOTE: When predicting `inc`, you should pretend as though you do not have access to the `e401k`, the `p401k` variable, and the `pira` variable. 

#### When predicting `e401k`, you may use the entire dataframe if you wish.

### Step 2: Gather the data.

##### 1. Read in the data.

In [5]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.linear_model import LinearRegression,LogisticRegression
from sklearn.neighbors import KNeighborsRegressor,KNeighborsClassifier
from sklearn.tree import DecisionTreeRegressor,DecisionTreeClassifier
from sklearn.ensemble import BaggingRegressor,RandomForestRegressor,AdaBoostRegressor,BaggingClassifier,RandomForestClassifier, ExtraTreesClassifier,AdaBoostClassifier
from sklearn.metrics import mean_squared_error,root_mean_squared_error,accuracy_score,precision_score, recall_score,f1_score
from sklearn.model_selection import train_test_split, cross_val_score,GridSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.datasets import make_classification

In [6]:
df = pd.read_csv('401ksubs.csv')

In [7]:
df.head()

Unnamed: 0,e401k,inc,marr,male,age,fsize,nettfa,p401k,pira,incsq,agesq
0,0,13.17,0,0,40,1,4.575,0,1,173.4489,1600
1,1,61.23,0,1,35,1,154.0,1,0,3749.113,1225
2,0,12.858,1,0,44,2,0.0,0,0,165.3282,1936
3,0,98.88,1,1,44,2,21.8,0,0,9777.254,1936
4,0,22.614,0,0,53,1,18.45,0,0,511.393,2809


In [8]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9275 entries, 0 to 9274
Data columns (total 11 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   e401k   9275 non-null   int64  
 1   inc     9275 non-null   float64
 2   marr    9275 non-null   int64  
 3   male    9275 non-null   int64  
 4   age     9275 non-null   int64  
 5   fsize   9275 non-null   int64  
 6   nettfa  9275 non-null   float64
 7   p401k   9275 non-null   int64  
 8   pira    9275 non-null   int64  
 9   incsq   9275 non-null   float64
 10  agesq   9275 non-null   int64  
dtypes: float64(3), int64(8)
memory usage: 797.2 KB


In [9]:
df.describe()

Unnamed: 0,e401k,inc,marr,male,age,fsize,nettfa,p401k,pira,incsq,agesq
count,9275.0,9275.0,9275.0,9275.0,9275.0,9275.0,9275.0,9275.0,9275.0,9275.0,9275.0
mean,0.392129,39.254641,0.628571,0.20442,41.080216,2.885067,19.071675,0.276226,0.25434,2121.192483,1793.652722
std,0.488252,24.090002,0.483213,0.403299,10.299517,1.525835,63.963838,0.447154,0.435513,3001.469424,895.648841
min,0.0,10.008,0.0,0.0,25.0,1.0,-502.302,0.0,0.0,100.1601,625.0
25%,0.0,21.66,0.0,0.0,33.0,2.0,-0.5,0.0,0.0,469.1556,1089.0
50%,0.0,33.288,1.0,0.0,40.0,3.0,2.0,0.0,0.0,1108.091,1600.0
75%,1.0,50.16,1.0,0.0,48.0,4.0,18.4495,1.0,1.0,2516.0255,2304.0
max,1.0,199.041,1.0,1.0,64.0,13.0,1536.798,1.0,1.0,39617.32,4096.0


In [10]:
df.isnull().sum()

e401k     0
inc       0
marr      0
male      0
age       0
fsize     0
nettfa    0
p401k     0
pira      0
incsq     0
agesq     0
dtype: int64

In [11]:
df.corr()

Unnamed: 0,e401k,inc,marr,male,age,fsize,nettfa,p401k,pira,incsq,agesq
e401k,1.0,0.268178,0.080843,-0.027641,0.031526,0.012015,0.14395,0.76917,0.118643,0.206618,0.017526
inc,0.268178,1.0,0.362008,-0.069871,0.105638,0.11017,0.376586,0.270833,0.364354,0.940161,0.087305
marr,0.080843,0.362008,1.0,-0.36395,0.059047,0.564814,0.075039,0.085636,0.116925,0.28006,0.0545
male,-0.027641,-0.069871,-0.36395,1.0,-0.120297,-0.320678,-0.018132,-0.024949,-0.036361,-0.053715,-0.116235
age,0.031526,0.105638,0.059047,-0.120297,1.0,-0.030536,0.203906,0.025977,0.238557,0.097584,0.992619
fsize,0.012015,0.11017,0.564814,-0.320678,-0.030536,1.0,-0.031506,0.014296,-0.043629,0.07957,-0.055924
nettfa,0.14395,0.376586,0.075039,-0.018132,0.203906,-0.031506,1.0,0.187392,0.345917,0.407568,0.203703
p401k,0.76917,0.270833,0.085636,-0.024949,0.025977,0.014296,0.187392,1.0,0.153033,0.222113,0.01574
pira,0.118643,0.364354,0.116925,-0.036361,0.238557,-0.043629,0.345917,0.153033,1.0,0.322805,0.233543
incsq,0.206618,0.940161,0.28006,-0.053715,0.097584,0.07957,0.407568,0.222113,0.322805,1.0,0.082991


##### 2. What are 2-3 other variables that, if available, would be helpful to have?

__Answer:__ I think that the household debt and career are going to be helpful to have. 

##### 3. Suppose a peer recommended putting `race` into your model in order to better predict who to target when advertising IRAs and 401(k)s. Why would this be an unethical decision?

__Answer:__ Putting race into the model is an unethical decision because it might producesbinaccurate estimates and misleadsconclusions. Moreover, it could be viewed aw generating data for purposes that harm people or communities of color.

## Step 3: Explore the data.

##### 4. When attempting to predict income, which feature(s) would we reasonably not use? Why?

__Answer:__ When attempting to predicting income, incsq should not be used since it is the result from squaring the income.

##### 5. What two variables have already been created for us through feature engineering? Come up with a hypothesis as to why subject-matter experts may have done this.
> This need not be a "statistical hypothesis." Just brainstorm why SMEs (Subject Matter Experts) might have done this!

__Answer:__ incsqr and age are variables that have been created through feature engineering. 
Expert hope that the polynomial features might expose interactions or relationship between each features.
This is because income and age may not have straight relations with other variables.
 

##### 6. Looking at the data dictionary, one variable description appears to be an error. What is this error, and what do you think the correct value would be?

__Answer:__ In the data dictionary, inc and age should be described as income and age respectively.

## Step 4: Model the data. (Part 1: Regression Problem)

Recall:
- Problem: What features best predict one's income?
- When predicting `inc`, you should pretend as though you do not have access to the `e401k`, the `p401k` variable, and the `pira` variable.

##### 7. List all modeling tactics we've learned that could be used to solve a regression problem (as of Wednesday afternoon of Week 6). For each tactic, identify whether it is or is not appropriate for solving this specific regression problem and explain why or why not.

__Answer:__
- a multiple linear regression model: This tactic seems appropriate since these variable might have linear relatiionship between each others.
- a k-nearest neighbors model: Apart from classification,this tactic can also be used for regression tasks by calculating the average or weighted average of the target values of the nearest neighbors.
- a decision tree: It is effective but based on only one tree. This might leads to Overfitting.
- a set of bagged decision trees: bagging typically results in improved accuracy over prediction using a single tree.
- a random forest: : Random forest leverages an ensemble of decision trees, resulting in highly accurate predictions. By aggregating the outputs of multiple trees, it reduces the risk of overfitting and provides robust results
- an Adaboost model: By combining weak classifiers and focusing more on harder-to-classify instances, AdaBoost ensures a high level of prediction precision.

##### 8. Regardless of your answer to number 7, fit at least one of each of the following models to attempt to solve the regression problem above:
    - a multiple linear regression model
    - a k-nearest neighbors model
    - a decision tree
    - a set of bagged decision trees
    - a random forest
    - an Adaboost model
    
> As always, be sure to do a train/test split! In order to compare modeling techniques, you should use the same train-test split on each. I recommend setting a random seed here.

> You may find it helpful to set up a pipeline to try each modeling technique, but you are not required to do so!

In [25]:
reggression_models = {'linear' : LinearRegression(),
         'knn' : KNeighborsRegressor(),
         'tree' : DecisionTreeRegressor(min_samples_split=5,min_samples_leaf=20,random_state=42),
         'bagging' : BaggingRegressor(n_estimators=100,random_state=42),
         'forest' : RandomForestRegressor(min_samples_split=30,random_state=42),
         'adaboost' : AdaBoostRegressor(n_estimators=70,learning_rate=0.1,loss='exponential',random_state=42)}

In [26]:
X = df.drop(columns=['inc','e401k','p401k','pira','incsq'])
y = df[['inc']]
X_train,X_test, y_train, y_test = train_test_split(X,y,test_size=0.2, random_state=42)
sc = StandardScaler()
X_train_sc = sc.fit_transform(X_train)
X_test_sc = sc.fit_transform(X_test)

In [27]:
# create a FUNCTION to score each models
def scoring_models(X_train, X_test, y_train, y_test, models):
    for name,model in reggression_models.items():
        model.fit(X_train, y_train.values.ravel())
        score_train = model.score(X_train, y_train)
        score_test = model.score(X_test, y_test)
        print(f'Model {name} : Train Score is {score_train:.4f} and Test Score is {score_test:.4f}')

In [28]:
scoring_models(X_train_sc, X_test_sc, y_train, y_test, reggression_models)

Model linear : Train Score is 0.2926 and Test Score is 0.2772
Model knn : Train Score is 0.5262 and Test Score is 0.3086
Model tree : Train Score is 0.4817 and Test Score is 0.3486
Model bagging : Train Score is 0.8960 and Test Score is 0.3052
Model forest : Train Score is 0.5682 and Test Score is 0.3759
Model adaboost : Train Score is 0.3542 and Test Score is 0.3225


##### 9. What is bootstrapping?

__Answer:__ Bootstrapping is random sampling with replacement. Instead of building one model on our original sample, we will now build one model on each $B$ sub-samples of size $n$ from sample with replacement.

##### 10. What is the difference between a decision tree and a set of bagged decision trees? Be specific and precise!

__Answer:__ 
While decision tree make decisions based on only one tree, the bagged decision tree uses bootstrapped populations on multiple trees which is useful when we want to lower the variance.

##### 11. What is the difference between a set of bagged decision trees and a random forest? Be specific and precise!

__Answer:__ While bagged decision trees uses bootstrapped populations and aggregate the results, Random forest adds a layer of complexity by randomly selecting some features in every split of the tree branch. This prevents the trees from being overly reliant on some features all the time.

##### 12. Why might a random forest be superior to a set of bagged decision trees?
> Hint: Consider the bias-variance tradeoff.

__Answer:__ The advantage of Random forests compared to bagged decision tree model is that it is the superior of bagged decision tree model. Random forests tries to avoid the over-reliance on specific features whcich make it generalize better with lower avriance but higher bias. 

## Step 5: Evaluate the model. (Part 1: Regression Problem)

##### 13. Using RMSE, evaluate each of the models you fit on both the training and testing data.

In [38]:
# create a FUNCTION to evaluate each models
def evaluating_models(X_train, X_test, y_train, y_test, models):
    for name,model in reggression_models.items():
        model.fit(X_train, y_train.values.ravel())
        pred_train = model.predict(X_train)
        pred_test = model.predict(X_test)
        rmse_train = root_mean_squared_error(y_train, pred_train)
        rmse_test = root_mean_squared_error(y_test, pred_test)
        print(f'Model {name} : Training RMSE is {rmse_train:.4f} and Testing RMSE is {rmse_test:.4f}')

In [39]:
evaluating_models(X_train_sc, X_test_sc, y_train, y_test, reggression_models)

Model linear : Training RMSE is 20.1642 and Testing RMSE is 20.8648
Model knn : Training RMSE is 16.5021 and Testing RMSE is 20.4062
Model tree : Training RMSE is 17.2588 and Testing RMSE is 19.8079
Model bagging : Training RMSE is 7.7328 and Testing RMSE is 20.4566
Model forest : Training RMSE is 15.7542 and Testing RMSE is 19.3882
Model adaboost : Training RMSE is 19.2659 and Testing RMSE is 20.2003


##### 14. Based on training RMSE and testing RMSE, is there evidence of overfitting in any of your models? Which ones?

__Answer:__ Overfitting is exist if the RMSE on the training set is lower than on the test set. Based on the result, all model are ovrfitting.

##### 15. Based on everything we've covered so far, if you had to pick just one model as your final model to use to answer the problem in front of you, which one model would you pick? Defend your choice.

__Answer:__ I would pick Adaboost based on the r-squared and Testing RMSE that quit close to Trianing RMSE.

##### 16. Suppose you wanted to improve the performance of your final model. Brainstorm 2-3 things that, if you had more time, you would attempt.

__Answer:__ I think that feature engineering and Gridsearch would help.

## Step 4: Model the data. (Part 2: Classification Problem)

Recall:
- Problem: Predict whether or not one is eligible for a 401k.
- When predicting `e401k`, you may use the entire dataframe if you wish.

##### 17. While you're allowed to use every variable in your dataframe, mention at least one disadvantage of using `p401k` in your model.

__Answer:__ The disadvantage of using p401k in the model is that p401k is that while we are trying to predict whether or not someone is eligible for a 401k and p401k is a column that mentions whther or not someone is participate a 401k, this would overfit our model.

##### 18. List all modeling tactics we've learned that could be used to solve a classification problem (as of Wednesday afternoon of Week 6). For each tactic, identify whether it is or is not appropriate for solving this specific classification problem and explain why or why not.

__Answer:__ 
- logistic regression model : It is simple, interpretable, and effective for binary classification, especially with linearly separable data.
- k-nearest neighbors model : It is non-parametric and easy to understandcapturing local data patterns 
- decision tree : It is intuitive, interpretable and able to capturing non-linear relationships.
- set of bagged decision trees : It reduces variance and increases stability by averaging multiple decision trees trained on different data samples.
- random forest : It improves accuracy and robustness by combining the predictions of multiple decision trees and reducing overfitting.
- Adaboost model: It boosts performance by sequentially focusing on misclassified examples, improving the accuracy of weak classifiers.

##### 19. Regardless of your answer to number 18, fit at least one of each of the following models to attempt to solve the classification problem above:
    - a logistic regression model
    - a k-nearest neighbors model
    - a decision tree
    - a set of bagged decision trees
    - a random forest
    - an Adaboost model
    
> As always, be sure to do a train/test split! In order to compare modeling techniques, you should use the same train-test split on each. I recommend using a random seed here.

> You may find it helpful to set up a pipeline to try each modeling technique, but you are not required to do so!

In [131]:
classification_models = {'logistic' : LogisticRegression(),
         'knn' : KNeighborsClassifier(),
         'tree' : DecisionTreeClassifier(min_samples_split=5,min_samples_leaf=5,random_state=42),
         'bagging' : BaggingClassifier(n_estimators=20,random_state=42),
         'forest' : RandomForestClassifier(min_samples_split=5,random_state=42),
         'adaboost' : AdaBoostClassifier(random_state=42)}

In [133]:
X = df.drop(columns=['e401k','p401k'])
y = df[['e401k']]
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y,test_size=0.20, random_state=42)
sc = StandardScaler()
X_train_sc = sc.fit_transform(X_train)
X_test_sc = sc.fit_transform(X_test)

## Step 5: Evaluate the model. (Part 2: Classfication Problem)

##### 20. Suppose our "positive" class is that someone is eligible for a 401(k). What are our false positives? What are our false negatives?

__Answer:__ 
- False Positive is when we predict that this person would be eligible for a 401 but in reality that person is not.
- False Negative is when we predict that this person would not be eligible for a 401 but in reality that person is.

##### 21. In this specific case, would we rather minimize false positives or minimize false negatives? Defend your choice.

__Answer:__ In this specific case, I would rather minimize False Negative. The objective of this plan is to promote a tax-advantaged retirement savings. Then, people with false negative  will loss a opportunity to invest and gain earnings which is against the goal of 401k plan

##### 22. Suppose we wanted to optimize for the answer you provided in problem 21. Which metric would we optimize in this case?

__Answer:__ To minimize False Negative, we would optimize recall.

##### 23. Suppose that instead of optimizing for the metric in problem 21, we wanted to balance our false positives and false negatives using `f1-score`. Why might [f1-score](https://en.wikipedia.org/wiki/F1_score) be an appropriate metric to use here?

__Answer:__  The F1 score combines precision and recall into a single number that tells us how well a model is performing overall. It finds a balanced middle ground between precision (how accurate our positive predictions are) and recall.

##### 24. Using f1-score, evaluate each of the models you fit on both the training and testing data.

In [162]:
# create a FUNCTION to accuracy score each models
def accuracy_scoring_models(X_train, X_test, y_train, y_test, models):
    for name,model in classification_models.items():
        model.fit(X_train, y_train.values.ravel())
        pred_train = model.predict(X_train)
        pred_test = model.predict(X_test)
        accuracy_train = accuracy_score(y_train['e401k'], pred_train)
        accuracy_test = accuracy_score(y_test['e401k'], pred_test)
        print(f'Model {name} : Training Accuracy Score is {accuracy_train:.4f} and Testing Accuracy Score is {accuracy_test:.4f}')

In [164]:
accuracy_scoring_models(X_train_sc, X_test_sc, y_train, y_test, classification_models)

Model logistic : Training Accuracy Score is 0.6526 and Testing Accuracy Score is 0.6674
Model knn : Training Accuracy Score is 0.7547 and Testing Accuracy Score is 0.6356
Model tree : Training Accuracy Score is 0.8310 and Testing Accuracy Score is 0.5887
Model bagging : Training Accuracy Score is 0.9939 and Testing Accuracy Score is 0.6501
Model forest : Training Accuracy Score is 0.9869 and Testing Accuracy Score is 0.6593
Model adaboost : Training Accuracy Score is 0.6854 and Testing Accuracy Score is 0.6879




In [166]:
# create a FUNCTION to F1_scoring each models
def f1_scoring_models(X_train, X_test, y_train, y_test, models):
    for name,model in classification_models.items():
        model.fit(X_train, y_train.values.ravel())
        pred_test = model.predict(X_test)
        f1_test = f1_score(y_test['e401k'], pred_test)
        print(f'Model {name} : Testing F1-Score is {f1_test:.4f}')

In [168]:
f1_scoring_models(X_train_sc, X_test_sc, y_train, y_test, classification_models)

Model logistic : Testing F1-Score is 0.4854
Model knn : Testing F1-Score is 0.4894
Model tree : Testing F1-Score is 0.4984
Model bagging : Testing F1-Score is 0.5348
Model forest : Testing F1-Score is 0.5447
Model adaboost : Testing F1-Score is 0.5965




##### 25. Based on training f1-score and testing f1-score, is there evidence of overfitting in any of your models? Which ones?

__Answer:__  There are evidence of overfitting in Logistic Regression and AdaBoost.

##### 26. Based on everything we've covered so far, if you had to pick just one model as your final model to use to answer the problem in front of you, which one model would you pick? Defend your choice.

__Answer:__ AdaBoost seems to have the strongest performance amoungst all the models.

##### 27. Suppose you wanted to improve the performance of your final model. Brainstorm 2-3 things that, if you had more time, you would attempt.

__Answer:__  Same as the regression model, I think that feature engineering and Gridsearch would help.

## Step 6: Answer the problem.

##### BONUS: Briefly summarize your answers to the regression and classification problems. Be sure to include any limitations or hesitations in your answer.

- Regression: What features best predict one's income?
- Classification: Predict whether or not one is eligible for a 401k.

__Answer:__ 
For the Regression model, Adaboost is the best model based on the r-squared and Testing RMSE that quit close to Trianing RMSE. On the other hand, AdaBoost seems to have the strongest performance to predict whether or not one is eligible for a 401k amount all the models.