## 6.01 - Supervised Learning Model Comparison

Recall the "data science process."

1. Define the problem.
2. Gather the data.
3. Explore the data.
4. Model the data.
5. Evaluate the model.
6. Answer the problem.

In this lab, we're going to focus mostly on creating (and then comparing) many regression and classification models. Thus, we'll define the problem and gather the data for you.
Most of the questions requiring a written response can be written in 2-3 sentences.

### Step 1: Define the problem.

You are a data scientist with a financial services company. Specifically, you want to leverage data in order to identify potential customers.

If you are unfamiliar with "401(k)s" or "IRAs," these are two types of retirement accounts. Very broadly speaking:
- You can put money for retirement into both of these accounts.
- The money in these accounts gets invested and hopefully has a lot more money in it when you retire.
- These are a little different from regular bank accounts in that there are certain tax benefits to these accounts. Also, employers frequently match money that you put into a 401k.
- If you want to learn more about them, check out [this site](https://www.nerdwallet.com/article/ira-vs-401k-retirement-accounts).

We will tackle one regression problem and one classification problem today.
- Regression: What features best predict one's income?
- Classification: Predict whether or not one is eligible for a 401k.

Check out the data dictionary [here](http://fmwww.bc.edu/ec-p/data/wooldridge2k/401KSUBS.DES).

### NOTE: When predicting `inc`, you should pretend as though you do not have access to the `e401k`, the `p401k` variable, and the `pira` variable. When predicting `e401k`, you may use the entire dataframe if you wish.

### Step 2: Gather the data.

##### 1. Read in the data from the repository.

In [7]:
import pandas as pd

In [8]:
df = pd.read_csv('401ksubs.csv')

In [9]:
df.head(3)

Unnamed: 0,e401k,inc,marr,male,age,fsize,nettfa,p401k,pira,incsq,agesq
0,0,13.17,0,0,40,1,4.575,0,1,173.4489,1600
1,1,61.23,0,1,35,1,154.0,1,0,3749.113,1225
2,0,12.858,1,0,44,2,0.0,0,0,165.3282,1936


##### 2. What are 2-3 other variables that, if available, would be helpful to have?

**Answer** 
* 1. How many years has he/she worked.
* 2. Highest education level.
* 3. If she/he has any child?

##### 3. Suppose a peer recommended putting `race` into your model in order to better predict who to target when advertising IRAs and 401(k)s. Why would this be an unethical decision?

* **Answer**  I don't think someone's race or color should be one of the predictor, if 2 person got everything same, then they should get same idea based on the model

## Step 3: Explore the data.

##### 4. When attempting to predict income, which feature(s) would we reasonably not use? Why?

* **Answer** incsq: Because this features is baiscally the income.

##### 5. What two variables have already been created for us through feature engineering? Come up with a hypothesis as to why subject-matter experts may have done this.
> This need not be a "statistical hypothesis." Just brainstorm why SMEs might have done this!

* **Answer** : incsq & agesq. I guess SMEs have done this because this can help to expand the difference between 2 incomes or ages.


##### 6. Looking at the data dictionary, one variable description appears to be an error. What is this error, and what do you think the correct value would be?

* **Answer**  I think there are two errors. inc and age. The descriptions indicate they are sqaured of inc & age. But, I think it should be the just income and age without squared. And the unit for inc should be $1000

## Step 4: Model the data. (Part 1: Regression Problem)

Recall:
- Problem: What features best predict one's income?
- When predicting `inc`, you should pretend as though you do not have access to the `e401k`, the `p401k` variable, and the `pira` variable.

##### 7. List all modeling tactics we've learned that could be used to solve a regression problem (as of Wednesday afternoon of Week 6). For each tactic, identify whether it is or is not appropriate for solving this specific regression problem and explain why or why not.

In [2]:
from sklearn.linear_model import LinearRegression
from sklearn.neighbors import KNeighborsRegressor
from sklearn.tree import DecisionTreeRegressor,ExtraTreeRegressor
from sklearn.ensemble import BaggingRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.ensemble import AdaBoostRegressor
from sklearn.svm import SVR

**Answer:**
* 1) Linear Regression: Yes, the model is easy to interpret.
* 2) KNN: No, because the model can't be interpreted.
* 3) Decision Tree & Extra Tree: Yes, the model can be interpreted.
* 4) Bagging: Yes, the model can be interpreted.
* 5) Random Forest: Yes, the model can be interpreted.
* 6) Adaboost: Yes, the model can be interpreted.
* 7) SVR: Yes, the model can be interpreted.

##### 8. Regardless of your answer to number 7, fit at least one of each of the following models to attempt to solve the regression problem above:
    - a multiple linear regression model
    - a k-nearest neighbors model
    - a decision tree
    - a set of bagged decision trees
    - a random forest
    - an Adaboost model
    - a support vector regressor
    
> As always, be sure to do a train/test split! In order to compare modeling techniques, you should use the same train-test split on each. I recommend setting a random seed here.

> You may find it helpful to set up a pipeline to try each modeling technique, but you are not required to do so!

In [11]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

X = df.drop(["e401k", "p401k", "pira", "inc", "incsq"], axis=1)
y = df['inc']

X_train, X_test, y_train, y_test = train_test_split(X,y,random_state=3)

sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)


# OLS
linreg = LinearRegression()
linreg.fit(X_train, y_train)

# knn
knn = KNeighborsRegressor()
knn.fit(X_train, y_train)

# decsion tree
dc = DecisionTreeRegressor()
dc.fit(X_train, y_train)

# bagging
bg = BaggingRegressor()
bg.fit(X_train, y_train)

# random forest
rf = RandomForestRegressor()
rf.fit(X_train, y_train)

# ada boost
adb = AdaBoostRegressor()
adb.fit(X_train, y_train)

# SVR
svr = SVR()
svr.fit(X_train, y_train)


SVR()

##### 9. What is bootstrapping?

**Answer**:
> Bootstrapping is one of the resampling methods. Basically, we resample the data we have n times and build a model on each one of the samples. Based on the result of all the models, we can estimate the parameter of the accuracu of the parameter (standard error).

##### 10. What is the difference between a decision tree and a set of bagged decision trees? Be specific and precise!

**Answer**:
> Decision tree method suffer from high variance, however, bagged decision tree has relatively smaller variance, because we build n different models based on n resampled data.

##### 11. What is the difference between a set of bagged decision trees and a random forest? Be specific and precise!

**Answer**:
> Random Forest, we build a number of decision tress on bootstrapped training samples as we did in bagging, but a random sample of m predctors is chosen as split  candidates from the full set of p predictors.

##### 12. Why might a random forest be superior to a set of bagged decision trees?
> Hint: Consider the bias-variance tradeoff.

**Answer**:
> For example, if there is a very strong predictors in our training dataset, then we build a number of decision trees on boostrapped training samples, those bagged tress will be highly correlated. Averaging many highly correlated quantitties won't help to reduce the variance.

## Step 5: Evaluate the model. (Part 1: Regression Problem)

##### 13. Using RMSE, evaluate each of the models you fit on both the training and testing data.

In [12]:
from sklearn.metrics import mean_squared_error

# rmse for training
linreg_rmse_train = mean_squared_error(y_train, linreg.predict(X_train)) ** (1/2)
knn_rmse_train = mean_squared_error(y_train, knn.predict(X_train)) ** (1/2)
dc_rmse_train = mean_squared_error(y_train, dc.predict(X_train)) ** (1/2)
bg_rmse_trian = mean_squared_error(y_train, bg.predict(X_train)) ** (1/2)
rf_rmse_train = mean_squared_error(y_train, rf.predict(X_train)) ** (1/2)
adb_rmse_train = mean_squared_error(y_train, adb.predict(X_train)) ** (1/2)
svr_rmse_train = mean_squared_error(y_train, svr.predict(X_train)) ** (1/2)


# rmse for testing
linreg_pred = linreg.predict(X_test)
knn_pred = knn.predict(X_test)
dc_pred = dc.predict(X_test)
bg_pred = bg.predict(X_test)
rf_pred = rf.predict(X_test)
adb_pred = adb.predict(X_test)
svr_pred = svr.predict(X_test)

linreg_rmse_test = mean_squared_error(y_test, linreg_pred) ** (1/2)
knn_rmse_test = mean_squared_error(y_test, knn_pred) ** (1/2)
dc_rmse_test = mean_squared_error(y_test, dc_pred) ** (1/2)
bg_rmse_test = mean_squared_error(y_test, bg_pred) ** (1/2)
rf_rmse_test = mean_squared_error(y_test, rf_pred) ** (1/2)
adb_rmse_test = mean_squared_error(y_test, adb_pred) ** (1/2)
svr_rmse_test = mean_squared_error(y_test, svr_pred) ** (1/2)

rmse = {'Methods':['linear regression','knn','Decision Tree','Bagging Decision Tree','Random Forest','Ada Boost','SVR'],
        'RMSE of Training':[linreg_rmse_train, knn_rmse_train, dc_rmse_train, bg_rmse_trian, rf_rmse_train, adb_rmse_train, svr_rmse_train],
        'RMSE of Testing': [linreg_rmse_test, knn_rmse_test, dc_rmse_test, bg_rmse_test, rf_rmse_test, adb_rmse_test, svr_rmse_test]}

rmse = pd.DataFrame(rmse)

rmse

Unnamed: 0,Methods,RMSE of Training,RMSE of Testing
0,linear regression,20.238046,20.581316
1,knn,16.43783,20.191135
2,Decision Tree,2.351666,26.155874
3,Bagging Decision Tree,8.686642,20.957117
4,Random Forest,7.696809,20.329514
5,Ada Boost,24.456444,24.621227
6,SVR,19.900798,20.45483


##### 14. Based on training RMSE and testing RMSE, is there evidence of overfitting in any of your models? Which ones?

In [15]:
rmse.loc[rmse['RMSE of Testing'] > rmse['RMSE of Training'],:]

Unnamed: 0,Methods,RMSE of Training,RMSE of Testing
0,linear regression,20.238046,20.581316
1,knn,16.43783,20.191135
2,Decision Tree,2.351666,26.155874
3,Bagging Decision Tree,8.686642,20.957117
4,Random Forest,7.696809,20.329514
5,Ada Boost,24.456444,24.621227
6,SVR,19.900798,20.45483


**Answer**:
> All models are overfitting.

##### 15. Based on everything we've covered so far, if you had to pick just one model as your final model to use to answer the problem in front of you, which one model would you pick? Defend your choice.

In [18]:
rmse.sort_values('RMSE of Testing')

Unnamed: 0,Methods,RMSE of Training,RMSE of Testing
1,knn,16.43783,20.191135
4,Random Forest,7.696809,20.329514
6,SVR,19.900798,20.45483
0,linear regression,20.238046,20.581316
3,Bagging Decision Tree,8.686642,20.957117
5,Ada Boost,24.456444,24.621227
2,Decision Tree,2.351666,26.155874


**Answer**:
> I would choose linear regression, because first it has a relatively low RMSE for testing data and linear regression models can be easily interepreted.

##### 16. Suppose you wanted to improve the performance of your final model. Brainstorm 2-3 things that, if you had more time, you would attempt.

**Answer**:
> 1. I would try polynomial features
> 2. I would transform the Y.

## Step 4: Model the data. (Part 2: Classification Problem)

Recall:
- Problem: Predict whether or not one is eligible for a 401k.
- When predicting `e401k`, you may use the entire dataframe if you wish.

##### 17. While you're allowed to use every variable in your dataframe, mention at least one disadvantage of using `p401k` in your model.

**Answer**:
> Because we're trying to figure out why someone is eligible for a 401K, however, if someone is already participated 401K (p401k will tell you that), then he/she must be eligible for 401K, this variable does help to make a better model result, but it doesn't help us to figure out why. 

##### 18. List all modeling tactics we've learned that could be used to solve a classification problem (as of Wednesday afternoon of Week 6). For each tactic, identify whether it is or is not appropriate for solving this specific classification problem and explain why or why not.

**Answer**:
- 1) Logistic regression: yes.
- 2) KNN: yes
- 3) Decision Tree: yes.
- 4) Bagged Decision Tree: yes
- 5) Random Forest: yes
- 6) Adaboos & XGboostt: yes
- 7) SVM: yes

##### 19. Regardless of your answer to number 18, fit at least one of each of the following models to attempt to solve the classification problem above:
    - a logistic regression model
    - a k-nearest neighbors model
    - a decision tree
    - a set of bagged decision trees
    - a random forest
    - an Adaboost model
    - a support vector classifier
    
> As always, be sure to do a train/test split! In order to compare modeling techniques, you should use the same train-test split on each. I recommend using a random seed here.

> You may find it helpful to set up a pipeline to try each modeling technique, but you are not required to do so!

In [27]:
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import BaggingClassifier, RandomForestClassifier, AdaBoostClassifier
from sklearn.svm import SVC

In [29]:
X = df.drop(["e401k", "p401k"], axis=1)
y = df["e401k"]

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 3)

sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)

In [34]:
logreg = LogisticRegression()
logreg.fit(X_train, y_train)

knn_cat = KNeighborsClassifier()
knn_cat.fit(X_train, y_train)

dc_cat = DecisionTreeClassifier()
dc_cat.fit(X_train, y_train)

bg_cat = BaggingClassifier()
bg_cat.fit(X_train, y_train)

rf_cat = RandomForestClassifier()
rf_cat.fit(X_train, y_train)

ad_cat = AdaBoostClassifier()
ad_cat.fit(X_train, y_train)

sv_cat = SVC()
sv_cat.fit(X_train, y_train)

SVC()

## Step 5: Evaluate the model. (Part 2: Classfication Problem)

##### 20. Suppose our "positive" class is that someone is eligible for a 401(k). What are our false positives? What are our false negatives?

**Answer**:
- False Positive: The true value is negative(not eligible for 401K) but we predict its positive (eligible)
- False Negative: The true value is positive(eligible for 401K) but we predict its negative (not eligible)

##### 21. In this specific case, would we rather minimize false positives or minimize false negatives? Defend your choice.

**Answer**:
- I would prefer to minimize false negative because I would let more people to get their 401K, because I'm a nice guy and I know 401K is good for people.

##### 22. Suppose we wanted to optimize for the answer you provided in problem 21. Which metric would we optimize in this case?

**Answer**:
- In order to optimize false negative, then we need to increase the recall, TP/(TP + FN).

##### 23. Suppose that instead of optimizing for the metric in problem 21, we wanted to balance our false positives and false negatives using `f1-score`. Why might [f1-score](https://en.wikipedia.org/wiki/F1_score) be an appropriate metric to use here?

**Answer**
> Because F1 score will only get a high score if both recall and precision are high.

##### 24. Using f1-score, evaluate each of the models you fit on both the training and testing data.

In [31]:
from sklearn.metrics import f1_score

In [36]:
# For training
logreg_train = f1_score(y_train, logreg.predict(X_train))
knn_cat_train  = f1_score(y_train, knn_cat.predict(X_train))
dc_cat_train = f1_score(y_train, dc_cat.predict(X_train))
bg_cat_train = f1_score(y_train, bg_cat.predict(X_train))
rf_cat_train = f1_score(y_train, rf_cat.predict(X_train))
ad_cat_train = f1_score(y_train, ad_cat.predict(X_train))
sv_cat_train = f1_score(y_train, sv_cat.predict(X_train))

# For testing
logreg_f1_test = f1_score(y_test, logreg.predict(X_test))
knn_cat_test  = f1_score(y_test, knn_cat.predict(X_test))
dc_cat_test = f1_score(y_test, dc_cat.predict(X_test))
bg_cat_test = f1_score(y_test, bg_cat.predict(X_test))
rf_cat_test = f1_score(y_test, rf_cat.predict(X_test))
ad_cat_test = f1_score(y_test, ad_cat.predict(X_test))
sv_cat_test = f1_score(y_test, sv_cat.predict(X_test))

result_cat = pd.DataFrame({
    'Methods':['Logistic Regreesion','KNN','Decision Tree','Bagging','Random Forest','AdaBoost','SVM'],
    'F1 Training':[logreg_train,knn_cat_train ,dc_cat_train,bg_cat_train,rf_cat_train,ad_cat_train,sv_cat_train],
    'F1 Testing':[logreg_f1_test,knn_cat_test,dc_cat_test,bg_cat_test,rf_cat_test,ad_cat_test,sv_cat_test]
})

result_cat

Unnamed: 0,Methods,F1 Training,F1 Testing
0,Logistic Regreesion,0.474569,0.485089
1,KNN,0.649651,0.4994
2,Decision Tree,1.0,0.490463
3,Bagging,0.965826,0.476433
4,Random Forest,1.0,0.53146
5,AdaBoost,0.573137,0.557598
6,SVM,0.46165,0.458969


##### 25. Based on training f1-score and testing f1-score, is there evidence of overfitting in any of your models? Which ones?

**Answer**:
Yes, models list below are overfitting.

In [38]:
result_cat.loc[result_cat['F1 Testing'] < result_cat['F1 Training']]

Unnamed: 0,Methods,F1 Training,F1 Testing
1,KNN,0.649651,0.4994
2,Decision Tree,1.0,0.490463
3,Bagging,0.965826,0.476433
4,Random Forest,1.0,0.53146
5,AdaBoost,0.573137,0.557598
6,SVM,0.46165,0.458969


##### 26. Based on everything we've covered so far, if you had to pick just one model as your final model to use to answer the problem in front of you, which one model would you pick? Defend your choice.

**Answer**:
> I prefer to use AdaBoost, because first, AdaBoost has the second smallest overfit problem, secondly, it has the highest F1 testing score as well.

##### 27. Suppose you wanted to improve the performance of your final model. Brainstorm 2-3 things that, if you had more time, you would attempt.

**Answer**:
> 1) Using GridSearch to find the best hyperparameters.

> 2) Using polynimial features.

## Step 6: Answer the problem.

##### BONUS: Briefly summarize your answers to the regression and classification problems. Be sure to include any limitations or hesitations in your answer.

- Regression: What features best predict one's income?
- Classification: Predict whether or not one is eligible for a 401k.

In [138]:
X_regression = df.drop(["e401k", "p401k", "pira", "inc", "incsq"], axis=1)
print(X_regression.columns)
print(linreg.coef_)

Index(['marr', 'male', 'age', 'fsize', 'nettfa', 'agesq'], dtype='object')
[ 10.24595981   1.33508397  32.8927121   -3.39214151   7.99810811
 -32.61084542]


**Answer**:
> Since age has the largest coef (so does agesq), it's the feature that best predict one's income. One unit(year) increase in age will make the income increase 32.89.

---

In [140]:
X_classification = df.drop(["e401k", "p401k"], axis=1)
print(X_classification.columns)
print(ad_cat.feature_importances_)

Index(['inc', 'marr', 'male', 'age', 'fsize', 'nettfa', 'pira', 'incsq',
       'agesq'],
      dtype='object')
[0.14 0.02 0.02 0.04 0.04 0.56 0.04 0.06 0.08]


**Answer**:
> Since nettfa has the largest importance score, it's the feature that best whether or not one is eligible for a 401k.