## 6.01 - Supervised Learning Model Comparison

Recall the "data science process."

1. Define the problem.
2. Gather the data.
3. Explore the data.
4. Model the data.
5. Evaluate the model.
6. Answer the problem.

In this lab, we're going to focus mostly on creating (and then comparing) many regression and classification models. Thus, we'll define the problem and gather the data for you.
Most of the questions requiring a written response can be written in 2-3 sentences.

### Step 1: Define the problem.

You are a data scientist with a financial services company. Specifically, you want to leverage data in order to identify potential customers.

If you are unfamiliar with "401(k)s" or "IRAs," these are two types of retirement accounts. Very broadly speaking:
- You can put money for retirement into both of these accounts.
- The money in these accounts gets invested and hopefully has a lot more money in it when you retire.
- These are a little different from regular bank accounts in that there are certain tax benefits to these accounts. Also, employers frequently match money that you put into a 401k.
- If you want to learn more about them, check out [this site](https://www.nerdwallet.com/article/ira-vs-401k-retirement-accounts).

We will tackle one regression problem and one classification problem today.
- Regression: What features best predict one's income?
- Classification: Predict whether or not one is eligible for a 401k.

Check out the data dictionary [here](http://fmwww.bc.edu/ec-p/data/wooldridge2k/401KSUBS.DES).

### NOTE: When predicting `inc`, you should pretend as though you do not have access to the `e401k`, the `p401k` variable, and the `pira` variable. When predicting `e401k`, you may use the entire dataframe if you wish.

### Step 2: Gather the data.

##### 1. Read in the data from the repository.

    - a multiple linear regression model
    - a k-nearest neighbors model
    - a decision tree
    - a set of bagged decision trees
    - a random forest
    - an Adaboost model
    - a support vector regressor

In [17]:
import pandas as pd
import numpy as np
from sklearn.linear_model import LinearRegression, LogisticRegression

from sklearn.neighbors import KNeighborsRegressor

from sklearn.tree import DecisionTreeRegressor

from sklearn.ensemble import RandomForestRegressor, BaggingRegressor, AdaBoostRegressor

from sklearn.svm import SVR

from sklearn.metrics import mean_squared_error, f1_score
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import PolynomialFeatures, StandardScaler
from sklearn.model_selection import GridSearchCV, train_test_split

In [2]:
df = pd.read_csv('401ksubs.csv')
df

Unnamed: 0,e401k,inc,marr,male,age,fsize,nettfa,p401k,pira,incsq,agesq
0,0,13.170,0,0,40,1,4.575,0,1,173.4489,1600
1,1,61.230,0,1,35,1,154.000,1,0,3749.1130,1225
2,0,12.858,1,0,44,2,0.000,0,0,165.3282,1936
3,0,98.880,1,1,44,2,21.800,0,0,9777.2540,1936
4,0,22.614,0,0,53,1,18.450,0,0,511.3930,2809
...,...,...,...,...,...,...,...,...,...,...,...
9270,0,58.428,1,0,33,4,-1.200,0,0,3413.8310,1089
9271,0,24.546,0,1,37,3,2.000,0,0,602.5061,1369
9272,0,38.550,1,0,33,3,-13.600,0,1,1486.1020,1089
9273,0,34.410,1,0,57,3,3.550,0,0,1184.0480,3249


##### 2. What are 2-3 other variables that, if available, would be helpful to have?

##### 3. Suppose a peer recommended putting `race` into your model in order to better predict who to target when advertising IRAs and 401(k)s. Why would this be an unethical decision?

That is racially bias. not cool. Giving unequal opportunities based on race is unethical. 

## Step 3: Explore the data.

##### 4. When attempting to predict income, which feature(s) in our dataset would we reasonably not use? Why?

In [3]:
# answer :
# Marriage, Sex, Age, Family Size, Net Total,
# and reasonably I would test it if they
# already have a 401k or IRA but we are to pretend we do not. 
# These all may have an effect on their income.
# Just removing the income since we are looking for that. 

##### 5. What two variables have already been created for us through feature engineering? Come up with a hypothesis as to why subject-matter experts may have done this.
> This need not be a "statistical hypothesis." Just brainstorm why SMEs might have done this!

In [4]:
# Income and age are squared. This may be so we have a larger range 
# and thus our samples are more spread out, 
# giving us less bias in our answers. 

##### 6. Looking at the data dictionary, two variable descriptions appear to be errors. What are these errors, and what do you think the correct value would be, looking at the data?

In [5]:
# income and age are labeled as if they are squared but they are not in that column. 

## Step 4: Model the data. (Part 1: Regression Problem)

Recall:
- Problem: What features best predict one's income?
- When predicting `inc`, you should pretend as though you do not have access to the `e401k`, the `p401k` variable, and the `pira` variable.

##### 7. List all models/modeling tactics we've learned that could be used to solve a regression problem (as of Wednesday afternoon of Week 6).

In [6]:
# logerithmic regression, LASSO, Linear Regression, KNN

##### 8. Regardless of your answer to number 7, fit at least one of each of the following models to attempt to solve the regression problem above. You will be asked to evaluate your models later in Step 5:
    - a multiple linear regression model
    - a k-nearest neighbors model
    - a decision tree
    - a set of bagged decision trees
    - a random forest
    - an Adaboost model
    - a support vector regressor
    
> As always, be sure to do a train/test split! In order to compare modeling techniques, you should use the same train-test split on each. I recommend setting a random seed here.

> You may find it helpful to set up a pipeline to try each modeling technique, but you are not required to do so!

In [7]:
features = ['marr','male', 'age', 'fsize', 'nettfa']
X = df[features]
y = df['inc']

X_train, X_test, y_train, y_test = train_test_split(X, y,
                                                    random_state =42, 
                                                    test_size = 0.2)

lr = LinearRegression()
knn = KNeighborsRegressor()
dt = DecisionTreeRegressor()
bag = BaggingRegressor()
rf = RandomForestRegressor()
adboost = AdaBoostRegressor()
svr = SVR()

est = [lr, knn, dt, bag, rf, adboost, svr]

mse = mean_squared_error

for model in est:
    pipe = Pipeline([
        ('sc', StandardScaler()), 
        ('estimator', model) ])
    pipe.fit(X_train, y_train)
    y_pred = pipe.predict(X_test)
    print(mse(y_test, y_pred, squared = False))
    


21.363413714273026
20.185058402131137
27.2474278240084
20.78811812170192
20.33174009622726
21.588168038344634
20.467288297315545


##### 9. What is bootstrapping?

In [8]:
# taking samples with replacement. 

##### 10. What is the difference between a decision tree and a set of bagged decision trees? Be specific and precise!

A decision tree:
- takes a dataset consisting of $X$ and $Y$ data, 
- finds rules based on our $X$ data that partitions (splits) our data into smaller datasets such that
- by the bottom of the tree, the values $Y$ in each "leaf node" are as "pure" as possible.

    Decision trees have some limitations. In particular, trees that are grown very deep tend to learn highly irregular patterns (a.k.a. they overfit their training sets). 

Bagging (bootstrap aggregating):
           Mitigates this problem by exposing different trees(random rows) to different sub-samples of the training set. 

##### 11. What is the difference between a set of bagged decision trees and a random forest? Be specific and precise!

    forest is random columns or features. 
    trees is random rows or differeng samples of data. 

    Random forests differ from bagging decision trees in only one way: they use a modified tree learning algorithm that selects, at each split in the learning process, a **random subset of the features**. This process is sometimes called the *random subspace method*.

    The reason for doing this is the correlation of the trees in an ordinary bootstrap sample: if one or a few features are very strong predictors for the response variable (target output), these features will be used in many/all of the bagged decision trees, causing them to become correlated. By selecting a random subset of features at each split, we counter this correlation between base trees, strengthening the overall model.

##### 12. Why might a random forest be superior to a set of bagged decision trees?
> Hint: Consider the bias-variance tradeoff.

    By "de-correlating" our trees from one another, we can drastically reduce the variance of our model.

## Step 5: Evaluate the model. (Part 1: Regression Problem)

##### 13. Using RMSE, evaluate each of the models you fit on both the training and testing data.

In [9]:
for model in est:
    pipe = Pipeline([
        ('sc', StandardScaler()), 
        ('estimator', model) ])
    pipe.fit(X_train, y_train)
    y_pred = pipe.predict(X_test)
    print(f'{model} RMSE for the testing data:')
    print(mse(y_test, y_pred, squared = False))
    print('')


LinearRegression() RMSE for the testing data:
21.363413714273026

KNeighborsRegressor() RMSE for the testing data:
20.185058402131137

DecisionTreeRegressor() RMSE for the testing data:
27.21956660713901

BaggingRegressor() RMSE for the testing data:
20.838286895492555

RandomForestRegressor() RMSE for the testing data:
20.211332260682823

AdaBoostRegressor() RMSE for the testing data:
23.543041422081203

SVR() RMSE for the testing data:
20.467288297315545



In [10]:
for model in est:
    pipe = Pipeline([
        ('sc', StandardScaler()), 
        ('estimator', model) ])
    pipe.fit(X_train, y_train)
    y_pred = pipe.predict(X_train)
    print(f'{model} RMSE for the training data:')
    print(mse(y_train, y_pred, squared = False))
    print(' ')


LinearRegression() RMSE for the testing data:
20.497257487705625
 
KNeighborsRegressor() RMSE for the testing data:
16.483877636883296
 
DecisionTreeRegressor() RMSE for the testing data:
2.2638130048030134
 
BaggingRegressor() RMSE for the testing data:
8.85885824420104
 
RandomForestRegressor() RMSE for the testing data:
7.701698802675366
 
AdaBoostRegressor() RMSE for the testing data:
22.548451357765014
 
SVR() RMSE for the testing data:
19.778944149412197
 


##### 14. Based on training RMSE and testing RMSE, is there evidence of overfitting in any of your models? Which ones?

    Yes there is certianally some overfitting going on. 
    The RMSE scores:
        -KNN has a net gain of about 6
        -Decision Tree is incredibly overfit with a gain of close to 35
        -Bagging has a gain of 17
        -Random forest a gain of 17
        
    Ada boost has a loss (underfitting) of about 5, yet the RMSE is already far higher it is likely not a good model in general for this matter. 

##### 15. Based on everything we've covered so far, if you had to pick just one model as your final model to use to answer the problem in front of you, which one model would you pick? Defend your choice.

    Linear Regression because the model has the least about of change from the test to the training. 

##### 16. Suppose you wanted to improve the performance of your final model. Brainstorm 2-3 things that, if you had more time, you would attempt.

In [11]:
# I would do a gridsearch to find better parameters on my models. 

## Step 4: Model the data. (Part 2: Classification Problem)

Recall:
- Problem: Predict whether or not one is eligible for a 401k.
- When predicting `e401k`, you may use the entire dataframe if you wish.

##### 17. While you're allowed to use every variable in your dataframe, mention at least one disadvantage of using `p401k` in your model.

In [12]:
# FOr a targeted marketing plan, it makes little sence to seek
# out people who already have 401k's. They likely cannot switch over
# or are already investing in their own 401k. 

##### 18. List all modeling tactics we've learned that could be used to solve a classification problem (as of Wednesday afternoon of Week 6).

In [13]:
from sklearn.neighbors import KNeighborsClassifier

from sklearn.tree import DecisionTreeClassifier

from sklearn.ensemble import RandomForestClassifier, BaggingClassifier, AdaBoostClassifier

from sklearn.svm import SVC


##### 19. Regardless of your answer to number 18, fit at least one of each of the following models to attempt to solve the classification problem above. You will be asked to evaluate your models later in Step 5:
    - a logistic regression model
    - a k-nearest neighbors model
    - a decision tree
    - a set of bagged decision trees
    - a random forest
    - an Adaboost model
    - a support vector classifier
    
> As always, be sure to do a train/test split! In order to compare modeling techniques, you should use the same train-test split on each. I recommend using a random seed here.

> You may find it helpful to set up a pipeline to try each modeling technique, but you are not required to do so!

In [19]:
features = ['marr','male', 'age', 'fsize', 'nettfa', 'inc']
X = df[features]
y = df['e401k']

X_train, X_test, y_train, y_test = train_test_split(X, y,
                                                    random_state =42, 
                                                    test_size = 0.2)

logreg = LogisticRegression()
knnc = KNeighborsClassifier()
dtc = DecisionTreeClassifier()
bagc = BaggingClassifier()
rfc = RandomForestClassifier()
adboostc = AdaBoostClassifier()
svc = SVC()

est = [logreg, knnc, dtc, bagc, rfc, adboostc, svc]

mse = mean_squared_error

for model in est:
    pipe = Pipeline([
        ('sc', StandardScaler()), 
        ('estimator', model) ])
    pipe.fit(X_train, y_train)
    y_pred_train = pipe.predict(X_train)
    y_pred_test = pipe.predict(X_test)
    
    print(f'{mse(y_train, y_pred_train, squared = False)}: train: {model}')
    print(f'{mse(y_test, y_pred_test, squared = False)}: test: {model}')
    print()


0.5936546774100089: train: LogisticRegression()
0.5785938899503439: test: LogisticRegression()

0.5008079725173384: train: KNeighborsClassifier()
0.6018838350906229: test: KNeighborsClassifier()

0.0: train: DecisionTreeClassifier()
0.6337327912250197: test: DecisionTreeClassifier()

0.1509181245690859: train: BaggingClassifier()
0.6054558773412049: test: BaggingClassifier()

0.0: train: RandomForestClassifier()
0.5928595784706245: test: RandomForestClassifier()

0.5578404644945482: train: AdaBoostClassifier()
0.5620524574802326: test: AdaBoostClassifier()

0.5678967675066257: train: SVC()
0.5692005080187833: test: SVC()



## Step 5: Evaluate the model. (Part 2: Classfication Problem)

##### 20. Suppose our "positive" class is that someone is eligible for a 401(k). What are our false positives? What are our false negatives?

In [15]:
# A false positive is someone who comes up as eligable but they are not in actuality. 
# A false negative is someone who comes up as ineligable but they do qualify for a 401k.

##### 21. In this specific case, would we rather minimize false positives or minimize false negatives? Defend your choice.

    Minimizing false negatives are better for the financial aspects of the company because they will interact with more people than they miss out on. However having both low is the best option. 

##### 22. Suppose we wanted to optimize for (minimize) the answer you provided in problem 21. Which metric would we optimize (maximize) in this case?

    To minimize false negatives and not be so concerened about false positives, we would want increase the sensitivity. 

##### 23. Suppose that instead of optimizing for the metric in problem 21, we wanted to balance our false positives and false negatives using `f1-score`. Why might [f1-score](https://en.wikipedia.org/wiki/F1_score) be an appropriate metric to use here?

    The f1-score is optimized when there is high precision and lower false (negatives and positives).

##### 24. Using f1-score, evaluate each of the models you fit on both the training and testing data.

In [20]:
for model in est:
    pipe = Pipeline([
        ('sc', StandardScaler()), 
        ('estimator', model) ])
    pipe.fit(X_train, y_train)
    y_pred_train = pipe.predict(X_train)
    y_pred_test = pipe.predict(X_test)
    
    print(f'{f1_score(y_train, y_pred_train)}: train: {model}')
    print(f'{f1_score(y_test, y_pred_test)}: test: {model}')
    print()


0.3745515426931356: train: LogisticRegression()
0.4057416267942584: test: LogisticRegression()

0.6516938049784766: train: KNeighborsClassifier()
0.4947368421052631: test: KNeighborsClassifier()

1.0: train: DecisionTreeClassifier()
0.4917582417582418: test: DecisionTreeClassifier()

0.9690467815687653: train: BaggingClassifier()
0.46993670886075944: test: BaggingClassifier()

1.0: train: RandomForestClassifier()
0.5182370820668692: test: RandomForestClassifier()

0.5639282341831916: train: AdaBoostClassifier()
0.5593984962406015: test: AdaBoostClassifier()

0.4495054060271451: train: SVC()
0.44300278035217794: test: SVC()



##### 25. Based on training f1-score and testing f1-score, is there evidence of overfitting in any of your models? Which ones?

    There is evidence of overfitting. KNN, DecisionTree, BaggingClassifier, RandomForest are all overfit. Remarkable though I must add that Random Forest has a 100% fscore on training and the second highest performing fscore on the training data at 52%. Not bad for being incredibly overfit. 

##### 26. Based on everything we've covered so far, if you had to pick just one model as your final model to use to answer the problem in front of you, which one model would you pick? Defend your choice.

    AdaBoostClassifier. It is not overfit or underfit with the highest testing fscore. This indicates that it is a highly effective model compared to the others presented here. 

##### 27. Suppose you wanted to improve the performance of your final model. Brainstorm 2-3 things that, if you had more time, you would attempt.

    Gridsearch other parameters and determine which models have the least variance and bias effecting their model. 
    Determine the coeeficients and see how much of an effect each variable has on the model to see if I could tweak the model some more. 

## Step 6: Answer the problem. [BONUS] 

##### BONUS: Briefly summarize your answers to the regression and classification problems. Be sure to include any limitations or hesitations in your answer.

- Regression: What features best predict one's income?
- Classification: Predict whether or not one is eligible for a 401k.