<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">

# Supervised Learning Model Comparison

_author The arbitrary and capricious heart of data science_

---

### Let us begin...

Recall the "data science process."
   1. Define the problem.
   2. Gather the data.
   3. Explore the data.
   4. Model the data.
   5. Evaluate the model.
   6. Answer the problem.

In this lab, we're going to focus mostly on creating (and then comparing) many regression and classification models. Thus, we'll define the problem and gather the data for you.
Most of the questions requiring a written response can be written in 2-3 sentences.

### Step 1: Define the problem.

You are a data scientist with a financial services company. Specifically, you want to leverage data in order to identify potential customers.

If you are unfamiliar with "401(k)s" or "IRAs," these are two types of retirement accounts. Very broadly speaking:
- You can put money for retirement into both of these accounts.
- The money in these accounts gets invested and hopefully has a lot more money in it when you retire.
- These are a little different from regular bank accounts in that there are certain tax benefits to these accounts. Also, employers frequently match money that you put into a 401k.
- If you want to learn more about them, check out [this site](https://www.nerdwallet.com/article/ira-vs-401k-retirement-accounts).

We will tackle one regression problem and one classification problem today.
- Regression: What features best predict one's income?
- Classification: Predict whether or not one is eligible for a 401k.

Check out the data dictionary [here](http://fmwww.bc.edu/ec-p/data/wooldridge2k/401KSUBS.DES).

### NOTE: When predicting `inc`, you should pretend as though you do not have access to the `e401k`, the `p401k` variable, and the `pira` variable. When predicting `e401k`, you may use the entire dataframe if you wish.

### Step 2: Gather the data.

##### 1. Read in the data from the repository.

In [1]:
import pandas as pd
import numpy as np

from math import sqrt

from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.linear_model import LinearRegression, LogisticRegression
from sklearn.neighbors import KNeighborsRegressor, KNeighborsClassifier
from sklearn.tree import DecisionTreeRegressor, DecisionTreeClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import BaggingRegressor, BaggingClassifier, RandomForestRegressor, RandomForestClassifier, AdaBoostRegressor, AdaBoostClassifier
from sklearn.metrics import mean_squared_error, f1_score
from sklearn import svm

import matplotlib.pyplot as plt

In [2]:
df = pd.read_csv('./401ksubs.csv')
df.head(2)

Unnamed: 0,e401k,inc,marr,male,age,fsize,nettfa,p401k,pira,incsq,agesq
0,0,13.17,0,0,40,1,4.575,0,1,173.4489,1600
1,1,61.23,0,1,35,1,154.0,1,0,3749.113,1225


In [3]:
df.isnull().sum()

e401k     0
inc       0
marr      0
male      0
age       0
fsize     0
nettfa    0
p401k     0
pira      0
incsq     0
agesq     0
dtype: int64

In [4]:
df.shape

(9275, 11)

In [5]:
df.dtypes

e401k       int64
inc       float64
marr        int64
male        int64
age         int64
fsize       int64
nettfa    float64
p401k       int64
pira        int64
incsq     float64
agesq       int64
dtype: object

##### 2. What are 2-3 other variables that, if available, would be helpful to have?

>1. Age.
>2. whether or not they have a  degree.
>3. whether or not they have children.

In [6]:
df.columns

Index(['e401k', 'inc', 'marr', 'male', 'age', 'fsize', 'nettfa', 'p401k',
       'pira', 'incsq', 'agesq'],
      dtype='object')

In [7]:
df[['marr', 'male', 'agesq', 'fsize', 'nettfa']]

Unnamed: 0,marr,male,agesq,fsize,nettfa
0,0,0,1600,1,4.575
1,0,1,1225,1,154.000
2,1,0,1936,2,0.000
3,1,1,1936,2,21.800
4,0,0,2809,1,18.450
...,...,...,...,...,...
9270,1,0,1089,4,-1.200
9271,0,1,1369,3,2.000
9272,1,0,1089,3,-13.600
9273,1,0,3249,3,3.550


##### 3. Suppose a peer recommended putting `race` into your model in order to better predict who to target when advertising IRAs and 401(k)s. Why would this be an unethical decision?

It would be unethical because it could unintentionally lead to discrimination and also potentially illegal.

## Step 3: Explore the data.

##### 4. When attempting to predict income, which feature(s) would we reasonably not use? Why?

In [8]:
df.head(3)

Unnamed: 0,e401k,inc,marr,male,age,fsize,nettfa,p401k,pira,incsq,agesq
0,0,13.17,0,0,40,1,4.575,0,1,173.4489,1600
1,1,61.23,0,1,35,1,154.0,1,0,3749.113,1225
2,0,12.858,1,0,44,2,0.0,0,0,165.3282,1936


##### 5. What two variables have already been created for us through feature engineering? Come up with a hypothesis as to why subject-matter experts may have done this.
> This need not be a "statistical hypothesis." Just brainstorm why SMEs might have done this!

The two variables already created are incsq (income squared) and agesq (age squared); SMEs may have done this to create a more obvious distinction between those who qualify and those who do not when considering they income and age.

##### 6. Looking at the data dictionary, one variable description appears to be an error. What is this error, and what do you think the correct value would be?

Looking at the data dictionary, it seems that the income is listed as income squared. It also does not mention that the income, I'm assuming, is in 1000s, which it does mention for net total financial assets. Income's (inc) description should not be squared and it should mention that the numbers are in 1000s.

## Step 4: Model the data. (Part 1: Regression Problem)

Recall:
- Problem: What features best predict one's income?
- When predicting `inc`, you should pretend as though you do not have access to the `e401k`, the `p401k` variable, and the `pira` variable.

##### 7. List all modeling tactics we've learned that could be used to solve a regression problem (as of Wednesday afternoon of Week 6). For each tactic, identify whether it is or is not appropriate for solving this specific regression problem and explain why or why not.

   > `Linear Regression`:  For predicting one's income because its predictions and coefficients can read easy.
   
   > `Ridge Regression`:  For predictions and coefficients are not just only easy to understand, but the coefficients have been regulated, improving the predictive performance of the model.
    
   > `Lasso Regression`:  For the same reasons as Linear Regression and Ridge Regression but regulates the coefficients more than Ridge Regression, potentially improving the predictive performance of the model.
    
   > `ElasticNet Regression`:  Combines the effects of Lasso Regression and Ridge Regression.

##### 8. Regardless of your answer to number 7, fit at least one of each of the following models to attempt to solve the regression problem above:
    - a multiple linear regression model
    - a k-nearest neighbors model
    - a decision tree
    - a set of bagged decision trees
    - a random forest
    - an Adaboost model
    
> As always, be sure to do a train/test split! In order to compare modeling techniques, you should use the same train-test split on each. I recommend setting a random seed here.

> You may find it helpful to set up a pipeline to try each modeling technique, but you are not required to do so!

##### 9. What is bootstrapping?

- Bootstrapping is random resampling with replacement.
- We bootstrap when fitting bagged decision trees so that we can fit multiple decision trees on slightly different sets of data. Bagged decision trees tend to outperform single decision trees.
- Bootstrapping can also be used to conduct hypothesis tests and generate confidence intervals directly from resampled data.

##### 10. What is the difference between a decision tree and a set of bagged decision trees? Be specific and precise!

The fundamental difference between bagging and random forest is that in Random forests, only a subset of features are selected at random out of the total and the best split feature from the subset is used to split each node in a tree, unlike in bagging where all features are considered for splitting a node. It doesn't  means that bagging is the same as random forest, if only one explanatory variable (predictor) is used as input.

##### 11. What is the difference between a set of bagged decision trees and a random forest? Be specific and precise!

`Bagging (Bootstrap Aggregation)` is used when our goal is to reduce the variance of a decision tree.

`Random Forest` is an extension over bagging. It takes one extra step where in addition to taking the random subset of data, it also takes the random selection of features rather than using all features to grow trees. When you have many random trees. It’s called Random Forest.



##### 12. Why might a random forest be superior to a set of bagged decision trees?
> Hint: Consider the bias-variance tradeoff.

A random forest model might be superior to a set of bagged decision trees because is contains less variance at the cost of slightly greater bias, which should improve the model overall.

## Step 5: Evaluate the model. (Part 1: Regression Problem)

##### 13. Using RMSE, evaluate each of the models you fit on both the training and testing data.

In [9]:
df.head()

Unnamed: 0,e401k,inc,marr,male,age,fsize,nettfa,p401k,pira,incsq,agesq
0,0,13.17,0,0,40,1,4.575,0,1,173.4489,1600
1,1,61.23,0,1,35,1,154.0,1,0,3749.113,1225
2,0,12.858,1,0,44,2,0.0,0,0,165.3282,1936
3,0,98.88,1,1,44,2,21.8,0,0,9777.254,1936
4,0,22.614,0,0,53,1,18.45,0,0,511.393,2809


In [148]:
features = ['marr', 'male', 'agesq', 'fsize', 'nettfa']

X = df[features]
y = df['inc']

In [149]:
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

In [150]:
ss = StandardScaler()

In [151]:
ss.fit(X_train)

In [152]:
X_train_sc = ss.transform(X_train)
X_train_sc

array([[-1.29516184,  1.96752343, -1.20265512, -1.23978916, -0.37605834],
       [ 0.77210428, -0.50825316, -1.07629993,  0.73260698, -0.11831972],
       [-1.29516184, -0.50825316,  0.06089676, -1.23978916, -0.3531968 ],
       ...,
       [-1.29516184, -0.50825316, -1.31998493,  0.0751416 , -0.2989491 ],
       [-1.29516184,  1.96752343, -1.20265512,  0.73260698, -0.34690407],
       [ 0.77210428, -0.50825316, -0.12863602,  0.0751416 ,  2.75674683]])

In [153]:
X_test_sc = ss.transform(X_test)
X_test_sc 

array([[ 0.77210428, -0.50825316,  0.06089676,  0.73260698, -0.21241625],
       [ 0.77210428, -0.50825316, -0.86984458,  0.0751416 , -0.2989491 ],
       [-1.29516184,  1.96752343, -0.64308214, -1.23978916, -0.07575852],
       ...,
       [-1.29516184, -0.50825316,  2.45261996, -0.58232378, -0.28732459],
       [-1.29516184,  1.96752343, -0.2200179 , -1.23978916, -0.2989491 ],
       [ 0.77210428, -0.50825316, -0.86984458, -0.58232378, -0.13930584]])

**1.  Linear Regression**

In [154]:
lr = LinearRegression()

In [155]:
lr.fit(X_train_sc, y_train)

In [156]:
cross_val_score(lr, X_train_sc, y_train).mean()

0.26830116546660115

In [157]:
lr.score(X_train_sc, y_train)

0.2756613256103081

In [158]:
lr.score(X_test_sc, y_test)

0.22712873344830842

**Using RMSE**

In [159]:
lr_predics_train = lr.predict(X_train_sc)

In [160]:
lr_rms_train = sqrt(mean_squared_error(y_train, lr_predics_train))

In [161]:
lr_rms_train

20.40722185735052

In [162]:
lr_predics_test = lr.predict(X_test_sc)

In [163]:
lr_rms_test = sqrt(mean_squared_error(y_test, lr_predics_test))

In [164]:
lr_rms_test

21.466350728165125

**2. K-Nearest Neighbors**

In [165]:
knn = KNeighborsRegressor()

In [166]:
knn.fit(X_train_sc,y_train)

In [167]:
cross_val_score(knn,X_train_sc,y_train).mean()

0.287233091422582

In [168]:
knn.score(X_train_sc,y_train)

0.5267766834764802

In [169]:
knn.score(X_test_sc,y_test)

0.3134384327718984

**Using RMSE**

In [170]:
knn_predics_train = knn.predict(X_train_sc)

In [171]:
knn_rms_train = sqrt(mean_squared_error(y_train, knn_predics_train))

In [172]:
knn_rms_train

16.49476438701818

In [173]:
knn_predics_test = knn.predict(X_test_sc)

In [174]:
knn_rms_test = sqrt(mean_squared_error(y_test, knn_predics_test))

In [175]:
knn_rms_test

20.232259389714613

**3. Decision Tree**

In [204]:
X_train.columns

Index(['marr', 'male', 'agesq', 'fsize', 'nettfa'], dtype='object')

In [205]:
dt = DecisionTreeRegressor()

In [214]:
dt.fit(X_train_sc,y_train)

In [215]:
cross_val_score(dt,X_train_sc,y_train)

array([-0.07326514, -0.18882531, -0.18639912, -0.23977466, -0.29674907])

In [219]:
dt.score(X_train_sc,y_train)

0.9908140061424364

In [220]:
dt.score(X_test_sc,y_test)

-0.26088527273515183

**Using RMSE**

In [182]:
dt_predics_train = dt.predict(X_train_sc)

In [183]:
dt_rms_train = sqrt(mean_squared_error(y_train, dt_predics_train))

In [184]:
dt_rms_train 

2.2981381078557472

In [185]:
dt_predics_test = dt.predict(X_test_sc)

In [186]:
dt_rms_test = sqrt(mean_squared_error(y_test, dt_predics_test))

In [187]:
dt_rms_test

26.88477844093206

**4. Bagged Decision Tree**

In [188]:
bdt = BaggingRegressor()

In [189]:
bdt.fit(X_train_sc,y_train)

In [190]:
cross_val_score(bdt,X_train_sc,y_train).mean()

0.25970473721412884

In [221]:
bdt.score(X_train_sc,y_train)

0.8654503471029847

In [222]:
bdt.score(X_test_sc,y_test)

0.25447506791203023

**Using RMSE**

In [193]:
bdt_predics_train = bdt.predict(X_train_sc)

In [194]:
bdt_rms_train = sqrt(mean_squared_error(y_train, bdt_predics_train))

In [195]:
bdt_rms_train

8.795374575494954

In [196]:
bdt_predics_test = bdt.predict(X_test_sc)

In [197]:
bdt_rms_test = sqrt(mean_squared_error(y_test, bdt_predics_test))

In [198]:
bdt_rms_test

21.08316103413898

**5.  Random Forests**

In [199]:
rf = RandomForestRegressor()

In [200]:
rf.fit(X_train_sc,y_train)

In [201]:
cross_val_score(rf,X_train_sc,y_train).mean()

0.30188273162724355

In [202]:
rf.score(X_train_sc,y_train)

0.8964850370675724

In [203]:
rf.score(X_test_sc,y_test)

0.2981744506950985

**Using RMSE**

In [65]:
rf_predics_train = rf.predict(X_train_sc)

In [66]:
rf_rms_train = sqrt(mean_squared_error(y_train, rf_predics_train))

In [67]:
rf_rms_train

7.66144989413659

In [68]:
rf_predics_test = rf.predict(X_test_sc)

In [69]:
rf_rms_test = sqrt(mean_squared_error(y_test, rf_predics_test))

In [70]:
rf_rms_test

20.385321300978728

**6. AdaBoost**

In [71]:
ada = AdaBoostRegressor()

In [72]:
ada.fit(X_train_sc,y_train)

In [73]:
cross_val_score(ada,X_train_sc,y_train).mean()

0.1127230586225032

In [74]:
ada.score(X_train_sc,y_train)

0.2339690990080423

In [75]:
ada.score(X_test_sc,y_test)

0.1864368440522175

**Using RMSE**

In [76]:
ada_predics_train = ada.predict(X_train_sc)

In [77]:
ada_rms_train = sqrt(mean_squared_error(y_train, ada_predics_train))

In [78]:
ada_rms_train

20.986315301232867

In [79]:
ada_predics_test = ada.predict(X_test_sc)

In [80]:
ada_rms_test = sqrt(mean_squared_error(y_test, ada_predics_test))

In [81]:
ada_rms_test

22.024206794726254

##### 14. Based on training RMSE and testing RMSE, is there evidence of overfitting in any of your models? Which ones?

Based on the training RMSE and the testing RMSE, there is evidence of overfitting in all models. Only the Linear Regression Machines had just slight overfitting.

##### 15. Based on everything we've covered so far, if you had to pick just one model as your final model to use to answer the problem in front of you, which one model would you pick? Defend your choice.

I would pick the Linear Regression model because although it did not have the best RMSE score on the testing data, the gap between the RMSE score on the training data vs the testing data is amongst the smallest of all the models. 

##### 16. Suppose you wanted to improve the performance of your final model. Brainstorm 2-3 things that, if you had more time, you would attempt.

Try turning the age column into a categorical one by making different ranges, and running the age column through get_dummies to fall within these different ranges.

## Step 4: Model the data. (Part 2: Classification Problem)

Recall:
- Problem: Predict whether or not one is eligible for a 401k.
- When predicting `e401k`, you may use the entire dataframe if you wish.



In [246]:
features = ['incsq', 'marr', 'male', 'agesq', 'fsize', 'nettfa', 'pira']

X = df[features]
y = df['e401k']

In [247]:
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

In [248]:
ss = StandardScaler()

In [249]:
ss.fit(X_train)

In [250]:
X_train_sc = ss.transform(X_train)

In [251]:
X_test_sc = ss.transform(X_test)

**1. Logistic Regression**

In [252]:
logreg = LogisticRegression()

In [253]:
logreg.fit(X_train_sc, y_train)

In [254]:
cross_val_score(logreg, X_train_sc, y_train).mean()

0.6364292826627663

In [255]:
logreg.score(X_train_sc, y_train)

0.6367165037377803

In [232]:
logreg.score(X_test_sc, y_test)

0.648124191461837

**2. K Nearest Neighbors**

In [256]:
knn = KNeighborsClassifier()

In [257]:
knn.fit(X_train_sc, y_train)

In [258]:
cross_val_score(knn, X_train_sc, y_train).mean()

0.6266526603700306

In [259]:
knn.score(X_train_sc, y_train)

0.7514376078205866

In [260]:
knn.score(X_test_sc, y_test)

0.64381198792583

**3. Decision Tree**

In [261]:
dt = DecisionTreeClassifier()

In [262]:
dt.fit(X_train_sc, y_train)

In [263]:
cross_val_score(dt, X_train_sc, y_train).mean()

0.5981898204384508

In [264]:
dt.score(X_train_sc, y_train)

1.0

In [265]:
dt.score(X_test_sc, y_test)

0.592496765847348

**4. Bagged Decision Tree**

In [266]:
bdt = BaggingClassifier()

In [267]:
bdt.fit(X_train_sc, y_train)

In [268]:
cross_val_score(bdt, X_train_sc, y_train).mean()

0.6411719014683888

In [269]:
bdt.score(X_train_sc, y_train)

0.9785796434732605

In [270]:
bdt.score(X_test_sc, y_test)

0.653730056058646

**5. Random Forests**

In [271]:
rf = RandomForestClassifier()

In [272]:
rf.fit(X_train_sc, y_train)

In [273]:
cross_val_score(rf, X_train_sc, y_train).mean()

0.6671948982374378

In [274]:
rf.score(X_train_sc, y_train)

1.0

In [275]:
rf.score(X_test_sc, y_test)

0.6601983613626563

**6. AdaBoost**

In [276]:
ada = AdaBoostClassifier()

In [277]:
ada.fit(X_train_sc, y_train)

In [278]:
cross_val_score(ada, X_train_sc, y_train).mean()

0.6795571076790864

In [279]:
ada.score(X_train_sc, y_train)

0.6927832087406556

In [280]:
ada.score(X_test_sc, y_test)

0.685640362225097

##### 17. While you're allowed to use every variable in your dataframe, mention at least one disadvantage of using `p401k` in your model.

Given that our target variable, in this case, is whether or not someone is eligible for a 401k, including whether or not someone currently has a 401k is almost the same as whether or not someone is eligible for a 401k for those who do currently have a 401k. With this said, including the p401k in my model would almost be like training the model with the target variable included, which of course, would not lead to great results.

In [299]:
df.groupby(["e401k","p401k"])["e401k"].describe()

Unnamed: 0_level_0,Unnamed: 1_level_0,count,mean,std,min,25%,50%,75%,max
e401k,p401k,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
0,0,5638.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0,1075.0,1.0,0.0,1.0,1.0,1.0,1.0,1.0
1,1,2562.0,1.0,0.0,1.0,1.0,1.0,1.0,1.0


##### 18. List all modeling tactics we've learned that could be used to solve a classification problem (as of Wednesday afternoon of Week 6). For each tactic, identify whether it is or is not appropriate for solving this specific classification problem and explain why or why not.

-  Logistic Regression:  An appropriate tactic, especially since its coefficients can be interpreted.
-  KNearest Neighbors:  An appropriate tactic, as it can be used for Classification purposes.
-  Decision Trees:  An appropriate tactic, as it can be used for Classification purposes.
-  Bagged Decision Trees:  An appropriate tactic, as it can be used for Classification purposes.
-  Random Forest:  An appropriate tactic, as it can be used for Classification purposes.

##### 19. Regardless of your answer to number 18, fit at least one of each of the following models to attempt to solve the classification problem above:
    - a logistic regression model
    - a k-nearest neighbors model
    - a decision tree
    - a set of bagged decision trees
    - a random forest
    - an Adaboost model
    
> As always, be sure to do a train/test split! In order to compare modeling techniques, you should use the same train-test split on each. I recommend using a random seed here.

> You may find it helpful to set up a pipeline to try each modeling technique, but you are not required to do so!

## Step 5: Evaluate the model. (Part 2: Classfication Problem)

##### 20. Suppose our "positive" class is that someone is eligible for a 401(k). What are our false positives? What are our false negatives?

False Positives: Someone that the model predicts is eligible for a 401(k), but actually is not.
False Negatives: Someone that the model predicts is not eligible for a 401(k), but actually is.

##### 21. In this specific case, would we rather minimize false positives or minimize false negatives? Defend your choice.

I am going to assume that the cost to the financial services company that I am working for is greater if they advertise/offer a 401k to someone who is not actually eligible for one than if they did not advertise/offer a 401k to someone who is eligible.
Under this assumption, we would rather minimize False Positives.

##### 22. Suppose we wanted to optimize for the answer you provided in problem 21. Which metric would we optimize in this case?

If we wanted to optimize for False Positives, we should optimize the Specificity metric.

##### 23. Suppose that instead of optimizing for the metric in problem 21, we wanted to balance our false positives and false negatives using `f1-score`. Why might [f1-score](https://en.wikipedia.org/wiki/F1_score) be an appropriate metric to use here?

The F1 score might be an appropriate metric to use here because it considers both the model's precision and recall to measure the model's accuracy.

##### 24. Using f1-score, evaluate each of the models you fit on both the training and testing data.

**1. Logistic Regression**

In [281]:
logreg_predics_train = logreg.predict(X_train_sc)

In [282]:
logreg_f1_train = f1_score(y_train, logreg_predics_train)

In [283]:
logreg_f1_train

0.29825048597611775

In [284]:
logreg_predics_test = logreg.predict(X_test_sc)

In [285]:
logreg_f1_test = f1_score(y_test, logreg_predics_test)

In [286]:
logreg_f1_test

0.31197301854974707

**2. K Nearest Neighbors**

In [287]:
dt_predics_train = dt.predict(X_train_sc)

In [288]:
dt_f1_train = f1_score(y_train, dt_predics_train)

In [289]:
dt_f1_train

1.0

In [290]:
dt_predics_test = dt.predict(X_test_sc)

In [291]:
dt_f1_test = f1_score(y_test, dt_predics_test)

In [292]:
dt_f1_test

0.48105436573311366

**3. Bagged Decision Tree**

In [129]:
bdt_predics_train = bdt.predict(X_train_sc)

In [130]:
bdt_f1_train = f1_score(y_train, bdt_predics_train)

In [131]:
bdt_f1_train

0.9675847854599962

In [132]:
bdt_predics_test = bdt.predict(X_test_sc)

In [133]:
bdt_f1_test = f1_score(y_test, bdt_predics_test)

In [134]:
bdt_f1_test

0.47630922693266836

**4. Random Forests**

In [135]:
rf_predics_train = rf.predict(X_train_sc)

In [136]:
rf_f1_train = f1_score(y_train, rf_predics_train)

In [137]:
rf_f1_train

0.9998174849425077

In [138]:
rf_predics_test = rf.predict(X_test_sc)

In [139]:
rf_f1_test = f1_score(y_test, rf_predics_test)

In [140]:
rf_f1_test

0.5324357405140759

**5. AdaBoost**

In [141]:
ada_predics_train = ada.predict(X_train_sc)

In [142]:
ada_f1_train = f1_score(y_train, ada_predics_train)

In [143]:
ada_f1_train

0.569066344020972

In [144]:
ada_predics_test = ada.predict(X_test_sc)

In [145]:
ada_f1_test = f1_score(y_test, ada_predics_test)

In [146]:
ada_f1_test

0.5552165954850519

##### 25. Based on training f1-score and testing f1-score, is there evidence of overfitting in any of your models? Which ones?

Based on the training f1-scores and the testing f1-scores, there is evidence that the K Nearest Neighbors, Decision Tree, Bagged Decision Trees, and Random Forests models are overfit. The AdaBoost  model are also show evidence of being overfit.

##### 26. Based on everything we've covered so far, if you had to pick just one model as your final model to use to answer the problem in front of you, which one model would you pick? Defend your choice.

It would be the AdaBoost model; it has the strongest f1-test score and only shows evidence of being ever-so-slightly overfit.

##### 27. Suppose you wanted to improve the performance of your final model. Brainstorm 2-3 things that, if you had more time, you would attempt.

I would spend more time on feature creation, in particular, passing my features through polynomial features.
I would gridsearch on my models to see how much tweaking the models' parameters could improve their performances.

## Step 6: Answer the problem.

##### BONUS: Briefly summarize your answers to the regression and classification problems. Be sure to include any limitations or hesitations in your answer.

- Regression: What features best predict one's income?
- Classification: Predict whether or not one is eligible for a 401k.

`Regression model`, it would be the `Linear Regression model`.
`Classification model`, it would be the `AdaBoost model`.
