## 6.01 - Supervised Learning Model Comparison

Recall the "data science process."

1. Define the problem.
2. Gather the data.
3. Explore the data.
4. Model the data.
5. Evaluate the model.
6. Answer the problem.

In this lab, we're going to focus mostly on creating (and then comparing) many regression and classification models. Thus, we'll define the problem and gather the data for you.
Most of the questions requiring a written response can be written in 2-3 sentences.

In [117]:
clear()

[H[2J

In [150]:
# We will first import libraries
# Import libraries
import pandas as pd
import numpy as np # linear algebra
import seaborn as sns
import matplotlib.pyplot as plt

from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.linear_model import Ridge, RidgeCV, Lasso, LassoCV, LinearRegression, LogisticRegression
from sklearn.ensemble import RandomForestRegressor, RandomForestClassifier, AdaBoostRegressor, AdaBoostClassifier, BaggingRegressor, BaggingClassifier
from sklearn import metrics
from sklearn.metrics import r2_score, mean_squared_error, explained_variance_score
from warnings import simplefilter

from sklearn.tree import DecisionTreeRegressor, DecisionTreeClassifier
from sklearn.neighbors import KNeighborsRegressor, KNeighborsClassifier
from sklearn.svm import SVR, SVC
from sklearn.metrics import confusion_matrix, f1_score

simplefilter("ignore", category=0)

### Step 1: Define the problem.

You are a data scientist with a financial services company. Specifically, you want to leverage data in order to identify potential customers.

If you are unfamiliar with "401(k)s" or "IRAs," these are two types of retirement accounts. Very broadly speaking:
- You can put money for retirement into both of these accounts.
- The money in these accounts gets invested and hopefully has a lot more money in it when you retire.
- These are a little different from regular bank accounts in that there are certain tax benefits to these accounts. Also, employers frequently match money that you put into a 401k.
- If you want to learn more about them, check out [this site](https://www.nerdwallet.com/article/ira-vs-401k-retirement-accounts).

We will tackle one regression problem and one classification problem today.
- Regression: What features best predict one's income?
- Classification: Predict whether or not one is eligible for a 401k.

Check out the data dictionary [here](http://fmwww.bc.edu/ec-p/data/wooldridge2k/401KSUBS.DES).

### NOTE: When predicting `inc`, you should pretend as though you do not have access to the `e401k`, the `p401k` variable, and the `pira` variable. When predicting `e401k`, you may use the entire dataframe if you wish.

### Step 2: Gather the data.

##### 1. Read in the data from the repository.

In [137]:
class data_explorer:
    # This object takes in a DataFrame and then does some initial exploration
    def __init__(self, path):
        # This initalizes and loads all our past datasets
        self.dframe = pd.read_csv(path)
        self.dtypes = self.dframe.dtypes
        self.shape = self.dframe.shape
        self.nulls = self.dframe.isnull().mean()
        
    def explore(self):
        
        # Let's check out the shape of our DataFrame    
        print(f'DataFrame has {self.dframe.shape[1]} columns and {self.dframe.shape[0]} rows.')   
        
        # First we'll look for null values and Data Types
        i = 1
        for item1, item2 in zip(self.nulls, self.dtypes):
            p = 0

            if item1 != 0:
                print(f'Column {i} has {item1} null items')
                p += 1
            print(f'Column {i} is {item2}')
            i += 1   
        if p == 0:
            print(f'DataFrame has zero null values.')
            
    
    # Drop selected columns
    def col_drop(self, drop_list, inplace = True):
        self.dframe.drop(columns=drop_list, inplace=True)
    
    # Check out why an object column isn't numeric
    def num_check(self, column_name):
        print(self.dframe[self.dframe[column_name].str.isnumeric() == False])
     
    # Displays a column is Descending Order
    def view_asc(self, column_name, asc = False):
        temp = self.dframe[column_name].sort_values(ascending = asc)
        return temp
    
    # This will set any column values you choose in your selected column to NAN
    def col_nan(self, column_name, s):
        self.dframe[column_name].apply(lambda x: np.nan if x == s else int(x))
    
    # This will show us what the shape of the DFrame will be if we drop all NAN values
    def s_if_drop(self):
        self.dframe.dropna().shape
    
    # This method will fill any NAN value in the DataFrame with 0
    def fill_na(self):
        self.dframe.fillna(value = 0)
    
    # This will build a new DataFrame with just the columns we desire
    def df_builder(self, column_list):
        new_df = pd.DataFrame()
        for item in column_list:
            new_df[item] = self.dframe[item]
        return new_df
        
    

In [138]:
df_o = data_explorer('401ksubs.csv')

In [139]:
df_o.view_asc('inc')

1107    199.041
531     192.990
8361    191.715
1455    180.858
1354    179.373
         ...   
2166     10.044
2220     10.035
3392     10.032
6600     10.008
4621     10.008
Name: inc, Length: 9275, dtype: float64

In [140]:
# All datatypes are as expected and there are zero null values in the DataFrame
df_o.explore()
df = df_o.dframe

DataFrame has 11 columns and 9275 rows.
Column 1 is int64
Column 2 is float64
Column 3 is int64
Column 4 is int64
Column 5 is int64
Column 6 is int64
Column 7 is float64
Column 8 is int64
Column 9 is int64
Column 10 is float64
Column 11 is int64
DataFrame has zero null values.


In [141]:
df.head()

Unnamed: 0,e401k,inc,marr,male,age,fsize,nettfa,p401k,pira,incsq,agesq
0,0,13.17,0,0,40,1,4.575,0,1,173.4489,1600
1,1,61.23,0,1,35,1,154.0,1,0,3749.113,1225
2,0,12.858,1,0,44,2,0.0,0,0,165.3282,1936
3,0,98.88,1,1,44,2,21.8,0,0,9777.254,1936
4,0,22.614,0,0,53,1,18.45,0,0,511.393,2809


Data Dictionary  
Contains data from 401ksubs.dta  
  obs:  9,275                            
 vars:  11  
 4 Sep 2001 13:50  
table {align:left;display:block}
-------------------------------------------------------------------------------  
|variable name|type|variable label|
|:--|---|---|
e401k|byte|=1 if eligble for 401(k)|
inc|float|inc^2|
marr|byte|=1 if married|
male|byte|=1 if male respondent|
age|byte|age^2|
fsize|byte|family size|
nettfa|float|net total fin. assets, $1000|
p401k|byte|=1 if participate in 401(k)|
pira|byte| =1 if have IRA|
incsq|float|inc^2|
agesq|int| age^2|


##### 2. What are 2-3 other variables that, if available, would be helpful to have?

***Answer:*** Location would be valuable. Also employment status would be good to know.

##### 3. Suppose a peer recommended putting `race` into your model in order to better predict who to target when advertising IRAs and 401(k)s. Why would this be an unethical decision?

***Answer:*** This would be racist as it's judging an applicant, not on their financial record, but on their race. It's assuming one race is more likely to act badly than another.

## Step 3: Explore the data.

##### 4. When attempting to predict income, which feature(s) would we reasonably not use? Why?

***Answer:*** I don't think we should use the 'male' column for the same reasons I wouldn't want to include race. It's inherantly sexist.

##### 5. What two variables have already been created for us through feature engineering? Come up with a hypothesis as to why subject-matter experts may have done this.
> This need not be a "statistical hypothesis." Just brainstorm why SMEs might have done this!

***Answer:*** Income Squared and Age Squared. These are both important variables and including them would allow our model to take these factors into greater account. But since we are trying to predict income, we shouldn't use Income Squared as a variable. This would cause data leakage.

##### 6. Looking at the data dictionary, one variable description appears to be an error. What is this error, and what do you think the correct value would be?

***Answer:*** The descriptios for age and inc are wrong, they are listed as being squared. They not squared and so should not be listed in this manner.

## Step 4: Model the data. (Part 1: Regression Problem)

Recall:
- Problem: What features best predict one's income?
- When predicting `inc`, you should pretend as though you do not have access to the `e401k`, the `p401k` variable, and the `pira` variable.

##### 7. List all modeling tactics we've learned that could be used to solve a regression problem (as of Friday morning of Week 6). For each tactic, identify whether it is or is not appropriate for solving this specific regression problem and explain why or why not.

- Linear Regression
- Ridge
- LASSO
- KNN
- Decision Trees

##### 8. Regardless of your answer to number 7, fit at least one of each of the following models to attempt to solve the regression problem above:
    - a multiple linear regression model
    - a k-nearest neighbors model
    - a decision tree
    - a set of bagged decision trees
    - a random forest
    - an Adaboost model
    - a support vector regressor
    
> As always, be sure to do a train/test split! In order to compare modeling techniques, you should use the same train-test split on each. I recommend setting a random seed here.

> You may find it helpful to set up a pipeline to try each modeling technique, but you are not required to do so!

### NOTE: When predicting `inc`, you should pretend as though you do not have access to the `e401k`, the `p401k` variable, and the `pira` variable. When predicting `e401k`, you may use the entire dataframe if you wish.

In [166]:

class analyzer:
    def __init__(self, df, features, target):
        self.df = df
        self.X = self.df[features]
        self.y = self.df[target]
             
        X_train, X_test, y_train, y_test = train_test_split(self.X, self.y, test_size=0.3, random_state=42)
        
        self.X_train = X_train
        self.y_train = y_train
        self.X_test = X_test
        self.y_test = y_test
        
        # Scale our data.
        # Relabeling scaled data as "Z" is common.
        sc = StandardScaler()
        self.Z_train = sc.fit_transform(self.X_train)
        self.Z_test = sc.transform(self.X_test)
        
    def go_reg(self):
        # Instantiate models
        lr = LinearRegression()
        knn = KNeighborsRegressor()
        dt = DecisionTreeRegressor()
        bag = BaggingRegressor()
        rf = RandomForestRegressor()
        ada = AdaBoostRegressor()
        sv = SVR()
        
        lr.fit(self.Z_train, self.y_train)
        knn.fit(self.Z_train, self.y_train)
        dt.fit(self.Z_train, self.y_train)
        bag.fit(self.Z_train, self.y_train)
        rf.fit(self.Z_train, self.y_train)
        ada.fit(self.Z_train, self.y_train)
        sv.fit(self.Z_train, self.y_train)
        
        print('---------------------------------')
        print(f'LR Train Score {round(lr.score(self.Z_train, self.y_train), 2)}')
        print(f'LR Test Score {round(lr.score(self.Z_test, self.y_test), 2)}')
        print(' ')
        print(f'KNN Train Score {round(knn.score(self.Z_train, self.y_train), 2)}')
        print(f'KNN Test Score {round(knn.score(self.Z_test, self.y_test), 2)}')
        print(' ')
        print(f'DT Train Score {round(dt.score(self.Z_train, self.y_train), 2)}')
        print(f'DT Test Score {round(dt.score(self.Z_test, self.y_test), 2)}')
        print(' ')
        print(f'Bag Train Score {round(bag.score(self.Z_train, self.y_train), 2)}')
        print(f'Bag Test Score {round(bag.score(self.Z_test, self.y_test), 2)}')
        print(' ')
        print(f'RF Train Score {round(rf.score(self.Z_train, self.y_train), 2)}')
        print(f'RF Test Score {round(rf.score(self.Z_test, self.y_test), 2)}')
        print(' ')
        print(f'ADA Train Score {round(ada.score(self.Z_train, self.y_train), 2)}')
        print(f'ADA Test Score {round(ada.score(self.Z_test, self.y_test), 2)}')
        print(' ')
        print(f'SV Train Score {round(sv.score(self.Z_train, self.y_train), 2)}')
        print(f'SV Test Score {round(sv.score(self.Z_test, self.y_test), 2)}')
        
    def go_class(self):
        lg = LogisticRegression()
        knn = KNeighborsClassifier()
        dt = DecisionTreeClassifier()
        bag = BaggingClassifier()
        rf = RandomForestClassifier()
        ada = AdaBoostClassifier()
        sv = SVC()
        
        lg.fit(self.Z_train, self.y_train)
        knn.fit(self.Z_train, self.y_train)
        dt.fit(self.Z_train, self.y_train)
        bag.fit(self.Z_train, self.y_train)
        rf.fit(self.Z_train, self.y_train)
        ada.fit(self.Z_train, self.y_train)
        sv.fit(self.Z_train, self.y_train)        
        
        print(' ')
        print('-----------------------------')
        # Linear Regression
        lg_pred_train = lg.predict(self.Z_train)
        lg_pred_test = lg.predict(self.Z_test)
        print(' ')
        print(f'LG F1 Train Score is {f1_score(self.y_train, lg_pred_train)}')
        print(f'LG F1 Test Score is {f1_score(self.y_test, lg_pred_test)}')
        
        # KNN
        knn_pred_train = knn.predict(self.Z_train)
        knn_pred_test = knn.predict(self.Z_test)
        print(' ')
        print(f'KNN F1 Train Score is {f1_score(self.y_train, knn_pred_train)}')
        print(f'KNN F1 Test Score is {f1_score(self.y_test, knn_pred_test)}')

        # Decision Tree
        dt_pred_train = dt.predict(self.Z_train)
        dt_pred_test = dt.predict(self.Z_test)
        print(' ')
        print(f'DT F1 Train Score is {f1_score(self.y_train, dt_pred_train)}')
        print(f'DT F1 Test Score is {f1_score(self.y_test, dt_pred_test)}')
        
        # Bagged
        bag_pred_train = bag.predict(self.Z_train)
        bag_pred_test = bag.predict(self.Z_test)
        print(' ')
        print(f'Bagged F1 Train Score is {f1_score(self.y_train, bag_pred_train)}')
        print(f'Bagged F1 Test Score is {f1_score(self.y_test, bag_pred_test)}')

        # Random Forest
        rf_pred_train = rf.predict(self.Z_train)
        rf_pred_test = rf.predict(self.Z_test)
        print(' ')
        print(f'RF F1 Train Score is {f1_score(self.y_train, rf_pred_train)}')
        print(f'RF F1 Test Score is {f1_score(self.y_test, rf_pred_test)}')

        # Adaboost
        ada_pred_train = ada.predict(self.Z_train)
        ada_pred_test = ada.predict(self.Z_test)
        print(' ')
        print(f'Adaboost F1 Train Score is {f1_score(self.y_train, ada_pred_train)}')
        print(f'Adaboost F1 Test Score is {f1_score(self.y_test, ada_pred_test)}')

        # Support Vector Classifier
        sv_pred_train = sv.predict(self.Z_train)
        sv_pred_test = sv.predict(self.Z_test)
        print(' ')
        print(f'SV F1 Train Score is {f1_score(self.y_train, sv_pred_train)}')
        print(f'SV F1 Test Score is {f1_score(self.y_test, sv_pred_test)}')
        print('--------------------------------')
        
        
        
        
        
    def RMSE_Calc(self):
        
        lr = LinearRegression()
        knn = KNeighborsRegressor()
        dt = DecisionTreeRegressor()
        bag = BaggingRegressor()
        rf = RandomForestRegressor()
        ada = AdaBoostRegressor()
        sv = SVR()
        
        lr.fit(self.Z_train, self.y_train)
        knn.fit(self.Z_train, self.y_train)
        dt.fit(self.Z_train, self.y_train)
        bag.fit(self.Z_train, self.y_train)
        rf.fit(self.Z_train, self.y_train)
        ada.fit(self.Z_train, self.y_train)
        sv.fit(self.Z_train, self.y_train)
        print(' ')
        print('-----------------------------')
        # Linear Regression
        lr_pred_train = lr.predict(self.Z_train)
        lr_pred_test = lr.predict(self.Z_test)
        print(' ')
        print(f'LR RMSE Train Score is {mean_squared_error(self.y_train, lr_pred_train)**(0.5)}')
        print(f'LR RMSE Test Score is {mean_squared_error(self.y_test, lr_pred_test)**(0.5)}')

        # KNN
        knn_pred_train = knn.predict(self.Z_train)
        knn_pred_test = knn.predict(self.Z_test)
        print(' ')
        print(f'KNN RMSE Train Score is {mean_squared_error(self.y_train, knn_pred_train)**(0.5)}')
        print(f'KNN RMSE Test Score is {mean_squared_error(self.y_test, knn_pred_test)**(0.5)}')

        # Decision Tree
        dt_pred_train = dt.predict(self.Z_train)
        dt_pred_test = dt.predict(self.Z_test)
        print(' ')
        print(f'DT RMSE Train Score is {mean_squared_error(self.y_train, dt_pred_train)**(0.5)}')
        print(f'DT RMSE Test Score is {mean_squared_error(self.y_test, dt_pred_test)**(0.5)}')
        
        # Bagged
        bag_pred_train = bag.predict(self.Z_train)
        bag_pred_test = bag.predict(self.Z_test)
        print(' ')
        print(f'Bagged RMSE Train Score is {mean_squared_error(self.y_train, bag_pred_train)**(0.5)}')
        print(f'Bagged RMSE Test Score is {mean_squared_error(self.y_test, bag_pred_test)**(0.5)}')

        # Random Forest
        rf_pred_train = rf.predict(self.Z_train)
        rf_pred_test = rf.predict(self.Z_test)
        print(' ')
        print(f'RF RMSE Train Score is {mean_squared_error(self.y_train, rf_pred_train)**(0.5)}')
        print(f'RF RMSE Test Score is {mean_squared_error(self.y_test, rf_pred_test)**(0.5)}')

        # Adaboost
        ada_pred_train = ada.predict(self.Z_train)
        ada_pred_test = ada.predict(self.Z_test)
        print(' ')
        print(f'Adaboost RMSE Train Score is {mean_squared_error(self.y_train, ada_pred_train)**(0.5)}')
        print(f'Adaboost RMSE Test Score is {mean_squared_error(self.y_test, ada_pred_test)**(0.5)}')

        # Support Vector Classifier
        sv_pred_train = sv.predict(self.Z_train)
        sv_pred_test = sv.predict(self.Z_test)
        print(' ')
        print(f'SVC RMSE Train Score is {mean_squared_error(self.y_train, sv_pred_train)**(0.5)}')
        print(f'SVC RMSE Test Score is {mean_squared_error(self.y_test, sv_pred_test)**(0.5)}')
        print('--------------------------------')



In [159]:
features = ['male', 'age', 'fsize', 'nettfa', 'marr']

In [160]:
obj = analyzer(df, features, 'inc')

In [161]:
# The model is horrible. Large bias, little variance. More complexity needs to be added.
obj.go_reg()

---------------------------------
LR Train Score 0.28
LR Test Score 0.23
 
KNN Train Score 0.52
KNN Test Score 0.31
 
DT Train Score 0.99
DT Test Score -0.26
 
Bag Train Score 0.86
Bag Test Score 0.27
 
RF Train Score 0.9
RF Test Score 0.31
 
ADA Train Score 0.29
ADA Test Score 0.26
 
SV Train Score 0.31
SV Test Score 0.31


##### 9. What is bootstrapping?

**Answer:** Random Resampling with Replacement

##### 10. What is the difference between a decision tree and a set of bagged decision trees? Be specific and precise!

**Answer:** Decision Trees models tend to be overfit. Bagging solves this problem by using different trees based on subsamples of the data.



##### 11. What is the difference between a set of bagged decision trees and a random forest? Be specific and precise!

**Answer:** These two methods only differ in one way. Random forests use a modified algorithm that randomly chooses a subset of features at each branch split.

##### 12. Why might a random forest be superior to a set of bagged decision trees?
> Hint: Consider the bias-variance tradeoff.

**Answer:** Bagged Decision Tress tend to overfit the model, meaning they have low bias and high variance. Random Forest produces models less affected by outlyers, they will have a higher bias than Bagged Decision Trees, but the variance will be lower.

## Step 5: Evaluate the model. (Part 1: Regression Problem)

##### 13. Using RMSE, evaluate each of the models you fit on both the training and testing data.

In [162]:
obj.RMSE_Calc()

 
-----------------------------
 
LR RMSE Train Score is 20.25386912005579
LR RMSE Test Score is 21.638157117706292
 
KNN RMSE Train Score is 16.50221997028363
KNN RMSE Test Score is 20.362884240446483
 
DT RMSE Train Score is 2.066312589899166
DT RMSE Test Score is 27.52552798980154
 
Bagged RMSE Train Score is 8.859451747126272
Bagged RMSE Test Score is 21.100026829043387
 
RF RMSE Train Score is 7.633770631503993
RF RMSE Test Score is 20.39824502403272
 
Adaboost RMSE Train Score is 21.956798373463457
Adaboost RMSE Test Score is 23.112219825463253
 
SVC RMSE Train Score is 19.76896191032628
SVC RMSE Test Score is 20.385095627961324
--------------------------------


##### 14. Based on training RMSE and testing RMSE, is there evidence of overfitting in any of your models? Which ones?

The following are overfit:    
- Decision Tree
- Bagging
- Random Forest

##### 15. Based on everything we've covered so far, if you had to pick just one model as your final model to use to answer the problem in front of you, which one model would you pick? Defend your choice.

**Answer:** I would use the SVC, but Adaboost or a regular old LinearRegression would also work.

##### 16. Suppose you wanted to improve the performance of your final model. Brainstorm 2-3 things that, if you had more time, you would attempt.

**Answer:** As always, more data would be nice to have. Each of these methods could be gridsearched over for the best parameters.

## Step 4: Model the data. (Part 2: Classification Problem)

Recall:
- Problem: Predict whether or not one is eligible for a 401k.
- When predicting `e401k`, you may use the entire dataframe if you wish.

##### 17. While you're allowed to use every variable in your dataframe, mention at least one disadvantage of using `p401k` in your model.

**Answer:** We are trying to predict if someone is eligible for a 401k. If they are enrolled in a 401K then they are obviously eligible.

##### 18. List all modeling tactics we've learned that could be used to solve a classification problem (as of Wednesday afternoon of Week 6). For each tactic, identify whether it is or is not appropriate for solving this specific classification problem and explain why or why not.

All of the below are appropriate:  
- logistic regression 
- k-nearest neighbors
- decision tree
- bagged decision trees
- random forest
- Adaboost model
- support vector classifier (SVC)

##### 19. Regardless of your answer to number 18, fit at least one of each of the following models to attempt to solve the classification problem above:
    - a logistic regression model
    - a k-nearest neighbors model
    - a decision tree
    - a set of bagged decision trees
    - a random forest
    - an Adaboost model
    - a support vector classifier
    
> As always, be sure to do a train/test split! In order to compare modeling techniques, you should use the same train-test split on each. I recommend using a random seed here.

> You may find it helpful to set up a pipeline to try each modeling technique, but you are not required to do so!

In [163]:
features = ['marr', 'male', 'age', 'fsize', 'nettfa', 'agesq', 'inc', 'incsq', 'pira']
obj = analyzer(df, features, 'e401k')

In [164]:
obj.go_class()

 
-----------------------------
 
LG F1 Train Score is 0.48493277700509974
LG F1 Test Score is 0.471169686985173
 
KNN F1 Train Score is 0.6611779607346423
KNN F1 Test Score is 0.48469643753135977
 
DT F1 Train Score is 1.0
DT F1 Test Score is 0.4688950789229341
 
Bagged F1 Train Score is 0.9693104822638502
Bagged F1 Test Score is 0.49376299376299376
 
RF F1 Train Score is 1.0
RF F1 Test Score is 0.5219586067642605
 
Adaboost F1 Train Score is 0.5742279020234291
Adaboost F1 Test Score is 0.5538150581101566
 
SV F1 Train Score is 0.4769853313100658
SV F1 Test Score is 0.45542168674698796
--------------------------------


## Step 5: Evaluate the model. (Part 2: Classfication Problem)

##### 20. Suppose our "positive" class is that someone is eligible for a 401(k). What are our false positives? What are our false negatives?

**Answer:**  
False positive: Predicting someone is eligible but they are not   
False negative: Predicting someone is not eligible but they are

##### 21. In this specific case, would we rather minimize false positives or minimize false negatives? Defend your choice.

**Answer:** Definately minimize false negatives. You want as much business as possible, so people showing up but then not being eligible is better than people not showing up at all.

##### 22. Suppose we wanted to optimize for the answer you provided in problem 21. Which metric would we optimize in this case?

**Answer:** Sensitivity

##### 23. Suppose that instead of optimizing for the metric in problem 21, we wanted to balance our false positives and false negatives using `f1-score`. Why might [f1-score](https://en.wikipedia.org/wiki/F1_score) be an appropriate metric to use here?

**Answer:** F1 score is an average of sensitivity and specificity.

##### 24. Using f1-score, evaluate each of the models you fit on both the training and testing data.

In [165]:
# I already used this metric above, but just for kicks here it is again:
obj.go_class()

 
-----------------------------
 
LG F1 Train Score is 0.48493277700509974
LG F1 Test Score is 0.471169686985173
 
KNN F1 Train Score is 0.6611779607346423
KNN F1 Test Score is 0.48469643753135977
 
DT F1 Train Score is 1.0
DT F1 Test Score is 0.47708138447146864
 
Bagged F1 Train Score is 0.9704472843450479
Bagged F1 Test Score is 0.4726704841228527
 
RF F1 Train Score is 1.0
RF F1 Test Score is 0.5276381909547738
 
Adaboost F1 Train Score is 0.5742279020234291
Adaboost F1 Test Score is 0.5538150581101566
 
SV F1 Train Score is 0.4769853313100658
SV F1 Test Score is 0.45542168674698796
--------------------------------


##### 25. Based on training f1-score and testing f1-score, is there evidence of overfitting in any of your models? Which ones?

**Answer:** 
The following are overfit:
- decision tree
- bagged
- random forest

##### 26. Based on everything we've covered so far, if you had to pick just one model as your final model to use to answer the problem in front of you, which one model would you pick? Defend your choice.

**Answer:** Adaboost has the best scores. I would use that.

##### 27. Suppose you wanted to improve the performance of your final model. Brainstorm 2-3 things that, if you had more time, you would attempt.

**Answer:** I would gridsearch over the Adaboost parameters to find the best model. Also more data, employment status.

## Step 6: Answer the problem.

##### BONUS: Briefly summarize your answers to the regression and classification problems. Be sure to include any limitations or hesitations in your answer.

- Regression: What features best predict one's income?
- Classification: Predict whether or not one is eligible for a 401k.