## 6.01 - Supervised Learning Model Comparison

Recall the "data science process."

1. Define the problem.
2. Gather the data.
3. Explore the data.
4. Model the data.
5. Evaluate the model.
6. Answer the problem.

In this lab, we're going to focus mostly on creating (and then comparing) many regression and classification models. Thus, we'll define the problem and gather the data for you.
Most of the questions requiring a written response can be written in 2-3 sentences.

### Step 1: Define the problem.

You are a data scientist with a financial services company. Specifically, you want to leverage data in order to identify potential customers.

If you are unfamiliar with "401(k)s" or "IRAs," these are two types of retirement accounts. Very broadly speaking:
- You can put money for retirement into both of these accounts.
- The money in these accounts gets invested and hopefully has a lot more money in it when you retire.
- These are a little different from regular bank accounts in that there are certain tax benefits to these accounts. Also, employers frequently match money that you put into a 401k.
- If you want to learn more about them, check out [this site](https://www.nerdwallet.com/article/ira-vs-401k-retirement-accounts).

We will tackle one regression problem and one classification problem today.
- Regression: What features best predict one's income?
- Classification: Predict whether or not one is eligible for a 401k.

Check out the data dictionary [here](http://fmwww.bc.edu/ec-p/data/wooldridge2k/401KSUBS.DES).

### NOTE: When predicting `inc`, you should pretend as though you do not have access to the `e401k`, the `p401k` variable, and the `pira` variable. When predicting `e401k`, you may use the entire dataframe if you wish.

In [1]:
import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV
from sklearn.linear_model import LinearRegression, LogisticRegression, Lasso, Ridge, ElasticNet 
from sklearn.neighbors import KNeighborsRegressor, KNeighborsClassifier
from sklearn.tree import DecisionTreeRegressor, DecisionTreeClassifier
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.ensemble import BaggingRegressor, BaggingClassifier, RandomForestRegressor, RandomForestClassifier, AdaBoostRegressor, AdaBoostClassifier, \
ExtraTreesRegressor, ExtraTreesClassifier, GradientBoostingRegressor, GradientBoostingClassifier
from sklearn.svm import SVC, SVR
from sklearn.metrics import mean_squared_error, f1_score, r2_score
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import make_pipeline
from sklearn.naive_bayes import GaussianNB

### Step 2: Gather the data.

##### 1. Read in the data from the repository.

In [2]:
df = pd.read_csv('./401ksubs.csv')

In [3]:
# Check shape
df.shape

(9275, 11)

In [4]:
# Check first 5 rows
df.head()

Unnamed: 0,e401k,inc,marr,male,age,fsize,nettfa,p401k,pira,incsq,agesq
0,0,13.17,0,0,40,1,4.575,0,1,173.4489,1600
1,1,61.23,0,1,35,1,154.0,1,0,3749.113,1225
2,0,12.858,1,0,44,2,0.0,0,0,165.3282,1936
3,0,98.88,1,1,44,2,21.8,0,0,9777.254,1936
4,0,22.614,0,0,53,1,18.45,0,0,511.393,2809


In [5]:
# Check last 5 rows
df.tail()

Unnamed: 0,e401k,inc,marr,male,age,fsize,nettfa,p401k,pira,incsq,agesq
9270,0,58.428,1,0,33,4,-1.2,0,0,3413.831,1089
9271,0,24.546,0,1,37,3,2.0,0,0,602.5061,1369
9272,0,38.55,1,0,33,3,-13.6,0,1,1486.102,1089
9273,0,34.41,1,0,57,3,3.55,0,0,1184.048,3249
9274,0,25.608,0,1,49,1,1.8,0,0,655.7697,2401


In [6]:
# Check for missing values
df.isna().sum().sum()

0

In [7]:
# Index 8304 is a duplicate of 8172
df[df.duplicated(keep=False)]

Unnamed: 0,e401k,inc,marr,male,age,fsize,nettfa,p401k,pira,incsq,agesq
8172,0,13.44,0,0,42,4,0.0,0,0,180.6336,1764
8304,0,13.44,0,0,42,4,0.0,0,0,180.6336,1764


In [8]:
# Drop duplicate row
df.drop(8304, axis=0, inplace=True)

In [9]:
# Check data types
df.dtypes

e401k       int64
inc       float64
marr        int64
male        int64
age         int64
fsize       int64
nettfa    float64
p401k       int64
pira        int64
incsq     float64
agesq       int64
dtype: object

In [10]:
# Check fsize value counts, will need to OHE for modeling
df['fsize'].value_counts()

fsize
2     2199
1     2017
4     1989
3     1829
5      816
6      268
7       95
8       38
10       7
9        7
12       4
11       3
13       2
Name: count, dtype: int64

##### 2. What are 2-3 other variables that, if available, would be helpful to have?

1. Whether or not person has a college degree
2. Whether or not person has children
3. Would be helpful to have variables that predict financial stability:
   - a) how much money is in one's savings account
   - b) how much money is in one's checking (American terminology) account
   - c) what is one's credit score (does anyone in Singapore care?)
   - d) owning a house versus renting a residence may be a good indicator of someone able and willing to save money. (The same goes for a car)
   - e) knowing whether or not the person has made investments into stocks, bonds, or Certificates of Deposits (CDs) would also be helpful to predict eligibility for 401(k)

##### 3. Suppose a peer recommended putting `race` into your model in order to better predict who to target when advertising IRAs and 401(k)s. Why would this be an unethical decision?

Putting race into a model is likely an unethical thing to do if you aren't in China (jokes?). In this particular scenario, using race to predict whether or not someone is eligible for a 401k is using race to target who should qualify for a particular product or service. It is unacceptable to discriminate on the basis of race (I suppose, depends on the country), even if race by itself wouldn't immediately disqualify somebody from being targeted.

## Step 3: Explore the data.

##### 4. When attempting to predict income, which feature(s) would we reasonably not use? Why?

We wouldn't use income to predict income - because we wouldn't need to predict something if we already possessed the data we wanted to predict. Similarly, if we had access to incsq, then we wouldn't need to predict inc.

##### 5. What two variables have already been created for us through feature engineering? Come up with a hypothesis as to why subject-matter experts may have done this.
> This need not be a "statistical hypothesis." Just brainstorm why SMEs might have done this!

'incsq' and 'agesq' appear to already have been feature engineered. Assuming that this was done by subject-matter experts, it's likely that they found there's some quadratic (squared) relationship between income and eligibility for a 401k. (The same goes for age and eligibility for a 401k.) Perhaps people who are older or have higher incomes are exponentially more likely to be eligible for an IRA, so these terms account for that nonlinearity.

##### 6. Looking at the data dictionary, one variable description appears to be an error. What is this error, and what do you think the correct value would be?

There appear to be two errors. I'm not sure if this is deliberate. 'inc' is defined as inc^2 and age is defined as age^2. While these are correct for incsq and agesq, inc should refer to one's income (not sure if it's household or individual, but probably individual) and age should refer to one's age. The data dictionary also does not mention that the income is in 1000s, although this is mentioned for net total financial assets (nettfa).

## Step 4: Model the data. (Part 1: Regression Problem)

Recall:
- Problem: What features best predict one's income?
- When predicting `inc`, you should pretend as though you do not have access to the `e401k`, the `p401k` variable, and the `pira` variable.

##### 7. List all modeling tactics we've learned that could be used to solve a regression problem (as of Wednesday afternoon of Week 6). For each tactic, identify whether it is or is not appropriate for solving this specific regression problem and explain why or why not.

- Linear Regression: An appropriate tactic for predicting one's income and coefficients can be interpreted fairly easily. Unit change in x-variable leads to change in y-variable multiplied by the coefficient of that x-variable
- Ridge Regression: Similar to non-regularized linear regression, but coefficients have been penalised equivalent to the sum of the squares of the magnitude of coefficients
- Lasso Regression: Similar to non-regularized linear regression, but coefficients have been penalised equivalent to the sum of the absolute values of coefficients
- ElasticNet Regression: Combines the effects of Lasso and Ridge regularization
- k-nearest neighbors model: Does not provide a prediction for the importance or coefficients of variables
- A decision tree: A single tree is interpretable. Humans can visualize and understand a tree, no matter if they're machine learning experts or laypeople
- A set of bagged decision trees: Bootstrap aggregate of many decision tree. Not easy to interpret compared to a single decision tree
- A random forest: Bootstrap aggregate many decision trees and selects random subspace of features. Not easy to interpret compared to a single decision tree
- A set of randomized trees: Similar to random forest, but instead of looking for the most discriminative thresholds, thresholds are drawn at random for each candidate feature and the best of these randomly-generated thresholds is picked as the splitting rule. Not easy to interpret compared to a single decision tree
- An Adaboost model: Meta-estimator that begins by fitting a regressor on the original dataset and then fits additional copies of the regressor on the same dataset but where the weights of instances are adjusted according to the error of the current prediction. As such, subsequent regressors focus more on difficult cases. Can select base estimator, but if decision tree regressor, then not easy to interpret
- A gradient boosted model: Gradient Boosting is a generic algorithm to find approximate solutions to the additive modeling problem, while AdaBoost can be seen as a special case with a particular loss function. Not easy to interpret compared to a single decision tree
- A support vector regressor: Inspect .support_vectors_, though these are hard to interpret. With linear kernels, you can inspect .coef_.

##### 8. Regardless of your answer to number 7, fit at least one of each of the following models to attempt to solve the regression problem above:
    - a multiple linear regression model
    - a k-nearest neighbors model
    - a decision tree
    - a set of bagged decision trees
    - a random forest
    - an Adaboost model
    - a support vector regressor
    
> As always, be sure to do a train/test split! In order to compare modeling techniques, you should use the same train-test split on each. I recommend setting a random seed here.

> You may find it helpful to set up a pipeline to try each modeling technique, but you are not required to do so!

In [11]:
# Define custom scorer for regression models that prints mean cross validated RMSE score and train/test RMSE/R2 score
def custom_scorer(model, X_train, y_train, X_test, y_test):
    model.fit(X_train, y_train)
    print(f'For the model {model}:', '\n')
    print(f'The best parameters as chosen by GridSearchCV is {model.best_params_}')
    print(f'The mean cross-validated RMSE score of the best_estimator is {-model.best_score_}')
    print(f'The train R2 score is {r2_score(y_train, model.predict(X_train))}')
    print(f'The train RMSE score is {-model.score(X_train, y_train)}')
    print(f'The test R2 score is {r2_score(y_test, model.predict(X_test))}')
    print(f'The test RMSE score is {-model.score(X_test, y_test)}')

In [12]:
# Drop columns to create feature matrix
X = df.drop(columns = ['e401k', 'p401k', 'pira', 'inc', 'incsq'])
# Create target matrix
y = df['inc']
# Train/test (80/20) split
X_train, X_test, y_train, y_test = train_test_split(X,
                                                    y,
                                                    test_size = .2,
                                                    random_state = 123)

In [13]:
# Create 2 preprocessors, one for regression models that require feature scaling and
# another one for regression models that do not require feature scaling

# Scale numeric columns and OHE categorical columns
ct_ss = ColumnTransformer(
    transformers=[
        ("scale", StandardScaler(), ['age', 'nettfa', 'agesq']),
        ("ohe", OneHotEncoder(drop='first'), ['marr', 'male', 'fsize'])
    ],
    n_jobs = -1
)

# OHE categorical columns
ct_no_ss = ColumnTransformer(
    transformers=[
        ("ohe", OneHotEncoder(drop='first'), ['marr', 'male', 'fsize'])
    ],
    n_jobs = -1,
     remainder='passthrough'
)

In [14]:
# For linear regression model
lr = make_pipeline(ct_no_ss, LinearRegression())
print(f'The mean cross-validated RMSE score is {-np.mean(cross_val_score(lr, X_train, y_train, scoring="neg_root_mean_squared_error", cv=5, n_jobs=-1))}')

lr.fit(X_train, y_train)
lr_y_train_pred = lr.predict(X_train)
lr_y_test_pred = lr.predict(X_test)

print(f'The train R2 score is {lr.score(X_train, y_train)}')
print(f'The train RMSE score is {mean_squared_error(y_train, lr_y_train_pred, squared=False)}')
print(f'The test R2 score is {lr.score(X_test, y_test)}')
print(f'The test RMSE score is {mean_squared_error(y_test, lr_y_test_pred, squared=False)}')

The mean cross-validated RMSE score is 20.314919217166963
The train R2 score is 0.2969602497608077
The train RMSE score is 20.237568545903724
The test R2 score is 0.2586329552548243
The test RMSE score is 20.575900557907673


In [15]:
# For LASSO linear regression model
lasso_alphas = np.logspace(-3, -2, 100)
lasso_r = make_pipeline(ct_ss, Lasso(random_state=42))
param_grid_lasso_r = {"lasso__alpha": lasso_alphas,
                     }
grid_lasso_r = GridSearchCV(lasso_r, param_grid_lasso_r, scoring='neg_root_mean_squared_error', cv=5)
custom_scorer(grid_lasso_r, X_train, y_train, X_test, y_test)

For the model GridSearchCV(cv=5,
             estimator=Pipeline(steps=[('columntransformer',
                                        ColumnTransformer(n_jobs=-1,
                                                          transformers=[('scale',
                                                                         StandardScaler(),
                                                                         ['age',
                                                                          'nettfa',
                                                                          'agesq']),
                                                                        ('ohe',
                                                                         OneHotEncoder(drop='first'),
                                                                         ['marr',
                                                                          'male',
                                                                    

In [16]:
# For Ridge linear regression model
ridge_alphas = np.logspace(-1, 0, 100)
ridge_r = make_pipeline(ct_ss, Ridge(random_state=42))
param_grid_ridge_r = {"ridge__alpha": ridge_alphas,
                     }
grid_ridge_r = GridSearchCV(ridge_r, param_grid_ridge_r, scoring='neg_root_mean_squared_error', cv=5)
custom_scorer(grid_ridge_r, X_train, y_train, X_test, y_test)

For the model GridSearchCV(cv=5,
             estimator=Pipeline(steps=[('columntransformer',
                                        ColumnTransformer(n_jobs=-1,
                                                          transformers=[('scale',
                                                                         StandardScaler(),
                                                                         ['age',
                                                                          'nettfa',
                                                                          'agesq']),
                                                                        ('ohe',
                                                                         OneHotEncoder(drop='first'),
                                                                         ['marr',
                                                                          'male',
                                                                    

In [17]:
# For ElasticNet linear regression model
enet_alphas = np.logspace(-3, -2, 100)
enet_ratio = np.linspace(0.5, 0.99, 5)
enet = make_pipeline(ct_ss, ElasticNet(random_state=42))
param_grid_enet = {"elasticnet__alpha": enet_alphas,
                   "elasticnet__l1_ratio": enet_ratio,
                  }
grid_enet = GridSearchCV(enet, param_grid_enet, scoring='neg_root_mean_squared_error', cv=5)
custom_scorer(grid_enet, X_train, y_train, X_test, y_test)

For the model GridSearchCV(cv=5,
             estimator=Pipeline(steps=[('columntransformer',
                                        ColumnTransformer(n_jobs=-1,
                                                          transformers=[('scale',
                                                                         StandardScaler(),
                                                                         ['age',
                                                                          'nettfa',
                                                                          'agesq']),
                                                                        ('ohe',
                                                                         OneHotEncoder(drop='first'),
                                                                         ['marr',
                                                                          'male',
                                                                    

In [18]:
# For KNN regression model
knnr = make_pipeline(ct_ss, KNeighborsRegressor(n_jobs=-1))
param_grid_knnr = {"kneighborsregressor__n_neighbors": [2,3,4,5,6,7,8,9,10],
                   "kneighborsregressor__p": [1,2],
                  }
grid_knnr = GridSearchCV(knnr, param_grid_knnr, scoring='neg_root_mean_squared_error', cv=5)
custom_scorer(grid_knnr, X_train, y_train, X_test, y_test)

For the model GridSearchCV(cv=5,
             estimator=Pipeline(steps=[('columntransformer',
                                        ColumnTransformer(n_jobs=-1,
                                                          transformers=[('scale',
                                                                         StandardScaler(),
                                                                         ['age',
                                                                          'nettfa',
                                                                          'agesq']),
                                                                        ('ohe',
                                                                         OneHotEncoder(drop='first'),
                                                                         ['marr',
                                                                          'male',
                                                                    

In [19]:
# For Decision Tree regression model
dtr = make_pipeline(ct_no_ss, DecisionTreeRegressor())
param_grid_dtr = {"decisiontreeregressor__criterion": ['squared_error', 'friedman_mse', 'absolute_error', 'poisson'],
                  "decisiontreeregressor__max_depth": [2,3,4,5,6],
                 }
grid_dtr = GridSearchCV(dtr, param_grid_dtr, scoring='neg_root_mean_squared_error', cv=5)
custom_scorer(grid_dtr, X_train, y_train, X_test, y_test)

For the model GridSearchCV(cv=5,
             estimator=Pipeline(steps=[('columntransformer',
                                        ColumnTransformer(n_jobs=-1,
                                                          remainder='passthrough',
                                                          transformers=[('ohe',
                                                                         OneHotEncoder(drop='first'),
                                                                         ['marr',
                                                                          'male',
                                                                          'fsize'])])),
                                       ('decisiontreeregressor',
                                        DecisionTreeRegressor())]),
             param_grid={'decisiontreeregressor__criterion': ['squared_error',
                                                              'friedman_mse',
                              

In [20]:
# For Bagged Decision Tree regression model
br = make_pipeline(ct_no_ss, BaggingRegressor(n_jobs=-1, random_state=42))
param_grid_br = {"baggingregressor__n_estimators": [10, 15, 20, 25],
                }
grid_br = GridSearchCV(br, param_grid_br, scoring='neg_root_mean_squared_error', cv=5)
custom_scorer(grid_br, X_train, y_train, X_test, y_test)

For the model GridSearchCV(cv=5,
             estimator=Pipeline(steps=[('columntransformer',
                                        ColumnTransformer(n_jobs=-1,
                                                          remainder='passthrough',
                                                          transformers=[('ohe',
                                                                         OneHotEncoder(drop='first'),
                                                                         ['marr',
                                                                          'male',
                                                                          'fsize'])])),
                                       ('baggingregressor',
                                        BaggingRegressor(n_jobs=-1,
                                                         random_state=42))]),
             param_grid={'baggingregressor__n_estimators': [10, 15, 20, 25]},
             scoring='neg_root_mean_

In [21]:
# For Extra Trees regression model
etr = make_pipeline(ct_no_ss, ExtraTreesRegressor(n_jobs=-1, random_state=42))
param_grid_etr = {"extratreesregressor__n_estimators": [75, 100, 125, 150],
                  "extratreesregressor__criterion": ['squared_error', 'friedman_mse', 'absolute_error', 'poisson'],
                  "extratreesregressor__max_depth": [2,3,4],
                 }
grid_etr = GridSearchCV(etr, param_grid_etr, scoring='neg_root_mean_squared_error', cv=5)
custom_scorer(grid_etr, X_train, y_train, X_test, y_test)

For the model GridSearchCV(cv=5,
             estimator=Pipeline(steps=[('columntransformer',
                                        ColumnTransformer(n_jobs=-1,
                                                          remainder='passthrough',
                                                          transformers=[('ohe',
                                                                         OneHotEncoder(drop='first'),
                                                                         ['marr',
                                                                          'male',
                                                                          'fsize'])])),
                                       ('extratreesregressor',
                                        ExtraTreesRegressor(n_jobs=-1,
                                                            random_state=42))]),
             param_grid={'extratreesregressor__criterion': ['squared_error',
                            

In [22]:
# For Random Forest regression model
rfr = make_pipeline(ct_no_ss, RandomForestRegressor(n_jobs=-1, random_state=42))
param_grid_rfr = {"randomforestregressor__criterion": ['squared_error', 'friedman_mse', 'absolute_error', 'poisson'],
                  "randomforestregressor__max_depth": [2,3,4],
                  "randomforestregressor__n_estimators": [75, 100, 125, 150],
                 }
grid_rfr = GridSearchCV(rfr, param_grid_rfr, scoring='neg_root_mean_squared_error', cv=5)
custom_scorer(grid_rfr, X_train, y_train, X_test, y_test)

For the model GridSearchCV(cv=5,
             estimator=Pipeline(steps=[('columntransformer',
                                        ColumnTransformer(n_jobs=-1,
                                                          remainder='passthrough',
                                                          transformers=[('ohe',
                                                                         OneHotEncoder(drop='first'),
                                                                         ['marr',
                                                                          'male',
                                                                          'fsize'])])),
                                       ('randomforestregressor',
                                        RandomForestRegressor(n_jobs=-1,
                                                              random_state=42))]),
             param_grid={'randomforestregressor__criterion': ['squared_error',
                    

In [23]:
# For AdaBoost regression model
adar = make_pipeline(ct_no_ss, AdaBoostRegressor())
param_grid_adar = {"adaboostregressor__n_estimators": [75, 100, 125, 150],
                   "adaboostregressor__loss": ['linear', 'square', 'exponential'],
                  }
grid_adar = GridSearchCV(adar, param_grid_adar, scoring='neg_root_mean_squared_error', cv=5)
custom_scorer(grid_adar, X_train, y_train, X_test, y_test)

For the model GridSearchCV(cv=5,
             estimator=Pipeline(steps=[('columntransformer',
                                        ColumnTransformer(n_jobs=-1,
                                                          remainder='passthrough',
                                                          transformers=[('ohe',
                                                                         OneHotEncoder(drop='first'),
                                                                         ['marr',
                                                                          'male',
                                                                          'fsize'])])),
                                       ('adaboostregressor',
                                        AdaBoostRegressor())]),
             param_grid={'adaboostregressor__loss': ['linear', 'square',
                                                     'exponential'],
                         'adaboostregressor__n_estima

In [24]:
# For Gradient Boosted regression model
gbr = make_pipeline(ct_no_ss, GradientBoostingRegressor(random_state=42))
param_grid_gbr = {"gradientboostingregressor__loss": ['squared_error', 'absolute_error', 'huber', 'quantile'],
                  "gradientboostingregressor__n_estimators": [75, 100, 125, 150],
                  "gradientboostingregressor__criterion": ['friedman_mse', 'squared_error'],
                  "gradientboostingregressor__max_depth": [2, 3],
                 }
grid_gbr = GridSearchCV(gbr, param_grid_gbr, scoring='neg_root_mean_squared_error', cv=5)
custom_scorer(grid_gbr, X_train, y_train, X_test, y_test)

For the model GridSearchCV(cv=5,
             estimator=Pipeline(steps=[('columntransformer',
                                        ColumnTransformer(n_jobs=-1,
                                                          remainder='passthrough',
                                                          transformers=[('ohe',
                                                                         OneHotEncoder(drop='first'),
                                                                         ['marr',
                                                                          'male',
                                                                          'fsize'])])),
                                       ('gradientboostingregressor',
                                        GradientBoostingRegressor(random_state=42))]),
             param_grid={'gradientboostingregressor__criterion': ['friedman_mse',
                                                                  'squared_error']

In [25]:
# For Support Vector regression model
svr = make_pipeline(ct_ss, SVR())
param_grid_svr = {"svr__kernel": ['linear', 'poly', 'rbf', 'sigmoid'],
                  "svr__degree": [2,3,4,],
                  "svr__gamma": ['scale', 'auto'],
                 }
grid_svr = GridSearchCV(svr, param_grid_svr, scoring='neg_root_mean_squared_error', cv=5)
custom_scorer(grid_svr, X_train, y_train, X_test, y_test)

For the model GridSearchCV(cv=5,
             estimator=Pipeline(steps=[('columntransformer',
                                        ColumnTransformer(n_jobs=-1,
                                                          transformers=[('scale',
                                                                         StandardScaler(),
                                                                         ['age',
                                                                          'nettfa',
                                                                          'agesq']),
                                                                        ('ohe',
                                                                         OneHotEncoder(drop='first'),
                                                                         ['marr',
                                                                          'male',
                                                                    

##### 9. What is bootstrapping?

Bootstrapping is a method of sampling with replacement. We usually use it in order to simulate many different samples or to empirically estimate the sampling distribution of a statistic.

##### 10. What is the difference between a decision tree and a set of bagged decision trees? Be specific and precise!

With a set of bagged decision trees, we have bootstrapped (iteratively taking a random sample of the rows with replacement in a dataset) different samples and grown one decision tree on each bootstrapped sample, then our predictions are aggregated. A set of bagged decision trees is an ensemble method, meant to make 'weak signals' stronger, reducing variance in the model. With one decision tree, we only use the original sample and grow exactly one decision tree and no aggregation of predictions occurs.

##### 11. What is the difference between a set of bagged decision trees and a random forest? Be specific and precise!

The difference between a set of bagged decision trees and a random forest is that a random subset of features is selected when building each tree (the best split is found from a random subset of the max features) in a random forest model. In bagged decision trees, every feature is considered as a "candidate" for splitting at each node in the individual decision trees.

##### 12. Why might a random forest be superior to a set of bagged decision trees?
> Hint: Consider the bias-variance tradeoff.

A random forest randomly selects which features go into each split. This effectively makes our individual decision trees less correlated when compared to the individual trees that select all the features for each split in a bagged decision trees model. This combination of diverse trees decreases the variance of our predictions after aggregation of the different decision trees, although sometimes at the cost of a slight increase in bias. Thus, a random forest usually has less variance (less overfitting) compared to a set of bagged decision trees

## Step 5: Evaluate the model. (Part 1: Regression Problem)

##### 13. Using RMSE, evaluate each of the models you fit on both the training and testing data.

|               Model              | Training RMSE | Cross-validated RMSE | Testing RMSE |
|:--------------------------------:|:-------------:|:--------------------:|:------------:|
|         Linear Regression        |    20.2815    |        20.3562       |    20.4585   |
|      Lasso Linear Regression     |    20.2639    |        20.3458       |    20.4515   |
|      Ridge Linear Regression     |    20.2614    |        20.3458       |    20.4483   |
|   ElasticNet Linear Regression   |    20.2628    |        20.3456       |    20.4515   |
|  k-Nearest Neighbors Regression  |    17.7471    |        19.6855       |    19.5486   |
|     Decision Tree Regression     |    18.4067    |        18.8583       |    18.6100   |
| Bagged Decision Trees Regression |     8.1597    |        20.5272       |    20.2592   |
|      Extra Trees Regression      |    20.7467    |        20.7777       |    20.4931   |
|     Random Forest Regression     |    18.5145    |        18.8169       |    18.6040   |
|        AdaBoost Regression       |    22.7004    |        21.8013       |    23.1081   |
|    Gradient Boosted Regression   |    17.9361    |        18.6004       |    18.2975   |
|     Support Vector Regression    |    19.8861    |        20.0725       |    19.6709   |

##### 14. Based on training RMSE and testing RMSE, is there evidence of overfitting in any of your models? Which ones?

Based on the training RMSE and the testing RMSE, there is evidence of more overfitting, in this order:

1. Bagged Decision Trees Regression: Difference in RMSE of about 12
2. k-Nearest Neighbors Regression: Difference in RMSE of about 1.8
3. AdaBoost Regression: Difference in RMSE of about 0.4
4. The rest have differences in RMSE of 0.3 or less


##### 15. Based on everything we've covered so far, if you had to pick just one model as your final model to use to answer the problem in front of you, which one model would you pick? Defend your choice.

When I have a series of models from which I can pick, I usually do the following:

1. List out all of the models. Any model that cannot solve my problem should be removed. My problem statement is "What features best predict one's income?" Even if it's a very predictive model, my goal isn't just to come up with the best predictions. I want to find the model that performs the best based on a metric of my choice, in this case the RMSE score and also has coefficients that can be interpreted, so as to answer the problem statement
2. In this case, the model with the best test RMSE is the **Gradient Boosted Regression**, although it is slightly more overfitted than the Random Forest Regression, as based on difference in train/test RMSE. Regarding interpretability of my selected model:
    - Individual decision trees intrinsically perform feature selection by selecting appropriate split points. This information can be used to measure the importance of each feature; the basic idea is: the more often a feature is used in the split points of a tree the more important that feature is. This notion of importance can be extended to decision tree ensembles by simply averaging the impurity-based feature importance of each tree. I have calculated the regression coefficients for my selected model in the next code cell. The output will be displayed at the very end of the notebook when answering the final question
3. Sometimes, model selection becomes a judgment call with no perfect guide to make the final decision.
    - Do you have time to tune the models to try and eke out better performance?
    - Is one model substantially better at solving the problem you wanted to solve?
    - Do you need something understandable by a lay audience? (i.e. Linear regression is more common and more easily understood than AdaBoost or Support Vector Machines)

In [26]:
# Getting coefficient values for our chosen regression model
grid_gbr.fit(X_train, y_train) 
gbr.set_params(**grid_gbr.best_params_).fit(X_train, y_train)
feature_values = gbr['gradientboostingregressor'].feature_importances_
feature_names = gbr.named_steps.columntransformer.get_feature_names_out()
regression_coefficients = pd.DataFrame(feature_values, index=feature_names).T

##### 16. Suppose you wanted to improve the performance of your final model. Brainstorm 2-3 things that, if you had more time, you would attempt.

1. I would look at higher-order terms. For example, since we have incsq and agesq automatically created for us, there may be other higher-order terms that could be predictive. I would explore these to see if we can make better predictions
2. I would also consider interaction terms between variables. For example between income and nettfa
3. I would consider transforming my variable. Income is often skewed, which means there will usually be a handful of very high incomes that might skew any linear model. I would consider transforming income (likely using log) so that income is "un-skewed"

## Step 4: Model the data. (Part 2: Classification Problem)

Recall:
- Problem: Predict whether or not one is eligible for a 401k.
- When predicting `e401k`, you may use the entire dataframe if you wish.

##### 17. While you're allowed to use every variable in your dataframe, mention at least one disadvantage of using `p401k` in your model.

Given that our target variable is whether or not someone is eligible for a 401k, then the variable 'p401k', for whether or not someone currently has a 401k leaks information about our target variable. Every person with a 401k must by definition be eligible for a 401k although a person without a 401k may or may not be eligible for a 401k. Including 'p401k' in my model would almost be like training the model with the target variable included as a feature, which, of course, would not lead to great results due to data leakage. This additional information can allow the model to learn or know something that it otherwise would not know and in turn invalidate the estimated performance of the model being constructed.


##### 18. List all modeling tactics we've learned that could be used to solve a classification problem (as of Wednesday afternoon of Week 6). For each tactic, identify whether it is or is not appropriate for solving this specific classification problem and explain why or why not.

- A logistic regression model Yes, we can predict whether or not one is eligible for a 401(k)
- k-nearest neighbors model: Yes, we can predict whether or not one is eligible for a 401(k)
- A Naive Bayes model Yes, we can predict whether or not one is eligible for a 401(k)
- A decision tree: Yes, we can predict whether or not one is eligible for a 401(k)
- A set of bagged decision trees: Yes, we can predict whether or not one is eligible for a 401(k)
- A random forest: Yes, we can predict whether or not one is eligible for a 401(k)
- A set of randomized trees: Yes, we can predict whether or not one is eligible for a 401(k)
- An Adaboost model: Yes, we can predict whether or not one is eligible for a 401(k)
- A gradient boosted model: Yes, we can predict whether or not one is eligible for a 401(k)
- A support vector classifier: Yes, we can predict whether or not one is eligible for a 401(k)


##### 19. Regardless of your answer to number 18, fit at least one of each of the following models to attempt to solve the classification problem above:
    - a logistic regression model
    - a k-nearest neighbors model
    - a decision tree
    - a set of bagged decision trees
    - a random forest
    - an Adaboost model
    - a support vector classifier
    
> As always, be sure to do a train/test split! In order to compare modeling techniques, you should use the same train-test split on each. I recommend using a random seed here.

> You may find it helpful to set up a pipeline to try each modeling technique, but you are not required to do so!

In [27]:
# Define custom scorer for classification models that prints mean cross validated F1 score and train/test F1 score
def custom_scorer_classifier(model, X_train1, y_train1, X_test1, y_test1):
    model.fit(X_train1, y_train1)
    print(f'For the model {model}:', '\n')
    print(f'The best parameters as chosen by GridSearchCV is {model.best_params_}')
    print(f'The mean cross-validated F1 score of the best_estimator is {model.best_score_}')
    print(f'The train F1 score is {model.score(X_train1, y_train1)}')
    print(f'The test F1 score is {model.score(X_test1, y_test1)}')

In [28]:
# Drop columns to create feature matrix
X1 = df.drop(columns = ['e401k', 'p401k'])
# Create target matrix
y1 = df['e401k']
# Train/test (80/20) split
X_train1, X_test1, y_train1, y_test1 = train_test_split(X1,
                                                        y1,
                                                        test_size = .2,
                                                        random_state = 123)

In [29]:
# Create 2 preprocessors, one for regression models that require feature scaling and
# another one for regression models that do not require feature scaling

# Scale numeric columns and OHE categorical columns
ct_ss1 = ColumnTransformer(
    transformers=[
        ("scale", StandardScaler(), ['inc', 'age', 'nettfa', 'incsq', 'agesq']),
        ("ohe", OneHotEncoder(drop='first'), ['marr', 'male', 'fsize', 'pira'])
    ],
    n_jobs = -1
)
# OHE categorical columns
ct_no_ss1 = ColumnTransformer(
    transformers=[
        ("ohe", OneHotEncoder(drop='first'), ['marr', 'male', 'fsize', 'pira'])
    ],
    n_jobs = -1,
     remainder='passthrough'
)

In [30]:
# For logistic regression model
log_reg = make_pipeline(ct_ss1, LogisticRegression(max_iter=1000)) 
param_grid_log_reg  = {"logisticregression__penalty": ['l1', 'l2'],
                        "logisticregression__C": [0.01, 0.05, 0.1, 0.5, 1, 5, 10],
                        "logisticregression__solver": ['liblinear'],
                       }
grid_log_reg  = GridSearchCV(log_reg, param_grid_log_reg, scoring='f1', cv=5)
custom_scorer_classifier(grid_log_reg, X_train1, y_train1, X_test1, y_test1)

For the model GridSearchCV(cv=5,
             estimator=Pipeline(steps=[('columntransformer',
                                        ColumnTransformer(n_jobs=-1,
                                                          transformers=[('scale',
                                                                         StandardScaler(),
                                                                         ['inc',
                                                                          'age',
                                                                          'nettfa',
                                                                          'incsq',
                                                                          'agesq']),
                                                                        ('ohe',
                                                                         OneHotEncoder(drop='first'),
                                                                    

In [31]:
# For KNN classification model
knn_class = make_pipeline(ct_ss1, KNeighborsClassifier(n_jobs=-1))
param_grid_knn_class = {"kneighborsclassifier__n_neighbors": [2,3,4,5,6,7,8,9,10],
                        "kneighborsclassifier__algorithm": ['auto', 'ball_tree', 'kd_tree'],
                        "kneighborsclassifier__p": [1, 2],
                       }
grid_knn_class = GridSearchCV(knn_class, param_grid_knn_class, scoring='f1', cv=5)
custom_scorer_classifier(grid_knn_class, X_train1, y_train1, X_test1, y_test1)

For the model GridSearchCV(cv=5,
             estimator=Pipeline(steps=[('columntransformer',
                                        ColumnTransformer(n_jobs=-1,
                                                          transformers=[('scale',
                                                                         StandardScaler(),
                                                                         ['inc',
                                                                          'age',
                                                                          'nettfa',
                                                                          'incsq',
                                                                          'agesq']),
                                                                        ('ohe',
                                                                         OneHotEncoder(drop='first'),
                                                                    

In [32]:
# For Gaussian Naive Bayes classification model
gnb_class = make_pipeline(ct_no_ss1, GaussianNB())
param_grid_gnb_class = {
                       }
grid_gnb_class = GridSearchCV(gnb_class, param_grid_gnb_class, scoring='f1', cv=5)
custom_scorer_classifier(grid_gnb_class, X_train1, y_train1, X_test1, y_test1)

For the model GridSearchCV(cv=5,
             estimator=Pipeline(steps=[('columntransformer',
                                        ColumnTransformer(n_jobs=-1,
                                                          remainder='passthrough',
                                                          transformers=[('ohe',
                                                                         OneHotEncoder(drop='first'),
                                                                         ['marr',
                                                                          'male',
                                                                          'fsize',
                                                                          'pira'])])),
                                       ('gaussiannb', GaussianNB())]),
             param_grid={}, scoring='f1'): 

The best parameters as chosen by GridSearchCV is {}
The mean cross-validated F1 score of the best_estimator is 0.416731947

In [33]:
# For Decision Tree classification model
dt_class = make_pipeline(ct_no_ss1, DecisionTreeClassifier(random_state=42))
param_grid_dt_class = {"decisiontreeclassifier__criterion": ['gini', 'entropy', 'log_loss'],
                       "decisiontreeclassifier__max_depth": [2,3,4,5,6],
                      }
grid_dt_class = GridSearchCV(dt_class, param_grid_dt_class, scoring='f1', cv=5)
custom_scorer_classifier(grid_dt_class, X_train1, y_train1, X_test1, y_test1)

For the model GridSearchCV(cv=5,
             estimator=Pipeline(steps=[('columntransformer',
                                        ColumnTransformer(n_jobs=-1,
                                                          remainder='passthrough',
                                                          transformers=[('ohe',
                                                                         OneHotEncoder(drop='first'),
                                                                         ['marr',
                                                                          'male',
                                                                          'fsize',
                                                                          'pira'])])),
                                       ('decisiontreeclassifier',
                                        DecisionTreeClassifier(random_state=42))]),
             param_grid={'decisiontreeclassifier__criterion': ['gini',
                 

In [34]:
# For Bagged Decision Tree classification model
bag_class = make_pipeline(ct_no_ss1, BaggingClassifier(n_jobs=-1, random_state=42))
param_grid_bag_class = {"baggingclassifier__n_estimators": [10, 15, 20, 25],
                       }
grid_bag_class = GridSearchCV(bag_class, param_grid_bag_class, scoring='f1', cv=5)
custom_scorer_classifier(grid_bag_class, X_train1, y_train1, X_test1, y_test1)

For the model GridSearchCV(cv=5,
             estimator=Pipeline(steps=[('columntransformer',
                                        ColumnTransformer(n_jobs=-1,
                                                          remainder='passthrough',
                                                          transformers=[('ohe',
                                                                         OneHotEncoder(drop='first'),
                                                                         ['marr',
                                                                          'male',
                                                                          'fsize',
                                                                          'pira'])])),
                                       ('baggingclassifier',
                                        BaggingClassifier(n_jobs=-1,
                                                          random_state=42))]),
             param_grid={'bag

In [35]:
# For Extra Trees classification model
et_class = make_pipeline(ct_no_ss1, ExtraTreesClassifier(n_jobs=-1, random_state=42))
param_grid_et_class = {"extratreesclassifier__n_estimators": [75, 100, 125, 150],
                       "extratreesclassifier__criterion": ['gini', 'entropy', 'log_loss'],
                       "extratreesclassifier__max_depth": [2,3,4,5],
                      }
grid_et_class = GridSearchCV(et_class, param_grid_et_class, scoring='f1', cv=5)
custom_scorer_classifier(grid_et_class, X_train1, y_train1, X_test1, y_test1)

For the model GridSearchCV(cv=5,
             estimator=Pipeline(steps=[('columntransformer',
                                        ColumnTransformer(n_jobs=-1,
                                                          remainder='passthrough',
                                                          transformers=[('ohe',
                                                                         OneHotEncoder(drop='first'),
                                                                         ['marr',
                                                                          'male',
                                                                          'fsize',
                                                                          'pira'])])),
                                       ('extratreesclassifier',
                                        ExtraTreesClassifier(n_jobs=-1,
                                                             random_state=42))]),
             param_g

In [36]:
# For Random Forest classification model
rf_class = make_pipeline(ct_no_ss1, RandomForestClassifier(n_jobs=-1, random_state=42))
param_grid_rf_class = {"randomforestclassifier__n_estimators": [75, 100, 125, 150],
                       "randomforestclassifier__criterion": ['gini', 'entropy', 'log_loss'],
                       "randomforestclassifier__max_depth": [2,3,4,5,6,7,8],
                      }
grid_rf_class = GridSearchCV(rf_class, param_grid_rf_class, scoring='f1', cv=5)
custom_scorer_classifier(grid_rf_class, X_train1, y_train1, X_test1, y_test1)

For the model GridSearchCV(cv=5,
             estimator=Pipeline(steps=[('columntransformer',
                                        ColumnTransformer(n_jobs=-1,
                                                          remainder='passthrough',
                                                          transformers=[('ohe',
                                                                         OneHotEncoder(drop='first'),
                                                                         ['marr',
                                                                          'male',
                                                                          'fsize',
                                                                          'pira'])])),
                                       ('randomforestclassifier',
                                        RandomForestClassifier(n_jobs=-1,
                                                               random_state=42))]),
             p

In [37]:
# For AdaBoost classification model
ada_class = make_pipeline(ct_no_ss1, AdaBoostClassifier(random_state=42))
param_grid_ada_class = {"adaboostclassifier__n_estimators": [50, 100, 150],
                        "adaboostclassifier__algorithm": ['SAMME', 'SAMME.R'],
                       }
grid_ada_class = GridSearchCV(ada_class, param_grid_ada_class, scoring='f1', cv=5)
custom_scorer_classifier(grid_ada_class, X_train1, y_train1, X_test1, y_test1)

For the model GridSearchCV(cv=5,
             estimator=Pipeline(steps=[('columntransformer',
                                        ColumnTransformer(n_jobs=-1,
                                                          remainder='passthrough',
                                                          transformers=[('ohe',
                                                                         OneHotEncoder(drop='first'),
                                                                         ['marr',
                                                                          'male',
                                                                          'fsize',
                                                                          'pira'])])),
                                       ('adaboostclassifier',
                                        AdaBoostClassifier(random_state=42))]),
             param_grid={'adaboostclassifier__algorithm': ['SAMME', 'SAMME.R'],
                

In [38]:
# For Gradient Boosted classification model
gb_class = make_pipeline(ct_no_ss1, GradientBoostingClassifier(random_state=42))
param_grid_gb_class = {"gradientboostingclassifier__loss": ['log_loss', 'exponential'],
                       "gradientboostingclassifier__n_estimators": [75, 100, 125, 150],
                       "gradientboostingclassifier__criterion": ['friedman_mse', 'squared_error'],
                       "gradientboostingclassifier__max_depth": [2, 3],
                      }
grid_gb_class = GridSearchCV(gb_class, param_grid_gb_class, scoring='f1', cv=5)
custom_scorer_classifier(grid_gb_class, X_train1, y_train1, X_test1, y_test1)

For the model GridSearchCV(cv=5,
             estimator=Pipeline(steps=[('columntransformer',
                                        ColumnTransformer(n_jobs=-1,
                                                          remainder='passthrough',
                                                          transformers=[('ohe',
                                                                         OneHotEncoder(drop='first'),
                                                                         ['marr',
                                                                          'male',
                                                                          'fsize',
                                                                          'pira'])])),
                                       ('gradientboostingclassifier',
                                        GradientBoostingClassifier(random_state=42))]),
             param_grid={'gradientboostingclassifier__criterion': ['friedman_mse

In [39]:
# For Support Vector classification model
svc = make_pipeline(ct_ss1, SVC(random_state=42))
param_grid_svc = {"svc__kernel": ['linear', 'poly', 'rbf', 'sigmoid'],
                  "svc__degree": [2,3,4],
                  "svc__gamma": ['scale', 'auto'],
                 }
grid_svc = GridSearchCV(svc, param_grid_svc, scoring='f1', cv=5)
custom_scorer_classifier(grid_svc, X_train1, y_train1, X_test1, y_test1)

For the model GridSearchCV(cv=5,
             estimator=Pipeline(steps=[('columntransformer',
                                        ColumnTransformer(n_jobs=-1,
                                                          transformers=[('scale',
                                                                         StandardScaler(),
                                                                         ['inc',
                                                                          'age',
                                                                          'nettfa',
                                                                          'incsq',
                                                                          'agesq']),
                                                                        ('ohe',
                                                                         OneHotEncoder(drop='first'),
                                                                    

## Step 5: Evaluate the model. (Part 2: Classfication Problem)

##### 20. Suppose our "positive" class is that someone is eligible for a 401(k). What are our false positives? What are our false negatives?

False Positives: Someone that the model incorrectly predicts is eligible for a 401(k)

False Negatives: Someone that the model incorrectly predicts is not eligible for a 401(k)

##### 21. In this specific case, would we rather minimize false positives or minimize false negatives? Defend your choice.

I am going to assume that the cost to the financial services company that I am working for is greater if they advertise/offer a 401k to someone who is not actually eligible for one than if they did not advertise/offer a 401k to someone who is eligible. Under this assumption, we would rather minimize False Positives.

##### 22. Suppose we wanted to optimize for the answer you provided in problem 21. Which metric would we optimize in this case?

If we wanted to minimize False Positives, we should optimize the Precision metric or Specificity (1- False Postive Rate) metric.


##### 23. Suppose that instead of optimizing for the metric in problem 21, we wanted to balance our false positives and false negatives using `f1-score`. Why might [f1-score](https://en.wikipedia.org/wiki/F1_score) be an appropriate metric to use here?

The F1 score might be an appropriate metric to use here because it considers both the model's precision and recall to evaluate the model's performance. The F1 score is the harmonic mean of precision and recall and can be appropriate when the average ratio is desired.

##### 24. Using f1-score, evaluate each of the models you fit on both the training and testing data.

|                 Model                | Training F1 Score | Cross-validated F1 Score | Testing F1 Score |
|:------------------------------------:|:-----------------:|:------------------------:|:----------------:|
|          Logistic Regression         |       0.4817      |          0.4824          |      0.4621      |
|         Gaussian Naive Bayes         |       0.4404      |          0.4342          |      0.4103      |
|  k-Nearest Neighbors Classification  |       0.5913      |          0.4796          |      0.4612      |
|     Decision Tree Classification     |       0.5699      |          0.5622          |      0.5377      |
| Bagged Decision Trees Classification |       0.9953      |          0.5205          |      0.5060      |
|      Extra Trees Classification      |       0.2961      |          0.2777          |      0.2642      |
|     Random Forest Classification     |       0.6026      |          0.5504          |      0.5278      |
|        AdaBoost Classification       |       0.5696      |          0.5635          |      0.5404      |
|    Gradient Boosted Classification   |       0.5858      |          0.5589          |      0.5437      |
|     Support Vector Classification    |       0.4649      |          0.4682          |      0.4639      |

##### 25. Based on training f1-score and testing f1-score, is there evidence of overfitting in any of your models? Which ones?

Based on the training f1-score and the testing f1-score, there is evidence of more overfitting, in this order:

1. Bagged Decision Trees Classification: Difference in f1-score of about 0.49
2. k-Nearest Neighbors Classification: Difference in f1-score of about 0.13
3. Random Forest Classification: Difference in f1-score of about 0.075
4. The rest have differences in f1-score of about 0.04 or less

##### 26. Based on everything we've covered so far, if you had to pick just one model as your final model to use to answer the problem in front of you, which one model would you pick? Defend your choice.

When I have a series of models from which I can pick, I usually do the following:

1. List out all of the models. Any model that cannot solve my problem should be removed. The problem statement is "Predict whether or not one is eligible for a 401k?" I want to find the model that performs the best based on a metric of my choice, in this case the f1-score
2. The model with the best test f1-score is the Gradient Boosted Classification, although it is slightly more overfitted than the AdaBoost Classification, as based on difference in train/test f1-score. Therefore, I have decided to pick the **AdaBoost Classification** instead. Regarding interpretability of my selected model (although not a requirement):
    - For decision tree ensembles, we can simply average the impurity-based feature importance of each tree to determine the importance of features. I have shown the code below to show it can be done, similar as to for the regression problem statement 
3. Sometimes, model selection becomes a judgment call with no perfect guide to make the final decision
    - Do you have time to tune the models to try and eke out better performance?
    - Is one model substantially better at solving the problem you wanted to solve?
    - Do you need something understandable by a lay audience?

##### 27. Suppose you wanted to improve the performance of your final model. Brainstorm 2-3 things that, if you had more time, you would attempt.


1. I would look at higher-order terms. For example, since we have incsq and agesq automatically created for us, there may be other higher-order terms that could be predictive. I would explore these to see if we can make better predictions. This recommendation was included above but still applies for classification problems
2. I would also consider interaction terms between variables. For example between income and nettfa. This recommendation was included above but still applies for classification problems
3. I would consider transforming my variable. Income is often skewed, which means there will usually be a handful of very high incomes that might skew any linear model. I would consider transforming income (likely using log) so that income is "un-skewed". This recommendation was included above but still applies for classification problems

In [40]:
# Getting coefficient values for our chosen classification model
grid_ada_class.fit(X_train1, y_train1) 
ada_class.set_params(**grid_ada_class.best_params_).fit(X_train1, y_train1)
feature_values_class = ada_class['adaboostclassifier'].feature_importances_
feature_names_class = ada_class.named_steps.columntransformer.get_feature_names_out()
classification_coefficients = pd.DataFrame(feature_values_class, index=feature_names_class).T

## Step 6: Answer the problem.

##### BONUS: Briefly summarize your answers to the regression and classification problems. Be sure to include any limitations or hesitations in your answer.

- Regression: What features best predict one's income?
- Classification: Predict whether or not one is eligible for a 401k.

In [41]:
# Print regression coefficients
regression_coefficients

Unnamed: 0,ohe__marr_1,ohe__male_1,ohe__fsize_2,ohe__fsize_3,ohe__fsize_4,ohe__fsize_5,ohe__fsize_6,ohe__fsize_7,ohe__fsize_8,ohe__fsize_9,ohe__fsize_10,ohe__fsize_11,ohe__fsize_12,ohe__fsize_13,remainder__age,remainder__nettfa,remainder__agesq
0,0.213725,0.010224,0.00543,0.000457,0.003655,0.001046,0.001562,0.001359,0.000555,0.000607,0.000401,0.0,0.0,0.0,0.026327,0.705554,0.029098


Nettfa is the most important predictor of income, with a value of 0.7055. Every increase in net total financial assets of one unit ($1000) increases inc by 0.7055 (the data dictionary is not clear on the units for income, but perhaps it is in 1000s also). Being married (1) is the second most important feature with married persons having a greater inc of 0.2137 compared to unmarried (0) persons.

In [43]:
# Print classification coefficients
classification_coefficients

Unnamed: 0,ohe__marr_1,ohe__male_1,ohe__fsize_2,ohe__fsize_3,ohe__fsize_4,ohe__fsize_5,ohe__fsize_6,ohe__fsize_7,ohe__fsize_8,ohe__fsize_9,ohe__fsize_10,ohe__fsize_11,ohe__fsize_12,ohe__fsize_13,ohe__pira_1,remainder__inc,remainder__age,remainder__nettfa,remainder__incsq,remainder__agesq
0,0.011853,0.007381,0.0,0.0,0.010278,0.0,0.0,0.009903,0.036893,0.0,0.0,0.0,0.0,0.0,0.074807,0.090443,0.039604,0.328524,0.294227,0.096085


We can predict whether someone is eligible or not for a 401k with a f1-score of 0.5404. The coefficients are not necessary for the problem statement but I just wanted to show it can be done. The most important feature used for spltting is nettfa followed by incsq.