## 6.01 - Supervised Learning Model Comparison

Recall the "data science process."

1. Define the problem.
2. Gather the data.
3. Explore the data.
4. Model the data.
5. Evaluate the model.
6. Answer the problem.

In this lab, we're going to focus mostly on creating (and then comparing) many regression and classification models. Thus, we'll define the problem and gather the data for you.
Most of the questions requiring a written response can be written in 2-3 sentences.

### Step 1: Define the problem.

You are a data scientist with a financial services company. Specifically, you want to leverage data in order to identify potential customers.

If you are unfamiliar with "401(k)s" or "IRAs," these are two types of retirement accounts. Very broadly speaking:
- You can put money for retirement into both of these accounts.
- The money in these accounts gets invested and hopefully has a lot more money in it when you retire.
- These are a little different from regular bank accounts in that there are certain tax benefits to these accounts. Also, employers frequently match money that you put into a 401k.
- If you want to learn more about them, check out [this site](https://www.nerdwallet.com/article/ira-vs-401k-retirement-accounts).

We will tackle one regression problem and one classification problem today.
- Regression: What features best predict one's income?
- Classification: Predict whether or not one is eligible for a 401k.

Check out the data dictionary [here](http://fmwww.bc.edu/ec-p/data/wooldridge2k/401KSUBS.DES).

### NOTE: When predicting `inc`, you should pretend as though you do not have access to the `e401k`, the `p401k` variable, and the `pira` variable. When predicting `e401k`, you may use the entire dataframe if you wish.

### Step 2: Gather the data.

##### 1. Read in the data from the repository.

In [1]:
# import liberary
import pandas as pd
import numpy as np

In [2]:
# read the data
df= pd.read_csv('./401ksubs.csv')           

In [3]:
# the first five data 
df.head()

Unnamed: 0,e401k,inc,marr,male,age,fsize,nettfa,p401k,pira,incsq,agesq
0,0,13.17,0,0,40,1,4.575,0,1,173.4489,1600
1,1,61.23,0,1,35,1,154.0,1,0,3749.113,1225
2,0,12.858,1,0,44,2,0.0,0,0,165.3282,1936
3,0,98.88,1,1,44,2,21.8,0,0,9777.254,1936
4,0,22.614,0,0,53,1,18.45,0,0,511.393,2809


In [4]:
# size of the DataFrame
df.shape

(9275, 11)

##### 2. What are 2-3 other variables that, if available, would be helpful to have?

**Answer:**
- I think the following list may be helpful to additional variables to this dataset to have more information:
 - 1) the matching amount in each month 
 - 2) how the person credit history is health based of credit score pay his agreement
 - 3) the personal asset status like car, house, shareholder or investment, etc

##### 3. Suppose a peer recommended putting `race` into your model in order to better predict who to target when advertising IRAs and 401(k)s. Why would this be an unethical decision?

**Answer:**
- Race is more prone to unreasonable decisions. It may raise a lot of questions on the ethicality of the data and the process of making decisions based on this variable.. At the end, it will face serious unacceptance if someone will be qualified or not due to the race. 

## Step 3: Explore the data.

##### 4. When attempting to predict income, which feature(s) would we reasonably not use? Why?

 **Answer:** 
- As it is clearly seen in the data and the dictionary inc used to calculate incsq which creates redundant. It may be good to drop either of the features.

##### 5. What two variables have already been created for us through feature engineering? Come up with a hypothesis as to why subject-matter experts may have done this.
> This need not be a "statistical hypothesis." Just brainstorm why SMEs might have done this!

**Answer:** 

- The variables `age`, `incsq` and `agesq` calculated with squaring the variable income and age. 401K is retirement insurance which needs qualified experts to explain what relation will have with age and also the income. When will be the relation may change as age closer to retirement time too. 


##### 6. Looking at the data dictionary, one variable description appears to be an error. What is this error, and what do you think the correct value would be?

**Answer:** 
- It seems `age` and `inc` values are defined as an error since `age^2` and `inc^2`, respectively. 


## Step 4: Model the data. (Part 1: Regression Problem)

Recall:
- Problem: What features best predict one's income?
- When predicting `inc`, you should pretend as though you do not have access to the `e401k`, the `p401k` variable, and the `pira` variable.

##### 7. List all modeling tactics we've learned that could be used to solve a regression problem (as of Wednesday afternoon of Week 6). For each tactic, identify whether it is or is not appropriate for solving this specific regression problem and explain why or why not.

**Answer:**
- a k-nearest neighbors model doesn't influence the feature, instead it depends on the distance of the features. On the contrary, linear regression, decision tree, random forest, adaboost, extremely randomized trees,and the XGBoost models have the influence of the features expressed with entropies.

##### 8. Regardless of your answer to number 7, fit at least one of each of the following models to attempt to solve the regression problem above:
    - a multiple linear regression model
    - a k-nearest neighbors model
    - a decision tree
    - a set of bagged decision trees
    - a random forest
    - an Adaboost model
    - a support vector regressor
    
> As always, be sure to do a train/test split! In order to compare modeling techniques, you should use the same train-test split on each. I recommend setting a random seed here.

> You may find it helpful to set up a pipeline to try each modeling technique, but you are not required to do so!

In [5]:
from sklearn.linear_model import LinearRegression
from sklearn.neighbors import KNeighborsRegressor, KNeighborsClassifier
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import BaggingRegressor, RandomForestRegressor, AdaBoostRegressor
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
from sklearn.ensemble import AdaBoostClassifier
from sklearn.datasets import make_classification
from sklearn.metrics import f1_score

In [6]:
# Standared the data prior to modeling
df_ds = pd.DataFrame(StandardScaler().fit_transform(df), columns = df.columns)

In [7]:
# Described indicate that the standared value equal to 1.
df_ds.describe()

Unnamed: 0,e401k,inc,marr,male,age,fsize,nettfa,p401k,pira,incsq,agesq
count,9275.0,9275.0,9275.0,9275.0,9275.0,9275.0,9275.0,9275.0,9275.0,9275.0,9275.0
mean,-4.896473e-16,1.470761e-16,6.391294e-16,1.298751e-16,1.286063e-16,7.338604e-16,4.878996e-17,-2.06948e-15,-8.523879e-16,7.714703000000001e-17,-2.350441e-16
std,1.000054,1.000054,1.000054,1.000054,1.000054,1.000054,1.000054,1.000054,1.000054,1.000054,1.000054
min,-0.803173,-1.214123,-1.300887,-0.5068978,-1.561343,-1.2355,-8.151509,-0.6177763,-0.5840318,-0.673384,-1.304882
25%,-0.803173,-0.7304105,-1.300887,-0.5068978,-0.7845661,-0.5800856,-0.3059968,-0.6177763,-0.5840318,-0.550439,-0.7867935
50%,-0.803173,-0.2476946,0.7687061,-0.5068978,-0.1048859,0.07532845,-0.2669101,-0.6177763,-0.5840318,-0.3375534,-0.2162267
75%,1.245062,0.4527167,0.7687061,-0.5068978,0.6718915,0.7307425,-0.009727507,1.618709,1.712236,0.1315537,0.5698381
max,1.245062,6.633249,0.7687061,1.972784,2.225446,6.629469,23.72916,1.618709,1.712236,12.49326,2.57073


In [8]:
# Assign the value
X = df_ds.drop(columns = ['e401k', 'p401k', 'pira', 'inc', 'incsq'])
y = df_ds['inc']

In [9]:
# Train-Test-Split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = .3, random_state = 42)

In [10]:
l_reg = LinearRegression()
l_reg.fit(X_train, y_train)

LinearRegression()

In [11]:
random_forest = RandomForestRegressor()
random_forest.fit(X_train, y_train)

RandomForestRegressor()

##### 9. What is bootstrapping?

**Answer:** 
- It is a replacement method of a sample that can be used to simulate a different sample distribution.

##### 10. What is the difference between a decision tree and a set of bagged decision trees? Be specific and precise!

**Answer:** 
- Decision trees are well known models which suffer from bias and variance. Whereas, Bagging is used to reduce the variance of a decision tree. 

##### 11. What is the difference between a set of bagged decision trees and a random forest? Be specific and precise!

**Answer:** 
- The random forests, only a subset of features are selected at random out of the total and the best split feature from the subset is used to split each node in a tree. The bagged decision trees of every feature are considered for splitting at a node.

##### 12. Why might a random forest be superior to a set of bagged decision trees?
> Hint: Consider the bias-variance tradeoff.

**Answer:** 

- In a random forest used to select the features randomly that are going to split. It helps to decreases the variance of the predictions by aggregating the different decision trees. This makes the random forest less variance than bagged decision trees.


## Step 5: Evaluate the model. (Part 1: Regression Problem)

##### 13. Using RMSE, evaluate each of the models you fit on both the training and testing data.

In [12]:
# Calculate the RMSE for train in model l_reg
trainRMSE = mean_squared_error(y_true = y_train, y_pred = l_reg.predict(X_train))
trainRMSE_final = trainRMSE**0.5
print(f'The training RMSE is {trainRMSE_final}')

The training RMSE is 0.8288486281787126


In [13]:
# Calculate the RMSE for test in model l_reg
testRMSE = mean_squared_error(y_true = y_test, y_pred = l_reg.predict(X_test))
testRMSE_final = testRMSE **0.5
print(f'The test RMSE is"  {testRMSE_final}')

The test RMSE is"  0.8771575950761487


In [14]:
# Evaluate the RMSE to get the value
trainRMSE = mean_squared_error(y_true = y_train, y_pred = random_forest.predict(X_train))
trainRMSE_final = trainRMSE**0.5
print(f'The training RMSE is {trainRMSE_final}')

The training RMSE is 0.3158180481579853


In [15]:
# Evaluate the fuction to get the value
# Calculate the RMSE for test in model l_reg
testRMSE = mean_squared_error(y_true = y_test, y_pred = random_forest.predict(X_test))
testRMSE_final = testRMSE **0.5
print(f'The test RMSE is"  {testRMSE_final}')

The test RMSE is"  0.8510195042574749


** Note**
- The results with random forest shows train and test have 50% different with its default evaluation of the hyperparameters.

##### 14. Based on training RMSE and testing RMSE, is there evidence of overfitting in any of your models? Which ones?

**Answer:**

The random forest shows much overfitting as compared with the linear regression taken as an example for this assignment. The reason is the RMSE of the train is much higher than the test value, which poorly generalizes the unseen 30% data .


##### 15. Based on everything we've covered so far, if you had to pick just one model as your final model to use to answer the problem in front of you, which one model would you pick? Defend your choice.

**Answer:** 

- For this problem the best choice highly depends on which model shows less difference value between train and test RSME. The smaller the difference is the less the overfit behavior and the best choice for the model. For instance linear regression shows less difference than random forest which have about 0.5.


##### 16. Suppose you wanted to improve the performance of your final model. Brainstorm 2-3 things that, if you had more time, you would attempt.

**Answer:** 
- a) Make feature engineering and feature treatment base of the distribution behaviour. For example, use log value for one side skewed to make a normal distribution.
- b) Pick the models shows less overfitting nature using the  defaults hyperparameters evaluation.
- c) List out the model best predicted accordingly.

## Step 4: Model the data. (Part 2: Classification Problem)

Recall:
- Problem: Predict whether or not one is eligible for a 401k.
- When predicting `e401k`, you may use the entire dataframe if you wish.

##### 17. While you're allowed to use every variable in your dataframe, mention at least one disadvantage of using `p401k` in your model.

**Answer:** 

- Predict whether or not one is eligible for a 401k. The sentence may be biased by the final prediction someone will be eligible for the 401K.

##### 18. List all modeling tactics we've learned that could be used to solve a classification problem (as of Wednesday afternoon of Week 6). For each tactic, identify whether it is or is not appropriate for solving this specific classification problem and explain why or why not.

**Answer:**
- a logistic regression, knn, decision tree, Naive Bayes, random forest, extremely randomized trees and adaboost models can be used to predict whether or not one is eligible for a 401(k).

##### 19. Regardless of your answer to number 18, fit at least one of each of the following models to attempt to solve the classification problem above:
    - a logistic regression model
    - a k-nearest neighbors model
    - a decision tree
    - a set of bagged decision trees
    - a random forest
    - an Adaboost model
    - a support vector classifier
    
> As always, be sure to do a train/test split! In order to compare modeling techniques, you should use the same train-test split on each. I recommend using a random seed here.

> You may find it helpful to set up a pipeline to try each modeling technique, but you are not required to do so!

In [16]:
# Assigning variable X and y, Idea taken from https://stackoverflow.com
X = df_ds.drop(columns = ['e401k', 'p401k'])
y = [1 if df_ds['e401k'][i] > 0 else 0 for i in range(df_ds.shape[0])]

In [17]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = .3, random_state = 42)

In [18]:
# Model evalution for classification
knn = KNeighborsClassifier()
knn.fit(X_train, y_train)

KNeighborsClassifier()

In [19]:
# Model evalution for classification
adaboost = AdaBoostClassifier()
adaboost.fit(X_train, y_train)

AdaBoostClassifier()

## Step 5: Evaluate the model. (Part 2: Classfication Problem)

##### 20. Suppose our "positive" class is that someone is eligible for a 401(k). What are our false positives? What are our false negatives?

**Answer:**
- False positives means incorrectly predicted the person will be eligible for a 401k.
- False negatives indicate the prediction made incorrectly the person not to be eligible for a 401k.


##### 21. In this specific case, would we rather minimize false positives or minimize false negatives? Defend your choice.

**Answer:** 

- The number of false positives predicted to be  as much as possible lower for voiding the ineligible one for 401K to be eligible. This caused financial risk to the IRA office and at large for the government of the country. 


##### 22. Suppose we wanted to optimize for the answer you provided in problem 21. Which metric would we optimize in this case?

**Answer:** 

- It would be great to minimize the specificity to avoid wrong classification. 


##### 23. Suppose that instead of optimizing for the metric in problem 21, we wanted to balance our false positives and false negatives using `f1-score`. Why might [f1-score](https://en.wikipedia.org/wiki/F1_score) be an appropriate metric to use here?

**Answer:** 
- Because the f1 value is evaluated with the denominator of false positive and false negative. This means the denominator values of either value increase, it will decrease the f1 value if the true positive value is constant. 

##### 24. Using f1-score, evaluate each of the models you fit on both the training and testing data.

In [20]:
# f1_score for knn model for train
train_f1_knn = f1_score(y_true = y_train, y_pred = knn.predict(X_train))
print (f"The f1_score of train using model knn: {train_f1_knn}")

The f1_score of train using model knn: 0.6620194338825518


In [21]:
# f1_score for knn model for test
test_f1_knn = f1_score(y_true = y_test, y_pred = knn.predict(X_test))
print (f"The f1_score of test using model knn: {test_f1_knn}")

The f1_score of test using model knn: 0.4842105263157894


In [22]:
# f1_score for adaboost model for train
train_f1_knn = f1_score(y_true = y_train, y_pred = adaboost.predict(X_train))
print (f"The f1_score of train using model knn: {train_f1_knn}")

The f1_score of train using model knn: 0.5742279020234291


In [23]:
# f1_score for adaboost model for test
test_f1_knn = f1_score(y_true = y_test, y_pred = adaboost.predict(X_test))
print (f"The f1_score of test using model knn: {test_f1_knn}")

The f1_score of test using model knn: 0.5538150581101566


##### 25. Based on training f1-score and testing f1-score, is there evidence of overfitting in any of your models? Which ones?

**Answer:**
- Comparatively the knn shows higher value of f1 in train than test that causes overfitting as compared with adaboost which has 0.02 higher vale of train than the test.


##### 26. Based on everything we've covered so far, if you had to pick just one model as your final model to use to answer the problem in front of you, which one model would you pick? Defend your choice.

**Answer:**
- The primary thing I will look at is which model gives very close score values between train and test from all the models used in the problem. For this example, I will take adaboost performs very well with a closer score between train and test.

##### 27. Suppose you wanted to improve the performance of your final model. Brainstorm 2-3 things that, if you had more time, you would attempt.

**Answer:** 
- May be getting recommendation for the best feature engineering to challenge the model for optimal performance.

- May be looked critically for separately classifing the eligibility of the 401K since it is a very critical parameter before using the whole data to minimize the wrong filtration of ineligible to eligible catagory. 

- It may be better to work to find the optimal hyperparameters using GridSearch to have the best classification instead of using default values.