## 6.01 - Supervised Learning Model Comparison

Recall the "data science process."

1. Define the problem.
2. Gather the data.
3. Explore the data.
4. Model the data.
5. Evaluate the model.
6. Answer the problem.

In this lab, we're going to focus mostly on creating (and then comparing) many regression and classification models. Thus, we'll define the problem and gather the data for you.
Most of the questions requiring a written response can be written in 2-3 sentences.

### Step 1: Define the problem.

You are a data scientist with a financial services company. Specifically, you want to leverage data in order to identify potential customers.

If you are unfamiliar with "401(k)s" or "IRAs," these are two types of retirement accounts. Very broadly speaking:
- You can put money for retirement into both of these accounts.
- The money in these accounts gets invested and hopefully has a lot more money in it when you retire.
- These are a little different from regular bank accounts in that there are certain tax benefits to these accounts. Also, employers frequently match money that you put into a 401k.
- If you want to learn more about them, check out [this site](https://www.nerdwallet.com/article/ira-vs-401k-retirement-accounts).

We will tackle one regression problem and one classification problem today.
- Regression: What features best predict one's income?
- Classification: Predict whether or not one is eligible for a 401k.

Check out the data dictionary [here](http://fmwww.bc.edu/ec-p/data/wooldridge2k/401KSUBS.DES).

### NOTE: When predicting `inc`, you should pretend as though you do not have access to the `e401k`, the `p401k` variable, and the `pira` variable. When predicting `e401k`, you may use the entire dataframe if you wish.

### Step 2: Gather the data.

##### 1. Read in the data from the repository.

In [2]:
import pandas as pd
df= pd.read_csv('401ksubs.csv')
df.head()

Unnamed: 0,e401k,inc,marr,male,age,fsize,nettfa,p401k,pira,incsq,agesq
0,0,13.17,0,0,40,1,4.575,0,1,173.4489,1600
1,1,61.23,0,1,35,1,154.0,1,0,3749.113,1225
2,0,12.858,1,0,44,2,0.0,0,0,165.3282,1936
3,0,98.88,1,1,44,2,21.8,0,0,9777.254,1936
4,0,22.614,0,0,53,1,18.45,0,0,511.393,2809


##### 2. What are 2-3 other variables that, if available, would be helpful to have?

Salary and Employment status to see if our clients match a certain demographic and create focused group for our target 

##### 3. Suppose a peer recommended putting `race` into your model in order to better predict who to target when advertising IRAs and 401(k)s. Why would this be an unethical decision?

We would be discriminating someone simply off the color of their skin or the ethncity that they were born in, circumstances in which that are beyond their control and hold no value for predicting IRAs and 401(k)s. The data gathered from race would show what races/ethnicties contribute to a 401(k) but the goal of this project is not to look for ethncities and track their performance as a whole with 401(k) and IRA performance.

## Step 3: Explore the data.

##### 4. When attempting to predict income, which feature(s) would we reasonably not use? Why?

I will avoid using the status of male because while their is a bias for males being paid more, it is not a focus I would want my study to have. I would like my features to be about family size,income,marriage and age. The other features do not resonate with income as the features listed above.

##### 5. What two variables have already been created for us through feature engineering? Come up with a hypothesis as to why subject-matter experts may have done this.
> This need not be a "statistical hypothesis." Just brainstorm why SMEs might have done this!

age and income has columuns in which that data is squared. It's possible that these were created as a way for our data to be read better when conducting Logisitc Regression since our data is binomial

##### 6. Looking at the data dictionary, one variable description appears to be an error. What is this error, and what do you think the correct value would be?

age and inc, the value description has it as squared but it's values are not

## Step 4: Model the data. (Part 1: Regression Problem)

Recall:
- Problem: What features best predict one's income?
- When predicting `inc`, you should pretend as though you do not have access to the `e401k`, the `p401k` variable, and the `pira` variable.

##### 7. List all modeling tactics we've learned that could be used to solve a regression problem (as of Wednesday afternoon of Week 6). For each tactic, identify whether it is or is not appropriate for solving this specific regression problem and explain why or why not.

Linear Regression(It would be great to find a correlation of our features and y)

Lasso and Ridge (If we have outliers, these would be great to have our data make more sense by identifying these and reducing it's weight)

decision tree(perfect for see the influence of our features when compared to our predicted value)

random forest(even better since it is an improvement on decision tree)

adaboost(another outstanding model for comparing a correlation of our features and y)

svr(another another outstanding model for comparing a correaltion of our features and y)

knn (great for identifying new data based on it's featres but not something we will need for this excerise)

##### 8. Regardless of your answer to number 7, fit at least one of each of the following models to attempt to solve the regression problem above:
    - a multiple linear regression model
    - a k-nearest neighbors model
    - a decision tree
    - a set of bagged decision trees
    - a random forest
    - an Adaboost model
    - a support vector regressor
    
> As always, be sure to do a train/test split! In order to compare modeling techniques, you should use the same train-test split on each. I recommend setting a random seed here.

> You may find it helpful to set up a pipeline to try each modeling technique, but you are not required to do so!

In [3]:
from sklearn.linear_model import LinearRegression
from sklearn.neighbors import KNeighborsRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.svm import SVR
from sklearn.ensemble import BaggingRegressor, RandomForestRegressor, AdaBoostRegressor
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

In [4]:
X = df[['marr','male','age','fsize','nettfa']]
y = df['inc']

In [5]:
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=.3,random_state=42)

In [6]:
lr = LinearRegression()
knn = KNeighborsRegressor()
dt = DecisionTreeRegressor()
br = BaggingRegressor()
rf = RandomForestRegressor()
abr = AdaBoostRegressor()
sv = SVR()
sc = StandardScaler()

In [7]:
Z_train = sc.fit_transform(X_train)
Z_test = sc.transform(X_test)

# Linear Regression

In [8]:
lr.fit(Z_train,y_train)
print(lr.score(Z_train, y_train))
print(lr.score(Z_test, y_test))

0.28033947967052086
0.22503604417368395


# KNeighbors Regressor

In [9]:
knn.fit(Z_train,y_train)
print(knn.score(Z_train, y_train))
print(knn.score(Z_test, y_test))

0.522682105392541
0.31464566209244826


# Decision Tree

In [10]:
dt.fit(Z_train,y_train)
print(dt.score(Z_train, y_train))
print(dt.score(Z_test, y_test))

0.9925096221925963
-0.20687065239757407


# Bagging Regressor

In [11]:
br.fit(Z_train,y_train)
print(br.score(Z_train, y_train))
print(br.score(Z_test, y_test))

0.8630995279264374
0.26580431378839753


# Random Forest Regressor

In [12]:
rf.fit(Z_train,y_train)
print(rf.score(Z_train, y_train))
print(rf.score(Z_test, y_test))

0.8983693388182759
0.3127336075152697


# Ada Boost Regressor

In [13]:
abr.fit(Z_train,y_train)
print(abr.score(Z_train, y_train))
print(abr.score(Z_test, y_test))

0.1874974934539816
0.15252287926389663


# Support Vector Regressor

In [14]:
sv.fit(Z_train,y_train)
print(sv.score(Z_train, y_train))
print(sv.score(Z_test, y_test))

0.3143864229115487
0.31219319246121724


##### 9. What is bootstrapping?

A sampling of a sample of our data. This is used to create multiple data sets from our sample

##### 10. What is the difference between a decision tree and a set of bagged decision trees? Be specific and precise!

A decision tree takes our sample and aggregates it to see which values have a impact on our y while bagged decision trees create multiple variations of our features and identify which set from that variation has an impact on our y

##### 11. What is the difference between a set of bagged decision trees and a random forest? Be specific and precise!

bagged decision trees subsample our data to find the best feature(s) for our y where as random forest randomize our sample and take that data to show it's impact on our y value

##### 12. Why might a random forest be superior to a set of bagged decision trees?
> Hint: Consider the bias-variance tradeoff.

random forest may be superior for the sense that it will select different variables so if our model is overfit, these variables that were slecte should help fit a model that has less bias and variance

## Step 5: Evaluate the model. (Part 1: Regression Problem)

##### 13. Using RMSE, evaluate each of the models you fit on both the training and testing data.

In [19]:
from sklearn.metrics import mean_squared_error
def rmse(model, X_train, X_test, y_train, y_test):
    Z_train = sc.fit_transform(X_train)
    Z_test = sc.transform(X_test)
    model.fit(Z_train,y_train)
    rmse_train = mean_squared_error(y_true=y_train,y_pred=model.predict(Z_train),
                                    squared = False)
    rmse_test = mean_squared_error(y_true=y_test,y_pred=model.predict(Z_test),
                                    squared = False)
    print('rmse train score of', model , ':', rmse_train)
    print('rmse test score of', model , ':', rmse_test)
rmse(lr, X_train, X_test, y_train, y_test)
rmse(knn, X_train, X_test, y_train, y_test)  
rmse(dt, X_train, X_test, y_train, y_test)  
rmse(br, X_train, X_test, y_train, y_test)  
rmse(rf, X_train, X_test, y_train, y_test)  
rmse(abr, X_train, X_test, y_train, y_test)  
rmse(sv, X_train, X_test, y_train, y_test)  

    

rmse train score of LinearRegression() : 20.253869120055786
rmse test score of LinearRegression() : 21.638157117706292
rmse train score of KNeighborsRegressor() : 16.494836617084168
rmse test score of KNeighborsRegressor() : 20.34872024463971
rmse train score of DecisionTreeRegressor() : 2.066312589899166
rmse test score of DecisionTreeRegressor() : 27.40390236206897
rmse train score of BaggingRegressor() : 8.696788957141678
rmse test score of BaggingRegressor() : 21.002162653446636
rmse train score of RandomForestRegressor() : 7.635883869745225
rmse test score of RandomForestRegressor() : 20.318322807306206
rmse train score of AdaBoostRegressor() : 23.564332150306146
rmse test score of AdaBoostRegressor() : 24.59343918316724
rmse train score of SVR() : 19.768961910326304
rmse test score of SVR() : 20.38509562796136


##### 14. Based on training RMSE and testing RMSE, is there evidence of overfitting in any of your models? Which ones?

DecisionTree Regressor and RandomForest Regressor

##### 15. Based on everything we've covered so far, if you had to pick just one model as your final model to use to answer the problem in front of you, which one model would you pick? Defend your choice.

Support Vector Regressor, based on my models, it has a fair accuracy percentage and better rmse score than my other models

##### 16. Suppose you wanted to improve the performance of your final model. Brainstorm 2-3 things that, if you had more time, you would attempt.

For Support Vector Regressor, the parameters I can adjust are the number of degrees or it's penalty factor

## Step 4: Model the data. (Part 2: Classification Problem)

Recall:
- Problem: Predict whether or not one is eligible for a 401k.
- When predicting `e401k`, you may use the entire dataframe if you wish.

##### 17. While you're allowed to use every variable in your dataframe, mention at least one disadvantage of using `p401k` in your model.

p401k is for people who have a 401k already, by adding them to our model, we would be trying to predict memebers who are eligibale for a 401k even though they already have a 401k. These members are not people we are interested in.

##### 18. List all modeling tactics we've learned that could be used to solve a classification problem (as of Wednesday afternoon of Week 6). For each tactic, identify whether it is or is not appropriate for solving this specific classification problem and explain why or why not.

- a logistic regression model
- a k-nearest neighbors model
- a decision tree
- a set of bagged decision trees
- a random forest
- an Adaboost model
- a support vector classifier

These are all great since we are seeing if someone is eligible for a 401k (1) or not eligible for 401k(0) and these models will perform great at classifying whether or not our features where the lack there of or having, would make our clients eligible for a 401k or not

##### 19. Regardless of your answer to number 18, fit at least one of each of the following models to attempt to solve the classification problem above:
    - a logistic regression model
    - a k-nearest neighbors model
    - a decision tree
    - a set of bagged decision trees
    - a random forest
    - an Adaboost model
    - a support vector classifier
    
> As always, be sure to do a train/test split! In order to compare modeling techniques, you should use the same train-test split on each. I recommend using a random seed here.

> You may find it helpful to set up a pipeline to try each modeling technique, but you are not required to do so!

In [20]:
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC
from sklearn.ensemble import BaggingClassifier, RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

In [21]:
X = df.drop(columns=['e401k','p401k'])
y = df['e401k']
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=.3,random_state=42)

In [22]:
logreg = LogisticRegression()
knc = KNeighborsClassifier()
dtc = DecisionTreeClassifier()
bc = BaggingClassifier()
rfc = RandomForestClassifier()
svc = SVC()
sc = StandardScaler()

# Logisitc Regression

In [23]:
logreg.fit(Z_train,y_train)
print(logreg.score(Z_train, y_train))
print(logreg.score(Z_test, y_test))

0.6324707332101047
0.6338483650736615


# Kneighbors Classifier

In [24]:
knc.fit(Z_train,y_train)
print(knc.score(Z_train, y_train))
print(knc.score(Z_test, y_test))

0.7456869993838571
0.6295364714337046


# Decision Tree Classifier

In [25]:
dtc.fit(Z_train,y_train)
print(dtc.score(Z_train, y_train))
print(dtc.score(Z_test, y_test))

0.9935304990757856
0.5713259072942868


# Bagging Classifier

In [26]:
bc.fit(Z_train,y_train)
print(bc.score(Z_train, y_train))
print(bc.score(Z_test, y_test))

0.9644177449168208
0.6072583542939274


# Random Forest Classifier

In [27]:
rfc.fit(Z_train,y_train)
print(rfc.score(Z_train, y_train))
print(rfc.score(Z_test, y_test))

0.9935304990757856
0.6209126841537909


# Support Vector Classifier

In [28]:
svc.fit(Z_train,y_train)
print(svc.score(Z_train, y_train))
print(svc.score(Z_test, y_test))

0.6551139864448552
0.6604383758533956


## Step 5: Evaluate the model. (Part 2: Classfication Problem)

##### 20. Suppose our "positive" class is that someone is eligible for a 401(k). What are our false positives? What are our false negatives?

FP = Those who are not eligible for a 401k but are predicted as eligible

FN = Thise who are eligible for a 401k but predicted as not

##### 21. In this specific case, would we rather minimize false positives or minimize false negatives? Defend your choice.

We would minimize False positives so clients who are not eligible won't cause a financial hiccup as opposed to our false negatives who can probably get a second opinion and prove to a human that they are eligible.

##### 22. Suppose we wanted to optimize for the answer you provided in problem 21. Which metric would we optimize in this case?

We would optimize specificty so we can predict more negatives as a way to not classify them as positive.

##### 23. Suppose that instead of optimizing for the metric in problem 21, we wanted to balance our false positives and false negatives using `f1-score`. Why might [f1-score](https://en.wikipedia.org/wiki/F1_score) be an appropriate metric to use here?

Having a balcned F-1 score would mean that our accuracy an precision are optimized which means our FP and FN are fairly low

##### 24. Using f1-score, evaluate each of the models you fit on both the training and testing data.

In [30]:
from sklearn.metrics import f1_score
def f_1(model, X_train, X_test, y_train, y_test):
    Z_train = sc.fit_transform(X_train)
    Z_test = sc.transform(X_test)
    model.fit(Z_train,y_train)
    f1_train = f1_score(y_true=y_train,y_pred=model.predict(Z_train))
    f1_test = f1_score(y_true=y_test,y_pred=model.predict(Z_test))
    print('rmse train score of', model , ':', f1_train)
    print('rmse test score of', model , ':', f1_test)
f_1(logreg, X_train, X_test, y_train, y_test)
f_1(knc, X_train, X_test, y_train, y_test)  
f_1(dtc, X_train, X_test, y_train, y_test)  
f_1(bc, X_train, X_test, y_train, y_test)  
f_1(rfc, X_train, X_test, y_train, y_test)    
f_1(svc, X_train, X_test, y_train, y_test)  



rmse train score of LogisticRegression() : 0.48493277700509974
rmse test score of LogisticRegression() : 0.471169686985173
rmse train score of KNeighborsClassifier() : 0.6611779607346423
rmse test score of KNeighborsClassifier() : 0.48469643753135977
rmse train score of DecisionTreeClassifier() : 1.0
rmse test score of DecisionTreeClassifier() : 0.4658298465829847
rmse train score of BaggingClassifier() : 0.9705414012738853
rmse test score of BaggingClassifier() : 0.4966785896780787
rmse train score of RandomForestClassifier() : 1.0
rmse test score of RandomForestClassifier() : 0.5289672544080604
rmse train score of SVC() : 0.4769853313100658
rmse test score of SVC() : 0.45542168674698796


##### 25. Based on training f1-score and testing f1-score, is there evidence of overfitting in any of your models? Which ones?

Aside from the training data in our decision tree and random tree, our data is spread fairly

##### 26. Based on everything we've covered so far, if you had to pick just one model as your final model to use to answer the problem in front of you, which one model would you pick? Defend your choice.

Support vector classifier as it's accuracy performs the best

##### 27. Suppose you wanted to improve the performance of your final model. Brainstorm 2-3 things that, if you had more time, you would attempt.

Svc just like support vector regressor has parameters that can be adusted such as degrees and penalty factor, I could also modulate my features

## Step 6: Answer the problem.

##### BONUS: Briefly summarize your answers to the regression and classification problems. Be sure to include any limitations or hesitations in your answer.

- Regression: What features best predict one's income?
- Classification: Predict whether or not one is eligible for a 401k.