<a href="https://colab.research.google.com/github/connorpheraty/DS-Unit-2-Sprint-3-Classification-Validation/blob/master/Connor_Heraty_DS_Unit_2_Sprint_Challenge_3_Classification_Validation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

 _Lambda School Data Science Unit 2_
 
 # Classification & Validation Sprint Challenge

Follow the instructions for each numbered part to earn a score of 2. See the bottom of the notebook for a list of ways you can earn a score of 3.

#### For this Sprint Challenge, you'll predict whether a person's income exceeds $50k/yr, based on census data.

You can read more about the Adult Census Income dataset at the UCI Machine Learning Repository: https://archive.ics.uci.edu/ml/datasets/adult

#### Run this cell to load the data:

In [0]:
import pandas as pd

columns = ['age', 
           'workclass', 
           'fnlwgt', 
           'education', 
           'education-num', 
           'marital-status', 
           'occupation', 
           'relationship', 
           'race', 
           'sex', 
           'capital-gain', 
           'capital-loss', 
           'hours-per-week', 
           'native-country', 
           'income']

df = pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data', 
                 header=None, names=columns)

df['income'] = df['income'].str.strip()

In [42]:
df.head(10)

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,income
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K
5,37,Private,284582,Masters,14,Married-civ-spouse,Exec-managerial,Wife,White,Female,0,0,40,United-States,<=50K
6,49,Private,160187,9th,5,Married-spouse-absent,Other-service,Not-in-family,Black,Female,0,0,16,Jamaica,<=50K
7,52,Self-emp-not-inc,209642,HS-grad,9,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,45,United-States,>50K
8,31,Private,45781,Masters,14,Never-married,Prof-specialty,Not-in-family,White,Female,14084,0,50,United-States,>50K
9,42,Private,159449,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,5178,0,40,United-States,>50K


In [0]:
# Feature Engineering
df['net-capital-gain'] = df['capital-gain'] - df['capital-loss']

## Part 1 — Begin with baselines

Split the data into an **X matrix** (all the features) and **y vector** (the target).

(You _don't_ need to split the data into train and test sets here. You'll be asked to do that at the _end_ of Part 1.)

In [0]:
# Split into X,y
y = df['income'] == '>50K'
X = df.drop(columns ='income')

In [61]:
# Raw value counts for income > 50K
y.value_counts(normalize=True)

False    0.75919
True     0.24081
Name: income, dtype: float64

In [0]:
majority_class = y.mode()[0]
y_pred = [majority_class] * len(y)

What **accuracy score** would you get here with a **"majority class baseline"?** 
 
(You can answer this question either with a scikit-learn function or with a pandas function.)

In [63]:
# Majority class baseline is 75.9
from sklearn.metrics import accuracy_score
accuracy_score(y, y_pred)

0.7591904425539756

What **ROC AUC score** would you get here with a **majority class baseline?**

(You can answer this question either with a scikit-learn function or with no code, just your understanding of ROC AUC.)

In [64]:
from sklearn.metrics import roc_auc_score
roc_auc_score(y, y_pred)

0.5

In this Sprint Challenge, you will use **"Cross-Validation with Independent Test Set"** for your model validaton method.

First, **split the data into `X_train, X_test, y_train, y_test`**. You can include 80% of the data in the train set, and hold out 20% for the test set.

In [0]:
# train_test_split function for splitting dataset
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y)

In [0]:
X_train, X_val, y_train, y_val = train_test_split(
    X_train, y_train, test_size=0.3, random_state=42, stratify=y_train)

In [51]:
# View shapes of newly created data sets
X_train.shape, X_val.shape, X_test.shape, y_train.shape, y_val.shape, y_test.shape

((18233, 15), (7815, 15), (6513, 15), (18233,), (7815,), (6513,))

## Part 2 — Modeling with Logistic Regression!

- You may do exploratory data analysis and visualization, but it is not required.
- You may **use all the features, or select any features** of your choice, as long as you select at least one numeric feature and one categorical feature.
- **Scale your numeric features**, using any scikit-learn [Scaler](https://scikit-learn.org/stable/modules/classes.html#module-sklearn.preprocessing) of your choice.
- **Encode your categorical features**. You may use any encoding (One-Hot, Ordinal, etc) and any library (category_encoders, scikit-learn, pandas, etc) of your choice.
- You may choose to use a pipeline, but it is not required.
- Use a **Logistic Regression** model.
- Use scikit-learn's [**cross_val_score**](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.cross_val_score.html) function. For [scoring](https://scikit-learn.org/stable/modules/model_evaluation.html#the-scoring-parameter-defining-model-evaluation-rules), use **accuracy**.
- **Print your model's cross-validation accuracy score.**

In [52]:
! pip install category_encoders



In [0]:
import warnings
import category_encoders as ce
import matplotlib.pyplot as plt
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score, train_test_split
from sklearn.pipeline import make_pipeline
from sklearn.exceptions import DataConversionWarning
from sklearn.preprocessing import StandardScaler

In [54]:
# Examine stats for numerical data
df.describe(include='number')

Unnamed: 0,age,fnlwgt,education-num,capital-gain,capital-loss,hours-per-week,net-capital-gain
count,32561.0,32561.0,32561.0,32561.0,32561.0,32561.0,32561.0
mean,38.581647,189778.4,10.080679,1077.648844,87.30383,40.437456,990.345014
std,13.640433,105550.0,2.57272,7385.292085,402.960219,12.347429,7408.986951
min,17.0,12285.0,1.0,0.0,0.0,1.0,-4356.0
25%,28.0,117827.0,9.0,0.0,0.0,40.0,0.0
50%,37.0,178356.0,10.0,0.0,0.0,40.0,0.0
75%,48.0,237051.0,12.0,0.0,0.0,45.0,0.0
max,90.0,1484705.0,16.0,99999.0,4356.0,99.0,99999.0


In [55]:
# Note: native-country is a high cardinality categorical variable
df.describe(exclude='number')

Unnamed: 0,workclass,education,marital-status,occupation,relationship,race,sex,native-country,income
count,32561,32561,32561,32561,32561,32561,32561,32561,32561
unique,9,16,7,15,6,5,2,42,2
top,Private,HS-grad,Married-civ-spouse,Prof-specialty,Husband,White,Male,United-States,<=50K
freq,22696,10501,14976,4140,13193,27816,21790,29170,24720


In [0]:
# Make Pipeline
# OneHotEncoder for low cardinality categorical columns
# BinaryEncoder for high cardinality categorical columns

pipeline = make_pipeline(
    ce.OneHotEncoder(cols=['relationship', 'race', 'sex','marital-status','workclass', 'occupation','education'], use_cat_names=True),
    ce.BinaryEncoder(cols=['native-country']),
    StandardScaler(), 
    LogisticRegression(solver='lbfgs', max_iter=1000)
)

In [57]:
# Cross-validate with training data
scores = cross_val_score(pipeline, X_train, y_train, scoring='accuracy', cv=10, 
                         n_jobs=-1, verbose=10)

[Parallel(n_jobs=-1)]: Using backend LokyBackend with 2 concurrent workers.
[Parallel(n_jobs=-1)]: Done   1 tasks      | elapsed:    1.4s
[Parallel(n_jobs=-1)]: Done   4 tasks      | elapsed:    3.1s
[Parallel(n_jobs=-1)]: Done  10 out of  10 | elapsed:    7.5s finished


In [58]:
# Our model is scoring higher than our above baseline of .7591

print('Cross-Validation Accuracy Scores:', scores)
print('Average:', scores.mean())

Cross-Validation Accuracy Scores: [0.86082192 0.84539474 0.8425672  0.85298958 0.84640702 0.85353812
 0.85134394 0.84421284 0.85189248 0.85024685]
Average: 0.8499414679883456


## Part 3 — Modeling with Tree Ensembles!

Part 3 is the same as Part 2, except this time, use a **Random Forest** or **Gradient Boosting** classifier. You may use scikit-learn, xgboost, or any other library. Then, print your model's cross-validation accuracy score.

In [0]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
from xgboost import XGBClassifier

from sklearn.impute import SimpleImputer
from sklearn.preprocessing import MinMaxScaler

In [24]:
features = ['age', 'workclass', 'fnlwgt', 'education', 'education-num', 'marital-status', 'occupation', 'relationship', 'race', 'capital-gain', 'capital-loss', 'hours-per-week', 'native-country']
target = 'income'

preprocessor = make_pipeline(ce.BaseNEncoder(), SimpleImputer(), StandardScaler())
income_X = preprocessor.fit_transform(df[features])
income_X = pd.DataFrame(income_X)
income_y = df[target]

income_X.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,30,31,32,33,34,35,36,37,38,39
0,0.030671,0.0,-0.025404,-0.475527,-2.539217,0.443459,-1.063611,-0.116092,-0.400559,-0.767994,...,0.148453,-0.21666,-0.035429,0.0,-0.090411,-0.154175,-0.202778,-0.27986,-0.245085,0.253631
1,0.837109,0.0,-0.025404,-0.475527,0.393822,-2.255001,-1.008707,-0.116092,-0.400559,-0.767994,...,-0.14592,-0.21666,-2.222153,0.0,-0.090411,-0.154175,-0.202778,-0.27986,-0.245085,0.253631
2,-0.042642,0.0,-0.025404,-0.475527,0.393822,0.443459,0.245079,-0.116092,-0.400559,-0.767994,...,-0.14592,-0.21666,-0.035429,0.0,-0.090411,-0.154175,-0.202778,-0.27986,-0.245085,0.253631
3,1.057047,0.0,-0.025404,-0.475527,0.393822,0.443459,0.425801,-0.116092,-0.400559,-0.767994,...,-0.14592,-0.21666,-0.035429,0.0,-0.090411,-0.154175,-0.202778,-0.27986,-0.245085,0.253631
4,-0.775768,0.0,-0.025404,-0.475527,0.393822,0.443459,1.408176,-0.116092,-0.400559,-0.767994,...,-0.14592,-0.21666,-0.035429,0.0,-0.090411,-0.154175,-0.202778,-0.27986,4.080225,-3.942743


In [22]:
# The second XGB Classifier model has the highest cross-validation score, however it may be overfit
models = [RandomForestClassifier(max_depth=None, n_estimators=100, n_jobs=-1, random_state=42),
          XGBClassifier(max_depth=3, n_estimators=100, n_jobs=-1, random_state=42),
          XGBClassifier(max_depth=5, n_estimators=200, n_jobs=-1, random_state=42)]

for model in models:
    print(model, '\n')
    score = cross_val_score(model, income_X, income_y, scoring='accuracy', cv=5).mean()
    print('Cross-Validation Accuracy:', score, '\n', '\n')

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=100, n_jobs=-1,
            oob_score=False, random_state=42, verbose=0, warm_start=False) 

Cross-Validation Accuracy: 0.854580712889096 
 

XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
       colsample_bytree=1, gamma=0, learning_rate=0.1, max_delta_step=0,
       max_depth=3, min_child_weight=1, missing=None, n_estimators=100,
       n_jobs=-1, nthread=None, objective='binary:logistic',
       random_state=42, reg_alpha=0, reg_lambda=1, scale_pos_weight=1,
       seed=None, silent=True, subsample=1) 

Cross-Validation Accuracy: 0.8639785209395988 
 

XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
       colsample_bytree=1, ga

### Train Set

In [0]:
# Incomplete
features = ['age', 'workclass', 'fnlwgt', 'education', 'education-num', 'marital-status', 'occupation', 'relationship', 'race', 'capital-gain', 'capital-loss', 'hours-per-week', 'native-country']
target = 'income'

X = df[features]
y = df[[target]]

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y)

X_train, X_val, y_train, y_val = train_test_split(
    X_train, y_train, test_size=0.3, random_state=42, stratify=y_train)

In [0]:
features = ['age', 'workclass', 'fnlwgt', 'education', 'education-num', 'marital-status', 'occupation', 'relationship', 'race', 'capital-gain', 'capital-loss', 'hours-per-week', 'native-country']
target = 'income'

preprocessor = make_pipeline(ce.BaseNEncoder(), SimpleImputer(), StandardScaler())
income_X = preprocessor.fit_transform(X_train[features])
income_train_X = pd.DataFrame(income_X)
income_test_y = y_test[target]

income_train_X.head()

In [0]:
models = [RandomForestClassifier(max_depth=None, n_estimators=200, n_jobs=-1, random_state=42),
          XGBClassifier(max_depth=3, n_estimators=100, n_jobs=-1, random_state=42),
          XGBClassifier(max_depth=5, n_estimators=200, n_jobs=-1, random_state=42)]

for model in models:
    print(model, '\n')
    y_pred = cross_val_predict()
    score = cross_val_score(model, income_train_X, income_test_y, scoring='accuracy', cv=5).mean()
    print('Cross-Validation Accuracy:', score, '\n', '\n')

## Part 4 — Calculate classification metrics from a confusion matrix

Suppose this is the confusion matrix for your binary classification model:

<table>
  <tr>
    <td colspan="2" rowspan="2"></td>
    <td colspan="2">Predicted</td>
  </tr>
  <tr>
    <td>Negative</td>
    <td>Positive</td>
  </tr>
  <tr>
    <td rowspan="2">Actual</td>
    <td>Negative</td>
    <td style="border: solid">85</td>
    <td style="border: solid">58</td>
  </tr>
  <tr>
    <td>Positive</td>
    <td style="border: solid">8</td>
    <td style="border: solid"> 36</td>
  </tr>
</table>

In [0]:
# Assign matrix values to variables
true_negative  = 85
false_positive = 58
false_negative = 8
true_positive  = 36

# Assign actual and predicted values to variables
actual_negative = 85 + 58
actual_positive = 8 + 36
predicted_negative = 85 + 8
predicted_positive = 58 + 36

Calculate accuracy

In [0]:
# Accuracy Score
accuracy = (true_positive + true_negative) / (predicted_negative + predicted_positive)
print('Accuracy: ', accuracy)

Calculate precision

In [0]:
# Precision Score
precision = true_positive / predicted_positive
print('Precision: ', precision)

Calculate recall

In [0]:
# Recall score
recall = true_positive / actual_positive
print('Recall: ', recall)

F1 Score and False Positive Rate

In [0]:
# F1 Score
f1 = 2 * precision * recall / (precision + recall) 
print('F1 score: ', f1)

# False Positive Rate
false_positive_rate = false_positive / (false_positive + true_negative)
print('False Positive Rate: ', false_positive_rate)

## BONUS — How you can earn a score of 3

### Part 1
Do feature engineering, to try improving your cross-validation score.

### Part 2
Experiment with feature selection, preprocessing, categorical encoding, and hyperparameter optimization, to try improving your cross-validation score.

### Part 3
Which model had the best cross-validation score? Refit this model on the train set and do a final evaluation on the held out test set — what is the test score? 

### Part 4
Calculate F1 score and False Positive Rate. 