 _Lambda School Data Science Unit 2_
 
 # Classification & Validation Sprint Challenge

Follow the instructions for each numbered part to earn a score of 2. See the bottom of the notebook for a list of ways you can earn a score of 3.

#### For this Sprint Challenge, you'll predict whether a person's income exceeds $50k/yr, based on census data.

You can read more about the Adult Census Income dataset at the UCI Machine Learning Repository: https://archive.ics.uci.edu/ml/datasets/adult

#### Run this cell to load the data:

In [1]:
import pandas as pd

columns = ['age', 
           'workclass', 
           'fnlwgt', 
           'education', 
           'education-num', 
           'marital-status', 
           'occupation', 
           'relationship', 
           'race', 
           'sex', 
           'capital-gain', 
           'capital-loss', 
           'hours-per-week', 
           'native-country', 
           'income']

df = pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data', 
                 header=None, names=columns)

df['income'] = df['income'].str.strip()

## Part 1 — Begin with baselines

Split the data into an **X matrix** (all the features) and **y vector** (the target).

(You _don't_ need to split the data into train and test sets here. You'll be asked to do that at the _end_ of Part 1.)

In [2]:
X = df.drop(columns='income', axis=1)
y = df['income']
X.shape, y.shape

((32561, 14), (32561,))

What **accuracy score** would you get here with a **"majority class baseline"?** 
 
(You can answer this question either with a scikit-learn function or with a pandas function.)

In [3]:
y.value_counts(normalize=True)

<=50K    0.75919
>50K     0.24081
Name: income, dtype: float64

A majority class baseline would have a 75.9% accuracy rate by always guessing that a person's income was less than $50K. 

What **ROC AUC score** would you get here with a **majority class baseline?**

(You can answer this question either with a scikit-learn function or with no code, just your understanding of ROC AUC.)

You would get a ROC AUC score of 0.5 given a majority class baseline. 

In this Sprint Challenge, you will use **"Cross-Validation with Independent Test Set"** for your model validaton method.

First, **split the data into `X_train, X_test, y_train, y_test`**. You can include 80% of the data in the train set, and hold out 20% for the test set.

In [17]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

X_train.shape, X_test.shape, y_train.shape, y_test.shape

((26048, 14), (6513, 14), (26048,), (6513,))

## Part 2 — Modeling with Logistic Regression!

- You may do exploratory data analysis and visualization, but it is not required.
- You may **use all the features, or select any features** of your choice, as long as you select at least one numeric feature and one categorical feature.
- **Scale your numeric features**, using any scikit-learn [Scaler](https://scikit-learn.org/stable/modules/classes.html#module-sklearn.preprocessing) of your choice.
- **Encode your categorical features**. You may use any encoding (One-Hot, Ordinal, etc) and any library (category_encoders, scikit-learn, pandas, etc) of your choice.
- You may choose to use a pipeline, but it is not required.
- Use a **Logistic Regression** model.
- Use scikit-learn's [**cross_val_score**](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.cross_val_score.html) function. For [scoring](https://scikit-learn.org/stable/modules/model_evaluation.html#the-scoring-parameter-defining-model-evaluation-rules), use **accuracy**.
- **Print your model's cross-validation accuracy score.**

## Data Dictionary
- Built data dictionary to prepare for model building

In [18]:
data_dict = pd.read_excel('Sprint_Data_Dict.xlsx')
data_dict

Unnamed: 0,Attribute,Dtype,Unique Values,Processing (Regression),Processing (Trees),Notes
0,age,int,73,Standardization,,
1,workclass,object,9,OneHot,Ordinal,? Represents unkown
2,fnlwgt,int,18440,Standardization,,
3,education,object,16,Hash,Hash,Consider creating new feature with reduced cat...
4,education-num,int,16,Standardization,,Implied ordering is warranted
5,marital-status,object,7,OneHot,Ordinal,Ordering is meaningless
6,occupation,object,15,Hash,Ordinal,Consider creating new features with reduced ca...
7,relationship,object,6,OneHot,Ordinal,
8,race,object,5,OneHot,Ordinal,Consider grouping
9,sex,object,2,Switch to binary int,Switch to binary int,


## Prepare and Make Features

In [19]:
import numpy as np

In [20]:
def make_features(X):
    X = X.copy()
    X['sex'] = X['sex'].replace({'Female':0,'Male':1})
    X['Age*Education'] = X['age'] * X['education-num']
    X['White_American'] = np.where( (X['race'].str.strip() == 'White') & (X['native-country'].str.strip() == 'United-States'), 1, 0)
    return X

In [21]:
X_train = make_features(X_train)
X_test = make_features(X_test)

## Prepare Target

In [22]:
def prepare_target(y):
    y = y.replace({'<=50K':0,'>50K':1})
    return y

In [23]:
y_train = prepare_target(y_train)
y_test = prepare_target(y_test)

## Processing

In [24]:
import category_encoders as ce

from sklearn.compose import ColumnTransformer, make_column_transformer
from sklearn.impute import SimpleImputer
from sklearn.pipeline import make_pipeline, Pipeline
from sklearn.preprocessing import StandardScaler

In [25]:
ordinal_features = []
one_hot_features = ['workclass','marital-status','relationship','race']
hash_features = ['education','native-country']
num_features = ['age','fnlwgt','education-num','Age*Education','capital-gain','capital-loss','hours-per-week']

ordinal_proccessor = make_pipeline(
    ce.OrdinalEncoder()
)

one_hot_processor = make_pipeline(
    ce.OneHotEncoder(use_cat_names=True)
)

hash_processor = make_pipeline(
    ce.HashingEncoder(n_components=8)
)

num_processor = make_pipeline(
    SimpleImputer(strategy='mean'),
    StandardScaler()
)

preprocess = make_column_transformer(
#   (ordinal_proccesor, ordinal_features), Not used
    (one_hot_processor, one_hot_features),
    (hash_processor, hash_features),
    (num_processor, num_features)
)

## Logistic Regression

In [26]:
from sklearn.linear_model import LogisticRegression

model = make_pipeline(
    preprocess,
    LogisticRegression(solver='lbfgs', class_weight=None, max_iter=1000)
)

scoring = ['accuracy', 'f1','roc_auc']

In [27]:
from sklearn.model_selection import cross_val_score, cross_validate

# Note switch to cross_validate to access multiple score types
# scores = cross_val_score(model, X_train, y_train, scoring='accuracy', cv=3)

scores = cross_validate(model, X_train, y_train, return_train_score=True, scoring=scoring, cv=5)

In [28]:
print('Train Cross-Validation Mean Accuracy Score:', scores['test_accuracy'].mean())
print('Train Cross-Validation Mean F1 Score:', scores['test_f1'].mean())
print('Train Cross-Validation Mean ROC AUC Score:', scores['test_roc_auc'].mean())

Train Cross-Validation Mean Accuracy Score: 0.8457464767350471
Train Cross-Validation Mean F1 Score: 0.6429026994600042
Train Cross-Validation Mean ROC AUC Score: 0.8990414011712866


## Part 3 — Modeling with Tree Ensembles!

Part 3 is the same as Part 2, except this time, use a **Random Forest** or **Gradient Boosting** classifier. You may use scikit-learn, xgboost, or any other library. Then, print your model's cross-validation accuracy score.

## Random Forest

In [29]:
from sklearn.ensemble import RandomForestClassifier

grow_forest = make_pipeline(
    preprocess,
    RandomForestClassifier(max_depth=8, n_estimators=100)
)

scores = cross_validate(grow_forest, X_train, y_train, return_train_score=True, scoring=scoring, cv=3)

print('Train Cross-Validation Mean Accuracy Score:', scores['test_accuracy'].mean())
print('Train Cross-Validation Mean F1 Score:', scores['test_f1'].mean())
print('Train Cross-Validation Mean ROC AUC Score:', scores['test_roc_auc'].mean())

Train Cross-Validation Mean Accuracy Score: 0.8538852526163878
Train Cross-Validation Mean F1 Score: 0.6411671126154076
Train Cross-Validation Mean ROC AUC Score: 0.9076054723816313


## Gradient Boosting

In [30]:
import os
os.environ['KMP_DUPLICATE_LIB_OK']='True'
from xgboost import XGBClassifier

In [31]:
grad_booster = make_pipeline(
    preprocess,
    XGBClassifier(n_estimators=100)
)

scores = cross_validate(grad_booster, X_train, y_train, return_train_score=True, scoring=scoring, cv=5)

print('Train Cross-Validation Mean Accuracy Score:', scores['test_accuracy'].mean())
print('Train Cross-Validation Mean F1 Score:', scores['test_f1'].mean())
print('Train Cross-Validation Mean ROC AUC Score:', scores['test_roc_auc'].mean())

Train Cross-Validation Mean Accuracy Score: 0.8614867299288955
Train Cross-Validation Mean F1 Score: 0.6726656981590479
Train Cross-Validation Mean ROC AUC Score: 0.9165097821734779


## Gradient Boosting Classifier on Test Set

In [32]:
X_train

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,Age*Education,White_American
5514,33,Local-gov,198183,Bachelors,13,Never-married,Prof-specialty,Not-in-family,White,Female,0,0,50,United-States,429,1
19777,36,Private,86459,Assoc-voc,11,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,1887,50,United-States,396,1
10781,58,Self-emp-not-inc,203039,9th,5,Separated,Craft-repair,Not-in-family,White,Male,0,0,40,United-States,290,1
32240,21,Private,180190,Assoc-voc,11,Married-civ-spouse,Farming-fishing,Husband,White,Male,0,0,46,United-States,231,1
9876,27,Private,279872,Some-college,10,Divorced,Other-service,Not-in-family,White,Male,0,0,40,United-States,270,1
5455,44,Private,175485,Bachelors,13,Married-civ-spouse,Craft-repair,Husband,White,Male,0,0,12,United-States,572,1
8615,33,Private,67006,10th,6,Never-married,Machine-op-inspct,Not-in-family,White,Female,0,0,45,United-States,198,1
29805,62,Self-emp-not-inc,75478,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,40,United-States,806,1
15081,20,Private,374116,HS-grad,9,Never-married,Prof-specialty,Other-relative,White,Female,0,0,40,United-States,180,1
17203,33,Private,23871,HS-grad,9,Divorced,Other-service,Unmarried,White,Female,0,0,30,United-States,297,1


In [33]:
from sklearn.metrics import accuracy_score, roc_auc_score, f1_score, confusion_matrix

X_train = preprocess.fit_transform(X_train, y_train)
X_test = preprocess.transform(X_test)

booster = XGBClassifier(n_estimators=100)
booster.fit(X_train, y_train)

y_pred = booster.predict(X_test)
y_pred_proba = booster.predict_proba(X_test)[:,1]

print('Test Accuracy Score:', accuracy_score(y_test, y_pred) )
print('Test F1 Score:', f1_score(y_test, y_pred))
print('Test ROC AUC Score:', roc_auc_score(y_test, y_pred_proba))

conf_mat = pd.DataFrame(confusion_matrix(y_test, y_pred), 
             columns=['Predicted Negative', 'Predicted Positive'], 
             index=['Actual Negative', 'Actual Positive'])

fpr = conf_mat.iat[0,1] / (conf_mat.iat[0,1] + conf_mat.iat[0,0])

print('False Positive Rate:', fpr)
print(conf_mat)

Test Accuracy Score: 0.8673422385997236
Test F1 Score: 0.6909871244635193
Test ROC AUC Score: 0.9189037777750874
False Positive Rate: 0.05240793201133145
                 Predicted Negative  Predicted Positive
Actual Negative                4683                 259
Actual Positive                 605                 966


## Part 4 — Calculate classification metrics from a confusion matrix

Suppose this is the confusion matrix for your binary classification model:

<table>
  <tr>
    <td colspan="2" rowspan="2"></td>
    <td colspan="2">Predicted</td>
  </tr>
  <tr>
    <td>Negative</td>
    <td>Positive</td>
  </tr>
  <tr>
    <td rowspan="2">Actual</td>
    <td>Negative</td>
    <td style="border: solid">85</td>
    <td style="border: solid">58</td>
  </tr>
  <tr>
    <td>Positive</td>
    <td style="border: solid">8</td>
    <td style="border: solid"> 36</td>
  </tr>
</table>

Calculate accuracy

Total Correct Predictions / Total Predictions

In [34]:
(36 + 85) / (85+58+8+36)

0.6470588235294118

Calculate precision

Accuracy = Predicted Positive Correct / Total Positive Predictions

In [35]:
36 / ( 36+ 58)

0.3829787234042553

Calculate recall

Recall = Predicted Positive Correctly / Total Positives in Data

In [36]:
36 / (8 + 36)

0.8181818181818182

## BONUS — How you can earn a score of 3

### Part 1
Do feature engineering, to try improving your cross-validation score.

### Part 2
Experiment with feature selection, preprocessing, categorical encoding, and hyperparameter optimization, to try improving your cross-validation score.

### Part 3
Which model had the best cross-validation score? Refit this model on the train set and do a final evaluation on the held out test set — what is the test score? 

### Part 4
Calculate F1 score and False Positive Rate. 