 _Lambda School Data Science Unit 2_
 
 # Classification & Validation Sprint Challenge

Follow the instructions for each numbered part to earn a score of 2. See the bottom of the notebook for a list of ways you can earn a score of 3.

#### For this Sprint Challenge, you'll predict whether a person's income exceeds $50k/yr, based on census data.

You can read more about the Adult Census Income dataset at the UCI Machine Learning Repository: https://archive.ics.uci.edu/ml/datasets/adult

#### Run this cell to load the data:

In [0]:
import pandas as pd

columns = ['age', 
           'workclass', 
           'fnlwgt', 
           'education', 
           'education-num', 
           'marital-status', 
           'occupation', 
           'relationship', 
           'race', 
           'sex', 
           'capital-gain', 
           'capital-loss', 
           'hours-per-week', 
           'native-country', 
           'income']

df = pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data', 
                 header=None, names=columns)

df['income'] = df['income'].str.strip()

In [113]:
df.isnull().sum()

age               0
workclass         0
fnlwgt            0
education         0
education-num     0
marital-status    0
occupation        0
relationship      0
race              0
sex               0
capital-gain      0
capital-loss      0
hours-per-week    0
native-country    0
income            0
dtype: int64

In [114]:
df.head(14)

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,income
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K
5,37,Private,284582,Masters,14,Married-civ-spouse,Exec-managerial,Wife,White,Female,0,0,40,United-States,<=50K
6,49,Private,160187,9th,5,Married-spouse-absent,Other-service,Not-in-family,Black,Female,0,0,16,Jamaica,<=50K
7,52,Self-emp-not-inc,209642,HS-grad,9,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,45,United-States,>50K
8,31,Private,45781,Masters,14,Never-married,Prof-specialty,Not-in-family,White,Female,14084,0,50,United-States,>50K
9,42,Private,159449,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,5178,0,40,United-States,>50K


In [0]:
df['income'] = df['income'].astype('str')

## Part 1 — Begin with baselines

Split the data into an **X matrix** (all the features) and **y vector** (the target).

(You _don't_ need to split the data into train and test sets here. You'll be asked to do that at the _end_ of Part 1.)

In [0]:
X = df.drop(columns='income')
y = df['income'] == '>50K'

What **accuracy score** would you get here with a **"majority class baseline"?** 
 
(You can answer this question either with a scikit-learn function or with a pandas function.)

In [117]:
df['income'].value_counts(normalize=True)

<=50K    0.75919
>50K     0.24081
Name: income, dtype: float64

In [0]:
majority_class = y.mode()[0]
y_pred = [majority_class]*len(y)

In [119]:
from sklearn.metrics import accuracy_score

accuracy_score(y, y_pred)

0.7591904425539756

In [0]:
## .75919

What **ROC AUC score** would you get here with a **majority class baseline?**

(You can answer this question either with a scikit-learn function or with no code, just your understanding of ROC AUC.)

In [121]:
from sklearn.metrics import roc_auc_score

roc_auc_score(y, y_pred)

0.5

In [0]:
## would get roc_auc score of .5.. not better than guessing. No worse either

In this Sprint Challenge, you will use **"Cross-Validation with Independent Test Set"** for your model validaton method.

First, **split the data into `X_train, X_test, y_train, y_test`**. You can include 80% of the data in the train set, and hold out 20% for the test set.

In [0]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=.2, random_state=42, stratify=y
)

In [124]:
X_train.shape, X_test.shape, y_train.shape, y_test.shape

((26048, 14), (6513, 14), (26048,), (6513,))

## Part 2 — Modeling with Logistic Regression!

- You may do exploratory data analysis and visualization, but it is not required.
- You may **use all the features, or select any features** of your choice, as long as you select at least one numeric feature and one categorical feature.
- **Scale your numeric features**, using any scikit-learn [Scaler](https://scikit-learn.org/stable/modules/classes.html#module-sklearn.preprocessing) of your choice.
- **Encode your categorical features**. You may use any encoding (One-Hot, Ordinal, etc) and any library (category_encoders, scikit-learn, pandas, etc) of your choice.
- You may choose to use a pipeline, but it is not required.
- Use a **Logistic Regression** model.
- Use scikit-learn's [**cross_val_score**](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.cross_val_score.html) function. For [scoring](https://scikit-learn.org/stable/modules/model_evaluation.html#the-scoring-parameter-defining-model-evaluation-rules), use **accuracy**.
- **Print your model's cross-validation accuracy score.**

In [125]:
!pip install category_encoders



In [126]:
from sklearn.pipeline import make_pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import cross_val_score
import category_encoders as ce

X_train.select_dtypes(exclude='number').nunique().sort_values()

sex                2
race               5
relationship       6
marital-status     7
workclass          9
occupation        15
education         16
native-country    42
dtype: int64

In [127]:
pipeline1 = make_pipeline(
    ce.OneHotEncoder(use_cat_names=True),
    StandardScaler(),
    LogisticRegression(solver='lbfgs', max_iter=1000)
)

scores = cross_val_score(pipeline1, X_train, y_train, scoring='accuracy', cv=10, n_jobs=-1, verbose=10)

[Parallel(n_jobs=-1)]: Using backend LokyBackend with 2 concurrent workers.
[Parallel(n_jobs=-1)]: Done   1 tasks      | elapsed:    4.1s
[Parallel(n_jobs=-1)]: Done   4 tasks      | elapsed:    6.9s
[Parallel(n_jobs=-1)]: Done  10 out of  10 | elapsed:   14.4s finished


In [128]:
scores.mean()

0.8507752447956076

## Part 3 — Modeling with Tree Ensembles!

Part 3 is the same as Part 2, except this time, use a **Random Forest** or **Gradient Boosting** classifier. You may use scikit-learn, xgboost, or any other library. Then, print your model's cross-validation accuracy score.

In [129]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import MinMaxScaler


pipeline2 = make_pipeline(
    ce.OrdinalEncoder(),
    MinMaxScaler(),
    RandomForestClassifier(max_depth=3, n_estimators=100, n_jobs=-1, random_state=42)
)


scores = cross_val_score(pipeline2, X_train, y_train, scoring='accuracy', cv=10, n_jobs=-1, verbose=10)

[Parallel(n_jobs=-1)]: Using backend LokyBackend with 2 concurrent workers.
[Parallel(n_jobs=-1)]: Done   1 tasks      | elapsed:    1.7s
[Parallel(n_jobs=-1)]: Done   4 tasks      | elapsed:    3.3s
[Parallel(n_jobs=-1)]: Done  10 out of  10 | elapsed:    8.0s finished


In [130]:
scores.mean()

0.8364171469430387

## Part 4 — Calculate classification metrics from a confusion matrix

Suppose this is the confusion matrix for your binary classification model:

<table>
  <tr>
    <td colspan="2" rowspan="2"></td>
    <td colspan="2">Predicted</td>
  </tr>
  <tr>
    <td>Negative</td>
    <td>Positive</td>
  </tr>
  <tr>
    <td rowspan="2">Actual</td>
    <td>Negative</td>
    <td style="border: solid">85</td>
    <td style="border: solid">58</td>
  </tr>
  <tr>
    <td>Positive</td>
    <td style="border: solid">8</td>
    <td style="border: solid"> 36</td>
  </tr>
</table>

In [0]:
true_negative  = 85
false_positive = 58
false_negative = 8
true_positive  = 36

actual_negative = true_negative + false_positive
actual_positive = false_negative + true_positive

predicted_negative = true_negative + false_negative
predicted_positive = false_positive + true_positive

Calculate accuracy

In [132]:
accuracy = (true_positive + true_negative)/(predicted_negative+predicted_positive) ## predicted right/everything
accuracy

0.6470588235294118

Calculate precision

In [133]:
precision = true_positive / predicted_positive 
precision

0.3829787234042553

Calculate recall

In [134]:
recall = true_positive / actual_positive 
recall

0.8181818181818182

## BONUS — How you can earn a score of 3

### Part 1
Do feature engineering, to try improving your cross-validation score.

### Part 2
Experiment with feature selection, preprocessing, categorical encoding, and hyperparameter optimization, to try improving your cross-validation score.

### Part 3
Which model had the best cross-validation score? Refit this model on the train set and do a final evaluation on the held out test set — what is the test score? 

### Part 4
Calculate F1 score and False Positive Rate. 

In [0]:
import pandas as pd
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import make_pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import cross_val_score
import category_encoders as ce
from sklearn.model_selection import train_test_split
import numpy as np
from sklearn.tree import DecisionTreeClassifier
from xgboost import XGBClassifier


columns = ['age', 
           'workclass', 
           'fnlwgt', 
           'education', 
           'education-num', 
           'marital-status', 
           'occupation', 
           'relationship', 
           'race', 
           'sex', 
           'capital-gain', 
           'capital-loss', 
           'hours-per-week', 
           'native-country', 
           'income']

df = pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data', 
                 header=None, names=columns)

df['income'] = df['income'].str.strip()

In [164]:
df.head()

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,income
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K


In [165]:
df.select_dtypes(exclude='number').nunique().sort_values()

sex                2
income             2
race               5
relationship       6
marital-status     7
workclass          9
occupation        15
education         16
native-country    42
dtype: int64

In [0]:
df = df.drop(columns='education')

In [0]:

def make_features(X):
  X = X.copy()
  X['age_round_10'] = X['age'].round(-1)
  return X

df = make_features(df)
  

In [0]:
df = df.drop(columns='age')

In [0]:
X = df.drop(columns='income')
y = df['income'] == '>50K'

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=.2, random_state=42, stratify=y
)

In [170]:
X_train.shape, y_train.shape, X_test.shape, y_test.shape

((26048, 13), (26048,), (6513, 13), (6513,))

In [171]:
X_train.head()

Unnamed: 0,workclass,fnlwgt,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,age_round_10
15738,Private,37210,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,45,United-States,30
27985,Private,101950,14,Never-married,Exec-managerial,Not-in-family,White,Female,0,0,45,United-States,40
30673,?,122244,9,Never-married,?,Not-in-family,White,Female,0,0,28,United-States,20
9505,Local-gov,24763,10,Divorced,Transport-moving,Unmarried,White,Male,6849,0,40,United-States,40
26417,Private,113936,13,Never-married,Prof-specialty,Own-child,White,Male,0,0,40,United-States,20


In [172]:
columns = ['workclass', 
           'fnlwgt',  
           'education-num', 
           'marital-status', 
           'occupation', 
           'relationship', 
           'race', 
           'sex', 
           'capital-gain', 
           'capital-loss', 
           'hours-per-week', 
           'native-country',
           'age_round_10']

preprocessor = make_pipeline(
    ce.OrdinalEncoder(),
    StandardScaler()
)

df_X = preprocessor.fit_transform(X_train)
df_X = pd.DataFrame(df_X, columns=columns)

df_X.describe()

  return self.partial_fit(X, y)
  return self.fit(X, **fit_params).transform(X)


Unnamed: 0,workclass,fnlwgt,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,age_round_10
count,26048.0,26048.0,26048.0,26048.0,26048.0,26048.0,26048.0,26048.0,26048.0,26048.0,26048.0,26048.0,26048.0
mean,2.958151e-16,-8.352246000000001e-17,7.501592e-16,-1.148114e-16,-1.43055e-16,7.295748e-17,2.691413e-15,-1.4274e-15,1.881109e-16,-5.852518e-16,-3.354668e-16,-2.532794e-15,1.370721e-15
std,1.000019,1.000019,1.000019,1.000019,1.000019,1.000019,1.000019,1.000019,1.000019,1.000019,1.000019,1.000019,1.000019
min,-0.5636901,-1.684387,-3.514975,-0.7812256,-1.50991,-0.9063975,-0.3451857,-0.7028558,-0.1453187,-0.2173618,-3.179475,-0.2575341,-1.322975
25%,-0.5636901,-0.680587,-0.417965,-0.7812256,-0.6743843,-0.9063975,-0.3451857,-0.7028558,-0.1453187,-0.2173618,-0.03871413,-0.2575341,-0.6086865
50%,-0.5636901,-0.1085065,-0.03083872,0.08688431,-0.1173674,-0.1937121,-0.3451857,-0.7028558,-0.1453187,-0.2173618,-0.03871413,-0.2575341,0.1056022
75%,-0.0255556,0.4497835,0.7434139,0.08688431,0.7181579,0.5189733,-0.3451857,1.422767,-0.1453187,-0.2173618,0.3639475,-0.2575341,0.8198908
max,3.741386,12.2711,2.291919,4.427434,2.389208,2.65703,5.31067,1.422767,13.55503,10.53884,4.712693,7.256144,3.677045


In [0]:
models = [LogisticRegression(solver='lbfgs', max_iter=1000),
          DecisionTreeClassifier(max_depth=3),
          DecisionTreeClassifier(max_depth=None),
          RandomForestClassifier(max_depth=3, n_estimators=100, n_jobs=-1, random_state=42),
          RandomForestClassifier(max_depth=None, n_estimators=100, n_jobs=-1, random_state=42),
          XGBClassifier(max_depth=3, n_estimators=100, n_jobs=-1, random_state=42)]



In [176]:
for model in models:
  print(model, '\n')
  score = cross_val_score(model, df_X, y_train, scoring='accuracy', cv=5).mean()
  print('Cross_Validation Accuracy:', score, '\n', '\n')

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=1000, multi_class='warn',
          n_jobs=None, penalty='l2', random_state=None, solver='lbfgs',
          tol=0.0001, verbose=0, warm_start=False) 

Cross_Validation Accuracy: 0.8371468471997197 
 

DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=3,
            max_features=None, max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, presort=False, random_state=None,
            splitter='best') 

Cross_Validation Accuracy: 0.8429052698912887 
 

DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
            max_features=None, max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0

In [177]:
from sklearn.metrics import accuracy_score

pipe = make_pipeline(
    ce.OrdinalEncoder(),
    StandardScaler(),
    XGBClassifier(max_depth=3, n_estimators=100, n_jobs=-1, random_state=42)
)

pipe.fit(X_train, y_train)

y_pred = pipe.predict(X_test)
accuracy_score(y_test,y_pred)

  return self.partial_fit(X, y)
  return self.fit(X, y, **fit_params).transform(X)
  Xt = transform.transform(Xt)


0.8691847075080608

##Got a accuracy score of .86 on test set