<a href="https://colab.research.google.com/github/bundickm/DS-Unit-2-Sprint-3-Classification-Validation/blob/master/DS_Unit_2_Sprint_Challenge_3_Classification_Validation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

 _Lambda School Data Science Unit 2_
 
 # Classification & Validation Sprint Challenge

Follow the instructions for each numbered part to earn a score of 2. See the bottom of the notebook for a list of ways you can earn a score of 3.

#### For this Sprint Challenge, you'll predict whether a person's income exceeds $50k/yr, based on census data.

You can read more about the Adult Census Income dataset at the UCI Machine Learning Repository: https://archive.ics.uci.edu/ml/datasets/adult

#### Run this cell to load the data:

In [0]:
#given code block
import pandas as pd

columns = ['age', 
           'workclass', 
           'fnlwgt', 
           'education', 
           'education-num', 
           'marital-status', 
           'occupation', 
           'relationship', 
           'race', 
           'sex', 
           'capital-gain', 
           'capital-loss', 
           'hours-per-week', 
           'native-country', 
           'income']

df = pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data', 
                 header=None, names=columns)

df['income'] = df['income'].str.strip()

## Part 1 — Begin with baselines

Split the data into an **X matrix** (all the features) and **y vector** (the target).

(You _don't_ need to split the data into train and test sets here. You'll be asked to do that at the _end_ of Part 1.)

In [121]:
print('Starting Shape:', df.shape)

#split into features and target
X = df.drop('income',axis='columns')
y = df['income']

#Check it worked
print('Ending Shapes:', X.shape,y.shape)

Starting Shape: (32561, 15)
Ending Shapes: (32561, 14) (32561,)


What **accuracy score** would you get here with a **"majority class baseline"?** 
 
(You can answer this question either with a scikit-learn function or with a pandas function.)

In [122]:
#75.919% Accuracy just guessing the majority class
y.value_counts(normalize=True)

<=50K    0.75919
>50K     0.24081
Name: income, dtype: float64

What **ROC AUC score** would you get here with a **majority class baseline?**

(You can answer this question either with a scikit-learn function or with no code, just your understanding of ROC AUC.)

 **Answer**: A naive majority class baseline (like above) will have an ROC-AUC score of 0.5.

In this Sprint Challenge, you will use **"Cross-Validation with Independent Test Set"** for your model validaton method.

First, **split the data into `X_train, X_test, y_train, y_test`**. You can include 80% of the data in the train set, and hold out 20% for the test set.

In [124]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y,
                                    test_size=0.2, random_state=42, stratify=y)

#prove the split worked
X_train.shape, X_test.shape, y_train.shape, y_test.shape

((26048, 14), (6513, 14), (26048,), (6513,))

## Part 2 — Modeling with Logistic Regression!

- You may do exploratory data analysis and visualization, but it is not required.
- You may **use all the features, or select any features** of your choice, as long as you select at least one numeric feature and one categorical feature.
- **Scale your numeric features**, using any scikit-learn [Scaler](https://scikit-learn.org/stable/modules/classes.html#module-sklearn.preprocessing) of your choice.
- **Encode your categorical features**. You may use any encoding (One-Hot, Ordinal, etc) and any library (category_encoders, scikit-learn, pandas, etc) of your choice.
- You may choose to use a pipeline, but it is not required.
- Use a **Logistic Regression** model.
- Use scikit-learn's [**cross_val_score**](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.cross_val_score.html) function. For [scoring](https://scikit-learn.org/stable/modules/model_evaluation.html#the-scoring-parameter-defining-model-evaluation-rules), use **accuracy**.
- **Print your model's cross-validation accuracy score.**

In [10]:
#some quick exploring
X.head()

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba


In [11]:
#some quick exploring
X.describe()

Unnamed: 0,age,fnlwgt,education-num,capital-gain,capital-loss,hours-per-week
count,32561.0,32561.0,32561.0,32561.0,32561.0,32561.0
mean,38.581647,189778.4,10.080679,1077.648844,87.30383,40.437456
std,13.640433,105550.0,2.57272,7385.292085,402.960219,12.347429
min,17.0,12285.0,1.0,0.0,0.0,1.0
25%,28.0,117827.0,9.0,0.0,0.0,40.0
50%,37.0,178356.0,10.0,0.0,0.0,40.0
75%,48.0,237051.0,12.0,0.0,0.0,45.0
max,90.0,1484705.0,16.0,99999.0,4356.0,99.0


In [14]:
#some quick exploring
X.describe(exclude='number')

Unnamed: 0,workclass,education,marital-status,occupation,relationship,race,sex,native-country
count,32561,32561,32561,32561,32561,32561,32561,32561
unique,9,16,7,15,6,5,2,42
top,Private,HS-grad,Married-civ-spouse,Prof-specialty,Husband,White,Male,United-States
freq,22696,10501,14976,4140,13193,27816,21790,29170


In [19]:
#confirming no nulls
X.isnull().sum()

age               0
workclass         0
fnlwgt            0
education         0
education-num     0
marital-status    0
occupation        0
relationship      0
race              0
sex               0
capital-gain      0
capital-loss      0
hours-per-week    0
native-country    0
dtype: int64

In [0]:
#all of the necessary imports

# !pip install category_encoders

import category_encoders as ce
from sklearn.preprocessing import MinMaxScaler
from sklearn.pipeline import make_pipeline
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier

In [0]:
#bring down the reading in of the csv, makes it easier in case of mess up or 
#restart of the environment
def prep_csv():
  columns = ['age', 
             'workclass', 
             'fnlwgt', 
             'education', 
             'education-num', 
             'marital-status', 
             'occupation', 
             'relationship', 
             'race', 
             'sex', 
             'capital-gain', 
             'capital-loss', 
             'hours-per-week', 
             'native-country', 
             'income']
  df = pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data', 
                 header=None, names=columns)

  df['income'] = df['income'].str.strip()
  return df

In [0]:
def feature_engineer(df):
  #prepping for stretch goals
  return df

In [108]:
#get our csv
df = prep_csv()

#prep our train and test variables
X = df.drop('income',axis='columns')
y = (df['income'] == '<=50K') #50k or less is True
X_train, X_test, y_train, y_test = train_test_split(X, y,
                                    test_size=0.2, random_state=42, stratify=y)

#feature engineer #prepping for stretch goals
# X_train = feature_engineer(X_train)

#make a pipeline
pipe = make_pipeline(
  #one-hot the low cardinality categoricals, ordinal for the rest
  ce.OneHotEncoder(use_cat_names=True, 
                   cols=['workclass','marital-status',
                         'relationship','race','sex']),
  ce.OrdinalEncoder(cols=['education','occupation','native-country']),
  MinMaxScaler(),
  LogisticRegression(solver='lbfgs'))

score = cross_val_score(pipe, X_train, y_train, cv=10,
                        scoring='accuracy', n_jobs=-1).mean()

#print accuracy score, it should beat 0.75919 or something went wrong
print('Accuracy:', score)

Accuracy: 0.843481032687589


## Part 3 — Modeling with Tree Ensembles!

Part 3 is the same as Part 2, except this time, use a **Random Forest** or **Gradient Boosting** classifier. You may use scikit-learn, xgboost, or any other library. Then, print your model's cross-validation accuracy score.

In [107]:
#get our csv
df = prep_csv()

#prep our train and test variables
X = df.drop('income',axis='columns')
y = (df['income'] == '<=50K') #50k or less is True
X_train, X_test, y_train, y_test = train_test_split(X, y,
                                    test_size=0.2, random_state=42, stratify=y)

#feature engineer #prepping for stretch goals
# X_train = feature_engineer(X_train)

#make a pipeline
pipe = make_pipeline(
  #one-hot the low cardinality categoricals, ordinal for the rest
  ce.OneHotEncoder(use_cat_names=True, 
                   cols=['workclass','marital-status',
                         'relationship','race','sex']),
  ce.OrdinalEncoder(cols=['education','occupation','native-country']),
  MinMaxScaler(),
  RandomForestClassifier(n_estimators=100, max_depth=5))

score = cross_val_score(pipe, X_train, y_train, cv=10,
                        scoring='accuracy', n_jobs=-1).mean()

#print accuracy score, it should beat 0.75919 or something went wrong
print('Accuracy:', score)

Accuracy: 0.8487796727085097


## Part 4 — Calculate classification metrics from a confusion matrix

Suppose this is the confusion matrix for your binary classification model:

<table>
  <tr>
    <td colspan="2" rowspan="2"></td>
    <td colspan="2">Predicted</td>
  </tr>
  <tr>
    <td>Negative</td>
    <td>Positive</td>
  </tr>
  <tr>
    <td rowspan="2">Actual</td>
    <td>Negative</td>
    <td style="border: solid">85</td>
    <td style="border: solid">58</td>
  </tr>
  <tr>
    <td>Positive</td>
    <td style="border: solid">8</td>
    <td style="border: solid"> 36</td>
  </tr>
</table>

Calculate accuracy
\begin{align}Accuracy = \frac{\text{True Positives + True Negatives}}{\text{Total Number of Predictions}}\end{align}

In [7]:
true_pos = 36
true_neg = 85
total_pred = 85+58+8+36

accuracy = (true_pos + true_neg)/total_pred

print('Accuracy:',accuracy)

Accuracy: 0.6470588235294118


Calculate precision
\begin{align}Precision = \frac{\text{True Positives}}{\text{True Positives + False Positives}}\end{align}

In [8]:
false_pos = 58

precision = true_pos/(true_pos + false_pos)

print('Precision:',precision)

Precision: 0.3829787234042553


Calculate recall
\begin{align}Recall = \frac{\text{True Positives}}{\text{True Positives + False Negatives}}\end{align}

In [9]:
false_neg = 8

recall = true_pos/(true_pos + false_neg)

print('Recall:', recall)

Recall: 0.8181818181818182


## BONUS — How you can earn a score of 3

### Part 1
Do feature engineering, to try improving your cross-validation score.

### Part 2
Experiment with feature selection, preprocessing, categorical encoding, and hyperparameter optimization, to try improving your cross-validation score.

### Part 3
Which model had the best cross-validation score? Refit this model on the train set and do a final evaluation on the held out test set — what is the test score? 

### Part 4
Calculate F1 score and False Positive Rate. 

In [0]:
#all of the necessary imports

# !pip install category_encoders

import category_encoders as ce
from sklearn.preprocessing import MinMaxScaler
from sklearn.pipeline import make_pipeline
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from xgboost import XGBClassifier

In [0]:
#bring down the reading in of the csv, makes it easier in case of mess up or 
#restart of the environment
def prep_csv():
  columns = ['age', 
             'workclass', 
             'fnlwgt', 
             'education', 
             'education-num', 
             'marital-status', 
             'occupation', 
             'relationship', 
             'race', 
             'sex', 
             'capital-gain', 
             'capital-loss', 
             'hours-per-week', 
             'native-country', 
             'income']
  df = pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data', 
                 header=None, names=columns)

  df['income'] = df['income'].str.strip()
  return df

In [0]:
def feature_engineer(df):
  df['busi_prof'] = ((df['occupation'].str.contains('Exec-managerial')) |
                     (df['occupation'].str.contains('Prof-specialty')))
  df['workclass'] = df['workclass'].str.strip().replace({'Without-pay':'No_income',
                                                         'Never-worked':'No_income'})
  df['net_gain'] = df['capital-gain'] - df['capital-loss']
  return df

In [127]:
#get our csv
df = prep_csv()

#prep our train and test variables
X = df.drop(['income', 'fnlwgt'],axis='columns')
y = (df['income'] == '<=50K') #50k or less is True
X_train, X_test, y_train, y_test = train_test_split(X, y,
                                    test_size=0.2, random_state=42, stratify=y)

#feature engineer
X_train = feature_engineer(X_train)

#make a pipeline
pipe = make_pipeline(
  #one-hot the low cardinality categoricals, ordinal for the rest
  ce.OneHotEncoder(use_cat_names=True, 
                   cols=['workclass','marital-status',
                         'relationship','race','sex','busi_prof']),
  ce.OrdinalEncoder(cols=['education','occupation','native-country']),
  MinMaxScaler(),
  XGBClassifier(n_estimators=100, max_depth=5,       #boosting was best model
                         class_weight={1:1,0:1.1}))  #by small margin

accuracy = cross_val_score(pipe, X_train, y_train, cv=10,
                        scoring='accuracy', n_jobs=-1).mean()
f1 = cross_val_score(pipe, X_train, y_train, cv=10,
                        scoring='f1', n_jobs=-1).mean()

#print accuracy score, it should beat 0.75919 or something went wrong
print('Accuracy:', accuracy)
print('F1 Score:', f1)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  This is separate from the ipykernel package so we can avoid doing imports until
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  


Accuracy: 0.8713918821339594
F1 Score: 0.917883371098607


##With the Test Data

In [116]:
#get our csv
df = prep_csv()

#prep our train and test variables
X = df.drop(['income', 'fnlwgt'],axis='columns')
y = (df['income'] == '<=50K') #50k or less is True
X_train, X_test, y_train, y_test = train_test_split(X, y,
                                    test_size=0.2, random_state=42, stratify=y)

#feature engineer
X_test = feature_engineer(X_test)

#make a pipeline
pipe = make_pipeline(
  #one-hot the low cardinality categoricals, ordinal for the rest
  ce.OneHotEncoder(use_cat_names=True, 
                   cols=['workclass','marital-status',
                         'relationship','race','sex','busi_prof']),
  ce.OrdinalEncoder(cols=['education','occupation','native-country']),
  MinMaxScaler(),
  XGBClassifier(n_estimators=100, max_depth=5,
                         class_weight={1:1,0:1.1}))

accuracy = cross_val_score(pipe, X_test, y_test, cv=10,
                        scoring='accuracy', n_jobs=-1).mean()
f1 = cross_val_score(pipe, X_test, y_test, cv=10,
                        scoring='f1', n_jobs=-1).mean()

#print accuracy score, it should beat 0.75919 or something went wrong
print('Accuracy:', accuracy)
print('F1 Score:', f1)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  This is separate from the ipykernel package so we can avoid doing imports until
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  


Accuracy: 0.8661138242323677
F1 Score: 0.9145301015362666


##Feature Engineering Work
Everything below here is basically just a digital scratch pad

In [55]:
X.head()

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba


In [62]:
df.groupby('occupation')['income'].value_counts(normalize=True)

occupation          income
 ?                  <=50K     0.896365
                    >50K      0.103635
 Adm-clerical       <=50K     0.865517
                    >50K      0.134483
 Armed-Forces       <=50K     0.888889
                    >50K      0.111111
 Craft-repair       <=50K     0.773359
                    >50K      0.226641
 Exec-managerial    <=50K     0.515986
                    >50K      0.484014
 Farming-fishing    <=50K     0.884306
                    >50K      0.115694
 Handlers-cleaners  <=50K     0.937226
                    >50K      0.062774
 Machine-op-inspct  <=50K     0.875125
                    >50K      0.124875
 Other-service      <=50K     0.958422
                    >50K      0.041578
 Priv-house-serv    <=50K     0.993289
                    >50K      0.006711
 Prof-specialty     <=50K     0.550966
                    >50K      0.449034
 Protective-serv    <=50K     0.674884
                    >50K      0.325116
 Sales              <=50K     0.73068

In [0]:
df = prep_csv()
df['busi_prof'] = ((df['occupation'].str.contains('Exec-managerial')) |
                   (df['occupation'].str.contains('Prof-specialty')))

In [73]:
df['income'] = (df['income'] == '<=50K')
df.corr()

Unnamed: 0,age,fnlwgt,education-num,capital-gain,capital-loss,hours-per-week,income,busi_prof
age,1.0,-0.076646,0.036527,0.077674,0.057775,0.068756,-0.234037,0.11721
fnlwgt,-0.076646,1.0,-0.043195,0.000432,-0.010252,-0.018768,0.009463,-0.027052
education-num,0.036527,-0.043195,1.0,0.12263,0.079923,0.148123,-0.335154,0.47448
capital-gain,0.077674,0.000432,0.12263,1.0,-0.031615,0.078409,-0.223329,0.111544
capital-loss,0.057775,-0.010252,0.079923,-0.031615,1.0,0.054256,-0.150526,0.072275
hours-per-week,0.068756,-0.018768,0.148123,0.078409,0.054256,1.0,-0.229689,0.152224
income,-0.234037,0.009463,-0.335154,-0.223329,-0.150526,-0.229689,1.0,-0.306207
busi_prof,0.11721,-0.027052,0.47448,0.111544,0.072275,0.152224,-0.306207,1.0


In [68]:
df['busi_prof'].value_counts()

False    24355
True      8206
Name: busi_prof, dtype: int64

In [77]:
df.groupby('workclass')['income'].value_counts(normalize=True)

workclass          income
 ?                 True      0.895969
                   False     0.104031
 Federal-gov       True      0.613542
                   False     0.386458
 Local-gov         True      0.705208
                   False     0.294792
 Never-worked      True      1.000000
 Private           True      0.781327
                   False     0.218673
 Self-emp-inc      False     0.557348
                   True      0.442652
 Self-emp-not-inc  True      0.715073
                   False     0.284927
 State-gov         True      0.728043
                   False     0.271957
 Without-pay       True      1.000000
Name: income, dtype: float64

In [82]:
df['workclass'] = df['workclass'].str.strip().replace({'Without-pay':'No_income',
                                                      'Never-worked':'No_income'})
df['workclass'].value_counts()

Private             22696
Self-emp-not-inc     2541
Local-gov            2093
?                    1836
State-gov            1298
Self-emp-inc         1116
Federal-gov           960
No_income              21
Name: workclass, dtype: int64

In [93]:
df.groupby(['education','marital-status'])['income'].value_counts(normalize=True)

education      marital-status          income
 10th           Divorced               <=50K     0.983333
                                       >50K      0.016667
                Married-civ-spouse     <=50K     0.842407
                                       >50K      0.157593
                Married-spouse-absent  <=50K     1.000000
                Never-married          <=50K     0.991690
                                       >50K      0.008310
                Separated              <=50K     0.979592
                                       >50K      0.020408
                Widowed                <=50K     0.974359
                                       >50K      0.025641
 11th           Divorced               <=50K     0.938462
                                       >50K      0.061538
                Married-civ-spouse     <=50K     0.875706
                                       >50K      0.124294
                Married-spouse-absent  <=50K     1.000000
                Never-marr

In [94]:
df['high_val_ed_and_home'] = ((df['education'].str.contains('Some-college') & df['marital-status'].str.contains('Married-civ-spouse')) |
                              (df['education'].str.contains('Prof-school') & df['marital-status'].str.contains('Divorced')) |
                              (df['education'].str.contains('Prof-school') & df['marital-status'].str.contains('Never-married')))
df['high_val_ed_and_home'].value_counts()

False    29595
True      2966
Name: high_val_ed_and_home, dtype: int64

In [106]:
df.groupby(['race','occupation'])['income'].value_counts()

race                 occupation          income
 Amer-Indian-Eskimo   ?                  <=50K       23
                                         >50K         2
                      Adm-clerical       <=50K       28
                                         >50K         3
                      Armed-Forces       <=50K        1
                      Craft-repair       <=50K       38
                                         >50K         6
                      Exec-managerial    <=50K       27
                                         >50K         3
                      Farming-fishing    <=50K       10
                      Handlers-cleaners  <=50K       22
                      Machine-op-inspct  <=50K       19
                      Other-service      <=50K       31
                                         >50K         2
                      Prof-specialty     <=50K       22
                                         >50K        11
                      Protective-serv    <=50K        6


In [0]:
df['net_gain'] = df['capital-gain'] - df['capital-loss']