<a href="https://colab.research.google.com/github/Nolanole/DS-Unit-2-Sprint-3-Classification-Validation/blob/master/DS_Unit_2_Sprint_Challenge_3_Classification_Validation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

 _Lambda School Data Science Unit 2_
 
 # Classification & Validation Sprint Challenge

Follow the instructions for each numbered part to earn a score of 2. See the bottom of the notebook for a list of ways you can earn a score of 3.

#### For this Sprint Challenge, you'll predict whether a person's income exceeds $50k/yr, based on census data.

You can read more about the Adult Census Income dataset at the UCI Machine Learning Repository: https://archive.ics.uci.edu/ml/datasets/adult

#### Run this cell to load the data:

In [0]:
import pandas as pd

columns = ['age', 
           'workclass', 
           'fnlwgt', 
           'education', 
           'education-num', 
           'marital-status', 
           'occupation', 
           'relationship', 
           'race', 
           'sex', 
           'capital-gain', 
           'capital-loss', 
           'hours-per-week', 
           'native-country', 
           'income']

df = pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data', 
                 header=None, names=columns)

df['income'] = df['income'].str.strip()

## Part 1 — Begin with baselines

Split the data into an **X matrix** (all the features) and **y vector** (the target).

(You _don't_ need to split the data into train and test sets here. You'll be asked to do that at the _end_ of Part 1.)

In [0]:
target = 'income'
X = df.drop(columns=target)
y = df[target]

What **accuracy score** would you get here with a **"majority class baseline"?** 
 
(You can answer this question either with a scikit-learn function or with a pandas function.)

In [46]:
y_vals_norm = y.value_counts(normalize=True)
y_vals_norm

<=50K    0.75919
>50K     0.24081
Name: income, dtype: float64

In [47]:
print(y_vals_norm[0])

0.7591904425539756


###^^^If you use majority class as baseline guess that income is less than/equal to $50k, you would have 75.9% accuracy

What **ROC AUC score** would you get here with a **majority class baseline?**

(You can answer this question either with a scikit-learn function or with no code, just your understanding of ROC AUC.)

In [0]:
y_pred_numeric = [0] * len(y)
y_numeric = y.replace({'<=50K':0, '>50K':1})

In [39]:
from sklearn.metrics import roc_auc_score
print('ROC-AUC Score for majority class baseline: ' + str(roc_auc_score(y_numeric, y_pred_numeric)))

ROC-AUC Score for majority class baseline: 0.5


In this Sprint Challenge, you will use **"Cross-Validation with Independent Test Set"** for your model validaton method.

First, **split the data into `X_train, X_test, y_train, y_test`**. You can include 80% of the data in the train set, and hold out 20% for the test set.

In [0]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

## Part 2 — Modeling with Logistic Regression!

- You may do exploratory data analysis and visualization, but it is not required.
- You may **use all the features, or select any features** of your choice, as long as you select at least one numeric feature and one categorical feature.
- **Scale your numeric features**, using any scikit-learn [Scaler](https://scikit-learn.org/stable/modules/classes.html#module-sklearn.preprocessing) of your choice.
- **Encode your categorical features**. You may use any encoding (One-Hot, Ordinal, etc) and any library (category_encoders, scikit-learn, pandas, etc) of your choice.
- You may choose to use a pipeline, but it is not required.
- Use a **Logistic Regression** model.
- Use scikit-learn's [**cross_val_score**](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.cross_val_score.html) function. For [scoring](https://scikit-learn.org/stable/modules/model_evaluation.html#the-scoring-parameter-defining-model-evaluation-rules), use **accuracy**.
- **Print your model's cross-validation accuracy score.**

In [0]:
!pip install category_encoders

In [21]:
%matplotlib inline
import warnings
import category_encoders as ce
import matplotlib.pyplot as plt
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score
from sklearn.pipeline import make_pipeline
from sklearn.exceptions import DataConversionWarning
from sklearn.preprocessing import StandardScaler
warnings.filterwarnings(action='ignore', category=DataConversionWarning)

#Make LogisticRegression pipeline:

pipeline = make_pipeline(
    ce.OneHotEncoder(use_cat_names=True), 
    StandardScaler(), 
    LogisticRegression(solver='lbfgs', max_iter=1000)
)

# Cross-validate with training data
scores = cross_val_score(pipeline, X_train, y_train, scoring='accuracy', cv=10, n_jobs=-1, verbose=10)

[Parallel(n_jobs=-1)]: Using backend LokyBackend with 2 concurrent workers.
[Parallel(n_jobs=-1)]: Done   1 tasks      | elapsed:    3.9s
[Parallel(n_jobs=-1)]: Done   4 tasks      | elapsed:    6.7s
[Parallel(n_jobs=-1)]: Done  10 out of  10 | elapsed:   14.0s finished


In [25]:
print('Logistic Regression Cross-Validation Accuracy scores: ', scores)
print('Logistic regression Cross-Validation Average Accuracy Score: ', scores.mean())

Logistic Regression Cross-Validation Accuracy scores:  [0.8452975  0.85105566 0.84683301 0.85412668 0.84683301 0.85143954
 0.84952015 0.85143954 0.84754224 0.86059908]
Logistic regression Cross-Validation Average Accuracy Score:  0.8504686426610766


## Part 3 — Modeling with Tree Ensembles!

Part 3 is the same as Part 2, except this time, use a **Random Forest** or **Gradient Boosting** classifier. You may use scikit-learn, xgboost, or any other library. Then, print your model's cross-validation accuracy score.

In [27]:
from sklearn.ensemble import RandomForestClassifier

#Make RandomForestClassifier pipeline:

pipeline = make_pipeline(
    ce.OneHotEncoder(use_cat_names=True), 
    #StandardScaler(), 
    RandomForestClassifier(max_depth=5, n_estimators=100)
)

#Cross-validate with training data
scores = cross_val_score(pipeline, X_train, y_train, scoring='accuracy', cv=10, n_jobs=-1, verbose=10)

[Parallel(n_jobs=-1)]: Using backend LokyBackend with 2 concurrent workers.
[Parallel(n_jobs=-1)]: Done   1 tasks      | elapsed:    2.4s
[Parallel(n_jobs=-1)]: Done   4 tasks      | elapsed:    5.2s
[Parallel(n_jobs=-1)]: Done  10 out of  10 | elapsed:   13.1s finished


In [28]:
print('Random Forest Cross-Validation Accuracy scores: ', scores)
print('Random Forest Cross-Validation Average Accuracy Score: ', scores.mean())

Random Forest Cross-Validation Accuracy scores:  [0.83570058 0.82341651 0.84222649 0.84069098 0.83723608 0.8318618
 0.83608445 0.83723608 0.83640553 0.85138249]
Random Forest Cross-Validation Average Accuracy Score:  0.8372240993481166


In [29]:
from xgboost import XGBClassifier

#Make XGBoost pipeline:

pipeline = make_pipeline(
    ce.OneHotEncoder(use_cat_names=True), 
    #StandardScaler(), 
    XGBClassifier(max_depth=5, n_estimators=100)
)

#Cross-validate with training data
scores = cross_val_score(pipeline, X_train, y_train, scoring='accuracy', cv=10, n_jobs=-1, verbose=10)

[Parallel(n_jobs=-1)]: Using backend LokyBackend with 2 concurrent workers.
[Parallel(n_jobs=-1)]: Done   1 tasks      | elapsed:   16.3s
[Parallel(n_jobs=-1)]: Done   4 tasks      | elapsed:   32.7s
[Parallel(n_jobs=-1)]: Done  10 out of  10 | elapsed:  1.4min finished


In [30]:
print('XGBoost Cross-Validation Accuracy scores: ', scores)
print('XGBoost Cross-Validation Average Accuracy Score: ', scores.mean())

XGBoost Cross-Validation Accuracy scores:  [0.8621881  0.85834933 0.86833013 0.87792706 0.86756238 0.87332054
 0.87101727 0.86833013 0.87096774 0.87519201]
XGBoost Cross-Validation Average Accuracy Score:  0.8693184706239627


## Part 4 — Calculate classification metrics from a confusion matrix

Suppose this is the confusion matrix for your binary classification model:

<table>
  <tr>
    <td colspan="2" rowspan="2"></td>
    <td colspan="2">Predicted</td>
  </tr>
  <tr>
    <td>Negative</td>
    <td>Positive</td>
  </tr>
  <tr>
    <td rowspan="2">Actual</td>
    <td>Negative</td>
    <td style="border: solid">85</td>
    <td style="border: solid">58</td>
  </tr>
  <tr>
    <td>Positive</td>
    <td style="border: solid">8</td>
    <td style="border: solid"> 36</td>
  </tr>
</table>

In [0]:
true_pos = 36
true_neg = 85
false_pos = 58
false_neg = 8
actual_pos = true_pos + false_neg
actual_neg = true_neg + false_pos
pred_pos = true_pos + false_pos
pred_neg = true_neg + false_neg

Calculate accuracy

In [32]:
accuracy = (true_pos + true_neg) / (pred_neg + pred_pos)
print(accuracy)

0.6470588235294118


Calculate precision

In [33]:
precision = true_pos / pred_pos
print(precision)

0.3829787234042553


Calculate recall

In [34]:
recall = true_pos / actual_pos
print(recall)

0.8181818181818182


In [35]:
f1 = 2*((precision*recall)/(precision+recall))
print(f1)

0.5217391304347826


## BONUS — How you can earn a score of 3

### Part 1
Do feature engineering, to try improving your cross-validation score.

### Part 2
Experiment with feature selection, preprocessing, categorical encoding, and hyperparameter optimization, to try improving your cross-validation score.

### Part 3
Which model had the best cross-validation score? Refit this model on the train set and do a final evaluation on the held out test set — what is the test score? 

### Part 4
Calculate F1 score and False Positive Rate. 