Lambda School Data Science

*Unit 2, Sprint 2, Module 4*

---

# Classification Metrics

## Assignment
- [ ] If you haven't yet, [review requirements for your portfolio project](https://lambdaschool.github.io/ds/unit2), then submit your dataset.
- [ ] Plot a confusion matrix for your Tanzania Waterpumps model.
- [ ] Continue to participate in our Kaggle challenge. Every student should have made at least one submission that scores at least 70% accuracy (well above the majority class baseline).
- [ ] Submit your final predictions to our Kaggle competition. Optionally, go to **My Submissions**, and _"you may select up to 1 submission to be used to count towards your final leaderboard score."_
- [ ] Commit your notebook to your fork of the GitHub repo.
- [ ] Read [Maximizing Scarce Maintenance Resources with Data: Applying predictive modeling, precision at k, and clustering to optimize impact](http://archive.is/DelgE), by Lambda DS3 student Michael Brady. His blog post extends the Tanzania Waterpumps scenario, far beyond what's in the lecture notebook.


## Stretch Goals

### Reading

- [Attacking discrimination with smarter machine learning](https://research.google.com/bigpicture/attacking-discrimination-in-ml/), by Google Research, with  interactive visualizations. _"A threshold classifier essentially makes a yes/no decision, putting things in one category or another. We look at how these classifiers work, ways they can potentially be unfair, and how you might turn an unfair classifier into a fairer one. As an illustrative example, we focus on loan granting scenarios where a bank may grant or deny a loan based on a single, automatically computed number such as a credit score."_
- [Notebook about how to calculate expected value from a confusion matrix by treating it as a cost-benefit matrix](https://github.com/podopie/DAT18NYC/blob/master/classes/13-expected_value_cost_benefit_analysis.ipynb)
- [Visualizing Machine Learning Thresholds to Make Better Business Decisions](https://blog.insightdatascience.com/visualizing-machine-learning-thresholds-to-make-better-business-decisions-4ab07f823415)


### Doing
- [ ] Share visualizations in our Slack channel!
- [ ] RandomizedSearchCV / GridSearchCV, for model selection. (See module 3 assignment notebook)
- [ ] Stacking Ensemble. (See module 3 assignment notebook)
- [ ] More Categorical Encoding. (See module 2 assignment notebook)

In [None]:
%%capture
import sys

# If you're on Colab:
if 'google.colab' in sys.modules:
    DATA_PATH = 'https://raw.githubusercontent.com/LambdaSchool/DS-Unit-2-Kaggle-Challenge/master/data/'
    !pip install category_encoders==2.*

# If you're working locally:
else:
    DATA_PATH = '../data/'

In [None]:
import pandas as pd

# Merge train_features.csv & train_labels.csv
train = pd.merge(pd.read_csv(DATA_PATH+'waterpumps/train_features.csv'), 
                 pd.read_csv(DATA_PATH+'waterpumps/train_labels.csv'))

# Read test_features.csv & sample_submission.csv
test = pd.read_csv(DATA_PATH+'waterpumps/test_features.csv')
sample_submission = pd.read_csv(DATA_PATH+'waterpumps/sample_submission.csv')

# Imports

In [None]:
#Imports
%matplotlib inline
from pandas_profiling import ProfileReport

from sklearn.model_selection import train_test_split, GridSearchCV, RandomizedSearchCV
from sklearn.pipeline import make_pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, FunctionTransformer
from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_selection import SelectKBest, f_regression
from sklearn.metrics import plot_confusion_matrix, classification_report, accuracy_score

from category_encoders import OneHotEncoder, OrdinalEncoder, TargetEncoder

import numpy as np
from scipy.stats import randint, uniform
import matplotlib as plt

  import pandas.util.testing as tm


# Train-Validate-Test Split

In [None]:
#Perform our train/test/split on the data
train, val  = train_test_split(train, train_size=0.8, test_size=0.2, stratify=train['status_group'], random_state=42)
train.shape, val.shape, test.shape

((47520, 41), (11880, 41), (14358, 40))

# Baseline

In [None]:
train['status_group'].value_counts()

functional                 25807
non functional             18259
functional needs repair     3454
Name: status_group, dtype: int64

In [None]:
baseline = (25807 / (25807 + 18259 + 3454))

print('Our naive baseline is: ', baseline)

Our naive baseline is:  0.5430765993265994


# Wrangle Data Function

In [None]:
#Create a function to wrangle our data by fixing the null island values in the geo data as well as 
#remove any redundant columns
def wrangle_data(X):
  """Function for wrangling all of our data sets in the same manner"""

  #Prevent a SettingWithCopyWarning
  X = X.copy()

  #Replace the near-0 values in latitude with 0's for easier handling
  X['latitude'] = X['latitude'].replace(-2e-08, 0)

  #Now we will replace the 0's with null values for easier handling
  cols_with_zeros = ['longitude', 'latitude', 'construction_year',
                     'gps_height', 'population']
  for col in cols_with_zeros:
    X[col] = X[col].replace(0, np.nan)
    X[col+'_MISSING'] = X[col].isnull()

  #Drop redundant or non-useful columns
  red_cols = ['quantity_group', 'payment_type', 'recorded_by', 'id']
  X = X.drop(columns=red_cols)

  #Convert date_recorded to a datetime object
  X['date_recorded'] = pd.to_datetime(X['date_recorded'], infer_datetime_format = True)

  #Extract the date elements and then drop the original date/time column
  X['year_recorded'] = X['date_recorded'].dt.year
  X['month_recorded'] = X['date_recorded'].dt.month
  X['day_recorded'] = X['date_recorded'].dt.day
  X = X.drop(columns='date_recorded')

  #Engineering new features
  X['years'] = X['year_recorded'] - X['construction_year']
  X['years_MISSING'] = X['years'].isnull()

  #Return the wrangled dataframe
  return X


In [None]:
#Create a function that will select our features and control for high cardinality

def create_features(X):
  
  #Create our target
  target = 'status_group'

  #Strip out the targed and id columns
  train_features = train.drop(columns=[target])

  #Get a list of the numeric features
  numeric_features = train_features.select_dtypes(include='number').columns.tolist()

  #Get a series with the cardinality of the nonnumeric features
  cardinality = train_features.select_dtypes(exclude='number').nunique()

  #Create a list of all nonnumeric features with a cardinality less than 50
  categorical_features = cardinality[cardinality <= 50].index.tolist()

  #Combine the lists
  features = numeric_features + categorical_features

  return features



In [None]:
#Create our features and perform the final split on our data into Feature matrices and target vectors
target = 'status_group'
features = create_features(train)

X_train = train.drop(columns=target)
y_train = train[target]
X_val = val.drop(columns=target)
y_val = val[target]
X_test = test

X_train.shape, y_train.shape, X_val.shape, y_val.shape, X_test.shape, y_val.shape

((47520, 40), (47520,), (11880, 40), (11880,), (14358, 40), (11880,))

# Create our pipeline

In [None]:
#Make our pipeline

model = make_pipeline(
    FunctionTransformer(wrangle_data, validate=False),
    OrdinalEncoder(),
    SimpleImputer(strategy='mean'),
    RandomForestClassifier(criterion='entropy', max_depth=50, n_estimators=200, min_samples_leaf=1, random_state=42)
)

In [None]:
#Fit the data to our model
%%time

model.fit(X_train, y_train)
print('training accuracy:', model.score(X_train, y_train))
print('validation accuracy:', model.score(X_val, y_val))

training accuracy: 0.9999789562289563
validation accuracy: 0.8132154882154882
CPU times: user 40 s, sys: 470 ms, total: 40.4 s
Wall time: 40.5 s


# Confusion matrix

In [None]:
y_pred = model.predict(X_val)
print('Validation Accuracy: ', accuracy_score(y_val, y_pred))

Validation Accuracy:  0.8132154882154882


In [None]:
#Create our confusion matrix on our validation data

plt.rcParams['figure.dpi'] = 125

plot_confusion_matrix(model, X_val, y_val, values_format='.0f', xticks_rotation='vertical')


<sklearn.metrics._plot.confusion_matrix.ConfusionMatrixDisplay at 0x7f3700342c50>

# Stretch Goals

Below will be our stretch goals: 

- Using a grid search to determine the best hyperparameters for use in our model.
- Stacking Ensemble
- More? Categorical Encoding
- Create visuals to share on Slack

In [None]:
#Create a new pipeline to work with on grid search
model = make_pipeline(
    FunctionTransformer(wrangle_data, validate=False),
    OrdinalEncoder(),
    SimpleImputer(strategy='mean'),
    StandardScaler(),
    RandomForestClassifier(criterion='entropy', random_state=42)
)

#Stored static variables below:
#, max_depth=50, n_estimators=200, min_samples_leaf=1, random_state=42

In [None]:
#Create and use our grid search
#Create our param_dist to use for searching
param_dist = {
    'randomforestclassifier__n_estimators': [300],
    'randomforestclassifier__max_depth': [20], 
    'randomforestclassifier__max_features': [.25],
    'randomforestclassifier__min_samples_split': [2],
    'randomforestclassifier__min_samples_leaf': [2]
}

In [None]:
search = GridSearchCV(
    model,
    param_grid=param_dist,
    cv=3,
    scoring='accuracy',
    verbose=10,
    return_train_score=True,
    n_jobs=-1
)

In [None]:
search.fit(X_train, y_train)

Fitting 3 folds for each of 5 candidates, totalling 15 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 2 concurrent workers.
[Parallel(n_jobs=-1)]: Done   1 tasks      | elapsed:  1.5min
[Parallel(n_jobs=-1)]: Done   4 tasks      | elapsed:  3.0min
[Parallel(n_jobs=-1)]: Done   9 tasks      | elapsed:  7.1min
[Parallel(n_jobs=-1)]: Done  15 out of  15 | elapsed: 10.7min finished


GridSearchCV(cv=3, error_score=nan,
             estimator=Pipeline(memory=None,
                                steps=[('functiontransformer',
                                        FunctionTransformer(accept_sparse=False,
                                                            check_inverse=True,
                                                            func=<function wrangle_data at 0x7f36ffdf79d8>,
                                                            inv_kw_args=None,
                                                            inverse_func=None,
                                                            kw_args=None,
                                                            validate=False)),
                                       ('ordinalencoder',
                                        OrdinalEncoder(cols=None,
                                                       drop_invariant=False,
                                                       handle_missing='value'

In [None]:
print('training accuracy:', search.score(X_train, y_train))
print('validation accuracy:', search.score(X_val, y_val))

training accuracy: 0.9400042087542088
validation accuracy: 0.8131313131313131


In [None]:
print(search.best_estimator_)

Pipeline(memory=None,
         steps=[('functiontransformer',
                 FunctionTransformer(accept_sparse=False, check_inverse=True,
                                     func=<function wrangle_data at 0x7f36ffdf79d8>,
                                     inv_kw_args=None, inverse_func=None,
                                     kw_args=None, validate=False)),
                ('ordinalencoder',
                 OrdinalEncoder(cols=['funder', 'installer', 'wpt_name',
                                      'basin', 'subvillage', 'region', 'lga',
                                      'ward', 'public_meeting',...
                 RandomForestClassifier(bootstrap=True, ccp_alpha=0.0,
                                        class_weight=None, criterion='entropy',
                                        max_depth=20, max_features=0.25,
                                        max_leaf_nodes=None, max_samples=None,
                                        min_impurity_decrease=0.0,
         