<a href="https://colab.research.google.com/github/DavidVollendroff/DS-Unit-2-Kaggle-Challenge/blob/master/module4/LS_DS_224_assignment.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Lambda School Data Science

*Unit 2, Sprint 2, Module 4*

---

# Classification Metrics

## Assignment


In [0]:
%%capture
import sys

# If you're on Colab:
if 'google.colab' in sys.modules:
    DATA_PATH = 'https://raw.githubusercontent.com/LambdaSchool/DS-Unit-2-Kaggle-Challenge/master/data/'
    !pip install category_encoders==2.*

# If you're working locally:
else:
    DATA_PATH = '../data/'

In [0]:
import pandas as pd

# Merge train_features.csv & train_labels.csv
train = pd.merge(pd.read_csv(DATA_PATH+'waterpumps/train_features.csv'), 
                 pd.read_csv(DATA_PATH+'waterpumps/train_labels.csv'))

# Read test_features.csv & sample_submission.csv
test = pd.read_csv(DATA_PATH+'waterpumps/test_features.csv')
sample_submission = pd.read_csv(DATA_PATH+'waterpumps/sample_submission.csv')

- [X] If you haven't yet, [review requirements for your portfolio project](https://lambdaschool.github.io/ds/unit2), then submit your dataset.

- [X] Plot a confusion matrix for your Tanzania Waterpumps model.


In [0]:
# Will need to define a function to replace erroneous values with NaN values

import numpy as np # 

def data_cleaner(X):
  X = X.copy()
  X['latitude'] = X['latitude'].replace(-2e-08, 0) # remove non-sense values
  
  # useless columns
  X.drop(columns=['id', 'recorded_by', 'num_private', 'quantity_group'])
  
  # Will leave as a stretch goal zero-wrangling of greater complexity than this
  cols_with_zeros = ['longitude', 'latitude']
  for col in cols_with_zeros:
      X[col] = X[col].replace(0, np.nan)
  
  return X

In [0]:
train = data_cleaner(train)
test = data_cleaner(test)

# Initial Feature Selection

#establish the 'label'
target = 'status_group'

features = train.columns.tolist()

removed_features = ['id',
                   'recorded_by',
                   'num_private',
                   'quantity_group',
                   'status_group']

for item in removed_features:
  if item in features:
    features.remove(item)

In [0]:
# features matrix and target vector 
X_train = train[features]
y_train = train[target]
X_test = test[features]

In [0]:
# import all necessary functions for sklearn pipeline
import category_encoders as ce
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier
from sklearn.pipeline import make_pipeline

In [0]:
# create pipeline
pipe = make_pipeline(ce.OrdinalEncoder(),
                         SimpleImputer(strategy='mean'),
                         RandomForestClassifier(max_depth=20, max_features=5, n_estimators=1200))
pipe.fit(X_train, y_train);

In [0]:
from sklearn.metrics import confusion_matrix
from sklearn.model_selection import train_test_split
split_train, val = train_test_split(train)
X_val = val[features]
y_val = val[target]
X_split = split_train[features]
y_split = split_train[target]
y_pred = pipe.predict(X_val)

confusion_matrix(y_val, y_pred)

In [0]:
from sklearn.utils.multiclass import unique_labels
unique_labels(y_val)

In [0]:
import seaborn as sns
def confusion_plot(y_true, y_pred):
  labels = unique_labels(y_val)
  columns = [f'Predicted {label}' for label in labels]
  index = [f'Actual {label}' for label in labels]
  table = pd.DataFrame(confusion_matrix(y_true, y_pred),
                       columns=columns,
                       index=index)
  return sns.heatmap(table, annot=True, fmt='d', cmap='viridis')

confusion_plot(y_val, y_pred)
  

- [X] Continue to participate in our Kaggle challenge. Every student should have made at least one submission that scores at least 70% accuracy (well above the majority class baseline).

In [0]:
!pip install scikit-optimize
!pip install scikit-learn==0.20.3 # newer versions incompatible with skopt

In [0]:
from skopt import BayesSearchCV
from skopt.space import Real, Categorical, Integer

bayes_pipe = make_pipeline(ce.OrdinalEncoder(),
                         SimpleImputer(strategy='median'),
                         RandomForestClassifier())



In [0]:
param_distributions = {
    'randomforestclassifier__n_estimators': Integer(500, 2000), 
    'randomforestclassifier__max_depth': Integer(20, 40), 
    'randomforestclassifier__max_features': Integer(1, len(features)+1), 
}

In [25]:
search = BayesSearchCV(bayes_pipe,
                       param_distributions,
                       n_iter=5,
                       verbose=10,
                       n_jobs=-1,
                       cv=3)
search.fit(X_train, y_train)

Fitting 3 folds for each of 1 candidates, totalling 3 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 2 concurrent workers.
[Parallel(n_jobs=-1)]: Done   3 out of   3 | elapsed:  5.3min finished
[Parallel(n_jobs=-1)]: Done   3 out of   3 | elapsed:  5.3min remaining:    0.0s


Fitting 3 folds for each of 1 candidates, totalling 3 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 2 concurrent workers.
[Parallel(n_jobs=-1)]: Done   3 out of   3 | elapsed:  3.7min remaining:    0.0s
[Parallel(n_jobs=-1)]: Done   3 out of   3 | elapsed:  3.7min finished


Fitting 3 folds for each of 1 candidates, totalling 3 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 2 concurrent workers.
[Parallel(n_jobs=-1)]: Done   3 out of   3 | elapsed: 11.7min finished
[Parallel(n_jobs=-1)]: Done   3 out of   3 | elapsed: 11.7min remaining:    0.0s


Fitting 3 folds for each of 1 candidates, totalling 3 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 2 concurrent workers.
[Parallel(n_jobs=-1)]: Done   3 out of   3 | elapsed: 10.2min remaining:    0.0s
[Parallel(n_jobs=-1)]: Done   3 out of   3 | elapsed: 10.2min finished


Fitting 3 folds for each of 1 candidates, totalling 3 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 2 concurrent workers.
[Parallel(n_jobs=-1)]: Done   3 out of   3 | elapsed: 11.5min remaining:    0.0s
[Parallel(n_jobs=-1)]: Done   3 out of   3 | elapsed: 11.5min finished


BayesSearchCV(cv=3, error_score='raise',
       estimator=Pipeline(memory=None,
     steps=[('ordinalencoder', OrdinalEncoder(cols=None, drop_invariant=False, handle_missing='value',
        handle_unknown='value', mapping=None, return_df=True, verbose=0)), ('simpleimputer', SimpleImputer(copy=True, fill_value=None, missing_values=nan,
       strategy='median', verbose=0)), ('random...obs=None,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False))]),
       fit_params=None, iid=True, n_iter=5, n_jobs=-1, n_points=1,
       optimizer_kwargs=None, pre_dispatch='2*n_jobs', random_state=None,
       refit=True, return_train_score=False, scoring=None,
       search_spaces={'randomforestclassifier__n_estimators': Integer(low=500, high=2000), 'randomforestclassifier__max_depth': Integer(low=20, high=40), 'randomforestclassifier__max_features': Integer(low=1, high=37)},
       verbose=10)

In [26]:
print('Best hyperparameters', search.best_params_)
print('Cross-validation Accuracy {:.1f}%'.format(100*search.best_score_))

Best hyperparameters {'randomforestclassifier__max_depth': 24, 'randomforestclassifier__max_features': 3, 'randomforestclassifier__n_estimators': 1725}
Cross-validation Accuracy 80.7%


In [0]:
best_model = search.best_estimator_

- [X] Submit your final predictions to our Kaggle competition. Optionally, go to **My Submissions**, and _"you may select up to 1 submission to be used to count towards your final leaderboard score."_

In [0]:
# create pipeline
pipe = make_pipeline(ce.OrdinalEncoder(),
                         SimpleImputer(strategy='mean'),
                         RandomForestClassifier(max_depth=20, max_features=5, n_estimators=1200))
pipe.fit(X_train, y_train);

In [0]:
my_predictions = pipe.predict(X_test)
sample_submission['status_group'] = my_predictions
sample_submission.to_csv('davidvollendroff_submission.csv', index=False)

In [0]:
from google.colab import files
files.download('davidvollendroff_submission.csv')

- [X] Commit your notebook to your fork of the GitHub repo.

- [X] Read [Maximizing Scarce Maintenance Resources with Data: Applying predictive modeling, precision at k, and clustering to optimize impact](https://towardsdatascience.com/maximizing-scarce-maintenance-resources-with-data-8f3491133050), by Lambda DS3 student Michael Brady. His blog post extends the Tanzania Waterpumps scenario, far beyond what's in the lecture notebook.