# Cross-Validation

- Do **k-fold cross-validation** with independent test set
- Use scikit-learn for **hyperparameter optimization**

In [16]:
%%capture

import sys

# If you're on Colab:
if 'google.colab' in sys.modules:
    DATA_PATH = 'https://raw.githubusercontent.com/LambdaSchool/DS-Unit-2-Kaggle-Challenge/master/data/'
    !pip install category_encoders==2.*

# If you're working locally:
else:
    DATA_PATH = '../data/'

In [17]:
from category_encoders import OrdinalEncoder
from sklearn.ensemble import RandomForestClassifier
from sklearn.impute import SimpleImputer
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV
from sklearn.pipeline import make_pipeline
from sklearn.tree import DecisionTreeClassifier
import numpy as np
import pandas as pd

# I. Wrangle Data

In [18]:
def wrangle(fm_path, tv_path=None):
  # Import CSV files
  if tv_path:
    df = pd.merge(pd.read_csv(fm_path, 
                              na_values=[0, -2.000000e-08],
                              parse_dates=['construction_year', 'date_recorded']), 
                  pd.read_csv(tv_path)).set_index('id')
  else:
    df = pd.read_csv(fm_path, na_values=[0, -2.000000e-08],
                     parse_dates=['construction_year', 'date_recorded']).set_index('id')

  # Feature engineering (Credit: Mena and Keila)
  df['pump_age'] = df['date_recorded'].dt.year - df['construction_year'].dt.year

  # Drop constant and repeated columns
  df.drop(columns=['recorded_by', 'extraction_type_group', 'quantity_group',
                   'construction_year', 'date_recorded'], 
          inplace=True)
  
  # Drop columns with high % of NaN values
  df.dropna(axis=1, thresh=len(df)*.6, inplace=True)

  return df

df = wrangle(fm_path=DATA_PATH+'waterpumps/train_features.csv',
             tv_path=DATA_PATH+'waterpumps/train_labels.csv')

X_test = wrangle(fm_path=DATA_PATH+'waterpumps/test_features.csv')

# II. Split Data

## Split TV from FM

In [19]:
target = 'status_group'
y_train = df[target]
X_train = df.drop(columns=target)

# Training-Validation Split

- Since we're doing k-fold CV, there's no need for a validation set.

# III. Establish Baseline

This is a **classification** problem, our baseline will be **accuracy**. 

In [20]:
print('Baseline Accuracy:', y_train.value_counts(normalize=True).max())

Baseline Accuracy: 0.5430899510092763


# IV. Build Models

- `DecisionTreeClassifier`
- `RandomForestClassifier`

In [21]:
model_dt = make_pipeline(
    OrdinalEncoder(),
    SimpleImputer(strategy='median'),
    DecisionTreeClassifier(random_state=42)
)

In [22]:
model_rf = make_pipeline(
    OrdinalEncoder(),
    SimpleImputer(),
    RandomForestClassifier(n_estimators=25, random_state=42)
)

**Check cross-validation scores**

In [23]:
cv_scores_dt = cross_val_score(model_dt, X_train, y_train, n_jobs=-1)

In [None]:
cv_scores_rf = cross_val_score(model_rf, X_train, y_train, n_jobs=-1)

In [25]:
print('CV score DecisionTreeClassifier')
print(cv_scores_dt)
print('Mean CV accuracy score:', cv_scores_dt.mean())
print('STD CV accuracy score:', cv_scores_dt.std())

CV score DecisionTreeClassifier
[0.70505051 0.69545455 0.71574074 0.703367   0.70940315]
Mean CV accuracy score: 0.7058031886051921
STD CV accuracy score: 0.006712832474797317


In [26]:
print('CV score RandomForestClassifier')
print(cv_scores_rf)
print('Mean CV accuracy score:', cv_scores_rf.mean())
print('STD CV accuracy score:', cv_scores_rf.std())

CV score RandomForestClassifier
[0.80218855 0.80042088 0.80328283 0.80294613 0.79914134]
Mean CV accuracy score: 0.8015959451404354
STD CV accuracy score: 0.001576427451542507


# V. Tune Model

- What are important hyperparameters for `RandomForestClassifier`?
  - `max_depth`: 5-35
  - `n_estimators` 25-100
  - imputation strategy

**`GridSearch`:** Very thourough, but it can take a long time.

In [28]:
estimator = make_pipeline(
    OrdinalEncoder(),
    SimpleImputer(),
    RandomForestClassifier(random_state=42)
)

params = {
    'simpleimputer__strategy': ['mean', 'median'],
    'randomforestclassifier__n_estimators': [25, 50, 75, 100],
    'randomforestclassifier__max_depth': range(5, 36, 5)
}

model_gs = GridSearchCV(
    estimator,
    param_grid=params,
    cv=5,
    n_jobs=-1,
    verbose=1
)

model_gs.fit(X_train, y_train);

Fitting 5 folds for each of 56 candidates, totalling 280 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 2 concurrent workers.
[Parallel(n_jobs=-1)]: Done  46 tasks      | elapsed:  1.6min
[Parallel(n_jobs=-1)]: Done 196 tasks      | elapsed: 12.9min
[Parallel(n_jobs=-1)]: Done 280 out of 280 | elapsed: 20.3min finished


In [29]:
model_gs.best_params_

{'randomforestclassifier__max_depth': 20,
 'randomforestclassifier__n_estimators': 100,
 'simpleimputer__strategy': 'mean'}

In [32]:
model_gs.best_score_

0.8090203292855032

**`RandomizedSearchCV`:** Quicker, less effective but usually good enough.

In [27]:
model_rs = RandomizedSearchCV(
    estimator,
    param_distributions=params,
    n_iter=3,
    cv=5,
    n_jobs=-1,
    verbose=1
)

#model_rs.fit(X_train, y_train);

# Make Submission

In [31]:
y_pred = model_gs.predict(X_test)

In [43]:
submission = pd.DataFrame({'status_group': y_pred}, index=X_test.index)

In [45]:
submission.to_csv('2021-02-17_submission.csv')

# VI. Communicate Results

**Showing Feature Importance**

Plot the feature importance for our `RandomForest` model.