BloomTech Data Science

*Unit 2, Sprint 2, Module 3*

---
<p style="padding: 10px; border: 2px solid red;">
    <b>Before you start:</b> Today is the day you should submit the dataset for your Unit 2 Build Week project. You can review the guidelines and make your submission in the Build Week course for your cohort on Canvas.</p>

In [1]:
%%capture
import sys

# If you're on Colab:
if 'google.colab' in sys.modules:
    DATA_PATH = 'https://raw.githubusercontent.com/LambdaSchool/DS-Unit-2-Kaggle-Challenge/main/data/'
    !pip install category_encoders==2.*
    !pip install pandas-profiling==2.*
    
    #Connect to remote data
    from google.colab import drive
    drive.mount('/content/drive', force_remount=True)
    %cd /content/drive/My Drive/Kaggle

# If you're working locally:
else:
    DATA_PATH = '../data/'

# Module Project: Hyperparameter Tuning

This sprint, the module projects will focus on creating and improving a model for the Tanazania Water Pump dataset. Your goal is to create a model to predict whether a water pump is functional, non-functional, or needs repair.

Dataset source: [DrivenData.org](https://www.drivendata.org/competitions/7/pump-it-up-data-mining-the-water-table/).

## Directions

The tasks for this project are as follows:

- **Task 1:** Use `wrangle` function to import training and test data.
- **Task 2:** Split training data into feature matrix `X` and target vector `y`.
- **Task 3:** Establish the baseline accuracy score for your dataset.
- **Task 4:** Build `clf_dt`.
- **Task 5:** Build `clf_rf`.
- **Task 6:** Evaluate classifiers using k-fold cross-validation.
- **Task 7:** Tune hyperparameters for best performing classifier.
- **Task 8:** Print out best score and params for model.
- **Task 9:** Create `submission.csv` and upload to Kaggle.

You should limit yourself to the following libraries for this project:

- `category_encoders`
- `matplotlib`
- `pandas`
- `pandas-profiling`
- `sklearn`

# I. Wrangle Data

In [3]:
# Import Block
import pandas as pd
import numpy as np
from sklearn.model_selection import KFold, cross_val_score
from sklearn.model_selection import RandomizedSearchCV, GridSearchCV
from sklearn.dummy import DummyClassifier
from sklearn.pipeline import make_pipeline
from category_encoders import OrdinalEncoder
from sklearn.impute import SimpleImputer
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
import matplotlib.pyplot as plt

In [4]:
def wrangle(fm_path, tv_path=None):
    if tv_path:
        df = pd.merge(pd.read_csv(fm_path, 
                                  na_values=[0, -2.000000e-08]),
                      pd.read_csv(tv_path)).set_index('id')
    else:
        df = pd.read_csv(fm_path, 
                         na_values=[0, -2.000000e-08],
                         index_col='id')
        
    # Build new Pump Age feature
    df['pump_age'] = pd.to_datetime(df['date_recorded']).dt.year - pd.to_datetime(df['construction_year']).dt.year
    df.drop(columns=['date_recorded'], inplace=True)

    # Drop constant columns
    df.drop(columns=['recorded_by'], inplace=True)

    # Drop duplicate columns
    #automatic method is too unreliable
    #dupe_cols = [col for col in df.head(50).T.duplicated().index
    #             if df.head(50).T.duplicated()[col]]
    dupe_cols = ['subvillage', 'region', 'extraction_type_group', 'payment', 
                 'quality_group', 'quantity_group', 'source_type', 'waterpoint_type_group']
    df.drop(columns=dupe_cols, inplace=True)    

    # Drop HCCCs
    cutoff = 100
    drop_cols = [col for col in df.select_dtypes('object').columns
                 if df[col].nunique() > cutoff]
    df.drop(columns=drop_cols, inplace=True)

    return df

**Task 1:** Using the above `wrangle` function to read `train_features.csv` and `train_labels.csv` into the DataFrame `df`, and `test_features.csv` into the DataFrame `X_test`.

In [5]:
df = wrangle("train_features.csv", "train_labels.csv")
X_test = wrangle("test_features.csv")

# II. Split Data

**Task 2:** Split your DataFrame `df` into a feature matrix `X` and the target vector `y`. You want to predict `'status_group'`.

**Note:** You won't need to do a train-test split because you'll use cross-validation instead.

In [6]:
X = df.drop(columns=['status_group'])
y = df['status_group']

# III. Establish Baseline

**Task 3:** Since this is a **classification** problem, you should establish a baseline accuracy score. Figure out what is the majority class in `y_train` and what percentage of your training observations it represents.

In [7]:
model_dum = DummyClassifier(strategy='prior').fit(X, y)
baseline_acc = accuracy_score(y, model_dum.predict(X))
print('Baseline Accuracy Score:', baseline_acc)

Baseline Accuracy Score: 0.5429828068772491


# IV. Build Models

**Task 4:** Build a `Pipeline` named `clf_dt`. Your `Pipeline` should include:

- an `OrdinalEncoder` transformer for categorical features.
- a `SimpleImputer` transformer fot missing values.
- a `DecisionTreeClassifier` Predictor.

**Note:** Do not train `clf_dt`. You'll do that in a subsequent task. 

In [8]:
clf_dt = make_pipeline(OrdinalEncoder(), SimpleImputer(), DecisionTreeClassifier(random_state=42))

**Task 5:** Build a `Pipeline` named `clf_rf`. Your `Pipeline` should include:

- an `OrdinalEncoder` transformer for categorical features.
- a `SimpleImputer` transformer fot missing values.
- a `RandomForestClassifier` predictor.

**Note:** Do not train `clf_rf`. You'll do that in a subsequent task. 

In [9]:
clf_rf = make_pipeline(OrdinalEncoder(), SimpleImputer(), RandomForestClassifier(random_state=42, n_jobs=-1))

# V. Check Metrics

**Task 6:** Evaluate the performance of both of your classifiers using k-fold cross-validation.

In [10]:
kfold_cv = KFold(n_splits=5, shuffle=True, random_state=42)

cv_scores_dt = cross_val_score(clf_dt, X, y, cv=kfold_cv, scoring='accuracy')
cv_scores_rf = cross_val_score(clf_rf, X, y, cv=kfold_cv, scoring='accuracy')

In [11]:
print('CV scores DecisionTreeClassifier')
print(cv_scores_dt)
print('Mean CV accuracy score:', cv_scores_dt.mean())
print('STD CV accuracy score:', cv_scores_dt.std())

CV scores DecisionTreeClassifier
[0.74526515 0.74368687 0.75252525 0.75284091 0.75260444]
Mean CV accuracy score: 0.7493845245042235
STD CV accuracy score: 0.004040077793211837


In [12]:
print('CV score RandomForestClassifier')
print(cv_scores_rf)
print('Mean CV accuracy score:', cv_scores_rf.mean())
print('STD CV accuracy score:', cv_scores_rf.std())

CV score RandomForestClassifier
[0.79261364 0.79987374 0.79997896 0.80155724 0.80300958]
Mean CV accuracy score: 0.7994066289893924
STD CV accuracy score: 0.0035859963134000335


# VI. Tune Model

**Task 7:** Choose the best performing of your two models and tune its hyperparameters using a `RandomizedSearchCV` named `model`. Make sure that you include cross-validation and that `n_iter` is set to at least `25`.

**Note:** If you're not sure which hyperparameters to tune, check the notes from today's guided project and the `sklearn` documentation. 

In [36]:
model_rf = make_pipeline(OrdinalEncoder(), SimpleImputer(), RandomForestClassifier(random_state=42, n_jobs=-1))

param_rngs = {"simpleimputer__strategy": ['mean', 'median'],
              "randomforestclassifier__max_depth": np.arange(30, 40, 1),
              "randomforestclassifier__max_features": np.linspace(0.4, 0.5, 20),
              "randomforestclassifier__n_estimators": np.arange(65, 75, 1),
              "randomforestclassifier__min_samples_split": np.arange(4, 16, 1),
              "randomforestclassifier__min_samples_leaf": np.arange(1, 8, 1),
              "randomforestclassifier__max_leaf_nodes": np.arange(2750, 3250, 50),
              "randomforestclassifier__criterion": ["gini", "entropy"]}

model_cv = RandomizedSearchCV(model_rf, param_distributions=param_rngs, n_iter=100, cv=5, n_jobs=-1, random_state=42, verbose=1, scoring='accuracy')

#model_cv = GridSearchCV(model_rf, param_grid=param_rngs, cv=5, n_jobs=-1, random_state=42, verbose=1, scoring='accuracy')

model_cv.fit(X, y)

Fitting 5 folds for each of 100 candidates, totalling 500 fits


RandomizedSearchCV(cv=5,
                   estimator=Pipeline(steps=[('ordinalencoder',
                                              OrdinalEncoder()),
                                             ('simpleimputer', SimpleImputer()),
                                             ('randomforestclassifier',
                                              RandomForestClassifier(n_jobs=-1,
                                                                     random_state=42))]),
                   n_iter=100, n_jobs=-1,
                   param_distributions={'randomforestclassifier__criterion': ['gini',
                                                                              'entropy'],
                                        'randomforestclassifier__max_depth': array([30, 31, 32, 33...
                                        'randomforestclassifier__max_leaf_nodes': array([2500, 2600, 2700, 2800, 2900, 3000, 3100, 3200, 3300, 3400]),
                                        'randomfores

**Task 8:** Print out the best score and best params for `model`.

In [38]:
best_score = model_cv.best_score_
#best_score = accuracy_score(y, model_cv.predict(X))
best_params = model_cv.best_params_

print('Best score for `model`:', best_score)
print('Best params for `model`:', best_params)

Best score for `model`: 0.8061196008100934
Best params for `model`: {'simpleimputer__strategy': 'median', 'randomforestclassifier__n_estimators': 65, 'randomforestclassifier__min_samples_split': 7, 'randomforestclassifier__min_samples_leaf': 1, 'randomforestclassifier__max_leaf_nodes': 2900, 'randomforestclassifier__max_features': 0.42500000000000004, 'randomforestclassifier__max_depth': 32, 'randomforestclassifier__criterion': 'entropy'}


# Communicate Results

**Task 9:** Create a DataFrame `submission` whose index is the same as `X_test` and that has one column `'status_group'` with your predictions. Next, save this DataFrame as a CSV file and upload your submissions to our competition site. 

**Note:** Check the `sample_submission.csv` file on the competition website to make sure your submissions follows the same formatting. 

In [39]:
submission = pd.DataFrame(data=model_cv.predict(X_test), index=X_test.index)
submission.columns = ['status_group']

# generate CSV
submission.to_csv('submission_jd_3.csv')
# download
if 'google.colab' in sys.modules:
    from google.colab import files
    files.download("submission_jd_3.csv")