Lambda School Data Science

*Unit 2, Sprint 2, Module 3*

---
<p style="padding: 10px; border: 2px solid red;">
    <b>Before you start:</b> Today is the day you should submit the dataset for your Unit 2 Build Week project. You can review the guidelines and make your submission in the Build Week course for your cohort on Canvas.</p>

In [None]:
   !pip install category_encoders==2.*
   !pip install pandas-profiling==2.*

In [21]:
from sklearn.model_selection import train_test_split
from sklearn.pipeline import make_pipeline
from category_encoders import OneHotEncoder, OrdinalEncoder
from sklearn.preprocessing import StandardScaler
from sklearn.impute import SimpleImputer
from sklearn.metrics import accuracy_score
from sklearn.model_selection import cross_val_score, validation_curve 
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV 
from sklearn.tree import DecisionTreeClassifier, plot_tree
from sklearn.ensemble import RandomForestClassifier
from pandas_profiling import ProfileReport
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np

pd.options.display.max_rows = 100

# Module Project: Hyperparameter Tuning

This sprint, the module projects will focus on creating and improving a model for the Tanazania Water Pump dataset. Your goal is to create a model to predict whether a water pump is functional, non-functional, or needs repair.

Dataset source: [DrivenData.org](https://www.drivendata.org/competitions/7/pump-it-up-data-mining-the-water-table/).

## Directions

The tasks for this project are as follows:

- **Task 1:** Use `wrangle` function to import training and test data.
- **Task 2:** Split training data into feature matrix `X` and target vector `y`.
- **Task 3:** Establish the baseline accuracy score for your dataset.
- **Task 4:** Build `clf_dt`.
- **Task 5:** Build `clf_rf`.
- **Task 6:** Evaluate classifiers using k-fold cross-validation.
- **Task 7:** Tune hyperparameters for best performing classifier.
- **Task 8:** Print out best score and params for model.
- **Task 9:** Create `submission.csv` and upload to Kaggle.

You should limit yourself to the following libraries for this project:

- `category_encoders`
- `matplotlib`
- `pandas`
- `pandas-profiling`
- `sklearn`

# I. Wrangle Data

In [14]:
train = pd.merge(pd.read_csv('train_features.csv', na_values=[0, -2.00000e-08], parse_dates=['date_recorded']), pd.read_csv('train_labels.csv', na_values=[0, -2.00000e-08]))
test = pd.read_csv('test_features.csv', na_values=[0, -2.00000e-08], parse_dates=['date_recorded'])

def wrangle(df):
    
    # Set id as index
    df.set_index('id', inplace=True)

    # Drop constant columns
    df.drop(columns= 'recorded_by', inplace=True)

    # # Drop duplicate columns
    #df.drop(columns= 'quantity group')

    # Create age feature
    df['pump_age'] = df['date_recorded'].dt.year - df['construction_year']
    df.drop(columns='date_recorded', inplace=True)

    #Drop HCCCs
    cutoff = 100
    drop_cols = [col for col in df.select_dtypes('object').columns
                 if df[col].nunique() > cutoff]
    df.drop(columns=drop_cols, inplace=True)

    # Strip differing words between the two datasets
    df.drop(columns= ['waterpoint_type_group', 'payment_type'], inplace=True)
    
    #Drop duplicate columns and account for the difference in X_test
    dupe_cols = [col for col in df.head(15).T.duplicated().index
                 if df.head(15).T.duplicated()[col]]
    df.drop(columns=dupe_cols, inplace=True)             

    # Drop num_private since there are 98% missing values in that column

    df.drop(columns='num_private', inplace=True)

    return df



**Task 1:** Using the above `wrangle` function to read `train_features.csv` and `train_labels.csv` into the DataFrame `df`, and `test_features.csv` into the DataFrame `X_test`.

In [15]:
df = wrangle(train)
X_test = wrangle(test)

# II. Split Data

**Task 2:** Split your DataFrame `df` into a feature matrix `X` and the target vector `y`. You want to predict `'status_group'`.

**Note:** You won't need to do a train-test split because you'll use cross-validation instead.

In [16]:
# Split
target = 'status_group'

y = train[target]
X = train.drop(columns = target)

In [17]:
X_train,X_val,y_train,y_val = train_test_split(X,y,test_size=0.2,random_state=42)

# III. Establish Baseline

**Task 3:** Since this is a **classification** problem, you should establish a baseline accuracy score. Figure out what is the majority class in `y_train` and what percentage of your training observations it represents.

In [18]:
baseline_acc = y_train.value_counts(normalize = True).max()
print('Baseline Accuracy Score:', baseline_acc)

Baseline Accuracy Score: 0.5440867003367004


# IV. Build Models

**Task 4:** Build a `Pipeline` named `clf_dt`. Your `Pipeline` should include:

- an `OrdinalEncoder` transformer for categorical features.
- a `SimpleImputer` transformer fot missing values.
- a `DecisionTreeClassifier` Predictor.

**Note:** Do not train `clf_dt`. You'll do that in a subsequent task. 

In [34]:
clf_dt = make_pipeline(
            OrdinalEncoder(),
            SimpleImputer(),
            DecisionTreeClassifier(random_state=42))


**Task 5:** Build a `Pipeline` named `clf_rf`. Your `Pipeline` should include:

- an `OrdinalEncoder` transformer for categorical features.
- a `SimpleImputer` transformer fot missing values.
- a `RandomForestClassifier` predictor.

**Note:** Do not train `clf_rf`. You'll do that in a subsequent task. 

In [35]:
clf_rf = make_pipeline(
            OrdinalEncoder(),
            SimpleImputer(),
            RandomForestClassifier(random_state=42))



# V. Check Metrics

**Task 6:** Evaluate the performance of both of your classifiers using k-fold cross-validation.

In [36]:
cv_scores_dt = cross_val_score(clf_dt, X, y, cv=5, n_jobs=-1)
cv_scores_rf = cross_val_score(clf_rf, X, y, cv=5, n_jobs=-1)

In [37]:
print('CV scores DecisionTreeClassifier')
print(cv_scores_dt)
print('Mean CV accuracy score:', cv_scores_dt.mean())
print('STD CV accuracy score:', cv_scores_dt.std())

CV scores DecisionTreeClassifier
[0.74726431 0.75073653 0.74705387 0.75178872 0.74558081]
Mean CV accuracy score: 0.7484848484848484
STD CV accuracy score: 0.0023645933317487924


In [38]:
print('CV score RandomForestClassifier')
print(cv_scores_rf)
print('Mean CV accuracy score:', cv_scores_rf.mean())
print('STD CV accuracy score:', cv_scores_rf.std())

CV score RandomForestClassifier
[0.79776936 0.79829545 0.79808502 0.80439815 0.79724327]
Mean CV accuracy score: 0.7991582491582492
STD CV accuracy score: 0.0026438213425150395


# VI. Tune Model

**Task 7:** Choose the best performing of your two models and tune its hyperparameters using a `RandomizedSearchCV` named `model`. Make sure that you include cross-validation and that `n_iter` is set to at least `25`.

**Note:** If you're not sure which hyperparameters to tune, check the notes from today's guided project and the `sklearn` documentation. 

In [39]:
param_grid = {
    'simpleimputer__strategy':['mean','median'],
    'randomforestclassifier__max_depth': range(12, 20, 1),
    'randomforestclassifier__n_estimators': range(65, 75, 1)}

In [40]:
model = RandomizedSearchCV(
    clf_rf,
    param_distributions = param_grid,
    n_iter = 40,
    cv = 5,
    n_jobs = -1,
    verbose = 1
)

model.fit(X,y)

Fitting 5 folds for each of 40 candidates, totalling 200 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 2 concurrent workers.
[Parallel(n_jobs=-1)]: Done  46 tasks      | elapsed:  3.2min
[Parallel(n_jobs=-1)]: Done 196 tasks      | elapsed: 13.2min
[Parallel(n_jobs=-1)]: Done 200 out of 200 | elapsed: 13.5min finished


RandomizedSearchCV(cv=5, error_score=nan,
                   estimator=Pipeline(memory=None,
                                      steps=[('ordinalencoder',
                                              OrdinalEncoder(cols=None,
                                                             drop_invariant=False,
                                                             handle_missing='value',
                                                             handle_unknown='value',
                                                             mapping=None,
                                                             return_df=True,
                                                             verbose=0)),
                                             ('simpleimputer',
                                              SimpleImputer(add_indicator=False,
                                                            copy=True,
                                                            fill_value=None,


**Task 8:** Print out the best score and best params for `model`.

In [41]:
best_score = model.best_score_
best_params = model.best_params_

print('Best score for `model`:', best_score)
print('Best params for `model`:', best_params)

Best score for `model`: 0.8042087542087542
Best params for `model`: {'simpleimputer__strategy': 'mean', 'randomforestclassifier__n_estimators': 66, 'randomforestclassifier__max_depth': 19}


# Communicate Results

**Task 9:** Create a DataFrame `submission` whose index is the same as `X_test` and that has one column `'status_group'` with your predictions. Next, save this DataFrame as a CSV file and upload your submissions to our competition site. 

**Note:** Check the `sample_submission.csv` file on the competition website to make sure your submissions follows the same formatting. 

In [43]:
y_pred = model.predict(X_test)
submission = pd.DataFrame({'status_group':y_pred}, index=X_test.index)
datestamp = pd.Timestamp.now().strftime('%Y-%m-%d_%H%M_')
submission.to_csv(f'{datestamp}submission_Luke_Sislowski.csv')

In [44]:
# Download CSV

from google.colab import files
files.download(f'{datestamp}submission_Luke_Sislowski.csv')

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>