Lambda School Data Science

*Unit 2, Sprint 2, Module 3*

---
<p style="padding: 10px; border: 2px solid red;">
    <b>Before you start:</b> Today is the day you should submit the dataset for your Unit 2 Build Week project. You can review the guidelines and make your submission in the Build Week course for your cohort on Canvas.</p>

In [1]:
from category_encoders import OrdinalEncoder
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.impute import SimpleImputer
from sklearn.model_selection import cross_val_score, validation_curve # k-fold CV
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV # Hyperparameter tuning
from sklearn.pipeline import make_pipeline
from sklearn.tree import DecisionTreeClassifier
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)

# Module Project: Hyperparameter Tuning

This sprint, the module projects will focus on creating and improving a model for the Tanazania Water Pump dataset. Your goal is to create a model to predict whether a water pump is functional, non-functional, or needs repair.

Dataset source: [DrivenData.org](https://www.drivendata.org/competitions/7/pump-it-up-data-mining-the-water-table/).

## Directions

The tasks for this project are as follows:

- **Task 1:** Use `wrangle` function to import training and test data.
- **Task 2:** Split training data into feature matrix `X` and target vector `y`.
- **Task 3:** Establish the baseline accuracy score for your dataset.
- **Task 4:** Build `clf_dt`.
- **Task 5:** Build `clf_rf`.
- **Task 6:** Evaluate classifiers using k-fold cross-validation.
- **Task 7:** Tune hyperparameters for best performing classifier.
- **Task 8:** Print out best score and params for model.
- **Task 9:** Create `submission.csv` and upload to Kaggle.

You should limit yourself to the following libraries for this project:

- `category_encoders`
- `matplotlib`
- `pandas`
- `pandas-profiling`
- `sklearn`

# I. Wrangle Data

In [9]:
# Use wrangle function to import training and test data, and clean
def wrangle(fm_path, tv_path=None):
    if tv_path:
        df = pd.merge(pd.read_csv(fm_path, 
                                  na_values=[0, -2.000000e-08]),
                      pd.read_csv(tv_path)).set_index('id')
    else:
        df = pd.read_csv(fm_path, 
                         na_values=[0, -2.000000e-08],
                         index_col='id')

    df['date_recorded'] = pd.to_datetime(df['date_recorded'])
    
    # Drop constant columns
    df.drop(columns=['recorded_by'], inplace=True)
    
    # Create age feature
    df['pump_age'] = df['date_recorded'].dt.year - df['construction_year']
    df.drop(columns=['date_recorded','construction_year'], inplace=True)
    
    # Drop HCCCs
    cutoff = 100
    drop_cols = [col for col in df.select_dtypes('object').columns
                 if df[col].nunique() > cutoff]
    df.drop(columns=drop_cols, inplace=True)

    # Drop duplicate columns
    dupe_cols = [col for col in df.head(15).T.duplicated().index
                 if df.head(15).T.duplicated()[col]]
    df.drop(columns=dupe_cols, inplace=True)             


    return df

# Using the above wrangle function to read train_features.csv and train_labels.csv into the DataFrame
df = wrangle(fm_path= 'train_features.csv',
             tv_path= 'train_labels.csv')

# test_features.csv into the DataFrame X_test
X_test = wrangle(fm_path= 'test_features.csv')

**Task 1:** Using the above `wrangle` function to read `train_features.csv` and `train_labels.csv` into the DataFrame `df`, and `test_features.csv` into the DataFrame `X_test`.

# II. Split Data

**Task 2:** Split your DataFrame `df` into a feature matrix `X` and the target vector `y`. You want to predict `'status_group'`.

**Note:** You won't need to do a train-test split because you'll use cross-validation instead.

In [5]:
# Split your DataFrame df into a feature matrix X and the target vector y. You want to predict 'status_group'
target = 'status_group'
y = df[target]
X = df.drop(columns=target)

In [7]:
# Using a randomized split, divide X and y into a training set (X_train, y_train) and a validation set (X_val, y_val)
X_train,y_val,y_train,y_val = train_test_split(X,y,test_size=.2,random_state=42)

# III. Establish Baseline

**Task 3:** Since this is a **classification** problem, you should establish a baseline accuracy score. Figure out what is the majority class in `y_train` and what percentage of your training observations it represents.

In [8]:
# Figure out what is the majority class in y_train and what percentage of your training observations it represents
baseline_Acc = y_train.value_counts(normalize=True).max()
print('Baseline Accuracy Score:',baseline_Acc)

Baseline Accuracy Score: 0.5425489938182296


# IV. Build Models

**Task 4:** Build a `Pipeline` named `clf_dt`. Your `Pipeline` should include:

- an `OrdinalEncoder` transformer for categorical features.
- a `SimpleImputer` transformer fot missing values.
- a `DecisionTreeClassifier` Predictor.

**Note:** Do not train `clf_dt`. You'll do that in a subsequent task. 

In [11]:
# Build a Pipeline named clf_dt
clf_dt = make_pipeline(OrdinalEncoder(),
                        SimpleImputer(strategy='mean'),
                        DecisionTreeClassifier(random_state=42))

clf_dt.fit(X_train,y_train)

Pipeline(steps=[('ordinalencoder',
                 OrdinalEncoder(cols=['basin', 'region', 'public_meeting',
                                      'scheme_management', 'permit',
                                      'extraction_type',
                                      'extraction_type_class', 'management',
                                      'management_group', 'payment',
                                      'payment_type', 'water_quality',
                                      'quality_group', 'quantity', 'source',
                                      'source_type', 'source_class',
                                      'waterpoint_type'],
                                mapping=[{'col': 'basin',
                                          'data_typ...
                                          'data_type': dtype('O'),
                                          'mapping': groundwater    1
surface        2
unknown        3
NaN           -2
dtype: int64},
                           

**Task 5:** Build a `Pipeline` named `clf_rf`. Your `Pipeline` should include:

- an `OrdinalEncoder` transformer for categorical features.
- a `SimpleImputer` transformer fot missing values.
- a `RandomForestClassifier` predictor.

**Note:** Do not train `clf_rf`. You'll do that in a subsequent task. 

In [12]:
# Task 5: Build a Pipeline named clf_rf
clf_rf = make_pipeline(OrdinalEncoder(),
                         SimpleImputer(strategy='mean'),
                         RandomForestClassifier(n_jobs=-1,
                                                random_state=42))

clf_rf.fit(X_train,y_train)

Pipeline(steps=[('ordinalencoder',
                 OrdinalEncoder(cols=['basin', 'region', 'public_meeting',
                                      'scheme_management', 'permit',
                                      'extraction_type',
                                      'extraction_type_class', 'management',
                                      'management_group', 'payment',
                                      'payment_type', 'water_quality',
                                      'quality_group', 'quantity', 'source',
                                      'source_type', 'source_class',
                                      'waterpoint_type'],
                                mapping=[{'col': 'basin',
                                          'data_typ...
                                          'data_type': dtype('O'),
                                          'mapping': groundwater    1
surface        2
unknown        3
NaN           -2
dtype: int64},
                           

# V. Check Metrics

**Task 6:** Evaluate the performance of both of your classifiers using k-fold cross-validation.

In [14]:
# Evaluate the performance of both of your classifiers using k-fold cross-validation
cv_scores_dt = cross_val_score(clf_dt,X,y,cv=5) 
cv_scores_rf = cross_val_score(clf_rf,X,y,cv=5, n_jobs=-1)

In [15]:
# Print Results of DecisionTreeClassifier
print('CV scores DecisionTreeClassifier')
print(cv_scores_dt)
print('Mean CV accuracy score:', cv_scores_dt.mean())
print('STD CV accuracy score:', cv_scores_dt.std())

CV scores DecisionTreeClassifier
[0.74978956 0.75210438 0.74694865 0.75484007 0.7415553 ]
Mean CV accuracy score: 0.7490475916519008
STD CV accuracy score: 0.004560421783713259


In [16]:
# Print Results of RandomForestClassifier
print('CV score RandomForestClassifier')
print(cv_scores_rf)
print('Mean CV accuracy score:', cv_scores_rf.mean())
print('STD CV accuracy score:', cv_scores_rf.std())

CV score RandomForestClassifier
[0.79955808 0.79945286 0.79987374 0.80576599 0.80016837]
Mean CV accuracy score: 0.8009638082568997
STD CV accuracy score: 0.002414166074363791


# VI. Tune Model

**Task 7:** Choose the best performing of your two models and tune its hyperparameters using a `RandomizedSearchCV` named `model`. Make sure that you include cross-validation and that `n_iter` is set to at least `25`.

**Note:** If you're not sure which hyperparameters to tune, check the notes from today's guided project and the `sklearn` documentation. 

In [18]:
# hoose the best performing of your two models and tune its hyperparameters using a RandomizedSearchCV named model

clf = make_pipeline(OrdinalEncoder(),
                    SimpleImputer(),
                    RandomForestClassifier(random_state=42,n_jobs=-1))

param_grid = {'simpleimputer__strategy':['meadian','mean'],
             'randomforestclassifier__max_depth':range(5,35,5),
             'randomforestclassifier__n_estimators':range(25,200,5),
             'randomforestclassifier__max_samples':np.arange(0.2,1,0.1),
             'randomforestclassifier__max_features':['sqrt','log2']}

model = RandomizedSearchCV(clf,param_distributions = param_grid,
                               n_iter=400,
                               n_jobs=-1,
                               verbose=1)
model.fit(X,y)

Fitting 5 folds for each of 400 candidates, totalling 2000 fits


        nan 0.80407832 0.80616169        nan        nan        nan
        nan 0.78835829 0.79328265 0.80433085        nan 0.80104787
        nan 0.75710763 0.79340887 0.75618168 0.80397312        nan
        nan 0.79959589 0.80416251 0.79963798        nan 0.80066906
 0.75584498 0.8027315         nan 0.75546616        nan 0.80277356
        nan 0.75374058        nan        nan 0.79324051 0.79239881
 0.72385777        nan        nan        nan        nan 0.79500827
 0.79338783        nan        nan 0.72415239 0.7228687         nan
 0.72263725 0.79791238 0.80239479 0.75597124 0.7567499         nan
 0.72299497 0.78393904        nan        nan        nan 0.75561354
 0.80397303 0.78524374        nan        nan        nan        nan
        nan 0.7938508  0.80483592 0.7852858  0.7215008  0.72337378
 0.80456236 0.80443606 0.8052147         nan        nan 0.80197385
        nan        nan        nan        nan        nan        nan
 0.756813   0.78503333 0.80435184        nan        nan       

RandomizedSearchCV(estimator=Pipeline(steps=[('ordinalencoder',
                                              OrdinalEncoder()),
                                             ('simpleimputer', SimpleImputer()),
                                             ('randomforestclassifier',
                                              RandomForestClassifier(n_jobs=-1,
                                                                     random_state=42))]),
                   n_iter=400, n_jobs=-1,
                   param_distributions={'randomforestclassifier__max_depth': range(5, 35, 5),
                                        'randomforestclassifier__max_features': ['sqrt',
                                                                                 'log2'],
                                        'randomforestclassifier__max_samples': array([0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9]),
                                        'randomforestclassifier__n_estimators': range(25, 200, 5),
      

**Task 8:** Print out the best score and best params for `model`.

In [19]:
# Assign the Best Score 
best_score = model.best_score_
best_params = model.best_params_

# Print out the best score and best params for model
print('Best score for `model`:', best_score)
print('Best params for `model`:', best_params)

Best score for `model`: 0.8061616905666155
Best params for `model`: {'simpleimputer__strategy': 'mean', 'randomforestclassifier__n_estimators': 125, 'randomforestclassifier__max_samples': 0.6000000000000001, 'randomforestclassifier__max_features': 'log2', 'randomforestclassifier__max_depth': 25}


# Communicate Results

**Task 9:** Create a DataFrame `submission` whose index is the same as `X_test` and that has one column `'status_group'` with your predictions. Next, save this DataFrame as a CSV file and upload your submissions to our competition site. 

**Note:** Check the `sample_submission.csv` file on the competition website to make sure your submissions follows the same formatting. 

In [None]:
X_test = X_test[X_train.columns]
y_pred = model_rfgs.predict(X_test)
submission = pd.DataFrame({'status_group':y_pred}, index=X_test.index)
datestamp = pd.Timestamp.now().strftime('%Y-%m-%d_%H%M_')
submission.to_csv(f'{datestamp}submission.csv')