BloomTech Data Science

*Unit 2, Sprint 2, Module 3*

---

# Module Project: Hyperparameter Tuning
This week, the module projects will focus on creating and improving a model for the Tanazania Water Pump dataset. Your goal is to create a model to predict whether a water pump is functional, non-functional, or functional needs repair.

## Directions

The tasks for this project are as follows:

- **Task 1:** Use `wrangle` function to import training and test data.
- **Task 2:** Split training data into feature matrix `X` and target vector `y`.
- **Task 3:** Establish the baseline accuracy score for your dataset.
- **Task 4:** Build `clf_dt`.
- **Task 5:** Build `clf_rf`.
- **Task 6:** Evaluate classifiers using k-fold cross-validation.
- **Task 7:** Tune hyperparameters for best performing classifier.
- **Task 8:** Print out best score and params for model.
- **Task 9:** Create `submission.csv` and upload to Kaggle.

You should limit yourself to the following libraries for this project:

- `category_encoders`
- `matplotlib`
- `pandas`
- `ydata-profiling`
- `sklearn`

# I. Wrangle Data

In [3]:
%%capture
!pip install category_encoders==2.*

In [27]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split, KFold, cross_val_score, RandomizedSearchCV, GridSearchCV
from sklearn.impute import SimpleImputer
from category_encoders import OrdinalEncoder
from sklearn.pipeline import make_pipeline

In [7]:
# mounting your google drive on colab
from google.colab import drive
drive.mount('/content/gdrive')

Mounted at /content/gdrive


In [8]:
#change your working directory, if you want to or have already saved your kaggle dataset on google drive.
%cd /content/gdrive/My Drive/Kaggle
# update it to your folder location on drive that contians the dataset and/or kaggle API token json file.

/content/gdrive/My Drive/Kaggle


In [9]:
# Download your Kaggle Dataset, if you haven't already done so.
import os
os.environ['KAGGLE_CONFIG_DIR'] = "/content/gdrive/My Drive/Kaggle"
!kaggle competitions download -c bloomtech-water-pump-challenge

Downloading bloomtech-water-pump-challenge.zip to /content/gdrive/My Drive/Kaggle
  0% 0.00/4.18M [00:00<?, ?B/s]
100% 4.18M/4.18M [00:00<00:00, 109MB/s]


In [10]:
# Unzip your Kaggle dataset, if you haven't already done so.
!unzip \*.zip  && rm *.zip

Archive:  bloomtech-water-pump-challenge.zip
replace sample_submission.csv? [y]es, [n]o, [A]ll, [N]one, [r]ename: A
  inflating: sample_submission.csv   
  inflating: test_features.csv       
  inflating: train_features.csv      
  inflating: train_labels.csv        


In [11]:
# List all files in your Kaggle folder on your google drive.
!ls

2024-01-08_2015_submission.csv	new_submission.csv     train_features.csv
kaggle.json			sample_submission.csv  train_labels.csv
model_rf_rs_80			test_features.csv      Untitled0.ipynb


In [12]:
def wrangle(fm_path, tv_path=None):
    if tv_path:
        df = pd.merge(pd.read_csv(fm_path,
                                  na_values=[0, -2.000000e-08]),
                      pd.read_csv(tv_path)).set_index('id')
    else:
        df = pd.read_csv(fm_path,
                         na_values=[0, -2.000000e-08],
                         index_col='id')

    # Drop constant columns
    df.drop(columns=['recorded_by'], inplace=True)

    # Drop HCCCs
    cutoff = 100
    drop_cols = [col for col in df.select_dtypes('object').columns
                 if df[col].nunique() > cutoff]
    df.drop(columns=drop_cols, inplace=True)

    # Drop duplicate columns
    dupe_cols = [col for col in df.head(100).T.duplicated().index
                 if df.head(100).T.duplicated()[col]]
    df.drop(columns=dupe_cols, inplace=True)

    return df

**Task 1:** Using the above `wrangle` function to read `train_features.csv` and `train_labels.csv` into the DataFrame `df`, and `test_features.csv` into the DataFrame `X_test`.

In [13]:
df = wrangle(fm_path='train_features.csv',
             tv_path='train_labels.csv')
X_test = wrangle(fm_path='test_features.csv')

# II. Split Data

**Task 2:** Split your DataFrame `df` into a feature matrix `X` and the target vector `y`. You want to predict `'status_group'`.

**Note:** You won't need to do a train-test split because you'll use cross-validation instead.

In [14]:
X = df.drop(columns= 'status_group')
y = df['status_group']

# III. Establish Baseline

**Task 3:** Since this is a **classification** problem, you should establish a baseline accuracy score. Figure out what is the majority class in `y_train` and what percentage of your training observations it represents.

In [15]:
baseline_acc = y.value_counts(normalize= True).max()
print('Baseline Accuracy Score:', baseline_acc)

Baseline Accuracy Score: 0.5429828068772491


# IV. Build Models

**Task 4:** Build a `Pipeline` named `clf_dt`. Your `Pipeline` should include:

- an `OrdinalEncoder` transformer for categorical features.
- a `SimpleImputer` transformer fot missing values.
- a `DecisionTreeClassifier` Predictor.

**Note:** Do not train `clf_dt`. You'll do that in a subsequent task.

In [16]:
clf_dt = make_pipeline(OrdinalEncoder(),
                       SimpleImputer(),
                       DecisionTreeClassifier(random_state= 42))

**Task 5:** Build a `Pipeline` named `clf_rf`. Your `Pipeline` should include:

- an `OrdinalEncoder` transformer for categorical features.
- a `SimpleImputer` transformer fot missing values.
- a `RandomForestClassifier` predictor.

**Note:** Do not train `clf_rf`. You'll do that in a subsequent task.

In [25]:
clf_rf = make_pipeline(OrdinalEncoder(),
                       SimpleImputer(),
                       RandomForestClassifier(n_estimators = 25, random_state = 42))

clf_rf

# V. Check Metrics

**Task 6:** Evaluate the performance of both of your classifiers using k-fold cross-validation.

In [21]:
k_fold_cv = KFold(n_splits=5, shuffle=True, random_state=42)

In [22]:
cv_scores_dt = cross_val_score(clf_dt, X, y, cv=k_fold_cv, n_jobs=-1)
cv_scores_rf = cross_val_score(clf_rf, X, y, cv=k_fold_cv, n_jobs=-1)

In [23]:
print('CV scores DecisionTreeClassifier')
print(cv_scores_dt)
print('Mean CV accuracy score:', cv_scores_dt.mean())
print('STD CV accuracy score:', cv_scores_dt.std())

CV scores DecisionTreeClassifier
[0.74621212 0.74821128 0.74926347 0.75473485 0.74723771]
Mean CV accuracy score: 0.7491318863155388
STD CV accuracy score: 0.002978957242934609


In [24]:
print('CV score RandomForestClassifier')
print(cv_scores_rf)
print('Mean CV accuracy score:', cv_scores_rf.mean())
print('STD CV accuracy score:', cv_scores_rf.std())

CV score RandomForestClassifier
[0.79156145 0.79713805 0.79387626 0.79724327 0.79616963]
Mean CV accuracy score: 0.7951977308423956
STD CV accuracy score: 0.0021846035397917497


# VI. Tune Model

**Task 7:** Choose the best performing of your two models and tune its hyperparameters using a `RandomizedSearchCV` named `model`. Make sure that you include cross-validation and that `n_iter` is set to at least `25`.

**Note:** If you're not sure which hyperparameters to tune, check the notes from today's guided project and the `sklearn` documentation.

In [28]:
param_grid = {'simpleimputer__strategy': ['mean', 'median'],
              'randomforestclassifier__max_depth': range(5, 40, 5),
              'randomforestclassifier__n_estimators': range(25, 125, 25)}

model = RandomizedSearchCV(clf_rf,
                           param_distributions=param_grid,
                           n_jobs= -1,
                           cv = 5,
                           verbose = 1,
                           n_iter= 25)

model.fit(X, y)

Fitting 5 folds for each of 25 candidates, totalling 125 fits


**Task 8:** Print out the best score and best params for `model`.

In [29]:
best_score = model.best_score_
best_params = model.best_params_

print('Best score for `model`:', best_score)
print('Best params for `model`:', best_params)

Best score for `model`: 0.80422563484294
Best params for `model`: {'simpleimputer__strategy': 'median', 'randomforestclassifier__n_estimators': 100, 'randomforestclassifier__max_depth': 20}


# Communicate Results

**Task 9:** Create a DataFrame `submission` whose index is the same as `X_test` and that has one column `'status_group'` with your predictions. Next, save this DataFrame as a CSV file and upload your submissions to our competition site.

**Note:** Check the `sample_submission.csv` file on the competition website to make sure your submissions follows the same formatting.

In [30]:
y_preds= model.predict(X_test)

submission = pd.DataFrame({'status_results': y_preds}, index=X_test.index)
submission

Unnamed: 0_level_0,status_results
id,Unnamed: 1_level_1
37098,non functional
14530,functional
62607,functional
46053,functional
47083,functional
...,...
26092,functional
919,non functional
47444,non functional
61128,non functional


In [31]:
submission.to_csv('new_submission.csv')

In [32]:
ls

2024-01-08_2015_submission.csv  new_submission.csv     train_features.csv
kaggle.json                     sample_submission.csv  train_labels.csv
model_rf_rs_80                  test_features.csv      Untitled0.ipynb
