## Capstone  - NYC Tree Census - Hyperparameter Tuning 

### Table of contents
1. [Background](#Background)
     -   1.1 [Data Source](#Data-Source)
     -   1.2 [Objective](#Objective)
     

## 1. Loading and Preparing Data

#### 1.1 Load Libraries

In [1]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.model_selection import (cross_val_score, train_test_split)

*Successfully loaded all the required libraries*

#### 1.2 Load the cleaned data 

In [2]:
df = pd.read_csv('encoded_data_health.csv')

*Successfully loaded the cleaned and encoded file*

In [3]:
df.head(2)

Unnamed: 0.1,Unnamed: 0,tree_dbh,health,latitude,longitude,x_sp,y_sp,problem_count,curb_loc_OffsetFromCurb,curb_loc_OnCurb,...,trnk_light_No,trnk_light_Yes,trnk_other_No,trnk_other_Yes,brch_light_No,brch_light_Yes,brch_shoe_No,brch_shoe_Yes,brch_other_No,brch_other_Yes
0,0,3,Fair,40.723092,-73.844215,1027431.148,202756.7687,0,0,1,...,1,0,1,0,1,0,1,0,1,0
1,1,21,Fair,40.794111,-73.818679,1034455.701,228644.8374,1,0,1,...,1,0,1,0,1,0,1,0,1,0


In [4]:
df = df.drop('Unnamed: 0', 1)

  df = df.drop('Unnamed: 0', 1)


In [5]:
df.head(2)

Unnamed: 0,tree_dbh,health,latitude,longitude,x_sp,y_sp,problem_count,curb_loc_OffsetFromCurb,curb_loc_OnCurb,steward_1or2,...,trnk_light_No,trnk_light_Yes,trnk_other_No,trnk_other_Yes,brch_light_No,brch_light_Yes,brch_shoe_No,brch_shoe_Yes,brch_other_No,brch_other_Yes
0,3,Fair,40.723092,-73.844215,1027431.148,202756.7687,0,0,1,0,...,1,0,1,0,1,0,1,0,1,0
1,21,Fair,40.794111,-73.818679,1034455.701,228644.8374,1,0,1,0,...,1,0,1,0,1,0,1,0,1,0


## setting variables

In [6]:
# setting X and y variables
y = df['health'].values
X = df.drop('health', axis=1).values

## Random oversampling using imblearn

In [7]:
#import library
from imblearn.over_sampling import RandomOverSampler
from collections import Counter

ros = RandomOverSampler(random_state=42)

# fit predictor and target varaible
X_ros, y_ros = ros.fit_resample(X, y)

print('Original dataset shape', Counter(y))
print('Resample dataset shape', Counter(y_ros))

Original dataset shape Counter({'Good': 521205, 'Fair': 94874, 'Poor': 26309})
Resample dataset shape Counter({'Fair': 521205, 'Good': 521205, 'Poor': 521205})


In [8]:
# train test split
X_train_rs, X_test_rs, y_train_rs, y_test_rs = train_test_split(X_ros, y_ros, test_size=0.25, random_state=42)

print(X_train_rs.shape, y_train_rs.shape)
print(X_test_rs.shape, y_test_rs.shape)

(1172711, 36) (1172711,)
(390904, 36) (390904,)


#### 4.1.1d Random Forest 

In [9]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report,confusion_matrix

rf_rs = RandomForestClassifier(random_state=42)
rf_rs.fit(X_train_rs, y_train_rs)
y_pred_rs = rf_rs.predict(X_test_rs)
    
    
# accuracy scores
print('Training Set Accuracy Score: ', rf_rs.score(X_train_rs, y_train_rs))
print('Test Set Accuracy Score: ', rf_rs.score(X_test_rs, y_test_rs))
    
# classification report
print('Classification Metrics \n')
print(classification_report(y_test_rs, y_pred_rs))

Training Set Accuracy Score:  0.9999735655246689
Test Set Accuracy Score:  0.9596115670343613
Classification Metrics 

              precision    recall  f1-score   support

        Fair       0.91      0.99      0.95    130385
        Good       0.99      0.89      0.94    129681
        Poor       0.98      1.00      0.99    130838

    accuracy                           0.96    390904
   macro avg       0.96      0.96      0.96    390904
weighted avg       0.96      0.96      0.96    390904



## Cross Validation 

In [10]:
#Calculating accuracy score for 10 cross validation folds. 
from sklearn.model_selection import cross_val_score
ac_cv = cross_val_score(estimator=rf_rs, X = X_train_rs, y = y_train_rs, cv=5)
print("scores for each fold")
for val in ac_cv:
    print(val)

scores for each fold
0.9463211436708834
0.9468922410485116
0.9459457154795303
0.945770906703277
0.9460992061123381


## Random Search 

In [11]:
from sklearn.model_selection import RandomizedSearchCV
from pprint import pprint

# Create the random grid
random_grid  = {'n_estimators': [1000],
 'min_samples_split': [10],
 'min_samples_leaf': [1],
 'max_features': ['auto'],
 'max_depth': [80],
 'bootstrap': [False]}


In [12]:
# Use the random grid to search for best hyperparameters

rf_random = RandomizedSearchCV(estimator = rf_rs, param_distributions = random_grid, cv = 5, verbose=2, random_state=42, n_jobs = -1)

# Fit the random search model
rf_random.fit(X_train_rs, y_train_rs)



Fitting 5 folds for each of 1 candidates, totalling 5 fits


3 fits failed out of a total of 5.
The score on these train-test partitions for these parameters will be set to nan.
If these failures are not expected, you can try to debug them by setting error_score='raise'.

Below are more details about the failures:
--------------------------------------------------------------------------------
1 fits failed with the following error:
Traceback (most recent call last):
  File "C:\Users\parij\AppData\Roaming\Python\Python39\site-packages\sklearn\model_selection\_validation.py", line 680, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "C:\Users\parij\AppData\Roaming\Python\Python39\site-packages\sklearn\ensemble\_forest.py", line 450, in fit
    trees = Parallel(
  File "C:\ProgramData\Anaconda3\lib\site-packages\joblib\parallel.py", line 1046, in __call__
    while self.dispatch_one_batch(iterator):
  File "C:\ProgramData\Anaconda3\lib\site-packages\joblib\parallel.py", line 861, in dispatch_one_batch
    self._dispatch(

KeyboardInterrupt: 

In [None]:
rf_random.best_score_

##  Grid SearchCV

In [13]:
# defining parameters 
params = {'n_estimators': [1000],
 'min_samples_split': [10],
 'min_samples_leaf': [1],
 'max_features': ['auto'],
 'max_depth': [80],
 'bootstrap': [False]}

In [14]:
#Finding the best accuracy score and the best hyperparameter that gives the best result.
from sklearn.model_selection import GridSearchCV
clf = GridSearchCV(estimator = rf_rs, param_grid = params, cv=5)
clf.fit(X_train_rs, y_train_rs)

print(clf.score(X_train_rs, y_train_rs))
print(clf.best_params_)
print(clf.best_score_)

0.9981282686015566
{'bootstrap': False, 'max_depth': 80, 'max_features': 'auto', 'min_samples_leaf': 1, 'min_samples_split': 10, 'n_estimators': 1000}
0.9467225939055737
