### Machine Learning

Note: The following models were commented out (using """xxx""") as they took a long time (more than 40 mins) to run. Please comment in if you'd like to run them on your machine.
- SVM
- Random Forest
- GradientBoost
- XGBoost classifier 

In [2]:
# wip, refactor as we firm model choices
# Import libraries for pre-processing
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split, cross_val_score

from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier,GradientBoostingClassifier

from sklearn.metrics import confusion_matrix
from sklearn.metrics import roc_auc_score

### Import files

In [3]:
# Import the files
df_test = pd.read_csv("https://raw.githubusercontent.com/AngShengJun/dsi14P4/master/assets/working/df_test_weather_cleaned.csv")
df_train = pd.read_csv("https://raw.githubusercontent.com/AngShengJun/dsi14P4/master/assets/working/df_train_weather_cleaned.csv")

### Prep

In [13]:
# Make a copy of test
df_testcopy = df_test.copy()

In [14]:
# Review the 1st line
df_testcopy.head(1)

Unnamed: 0.1,Unnamed: 0,id,date,lat,long,tavg,stnpress,dewpt,precip,windspeed,...,trap_T230,trap_T231,trap_T232,trap_T233,trap_T235,trap_T236,trap_T237,trap_T238,trap_T900,trap_T903
0,0,1,2008-06-11,41.95469,-87.800991,75.0,29.31,55.5,0.0,9.15,...,0,0,0,0,0,0,0,0,0,0


In [15]:
# Drop unnecessary col
df_testcopy.drop(['id','Unnamed: 0'],axis=1,inplace=True)
# Review
df_testcopy.head(1)

Unnamed: 0,date,lat,long,tavg,stnpress,dewpt,precip,windspeed,daylight,wk,...,trap_T230,trap_T231,trap_T232,trap_T233,trap_T235,trap_T236,trap_T237,trap_T238,trap_T900,trap_T903
0,2008-06-11,41.95469,-87.800991,75.0,29.31,55.5,0.0,9.15,15.166667,23,...,0,0,0,0,0,0,0,0,0,0


In [16]:
# Make a copy of train
df_traincopy = df_train.copy()

In [17]:
# Review the 1st line
df_traincopy.head(1)

Unnamed: 0.1,Unnamed: 0,date,lat,long,wnv,num_mos,wk,yr,tavg,stnpress,...,trap_T230,trap_T231,trap_T232,trap_T233,trap_T235,trap_T236,trap_T237,trap_T238,trap_T900,trap_T903
0,0,2007-05-29,41.95469,-87.800991,0,1,22,2007,75.5,29.415,...,0,0,0,0,0,0,0,0,0,0


In [18]:
# Drop unnecessary col
df_traincopy.drop(['Unnamed: 0'],axis=1,inplace=True)
# Review
df_traincopy.head(1)

Unnamed: 0,date,lat,long,wnv,num_mos,wk,yr,tavg,stnpress,dewpt,...,trap_T230,trap_T231,trap_T232,trap_T233,trap_T235,trap_T236,trap_T237,trap_T238,trap_T900,trap_T903
0,2007-05-29,41.95469,-87.800991,0,1,22,2007,75.5,29.415,58.5,...,0,0,0,0,0,0,0,0,0,0


In [19]:
# Sanity check
print(df_traincopy.shape)
print(df_testcopy.shape)

(8610, 156)
(116293, 154)


In [20]:
# Drop date column from test and train
df_traincopy.drop(['date'],axis=1,inplace=True)
df_testcopy.drop(['date'],axis=1,inplace=True)

In [21]:
# trc - TrainComplete set
X_trc = df_traincopy.drop(['num_mos','wnv'],axis=1)
y_trc = df_traincopy['wnv']

### Train-Validate-Split

In [26]:
# Train-validate-split
X_train,X_val,y_train,y_val = train_test_split(X_trc,y_trc,test_size=0.3,random_state=42, stratify=y_trc)

In [27]:
print(y_train.value_counts(normalize=True))

0    0.946906
1    0.053094
Name: wnv, dtype: float64


Positive class is wnv. Negative class is no wnv. Classes are imbalanced. We face a Class Imbalance Problem when the class distributions are highly imbalanced. In this context, many classification learning algorithms will have low predictive accuracy for the infrequent class (i.e. the positive class we are interested in).

We will use SMOTE (Synthetic Minority Oversampling TEchnique) to mitigate class imbalance. SMOTE consists of synthesizing elements for the minority class, based on exisitng observations. A minority class observation is randomly picked and the k-nearest neighbors are computed for this observation. Synthetic points are then added between this observation and its neighbors.

### Resampling

In [28]:
from imblearn.over_sampling import SMOTE

smote = SMOTE(sampling_strategy='minority')
X_sm, y_sm = smote.fit_sample(X_train, y_train)

In [29]:
# Review class balance after SMOTE application
print(y_sm.value_counts(normalize=True))

1    0.5
0    0.5
Name: wnv, dtype: float64


The baseline accuracy is 0.5. A model needs to perform better than this.

### Classification Models

Model workflow:
- pipeline for standard scaler (transformer) and classifier model (estimator), where relevant.
- gridsearch for best model parameters.
- metrics for evaluation: F1 score and roc_auc since we are dealing with imbalanced class distribution.

### Logistic Regression Model

In [30]:
pipe1 = Pipeline([
    ('ss', StandardScaler()),
    ('lor', LogisticRegression(solver='lbfgs',random_state=42)),
])

In [31]:
pipe1.get_params()

{'memory': None,
 'steps': [('ss', StandardScaler(copy=True, with_mean=True, with_std=True)),
  ('lor',
   LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                      intercept_scaling=1, l1_ratio=None, max_iter=100,
                      multi_class='auto', n_jobs=None, penalty='l2',
                      random_state=42, solver='lbfgs', tol=0.0001, verbose=0,
                      warm_start=False))],
 'verbose': False,
 'ss': StandardScaler(copy=True, with_mean=True, with_std=True),
 'lor': LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                    intercept_scaling=1, l1_ratio=None, max_iter=100,
                    multi_class='auto', n_jobs=None, penalty='l2',
                    random_state=42, solver='lbfgs', tol=0.0001, verbose=0,
                    warm_start=False),
 'ss__copy': True,
 'ss__with_mean': True,
 'ss__with_std': True,
 'lor__C': 1.0,
 'lor__class_weight': None,
 'lor__dual': False,
 '

In [32]:
# Define the pipe parameters
pipe1_params = {'ss__with_mean': [True],
                'ss__with_std': [True],
                'lor__max_iter': [100,200,300]}

In [33]:
# Instantiate Gridsearch
gs1 = GridSearchCV(pipe1,\
                  param_grid=pipe1_params,\
                  cv=10)
# Fit GridSearch to the cleaned training data.
gs1.fit(X_sm,y_sm)

GridSearchCV(cv=10, error_score=nan,
             estimator=Pipeline(memory=None,
                                steps=[('ss',
                                        StandardScaler(copy=True,
                                                       with_mean=True,
                                                       with_std=True)),
                                       ('lor',
                                        LogisticRegression(C=1.0,
                                                           class_weight=None,
                                                           dual=False,
                                                           fit_intercept=True,
                                                           intercept_scaling=1,
                                                           l1_ratio=None,
                                                           max_iter=100,
                                                           multi_class='auto',
                

In [34]:
# Check the results of the grid search
# Google colab raises max iter limit, while jupytr runs fine with provided params
print(f"Best parameters: {gs1.best_params_}")
print(f"Best score: {gs1.best_score_}")

Best parameters: {'lor__max_iter': 100, 'ss__with_mean': True, 'ss__with_std': True}
Best score: 0.9676858103700475


In [35]:
# Save model
model1 = gs1.best_estimator_

In [25]:
# Score model on train set and validate set
print(f"Accuracy on train set: {model1.score(X_sm, y_sm)}")
print(f"Accuracy on validate set: {model1.score(X_val, y_val)}")

Accuracy on train set: 0.9680217277028211
Accuracy on validate set: 0.9450251645373596


The model accuracy is higher than the baseline accuracy, modelling helps with classification. The model is overfitted with about 2% drop in validate accuracy compared to train accuracy.

In [26]:
# Confusion matrix
# Pass in true values, predicted values to confusion matrix
# Convert confusion matrix into dataframe
# Positive class (class 1) is wnv
preds = gs1.predict(X_val)
cm = confusion_matrix(y_val, preds)
cm_df = pd.DataFrame(cm,columns=['pred no wnv','pred wnv'], index=['Actual no wnv','Actual wnv'])
cm_df

Unnamed: 0,pred no wnv,pred wnv
Actual no wnv,2441,5
Actual wnv,137,0


In [27]:
# return nparray as a 1-D array.
confusion_matrix(y_val, preds).ravel()
# Save TN/FP/FN/TP values.
tn, fp, fn, tp = confusion_matrix(y_val,preds).ravel()

In [28]:
# Summary of metrics for log reg model
sens = tp/(tp+fn)
prec = tp/(tp+fp)
f1 = 2*(prec*sens)/(prec+sens)
print(f"Sensitivity: {round(sens,4)}")
print(f"Precision: {round(prec,4)}")
print(f"F1: {round(f1,4)}")

Sensitivity: 0.0
Precision: 0.0
F1: nan


  after removing the cwd from sys.path.


In [29]:
pred_proba = [i[1] for i in gs1.predict_proba(X_val)]

pred_df = pd.DataFrame({'validate_values': y_val,
                        'pred_probs':pred_proba})
pred_df.head()

Unnamed: 0,validate_values,pred_probs
5602,0,0.19544
1012,0,0.000978
7561,0,0.032611
7887,0,0.001325
4422,0,0.054452


In [30]:
# Calculate ROC AUC.
roc_auc_score(pred_df['validate_values'],pred_df['pred_probs'])

0.7896222642658057

### SVM

Note: Code for SVM commented out, takes significant time to run (more than 30 mins).

In [31]:
"""pipe2 = Pipeline([
    ('ss', StandardScaler()),
    ('svm', SVC())
])"""

In [32]:
"""pipe2.get_params()"""

{'memory': None,
 'steps': [('ss', StandardScaler(copy=True, with_mean=True, with_std=True)),
  ('svm',
   SVC(C=1.0, break_ties=False, cache_size=200, class_weight=None, coef0=0.0,
       decision_function_shape='ovr', degree=3, gamma='scale', kernel='rbf',
       max_iter=-1, probability=False, random_state=None, shrinking=True,
       tol=0.001, verbose=False))],
 'verbose': False,
 'ss': StandardScaler(copy=True, with_mean=True, with_std=True),
 'svm': SVC(C=1.0, break_ties=False, cache_size=200, class_weight=None, coef0=0.0,
     decision_function_shape='ovr', degree=3, gamma='scale', kernel='rbf',
     max_iter=-1, probability=False, random_state=None, shrinking=True,
     tol=0.001, verbose=False),
 'ss__copy': True,
 'ss__with_mean': True,
 'ss__with_std': True,
 'svm__C': 1.0,
 'svm__break_ties': False,
 'svm__cache_size': 200,
 'svm__class_weight': None,
 'svm__coef0': 0.0,
 'svm__decision_function_shape': 'ovr',
 'svm__degree': 3,
 'svm__gamma': 'scale',
 'svm__kernel': 'rbf

In [33]:
# Define the pipe parameters
"""pipe2_params = {'ss__with_mean': [True],
                'ss__with_std': [True],
                'svm__C': [1,10],
                'svm__gamma': ['scale','auto'],
                'svm__kernel': ['rbf','linear','poly']
               }"""

In [None]:
# Initiate Gridsearch
"""gs2 = GridSearchCV(pipe2,
                  param_grid=pipe2_params,
                  cv=10)

# Fit gs2
gs2.fit(X_sm, y_sm)"""

In [None]:
# Check the results of the grid search

"""print(f"Best parameters: {gs2.best_params_}")
print(f"Best score: {gs2.best_score_}")"""

In [None]:
# Save model
"""model2 = gs2.best_estimator_"""

In [None]:
# Score model on train set and validate set
"""print(f"Accuracy on train set: {model2.score(X_sm, y_sm)}")
print(f"Accuracy on validate set: {model2.score(X_val, y_val)}")"""

In [None]:
# Confusion matrix
# Pass in true values, predicted values to confusion matrix
# Convert confusion matrix into dataframe
# Positive class (class 1) is wnv
"""preds = gs2.predict(X_val)
cm = confusion_matrix(y_val, preds)
cm_df = pd.DataFrame(cm,columns=['pred no wnv','pred wnv'], index=['Actual no wnv','Actual wnv'])
cm_df"""

In [None]:
# return nparray as a 1-D array.
"""confusion_matrix(y_val, preds).ravel()
# Save TN/FP/FN/TP values.
tn, fp, fn, tp = confusion_matrix(y_val,preds).ravel()"""

In [None]:
# Summary of metrics for log reg model
"""sens = tp/(tp+fn)
prec = tp/(tp+fp)
f1 = 2*(prec*sens)/(prec+sens)
print(f"Sensitivity: {round(sens,4)}")
print(f"Precision: {round(prec,4)}")
print(f"F1: {round(f1,4)}")"""

In [None]:
"""pred_proba = [i[1] for i in gs2.predict_proba(X_val)]

pred_df = pd.DataFrame({'validate_values': y_val,
                        'pred_probs':pred_proba})
pred_df.head()"""

In [None]:
# Calculate ROC AUC.
"""roc_auc_score(pred_df['validate_values'],pred_df['pred_probs'])"""

### KNN Classifier

In [59]:
pipe3 = Pipeline([
    ('ss', StandardScaler()),
    ('knn', KNeighborsClassifier())
])

In [60]:
pipe3.get_params()

{'memory': None,
 'steps': [('ss', StandardScaler(copy=True, with_mean=True, with_std=True)),
  ('knn',
   KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
                        metric_params=None, n_jobs=None, n_neighbors=5, p=2,
                        weights='uniform'))],
 'verbose': False,
 'ss': StandardScaler(copy=True, with_mean=True, with_std=True),
 'knn': KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
                      metric_params=None, n_jobs=None, n_neighbors=5, p=2,
                      weights='uniform'),
 'ss__copy': True,
 'ss__with_mean': True,
 'ss__with_std': True,
 'knn__algorithm': 'auto',
 'knn__leaf_size': 30,
 'knn__metric': 'minkowski',
 'knn__metric_params': None,
 'knn__n_jobs': None,
 'knn__n_neighbors': 5,
 'knn__p': 2,
 'knn__weights': 'uniform'}

In [61]:
# Define the pipe parameters
pipe3_params = {'ss__with_mean': [True],
                'ss__with_std': [True],
                'knn__n_neighbors' : [3,5,7],
                'knn__metric': ['euclidean','manhattan']}

In [62]:
# Instantiate Gridsearch
gs3 = GridSearchCV(pipe3,\
                  param_grid=pipe3_params,\
                  cv=10)
# Fit GridSearch to the cleaned training data.
gs3.fit(X_sm,y_sm)

GridSearchCV(cv=10, error_score=nan,
             estimator=Pipeline(memory=None,
                                steps=[('ss',
                                        StandardScaler(copy=True,
                                                       with_mean=True,
                                                       with_std=True)),
                                       ('knn',
                                        KNeighborsClassifier(algorithm='auto',
                                                             leaf_size=30,
                                                             metric='minkowski',
                                                             metric_params=None,
                                                             n_jobs=None,
                                                             n_neighbors=5, p=2,
                                                             weights='uniform'))],
                                verbose=False),
             i

In [63]:
# Check the results of the grid search

print(f"Best parameters: {gs3.best_params_}")
print(f"Best score: {gs3.best_score_}")

Best parameters: {'knn__metric': 'manhattan', 'knn__n_neighbors': 5, 'ss__with_mean': True, 'ss__with_std': True}
Best score: 0.965929585225729


In [64]:
# Save model
model3 = gs3.best_estimator_

In [65]:
# Score model on train set and validate set
print(f"Accuracy on train set: {model3.score(X_sm, y_sm)}")
print(f"Accuracy on validate set: {model3.score(X_val, y_val)}")

Accuracy on train set: 0.9724899246539338
Accuracy on validate set: 0.9337979094076655


The model is overfitted with about 4% drop in validate accuracy compared to train accuracy.

In [66]:
# Confusion matrix
# Pass in true values, predicted values to confusion matrix
# Convert confusion matrix into dataframe
# Positive class (class 1) is wnv
preds = gs3.predict(X_val)
cm = confusion_matrix(y_val, preds)
cm_df = pd.DataFrame(cm,columns=['pred no wnv','pred wnv'], index=['Actual no wnv','Actual wnv'])
cm_df

Unnamed: 0,pred no wnv,pred wnv
Actual no wnv,2403,43
Actual wnv,128,9


In [67]:
# return nparray as a 1-D array.
confusion_matrix(y_val, preds).ravel()
# Save TN/FP/FN/TP values.
tn, fp, fn, tp = confusion_matrix(y_val,preds).ravel()

In [68]:
# Summary of metrics for log reg model
sens = tp/(tp+fn)
prec = tp/(tp+fp)
f1 = 2*(prec*sens)/(prec+sens)
print(f"Sensitivity: {round(sens,4)}")
print(f"Precision: {round(prec,4)}")
print(f"F1: {round(f1,4)}")

Sensitivity: 0.0657
Precision: 0.1731
F1: 0.0952


In [76]:
pred_proba = [i[1] for i in gs3.predict_proba(X_val)]

pred_df = pd.DataFrame({'validate_values': y_val,
                        'pred_probs':pred_proba})
pred_df.head()

Unnamed: 0,validate_values,pred_probs
5602,0,0.2
1012,0,0.0
7561,0,0.0
7887,0,0.0
4422,0,0.0


In [77]:
# Calculate ROC AUC.
roc_auc_score(pred_df['validate_values'],pred_df['pred_probs'])

0.7214639124803791

### Decision Tree

In [155]:
gs4 = GridSearchCV(estimator = DecisionTreeClassifier(),\
                   param_grid = {'max_depth': [7, 9, 11],\
                                 'min_samples_split': [10, 15, 20],\
                                 'min_samples_leaf': [2, 3, 4],\
                                 'ccp_alpha': [0, 0.001, 0.01, 0.1, 1, 10]},\
                   cv = 5,\
                   verbose = 2)

In [156]:
# Fit GridSearch to the cleaned training data.
gs4.fit(X_sm,y_sm)

Fitting 5 folds for each of 162 candidates, totalling 810 fits
[CV] ccp_alpha=0, max_depth=7, min_samples_leaf=2, min_samples_split=10 
[CV]  ccp_alpha=0, max_depth=7, min_samples_leaf=2, min_samples_split=10, total=   0.1s
[CV] ccp_alpha=0, max_depth=7, min_samples_leaf=2, min_samples_split=10 
[CV]  ccp_alpha=0, max_depth=7, min_samples_leaf=2, min_samples_split=10, total=   0.1s
[CV] ccp_alpha=0, max_depth=7, min_samples_leaf=2, min_samples_split=10 


[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.0s remaining:    0.0s


[CV]  ccp_alpha=0, max_depth=7, min_samples_leaf=2, min_samples_split=10, total=   0.1s
[CV] ccp_alpha=0, max_depth=7, min_samples_leaf=2, min_samples_split=10 
[CV]  ccp_alpha=0, max_depth=7, min_samples_leaf=2, min_samples_split=10, total=   0.1s
[CV] ccp_alpha=0, max_depth=7, min_samples_leaf=2, min_samples_split=10 
[CV]  ccp_alpha=0, max_depth=7, min_samples_leaf=2, min_samples_split=10, total=   0.1s
[CV] ccp_alpha=0, max_depth=7, min_samples_leaf=2, min_samples_split=15 
[CV]  ccp_alpha=0, max_depth=7, min_samples_leaf=2, min_samples_split=15, total=   0.1s
[CV] ccp_alpha=0, max_depth=7, min_samples_leaf=2, min_samples_split=15 
[CV]  ccp_alpha=0, max_depth=7, min_samples_leaf=2, min_samples_split=15, total=   0.1s
[CV] ccp_alpha=0, max_depth=7, min_samples_leaf=2, min_samples_split=15 
[CV]  ccp_alpha=0, max_depth=7, min_samples_leaf=2, min_samples_split=15, total=   0.1s
[CV] ccp_alpha=0, max_depth=7, min_samples_leaf=2, min_samples_split=15 
[CV]  ccp_alpha=0, max_depth=7, mi

[CV]  ccp_alpha=0, max_depth=9, min_samples_leaf=2, min_samples_split=15, total=   0.1s
[CV] ccp_alpha=0, max_depth=9, min_samples_leaf=2, min_samples_split=15 
[CV]  ccp_alpha=0, max_depth=9, min_samples_leaf=2, min_samples_split=15, total=   0.1s
[CV] ccp_alpha=0, max_depth=9, min_samples_leaf=2, min_samples_split=20 
[CV]  ccp_alpha=0, max_depth=9, min_samples_leaf=2, min_samples_split=20, total=   0.1s
[CV] ccp_alpha=0, max_depth=9, min_samples_leaf=2, min_samples_split=20 
[CV]  ccp_alpha=0, max_depth=9, min_samples_leaf=2, min_samples_split=20, total=   0.1s
[CV] ccp_alpha=0, max_depth=9, min_samples_leaf=2, min_samples_split=20 
[CV]  ccp_alpha=0, max_depth=9, min_samples_leaf=2, min_samples_split=20, total=   0.1s
[CV] ccp_alpha=0, max_depth=9, min_samples_leaf=2, min_samples_split=20 
[CV]  ccp_alpha=0, max_depth=9, min_samples_leaf=2, min_samples_split=20, total=   0.1s
[CV] ccp_alpha=0, max_depth=9, min_samples_leaf=2, min_samples_split=20 
[CV]  ccp_alpha=0, max_depth=9, mi

[CV]  ccp_alpha=0, max_depth=11, min_samples_leaf=2, min_samples_split=20, total=   0.1s
[CV] ccp_alpha=0, max_depth=11, min_samples_leaf=3, min_samples_split=10 
[CV]  ccp_alpha=0, max_depth=11, min_samples_leaf=3, min_samples_split=10, total=   0.1s
[CV] ccp_alpha=0, max_depth=11, min_samples_leaf=3, min_samples_split=10 
[CV]  ccp_alpha=0, max_depth=11, min_samples_leaf=3, min_samples_split=10, total=   0.1s
[CV] ccp_alpha=0, max_depth=11, min_samples_leaf=3, min_samples_split=10 
[CV]  ccp_alpha=0, max_depth=11, min_samples_leaf=3, min_samples_split=10, total=   0.1s
[CV] ccp_alpha=0, max_depth=11, min_samples_leaf=3, min_samples_split=10 
[CV]  ccp_alpha=0, max_depth=11, min_samples_leaf=3, min_samples_split=10, total=   0.1s
[CV] ccp_alpha=0, max_depth=11, min_samples_leaf=3, min_samples_split=10 
[CV]  ccp_alpha=0, max_depth=11, min_samples_leaf=3, min_samples_split=10, total=   0.1s
[CV] ccp_alpha=0, max_depth=11, min_samples_leaf=3, min_samples_split=15 
[CV]  ccp_alpha=0, max

[CV]  ccp_alpha=0.001, max_depth=7, min_samples_leaf=3, min_samples_split=15, total=   0.1s
[CV] ccp_alpha=0.001, max_depth=7, min_samples_leaf=3, min_samples_split=15 
[CV]  ccp_alpha=0.001, max_depth=7, min_samples_leaf=3, min_samples_split=15, total=   0.1s
[CV] ccp_alpha=0.001, max_depth=7, min_samples_leaf=3, min_samples_split=15 
[CV]  ccp_alpha=0.001, max_depth=7, min_samples_leaf=3, min_samples_split=15, total=   0.1s
[CV] ccp_alpha=0.001, max_depth=7, min_samples_leaf=3, min_samples_split=15 
[CV]  ccp_alpha=0.001, max_depth=7, min_samples_leaf=3, min_samples_split=15, total=   0.1s
[CV] ccp_alpha=0.001, max_depth=7, min_samples_leaf=3, min_samples_split=15 
[CV]  ccp_alpha=0.001, max_depth=7, min_samples_leaf=3, min_samples_split=15, total=   0.1s
[CV] ccp_alpha=0.001, max_depth=7, min_samples_leaf=3, min_samples_split=20 
[CV]  ccp_alpha=0.001, max_depth=7, min_samples_leaf=3, min_samples_split=20, total=   0.1s
[CV] ccp_alpha=0.001, max_depth=7, min_samples_leaf=3, min_samp

[CV]  ccp_alpha=0.001, max_depth=9, min_samples_leaf=3, min_samples_split=15, total=   0.1s
[CV] ccp_alpha=0.001, max_depth=9, min_samples_leaf=3, min_samples_split=20 
[CV]  ccp_alpha=0.001, max_depth=9, min_samples_leaf=3, min_samples_split=20, total=   0.1s
[CV] ccp_alpha=0.001, max_depth=9, min_samples_leaf=3, min_samples_split=20 
[CV]  ccp_alpha=0.001, max_depth=9, min_samples_leaf=3, min_samples_split=20, total=   0.1s
[CV] ccp_alpha=0.001, max_depth=9, min_samples_leaf=3, min_samples_split=20 
[CV]  ccp_alpha=0.001, max_depth=9, min_samples_leaf=3, min_samples_split=20, total=   0.1s
[CV] ccp_alpha=0.001, max_depth=9, min_samples_leaf=3, min_samples_split=20 
[CV]  ccp_alpha=0.001, max_depth=9, min_samples_leaf=3, min_samples_split=20, total=   0.1s
[CV] ccp_alpha=0.001, max_depth=9, min_samples_leaf=3, min_samples_split=20 
[CV]  ccp_alpha=0.001, max_depth=9, min_samples_leaf=3, min_samples_split=20, total=   0.1s
[CV] ccp_alpha=0.001, max_depth=9, min_samples_leaf=4, min_samp

[CV]  ccp_alpha=0.001, max_depth=11, min_samples_leaf=4, min_samples_split=10, total=   0.1s
[CV] ccp_alpha=0.001, max_depth=11, min_samples_leaf=4, min_samples_split=10 
[CV]  ccp_alpha=0.001, max_depth=11, min_samples_leaf=4, min_samples_split=10, total=   0.1s
[CV] ccp_alpha=0.001, max_depth=11, min_samples_leaf=4, min_samples_split=10 
[CV]  ccp_alpha=0.001, max_depth=11, min_samples_leaf=4, min_samples_split=10, total=   0.1s
[CV] ccp_alpha=0.001, max_depth=11, min_samples_leaf=4, min_samples_split=10 
[CV]  ccp_alpha=0.001, max_depth=11, min_samples_leaf=4, min_samples_split=10, total=   0.1s
[CV] ccp_alpha=0.001, max_depth=11, min_samples_leaf=4, min_samples_split=10 
[CV]  ccp_alpha=0.001, max_depth=11, min_samples_leaf=4, min_samples_split=10, total=   0.1s
[CV] ccp_alpha=0.001, max_depth=11, min_samples_leaf=4, min_samples_split=15 
[CV]  ccp_alpha=0.001, max_depth=11, min_samples_leaf=4, min_samples_split=15, total=   0.1s
[CV] ccp_alpha=0.001, max_depth=11, min_samples_leaf

[CV]  ccp_alpha=0.01, max_depth=7, min_samples_leaf=4, min_samples_split=15, total=   0.1s
[CV] ccp_alpha=0.01, max_depth=7, min_samples_leaf=4, min_samples_split=15 
[CV]  ccp_alpha=0.01, max_depth=7, min_samples_leaf=4, min_samples_split=15, total=   0.1s
[CV] ccp_alpha=0.01, max_depth=7, min_samples_leaf=4, min_samples_split=15 
[CV]  ccp_alpha=0.01, max_depth=7, min_samples_leaf=4, min_samples_split=15, total=   0.1s
[CV] ccp_alpha=0.01, max_depth=7, min_samples_leaf=4, min_samples_split=15 
[CV]  ccp_alpha=0.01, max_depth=7, min_samples_leaf=4, min_samples_split=15, total=   0.1s
[CV] ccp_alpha=0.01, max_depth=7, min_samples_leaf=4, min_samples_split=15 
[CV]  ccp_alpha=0.01, max_depth=7, min_samples_leaf=4, min_samples_split=15, total=   0.1s
[CV] ccp_alpha=0.01, max_depth=7, min_samples_leaf=4, min_samples_split=20 
[CV]  ccp_alpha=0.01, max_depth=7, min_samples_leaf=4, min_samples_split=20, total=   0.1s
[CV] ccp_alpha=0.01, max_depth=7, min_samples_leaf=4, min_samples_split=20

[CV]  ccp_alpha=0.01, max_depth=9, min_samples_leaf=4, min_samples_split=20, total=   0.1s
[CV] ccp_alpha=0.01, max_depth=9, min_samples_leaf=4, min_samples_split=20 
[CV]  ccp_alpha=0.01, max_depth=9, min_samples_leaf=4, min_samples_split=20, total=   0.1s
[CV] ccp_alpha=0.01, max_depth=9, min_samples_leaf=4, min_samples_split=20 
[CV]  ccp_alpha=0.01, max_depth=9, min_samples_leaf=4, min_samples_split=20, total=   0.1s
[CV] ccp_alpha=0.01, max_depth=9, min_samples_leaf=4, min_samples_split=20 
[CV]  ccp_alpha=0.01, max_depth=9, min_samples_leaf=4, min_samples_split=20, total=   0.1s
[CV] ccp_alpha=0.01, max_depth=11, min_samples_leaf=2, min_samples_split=10 
[CV]  ccp_alpha=0.01, max_depth=11, min_samples_leaf=2, min_samples_split=10, total=   0.1s
[CV] ccp_alpha=0.01, max_depth=11, min_samples_leaf=2, min_samples_split=10 
[CV]  ccp_alpha=0.01, max_depth=11, min_samples_leaf=2, min_samples_split=10, total=   0.1s
[CV] ccp_alpha=0.01, max_depth=11, min_samples_leaf=2, min_samples_spl

[CV]  ccp_alpha=0.1, max_depth=7, min_samples_leaf=2, min_samples_split=10, total=   0.1s
[CV] ccp_alpha=0.1, max_depth=7, min_samples_leaf=2, min_samples_split=10 
[CV]  ccp_alpha=0.1, max_depth=7, min_samples_leaf=2, min_samples_split=10, total=   0.1s
[CV] ccp_alpha=0.1, max_depth=7, min_samples_leaf=2, min_samples_split=10 
[CV]  ccp_alpha=0.1, max_depth=7, min_samples_leaf=2, min_samples_split=10, total=   0.1s
[CV] ccp_alpha=0.1, max_depth=7, min_samples_leaf=2, min_samples_split=10 
[CV]  ccp_alpha=0.1, max_depth=7, min_samples_leaf=2, min_samples_split=10, total=   0.1s
[CV] ccp_alpha=0.1, max_depth=7, min_samples_leaf=2, min_samples_split=10 
[CV]  ccp_alpha=0.1, max_depth=7, min_samples_leaf=2, min_samples_split=10, total=   0.1s
[CV] ccp_alpha=0.1, max_depth=7, min_samples_leaf=2, min_samples_split=15 
[CV]  ccp_alpha=0.1, max_depth=7, min_samples_leaf=2, min_samples_split=15, total=   0.1s
[CV] ccp_alpha=0.1, max_depth=7, min_samples_leaf=2, min_samples_split=15 
[CV]  ccp_

[CV]  ccp_alpha=0.1, max_depth=9, min_samples_leaf=2, min_samples_split=15, total=   0.1s
[CV] ccp_alpha=0.1, max_depth=9, min_samples_leaf=2, min_samples_split=15 
[CV]  ccp_alpha=0.1, max_depth=9, min_samples_leaf=2, min_samples_split=15, total=   0.1s
[CV] ccp_alpha=0.1, max_depth=9, min_samples_leaf=2, min_samples_split=15 
[CV]  ccp_alpha=0.1, max_depth=9, min_samples_leaf=2, min_samples_split=15, total=   0.1s
[CV] ccp_alpha=0.1, max_depth=9, min_samples_leaf=2, min_samples_split=20 
[CV]  ccp_alpha=0.1, max_depth=9, min_samples_leaf=2, min_samples_split=20, total=   0.1s
[CV] ccp_alpha=0.1, max_depth=9, min_samples_leaf=2, min_samples_split=20 
[CV]  ccp_alpha=0.1, max_depth=9, min_samples_leaf=2, min_samples_split=20, total=   0.1s
[CV] ccp_alpha=0.1, max_depth=9, min_samples_leaf=2, min_samples_split=20 
[CV]  ccp_alpha=0.1, max_depth=9, min_samples_leaf=2, min_samples_split=20, total=   0.1s
[CV] ccp_alpha=0.1, max_depth=9, min_samples_leaf=2, min_samples_split=20 
[CV]  ccp_

[CV]  ccp_alpha=0.1, max_depth=11, min_samples_leaf=2, min_samples_split=20, total=   0.1s
[CV] ccp_alpha=0.1, max_depth=11, min_samples_leaf=2, min_samples_split=20 
[CV]  ccp_alpha=0.1, max_depth=11, min_samples_leaf=2, min_samples_split=20, total=   0.1s
[CV] ccp_alpha=0.1, max_depth=11, min_samples_leaf=2, min_samples_split=20 
[CV]  ccp_alpha=0.1, max_depth=11, min_samples_leaf=2, min_samples_split=20, total=   0.1s
[CV] ccp_alpha=0.1, max_depth=11, min_samples_leaf=3, min_samples_split=10 
[CV]  ccp_alpha=0.1, max_depth=11, min_samples_leaf=3, min_samples_split=10, total=   0.1s
[CV] ccp_alpha=0.1, max_depth=11, min_samples_leaf=3, min_samples_split=10 
[CV]  ccp_alpha=0.1, max_depth=11, min_samples_leaf=3, min_samples_split=10, total=   0.1s
[CV] ccp_alpha=0.1, max_depth=11, min_samples_leaf=3, min_samples_split=10 
[CV]  ccp_alpha=0.1, max_depth=11, min_samples_leaf=3, min_samples_split=10, total=   0.1s
[CV] ccp_alpha=0.1, max_depth=11, min_samples_leaf=3, min_samples_split=10

[CV] ccp_alpha=1, max_depth=7, min_samples_leaf=3, min_samples_split=15 
[CV]  ccp_alpha=1, max_depth=7, min_samples_leaf=3, min_samples_split=15, total=   0.1s
[CV] ccp_alpha=1, max_depth=7, min_samples_leaf=3, min_samples_split=15 
[CV]  ccp_alpha=1, max_depth=7, min_samples_leaf=3, min_samples_split=15, total=   0.1s
[CV] ccp_alpha=1, max_depth=7, min_samples_leaf=3, min_samples_split=15 
[CV]  ccp_alpha=1, max_depth=7, min_samples_leaf=3, min_samples_split=15, total=   0.1s
[CV] ccp_alpha=1, max_depth=7, min_samples_leaf=3, min_samples_split=15 
[CV]  ccp_alpha=1, max_depth=7, min_samples_leaf=3, min_samples_split=15, total=   0.1s
[CV] ccp_alpha=1, max_depth=7, min_samples_leaf=3, min_samples_split=15 
[CV]  ccp_alpha=1, max_depth=7, min_samples_leaf=3, min_samples_split=15, total=   0.1s
[CV] ccp_alpha=1, max_depth=7, min_samples_leaf=3, min_samples_split=20 
[CV]  ccp_alpha=1, max_depth=7, min_samples_leaf=3, min_samples_split=20, total=   0.1s
[CV] ccp_alpha=1, max_depth=7, min

[CV]  ccp_alpha=1, max_depth=9, min_samples_leaf=3, min_samples_split=20, total=   0.1s
[CV] ccp_alpha=1, max_depth=9, min_samples_leaf=3, min_samples_split=20 
[CV]  ccp_alpha=1, max_depth=9, min_samples_leaf=3, min_samples_split=20, total=   0.1s
[CV] ccp_alpha=1, max_depth=9, min_samples_leaf=4, min_samples_split=10 
[CV]  ccp_alpha=1, max_depth=9, min_samples_leaf=4, min_samples_split=10, total=   0.1s
[CV] ccp_alpha=1, max_depth=9, min_samples_leaf=4, min_samples_split=10 
[CV]  ccp_alpha=1, max_depth=9, min_samples_leaf=4, min_samples_split=10, total=   0.1s
[CV] ccp_alpha=1, max_depth=9, min_samples_leaf=4, min_samples_split=10 
[CV]  ccp_alpha=1, max_depth=9, min_samples_leaf=4, min_samples_split=10, total=   0.1s
[CV] ccp_alpha=1, max_depth=9, min_samples_leaf=4, min_samples_split=10 
[CV]  ccp_alpha=1, max_depth=9, min_samples_leaf=4, min_samples_split=10, total=   0.1s
[CV] ccp_alpha=1, max_depth=9, min_samples_leaf=4, min_samples_split=10 
[CV]  ccp_alpha=1, max_depth=9, mi

[CV]  ccp_alpha=1, max_depth=11, min_samples_leaf=4, min_samples_split=15, total=   0.1s
[CV] ccp_alpha=1, max_depth=11, min_samples_leaf=4, min_samples_split=15 
[CV]  ccp_alpha=1, max_depth=11, min_samples_leaf=4, min_samples_split=15, total=   0.1s
[CV] ccp_alpha=1, max_depth=11, min_samples_leaf=4, min_samples_split=15 
[CV]  ccp_alpha=1, max_depth=11, min_samples_leaf=4, min_samples_split=15, total=   0.1s
[CV] ccp_alpha=1, max_depth=11, min_samples_leaf=4, min_samples_split=15 
[CV]  ccp_alpha=1, max_depth=11, min_samples_leaf=4, min_samples_split=15, total=   0.1s
[CV] ccp_alpha=1, max_depth=11, min_samples_leaf=4, min_samples_split=15 
[CV]  ccp_alpha=1, max_depth=11, min_samples_leaf=4, min_samples_split=15, total=   0.1s
[CV] ccp_alpha=1, max_depth=11, min_samples_leaf=4, min_samples_split=20 
[CV]  ccp_alpha=1, max_depth=11, min_samples_leaf=4, min_samples_split=20, total=   0.1s
[CV] ccp_alpha=1, max_depth=11, min_samples_leaf=4, min_samples_split=20 
[CV]  ccp_alpha=1, max

[CV]  ccp_alpha=10, max_depth=7, min_samples_leaf=4, min_samples_split=20, total=   0.1s
[CV] ccp_alpha=10, max_depth=7, min_samples_leaf=4, min_samples_split=20 
[CV]  ccp_alpha=10, max_depth=7, min_samples_leaf=4, min_samples_split=20, total=   0.1s
[CV] ccp_alpha=10, max_depth=9, min_samples_leaf=2, min_samples_split=10 
[CV]  ccp_alpha=10, max_depth=9, min_samples_leaf=2, min_samples_split=10, total=   0.1s
[CV] ccp_alpha=10, max_depth=9, min_samples_leaf=2, min_samples_split=10 
[CV]  ccp_alpha=10, max_depth=9, min_samples_leaf=2, min_samples_split=10, total=   0.1s
[CV] ccp_alpha=10, max_depth=9, min_samples_leaf=2, min_samples_split=10 
[CV]  ccp_alpha=10, max_depth=9, min_samples_leaf=2, min_samples_split=10, total=   0.1s
[CV] ccp_alpha=10, max_depth=9, min_samples_leaf=2, min_samples_split=10 
[CV]  ccp_alpha=10, max_depth=9, min_samples_leaf=2, min_samples_split=10, total=   0.1s
[CV] ccp_alpha=10, max_depth=9, min_samples_leaf=2, min_samples_split=10 
[CV]  ccp_alpha=10, ma

[CV]  ccp_alpha=10, max_depth=11, min_samples_leaf=2, min_samples_split=10, total=   0.1s
[CV] ccp_alpha=10, max_depth=11, min_samples_leaf=2, min_samples_split=15 
[CV]  ccp_alpha=10, max_depth=11, min_samples_leaf=2, min_samples_split=15, total=   0.1s
[CV] ccp_alpha=10, max_depth=11, min_samples_leaf=2, min_samples_split=15 
[CV]  ccp_alpha=10, max_depth=11, min_samples_leaf=2, min_samples_split=15, total=   0.1s
[CV] ccp_alpha=10, max_depth=11, min_samples_leaf=2, min_samples_split=15 
[CV]  ccp_alpha=10, max_depth=11, min_samples_leaf=2, min_samples_split=15, total=   0.1s
[CV] ccp_alpha=10, max_depth=11, min_samples_leaf=2, min_samples_split=15 
[CV]  ccp_alpha=10, max_depth=11, min_samples_leaf=2, min_samples_split=15, total=   0.1s
[CV] ccp_alpha=10, max_depth=11, min_samples_leaf=2, min_samples_split=15 
[CV]  ccp_alpha=10, max_depth=11, min_samples_leaf=2, min_samples_split=15, total=   0.1s
[CV] ccp_alpha=10, max_depth=11, min_samples_leaf=2, min_samples_split=20 
[CV]  ccp_

[Parallel(n_jobs=1)]: Done 810 out of 810 | elapsed:  1.2min finished


GridSearchCV(cv=5, error_score=nan,
             estimator=DecisionTreeClassifier(ccp_alpha=0.0, class_weight=None,
                                              criterion='gini', max_depth=None,
                                              max_features=None,
                                              max_leaf_nodes=None,
                                              min_impurity_decrease=0.0,
                                              min_impurity_split=None,
                                              min_samples_leaf=1,
                                              min_samples_split=2,
                                              min_weight_fraction_leaf=0.0,
                                              presort='deprecated',
                                              random_state=None,
                                              splitter='best'),
             iid='deprecated', n_jobs=None,
             param_grid={'ccp_alpha': [0, 0.001, 0.01, 0.1, 1, 10],
          

In [157]:
# Check the results of the grid search

print(f"Best parameters: {gs4.best_params_}")
print(f"Best score: {gs4.best_score_}")

Best parameters: {'ccp_alpha': 0, 'max_depth': 11, 'min_samples_leaf': 2, 'min_samples_split': 10}
Best score: 0.8593840154508632


In [158]:
# Save model
model4 = gs4.best_estimator_

In [159]:
# Score model on train set and validate set
print(f"Accuracy on train set: {model4.score(X_sm, y_sm)}")
print(f"Accuracy on validate set: {model4.score(X_val, y_val)}")

Accuracy on train set: 0.8975819169441037
Accuracy on validate set: 0.8141695702671312


The model is overfitted with about 8% drop in validate accuracy compared to train accuracy. The accuracy is lower than the all previous models.

In [160]:
# Confusion matrix
# Pass in true values, predicted values to confusion matrix
# Convert confusion matrix into dataframe
# Positive class (class 1) is wnv
preds = gs4.predict(X_val)
cm = confusion_matrix(y_val, preds)
cm_df = pd.DataFrame(cm,columns=['pred no wnv','pred wnv'], index=['Actual no wnv','Actual wnv'])
cm_df

Unnamed: 0,pred no wnv,pred wnv
Actual no wnv,2014,432
Actual wnv,48,89


In [161]:
# return nparray as a 1-D array.
confusion_matrix(y_val, preds).ravel()
# Save TN/FP/FN/TP values.
tn, fp, fn, tp = confusion_matrix(y_val,preds).ravel()

In [162]:
# Summary of metrics for log reg model
sens = tp/(tp+fn)
prec = tp/(tp+fp)
f1 = 2*(prec*sens)/(prec+sens)
print(f"Sensitivity: {round(sens,4)}")
print(f"Precision: {round(prec,4)}")
print(f"F1: {round(f1,4)}")

Sensitivity: 0.6496
Precision: 0.1708
F1: 0.2705


In [163]:
pred_proba = [i[1] for i in gs4.predict_proba(X_val)]

pred_df = pd.DataFrame({'validate_values': y_val,
                        'pred_probs':pred_proba})
pred_df.head()

Unnamed: 0,validate_values,pred_probs
5602,0,0.716418
1012,0,0.0
7561,0,0.176471
7887,0,0.86181
4422,0,0.0


In [164]:
# Calculate ROC AUC.
roc_auc_score(pred_df['validate_values'],pred_df['pred_probs'])

0.8036776861970385

### Random Forest + oversampling & undersampling SMOTE

Note: Code for Rand Forest commented out, takes significant time to run (more than 30 mins).

For Random Forest Classifier, we explore the combined effects of oversampling and undersampling SMOTE. First oversample the minority class with SMOTE to about a 1:10 ratio, then undersample the majority class to achieve about a 1:2 ratio.

In [46]:
from imblearn.over_sampling import SMOTE
from imblearn.under_sampling import RandomUnderSampler

# define pipeline

rf_pipeline = Pipeline([
    ('over', SMOTE()),
    ('under', RandomUnderSampler()),
    ('rf', RandomForestClassifier())
])

TypeError: All intermediate steps should be transformers and implement fit and transform or be the string 'passthrough' 'SMOTE(k_neighbors=5, n_jobs=None, random_state=None, sampling_strategy='auto')' (type <class 'imblearn.over_sampling._smote.SMOTE'>) doesn't

In [None]:
# Define dictionary of hyperparameters.
"""pipeline_params = {
    'over__k_neighbors' : [1,2,3,4,5,6,7,8,9,10],
    'over__sampling_strategy' : [0.1],
    'under__sampling_strategy' : [0.5],
    'rf__n_estimators': [50, 100, 200],
    'rf__max_depth': [4, 6, 10, 12],
    'rf__random_state': [13]
}"""

In [None]:
# Instantiate our GridSearchCV object.
"""rf_gs = GridSearchCV(rf_pipeline, # What is the model we want to fit?
                                 pipeline_params, # What is the dictionary of hyperparameters?
                                 cv=5, # What number of folds in CV will we use?
                                 verbose=1,
                                 scoring='roc_auc')"""

In [None]:
# Fit the GridSearchCV object to the data.
"""rf_gs.fit(X_train, y_train)"""

In [None]:
# Save the best model
"""rf_gs_best = rf_gs.best_estimator_"""

In [None]:
# Checking the scores - Not sure if this is accuracy as I used scoring = 'roc_auc', so might be roc_auc??
"""print(f"Accuracy on train set: {rf_gs_best.score(X_train, y_train)}")
print(f"Accuracy on validate set: {rf_gs_best.score(X_valid, y_valid)}")"""

In [None]:
"""prediction_rf = rf_gs_best.predict(X_valid)"""

In [None]:
# Confusion matrix
"""cm = confusion_matrix(y_valid, prediction_rf)
# Save TN/FP/FN/TP values.
tn, fp, fn, tp = confusion_matrix(y_val,prediction_rf).ravel()"""

In [None]:
# Summary of metrics for random forest model
"""sens = tp/(tp+fn)
prec = tp/(tp+fp)
f1 = 2*(prec*sens)/(prec+sens)
print(f"Sensitivity: {round(sens,4)}")
print(f"Precision: {round(prec,4)}")
print(f"F1: {round(f1,4)}")"""

In [None]:
"""pred_proba = [i[1] for i in prediction_rf.predict_proba(X_val)]

pred_df = pd.DataFrame({'validate_values': y_val,
                        'pred_probs':pred_proba})
pred_df.head()"""

In [None]:
# Calculate ROC AUC.
"""roc_auc_score(pred_df['validate_values'],pred_df['pred_probs'])"""

### Gradient Boosting Classifier

Note: Code for GradientBoostingClassifier commented out, takes significant time to run (more than 30 mins).

In [125]:
# Build upon the hyper-parameters used in Decision Tree model
# Learning rate in GB is try to put it in lower range; usually 0.1 to 0.2
"""gs5 = GridSearchCV(estimator = GradientBoostingClassifier(random_state=42),\
                   param_grid = {'learning_rate' : [0.1,0.2],\
                                 'n_estimators' : [100,125],
                                 'min_samples_split': [10,15],\
                                 'min_samples_leaf': [2,3],\
                                 'max_depth': [9,11],\
                                 'ccp_alpha': [0,0.1]},\
                   cv = 5,\
                   verbose = 2)"""

In [126]:
# CAUTION: Takes long time to run (more than 50 mins)
# Fit GridSearch to the cleaned training data.
"""gs5.fit(X_sm,y_sm)"""

Fitting 5 folds for each of 64 candidates, totalling 320 fits
[CV] ccp_alpha=0, learning_rate=0.1, max_depth=9, min_samples_leaf=2, min_samples_split=10, n_estimators=100 


[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.


[CV]  ccp_alpha=0, learning_rate=0.1, max_depth=9, min_samples_leaf=2, min_samples_split=10, n_estimators=100, total=   7.1s
[CV] ccp_alpha=0, learning_rate=0.1, max_depth=9, min_samples_leaf=2, min_samples_split=10, n_estimators=100 


[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    7.0s remaining:    0.0s


[CV]  ccp_alpha=0, learning_rate=0.1, max_depth=9, min_samples_leaf=2, min_samples_split=10, n_estimators=100, total=   7.3s
[CV] ccp_alpha=0, learning_rate=0.1, max_depth=9, min_samples_leaf=2, min_samples_split=10, n_estimators=100 
[CV]  ccp_alpha=0, learning_rate=0.1, max_depth=9, min_samples_leaf=2, min_samples_split=10, n_estimators=100, total=   6.7s
[CV] ccp_alpha=0, learning_rate=0.1, max_depth=9, min_samples_leaf=2, min_samples_split=10, n_estimators=100 
[CV]  ccp_alpha=0, learning_rate=0.1, max_depth=9, min_samples_leaf=2, min_samples_split=10, n_estimators=100, total=   6.7s
[CV] ccp_alpha=0, learning_rate=0.1, max_depth=9, min_samples_leaf=2, min_samples_split=10, n_estimators=100 
[CV]  ccp_alpha=0, learning_rate=0.1, max_depth=9, min_samples_leaf=2, min_samples_split=10, n_estimators=100, total=   6.9s
[CV] ccp_alpha=0, learning_rate=0.1, max_depth=9, min_samples_leaf=2, min_samples_split=10, n_estimators=125 
[CV]  ccp_alpha=0, learning_rate=0.1, max_depth=9, min_sampl

[CV]  ccp_alpha=0, learning_rate=0.1, max_depth=9, min_samples_leaf=3, min_samples_split=15, n_estimators=125, total=   9.0s
[CV] ccp_alpha=0, learning_rate=0.1, max_depth=9, min_samples_leaf=3, min_samples_split=15, n_estimators=125 
[CV]  ccp_alpha=0, learning_rate=0.1, max_depth=9, min_samples_leaf=3, min_samples_split=15, n_estimators=125, total=   9.3s
[CV] ccp_alpha=0, learning_rate=0.1, max_depth=9, min_samples_leaf=3, min_samples_split=15, n_estimators=125 
[CV]  ccp_alpha=0, learning_rate=0.1, max_depth=9, min_samples_leaf=3, min_samples_split=15, n_estimators=125, total=  10.0s
[CV] ccp_alpha=0, learning_rate=0.1, max_depth=9, min_samples_leaf=3, min_samples_split=15, n_estimators=125 
[CV]  ccp_alpha=0, learning_rate=0.1, max_depth=9, min_samples_leaf=3, min_samples_split=15, n_estimators=125, total=  10.7s
[CV] ccp_alpha=0, learning_rate=0.1, max_depth=11, min_samples_leaf=2, min_samples_split=10, n_estimators=100 
[CV]  ccp_alpha=0, learning_rate=0.1, max_depth=11, min_sam

[CV]  ccp_alpha=0, learning_rate=0.1, max_depth=11, min_samples_leaf=3, min_samples_split=15, n_estimators=100, total=  10.3s
[CV] ccp_alpha=0, learning_rate=0.1, max_depth=11, min_samples_leaf=3, min_samples_split=15, n_estimators=100 
[CV]  ccp_alpha=0, learning_rate=0.1, max_depth=11, min_samples_leaf=3, min_samples_split=15, n_estimators=100, total=  10.6s
[CV] ccp_alpha=0, learning_rate=0.1, max_depth=11, min_samples_leaf=3, min_samples_split=15, n_estimators=100 
[CV]  ccp_alpha=0, learning_rate=0.1, max_depth=11, min_samples_leaf=3, min_samples_split=15, n_estimators=100, total=   9.4s
[CV] ccp_alpha=0, learning_rate=0.1, max_depth=11, min_samples_leaf=3, min_samples_split=15, n_estimators=100 
[CV]  ccp_alpha=0, learning_rate=0.1, max_depth=11, min_samples_leaf=3, min_samples_split=15, n_estimators=100, total=   9.1s
[CV] ccp_alpha=0, learning_rate=0.1, max_depth=11, min_samples_leaf=3, min_samples_split=15, n_estimators=125 
[CV]  ccp_alpha=0, learning_rate=0.1, max_depth=11, 

[CV]  ccp_alpha=0, learning_rate=0.2, max_depth=9, min_samples_leaf=3, min_samples_split=10, n_estimators=125, total=   9.0s
[CV] ccp_alpha=0, learning_rate=0.2, max_depth=9, min_samples_leaf=3, min_samples_split=10, n_estimators=125 
[CV]  ccp_alpha=0, learning_rate=0.2, max_depth=9, min_samples_leaf=3, min_samples_split=10, n_estimators=125, total=   9.7s
[CV] ccp_alpha=0, learning_rate=0.2, max_depth=9, min_samples_leaf=3, min_samples_split=10, n_estimators=125 
[CV]  ccp_alpha=0, learning_rate=0.2, max_depth=9, min_samples_leaf=3, min_samples_split=10, n_estimators=125, total=   9.0s
[CV] ccp_alpha=0, learning_rate=0.2, max_depth=9, min_samples_leaf=3, min_samples_split=10, n_estimators=125 
[CV]  ccp_alpha=0, learning_rate=0.2, max_depth=9, min_samples_leaf=3, min_samples_split=10, n_estimators=125, total=   9.6s
[CV] ccp_alpha=0, learning_rate=0.2, max_depth=9, min_samples_leaf=3, min_samples_split=15, n_estimators=100 
[CV]  ccp_alpha=0, learning_rate=0.2, max_depth=9, min_sampl

[CV]  ccp_alpha=0, learning_rate=0.2, max_depth=11, min_samples_leaf=3, min_samples_split=10, n_estimators=100, total=   8.5s
[CV] ccp_alpha=0, learning_rate=0.2, max_depth=11, min_samples_leaf=3, min_samples_split=10, n_estimators=100 
[CV]  ccp_alpha=0, learning_rate=0.2, max_depth=11, min_samples_leaf=3, min_samples_split=10, n_estimators=100, total=   8.1s
[CV] ccp_alpha=0, learning_rate=0.2, max_depth=11, min_samples_leaf=3, min_samples_split=10, n_estimators=100 
[CV]  ccp_alpha=0, learning_rate=0.2, max_depth=11, min_samples_leaf=3, min_samples_split=10, n_estimators=100, total=   8.6s
[CV] ccp_alpha=0, learning_rate=0.2, max_depth=11, min_samples_leaf=3, min_samples_split=10, n_estimators=100 
[CV]  ccp_alpha=0, learning_rate=0.2, max_depth=11, min_samples_leaf=3, min_samples_split=10, n_estimators=100, total=   8.6s
[CV] ccp_alpha=0, learning_rate=0.2, max_depth=11, min_samples_leaf=3, min_samples_split=10, n_estimators=125 
[CV]  ccp_alpha=0, learning_rate=0.2, max_depth=11, 

[CV]  ccp_alpha=0.1, learning_rate=0.1, max_depth=9, min_samples_leaf=2, min_samples_split=15, n_estimators=125, total=   7.7s
[CV] ccp_alpha=0.1, learning_rate=0.1, max_depth=9, min_samples_leaf=2, min_samples_split=15, n_estimators=125 
[CV]  ccp_alpha=0.1, learning_rate=0.1, max_depth=9, min_samples_leaf=2, min_samples_split=15, n_estimators=125, total=   7.2s
[CV] ccp_alpha=0.1, learning_rate=0.1, max_depth=9, min_samples_leaf=2, min_samples_split=15, n_estimators=125 
[CV]  ccp_alpha=0.1, learning_rate=0.1, max_depth=9, min_samples_leaf=2, min_samples_split=15, n_estimators=125, total=   7.5s
[CV] ccp_alpha=0.1, learning_rate=0.1, max_depth=9, min_samples_leaf=2, min_samples_split=15, n_estimators=125 
[CV]  ccp_alpha=0.1, learning_rate=0.1, max_depth=9, min_samples_leaf=2, min_samples_split=15, n_estimators=125, total=   7.8s
[CV] ccp_alpha=0.1, learning_rate=0.1, max_depth=9, min_samples_leaf=3, min_samples_split=10, n_estimators=100 
[CV]  ccp_alpha=0.1, learning_rate=0.1, max_

[CV]  ccp_alpha=0.1, learning_rate=0.1, max_depth=11, min_samples_leaf=2, min_samples_split=15, n_estimators=100, total=   7.3s
[CV] ccp_alpha=0.1, learning_rate=0.1, max_depth=11, min_samples_leaf=2, min_samples_split=15, n_estimators=100 
[CV]  ccp_alpha=0.1, learning_rate=0.1, max_depth=11, min_samples_leaf=2, min_samples_split=15, n_estimators=100, total=   6.3s
[CV] ccp_alpha=0.1, learning_rate=0.1, max_depth=11, min_samples_leaf=2, min_samples_split=15, n_estimators=100 
[CV]  ccp_alpha=0.1, learning_rate=0.1, max_depth=11, min_samples_leaf=2, min_samples_split=15, n_estimators=100, total=   7.2s
[CV] ccp_alpha=0.1, learning_rate=0.1, max_depth=11, min_samples_leaf=2, min_samples_split=15, n_estimators=100 
[CV]  ccp_alpha=0.1, learning_rate=0.1, max_depth=11, min_samples_leaf=2, min_samples_split=15, n_estimators=100, total=   6.9s
[CV] ccp_alpha=0.1, learning_rate=0.1, max_depth=11, min_samples_leaf=2, min_samples_split=15, n_estimators=125 
[CV]  ccp_alpha=0.1, learning_rate=0

[CV]  ccp_alpha=0.1, learning_rate=0.2, max_depth=9, min_samples_leaf=2, min_samples_split=10, n_estimators=125, total=   9.0s
[CV] ccp_alpha=0.1, learning_rate=0.2, max_depth=9, min_samples_leaf=2, min_samples_split=10, n_estimators=125 
[CV]  ccp_alpha=0.1, learning_rate=0.2, max_depth=9, min_samples_leaf=2, min_samples_split=10, n_estimators=125, total=   7.4s
[CV] ccp_alpha=0.1, learning_rate=0.2, max_depth=9, min_samples_leaf=2, min_samples_split=10, n_estimators=125 
[CV]  ccp_alpha=0.1, learning_rate=0.2, max_depth=9, min_samples_leaf=2, min_samples_split=10, n_estimators=125, total=   7.6s
[CV] ccp_alpha=0.1, learning_rate=0.2, max_depth=9, min_samples_leaf=2, min_samples_split=10, n_estimators=125 
[CV]  ccp_alpha=0.1, learning_rate=0.2, max_depth=9, min_samples_leaf=2, min_samples_split=10, n_estimators=125, total=   8.2s
[CV] ccp_alpha=0.1, learning_rate=0.2, max_depth=9, min_samples_leaf=2, min_samples_split=15, n_estimators=100 
[CV]  ccp_alpha=0.1, learning_rate=0.2, max_

[CV]  ccp_alpha=0.1, learning_rate=0.2, max_depth=11, min_samples_leaf=2, min_samples_split=10, n_estimators=100, total=   6.8s
[CV] ccp_alpha=0.1, learning_rate=0.2, max_depth=11, min_samples_leaf=2, min_samples_split=10, n_estimators=100 
[CV]  ccp_alpha=0.1, learning_rate=0.2, max_depth=11, min_samples_leaf=2, min_samples_split=10, n_estimators=100, total=   6.6s
[CV] ccp_alpha=0.1, learning_rate=0.2, max_depth=11, min_samples_leaf=2, min_samples_split=10, n_estimators=100 
[CV]  ccp_alpha=0.1, learning_rate=0.2, max_depth=11, min_samples_leaf=2, min_samples_split=10, n_estimators=100, total=   7.1s
[CV] ccp_alpha=0.1, learning_rate=0.2, max_depth=11, min_samples_leaf=2, min_samples_split=10, n_estimators=100 
[CV]  ccp_alpha=0.1, learning_rate=0.2, max_depth=11, min_samples_leaf=2, min_samples_split=10, n_estimators=100, total=   6.3s
[CV] ccp_alpha=0.1, learning_rate=0.2, max_depth=11, min_samples_leaf=2, min_samples_split=10, n_estimators=125 
[CV]  ccp_alpha=0.1, learning_rate=0

[CV]  ccp_alpha=0.1, learning_rate=0.2, max_depth=11, min_samples_leaf=3, min_samples_split=15, n_estimators=125, total=   8.2s
[CV] ccp_alpha=0.1, learning_rate=0.2, max_depth=11, min_samples_leaf=3, min_samples_split=15, n_estimators=125 
[CV]  ccp_alpha=0.1, learning_rate=0.2, max_depth=11, min_samples_leaf=3, min_samples_split=15, n_estimators=125, total=   8.2s
[CV] ccp_alpha=0.1, learning_rate=0.2, max_depth=11, min_samples_leaf=3, min_samples_split=15, n_estimators=125 
[CV]  ccp_alpha=0.1, learning_rate=0.2, max_depth=11, min_samples_leaf=3, min_samples_split=15, n_estimators=125, total=   7.7s
[CV] ccp_alpha=0.1, learning_rate=0.2, max_depth=11, min_samples_leaf=3, min_samples_split=15, n_estimators=125 
[CV]  ccp_alpha=0.1, learning_rate=0.2, max_depth=11, min_samples_leaf=3, min_samples_split=15, n_estimators=125, total=   9.2s
[CV] ccp_alpha=0.1, learning_rate=0.2, max_depth=11, min_samples_leaf=3, min_samples_split=15, n_estimators=125 
[CV]  ccp_alpha=0.1, learning_rate=0

[Parallel(n_jobs=1)]: Done 320 out of 320 | elapsed: 44.4min finished


GridSearchCV(cv=5, error_score=nan,
             estimator=GradientBoostingClassifier(ccp_alpha=0.0,
                                                  criterion='friedman_mse',
                                                  init=None, learning_rate=0.1,
                                                  loss='deviance', max_depth=3,
                                                  max_features=None,
                                                  max_leaf_nodes=None,
                                                  min_impurity_decrease=0.0,
                                                  min_impurity_split=None,
                                                  min_samples_leaf=1,
                                                  min_samples_split=2,
                                                  min_weight_fraction_leaf=0.0,
                                                  n_estimators=100,
                                                  n_iter_no_c...
                 

In [129]:
# Check the results of the grid search

"""print(f"Best parameters: {gs5.best_params_}")
print(f"Best score: {gs5.best_score_}")"""

Best parameters: {'ccp_alpha': 0, 'learning_rate': 0.2, 'max_depth': 11, 'min_samples_leaf': 2, 'min_samples_split': 15, 'n_estimators': 125}
Best score: 0.9461207576635291


In [130]:
# Save model
"""model5 = gs5.best_estimator_"""

In [131]:
# Score model on train set and validate set
"""print(f"Accuracy on train set: {model5.score(X_sm, y_sm)}")
print(f"Accuracy on validate set: {model5.score(X_val, y_val)}")"""

Accuracy on train set: 0.9916768880322411
Accuracy on validate set: 0.9097948122338366


The model is overfitted with about 9% drop in validate accuracy compared to train accuracy. The accuracy is lower than the all previous models.

In [132]:
# Confusion matrix
# Pass in true values, predicted values to confusion matrix
# Convert confusion matrix into dataframe
# Positive class (class 1) is wnv
"""preds = gs5.predict(X_val)
cm = confusion_matrix(y_val, preds)
cm_df = pd.DataFrame(cm,columns=['pred no wnv','pred wnv'], index=['Actual no wnv','Actual wnv'])
cm_df"""

Unnamed: 0,pred no wnv,pred wnv
Actual no wnv,2319,127
Actual wnv,106,31


In [133]:
# return nparray as a 1-D array.
"""confusion_matrix(y_val, preds).ravel()
# Save TN/FP/FN/TP values.
tn, fp, fn, tp = confusion_matrix(y_val,preds).ravel()"""

In [134]:
# Summary of metrics for log reg model
"""sens = tp/(tp+fn)
prec = tp/(tp+fp)
f1 = 2*(prec*sens)/(prec+sens)
print(f"Sensitivity: {round(sens,4)}")
print(f"Precision: {round(prec,4)}")
print(f"F1: {round(f1,4)}")"""

Sensitivity: 0.2263
Precision: 0.1962
F1: 0.2102


In [135]:
"""pred_proba = [i[1] for i in gs5.predict_proba(X_val)]

pred_df = pd.DataFrame({'validate_values': y_val,
                        'pred_probs':pred_proba})
pred_df.head()"""

Unnamed: 0,validate_values,pred_probs
5602,0,0.247488
1012,0,0.02939
7561,0,0.061821
7887,0,0.040529
4422,0,0.008983


In [136]:
# Calculate ROC AUC.
"""roc_auc_score(pred_df['validate_values'],pred_df['pred_probs'])"""

0.8466496768148205

### XGBoost Classifier

Note: Code for XGBoost Classifier commented out, takes significant time to run (more than 30 mins).

XGBoost classifier is not explicitly covered in class; we explore the capabilities of XGboost in this section. XGBoost implements parallel processing and should run faster as compared to GBM.
Update: took 40 mins, slightly faster than GB, though it's likely due to the different params used.

In [139]:
"""from xgboost import XGBClassifier"""

In [140]:
"""gsX = GridSearchCV(estimator = XGBClassifier(random_state=42),\
                   param_grid = {'max_depth': [9,11],\
                                 'learning_rate' : [0.1],\
                                 'n_estimators' : [100,125],\
                                 'objective' : ['binary:logistic'],\
                                 'gamma': [0.5,1],\
                                 'min_child_weight': [1,5],\
                                 'subsample': [0.5,1.0],\
                                 'colsample_bytree': [0.5,1.0] },\
                   cv = 5,\
                   verbose = 2)"""

Notes on params:
- binary:logistic –logistic regression for binary classification, returns predicted probability (not class).
- gamma (default=0) specifies the minimum loss reduction required to make a split; makes model conservative.
- min child weight (default=1) minimum sum of weights of all observations required in a child. Used to control over-fitting. Higher values prevent a model from learning relations which might be highly specific to the particular sample selected for a tree.
- subsample (default=1) Denotes the fraction of observations to be randomly samples for each tree. Lower values make the algorithm more conservative and prevents overfitting. Typical values: 0.5, 1.
- colsample_bytree(default=1) Denotes the fraction of columns to be randomly samples for each tree. Typical values: 0.5, 1

In [141]:
# CAUTION: Takes long time to run (more than 50 mins)
# Fit GridSearch to the cleaned training data.
"""gsX.fit(X_sm,y_sm)"""


Fitting 5 folds for each of 64 candidates, totalling 320 fits
[CV] colsample_bytree=0.5, gamma=0.5, learning_rate=0.1, max_depth=9, min_child_weight=1, n_estimators=100, objective=binary:logistic, subsample=0.5 


[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.


[CV]  colsample_bytree=0.5, gamma=0.5, learning_rate=0.1, max_depth=9, min_child_weight=1, n_estimators=100, objective=binary:logistic, subsample=0.5, total=   3.9s
[CV] colsample_bytree=0.5, gamma=0.5, learning_rate=0.1, max_depth=9, min_child_weight=1, n_estimators=100, objective=binary:logistic, subsample=0.5 


[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    3.8s remaining:    0.0s


[CV]  colsample_bytree=0.5, gamma=0.5, learning_rate=0.1, max_depth=9, min_child_weight=1, n_estimators=100, objective=binary:logistic, subsample=0.5, total=   4.0s
[CV] colsample_bytree=0.5, gamma=0.5, learning_rate=0.1, max_depth=9, min_child_weight=1, n_estimators=100, objective=binary:logistic, subsample=0.5 
[CV]  colsample_bytree=0.5, gamma=0.5, learning_rate=0.1, max_depth=9, min_child_weight=1, n_estimators=100, objective=binary:logistic, subsample=0.5, total=   3.9s
[CV] colsample_bytree=0.5, gamma=0.5, learning_rate=0.1, max_depth=9, min_child_weight=1, n_estimators=100, objective=binary:logistic, subsample=0.5 
[CV]  colsample_bytree=0.5, gamma=0.5, learning_rate=0.1, max_depth=9, min_child_weight=1, n_estimators=100, objective=binary:logistic, subsample=0.5, total=   4.1s
[CV] colsample_bytree=0.5, gamma=0.5, learning_rate=0.1, max_depth=9, min_child_weight=1, n_estimators=100, objective=binary:logistic, subsample=0.5 
[CV]  colsample_bytree=0.5, gamma=0.5, learning_rate=0.

[CV]  colsample_bytree=0.5, gamma=0.5, learning_rate=0.1, max_depth=9, min_child_weight=5, n_estimators=100, objective=binary:logistic, subsample=1.0, total=   3.2s
[CV] colsample_bytree=0.5, gamma=0.5, learning_rate=0.1, max_depth=9, min_child_weight=5, n_estimators=100, objective=binary:logistic, subsample=1.0 
[CV]  colsample_bytree=0.5, gamma=0.5, learning_rate=0.1, max_depth=9, min_child_weight=5, n_estimators=100, objective=binary:logistic, subsample=1.0, total=   3.3s
[CV] colsample_bytree=0.5, gamma=0.5, learning_rate=0.1, max_depth=9, min_child_weight=5, n_estimators=125, objective=binary:logistic, subsample=0.5 
[CV]  colsample_bytree=0.5, gamma=0.5, learning_rate=0.1, max_depth=9, min_child_weight=5, n_estimators=125, objective=binary:logistic, subsample=0.5, total=   4.8s
[CV] colsample_bytree=0.5, gamma=0.5, learning_rate=0.1, max_depth=9, min_child_weight=5, n_estimators=125, objective=binary:logistic, subsample=0.5 
[CV]  colsample_bytree=0.5, gamma=0.5, learning_rate=0.

[CV]  colsample_bytree=0.5, gamma=0.5, learning_rate=0.1, max_depth=11, min_child_weight=1, n_estimators=125, objective=binary:logistic, subsample=0.5, total=   6.2s
[CV] colsample_bytree=0.5, gamma=0.5, learning_rate=0.1, max_depth=11, min_child_weight=1, n_estimators=125, objective=binary:logistic, subsample=1.0 
[CV]  colsample_bytree=0.5, gamma=0.5, learning_rate=0.1, max_depth=11, min_child_weight=1, n_estimators=125, objective=binary:logistic, subsample=1.0, total=   4.9s
[CV] colsample_bytree=0.5, gamma=0.5, learning_rate=0.1, max_depth=11, min_child_weight=1, n_estimators=125, objective=binary:logistic, subsample=1.0 
[CV]  colsample_bytree=0.5, gamma=0.5, learning_rate=0.1, max_depth=11, min_child_weight=1, n_estimators=125, objective=binary:logistic, subsample=1.0, total=   4.9s
[CV] colsample_bytree=0.5, gamma=0.5, learning_rate=0.1, max_depth=11, min_child_weight=1, n_estimators=125, objective=binary:logistic, subsample=1.0 
[CV]  colsample_bytree=0.5, gamma=0.5, learning_r

[CV]  colsample_bytree=0.5, gamma=1, learning_rate=0.1, max_depth=9, min_child_weight=1, n_estimators=100, objective=binary:logistic, subsample=0.5, total=   3.9s
[CV] colsample_bytree=0.5, gamma=1, learning_rate=0.1, max_depth=9, min_child_weight=1, n_estimators=100, objective=binary:logistic, subsample=0.5 
[CV]  colsample_bytree=0.5, gamma=1, learning_rate=0.1, max_depth=9, min_child_weight=1, n_estimators=100, objective=binary:logistic, subsample=0.5, total=   4.1s
[CV] colsample_bytree=0.5, gamma=1, learning_rate=0.1, max_depth=9, min_child_weight=1, n_estimators=100, objective=binary:logistic, subsample=0.5 
[CV]  colsample_bytree=0.5, gamma=1, learning_rate=0.1, max_depth=9, min_child_weight=1, n_estimators=100, objective=binary:logistic, subsample=0.5, total=   4.1s
[CV] colsample_bytree=0.5, gamma=1, learning_rate=0.1, max_depth=9, min_child_weight=1, n_estimators=100, objective=binary:logistic, subsample=0.5 
[CV]  colsample_bytree=0.5, gamma=1, learning_rate=0.1, max_depth=9

[CV]  colsample_bytree=0.5, gamma=1, learning_rate=0.1, max_depth=9, min_child_weight=5, n_estimators=100, objective=binary:logistic, subsample=1.0, total=   3.8s
[CV] colsample_bytree=0.5, gamma=1, learning_rate=0.1, max_depth=9, min_child_weight=5, n_estimators=100, objective=binary:logistic, subsample=1.0 
[CV]  colsample_bytree=0.5, gamma=1, learning_rate=0.1, max_depth=9, min_child_weight=5, n_estimators=100, objective=binary:logistic, subsample=1.0, total=   3.0s
[CV] colsample_bytree=0.5, gamma=1, learning_rate=0.1, max_depth=9, min_child_weight=5, n_estimators=100, objective=binary:logistic, subsample=1.0 
[CV]  colsample_bytree=0.5, gamma=1, learning_rate=0.1, max_depth=9, min_child_weight=5, n_estimators=100, objective=binary:logistic, subsample=1.0, total=   2.9s
[CV] colsample_bytree=0.5, gamma=1, learning_rate=0.1, max_depth=9, min_child_weight=5, n_estimators=125, objective=binary:logistic, subsample=0.5 
[CV]  colsample_bytree=0.5, gamma=1, learning_rate=0.1, max_depth=9

[CV]  colsample_bytree=0.5, gamma=1, learning_rate=0.1, max_depth=11, min_child_weight=1, n_estimators=125, objective=binary:logistic, subsample=0.5, total=   5.9s
[CV] colsample_bytree=0.5, gamma=1, learning_rate=0.1, max_depth=11, min_child_weight=1, n_estimators=125, objective=binary:logistic, subsample=1.0 
[CV]  colsample_bytree=0.5, gamma=1, learning_rate=0.1, max_depth=11, min_child_weight=1, n_estimators=125, objective=binary:logistic, subsample=1.0, total=   4.6s
[CV] colsample_bytree=0.5, gamma=1, learning_rate=0.1, max_depth=11, min_child_weight=1, n_estimators=125, objective=binary:logistic, subsample=1.0 
[CV]  colsample_bytree=0.5, gamma=1, learning_rate=0.1, max_depth=11, min_child_weight=1, n_estimators=125, objective=binary:logistic, subsample=1.0, total=   4.4s
[CV] colsample_bytree=0.5, gamma=1, learning_rate=0.1, max_depth=11, min_child_weight=1, n_estimators=125, objective=binary:logistic, subsample=1.0 
[CV]  colsample_bytree=0.5, gamma=1, learning_rate=0.1, max_d

[CV]  colsample_bytree=1.0, gamma=0.5, learning_rate=0.1, max_depth=9, min_child_weight=1, n_estimators=100, objective=binary:logistic, subsample=0.5, total=   7.7s
[CV] colsample_bytree=1.0, gamma=0.5, learning_rate=0.1, max_depth=9, min_child_weight=1, n_estimators=100, objective=binary:logistic, subsample=0.5 
[CV]  colsample_bytree=1.0, gamma=0.5, learning_rate=0.1, max_depth=9, min_child_weight=1, n_estimators=100, objective=binary:logistic, subsample=0.5, total=   7.5s
[CV] colsample_bytree=1.0, gamma=0.5, learning_rate=0.1, max_depth=9, min_child_weight=1, n_estimators=100, objective=binary:logistic, subsample=0.5 
[CV]  colsample_bytree=1.0, gamma=0.5, learning_rate=0.1, max_depth=9, min_child_weight=1, n_estimators=100, objective=binary:logistic, subsample=0.5, total=   7.5s
[CV] colsample_bytree=1.0, gamma=0.5, learning_rate=0.1, max_depth=9, min_child_weight=1, n_estimators=100, objective=binary:logistic, subsample=0.5 
[CV]  colsample_bytree=1.0, gamma=0.5, learning_rate=0.

[CV]  colsample_bytree=1.0, gamma=0.5, learning_rate=0.1, max_depth=9, min_child_weight=5, n_estimators=100, objective=binary:logistic, subsample=1.0, total=   5.8s
[CV] colsample_bytree=1.0, gamma=0.5, learning_rate=0.1, max_depth=9, min_child_weight=5, n_estimators=100, objective=binary:logistic, subsample=1.0 
[CV]  colsample_bytree=1.0, gamma=0.5, learning_rate=0.1, max_depth=9, min_child_weight=5, n_estimators=100, objective=binary:logistic, subsample=1.0, total=   5.5s
[CV] colsample_bytree=1.0, gamma=0.5, learning_rate=0.1, max_depth=9, min_child_weight=5, n_estimators=125, objective=binary:logistic, subsample=0.5 
[CV]  colsample_bytree=1.0, gamma=0.5, learning_rate=0.1, max_depth=9, min_child_weight=5, n_estimators=125, objective=binary:logistic, subsample=0.5, total=   8.6s
[CV] colsample_bytree=1.0, gamma=0.5, learning_rate=0.1, max_depth=9, min_child_weight=5, n_estimators=125, objective=binary:logistic, subsample=0.5 
[CV]  colsample_bytree=1.0, gamma=0.5, learning_rate=0.

[CV]  colsample_bytree=1.0, gamma=0.5, learning_rate=0.1, max_depth=11, min_child_weight=1, n_estimators=125, objective=binary:logistic, subsample=0.5, total=  11.6s
[CV] colsample_bytree=1.0, gamma=0.5, learning_rate=0.1, max_depth=11, min_child_weight=1, n_estimators=125, objective=binary:logistic, subsample=1.0 
[CV]  colsample_bytree=1.0, gamma=0.5, learning_rate=0.1, max_depth=11, min_child_weight=1, n_estimators=125, objective=binary:logistic, subsample=1.0, total=   8.3s
[CV] colsample_bytree=1.0, gamma=0.5, learning_rate=0.1, max_depth=11, min_child_weight=1, n_estimators=125, objective=binary:logistic, subsample=1.0 
[CV]  colsample_bytree=1.0, gamma=0.5, learning_rate=0.1, max_depth=11, min_child_weight=1, n_estimators=125, objective=binary:logistic, subsample=1.0, total=   8.1s
[CV] colsample_bytree=1.0, gamma=0.5, learning_rate=0.1, max_depth=11, min_child_weight=1, n_estimators=125, objective=binary:logistic, subsample=1.0 
[CV]  colsample_bytree=1.0, gamma=0.5, learning_r

[CV]  colsample_bytree=1.0, gamma=1, learning_rate=0.1, max_depth=9, min_child_weight=1, n_estimators=100, objective=binary:logistic, subsample=0.5, total=   7.0s
[CV] colsample_bytree=1.0, gamma=1, learning_rate=0.1, max_depth=9, min_child_weight=1, n_estimators=100, objective=binary:logistic, subsample=0.5 
[CV]  colsample_bytree=1.0, gamma=1, learning_rate=0.1, max_depth=9, min_child_weight=1, n_estimators=100, objective=binary:logistic, subsample=0.5, total=   8.0s
[CV] colsample_bytree=1.0, gamma=1, learning_rate=0.1, max_depth=9, min_child_weight=1, n_estimators=100, objective=binary:logistic, subsample=0.5 
[CV]  colsample_bytree=1.0, gamma=1, learning_rate=0.1, max_depth=9, min_child_weight=1, n_estimators=100, objective=binary:logistic, subsample=0.5, total=   8.0s
[CV] colsample_bytree=1.0, gamma=1, learning_rate=0.1, max_depth=9, min_child_weight=1, n_estimators=100, objective=binary:logistic, subsample=0.5 
[CV]  colsample_bytree=1.0, gamma=1, learning_rate=0.1, max_depth=9

[CV]  colsample_bytree=1.0, gamma=1, learning_rate=0.1, max_depth=9, min_child_weight=5, n_estimators=100, objective=binary:logistic, subsample=1.0, total=   5.2s
[CV] colsample_bytree=1.0, gamma=1, learning_rate=0.1, max_depth=9, min_child_weight=5, n_estimators=100, objective=binary:logistic, subsample=1.0 
[CV]  colsample_bytree=1.0, gamma=1, learning_rate=0.1, max_depth=9, min_child_weight=5, n_estimators=100, objective=binary:logistic, subsample=1.0, total=   5.5s
[CV] colsample_bytree=1.0, gamma=1, learning_rate=0.1, max_depth=9, min_child_weight=5, n_estimators=100, objective=binary:logistic, subsample=1.0 
[CV]  colsample_bytree=1.0, gamma=1, learning_rate=0.1, max_depth=9, min_child_weight=5, n_estimators=100, objective=binary:logistic, subsample=1.0, total=   5.3s
[CV] colsample_bytree=1.0, gamma=1, learning_rate=0.1, max_depth=9, min_child_weight=5, n_estimators=125, objective=binary:logistic, subsample=0.5 
[CV]  colsample_bytree=1.0, gamma=1, learning_rate=0.1, max_depth=9

[CV]  colsample_bytree=1.0, gamma=1, learning_rate=0.1, max_depth=11, min_child_weight=1, n_estimators=125, objective=binary:logistic, subsample=0.5, total=  10.9s
[CV] colsample_bytree=1.0, gamma=1, learning_rate=0.1, max_depth=11, min_child_weight=1, n_estimators=125, objective=binary:logistic, subsample=1.0 
[CV]  colsample_bytree=1.0, gamma=1, learning_rate=0.1, max_depth=11, min_child_weight=1, n_estimators=125, objective=binary:logistic, subsample=1.0, total=   9.3s
[CV] colsample_bytree=1.0, gamma=1, learning_rate=0.1, max_depth=11, min_child_weight=1, n_estimators=125, objective=binary:logistic, subsample=1.0 
[CV]  colsample_bytree=1.0, gamma=1, learning_rate=0.1, max_depth=11, min_child_weight=1, n_estimators=125, objective=binary:logistic, subsample=1.0, total=   8.5s
[CV] colsample_bytree=1.0, gamma=1, learning_rate=0.1, max_depth=11, min_child_weight=1, n_estimators=125, objective=binary:logistic, subsample=1.0 
[CV]  colsample_bytree=1.0, gamma=1, learning_rate=0.1, max_d

[Parallel(n_jobs=1)]: Done 320 out of 320 | elapsed: 33.4min finished


GridSearchCV(cv=5, error_score=nan,
             estimator=XGBClassifier(base_score=0.5, booster='gbtree',
                                     colsample_bylevel=1, colsample_bynode=1,
                                     colsample_bytree=1, gamma=0,
                                     learning_rate=0.1, max_delta_step=0,
                                     max_depth=3, min_child_weight=1,
                                     missing=None, n_estimators=100, n_jobs=1,
                                     nthread=None, objective='binary:logistic',
                                     random_state=42, reg_alpha=0, reg_lambda=1,
                                     scale_p...lent=None,
                                     subsample=1, verbosity=1),
             iid='deprecated', n_jobs=None,
             param_grid={'colsample_bytree': [0.5, 1.0], 'gamma': [0.5, 1],
                         'learning_rate': [0.1], 'max_depth': [9, 11],
                         'min_child_weight': [1, 5],

In [142]:
# Check the results of the grid search

"""print(f"Best parameters: {gsX.best_params_}")
print(f"Best score: {gsX.best_score_}")"""

Best parameters: {'colsample_bytree': 0.5, 'gamma': 0.5, 'learning_rate': 0.1, 'max_depth': 11, 'min_child_weight': 1, 'n_estimators': 125, 'objective': 'binary:logistic', 'subsample': 1.0}
Best score: 0.941651954026695


In [143]:
# Save model
"""modelX = gsX.best_estimator_"""

In [144]:
# Score model on train set and validate set
"""print(f"Accuracy on train set: {modelX.score(X_sm, y_sm)}")
print(f"Accuracy on validate set: {modelX.score(X_val, y_val)}")"""

Accuracy on train set: 0.9805502015069213
Accuracy on validate set: 0.9063104916763454


The model is overfitted with about 8% drop in validate accuracy compared to train accuracy. The accuracy is lower than the all previous models.

In [145]:
# Confusion matrix
# Pass in true values, predicted values to confusion matrix
# Convert confusion matrix into dataframe
# Positive class (class 1) is wnv
"""preds = gsX.predict(X_val)
cm = confusion_matrix(y_val, preds)
cm_df = pd.DataFrame(cm,columns=['pred no wnv','pred wnv'], index=['Actual no wnv','Actual wnv'])
cm_df"""

Unnamed: 0,pred no wnv,pred wnv
Actual no wnv,2308,138
Actual wnv,104,33


In [146]:
# return nparray as a 1-D array.
"""confusion_matrix(y_val, preds).ravel()
# Save TN/FP/FN/TP values.
tn, fp, fn, tp = confusion_matrix(y_val,preds).ravel()"""

In [147]:
# Summary of metrics for log reg model
"""sens = tp/(tp+fn)
prec = tp/(tp+fp)
f1 = 2*(prec*sens)/(prec+sens)
print(f"Sensitivity: {round(sens,4)}")
print(f"Precision: {round(prec,4)}")
print(f"F1: {round(f1,4)}")"""

Sensitivity: 0.2409
Precision: 0.193
F1: 0.2143


In [148]:
"""pred_proba = [i[1] for i in gsX.predict_proba(X_val)]

pred_df = pd.DataFrame({'validate_values': y_val,
                        'pred_probs':pred_proba})
pred_df.head()"""

Unnamed: 0,validate_values,pred_probs
5602,0,0.162352
1012,0,0.178396
7561,0,0.093803
7887,0,0.051679
4422,0,0.014565


In [149]:
# Calculate ROC AUC.
"""roc_auc_score(pred_df['validate_values'],pred_df['pred_probs'])"""

0.8480671556720042

### Summary of Model Metrics 

In [8]:
# Summary of Model scores in Dataframe
summary_df = pd.DataFrame({'accuracy(val)' : [0.945, 0.918, 0.934, 0.814, 0.852, 0.909, 0.906],\
                           'sensitivity' :   [0,     0.226, 0.066, 0.650, 0.524, 0.226, 0.241],\
                           'precision' :     [0,     0.226, 0.173, 0.171, 0.200, 0.196, 0.193],\
                           'F1' :            ['NaN', 0.226, 0.095, 0.271, 0.290, 0.210, 0.214],\
                           'roc_auc' :       [0.791, 0.795, 0.721, 0.804, 0.698, 0.847, 0.848]})
# Transpose dataframe
summary_dft = summary_df.T
# Rename columns
summary_dft.columns = ['LogReg','SVM', 'KNN', 'DT', 'RF(Smote O&U)', 'GBc', 'XGBc']
summary_dft

Unnamed: 0,LogReg,SVM,KNN,DT,RF(Smote O&U),GBc,XGBc
accuracy(val),0.945,0.918,0.934,0.814,0.852,0.909,0.906
sensitivity,0.0,0.226,0.066,0.65,0.524,0.226,0.241
precision,0.0,0.226,0.173,0.171,0.2,0.196,0.193
F1,,0.226,0.095,0.271,0.29,0.21,0.214
roc_auc,0.791,0.795,0.721,0.804,0.698,0.847,0.848


We pick XXX as the best model, based on F1 score and roc_auc. Next generate the predict probabilities on test set for kaggle submission.

In [39]:
# instantiate the best model with the best hyperparams
best_model = model1

### Model Evaluation

In [4]:
# making a copy of the train_kaggle df
X_train_kaggle = df_train.copy().drop(['num_mos','wnv', 'date'],axis=1)
y_train_kaggle = df_train.copy()['wnv']

X_test_kaggle = df_test.copy().drop(['id','date'],axis=1)

In [5]:
#checking shape
X_train_kaggle.shape

(8610, 154)

In [6]:
#checking shape
X_test_kaggle.shape

(116293, 154)

In [36]:
# Scale variables
# Commented out (pedning confirmation of best model selection)
ss = StandardScaler()
X_train_kaggle_ss = ss.fit_transform(X_train_kaggle)
X_test_kaggle_ss = ss.transform(X_test_kaggle)

In [37]:
# Resampling using SMOTE oversmaple on minority class
X_train_kaggle_sm, y_train_kaggle_sm = smote.fit_sample(X_train_kaggle_ss, y_train_kaggle)

In [40]:
# to insert best model

best_model.fit(X_train_kaggle_sm, y_train_kaggle_sm,)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression


Pipeline(memory=None,
         steps=[('ss',
                 StandardScaler(copy=True, with_mean=True, with_std=True)),
                ('lor',
                 LogisticRegression(C=1.0, class_weight=None, dual=False,
                                    fit_intercept=True, intercept_scaling=1,
                                    l1_ratio=None, max_iter=100,
                                    multi_class='auto', n_jobs=None,
                                    penalty='l2', random_state=42,
                                    solver='lbfgs', tol=0.0001, verbose=0,
                                    warm_start=False))],
         verbose=False)

In [41]:
# Score model on train set and validate set
print(f"Accuracy on train set: {best_model.score(X_train_kaggle_sm,y_train_kaggle_sm)}")

Accuracy on train set: 0.8027106586532564


In [None]:
# predicting kaggle output
predict_kaggle = best_model.predict(X_test_kaggle_ss)

In [None]:
# probability predicition
predict_proba_kaggle = best_model.predict_proba(X_test_kaggle_ss)

In [None]:
#Saving an output CSV file for submission
# output = pd.DataFrame({'Id': df_test['id'], 'WnvPresent': predict_proba_kaggle[:,1]})
# output.to_csv('my_submission_DSI.csv', index=False)