### Machine Learning

In [1]:
# Import libraries for pre-processing
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split, cross_val_score

from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier,GradientBoostingClassifier

from sklearn.metrics import confusion_matrix
from sklearn.metrics import roc_auc_score

### Import files

In [2]:
# Import the files
df_test = pd.read_csv("https://raw.githubusercontent.com/AngShengJun/dsi14P4/master/assets/working/df_test_weather_cleaned.csv")
df_train = pd.read_csv("https://raw.githubusercontent.com/AngShengJun/dsi14P4/master/assets/working/df_train_weather_cleaned.csv")

In [3]:
df_test.head(1)

Unnamed: 0.1,Unnamed: 0,id,date,lat,long,wk,yr,tavg,stnpress,dewpt,...,trap_T230,trap_T231,trap_T232,trap_T233,trap_T235,trap_T236,trap_T237,trap_T238,trap_T900,trap_T903
0,0,1,2008-06-11,41.95469,-87.800991,23,2008,75.0,29.31,55.5,...,0,0,0,0,0,0,0,0,0,0


In [4]:
df_train.head(1)

Unnamed: 0.1,Unnamed: 0,date,lat,long,wnv,num_mos,wk,yr,tavg,stnpress,...,trap_T230,trap_T231,trap_T232,trap_T233,trap_T235,trap_T236,trap_T237,trap_T238,trap_T900,trap_T903
0,0,2007-05-29,41.95469,-87.800991,0,1,22,2007,75.5,29.415,...,0,0,0,0,0,0,0,0,0,0


### Prep

In [5]:
# Make a copy of test
df_testcopy = df_test.copy()

In [6]:
# Drop unnecessary col
df_testcopy.drop(['id','Unnamed: 0'],axis=1,inplace=True)
# Review
df_testcopy.head(1)

Unnamed: 0,date,lat,long,wk,yr,tavg,stnpress,dewpt,precip,windspeed,...,trap_T230,trap_T231,trap_T232,trap_T233,trap_T235,trap_T236,trap_T237,trap_T238,trap_T900,trap_T903
0,2008-06-11,41.95469,-87.800991,23,2008,75.0,29.31,55.5,0.0,9.15,...,0,0,0,0,0,0,0,0,0,0


In [7]:
# Make a copy of train
df_traincopy = df_train.copy()

In [8]:
# Drop unnecessary col
df_traincopy.drop(['Unnamed: 0'],axis=1,inplace=True)
# Review
df_traincopy.head(1)

Unnamed: 0,date,lat,long,wnv,num_mos,wk,yr,tavg,stnpress,dewpt,...,trap_T230,trap_T231,trap_T232,trap_T233,trap_T235,trap_T236,trap_T237,trap_T238,trap_T900,trap_T903
0,2007-05-29,41.95469,-87.800991,0,1,22,2007,75.5,29.415,58.5,...,0,0,0,0,0,0,0,0,0,0


In [9]:
# Sanity check
print(df_traincopy.shape)
print(df_testcopy.shape)

(8610, 156)
(116293, 154)


In [10]:
# Drop date column from test and train
df_traincopy.drop(['date'],axis=1,inplace=True)
df_testcopy.drop(['date'],axis=1,inplace=True)

In [11]:
# trc - TrainComplete set
X_trc = df_traincopy.drop(['num_mos','wnv'],axis=1)
y_trc = df_traincopy['wnv']

In [12]:
df_traincopy.head(1)

Unnamed: 0,lat,long,wnv,num_mos,wk,yr,tavg,stnpress,dewpt,precip,...,trap_T230,trap_T231,trap_T232,trap_T233,trap_T235,trap_T236,trap_T237,trap_T238,trap_T900,trap_T903
0,41.95469,-87.800991,0,1,22,2007,75.5,29.415,58.5,0.0,...,0,0,0,0,0,0,0,0,0,0


In [13]:
df_testcopy.head(1)

Unnamed: 0,lat,long,wk,yr,tavg,stnpress,dewpt,precip,windspeed,daylight,...,trap_T230,trap_T231,trap_T232,trap_T233,trap_T235,trap_T236,trap_T237,trap_T238,trap_T900,trap_T903
0,41.95469,-87.800991,23,2008,75.0,29.31,55.5,0.0,9.15,15.166667,...,0,0,0,0,0,0,0,0,0,0


### Train-Validate-Split

In [14]:
# Train-validate-split
X_train,X_val,y_train,y_val = train_test_split(X_trc,y_trc,test_size=0.3,random_state=42, stratify=y_trc)

In [15]:
print(y_train.value_counts(normalize=True))

0    0.946906
1    0.053094
Name: wnv, dtype: float64


Positive class is wnv. Negative class is no wnv. Classes are imbalanced. We face a Class Imbalance Problem when the class distributions are highly imbalanced. In this context, many classification learning algorithms will have low predictive accuracy for the infrequent class (i.e. the positive class we are interested in).

We will use SMOTE (Synthetic Minority Oversampling TEchnique) to mitigate class imbalance. SMOTE consists of synthesizing elements for the minority class, based on exisitng observations. A minority class observation is randomly picked and the k-nearest neighbors are computed for this observation. Synthetic points are then added between this observation and its neighbors.

### Resampling

In [17]:
from imblearn.over_sampling import SMOTE

In [18]:

smote = SMOTE(sampling_strategy='minority', random_state=42)
X_sm, y_sm = smote.fit_sample(X_train, y_train)

In [19]:
# Review class balance after SMOTE application
print(y_sm.value_counts(normalize=True))

1    0.5
0    0.5
Name: wnv, dtype: float64


The baseline accuracy is 0.5. A model needs to perform better than this.

### Classification Models

Model workflow:
- pipeline for standard scaler (transformer) and classifier model (estimator), where relevant.
- gridsearch for best model parameters.
- metrics for evaluation: F1 score and roc_auc since we are dealing with imbalanced class distribution.

### Logistic Regression Model

In [20]:
pipe1 = Pipeline([
    ('ss', StandardScaler()),
    ('lor', LogisticRegression(solver='lbfgs',random_state=42)),
])

In [21]:
pipe1.get_params()

{'memory': None,
 'steps': [('ss', StandardScaler(copy=True, with_mean=True, with_std=True)),
  ('lor',
   LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                      intercept_scaling=1, l1_ratio=None, max_iter=100,
                      multi_class='auto', n_jobs=None, penalty='l2',
                      random_state=42, solver='lbfgs', tol=0.0001, verbose=0,
                      warm_start=False))],
 'verbose': False,
 'ss': StandardScaler(copy=True, with_mean=True, with_std=True),
 'lor': LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                    intercept_scaling=1, l1_ratio=None, max_iter=100,
                    multi_class='auto', n_jobs=None, penalty='l2',
                    random_state=42, solver='lbfgs', tol=0.0001, verbose=0,
                    warm_start=False),
 'ss__copy': True,
 'ss__with_mean': True,
 'ss__with_std': True,
 'lor__C': 1.0,
 'lor__class_weight': None,
 'lor__dual': False,
 '

In [22]:
# Define the pipe parameters
pipe1_params = {'ss__with_mean': [True],
                'ss__with_std': [True],
                'lor__max_iter': [100,200,300]}

In [23]:
# Instantiate Gridsearch
gs1 = GridSearchCV(pipe1,\
                  param_grid=pipe1_params,\
                  cv=10)
# Fit GridSearch to the cleaned training data.
gs1.fit(X_sm,y_sm)

GridSearchCV(cv=10, error_score=nan,
             estimator=Pipeline(memory=None,
                                steps=[('ss',
                                        StandardScaler(copy=True,
                                                       with_mean=True,
                                                       with_std=True)),
                                       ('lor',
                                        LogisticRegression(C=1.0,
                                                           class_weight=None,
                                                           dual=False,
                                                           fit_intercept=True,
                                                           intercept_scaling=1,
                                                           l1_ratio=None,
                                                           max_iter=100,
                                                           multi_class='auto',
                

In [24]:
# Check the results of the grid search
# Google colab raises max iter limit, while jupytr runs fine with provided params
print(f"Best parameters: {gs1.best_params_}")
print(f"Best score: {gs1.best_score_}")

Best parameters: {'lor__max_iter': 100, 'ss__with_mean': True, 'ss__with_std': True}
Best score: 0.9675104487875108


In [25]:
# Save model
model1 = gs1.best_estimator_

In [26]:
# Score model on train set and validate set
print(f"Accuracy on train set: {model1.score(X_sm, y_sm)}")
print(f"Accuracy on validate set: {model1.score(X_val, y_val)}")

Accuracy on train set: 0.9682845628175925
Accuracy on validate set: 0.9450251645373596


The model accuracy is higher than the baseline accuracy, modelling helps with classification. The model is overfitted with about 2% drop in validate accuracy compared to train accuracy.

In [27]:
# Confusion matrix
# Pass in true values, predicted values to confusion matrix
# Convert confusion matrix into dataframe
# Positive class (class 1) is wnv
preds = gs1.predict(X_val)
cm = confusion_matrix(y_val, preds)
cm_df = pd.DataFrame(cm,columns=['pred no wnv','pred wnv'], index=['Actual no wnv','Actual wnv'])
cm_df

Unnamed: 0,pred no wnv,pred wnv
Actual no wnv,2441,5
Actual wnv,137,0


In [28]:
# return nparray as a 1-D array.
confusion_matrix(y_val, preds).ravel()
# Save TN/FP/FN/TP values.
tn, fp, fn, tp = confusion_matrix(y_val,preds).ravel()

In [29]:
# Summary of metrics for log reg model
sens = tp/(tp+fn)
prec = tp/(tp+fp)
f1 = 2*(prec*sens)/(prec+sens)
print(f"Sensitivity: {round(sens,4)}")
print(f"Precision: {round(prec,4)}")
print(f"F1: {round(f1,4)}")

Sensitivity: 0.0
Precision: 0.0
F1: nan


  after removing the cwd from sys.path.


In [30]:
pred_proba = [i[1] for i in gs1.predict_proba(X_val)]

pred_df = pd.DataFrame({'validate_values': y_val,
                        'pred_probs':pred_proba})
pred_df.head()

Unnamed: 0,validate_values,pred_probs
5602,0,0.191102
1012,0,0.000979
7561,0,0.03421
7887,0,0.001251
4422,0,0.056057


In [31]:
# Calculate ROC AUC.
roc_auc_score(pred_df['validate_values'],pred_df['pred_probs'])

0.7899445541954391

### SVM

For SVM, the model fits takes simply too long a time (more than 3 hours) for girdsearch. In lieu of better performing and efficient models, the team discussed and ultimately decided to drop SVM. The code block is commented outm but preserved for learning purposes.

In [32]:
"""pipe2 = Pipeline([
    ('ss', StandardScaler()),
    ('svm', SVC(probability=True, random_state=42))
])"""

In [33]:
"""pipe2.get_params()"""

{'memory': None,
 'steps': [('ss', StandardScaler(copy=True, with_mean=True, with_std=True)),
  ('svm',
   SVC(C=1.0, break_ties=False, cache_size=200, class_weight=None, coef0=0.0,
       decision_function_shape='ovr', degree=3, gamma='scale', kernel='rbf',
       max_iter=-1, probability=True, random_state=42, shrinking=True, tol=0.001,
       verbose=False))],
 'verbose': False,
 'ss': StandardScaler(copy=True, with_mean=True, with_std=True),
 'svm': SVC(C=1.0, break_ties=False, cache_size=200, class_weight=None, coef0=0.0,
     decision_function_shape='ovr', degree=3, gamma='scale', kernel='rbf',
     max_iter=-1, probability=True, random_state=42, shrinking=True, tol=0.001,
     verbose=False),
 'ss__copy': True,
 'ss__with_mean': True,
 'ss__with_std': True,
 'svm__C': 1.0,
 'svm__break_ties': False,
 'svm__cache_size': 200,
 'svm__class_weight': None,
 'svm__coef0': 0.0,
 'svm__decision_function_shape': 'ovr',
 'svm__degree': 3,
 'svm__gamma': 'scale',
 'svm__kernel': 'rbf',
 's

In [34]:
# Define the pipe parameters
"""pipe2_params = {'ss__with_mean': [True],
                'ss__with_std': [True],
                'svm__C': [1,10],
                'svm__gamma': ['scale','auto'],
                'svm__kernel': ['rbf','linear','poly']
               }"""

In [35]:
# Initiate Gridsearch
"""gs2 = GridSearchCV(pipe2,
                  param_grid=pipe2_params,
                  cv=10,
                  verbose=2)

# Fit gs2
gs2.fit(X_sm, y_sm)"""

KeyboardInterrupt: 

In [None]:
# Check the results of the grid search

"""print(f"Best parameters: {gs2.best_params_}")
print(f"Best score: {gs2.best_score_}")"""

In [None]:
# Save model
"""model2 = gs2.best_estimator_"""

In [None]:
# Score model on train set and validate set
"""print(f"Accuracy on train set: {model2.score(X_sm, y_sm)}")
print(f"Accuracy on validate set: {model2.score(X_val, y_val)}")"""

In [None]:
# Confusion matrix
# Pass in true values, predicted values to confusion matrix
# Convert confusion matrix into dataframe
# Positive class (class 1) is wnv
"""preds = gs2.predict(X_val)
cm = confusion_matrix(y_val, preds)
cm_df = pd.DataFrame(cm,columns=['pred no wnv','pred wnv'], index=['Actual no wnv','Actual wnv'])
cm_df"""

In [None]:
# return nparray as a 1-D array.
"""confusion_matrix(y_val, preds).ravel()
# Save TN/FP/FN/TP values.
tn, fp, fn, tp = confusion_matrix(y_val,preds).ravel()"""

In [None]:
# Summary of metrics for log reg model
"""sens = tp/(tp+fn)
prec = tp/(tp+fp)
f1 = 2*(prec*sens)/(prec+sens)
print(f"Sensitivity: {round(sens,4)}")
print(f"Precision: {round(prec,4)}")
print(f"F1: {round(f1,4)}")"""

In [None]:
"""pred_proba = [i[1] for i in gs2.predict_proba(X_val)]

pred_df = pd.DataFrame({'validate_values': y_val,
                        'pred_probs':pred_proba})
pred_df.head()"""

In [None]:
# Calculate ROC AUC.
"""roc_auc_score(pred_df['validate_values'],pred_df['pred_probs'])"""

### KNN Classifier

In [None]:
pipe3 = Pipeline([
    ('ss', StandardScaler()),
    ('knn', KNeighborsClassifier())
])

In [37]:
pipe3.get_params()

{'memory': None,
 'steps': [('ss', StandardScaler(copy=True, with_mean=True, with_std=True)),
  ('knn',
   KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
                        metric_params=None, n_jobs=None, n_neighbors=5, p=2,
                        weights='uniform'))],
 'verbose': False,
 'ss': StandardScaler(copy=True, with_mean=True, with_std=True),
 'knn': KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
                      metric_params=None, n_jobs=None, n_neighbors=5, p=2,
                      weights='uniform'),
 'ss__copy': True,
 'ss__with_mean': True,
 'ss__with_std': True,
 'knn__algorithm': 'auto',
 'knn__leaf_size': 30,
 'knn__metric': 'minkowski',
 'knn__metric_params': None,
 'knn__n_jobs': None,
 'knn__n_neighbors': 5,
 'knn__p': 2,
 'knn__weights': 'uniform'}

In [38]:
# Define the pipe parameters
pipe3_params = {'ss__with_mean': [True],
                'ss__with_std': [True],
                'knn__n_neighbors' : [3,5,7],
                'knn__metric': ['euclidean','manhattan']}

In [39]:
# Instantiate Gridsearch
gs3 = GridSearchCV(pipe3,\
                  param_grid=pipe3_params,\
                  cv=10)
# Fit GridSearch to the cleaned training data.
gs3.fit(X_sm,y_sm)

GridSearchCV(cv=10, error_score=nan,
             estimator=Pipeline(memory=None,
                                steps=[('ss',
                                        StandardScaler(copy=True,
                                                       with_mean=True,
                                                       with_std=True)),
                                       ('knn',
                                        KNeighborsClassifier(algorithm='auto',
                                                             leaf_size=30,
                                                             metric='minkowski',
                                                             metric_params=None,
                                                             n_jobs=None,
                                                             n_neighbors=5, p=2,
                                                             weights='uniform'))],
                                verbose=False),
             i

In [40]:
# Check the results of the grid search

print(f"Best parameters: {gs3.best_params_}")
print(f"Best score: {gs3.best_score_}")

Best parameters: {'knn__metric': 'manhattan', 'knn__n_neighbors': 3, 'ss__with_mean': True, 'ss__with_std': True}
Best score: 0.9648784901559605


In [41]:
# Save model
model3 = gs3.best_estimator_

In [42]:
# Score model on train set and validate set
print(f"Accuracy on train set: {model3.score(X_sm, y_sm)}")
print(f"Accuracy on validate set: {model3.score(X_val, y_val)}")

Accuracy on train set: 0.9757315577361135
Accuracy on validate set: 0.9330236159504453


The model is overfitted with about 4% drop in validate accuracy compared to train accuracy.

In [43]:
# Confusion matrix
# Pass in true values, predicted values to confusion matrix
# Convert confusion matrix into dataframe
# Positive class (class 1) is wnv
preds = gs3.predict(X_val)
cm = confusion_matrix(y_val, preds)
cm_df = pd.DataFrame(cm,columns=['pred no wnv','pred wnv'], index=['Actual no wnv','Actual wnv'])
cm_df

Unnamed: 0,pred no wnv,pred wnv
Actual no wnv,2398,48
Actual wnv,125,12


In [44]:
# return nparray as a 1-D array.
confusion_matrix(y_val, preds).ravel()
# Save TN/FP/FN/TP values.
tn, fp, fn, tp = confusion_matrix(y_val,preds).ravel()

In [45]:
# Summary of metrics for log reg model
sens = tp/(tp+fn)
prec = tp/(tp+fp)
f1 = 2*(prec*sens)/(prec+sens)
print(f"Sensitivity: {round(sens,4)}")
print(f"Precision: {round(prec,4)}")
print(f"F1: {round(f1,4)}")

Sensitivity: 0.0876
Precision: 0.2
F1: 0.1218


In [46]:
pred_proba = [i[1] for i in gs3.predict_proba(X_val)]

pred_df = pd.DataFrame({'validate_values': y_val,
                        'pred_probs':pred_proba})
pred_df.head()

Unnamed: 0,validate_values,pred_probs
5602,0,0.333333
1012,0,0.0
7561,0,0.0
7887,0,0.0
4422,0,0.0


In [47]:
# Calculate ROC AUC.
roc_auc_score(pred_df['validate_values'],pred_df['pred_probs'])

0.6720491074359449

### Decision Tree

In [49]:
gs4 = GridSearchCV(estimator = DecisionTreeClassifier(random_state=42),\
                   param_grid = {'max_depth': [7, 9, 11],\
                                 'min_samples_split': [10, 15, 20],\
                                 'min_samples_leaf': [2, 3, 4],\
                                 'ccp_alpha': [0, 0.001, 0.01, 0.1, 1, 10]},\
                   cv = 5,\
                   verbose = 2)

In [50]:
# Fit GridSearch to the cleaned training data.
gs4.fit(X_sm,y_sm)

Fitting 5 folds for each of 162 candidates, totalling 810 fits
[CV] ccp_alpha=0, max_depth=7, min_samples_leaf=2, min_samples_split=10 
[CV]  ccp_alpha=0, max_depth=7, min_samples_leaf=2, min_samples_split=10, total=   0.1s
[CV] ccp_alpha=0, max_depth=7, min_samples_leaf=2, min_samples_split=10 
[CV]  ccp_alpha=0, max_depth=7, min_samples_leaf=2, min_samples_split=10, total=   0.1s
[CV] ccp_alpha=0, max_depth=7, min_samples_leaf=2, min_samples_split=10 


[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.0s remaining:    0.0s


[CV]  ccp_alpha=0, max_depth=7, min_samples_leaf=2, min_samples_split=10, total=   0.1s
[CV] ccp_alpha=0, max_depth=7, min_samples_leaf=2, min_samples_split=10 
[CV]  ccp_alpha=0, max_depth=7, min_samples_leaf=2, min_samples_split=10, total=   0.1s
[CV] ccp_alpha=0, max_depth=7, min_samples_leaf=2, min_samples_split=10 
[CV]  ccp_alpha=0, max_depth=7, min_samples_leaf=2, min_samples_split=10, total=   0.1s
[CV] ccp_alpha=0, max_depth=7, min_samples_leaf=2, min_samples_split=15 
[CV]  ccp_alpha=0, max_depth=7, min_samples_leaf=2, min_samples_split=15, total=   0.1s
[CV] ccp_alpha=0, max_depth=7, min_samples_leaf=2, min_samples_split=15 
[CV]  ccp_alpha=0, max_depth=7, min_samples_leaf=2, min_samples_split=15, total=   0.1s
[CV] ccp_alpha=0, max_depth=7, min_samples_leaf=2, min_samples_split=15 
[CV]  ccp_alpha=0, max_depth=7, min_samples_leaf=2, min_samples_split=15, total=   0.1s
[CV] ccp_alpha=0, max_depth=7, min_samples_leaf=2, min_samples_split=15 
[CV]  ccp_alpha=0, max_depth=7, mi

[CV]  ccp_alpha=0, max_depth=9, min_samples_leaf=2, min_samples_split=20, total=   0.1s
[CV] ccp_alpha=0, max_depth=9, min_samples_leaf=2, min_samples_split=20 
[CV]  ccp_alpha=0, max_depth=9, min_samples_leaf=2, min_samples_split=20, total=   0.1s
[CV] ccp_alpha=0, max_depth=9, min_samples_leaf=2, min_samples_split=20 
[CV]  ccp_alpha=0, max_depth=9, min_samples_leaf=2, min_samples_split=20, total=   0.1s
[CV] ccp_alpha=0, max_depth=9, min_samples_leaf=2, min_samples_split=20 
[CV]  ccp_alpha=0, max_depth=9, min_samples_leaf=2, min_samples_split=20, total=   0.1s
[CV] ccp_alpha=0, max_depth=9, min_samples_leaf=2, min_samples_split=20 
[CV]  ccp_alpha=0, max_depth=9, min_samples_leaf=2, min_samples_split=20, total=   0.1s
[CV] ccp_alpha=0, max_depth=9, min_samples_leaf=3, min_samples_split=10 
[CV]  ccp_alpha=0, max_depth=9, min_samples_leaf=3, min_samples_split=10, total=   0.1s
[CV] ccp_alpha=0, max_depth=9, min_samples_leaf=3, min_samples_split=10 
[CV]  ccp_alpha=0, max_depth=9, mi

[CV]  ccp_alpha=0, max_depth=11, min_samples_leaf=3, min_samples_split=10, total=   0.1s
[CV] ccp_alpha=0, max_depth=11, min_samples_leaf=3, min_samples_split=10 
[CV]  ccp_alpha=0, max_depth=11, min_samples_leaf=3, min_samples_split=10, total=   0.1s
[CV] ccp_alpha=0, max_depth=11, min_samples_leaf=3, min_samples_split=10 
[CV]  ccp_alpha=0, max_depth=11, min_samples_leaf=3, min_samples_split=10, total=   0.1s
[CV] ccp_alpha=0, max_depth=11, min_samples_leaf=3, min_samples_split=15 
[CV]  ccp_alpha=0, max_depth=11, min_samples_leaf=3, min_samples_split=15, total=   0.1s
[CV] ccp_alpha=0, max_depth=11, min_samples_leaf=3, min_samples_split=15 
[CV]  ccp_alpha=0, max_depth=11, min_samples_leaf=3, min_samples_split=15, total=   0.1s
[CV] ccp_alpha=0, max_depth=11, min_samples_leaf=3, min_samples_split=15 
[CV]  ccp_alpha=0, max_depth=11, min_samples_leaf=3, min_samples_split=15, total=   0.1s
[CV] ccp_alpha=0, max_depth=11, min_samples_leaf=3, min_samples_split=15 
[CV]  ccp_alpha=0, max

[CV]  ccp_alpha=0.001, max_depth=7, min_samples_leaf=3, min_samples_split=15, total=   0.1s
[CV] ccp_alpha=0.001, max_depth=7, min_samples_leaf=3, min_samples_split=15 
[CV]  ccp_alpha=0.001, max_depth=7, min_samples_leaf=3, min_samples_split=15, total=   0.1s
[CV] ccp_alpha=0.001, max_depth=7, min_samples_leaf=3, min_samples_split=20 
[CV]  ccp_alpha=0.001, max_depth=7, min_samples_leaf=3, min_samples_split=20, total=   0.1s
[CV] ccp_alpha=0.001, max_depth=7, min_samples_leaf=3, min_samples_split=20 
[CV]  ccp_alpha=0.001, max_depth=7, min_samples_leaf=3, min_samples_split=20, total=   0.1s
[CV] ccp_alpha=0.001, max_depth=7, min_samples_leaf=3, min_samples_split=20 
[CV]  ccp_alpha=0.001, max_depth=7, min_samples_leaf=3, min_samples_split=20, total=   0.1s
[CV] ccp_alpha=0.001, max_depth=7, min_samples_leaf=3, min_samples_split=20 
[CV]  ccp_alpha=0.001, max_depth=7, min_samples_leaf=3, min_samples_split=20, total=   0.1s
[CV] ccp_alpha=0.001, max_depth=7, min_samples_leaf=3, min_samp

[CV] ccp_alpha=0.001, max_depth=9, min_samples_leaf=3, min_samples_split=20 
[CV]  ccp_alpha=0.001, max_depth=9, min_samples_leaf=3, min_samples_split=20, total=   0.1s
[CV] ccp_alpha=0.001, max_depth=9, min_samples_leaf=3, min_samples_split=20 
[CV]  ccp_alpha=0.001, max_depth=9, min_samples_leaf=3, min_samples_split=20, total=   0.1s
[CV] ccp_alpha=0.001, max_depth=9, min_samples_leaf=3, min_samples_split=20 
[CV]  ccp_alpha=0.001, max_depth=9, min_samples_leaf=3, min_samples_split=20, total=   0.1s
[CV] ccp_alpha=0.001, max_depth=9, min_samples_leaf=4, min_samples_split=10 
[CV]  ccp_alpha=0.001, max_depth=9, min_samples_leaf=4, min_samples_split=10, total=   0.1s
[CV] ccp_alpha=0.001, max_depth=9, min_samples_leaf=4, min_samples_split=10 
[CV]  ccp_alpha=0.001, max_depth=9, min_samples_leaf=4, min_samples_split=10, total=   0.1s
[CV] ccp_alpha=0.001, max_depth=9, min_samples_leaf=4, min_samples_split=10 
[CV]  ccp_alpha=0.001, max_depth=9, min_samples_leaf=4, min_samples_split=10, 

[CV]  ccp_alpha=0.001, max_depth=11, min_samples_leaf=4, min_samples_split=10, total=   0.1s
[CV] ccp_alpha=0.001, max_depth=11, min_samples_leaf=4, min_samples_split=10 
[CV]  ccp_alpha=0.001, max_depth=11, min_samples_leaf=4, min_samples_split=10, total=   0.1s
[CV] ccp_alpha=0.001, max_depth=11, min_samples_leaf=4, min_samples_split=10 
[CV]  ccp_alpha=0.001, max_depth=11, min_samples_leaf=4, min_samples_split=10, total=   0.1s
[CV] ccp_alpha=0.001, max_depth=11, min_samples_leaf=4, min_samples_split=10 
[CV]  ccp_alpha=0.001, max_depth=11, min_samples_leaf=4, min_samples_split=10, total=   0.1s
[CV] ccp_alpha=0.001, max_depth=11, min_samples_leaf=4, min_samples_split=15 
[CV]  ccp_alpha=0.001, max_depth=11, min_samples_leaf=4, min_samples_split=15, total=   0.1s
[CV] ccp_alpha=0.001, max_depth=11, min_samples_leaf=4, min_samples_split=15 
[CV]  ccp_alpha=0.001, max_depth=11, min_samples_leaf=4, min_samples_split=15, total=   0.1s
[CV] ccp_alpha=0.001, max_depth=11, min_samples_leaf

[CV]  ccp_alpha=0.01, max_depth=7, min_samples_leaf=4, min_samples_split=15, total=   0.1s
[CV] ccp_alpha=0.01, max_depth=7, min_samples_leaf=4, min_samples_split=15 
[CV]  ccp_alpha=0.01, max_depth=7, min_samples_leaf=4, min_samples_split=15, total=   0.1s
[CV] ccp_alpha=0.01, max_depth=7, min_samples_leaf=4, min_samples_split=15 
[CV]  ccp_alpha=0.01, max_depth=7, min_samples_leaf=4, min_samples_split=15, total=   0.1s
[CV] ccp_alpha=0.01, max_depth=7, min_samples_leaf=4, min_samples_split=15 
[CV]  ccp_alpha=0.01, max_depth=7, min_samples_leaf=4, min_samples_split=15, total=   0.1s
[CV] ccp_alpha=0.01, max_depth=7, min_samples_leaf=4, min_samples_split=15 
[CV]  ccp_alpha=0.01, max_depth=7, min_samples_leaf=4, min_samples_split=15, total=   0.1s
[CV] ccp_alpha=0.01, max_depth=7, min_samples_leaf=4, min_samples_split=20 
[CV]  ccp_alpha=0.01, max_depth=7, min_samples_leaf=4, min_samples_split=20, total=   0.1s
[CV] ccp_alpha=0.01, max_depth=7, min_samples_leaf=4, min_samples_split=20

[CV]  ccp_alpha=0.01, max_depth=9, min_samples_leaf=4, min_samples_split=20, total=   0.1s
[CV] ccp_alpha=0.01, max_depth=9, min_samples_leaf=4, min_samples_split=20 
[CV]  ccp_alpha=0.01, max_depth=9, min_samples_leaf=4, min_samples_split=20, total=   0.1s
[CV] ccp_alpha=0.01, max_depth=9, min_samples_leaf=4, min_samples_split=20 
[CV]  ccp_alpha=0.01, max_depth=9, min_samples_leaf=4, min_samples_split=20, total=   0.1s
[CV] ccp_alpha=0.01, max_depth=11, min_samples_leaf=2, min_samples_split=10 
[CV]  ccp_alpha=0.01, max_depth=11, min_samples_leaf=2, min_samples_split=10, total=   0.1s
[CV] ccp_alpha=0.01, max_depth=11, min_samples_leaf=2, min_samples_split=10 
[CV]  ccp_alpha=0.01, max_depth=11, min_samples_leaf=2, min_samples_split=10, total=   0.1s
[CV] ccp_alpha=0.01, max_depth=11, min_samples_leaf=2, min_samples_split=10 
[CV]  ccp_alpha=0.01, max_depth=11, min_samples_leaf=2, min_samples_split=10, total=   0.1s
[CV] ccp_alpha=0.01, max_depth=11, min_samples_leaf=2, min_samples_s

[CV]  ccp_alpha=0.1, max_depth=7, min_samples_leaf=2, min_samples_split=10, total=   0.1s
[CV] ccp_alpha=0.1, max_depth=7, min_samples_leaf=2, min_samples_split=10 
[CV]  ccp_alpha=0.1, max_depth=7, min_samples_leaf=2, min_samples_split=10, total=   0.1s
[CV] ccp_alpha=0.1, max_depth=7, min_samples_leaf=2, min_samples_split=10 
[CV]  ccp_alpha=0.1, max_depth=7, min_samples_leaf=2, min_samples_split=10, total=   0.1s
[CV] ccp_alpha=0.1, max_depth=7, min_samples_leaf=2, min_samples_split=15 
[CV]  ccp_alpha=0.1, max_depth=7, min_samples_leaf=2, min_samples_split=15, total=   0.1s
[CV] ccp_alpha=0.1, max_depth=7, min_samples_leaf=2, min_samples_split=15 
[CV]  ccp_alpha=0.1, max_depth=7, min_samples_leaf=2, min_samples_split=15, total=   0.1s
[CV] ccp_alpha=0.1, max_depth=7, min_samples_leaf=2, min_samples_split=15 
[CV]  ccp_alpha=0.1, max_depth=7, min_samples_leaf=2, min_samples_split=15, total=   0.1s
[CV] ccp_alpha=0.1, max_depth=7, min_samples_leaf=2, min_samples_split=15 
[CV]  ccp_

[CV]  ccp_alpha=0.1, max_depth=9, min_samples_leaf=2, min_samples_split=15, total=   0.1s
[CV] ccp_alpha=0.1, max_depth=9, min_samples_leaf=2, min_samples_split=15 
[CV]  ccp_alpha=0.1, max_depth=9, min_samples_leaf=2, min_samples_split=15, total=   0.1s
[CV] ccp_alpha=0.1, max_depth=9, min_samples_leaf=2, min_samples_split=15 
[CV]  ccp_alpha=0.1, max_depth=9, min_samples_leaf=2, min_samples_split=15, total=   0.1s
[CV] ccp_alpha=0.1, max_depth=9, min_samples_leaf=2, min_samples_split=20 
[CV]  ccp_alpha=0.1, max_depth=9, min_samples_leaf=2, min_samples_split=20, total=   0.1s
[CV] ccp_alpha=0.1, max_depth=9, min_samples_leaf=2, min_samples_split=20 
[CV]  ccp_alpha=0.1, max_depth=9, min_samples_leaf=2, min_samples_split=20, total=   0.1s
[CV] ccp_alpha=0.1, max_depth=9, min_samples_leaf=2, min_samples_split=20 
[CV]  ccp_alpha=0.1, max_depth=9, min_samples_leaf=2, min_samples_split=20, total=   0.1s
[CV] ccp_alpha=0.1, max_depth=9, min_samples_leaf=2, min_samples_split=20 
[CV]  ccp_

[CV]  ccp_alpha=0.1, max_depth=11, min_samples_leaf=2, min_samples_split=20, total=   0.1s
[CV] ccp_alpha=0.1, max_depth=11, min_samples_leaf=3, min_samples_split=10 
[CV]  ccp_alpha=0.1, max_depth=11, min_samples_leaf=3, min_samples_split=10, total=   0.1s
[CV] ccp_alpha=0.1, max_depth=11, min_samples_leaf=3, min_samples_split=10 
[CV]  ccp_alpha=0.1, max_depth=11, min_samples_leaf=3, min_samples_split=10, total=   0.1s
[CV] ccp_alpha=0.1, max_depth=11, min_samples_leaf=3, min_samples_split=10 
[CV]  ccp_alpha=0.1, max_depth=11, min_samples_leaf=3, min_samples_split=10, total=   0.1s
[CV] ccp_alpha=0.1, max_depth=11, min_samples_leaf=3, min_samples_split=10 
[CV]  ccp_alpha=0.1, max_depth=11, min_samples_leaf=3, min_samples_split=10, total=   0.1s
[CV] ccp_alpha=0.1, max_depth=11, min_samples_leaf=3, min_samples_split=10 
[CV]  ccp_alpha=0.1, max_depth=11, min_samples_leaf=3, min_samples_split=10, total=   0.1s
[CV] ccp_alpha=0.1, max_depth=11, min_samples_leaf=3, min_samples_split=15

[CV]  ccp_alpha=1, max_depth=7, min_samples_leaf=3, min_samples_split=10, total=   0.1s
[CV] ccp_alpha=1, max_depth=7, min_samples_leaf=3, min_samples_split=15 
[CV]  ccp_alpha=1, max_depth=7, min_samples_leaf=3, min_samples_split=15, total=   0.1s
[CV] ccp_alpha=1, max_depth=7, min_samples_leaf=3, min_samples_split=15 
[CV]  ccp_alpha=1, max_depth=7, min_samples_leaf=3, min_samples_split=15, total=   0.1s
[CV] ccp_alpha=1, max_depth=7, min_samples_leaf=3, min_samples_split=15 
[CV]  ccp_alpha=1, max_depth=7, min_samples_leaf=3, min_samples_split=15, total=   0.1s
[CV] ccp_alpha=1, max_depth=7, min_samples_leaf=3, min_samples_split=15 
[CV]  ccp_alpha=1, max_depth=7, min_samples_leaf=3, min_samples_split=15, total=   0.1s
[CV] ccp_alpha=1, max_depth=7, min_samples_leaf=3, min_samples_split=15 
[CV]  ccp_alpha=1, max_depth=7, min_samples_leaf=3, min_samples_split=15, total=   0.1s
[CV] ccp_alpha=1, max_depth=7, min_samples_leaf=3, min_samples_split=20 
[CV]  ccp_alpha=1, max_depth=7, mi

[CV]  ccp_alpha=1, max_depth=9, min_samples_leaf=3, min_samples_split=20, total=   0.1s
[CV] ccp_alpha=1, max_depth=9, min_samples_leaf=3, min_samples_split=20 
[CV]  ccp_alpha=1, max_depth=9, min_samples_leaf=3, min_samples_split=20, total=   0.1s
[CV] ccp_alpha=1, max_depth=9, min_samples_leaf=3, min_samples_split=20 
[CV]  ccp_alpha=1, max_depth=9, min_samples_leaf=3, min_samples_split=20, total=   0.1s
[CV] ccp_alpha=1, max_depth=9, min_samples_leaf=3, min_samples_split=20 
[CV]  ccp_alpha=1, max_depth=9, min_samples_leaf=3, min_samples_split=20, total=   0.1s
[CV] ccp_alpha=1, max_depth=9, min_samples_leaf=3, min_samples_split=20 
[CV]  ccp_alpha=1, max_depth=9, min_samples_leaf=3, min_samples_split=20, total=   0.1s
[CV] ccp_alpha=1, max_depth=9, min_samples_leaf=4, min_samples_split=10 
[CV]  ccp_alpha=1, max_depth=9, min_samples_leaf=4, min_samples_split=10, total=   0.1s
[CV] ccp_alpha=1, max_depth=9, min_samples_leaf=4, min_samples_split=10 
[CV]  ccp_alpha=1, max_depth=9, mi

[CV]  ccp_alpha=1, max_depth=11, min_samples_leaf=4, min_samples_split=10, total=   0.1s
[CV] ccp_alpha=1, max_depth=11, min_samples_leaf=4, min_samples_split=10 
[CV]  ccp_alpha=1, max_depth=11, min_samples_leaf=4, min_samples_split=10, total=   0.1s
[CV] ccp_alpha=1, max_depth=11, min_samples_leaf=4, min_samples_split=10 
[CV]  ccp_alpha=1, max_depth=11, min_samples_leaf=4, min_samples_split=10, total=   0.1s
[CV] ccp_alpha=1, max_depth=11, min_samples_leaf=4, min_samples_split=15 
[CV]  ccp_alpha=1, max_depth=11, min_samples_leaf=4, min_samples_split=15, total=   0.1s
[CV] ccp_alpha=1, max_depth=11, min_samples_leaf=4, min_samples_split=15 
[CV]  ccp_alpha=1, max_depth=11, min_samples_leaf=4, min_samples_split=15, total=   0.1s
[CV] ccp_alpha=1, max_depth=11, min_samples_leaf=4, min_samples_split=15 
[CV]  ccp_alpha=1, max_depth=11, min_samples_leaf=4, min_samples_split=15, total=   0.1s
[CV] ccp_alpha=1, max_depth=11, min_samples_leaf=4, min_samples_split=15 
[CV]  ccp_alpha=1, max

[CV]  ccp_alpha=10, max_depth=7, min_samples_leaf=4, min_samples_split=15, total=   0.1s
[CV] ccp_alpha=10, max_depth=7, min_samples_leaf=4, min_samples_split=15 
[CV]  ccp_alpha=10, max_depth=7, min_samples_leaf=4, min_samples_split=15, total=   0.1s
[CV] ccp_alpha=10, max_depth=7, min_samples_leaf=4, min_samples_split=20 
[CV]  ccp_alpha=10, max_depth=7, min_samples_leaf=4, min_samples_split=20, total=   0.1s
[CV] ccp_alpha=10, max_depth=7, min_samples_leaf=4, min_samples_split=20 
[CV]  ccp_alpha=10, max_depth=7, min_samples_leaf=4, min_samples_split=20, total=   0.1s
[CV] ccp_alpha=10, max_depth=7, min_samples_leaf=4, min_samples_split=20 
[CV]  ccp_alpha=10, max_depth=7, min_samples_leaf=4, min_samples_split=20, total=   0.1s
[CV] ccp_alpha=10, max_depth=7, min_samples_leaf=4, min_samples_split=20 
[CV]  ccp_alpha=10, max_depth=7, min_samples_leaf=4, min_samples_split=20, total=   0.1s
[CV] ccp_alpha=10, max_depth=7, min_samples_leaf=4, min_samples_split=20 
[CV]  ccp_alpha=10, ma

[CV]  ccp_alpha=10, max_depth=11, min_samples_leaf=2, min_samples_split=10, total=   0.1s
[CV] ccp_alpha=10, max_depth=11, min_samples_leaf=2, min_samples_split=10 
[CV]  ccp_alpha=10, max_depth=11, min_samples_leaf=2, min_samples_split=10, total=   0.1s
[CV] ccp_alpha=10, max_depth=11, min_samples_leaf=2, min_samples_split=10 
[CV]  ccp_alpha=10, max_depth=11, min_samples_leaf=2, min_samples_split=10, total=   0.1s
[CV] ccp_alpha=10, max_depth=11, min_samples_leaf=2, min_samples_split=10 
[CV]  ccp_alpha=10, max_depth=11, min_samples_leaf=2, min_samples_split=10, total=   0.1s
[CV] ccp_alpha=10, max_depth=11, min_samples_leaf=2, min_samples_split=15 
[CV]  ccp_alpha=10, max_depth=11, min_samples_leaf=2, min_samples_split=15, total=   0.1s
[CV] ccp_alpha=10, max_depth=11, min_samples_leaf=2, min_samples_split=15 
[CV]  ccp_alpha=10, max_depth=11, min_samples_leaf=2, min_samples_split=15, total=   0.1s
[CV] ccp_alpha=10, max_depth=11, min_samples_leaf=2, min_samples_split=15 
[CV]  ccp_

[Parallel(n_jobs=1)]: Done 810 out of 810 | elapsed:  1.3min finished


GridSearchCV(cv=5, error_score=nan,
             estimator=DecisionTreeClassifier(ccp_alpha=0.0, class_weight=None,
                                              criterion='gini', max_depth=None,
                                              max_features=None,
                                              max_leaf_nodes=None,
                                              min_impurity_decrease=0.0,
                                              min_impurity_split=None,
                                              min_samples_leaf=1,
                                              min_samples_split=2,
                                              min_weight_fraction_leaf=0.0,
                                              presort='deprecated',
                                              random_state=42,
                                              splitter='best'),
             iid='deprecated', n_jobs=None,
             param_grid={'ccp_alpha': [0, 0.001, 0.01, 0.1, 1, 10],
            

In [51]:
# Check the results of the grid search

print(f"Best parameters: {gs4.best_params_}")
print(f"Best score: {gs4.best_score_}")

Best parameters: {'ccp_alpha': 0, 'max_depth': 11, 'min_samples_leaf': 2, 'min_samples_split': 10}
Best score: 0.8652537541705009


In [52]:
# Save model
model4 = gs4.best_estimator_

In [53]:
# Score model on train set and validate set
print(f"Accuracy on train set: {model4.score(X_sm, y_sm)}")
print(f"Accuracy on validate set: {model4.score(X_val, y_val)}")

Accuracy on train set: 0.8993341510425793
Accuracy on validate set: 0.7986837011227255


The model is overfitted with about 10% drop in validate accuracy compared to train accuracy. The accuracy is lower than the all previous models.

In [54]:
# Confusion matrix
# Pass in true values, predicted values to confusion matrix
# Convert confusion matrix into dataframe
# Positive class (class 1) is wnv
preds = gs4.predict(X_val)
cm = confusion_matrix(y_val, preds)
cm_df = pd.DataFrame(cm,columns=['pred no wnv','pred wnv'], index=['Actual no wnv','Actual wnv'])
cm_df

Unnamed: 0,pred no wnv,pred wnv
Actual no wnv,1971,475
Actual wnv,45,92


In [55]:
# return nparray as a 1-D array.
confusion_matrix(y_val, preds).ravel()
# Save TN/FP/FN/TP values.
tn, fp, fn, tp = confusion_matrix(y_val,preds).ravel()

In [56]:
# Summary of metrics for log reg model
sens = tp/(tp+fn)
prec = tp/(tp+fp)
f1 = 2*(prec*sens)/(prec+sens)
print(f"Sensitivity: {round(sens,4)}")
print(f"Precision: {round(prec,4)}")
print(f"F1: {round(f1,4)}")

Sensitivity: 0.6715
Precision: 0.1623
F1: 0.2614


In [57]:
pred_proba = [i[1] for i in gs4.predict_proba(X_val)]

pred_df = pd.DataFrame({'validate_values': y_val,
                        'pred_probs':pred_proba})
pred_df.head()

Unnamed: 0,validate_values,pred_probs
5602,0,0.0
1012,0,0.0
7561,0,0.347826
7887,0,0.0
4422,0,0.0


In [58]:
# Calculate ROC AUC.
roc_auc_score(pred_df['validate_values'],pred_df['pred_probs'])

0.7740299968367841

### Random Forest + oversampling & undersampling SMOTE

For Random Forest Classifier, we explore the combined effects of oversampling and undersampling SMOTE. First oversample the minority class with SMOTE to about a 1:10 ratio, then undersample the majority class to achieve about a 1:2 ratio.

In [97]:
from imblearn.over_sampling import SMOTE
from imblearn.under_sampling import RandomUnderSampler
from imblearn.pipeline import Pipeline
from sklearn.model_selection import RepeatedStratifiedKFold

# define pipeline

rf_pipeline = Pipeline([
    ('over', SMOTE(random_state=42)),
    ('under', RandomUnderSampler(random_state=42)),
    ('rf', RandomForestClassifier())
])

In [98]:
# Define dictionary of hyperparameters.
pipeline_params = {
    'over__k_neighbors' : [1,2,3,4,5,6,7,8,9,10],
    'over__sampling_strategy' : [0.1],
    'under__sampling_strategy' : [0.5],
    'rf__n_estimators': [50, 100, 200],
    'rf__max_depth': [4, 6, 10, 12],
    'rf__random_state': [42]
}

In [99]:
# Instantiate our GridSearchCV object.
rf_gs = GridSearchCV(rf_pipeline, # What is the model we want to fit?
                                 pipeline_params, # What is the dictionary of hyperparameters?
                                 cv=5, # What number of folds in CV will we use?
                                 verbose=1,
                                 scoring='roc_auc')

In [100]:
# Fit the GridSearchCV object to the data.
rf_gs.fit(X_train, y_train)

Fitting 5 folds for each of 120 candidates, totalling 600 fits


[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done 600 out of 600 | elapsed:  3.3min finished


GridSearchCV(cv=5, error_score=nan,
             estimator=Pipeline(memory=None,
                                steps=[('over',
                                        SMOTE(k_neighbors=5, n_jobs=None,
                                              random_state=42,
                                              sampling_strategy='auto')),
                                       ('under',
                                        RandomUnderSampler(random_state=42,
                                                           replacement=False,
                                                           sampling_strategy='auto')),
                                       ('rf',
                                        RandomForestClassifier(bootstrap=True,
                                                               ccp_alpha=0.0,
                                                               class_weight=None,
                                                               criterion='gini',
    

In [101]:
# Save the best model
rf_gs_best = rf_gs.best_estimator_

In [102]:
# Checking the scores - roc_auc (based on scoring param setting)
print(f"roc_auc on train set: {rf_gs_best.score(X_train, y_train)}")
print(f"roc_auc on validate set: {rf_gs_best.score(X_val, y_val)}")

Accuracy on train set: 0.8614567778330845
Accuracy on validate set: 0.8559814169570267


In [103]:
prediction_rf = rf_gs_best.predict(X_val)

In [104]:
# Confusion matrix
cm = confusion_matrix(y_val, prediction_rf)
# Save TN/FP/FN/TP values.
tn, fp, fn, tp = confusion_matrix(y_val,prediction_rf).ravel()

In [105]:
# Summary of metrics for random forest model
sens = tp/(tp+fn)
prec = tp/(tp+fp)
f1 = 2*(prec*sens)/(prec+sens)
print(f"Sensitivity: {round(sens,4)}")
print(f"Precision: {round(prec,4)}")
print(f"F1: {round(f1,4)}")

Sensitivity: 0.6569
Precision: 0.2169
F1: 0.3261


In [106]:
pred_proba = [i[1] for i in rf_gs_best.predict_proba(X_val)]

pred_df = pd.DataFrame({'validate_values': y_val,
                        'pred_probs':pred_proba})
pred_df.head()

Unnamed: 0,validate_values,pred_probs
5602,0,0.501987
1012,0,0.285213
7561,0,0.344337
7887,0,0.489848
4422,0,0.235456


In [107]:
# Calculate ROC AUC.
roc_auc_score(pred_df['validate_values'],pred_df['pred_probs'])

0.861510823570137

### Gradient Boosting Classifier

In [59]:
# Build upon the hyper-parameters used in Decision Tree model
# Learning rate in GB is try to put it in lower range; usually 0.1 to 0.2
gs5 = GridSearchCV(estimator = GradientBoostingClassifier(random_state=42),\
                   param_grid = {'learning_rate' : [0.1,0.2],\
                                 'n_estimators' : [100,125],
                                 'min_samples_split': [10,15],\
                                 'min_samples_leaf': [2,3],\
                                 'max_depth': [9,11],\
                                 'ccp_alpha': [0,0.1]},\
                   cv = 5,\
                   verbose = 2)

In [60]:
# CAUTION: Takes long time to run (more than 50 mins)
# Fit GridSearch to the cleaned training data.
gs5.fit(X_sm,y_sm)

Fitting 5 folds for each of 64 candidates, totalling 320 fits
[CV] ccp_alpha=0, learning_rate=0.1, max_depth=9, min_samples_leaf=2, min_samples_split=10, n_estimators=100 


[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.


[CV]  ccp_alpha=0, learning_rate=0.1, max_depth=9, min_samples_leaf=2, min_samples_split=10, n_estimators=100, total=   7.4s
[CV] ccp_alpha=0, learning_rate=0.1, max_depth=9, min_samples_leaf=2, min_samples_split=10, n_estimators=100 


[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    7.3s remaining:    0.0s


[CV]  ccp_alpha=0, learning_rate=0.1, max_depth=9, min_samples_leaf=2, min_samples_split=10, n_estimators=100, total=   7.2s
[CV] ccp_alpha=0, learning_rate=0.1, max_depth=9, min_samples_leaf=2, min_samples_split=10, n_estimators=100 
[CV]  ccp_alpha=0, learning_rate=0.1, max_depth=9, min_samples_leaf=2, min_samples_split=10, n_estimators=100, total=   7.2s
[CV] ccp_alpha=0, learning_rate=0.1, max_depth=9, min_samples_leaf=2, min_samples_split=10, n_estimators=100 
[CV]  ccp_alpha=0, learning_rate=0.1, max_depth=9, min_samples_leaf=2, min_samples_split=10, n_estimators=100, total=   7.1s
[CV] ccp_alpha=0, learning_rate=0.1, max_depth=9, min_samples_leaf=2, min_samples_split=10, n_estimators=100 
[CV]  ccp_alpha=0, learning_rate=0.1, max_depth=9, min_samples_leaf=2, min_samples_split=10, n_estimators=100, total=   7.2s
[CV] ccp_alpha=0, learning_rate=0.1, max_depth=9, min_samples_leaf=2, min_samples_split=10, n_estimators=125 
[CV]  ccp_alpha=0, learning_rate=0.1, max_depth=9, min_sampl

[CV]  ccp_alpha=0, learning_rate=0.1, max_depth=9, min_samples_leaf=3, min_samples_split=15, n_estimators=125, total=   9.3s
[CV] ccp_alpha=0, learning_rate=0.1, max_depth=9, min_samples_leaf=3, min_samples_split=15, n_estimators=125 
[CV]  ccp_alpha=0, learning_rate=0.1, max_depth=9, min_samples_leaf=3, min_samples_split=15, n_estimators=125, total=   9.5s
[CV] ccp_alpha=0, learning_rate=0.1, max_depth=9, min_samples_leaf=3, min_samples_split=15, n_estimators=125 
[CV]  ccp_alpha=0, learning_rate=0.1, max_depth=9, min_samples_leaf=3, min_samples_split=15, n_estimators=125, total=   9.0s
[CV] ccp_alpha=0, learning_rate=0.1, max_depth=9, min_samples_leaf=3, min_samples_split=15, n_estimators=125 
[CV]  ccp_alpha=0, learning_rate=0.1, max_depth=9, min_samples_leaf=3, min_samples_split=15, n_estimators=125, total=   9.3s
[CV] ccp_alpha=0, learning_rate=0.1, max_depth=11, min_samples_leaf=2, min_samples_split=10, n_estimators=100 
[CV]  ccp_alpha=0, learning_rate=0.1, max_depth=11, min_sam

[CV]  ccp_alpha=0, learning_rate=0.1, max_depth=11, min_samples_leaf=3, min_samples_split=15, n_estimators=100, total=   8.9s
[CV] ccp_alpha=0, learning_rate=0.1, max_depth=11, min_samples_leaf=3, min_samples_split=15, n_estimators=100 
[CV]  ccp_alpha=0, learning_rate=0.1, max_depth=11, min_samples_leaf=3, min_samples_split=15, n_estimators=100, total=   9.0s
[CV] ccp_alpha=0, learning_rate=0.1, max_depth=11, min_samples_leaf=3, min_samples_split=15, n_estimators=100 
[CV]  ccp_alpha=0, learning_rate=0.1, max_depth=11, min_samples_leaf=3, min_samples_split=15, n_estimators=100, total=   9.1s
[CV] ccp_alpha=0, learning_rate=0.1, max_depth=11, min_samples_leaf=3, min_samples_split=15, n_estimators=100 
[CV]  ccp_alpha=0, learning_rate=0.1, max_depth=11, min_samples_leaf=3, min_samples_split=15, n_estimators=100, total=   8.9s
[CV] ccp_alpha=0, learning_rate=0.1, max_depth=11, min_samples_leaf=3, min_samples_split=15, n_estimators=125 
[CV]  ccp_alpha=0, learning_rate=0.1, max_depth=11, 

[CV]  ccp_alpha=0, learning_rate=0.2, max_depth=9, min_samples_leaf=3, min_samples_split=10, n_estimators=125, total=   9.7s
[CV] ccp_alpha=0, learning_rate=0.2, max_depth=9, min_samples_leaf=3, min_samples_split=10, n_estimators=125 
[CV]  ccp_alpha=0, learning_rate=0.2, max_depth=9, min_samples_leaf=3, min_samples_split=10, n_estimators=125, total=   9.7s
[CV] ccp_alpha=0, learning_rate=0.2, max_depth=9, min_samples_leaf=3, min_samples_split=10, n_estimators=125 
[CV]  ccp_alpha=0, learning_rate=0.2, max_depth=9, min_samples_leaf=3, min_samples_split=10, n_estimators=125, total=   9.6s
[CV] ccp_alpha=0, learning_rate=0.2, max_depth=9, min_samples_leaf=3, min_samples_split=10, n_estimators=125 
[CV]  ccp_alpha=0, learning_rate=0.2, max_depth=9, min_samples_leaf=3, min_samples_split=10, n_estimators=125, total=   9.6s
[CV] ccp_alpha=0, learning_rate=0.2, max_depth=9, min_samples_leaf=3, min_samples_split=15, n_estimators=100 
[CV]  ccp_alpha=0, learning_rate=0.2, max_depth=9, min_sampl

[CV]  ccp_alpha=0, learning_rate=0.2, max_depth=11, min_samples_leaf=3, min_samples_split=10, n_estimators=100, total=   9.3s
[CV] ccp_alpha=0, learning_rate=0.2, max_depth=11, min_samples_leaf=3, min_samples_split=10, n_estimators=100 
[CV]  ccp_alpha=0, learning_rate=0.2, max_depth=11, min_samples_leaf=3, min_samples_split=10, n_estimators=100, total=   9.2s
[CV] ccp_alpha=0, learning_rate=0.2, max_depth=11, min_samples_leaf=3, min_samples_split=10, n_estimators=100 
[CV]  ccp_alpha=0, learning_rate=0.2, max_depth=11, min_samples_leaf=3, min_samples_split=10, n_estimators=100, total=   8.9s
[CV] ccp_alpha=0, learning_rate=0.2, max_depth=11, min_samples_leaf=3, min_samples_split=10, n_estimators=100 
[CV]  ccp_alpha=0, learning_rate=0.2, max_depth=11, min_samples_leaf=3, min_samples_split=10, n_estimators=100, total=   8.7s
[CV] ccp_alpha=0, learning_rate=0.2, max_depth=11, min_samples_leaf=3, min_samples_split=10, n_estimators=125 
[CV]  ccp_alpha=0, learning_rate=0.2, max_depth=11, 

[CV]  ccp_alpha=0.1, learning_rate=0.1, max_depth=9, min_samples_leaf=2, min_samples_split=15, n_estimators=125, total=   8.9s
[CV] ccp_alpha=0.1, learning_rate=0.1, max_depth=9, min_samples_leaf=2, min_samples_split=15, n_estimators=125 
[CV]  ccp_alpha=0.1, learning_rate=0.1, max_depth=9, min_samples_leaf=2, min_samples_split=15, n_estimators=125, total=   8.7s
[CV] ccp_alpha=0.1, learning_rate=0.1, max_depth=9, min_samples_leaf=2, min_samples_split=15, n_estimators=125 
[CV]  ccp_alpha=0.1, learning_rate=0.1, max_depth=9, min_samples_leaf=2, min_samples_split=15, n_estimators=125, total=   9.4s
[CV] ccp_alpha=0.1, learning_rate=0.1, max_depth=9, min_samples_leaf=2, min_samples_split=15, n_estimators=125 
[CV]  ccp_alpha=0.1, learning_rate=0.1, max_depth=9, min_samples_leaf=2, min_samples_split=15, n_estimators=125, total=   8.5s
[CV] ccp_alpha=0.1, learning_rate=0.1, max_depth=9, min_samples_leaf=3, min_samples_split=10, n_estimators=100 
[CV]  ccp_alpha=0.1, learning_rate=0.1, max_

[CV]  ccp_alpha=0.1, learning_rate=0.1, max_depth=11, min_samples_leaf=2, min_samples_split=15, n_estimators=100, total=   8.0s
[CV] ccp_alpha=0.1, learning_rate=0.1, max_depth=11, min_samples_leaf=2, min_samples_split=15, n_estimators=100 
[CV]  ccp_alpha=0.1, learning_rate=0.1, max_depth=11, min_samples_leaf=2, min_samples_split=15, n_estimators=100, total=   7.4s
[CV] ccp_alpha=0.1, learning_rate=0.1, max_depth=11, min_samples_leaf=2, min_samples_split=15, n_estimators=100 
[CV]  ccp_alpha=0.1, learning_rate=0.1, max_depth=11, min_samples_leaf=2, min_samples_split=15, n_estimators=100, total=   8.3s
[CV] ccp_alpha=0.1, learning_rate=0.1, max_depth=11, min_samples_leaf=2, min_samples_split=15, n_estimators=100 
[CV]  ccp_alpha=0.1, learning_rate=0.1, max_depth=11, min_samples_leaf=2, min_samples_split=15, n_estimators=100, total=   7.6s
[CV] ccp_alpha=0.1, learning_rate=0.1, max_depth=11, min_samples_leaf=2, min_samples_split=15, n_estimators=125 
[CV]  ccp_alpha=0.1, learning_rate=0

[CV]  ccp_alpha=0.1, learning_rate=0.2, max_depth=9, min_samples_leaf=2, min_samples_split=10, n_estimators=125, total=   9.4s
[CV] ccp_alpha=0.1, learning_rate=0.2, max_depth=9, min_samples_leaf=2, min_samples_split=10, n_estimators=125 
[CV]  ccp_alpha=0.1, learning_rate=0.2, max_depth=9, min_samples_leaf=2, min_samples_split=10, n_estimators=125, total=   8.6s
[CV] ccp_alpha=0.1, learning_rate=0.2, max_depth=9, min_samples_leaf=2, min_samples_split=10, n_estimators=125 
[CV]  ccp_alpha=0.1, learning_rate=0.2, max_depth=9, min_samples_leaf=2, min_samples_split=10, n_estimators=125, total=   9.5s
[CV] ccp_alpha=0.1, learning_rate=0.2, max_depth=9, min_samples_leaf=2, min_samples_split=10, n_estimators=125 
[CV]  ccp_alpha=0.1, learning_rate=0.2, max_depth=9, min_samples_leaf=2, min_samples_split=10, n_estimators=125, total=   8.6s
[CV] ccp_alpha=0.1, learning_rate=0.2, max_depth=9, min_samples_leaf=2, min_samples_split=15, n_estimators=100 
[CV]  ccp_alpha=0.1, learning_rate=0.2, max_

[CV]  ccp_alpha=0.1, learning_rate=0.2, max_depth=11, min_samples_leaf=2, min_samples_split=10, n_estimators=100, total=   6.9s
[CV] ccp_alpha=0.1, learning_rate=0.2, max_depth=11, min_samples_leaf=2, min_samples_split=10, n_estimators=100 
[CV]  ccp_alpha=0.1, learning_rate=0.2, max_depth=11, min_samples_leaf=2, min_samples_split=10, n_estimators=100, total=   6.9s
[CV] ccp_alpha=0.1, learning_rate=0.2, max_depth=11, min_samples_leaf=2, min_samples_split=10, n_estimators=100 
[CV]  ccp_alpha=0.1, learning_rate=0.2, max_depth=11, min_samples_leaf=2, min_samples_split=10, n_estimators=100, total=   7.1s
[CV] ccp_alpha=0.1, learning_rate=0.2, max_depth=11, min_samples_leaf=2, min_samples_split=10, n_estimators=100 
[CV]  ccp_alpha=0.1, learning_rate=0.2, max_depth=11, min_samples_leaf=2, min_samples_split=10, n_estimators=100, total=   7.9s
[CV] ccp_alpha=0.1, learning_rate=0.2, max_depth=11, min_samples_leaf=2, min_samples_split=10, n_estimators=125 
[CV]  ccp_alpha=0.1, learning_rate=0

[CV]  ccp_alpha=0.1, learning_rate=0.2, max_depth=11, min_samples_leaf=3, min_samples_split=15, n_estimators=125, total=   8.3s
[CV] ccp_alpha=0.1, learning_rate=0.2, max_depth=11, min_samples_leaf=3, min_samples_split=15, n_estimators=125 
[CV]  ccp_alpha=0.1, learning_rate=0.2, max_depth=11, min_samples_leaf=3, min_samples_split=15, n_estimators=125, total=   8.5s
[CV] ccp_alpha=0.1, learning_rate=0.2, max_depth=11, min_samples_leaf=3, min_samples_split=15, n_estimators=125 
[CV]  ccp_alpha=0.1, learning_rate=0.2, max_depth=11, min_samples_leaf=3, min_samples_split=15, n_estimators=125, total=   8.7s
[CV] ccp_alpha=0.1, learning_rate=0.2, max_depth=11, min_samples_leaf=3, min_samples_split=15, n_estimators=125 
[CV]  ccp_alpha=0.1, learning_rate=0.2, max_depth=11, min_samples_leaf=3, min_samples_split=15, n_estimators=125, total=   8.6s
[CV] ccp_alpha=0.1, learning_rate=0.2, max_depth=11, min_samples_leaf=3, min_samples_split=15, n_estimators=125 
[CV]  ccp_alpha=0.1, learning_rate=0

[Parallel(n_jobs=1)]: Done 320 out of 320 | elapsed: 45.8min finished


GridSearchCV(cv=5, error_score=nan,
             estimator=GradientBoostingClassifier(ccp_alpha=0.0,
                                                  criterion='friedman_mse',
                                                  init=None, learning_rate=0.1,
                                                  loss='deviance', max_depth=3,
                                                  max_features=None,
                                                  max_leaf_nodes=None,
                                                  min_impurity_decrease=0.0,
                                                  min_impurity_split=None,
                                                  min_samples_leaf=1,
                                                  min_samples_split=2,
                                                  min_weight_fraction_leaf=0.0,
                                                  n_estimators=100,
                                                  n_iter_no_c...
                 

In [61]:
# Check the results of the grid search

print(f"Best parameters: {gs5.best_params_}")
print(f"Best score: {gs5.best_score_}")

Best parameters: {'ccp_alpha': 0, 'learning_rate': 0.2, 'max_depth': 11, 'min_samples_leaf': 3, 'min_samples_split': 10, 'n_estimators': 125}
Best score: 0.9480478543730803


In [62]:
# Save model
model5 = gs5.best_estimator_

In [63]:
# Score model on train set and validate set
print(f"Accuracy on train set: {model5.score(X_sm, y_sm)}")
print(f"Accuracy on validate set: {model5.score(X_val, y_val)}")

Accuracy on train set: 0.9918521114420886
Accuracy on validate set: 0.9097948122338366


The model is overfitted with about 9% drop in validate accuracy compared to train accuracy. The accuracy is lower than the all previous models.

In [64]:
# Confusion matrix
# Pass in true values, predicted values to confusion matrix
# Convert confusion matrix into dataframe
# Positive class (class 1) is wnv
preds = gs5.predict(X_val)
cm = confusion_matrix(y_val, preds)
cm_df = pd.DataFrame(cm,columns=['pred no wnv','pred wnv'], index=['Actual no wnv','Actual wnv'])
cm_df

Unnamed: 0,pred no wnv,pred wnv
Actual no wnv,2319,127
Actual wnv,106,31


In [65]:
# return nparray as a 1-D array.
confusion_matrix(y_val, preds).ravel()
# Save TN/FP/FN/TP values.
tn, fp, fn, tp = confusion_matrix(y_val,preds).ravel()

In [66]:
# Summary of metrics for log reg model
sens = tp/(tp+fn)
prec = tp/(tp+fp)
f1 = 2*(prec*sens)/(prec+sens)
print(f"Sensitivity: {round(sens,4)}")
print(f"Precision: {round(prec,4)}")
print(f"F1: {round(f1,4)}")

Sensitivity: 0.2263
Precision: 0.1962
F1: 0.2102


In [67]:
pred_proba = [i[1] for i in gs5.predict_proba(X_val)]

pred_df = pd.DataFrame({'validate_values': y_val,
                        'pred_probs':pred_proba})
pred_df.head()

Unnamed: 0,validate_values,pred_probs
5602,0,0.139649
1012,0,0.026367
7561,0,0.280211
7887,0,0.018758
4422,0,0.006356


In [68]:
# Calculate ROC AUC.
roc_auc_score(pred_df['validate_values'],pred_df['pred_probs'])

0.8373823492548538

### XGBoost Classifier

Note: Code for XGBoost Classifier commented out, takes significant time to run (more than 30 mins).

XGBoost classifier is not explicitly covered in class; we explore the capabilities of XGboost in this section. XGBoost implements parallel processing and should run faster as compared to GBM.
Update: took 40 mins, slightly faster than GB, though it's likely due to the different params used.

In [70]:
from xgboost import XGBClassifier

In [71]:
gsX = GridSearchCV(estimator = XGBClassifier(random_state=42),\
                   param_grid = {'max_depth': [9,11],\
                                 'learning_rate' : [0.1],\
                                 'n_estimators' : [100,125],\
                                 'objective' : ['binary:logistic'],\
                                 'gamma': [0.5,1],\
                                 'min_child_weight': [1,5],\
                                 'subsample': [0.5,1.0],\
                                 'colsample_bytree': [0.5,1.0] },\
                   cv = 5,\
                   verbose = 2)

Notes on params:
- binary:logistic –logistic regression for binary classification, returns predicted probability (not class).
- gamma (default=0) specifies the minimum loss reduction required to make a split; makes model conservative.
- min child weight (default=1) minimum sum of weights of all observations required in a child. Used to control over-fitting. Higher values prevent a model from learning relations which might be highly specific to the particular sample selected for a tree.
- subsample (default=1) Denotes the fraction of observations to be randomly samples for each tree. Lower values make the algorithm more conservative and prevents overfitting. Typical values: 0.5, 1.
- colsample_bytree(default=1) Denotes the fraction of columns to be randomly samples for each tree. Typical values: 0.5, 1

In [72]:
# CAUTION: Takes long time to run (more than 50 mins)
# Fit GridSearch to the cleaned training data.
gsX.fit(X_sm,y_sm)


Fitting 5 folds for each of 64 candidates, totalling 320 fits
[CV] colsample_bytree=0.5, gamma=0.5, learning_rate=0.1, max_depth=9, min_child_weight=1, n_estimators=100, objective=binary:logistic, subsample=0.5 


[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.


[CV]  colsample_bytree=0.5, gamma=0.5, learning_rate=0.1, max_depth=9, min_child_weight=1, n_estimators=100, objective=binary:logistic, subsample=0.5, total=   3.9s
[CV] colsample_bytree=0.5, gamma=0.5, learning_rate=0.1, max_depth=9, min_child_weight=1, n_estimators=100, objective=binary:logistic, subsample=0.5 


[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    3.8s remaining:    0.0s


[CV]  colsample_bytree=0.5, gamma=0.5, learning_rate=0.1, max_depth=9, min_child_weight=1, n_estimators=100, objective=binary:logistic, subsample=0.5, total=   4.0s
[CV] colsample_bytree=0.5, gamma=0.5, learning_rate=0.1, max_depth=9, min_child_weight=1, n_estimators=100, objective=binary:logistic, subsample=0.5 
[CV]  colsample_bytree=0.5, gamma=0.5, learning_rate=0.1, max_depth=9, min_child_weight=1, n_estimators=100, objective=binary:logistic, subsample=0.5, total=   4.0s
[CV] colsample_bytree=0.5, gamma=0.5, learning_rate=0.1, max_depth=9, min_child_weight=1, n_estimators=100, objective=binary:logistic, subsample=0.5 
[CV]  colsample_bytree=0.5, gamma=0.5, learning_rate=0.1, max_depth=9, min_child_weight=1, n_estimators=100, objective=binary:logistic, subsample=0.5, total=   4.2s
[CV] colsample_bytree=0.5, gamma=0.5, learning_rate=0.1, max_depth=9, min_child_weight=1, n_estimators=100, objective=binary:logistic, subsample=0.5 
[CV]  colsample_bytree=0.5, gamma=0.5, learning_rate=0.

[CV]  colsample_bytree=0.5, gamma=0.5, learning_rate=0.1, max_depth=9, min_child_weight=5, n_estimators=100, objective=binary:logistic, subsample=1.0, total=   3.1s
[CV] colsample_bytree=0.5, gamma=0.5, learning_rate=0.1, max_depth=9, min_child_weight=5, n_estimators=100, objective=binary:logistic, subsample=1.0 
[CV]  colsample_bytree=0.5, gamma=0.5, learning_rate=0.1, max_depth=9, min_child_weight=5, n_estimators=100, objective=binary:logistic, subsample=1.0, total=   3.3s
[CV] colsample_bytree=0.5, gamma=0.5, learning_rate=0.1, max_depth=9, min_child_weight=5, n_estimators=125, objective=binary:logistic, subsample=0.5 
[CV]  colsample_bytree=0.5, gamma=0.5, learning_rate=0.1, max_depth=9, min_child_weight=5, n_estimators=125, objective=binary:logistic, subsample=0.5, total=   5.0s
[CV] colsample_bytree=0.5, gamma=0.5, learning_rate=0.1, max_depth=9, min_child_weight=5, n_estimators=125, objective=binary:logistic, subsample=0.5 
[CV]  colsample_bytree=0.5, gamma=0.5, learning_rate=0.

[CV]  colsample_bytree=0.5, gamma=0.5, learning_rate=0.1, max_depth=11, min_child_weight=1, n_estimators=125, objective=binary:logistic, subsample=0.5, total=   5.8s
[CV] colsample_bytree=0.5, gamma=0.5, learning_rate=0.1, max_depth=11, min_child_weight=1, n_estimators=125, objective=binary:logistic, subsample=1.0 
[CV]  colsample_bytree=0.5, gamma=0.5, learning_rate=0.1, max_depth=11, min_child_weight=1, n_estimators=125, objective=binary:logistic, subsample=1.0, total=   4.4s
[CV] colsample_bytree=0.5, gamma=0.5, learning_rate=0.1, max_depth=11, min_child_weight=1, n_estimators=125, objective=binary:logistic, subsample=1.0 
[CV]  colsample_bytree=0.5, gamma=0.5, learning_rate=0.1, max_depth=11, min_child_weight=1, n_estimators=125, objective=binary:logistic, subsample=1.0, total=   4.5s
[CV] colsample_bytree=0.5, gamma=0.5, learning_rate=0.1, max_depth=11, min_child_weight=1, n_estimators=125, objective=binary:logistic, subsample=1.0 
[CV]  colsample_bytree=0.5, gamma=0.5, learning_r

[CV]  colsample_bytree=0.5, gamma=1, learning_rate=0.1, max_depth=9, min_child_weight=1, n_estimators=100, objective=binary:logistic, subsample=0.5, total=   4.0s
[CV] colsample_bytree=0.5, gamma=1, learning_rate=0.1, max_depth=9, min_child_weight=1, n_estimators=100, objective=binary:logistic, subsample=0.5 
[CV]  colsample_bytree=0.5, gamma=1, learning_rate=0.1, max_depth=9, min_child_weight=1, n_estimators=100, objective=binary:logistic, subsample=0.5, total=   4.4s
[CV] colsample_bytree=0.5, gamma=1, learning_rate=0.1, max_depth=9, min_child_weight=1, n_estimators=100, objective=binary:logistic, subsample=0.5 
[CV]  colsample_bytree=0.5, gamma=1, learning_rate=0.1, max_depth=9, min_child_weight=1, n_estimators=100, objective=binary:logistic, subsample=0.5, total=   4.3s
[CV] colsample_bytree=0.5, gamma=1, learning_rate=0.1, max_depth=9, min_child_weight=1, n_estimators=100, objective=binary:logistic, subsample=0.5 
[CV]  colsample_bytree=0.5, gamma=1, learning_rate=0.1, max_depth=9

[CV]  colsample_bytree=0.5, gamma=1, learning_rate=0.1, max_depth=9, min_child_weight=5, n_estimators=100, objective=binary:logistic, subsample=1.0, total=   3.9s
[CV] colsample_bytree=0.5, gamma=1, learning_rate=0.1, max_depth=9, min_child_weight=5, n_estimators=100, objective=binary:logistic, subsample=1.0 
[CV]  colsample_bytree=0.5, gamma=1, learning_rate=0.1, max_depth=9, min_child_weight=5, n_estimators=100, objective=binary:logistic, subsample=1.0, total=   4.1s
[CV] colsample_bytree=0.5, gamma=1, learning_rate=0.1, max_depth=9, min_child_weight=5, n_estimators=100, objective=binary:logistic, subsample=1.0 
[CV]  colsample_bytree=0.5, gamma=1, learning_rate=0.1, max_depth=9, min_child_weight=5, n_estimators=100, objective=binary:logistic, subsample=1.0, total=   3.9s
[CV] colsample_bytree=0.5, gamma=1, learning_rate=0.1, max_depth=9, min_child_weight=5, n_estimators=125, objective=binary:logistic, subsample=0.5 
[CV]  colsample_bytree=0.5, gamma=1, learning_rate=0.1, max_depth=9

[CV]  colsample_bytree=0.5, gamma=1, learning_rate=0.1, max_depth=11, min_child_weight=1, n_estimators=125, objective=binary:logistic, subsample=0.5, total=   6.4s
[CV] colsample_bytree=0.5, gamma=1, learning_rate=0.1, max_depth=11, min_child_weight=1, n_estimators=125, objective=binary:logistic, subsample=1.0 
[CV]  colsample_bytree=0.5, gamma=1, learning_rate=0.1, max_depth=11, min_child_weight=1, n_estimators=125, objective=binary:logistic, subsample=1.0, total=   4.8s
[CV] colsample_bytree=0.5, gamma=1, learning_rate=0.1, max_depth=11, min_child_weight=1, n_estimators=125, objective=binary:logistic, subsample=1.0 
[CV]  colsample_bytree=0.5, gamma=1, learning_rate=0.1, max_depth=11, min_child_weight=1, n_estimators=125, objective=binary:logistic, subsample=1.0, total=   5.2s
[CV] colsample_bytree=0.5, gamma=1, learning_rate=0.1, max_depth=11, min_child_weight=1, n_estimators=125, objective=binary:logistic, subsample=1.0 
[CV]  colsample_bytree=0.5, gamma=1, learning_rate=0.1, max_d

[CV]  colsample_bytree=1.0, gamma=0.5, learning_rate=0.1, max_depth=9, min_child_weight=1, n_estimators=100, objective=binary:logistic, subsample=0.5, total=   8.7s
[CV] colsample_bytree=1.0, gamma=0.5, learning_rate=0.1, max_depth=9, min_child_weight=1, n_estimators=100, objective=binary:logistic, subsample=0.5 
[CV]  colsample_bytree=1.0, gamma=0.5, learning_rate=0.1, max_depth=9, min_child_weight=1, n_estimators=100, objective=binary:logistic, subsample=0.5, total=   8.0s
[CV] colsample_bytree=1.0, gamma=0.5, learning_rate=0.1, max_depth=9, min_child_weight=1, n_estimators=100, objective=binary:logistic, subsample=0.5 
[CV]  colsample_bytree=1.0, gamma=0.5, learning_rate=0.1, max_depth=9, min_child_weight=1, n_estimators=100, objective=binary:logistic, subsample=0.5, total=   8.2s
[CV] colsample_bytree=1.0, gamma=0.5, learning_rate=0.1, max_depth=9, min_child_weight=1, n_estimators=100, objective=binary:logistic, subsample=0.5 
[CV]  colsample_bytree=1.0, gamma=0.5, learning_rate=0.

[CV]  colsample_bytree=1.0, gamma=0.5, learning_rate=0.1, max_depth=9, min_child_weight=5, n_estimators=100, objective=binary:logistic, subsample=1.0, total=   5.7s
[CV] colsample_bytree=1.0, gamma=0.5, learning_rate=0.1, max_depth=9, min_child_weight=5, n_estimators=100, objective=binary:logistic, subsample=1.0 
[CV]  colsample_bytree=1.0, gamma=0.5, learning_rate=0.1, max_depth=9, min_child_weight=5, n_estimators=100, objective=binary:logistic, subsample=1.0, total=   5.7s
[CV] colsample_bytree=1.0, gamma=0.5, learning_rate=0.1, max_depth=9, min_child_weight=5, n_estimators=125, objective=binary:logistic, subsample=0.5 
[CV]  colsample_bytree=1.0, gamma=0.5, learning_rate=0.1, max_depth=9, min_child_weight=5, n_estimators=125, objective=binary:logistic, subsample=0.5, total=   9.0s
[CV] colsample_bytree=1.0, gamma=0.5, learning_rate=0.1, max_depth=9, min_child_weight=5, n_estimators=125, objective=binary:logistic, subsample=0.5 
[CV]  colsample_bytree=1.0, gamma=0.5, learning_rate=0.

[CV]  colsample_bytree=1.0, gamma=0.5, learning_rate=0.1, max_depth=11, min_child_weight=1, n_estimators=125, objective=binary:logistic, subsample=0.5, total=  12.0s
[CV] colsample_bytree=1.0, gamma=0.5, learning_rate=0.1, max_depth=11, min_child_weight=1, n_estimators=125, objective=binary:logistic, subsample=1.0 
[CV]  colsample_bytree=1.0, gamma=0.5, learning_rate=0.1, max_depth=11, min_child_weight=1, n_estimators=125, objective=binary:logistic, subsample=1.0, total=   9.5s
[CV] colsample_bytree=1.0, gamma=0.5, learning_rate=0.1, max_depth=11, min_child_weight=1, n_estimators=125, objective=binary:logistic, subsample=1.0 
[CV]  colsample_bytree=1.0, gamma=0.5, learning_rate=0.1, max_depth=11, min_child_weight=1, n_estimators=125, objective=binary:logistic, subsample=1.0, total=   8.6s
[CV] colsample_bytree=1.0, gamma=0.5, learning_rate=0.1, max_depth=11, min_child_weight=1, n_estimators=125, objective=binary:logistic, subsample=1.0 
[CV]  colsample_bytree=1.0, gamma=0.5, learning_r

[CV]  colsample_bytree=1.0, gamma=1, learning_rate=0.1, max_depth=9, min_child_weight=1, n_estimators=100, objective=binary:logistic, subsample=0.5, total=   7.4s
[CV] colsample_bytree=1.0, gamma=1, learning_rate=0.1, max_depth=9, min_child_weight=1, n_estimators=100, objective=binary:logistic, subsample=0.5 
[CV]  colsample_bytree=1.0, gamma=1, learning_rate=0.1, max_depth=9, min_child_weight=1, n_estimators=100, objective=binary:logistic, subsample=0.5, total=   7.8s
[CV] colsample_bytree=1.0, gamma=1, learning_rate=0.1, max_depth=9, min_child_weight=1, n_estimators=100, objective=binary:logistic, subsample=0.5 
[CV]  colsample_bytree=1.0, gamma=1, learning_rate=0.1, max_depth=9, min_child_weight=1, n_estimators=100, objective=binary:logistic, subsample=0.5, total=   7.9s
[CV] colsample_bytree=1.0, gamma=1, learning_rate=0.1, max_depth=9, min_child_weight=1, n_estimators=100, objective=binary:logistic, subsample=0.5 
[CV]  colsample_bytree=1.0, gamma=1, learning_rate=0.1, max_depth=9

[CV]  colsample_bytree=1.0, gamma=1, learning_rate=0.1, max_depth=9, min_child_weight=5, n_estimators=100, objective=binary:logistic, subsample=1.0, total=   5.9s
[CV] colsample_bytree=1.0, gamma=1, learning_rate=0.1, max_depth=9, min_child_weight=5, n_estimators=100, objective=binary:logistic, subsample=1.0 
[CV]  colsample_bytree=1.0, gamma=1, learning_rate=0.1, max_depth=9, min_child_weight=5, n_estimators=100, objective=binary:logistic, subsample=1.0, total=   6.3s
[CV] colsample_bytree=1.0, gamma=1, learning_rate=0.1, max_depth=9, min_child_weight=5, n_estimators=100, objective=binary:logistic, subsample=1.0 
[CV]  colsample_bytree=1.0, gamma=1, learning_rate=0.1, max_depth=9, min_child_weight=5, n_estimators=100, objective=binary:logistic, subsample=1.0, total=   5.9s
[CV] colsample_bytree=1.0, gamma=1, learning_rate=0.1, max_depth=9, min_child_weight=5, n_estimators=125, objective=binary:logistic, subsample=0.5 
[CV]  colsample_bytree=1.0, gamma=1, learning_rate=0.1, max_depth=9

[CV]  colsample_bytree=1.0, gamma=1, learning_rate=0.1, max_depth=11, min_child_weight=1, n_estimators=125, objective=binary:logistic, subsample=0.5, total=  12.4s
[CV] colsample_bytree=1.0, gamma=1, learning_rate=0.1, max_depth=11, min_child_weight=1, n_estimators=125, objective=binary:logistic, subsample=1.0 
[CV]  colsample_bytree=1.0, gamma=1, learning_rate=0.1, max_depth=11, min_child_weight=1, n_estimators=125, objective=binary:logistic, subsample=1.0, total=   9.2s
[CV] colsample_bytree=1.0, gamma=1, learning_rate=0.1, max_depth=11, min_child_weight=1, n_estimators=125, objective=binary:logistic, subsample=1.0 
[CV]  colsample_bytree=1.0, gamma=1, learning_rate=0.1, max_depth=11, min_child_weight=1, n_estimators=125, objective=binary:logistic, subsample=1.0, total=   8.8s
[CV] colsample_bytree=1.0, gamma=1, learning_rate=0.1, max_depth=11, min_child_weight=1, n_estimators=125, objective=binary:logistic, subsample=1.0 
[CV]  colsample_bytree=1.0, gamma=1, learning_rate=0.1, max_d

[Parallel(n_jobs=1)]: Done 320 out of 320 | elapsed: 36.2min finished


GridSearchCV(cv=5, error_score=nan,
             estimator=XGBClassifier(base_score=0.5, booster='gbtree',
                                     colsample_bylevel=1, colsample_bynode=1,
                                     colsample_bytree=1, gamma=0,
                                     learning_rate=0.1, max_delta_step=0,
                                     max_depth=3, min_child_weight=1,
                                     missing=None, n_estimators=100, n_jobs=1,
                                     nthread=None, objective='binary:logistic',
                                     random_state=42, reg_alpha=0, reg_lambda=1,
                                     scale_p...lent=None,
                                     subsample=1, verbosity=1),
             iid='deprecated', n_jobs=None,
             param_grid={'colsample_bytree': [0.5, 1.0], 'gamma': [0.5, 1],
                         'learning_rate': [0.1], 'max_depth': [9, 11],
                         'min_child_weight': [1, 5],

In [73]:
# Check the results of the grid search

print(f"Best parameters: {gsX.best_params_}")
print(f"Best score: {gsX.best_score_}")

Best parameters: {'colsample_bytree': 0.5, 'gamma': 0.5, 'learning_rate': 0.1, 'max_depth': 11, 'min_child_weight': 1, 'n_estimators': 125, 'objective': 'binary:logistic', 'subsample': 1.0}
Best score: 0.9427908448030502


In [74]:
# Save model
modelX = gsX.best_estimator_

In [75]:
# Score model on train set and validate set
print(f"Accuracy on train set: {modelX.score(X_sm, y_sm)}")
print(f"Accuracy on validate set: {modelX.score(X_val, y_val)}")

Accuracy on train set: 0.9803749780970737
Accuracy on validate set: 0.908246225319396


The model is overfitted with about 6% drop in validate accuracy compared to train accuracy. The accuracy is lower than the all previous models.

In [76]:
# Confusion matrix
# Pass in true values, predicted values to confusion matrix
# Convert confusion matrix into dataframe
# Positive class (class 1) is wnv
preds = gsX.predict(X_val)
cm = confusion_matrix(y_val, preds)
cm_df = pd.DataFrame(cm,columns=['pred no wnv','pred wnv'], index=['Actual no wnv','Actual wnv'])
cm_df

Unnamed: 0,pred no wnv,pred wnv
Actual no wnv,2309,137
Actual wnv,100,37


In [77]:
# return nparray as a 1-D array.
confusion_matrix(y_val, preds).ravel()
# Save TN/FP/FN/TP values.
tn, fp, fn, tp = confusion_matrix(y_val,preds).ravel()

In [78]:
# Summary of metrics for log reg model
sens = tp/(tp+fn)
prec = tp/(tp+fp)
f1 = 2*(prec*sens)/(prec+sens)
print(f"Sensitivity: {round(sens,4)}")
print(f"Precision: {round(prec,4)}")
print(f"F1: {round(f1,4)}")

Sensitivity: 0.2701
Precision: 0.2126
F1: 0.2379


In [79]:
pred_proba = [i[1] for i in gsX.predict_proba(X_val)]

pred_df = pd.DataFrame({'validate_values': y_val,
                        'pred_probs':pred_proba})
pred_df.head()

Unnamed: 0,validate_values,pred_probs
5602,0,0.206533
1012,0,0.101871
7561,0,0.185707
7887,0,0.132376
4422,0,0.014373


In [80]:
# Calculate ROC AUC.
roc_auc_score(pred_df['validate_values'],pred_df['pred_probs'])

0.8491817416786531

### Summary of Model Metrics 

In [114]:
# Summary of Model scores in Dataframe
summary_df = pd.DataFrame({'accuracy(val)' : [0.945, 0.933, 0.799, 0.856, 0.910, 0.908],\
                           'sensitivity' :   [0,     0.088, 0.672, 0.657, 0.226, 0.270],\
                           'precision' :     [0,     0.200, 0.162, 0.217, 0.196, 0.213],\
                           'F1' :            ['NaN', 0.122, 0.261, 0.326, 0.210, 0.238],\
                           'roc_auc' :       [0.791, 0.672, 0.774, 0.862, 0.837, 0.849]})
# Transpose dataframe
summary_dft = summary_df.T
# Rename columns
summary_dft.columns = ['LogReg', 'KNN', 'DT', 'RF(Smote O&U)', 'GBc', 'XGBc']
summary_dft

Unnamed: 0,LogReg,KNN,DT,RF(Smote O&U),GBc,XGBc
accuracy(val),0.945,0.933,0.799,0.856,0.91,0.908
sensitivity,0.0,0.088,0.672,0.657,0.226,0.27
precision,0.0,0.2,0.162,0.217,0.196,0.213
F1,,0.122,0.261,0.326,0.21,0.238
roc_auc,0.791,0.672,0.774,0.862,0.837,0.849


Two models stood out particular based on F1 score and roc_auc. These are the XGBoost classifier and the Random Forest classifier. XGBoost used oversampling, while RandomForest used a combination of Oversampling and Undersampling. For research purposes, we decided on these two as Best model candidates and will conduct kaggle submissions to further study these models.

The accuracy(val) for RandomForest is roc_auc on the validate set, since the gridsearch is set to optimize for this metric. Next generate the predict probabilities on test set for kaggle submission.

### Best Model 1 (XGBoost) Evaluation

In [81]:
# instantiate the best model with the best hyperparams
best_model = modelX

In [82]:
# Check train set headers
df_traincopy.head(1)

Unnamed: 0,lat,long,wnv,num_mos,wk,yr,tavg,stnpress,dewpt,precip,...,trap_T230,trap_T231,trap_T232,trap_T233,trap_T235,trap_T236,trap_T237,trap_T238,trap_T900,trap_T903
0,41.95469,-87.800991,0,1,22,2007,75.5,29.415,58.5,0.0,...,0,0,0,0,0,0,0,0,0,0


In [83]:
# Check test set headers
df_testcopy.head(1)

Unnamed: 0,lat,long,wk,yr,tavg,stnpress,dewpt,precip,windspeed,daylight,...,trap_T230,trap_T231,trap_T232,trap_T233,trap_T235,trap_T236,trap_T237,trap_T238,trap_T900,trap_T903
0,41.95469,-87.800991,23,2008,75.0,29.31,55.5,0.0,9.15,15.166667,...,0,0,0,0,0,0,0,0,0,0


In [84]:
# making a copy of the train_kaggle df
X_train_kaggle = df_traincopy.copy().drop(['num_mos','wnv'],axis=1)
y_train_kaggle = df_traincopy.copy()['wnv']

X_test_kaggle = df_testcopy.copy()

In [85]:
X_train_kaggle.columns.difference(X_test_kaggle.columns)

Index([], dtype='object')

In [86]:
X_test_kaggle.columns.difference(X_train_kaggle.columns)

Index([], dtype='object')

In [87]:
#checking shape
X_train_kaggle.shape

(8610, 153)

In [88]:
X_train_kaggle.head()

Unnamed: 0,lat,long,wk,yr,tavg,stnpress,dewpt,precip,windspeed,daylight,...,trap_T230,trap_T231,trap_T232,trap_T233,trap_T235,trap_T236,trap_T237,trap_T238,trap_T900,trap_T903
0,41.95469,-87.800991,22,2007,75.5,29.415,58.5,0.0,5.8,15.6,...,0,0,0,0,0,0,0,0,0,0
1,41.95469,-87.800991,22,2007,75.5,29.415,58.5,0.0,5.8,15.6,...,0,0,0,0,0,0,0,0,0,0
2,41.994991,-87.769279,22,2007,75.5,29.415,58.5,0.0,5.8,15.6,...,0,0,0,0,0,0,0,0,0,0
3,41.974089,-87.824812,22,2007,75.5,29.415,58.5,0.0,5.8,15.6,...,0,0,0,0,0,0,0,0,0,0
4,41.974089,-87.824812,22,2007,75.5,29.415,58.5,0.0,5.8,15.6,...,0,0,0,0,0,0,0,0,0,0


In [89]:
X_test_kaggle.head()

Unnamed: 0,lat,long,wk,yr,tavg,stnpress,dewpt,precip,windspeed,daylight,...,trap_T230,trap_T231,trap_T232,trap_T233,trap_T235,trap_T236,trap_T237,trap_T238,trap_T900,trap_T903
0,41.95469,-87.800991,23,2008,75.0,29.31,55.5,0.0,9.15,15.166667,...,0,0,0,0,0,0,0,0,0,0
1,41.95469,-87.800991,23,2008,75.0,29.31,55.5,0.0,9.15,15.166667,...,0,0,0,0,0,0,0,0,0,0
2,41.95469,-87.800991,23,2008,75.0,29.31,55.5,0.0,9.15,15.166667,...,0,0,0,0,0,0,0,0,0,0
3,41.95469,-87.800991,23,2008,75.0,29.31,55.5,0.0,9.15,15.166667,...,0,0,0,0,0,0,0,0,0,0
4,41.95469,-87.800991,23,2008,75.0,29.31,55.5,0.0,9.15,15.166667,...,0,0,0,0,0,0,0,0,0,0


In [90]:
#checking shape
X_test_kaggle.shape

(116293, 153)

In [None]:
# Scale variables
# Commented out (since best model is based on decision trees, scaling not required)
#ss = StandardScaler(n)
#X_train_kaggle_ss = ss.fit_transform(X_train_kaggle)
#X_test_kaggle_ss = ss.transform(X_test_kaggle)

In [91]:
# Resampling using SMOTE oversmaple on minority class
X_train_kaggle_sm, y_train_kaggle_sm = smote.fit_sample(X_train_kaggle, y_train_kaggle)

In [92]:
# to insert best model

best_model.fit(X_train_kaggle_sm, y_train_kaggle_sm)

XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
              colsample_bynode=1, colsample_bytree=0.5, gamma=0.5,
              learning_rate=0.1, max_delta_step=0, max_depth=11,
              min_child_weight=1, missing=None, n_estimators=125, n_jobs=1,
              nthread=None, objective='binary:logistic', random_state=42,
              reg_alpha=0, reg_lambda=1, scale_pos_weight=1, seed=None,
              silent=None, subsample=1.0, verbosity=1)

In [93]:
# Score model on train set and validate set
print(f"Accuracy on train set: {best_model.score(X_train_kaggle_sm,y_train_kaggle_sm)}")

Accuracy on train set: 0.9737519931313627


In [94]:
# predicting kaggle output
predict_kaggle = best_model.predict(X_test_kaggle)

In [95]:
# probability predicition
predict_proba_kaggle = best_model.predict_proba(X_test_kaggle)

In [96]:
#Saving an output CSV file for submission
output = pd.DataFrame({'Id': df_test['id'], 'WnvPresent': predict_proba_kaggle[:,1]})
output.to_csv('my_submission_DSI.csv', index=False)

### Best Model 2 Random Forest Evaluation

In [108]:
# try random forest for kaggle submission
rand_model = rf_gs_best

In [109]:
rand_model.fit(X_train_kaggle, y_train_kaggle)

Pipeline(memory=None,
         steps=[('over',
                 SMOTE(k_neighbors=5, n_jobs=None, random_state=42,
                       sampling_strategy=0.1)),
                ('under',
                 RandomUnderSampler(random_state=42, replacement=False,
                                    sampling_strategy=0.5)),
                ('rf',
                 RandomForestClassifier(bootstrap=True, ccp_alpha=0.0,
                                        class_weight=None, criterion='gini',
                                        max_depth=10, max_features='auto',
                                        max_leaf_nodes=None, max_samples=None,
                                        min_impurity_decrease=0.0,
                                        min_impurity_split=None,
                                        min_samples_leaf=1, min_samples_split=2,
                                        min_weight_fraction_leaf=0.0,
                                        n_estimators=50, n_jobs=None,


In [110]:
print(f"roc_auc on train set: {rand_model.score(X_train_kaggle,y_train_kaggle)}")

Accuracy on train set: 0.8816492450638792


In [111]:
# predicting kaggle output
predict_kaggle = rand_model.predict(X_test_kaggle)

In [112]:
# probability predicition
predict_proba_kaggle = rand_model.predict_proba(X_test_kaggle)

In [113]:
#Saving an output CSV file for submission
output2 = pd.DataFrame({'Id': df_test['id'], 'WnvPresent': predict_proba_kaggle[:,1]})
output2.to_csv('my_submission_DSI2.csv', index=False)

Kaggle submission scores:
XGBoost (Undersampling SMOTE): 0.74579
RandomForest (Oversampling and Undersampling SMOTE): 0.73774

The team discussed and agreed to explore Oversampling and Undersampling (SMOTE) on XGBoost classifier in a standalone notebook (4_13). The discussion on pesticide spray insights and recommendations are contained within current notebook, in the next section.

Afternote: 
XGBoost(Oversampling and Undersampling SMOTE): 0.75410
The results validated the Team's hypothesis that in general, (Oversampling and Undersampling SMOTE)is able to boost model performance, and gained valuable insights into model building and hyper-parameters tuning.

### Conclusion

#### Recommendations:

Pesticide deployment: To improve cost-effectiveness of pesticide deployment, the proposed recommendations relies on timing and coverage area:

1. Spraying should be focused in the months of Jun to Jul (periods of high rainfall), and targeted at region of traps (see presentation slides) with high wnv as a start.

2. Moving forwards, the deployment should be tailored accordingly to match rainfall patterns; mosquito population generally spike 2 weeks after heavy rainfall.

3. The classifier model could be used to provide insights to areas for targeted spraying in the longer term, as new data on wnv clusters, and weather data is available. Further details are elaborated in the presentation slides (cost-benefit analysis).

4. Data on the type of pesticide and cost per spray could be collected to provide more precise cost benefit spray recommendations

The model could be further improved by including data on residential areas, schools, and nursing homes. Such data will provide insights to 
- 1) the types and general state of the residential areas; to better inform alternative mosquito control programs through public outreach campaigns, 
- 2) identify potential sites (e.g. work areas where rainwater may pool unnoticed) for mosquito breeding for early prevention efforts
- 3) potential areas of higher risk (i.e. children and older folks) that could influence spraying times 

#### Future work
Combination of Oversampling and Undersampling techniques could be implemented for the various classifiers to further bosst classification performance (F1 and roc_auc).

Other Observations: 
Initially, the team tried to predict the number of mosquitoes, then classify wnv presence using the predicted numbers. However,  the predicted accuraacy score was so low that the team decided to drop it, and adopted the approach directly classifying wnv presence. This could infer that the number of mosquitoes is not a crucial factor in determining presence of wnv.

The team also encountered hidden errors attributed to subtle difference between google colab and jupyter notebooks. These were overall manageable which team members covering each other in ensuring smooth running of codes within the final github submission.