# Data Science 2 Seminar paper
## Business/project evaluation stage

### Premise
What is the effect of personal travel on CoViD-19 infection rates in the tri-state area?
As some of the possible factors I will look into:
- open vs closed borders  
- open vs closed stores 
- vacation times

Since border traffic was never entirely shut down for business traffic i.e. commuters we won't look into that. 

The hypothesis is that all 3 factores have an impact on infection rates. If that can be shown, I will try to form a prediction model for future holidays.

### Preliminary plan of action
* Define area of relevance based on travel/shopping/commute
* Evaluate infection response/delay windows to apply to infection timelines
* Construct an infection spike timeline for border region. 
  * Combined and statebased.
  * Classify days as rising infection/not
* Construct independent/combined timelines for borders' state, stores' state and vacation times. 
* Test correlation to CoViD-19 statistics in the involved countries/overall
* Classify spike events as border/store/vacation
* See if a prediction can be made



### Data evaluation

In [1]:
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
from datetime import datetime as dt, timedelta

from utility.helpers import *
import utility.init as util

plt.rcParams['figure.figsize'] = [12, 6]
plt.rcParams['figure.dpi'] = 100 # 200 e.g. is really fine, but slowe


## Importing data

In [2]:
emr_df = pd.read_csv(util.emr_infection_data)
de_ref_df = pd.read_csv(util.de_reference_data)
nl_ref_df = pd.read_csv(util.nl_reference_data)
be_ref_df = pd.read_csv(util.be_reference_data)

# add date typ columns
emr_df = addDateTypeColumn(emr_df,'XDate')
de_ref_df = addDateTypeColumn(de_ref_df,'XDate')
nl_ref_df = addDateTypeColumn(nl_ref_df,'XDate')
be_ref_df = addDateTypeColumn(be_ref_df,'XDate')

In [3]:
def lastRowsValue(window):
    return window[0]

In [4]:
gbtr_df = emr_df.loc[emr_df.Province_Id == 30]

gbtr_df = pd.merge(gbtr_df, de_ref_df, on='Date', how='outer')
gbtr_df['NDRC_SL_Yesterday'] = gbtr_df['N_Day_Rate_Change_Sliding_Window'].rolling(2, min_periods=1).apply(lastRowsValue, raw=True )

gbtr_df = gbtr_df.loc[gbtr_df.NDRC_SL_Yesterday.notna() & gbtr_df.N_Day_Rate_Change_Sliding_Window.notna() ,['NDRC_SL_Yesterday','N_Day_Rate_Change_Sliding_Window','Date','OffDayFactor']]
gbtr_df

Unnamed: 0,NDRC_SL_Yesterday,N_Day_Rate_Change_Sliding_Window,Date,OffDayFactor
9,1.093183,1.075902,2020-03-24,10.0
10,1.075902,1.074059,2020-03-25,11.0
11,1.074059,1.067411,2020-03-26,12.0
12,1.067411,1.071119,2020-03-27,13.0
13,1.071119,1.041553,2020-03-28,14.0
...,...,...,...,...
603,1.370667,1.361200,2021-11-11,3.0
604,1.361200,1.435401,2021-11-12,3.0
605,1.435401,1.528716,2021-11-13,2.0
606,1.528716,1.551841,2021-11-14,1.0


# Gradient Boosting Regressor

In [11]:
from sklearn.datasets import make_regression
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.model_selection import train_test_split

X = gbtr_df.loc[:,['Date','NDRC_SL_Yesterday','OffDayFactor']]
y = gbtr_df.N_Day_Rate_Change_Sliding_Window

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4, random_state=1)

gbtReg = GradientBoostingRegressor(max_depth = 2, subsample = 0.1, n_estimators = 50, learning_rate= 0.05, loss='lad', random_state=0)

gbtReg.fit(X_train, y_train)
gbtReg.score(X_test, y_test)

0.4462742951798032

## cross_val

In [12]:
from numpy import mean
from numpy import std

from sklearn.model_selection import cross_val_score
from sklearn.model_selection import RepeatedKFold, StratifiedKFold

gbtRegModel = GradientBoostingRegressor(max_depth = 2, subsample = 0.1, n_estimators = 50, learning_rate= 0.05, loss='lad',random_state=1)
crossVal = RepeatedKFold(n_splits=10, n_repeats=3, random_state=1)
n_scores = cross_val_score(gbtRegModel, X, y, scoring='neg_mean_absolute_error' , cv=crossVal, n_jobs=-1, error_score='raise') # , scoring='neg_mean_absolute_error'

print('MAE: %.3f (%.3f)' % (mean(n_scores), std(n_scores)))

MAE: -0.068 (0.010)


## Grid search

In [7]:
from sklearn.model_selection import GridSearchCV

param_grid = {
    'max_depth': [2],
    # 'max_features': [2, 3],
    'subsample'    : [0.1],
    # 'min_samples_leaf': [3, 4, 5],
    # 'min_samples_split': [8, 10, 12],
    'n_estimators': [50],
    'learning_rate': [0.05]
}

gbr = GradientBoostingRegressor(loss='lad', random_state=1)
cv = RepeatedKFold(n_splits=20, n_repeats=3, random_state=1)
grid_search = GridSearchCV(estimator = gbr, param_grid = param_grid, 
                          cv = cv, n_jobs = -1, verbose = 2, scoring='neg_mean_absolute_error')
grid_search.fit(X, y)


Fitting 60 folds for each of 3 candidates, totalling 180 fits


GridSearchCV(cv=RepeatedKFold(n_repeats=3, n_splits=20, random_state=1),
             estimator=GradientBoostingRegressor(loss='lad', random_state=1),
             n_jobs=-1,
             param_grid={'learning_rate': [0.05], 'max_depth': [2, 3, 4],
                         'n_estimators': [50], 'subsample': [0.1]},
             scoring='neg_mean_absolute_error', verbose=2)

In [8]:
print(" Results from Grid Search " )
print("\n The best estimator across ALL searched params:\n",grid_search.best_estimator_)
print("\n The best score across ALL searched params:\n",grid_search.best_score_)
print("\n The best parameters across ALL searched params:\n",grid_search.best_params_)

 Results from Grid Search 

 The best estimator across ALL searched params:
 GradientBoostingRegressor(learning_rate=0.05, loss='lad', max_depth=2,
                          n_estimators=50, random_state=1, subsample=0.1)

 The best score across ALL searched params:
 -0.06766992741685841

 The best parameters across ALL searched params:
 {'learning_rate': 0.05, 'max_depth': 2, 'n_estimators': 50, 'subsample': 0.1}


# Ridge Regression

In [9]:
from sklearn.datasets import make_regression
from sklearn.kernel_ridge import KernelRidge
from sklearn.model_selection import train_test_split

X = gbtr_df.loc[:,['NDRC_SL_Yesterday','OffDayFactor']]
y = gbtr_df.N_Day_Rate_Change_Sliding_Window

X_train, X_test, y_train, y_test = train_test_split(X, y,test_size=0.4, random_state=1)

gbtReg = KernelRidge()
gbtReg.fit(X_train, y_train)
gbtReg.score(X_test, y_test)

0.4549932891636499

# SVM

In [10]:
from sklearn.svm import SVR
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler

regr = make_pipeline(StandardScaler(), SVR(kernel='rbf',C=1.0, epsilon=0.2))
regr.fit(X_train, y_train)
regr.score(X_test, y_test)

0.3796438105215253

# Forecaster