# Time Series Forecasting
![time-series](https://miro.medium.com/max/654/0*_n2EpyRNWP-EchQ9.png)

## Not your everyday dataset

#### A normal machine learning dataset is a collection of observations.

Each row is independent of each other. 

#### Time series adds an explicit order dependence between observations

This naturally adds an inherent order to the data. 

For example the following dataset captures energy consumption on a daily basis: 

![](https://miro.medium.com/max/367/0*82pE15_BCQTYfMFe.png)

### Data pre processing to generate multivariate dataset
![](https://miro.medium.com/max/498/1*RvqYTi5Gow5SheiPCMLMVQ.png)

![](https://miro.medium.com/max/577/1*dd1t4Jc0HmkC6uP-BDdPNg.png)

## Pandemic response challenge

The Pandemic Response Challenge is a **$500K**, four-month challenge that focuses on the development of data-driven AI systems to predict COVID-19 infection rates and prescribe Intervention Plans (IPs) that regional governments, communities, and organizations can implement to minimize harm when reopening their economies.

https://www.xprize.org/challenge/pandemicresponse

https://github.com/OxCGRT/covid-policy-tracker

In [1]:
import pickle
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split

In [2]:
# Main source for the training data
DATA_URL = 'https://raw.githubusercontent.com/OxCGRT/covid-policy-tracker/master/data/OxCGRT_latest.csv'

df = pd.read_csv(DATA_URL, 
                 parse_dates=['Date'],
                 encoding="ISO-8859-1",
                 dtype={"RegionName": str,
                        "RegionCode": str},
                 error_bad_lines=False)

In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 98840 entries, 0 to 98839
Data columns (total 49 columns):
CountryName                              98840 non-null object
CountryCode                              98840 non-null object
RegionName                               33888 non-null object
RegionCode                               33888 non-null object
Jurisdiction                             98840 non-null object
Date                                     98840 non-null datetime64[ns]
C1_School closing                        94443 non-null float64
C1_Flag                                  71478 non-null float64
C2_Workplace closing                     93934 non-null float64
C2_Flag                                  64997 non-null float64
C3_Cancel public events                  93930 non-null float64
C3_Flag                                  70499 non-null float64
C4_Restrictions on gatherings            93927 non-null float64
C4_Flag                                  65665 non-null f

In [4]:
df.head()

Unnamed: 0,CountryName,CountryCode,RegionName,RegionCode,Jurisdiction,Date,C1_School closing,C1_Flag,C2_Workplace closing,C2_Flag,...,StringencyIndex,StringencyIndexForDisplay,StringencyLegacyIndex,StringencyLegacyIndexForDisplay,GovernmentResponseIndex,GovernmentResponseIndexForDisplay,ContainmentHealthIndex,ContainmentHealthIndexForDisplay,EconomicSupportIndex,EconomicSupportIndexForDisplay
0,Aruba,ABW,,,NAT_TOTAL,2020-01-01,0.0,,0.0,,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,Aruba,ABW,,,NAT_TOTAL,2020-01-02,0.0,,0.0,,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,Aruba,ABW,,,NAT_TOTAL,2020-01-03,0.0,,0.0,,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,Aruba,ABW,,,NAT_TOTAL,2020-01-04,0.0,,0.0,,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,Aruba,ABW,,,NAT_TOTAL,2020-01-05,0.0,,0.0,,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [5]:
# For testing, restrict training data to that before a hypothetical predictor submission date
HYPOTHETICAL_SUBMISSION_DATE = np.datetime64("2020-11-15")
df = df[df.Date <= HYPOTHETICAL_SUBMISSION_DATE]

In [6]:
# Add RegionID column that combines CountryName and RegionName for easier manipulation of data
df['GeoID'] = df['CountryName'] + '__' + df['RegionName'].astype(str)

In [7]:
# Add new cases column
df['NewCases'] = df.groupby('GeoID').ConfirmedCases.diff().fillna(0)

In [8]:
# Keep only columns of interest
id_cols = ['CountryName',
           'RegionName',
           'GeoID',
           'Date']
cases_col = ['NewCases']
npi_cols = ['C1_School closing',
            'C2_Workplace closing',
            'C3_Cancel public events',
            'C4_Restrictions on gatherings',
            'C5_Close public transport',
            'C6_Stay at home requirements',
            'C7_Restrictions on internal movement',
            'C8_International travel controls',
            'H1_Public information campaigns',
            'H2_Testing policy',
            'H3_Contact tracing',
            'H6_Facial Coverings']
df = df[id_cols + cases_col + npi_cols]

In [9]:
# Fill any missing case values by interpolation and setting NaNs to 0
df.update(df.groupby('GeoID').NewCases.apply(
    lambda group: group.interpolate()).fillna(0))

In [10]:
# Fill any missing NPIs by assuming they are the same as previous day
for npi_col in npi_cols:
    df.update(df.groupby('GeoID')[npi_col].ffill().fillna(0))

In [11]:
df.head()

Unnamed: 0,CountryName,RegionName,GeoID,Date,NewCases,C1_School closing,C2_Workplace closing,C3_Cancel public events,C4_Restrictions on gatherings,C5_Close public transport,C6_Stay at home requirements,C7_Restrictions on internal movement,C8_International travel controls,H1_Public information campaigns,H2_Testing policy,H3_Contact tracing,H6_Facial Coverings
0,Aruba,,Aruba__nan,2020-01-01,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,Aruba,,Aruba__nan,2020-01-02,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,Aruba,,Aruba__nan,2020-01-03,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,Aruba,,Aruba__nan,2020-01-04,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,Aruba,,Aruba__nan,2020-01-05,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


## Reframe multivariate dataset for regression algorithms

In [12]:
# Set number of past days to use to make predictions
nb_lookback_days = 30

# Create training data across all countries for predicting one day ahead
X_cols = cases_col + npi_cols
y_col = cases_col
X_samples = []
y_samples = []
geo_ids = df.GeoID.unique()
for g in geo_ids:
    gdf = df[df.GeoID == g]
    all_case_data = np.array(gdf[cases_col])
    all_npi_data = np.array(gdf[npi_cols])

    # Create one sample for each day where we have enough data
    # Each sample consists of cases and npis for previous nb_lookback_days
    nb_total_days = len(gdf)
    for d in range(nb_lookback_days, nb_total_days - 1):
        X_cases = all_case_data[d-nb_lookback_days:d]

        # Take negative of npis to support positive weight constraints
        X_npis = -all_npi_data[d - nb_lookback_days:d]

        # Flatten all input data so it fits Lasso input format.
        X_sample = np.concatenate([X_cases.flatten(),
                                   X_npis.flatten()])
        y_sample = all_case_data[d + 1]
        X_samples.append(X_sample)
        y_samples.append(y_sample)

X_samples = np.array(X_samples)
y_samples = np.array(y_samples).flatten()

In [13]:
X_samples

array([[ 0.,  0.,  0., ..., -0., -0., -0.],
       [ 0.,  0.,  0., ..., -0., -0., -0.],
       [ 0.,  0.,  0., ..., -0., -0., -0.],
       ...,
       [15., 19., 20., ..., -1., -1., -3.],
       [19., 20., 24., ..., -1., -1., -3.],
       [20., 24., 11., ..., -1., -1., -3.]])

In [14]:
y_samples

array([ 0.,  0.,  0., ..., 69., 21., 43.])

In [15]:
# Helpful function to compute mae
def mae(pred, true):
    return np.mean(np.abs(pred - true))

In [16]:
# Split data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X_samples,
                                                    y_samples,
                                                    test_size=0.2,
                                                    random_state=301)

In [20]:
from sklearn.linear_model import LinearRegression

linear_regressor = LinearRegression()

def evaluate_model(model):
    
    # Fit model
    model.fit(X_train, y_train)

    # Evaluate model
    train_preds = model.predict(X_train)
    train_preds = np.maximum(train_preds, 0) # Don't predict negative cases
    print('Train MAE:', mae(train_preds, y_train))

    test_preds = model.predict(X_test)
    test_preds = np.maximum(test_preds, 0) # Don't predict negative cases
    print('Test MAE:', mae(test_preds, y_test))
    
    return model

lr_model = evaluate_model(linear_regressor)

Train MAE: 219.2665194208216
Test MAE: 221.1032432064812


In [21]:
from xgboost import XGBRegressor

xgb_regressor = XGBRegressor()
xgb_model = evaluate_model(xgb_regressor)

Train MAE: 98.72512128438963
Test MAE: 197.3748955766229


## Inspect the learned feature coefficients for the model to see what features it's paying attention to.

In [22]:
def inspect_feature_importance(model):
    x_col_names = []
    for d in range(-nb_lookback_days, 0):
        x_col_names.append('Day ' + str(d) + ' ' + cases_col[0])
    for d in range(-nb_lookback_days, 1):
        for col_name in npi_cols:
            x_col_names.append('Day ' + str(d) + ' ' + col_name)

    # View non-zero coefficients
    for (col, coeff) in zip(x_col_names, list(model.coef_)):
        if coeff != 0.:
            print(col, coeff)
    print('Intercept', model.intercept_)

Day -30 NewCases 0.052336551289072344
Day -29 NewCases -0.027304104603490642
Day -28 NewCases 0.007862332817523021
Day -27 NewCases 0.13716002944047395
Day -26 NewCases -0.03699212001813895
Day -25 NewCases -0.0736424393552543
Day -24 NewCases -0.1456312484103433
Day -23 NewCases 0.0004444418978870948
Day -22 NewCases -0.07272763061176787
Day -21 NewCases -0.04264309175575846
Day -20 NewCases 0.18911559455804355
Day -19 NewCases 0.028784497691777752
Day -18 NewCases -0.1213719337321206
Day -17 NewCases -0.0491368817495996
Day -16 NewCases -0.11728237872639026
Day -15 NewCases -0.021624479775793178
Day -14 NewCases 0.10899809064344079
Day -13 NewCases 0.14524562247032907
Day -12 NewCases -0.051094841501884036
Day -11 NewCases -0.0636338719622939
Day -10 NewCases -0.1286611239725321
Day -9 NewCases -0.05991985121905671
Day -8 NewCases -0.032995935920346775
Day -7 NewCases 0.03682901365113726
Day -6 NewCases 0.3303778524822517
Day -5 NewCases 0.15968533408203
Day -4 NewCases 0.23351791527


## Prophylax

https://github.com/ShahNewazKhan/prophylax

## Sktime

## Fbprophet & Neural Prophet

