# An exponential-smoothing forecast of fatalities caused by the COVID-19 pandemic

**Approach**<br>
As we are still in the early stages of the epidemy in most countries, we will use a rather intuitive technique, namely a [damped trend method](https://otexts.com/fpp2/holt.html), to interpolate fatalities per million for each country. The underlying model is a rather intuitive technique to extrapolate exponential trends, almost like a human would do in a drawing, and it is based on four underlying smoothing parameters:
- Level-smoothing factor $\alpha$<br>
- Slope-smoothing or trend factor $\beta$<br>
- Slope-damping factor $\phi$, typically comprised between 0.8 (resilience) and 1 (fatalities increase indefinitely)<br>
- Seasonality factor $\gamma$ (we have assumed no seasonality in our forecasts)<br>

The smoothing parameters will be learned from the dataset and used to forecast fatalities per million, mainly in countries were data is still limited. Our hope is to be able to learn more specifically the damping factor $\phi$, which is of paramount importance to [***flatten the curve***](https://www.livescience.com/coronavirus-flatten-the-curve.html).

**Important notes and disclaimers:** 
- We will not use the number of confirmed cases directly. We will use it to compute case fatality rates.
- **Our proposed approach only yields acceptable $R^{2}$ scores with KNN regression** and we hope that with more data ,hopefully in a couple of weeks, we will be able to work with more evolved regression techniques, in the hope that we will be able to better understand COVID-19 risk factors. Our model parameters ($\alpha$, $\beta$ and $\phi$) are very sensitive, i.e. there is not a unique combination of such parameters that best fits the historical fatalities curve for a given country. This is obviously a major issue, as we can only learn from robust parameters, but **we will keep working on this methodological issue.**
- At this stage, the main purpose of this notebook is therefore to help whoever may want to explore similar forecasting techniques.

In [None]:
import numpy as np 
import pandas as pd 

import warnings
import matplotlib as mpl
import matplotlib.pyplot as plt
from matplotlib import dates
import datetime as dt

from sklearn.metrics import r2_score
from sklearn.model_selection import train_test_split
from sklearn.neighbors import DistanceMetric, KNeighborsRegressor, RadiusNeighborsRegressor
from sklearn.linear_model import LinearRegression, Ridge, RidgeCV
from sklearn.preprocessing import PolynomialFeatures
from sklearn.pipeline import Pipeline

from functools import reduce
from statsmodels.tsa.holtwinters import ExponentialSmoothing

idx = pd.IndexSlice

## Data

### Load dataset

In [None]:
df = pd.read_csv('../input/covid19-global-forecasting-week-2/train.csv', index_col=0)

In [None]:
# Typo
df['Country_Region'].replace('Taiwan*', 'Taiwan', inplace=True)
df = df[df['Country_Region'] != 'Diamond Princess']

In [None]:
# Country Codes
df_codes = (pd.read_csv('../input/my-covid19-dataset/iso-country-codes.csv')
            .rename(columns={'Country Name': 'Country_Region', 'Alpha-3 code': 'Country Code'})
            .loc[:,['Country_Region', 'Country Code']])
df = (pd.merge(df, df_codes, how='left', on='Country_Region'))

In [None]:
# Locations
def location(state, country):
    if type(state)==str and not state==country: 
        return country + ' - ' + state
    else:
        return country

# Timeline 
df['Date'] = df['Date'].apply(lambda x: (dt.datetime.strptime(x, '%Y-%m-%d')))
df['Location'] = df[['Province_State', 'Country_Region']].apply(lambda row: location(row[0],row[1]), axis=1)

t_start = df['Date'].unique()[0]
t_end = df['Date'].unique()[-1]

print('Number of Locations: ' + str(len(df['Location'].unique())))
print('Dates: from ' + np.datetime_as_string(t_start, unit='D') + ' to ' + 
                                  np.datetime_as_string(t_end, unit='D') + '\n')

 ## Adjustements

### Population adjustment (normalisation)

#### Oversesas territories
Most oversesas territories will be ignored. We will focus on major countries, as well as relevant states and provinces in China, the USA, Canada and Australia.

In [None]:
lst_out = ['Cayman Islands', 'Curacao', 'Faroe Islands', 'French Guiana', 'French Polynesia', 'Guadeloupe', 
           'Mayotte', 'Martinique', 'Reunion', 'Saint Barthelemy', 'St Martin', 'Aruba', 'Channel Islands', 
           'Gibraltar', 'Montserrat', 'Diamond Princess', 'From Diamond Princess', 'Puerto Rico', 
           'Virgin Islands', 'Guam']

df_loc = (df.loc[[((c not in lst_out) and (p not in lst_out)) 
                 for (c, p) in zip(df['Province_State'], df['Country_Region'])], 
                ['Location','Province_State', 'Country_Region','Country Code']]
          .drop_duplicates())

#### Provinces in China, USA, Canada and Australia
**Source**: Population in for China, the USA, Canada and Australia: http://www.citypopulation.de/

In [None]:
df_pop = pd.read_csv('../input/my-covid19-dataset/citypopulation-de/population.csv')

In [None]:
df_loc = pd.merge(df_loc, df_pop, how='left', on=['Province_State', 'Country_Region'])

#### Other locations
**Source:** 2020 Population Estimates, https://population.un.org/wpp/Download/Standard/Population/

In [None]:
# Population estimate as of July 2020 published by the UN (in '000 people)
df_pop = (pd.read_csv('../input/my-covid19-dataset/un-org/population-2020.csv')
          .rename(columns={'ISO 3166-1 alpha code': 'Country Code'}))

In [None]:
df_pop = df_pop.loc[~df_pop['Country Code'].isin(['CHN', 'USA', 'CAN', 'AUS']), ['Country Code', 'Population']]

In [None]:
df_pop = pd.merge(df_loc, df_pop, how='left', on='Country Code', suffixes=('_P/S','_C/R'))

In [None]:
# Population ('000) of the State if available, of the Country otherwise
df_pop['Population'] = (df_pop[['Population_P/S','Population_C/R']]
                        .apply(lambda x: x[1] if np.isnan(x[0]) else x[0], axis=1))

In [None]:
df_pop = df_pop.loc[:,['Location','Population']].set_index('Location', verify_integrity=True)

#### Fatalities per Million

In [None]:
df = pd.merge(df, df_pop, how='left', on='Location')

In [None]:
df['Fatalities per Million'] = df['Fatalities'] / df['Population'] * 1000
df['Confirmed Cases per Million'] = df['ConfirmedCases'] / df['Population'] * 1000

### Time Adjustment

In [None]:
# Day count since first confirmed case (resp. first fatality)
df['Day Count Confirmed'] = (df['ConfirmedCases']>0).groupby(df['Location']).cumsum().astype('int')
df['Day Count Fatalities'] = (df['Fatalities']>0).groupby(df['Location']).cumsum().astype('int')

In [None]:
# We correct a few inconsistencies in the dataset to make sure that day counts are strictly monotonous
# (the tuple (location, day counts) will be used as index)
df['Fatalities'] = df.loc[df['Day Count Fatalities']>0, 'Fatalities'].apply(lambda x: max(x,1))
df['Day Count Fatalities'] = (df['Fatalities']>0).groupby(df['Location']).cumsum().astype('int')

### Locations of interests
In some locations, the number of fatalities remains very limited and progresses very slowly. We believe that these locations do not yield meaningful data on which to build our model and we decide to exclude them of our training set.

In [None]:
# New confirmed cases (resp. fatalities)
df['New Cases'] = df['ConfirmedCases'].groupby(df['Location']).diff() / df['ConfirmedCases']
df['New Fatalities'] = df['Fatalities'].groupby(df['Location']).diff() / df['Fatalities']

# Case Fatality Rate (i.e. ratio between confirmed cases and confirmed fatalities)
# (may help identify outliers, i.e. countries where actual cases may be particularly underestimated)
df['Case Fatality Rate'] = df['Fatalities'] / df['ConfirmedCases']

In [None]:
#################################################################################################################
# Locations where the number of fatalities remains very limited and progresses very slowly are ignored 
df_all = df

df = df[(df['Day Count Fatalities']>0) & 
        ((df['Fatalities per Million']>1) | # at least 1 fatality per million
        (df['New Fatalities']>1/7))] # at least doubling every week
#################################################################################################################

### Testing Capacity
**Source:** Number of Tests per Million People, https://ourworldindata.org/coronavirus-testing-source-data#

In [None]:
df_testing = (pd.read_csv('../input/my-covid19-dataset/ourworldindata/tests/tests-vs-confirmed-cases-covid-19.csv')
              .loc[:,['Entity', 'Total COVID-19 tests']]
              .rename(columns={'Entity':'Location', 'Total COVID-19 tests': 'Tests'}))

In [None]:
df = pd.merge(df, df_testing, how='left', on='Location')

In [None]:
# Number of confirmed cases divided by number of tests (where the number of tests was unknown, we have made 
# the (strong) assumption that only infected people were tested)
df['Tests per Million'] = df['Tests'] / df['Population'] * 1000
df['Tests per Million'].fillna(df['Confirmed Cases per Million'], inplace=True)
df.drop(columns=['Province_State', 'Country_Region', 'Tests', 'Population'], inplace=True)

# 'Confirmed Rate' is defined as the proportion of confirmed cases in the tested population
df['Confirmed Rate'] = df['Confirmed Cases per Million'] / df['Tests per Million']

#### Mass testing capacity

##### Proportion of confirmed cases among tested population

In [None]:
df_plt = df.set_index(['Location','Day Count Fatalities'], verify_integrity=True)

mask = (~df_plt.index.get_level_values(0).duplicated(keep='last')) & (df_plt['Confirmed Rate']<1)
df_plt = df_plt.loc[mask, ['Fatalities per Million','Tests per Million','Confirmed Rate']].reset_index()

df_plt.plot(x='Day Count Fatalities', y='Confirmed Rate', c='Tests per Million', 
            kind='scatter', colormap='coolwarm_r', sharex=False, figsize=(17.5,7.5))

# Annotations
x = df_plt['Day Count Fatalities'].values
y = df_plt['Confirmed Rate'].values
z = df_plt['Location'].values

for i, txt in enumerate(z):
    if y[i] > .25:
        plt.text(x[i]+.005, y[i]+.005, txt, rotation=0, rotation_mode='anchor')

plt.title('Proportion of confirmed cases among tested population\n (Color = Number of Tests per Million)')
plt.xlabel('Number of days since first fatality')
plt.ylabel('')
plt.show()

##### Proportion of fatalities among confirmed cases

In [None]:
df_plt = df.set_index(['Location','Day Count Fatalities'], verify_integrity=True)

mask = ~df_plt.index.get_level_values(0).duplicated(keep='last')
df_plt = df_plt.loc[mask, ['Fatalities per Million','Tests per Million','Case Fatality Rate']].reset_index()

df_plt.plot(x='Day Count Fatalities', y='Case Fatality Rate', c='Tests per Million', 
            kind='scatter', colormap='coolwarm_r', sharex=False, figsize=(17.5,7.5))

# Annotations
x = df_plt['Day Count Fatalities'].values
y = df_plt['Case Fatality Rate'].values
z = df_plt['Location'].values

for i, txt in enumerate(z): # annotate outliers (case fatality rate above 5%)
    if (y[i]>.05) and (y[i]<.25):
        plt.text(x[i]+.005, y[i]+.005, txt, rotation=0, rotation_mode='anchor')

plt.ylim((0,.25))
plt.title('Proportion of fatalities among confirmed cases\n (Color = Number of Tests per Million)')
plt.xlabel('Number of days since first fatality')
plt.ylabel('')
plt.show()

**Notes:**
- High case fatality rates in early days seem to indicate a strong biais towards testing and curing of infected people in priority. Lower levels may indicate better anticipation, especially where the number of tests per million is high.

We will use `Case Fatality Rate` instead of `Confirmed Rate` (which seems less reliable, some values are above 100% for instance).

#### Mean Fatality Rate

In [None]:
# Compute key statistics... 
df_frate = (df.loc[:,['Day Count Fatalities', 'Case Fatality Rate']].groupby('Day Count Fatalities')
            .agg(['count', 'mean', 'std']))

y_count = df_frate.loc[:,('Case Fatality Rate', 'count')]
y_mean = df_frate.loc[:,('Case Fatality Rate', 'mean')]
y_std  = df_frate.loc[:,('Case Fatality Rate', 'std')]

# ... and plot them
plt.plot(df_frate.index, y_mean, c='w')
plt.fill_between(df_frate.index, y_mean - y_std, y_mean + y_std, alpha=.5)
plt.ylim(bottom=0)
plt.title('Mean Case Fatality Rate (% of confirmed cases, since first fatality)')
plt.show()

plt.bar(df_frate.index, y_count, color='grey')
plt.title('Number of countries')
plt.gca().set_xlabel('Number of days since first fatality')
plt.show()

***Note:*** *The fatality rate seems to reach a normative rate roughly 25 days after the first fatality (top chart). This hypothesis cannot be tested due to limited information thereafter (bottom chart).*

In [None]:
# Fatality rate 30 days after the first fatality
df_plt = df[df['Day Count Fatalities']==30]
df_plt.plot(x='Day Count Confirmed', y='Case Fatality Rate', c='Fatalities per Million', 
            kind='scatter', colormap='coolwarm', sharex=False, figsize=(10,7.5))

# Annotations
x = df_plt['Day Count Confirmed'].values
y = df_plt['Case Fatality Rate'].values
z = df_plt['Location'].values

for i, txt in enumerate(z):
    # fatality rates are expected to be close to 2%
    # (the number of confirmed cases is probably underestimated otherwise)
    plt.text(x[i]+.005, y[i]+.005, txt, rotation=0, rotation_mode='anchor')

plt.title('Case Fatality Rate 30 days after the first fatality')
plt.xlabel('Number of days since first confirmed case')
plt.show()

#### High-level consistency check

In [None]:
# We extract the most recent data available for each location and plot correlations
mask = (~df.set_index(['Location','Date'], verify_integrity=True)
        .index.get_level_values(0)
        .duplicated(keep='last'))
df_plt = (df[mask].drop(columns=['Date','Country Code',
                                 'Day Count Confirmed','ConfirmedCases','Confirmed Cases per Million',
                                 'Confirmed Rate','Fatalities'])
          .set_index(['Location'], verify_integrity=True))

In [None]:
# Correlation matrix
df_plt.corr().style.background_gradient(cmap='Reds').set_precision(2)

**Note:** 
Not surprisingly, we observe a linear correlation bewteen (i) the number of confirmed cases and the number of fatalities (our dependent variables); and (ii) between the number of confirmed cases and the number of tests per million.

In [None]:
pd.plotting.scatter_matrix(df_plt, figsize=(15,15))
plt.show()

### Case Fatality Rates
We will focus on fatalities per million and case fatality rates, more reliable than raw numbers of confirmed cases.

In [None]:
# Last available data
df = df.set_index(['Location','Day Count Fatalities'], verify_integrity=True).sort_index()
mask = ~df.index.get_level_values(0).duplicated(keep='last')

In [None]:
# Standardized case fatality rates
df_cfr = df.loc[mask,:].reset_index(level=1, drop=True).loc[:,['Country Code', 'Case Fatality Rate']]
df_cfr[['Case Fatality Rate']] = df_cfr[['Case Fatality Rate']].apply(lambda x: (x-x.min()) / (x.max()-x.min()))

## Exponential smoothing
We use exponential smoothing with damping to predict the evolution of the pandemic.<br>
For the sake of illustration, we first forecast the epidemy in a given state. Then we compute parameters for all locations.

**Documentation:** https://www.statsmodels.org/dev/generated/statsmodels.tsa.holtwinters.ExponentialSmoothing.html

### Illustration of the method

In [None]:
# Selected state, and forecasting period in days
state = 'US - New York'
fperiod = 90

In [None]:
# Fatalities per million for the selected state
fatalities = df.loc[idx[state, :], 'Fatalities per Million'].reset_index(drop=True)

In [None]:
# Guess parameters
init_alpha = .3 #.5
init_beta = .7 #.1
init_phi = .8
initial_level = fatalities[0]
initial_slope = fatalities[1] / fatalities[0]
start_params = [init_alpha, init_beta, initial_level, initial_slope, init_phi]

# Search for best parameters
with warnings.catch_warnings():
    warnings.simplefilter('ignore')
    fmodel = (ExponentialSmoothing(fatalities, trend='mul', damped=True, seasonal=None)
              .fit(start_params=start_params, remove_bias=True, use_basinhopping=True))
    fcast = fmodel.forecast(fperiod)

In [None]:
# Visualisation 
plt.figure(figsize=(10,5))
fcast.plot(style='--', marker='', color='green', legend=True, label='Forecast')
fmodel.fittedvalues.plot(style='--', marker='', color='blue', legend=True, label='Smoothed')
fatalities.plot(linestyle='', marker='.', color='red', legend=True, label='Actual values')

keys = ['smoothing_level', 'smoothing_slope', 'damping_slope']
alpha, beta, phi = list(map(fmodel.model.params.get, keys))
txt_params = ('Exponential smoothing with parameters:\n\n\t' + r'$\alpha=${}'.format(alpha) + '\n\t' + 
              r'$\beta=${}'.format(beta) + '\n\t' + r'$\phi=${}'.format(phi))

plt.gcf().text(1, 0.5, txt_params)
plt.title(r'Fatalities per Million in {}'.format(state) + '\n' + '(multiplicative damped trend)')
plt.xlabel('Day Count Fatalities')
plt.show()

### Parameters in all locations

In [None]:
# Define placeholders for states of interests
df_params = pd.DataFrame(index=df.index.unique(level='Location'), 
                         columns=['Alpha', 'Beta', 'Phi' ,'Forecast per Million'])
keys = ['smoothing_level', 'smoothing_slope', 'damping_slope']

# Loop through all locations
for state in df.index.get_level_values(level=0).unique():
    fatalities = df.loc[idx[state, :], 'Fatalities per Million'].reset_index(drop=True)
    
    # At least two data points are required to run the model
    if len(fatalities.dropna().index) < 2: continue
    
    # Get optimal parameters
    with warnings.catch_warnings():
        warnings.simplefilter('ignore')
        
        fmodel = (ExponentialSmoothing(fatalities, trend='mul', damped=True, seasonal=None)
                  .fit(remove_bias=True, smoothing_seasonal=0))
        _, beta, phi = list(map(fmodel.model.params.get, keys))
        
        # Re-run optimization if forecast is not constant
        if not ((beta==0) and (phi==0)):
            
            # Guess parameters
            init_alpha = .3 #.5
            init_beta = .7 #.1
            init_phi = .8
            initial_level = fatalities[0]
            initial_slope = fatalities[1] / fatalities[0]
            start_params = [init_alpha, init_beta, initial_level, initial_slope, init_phi]

            # Search for best parameters
            fmodel = (ExponentialSmoothing(fatalities, trend='mul', damped=True, seasonal=None)
                      .fit(start_params=start_params, remove_bias=True, use_basinhopping=True))
            
        # Consistency check: re-run where fatalities exceed 2% of the entire population
        if fcast.iloc[-1]>.02*1e3:
            
            # Guess alpha and beta subject to phi=0.80
            fmodel = (ExponentialSmoothing(fatalities, trend='mul', damped=True, seasonal=None)
                      .fit(damping_slope=.8, remove_bias=True, use_basinhopping=True))
    
    # Save model parameters
    df_params.loc[state, ['Alpha', 'Beta', 'Phi']] = list(map(fmodel.model.params.get, keys))
    
    # Save forecast (cumulated number of fatalities at the end of the [90]-day period)
    df_params.loc[state, 'Forecast per Million'] = fmodel.forecast(fperiod).iloc[-1]

In [None]:
# Convert data to numeric values
df_params = df_params.apply(pd.to_numeric, errors='ignore')

In [None]:
# Training set (list of countries with interpretable results)
mask = (
    (pd.isnull(df_params['Phi'])) | # no solution found
    ((df_params['Beta']==0) & (df_params['Phi']==1)) | # no damping (fatalities increase indefinitely)
    (df_params['Phi']==0) # dummy forecast (fatality counts remain constant)
)
        
idx_data = df_params.index[~mask]

In [None]:
# Parameters
df_plt = df_params.dropna().reset_index()
s, x, y, z = zip(*df_plt[['Location', 'Alpha', 'Beta', 'Phi']].values)

In [None]:
# Plot smoothing level vs smoothing slope
plt.figure(figsize=(10,10))
plt.scatter(x=x, y=y, marker='.')

for i, txt in enumerate(s):
    if not ((x[i] in [0,1]) or (y[i] in [0,1]) or (abs(x[i]-y[i])<1e-2)): # annotate non-naive model parameters
        plt.text(x[i]-.02, y[i]+.02, i, rotation=0, rotation_mode='anchor', fontsize=8)

plt.title(r'Model parameters: x-axis$=\alpha$, y-axis$=\beta$')
plt.xlabel(r'Smoothing level ($\alpha$)')
plt.ylabel(r'Smoothing slope ($\beta$)')
plt.tight_layout()
plt.show()

**Note:** For any $\alpha$ between 0 and 1, the weights attached to the observations decrease exponentially as we go back in time, hence the name `exponential smoothing`. If $\alpha$ is small (i.e., close to 0), more weight is given to observations from the more distant past. If $\alpha$ is large (i.e., close to 1), more weight is given to the more recent observations. For the extreme case where $\alpha$=1, the forecasts are equal to the naïve forecasts.

A very small value of $\beta$ means that the slope hardly changes over time.


**Source:** https://otexts.com/fpp2/expsmooth.html

In [None]:
# Plot smoothing slope vs damping factor
plt.figure(figsize=(10,10))
sc = plt.scatter(x=y, y=z, c=x, vmin=0, vmax=1, marker='.', cmap='Blues')
plt.colorbar(sc)

for i, txt in enumerate(s):
    if y[i]*(1-z[i])>1e-2: # annotate 'nicest' model parameters (i.e. s-shaped forecasts)
        plt.text(y[i]-.02, z[i]+.01, i, rotation=0, rotation_mode='anchor', fontsize=8)

t = np.arange(.01, 1., .01)
plt.plot(t, 1-1e-2/t, 'r--')
plt.ylim((0,1))
plt.xlim((0,1))        
        
plt.title(r'Model parameters: x-axis$=\beta$, y-axis$=\phi$ and $color=\alpha$')
plt.xlabel(r'$\beta$')
plt.ylabel(r'$\phi$')
plt.tight_layout()
plt.show()

**Note:** We had to set the damping factor $\phi$ to 0.8 manually in order to ensure that all forecasts are actually feasible, i.e. no locations where the predicted number of fatalities exceeds 2% of the entire population (see below).

##### Locations where the number of fatalities remains flat

In [None]:
df_plt.loc[(df_plt['Beta']==0)&(df_plt['Phi']==0)]

##### Locations where the number of fatalities increases indefinitely

In [None]:
df_plt.loc[(df_plt['Beta']==0)&(df_plt['Phi']==1)]

In some cases, the number of fatalities is predicted to increase indefinitely. This typically happens when $\beta$ (the slope-smoothing factor) is equal to 0 and $\phi$ (the slope-damping factor) is equal to 1.<br>
To improve our forecasts, we need to learn parameters $\alpha$, $\beta$ and $\phi$ from country-specific features describing how countries are exposed to the virus and how their respective populations may be infected.<br>

Given the very limited amount of data at our disposal, we will use very simple models, such as a linear regression or even a mere kNN interpolation.

##### Infeasible solutions (locations where the predicted number of fatalities exceeds 1% of the entire population...)

In [None]:
df_plt.loc[df_plt['Forecast per Million']>.01*1e6]

#####  Non-naive model parameters

In [None]:
df_plt.loc[[not ((a in [0,1]) or 
                 (b in [0,1]) or 
                 (abs(a-b)<1e-2)) for (a,b) in df_plt[['Alpha','Beta']].values]]

### Forecast
We show below the predicted number of fatalities by the end of the forecast period.

In [None]:
df_plt = pd.merge(df_plt, df_pop, how='left', on='Location')
df_plt['Forecast'] = df_plt['Forecast per Million'] * df_plt['Population'] / 1000

In [None]:
df_plt.sort_values(by='Forecast').iloc[100:].plot(
    x='Location', y='Forecast', kind='barh', fontsize=18, legend=False)
plt.gcf().set_size_inches(30, 50)
plt.gca().set_xscale('log')
plt.grid(color='grey', linestyle='--', linewidth=.5)
plt.ylabel(None)
txt = 'Predicted number of fatalities within the next {} days (logarithmic scale)'
plt.title(txt.format(fperiod), fontsize=24)
plt.show()

In [None]:
df_plt.sort_values(by='Forecast', ascending=False).iloc[:10,:].plot(
    x='Location', y='Forecast', kind='bar', legend=False)
txt = 'Locations with the highest predicted number of fatalities within the next {} days'
plt.title(txt.format(fperiod))
plt.gcf().set_size_inches(12.5, 7.5)
plt.xticks(rotation=45)
plt.xlabel(None)
plt.show()

**Note:** It is fundamental to note that those trends are being estimated before confinment and social distancing measures took their full effect in some countries. We recall that the effect of the confinment is controlled in our model by the damping factor $\phi$.

## Additional features

### Pairwise distances between countries
Some countries have not yet recorded any fatalities due to the virus and for our method to produce meaningful results, we need to predict roughly when they will record their first fatality. To do so, we will use pairwise distances between countries and number of days since first confirmed case.

In [None]:
# Position of each state on the Earth (in radians)
df_r = pd.read_csv('../input/my-covid19-dataset/latlong.csv').set_index('Location', verify_integrity=True).loc[:,['Lat', 'Long']]

In [None]:
# Pairwise distances between countries (in km, 6371 is the Earth's radius in km)
R = 6371
hs = DistanceMetric.get_metric('haversine')
df_dist = pd.DataFrame(data=hs.pairwise(np.radians(df_r)) * R, index=df_r.index, columns=df_r.index)

# standardisation
df_dist /= df_dist.values.max()

### World Development Indicators
**Sources:**
* The World Bank: https://databank.worldbank.org/source/world-development-indicators#
* OECD: https://stats.oecd.org/Index.aspx?ThemeTreeId=9

**Note:** Values are age-standardized

In [None]:
df_wdi = (pd.read_csv('../input/my-covid19-dataset/world-bank/world-development-indicators.csv')
          .dropna(subset=['Country Code','Series Name'], how='all')
          .set_index(['Country Code', 'Series Name'], verify_integrity=True)
          .drop(columns=['Country Name', 'Series Code'])
          .replace({'..': np.nan})
          .dropna(how='all')
          .astype(float))

In [None]:
# Take the latest data available (2018 figures in most cases)
def last_available_data(row):
    res = [d for d in row if not np.isnan(d)]
    return float(res[-1])

df_wdi = (df_wdi
          .apply(lambda row: last_available_data(row), axis=1)
          .reset_index()
          .pivot(index='Country Code', columns='Series Name', values=0))

In [None]:
# Fill in missing values with world values if available, or global medians otherwise
wdi_default = df_wdi.fillna(df_wdi.median()).loc['WLD',:]
df_wdi = df_wdi.fillna(wdi_default).reset_index()

In [None]:
# We use Singapore as proxy for Taiwan for macroeconomic and demographic data
# (Note: fatality figures are not available for Hong-Kong and Macao on a stand-alone basis)
new_row = df_wdi.loc[df_wdi['Country Code']=='SGP'].copy(deep=True)
new_row.loc[:,'Country Code'] = 'TWN'
df_wdi = df_wdi.append(new_row, ignore_index=True)

### Risk factors and preexisting health conditions
**Sources:**<br>
- NCD RisC: http://www.ncdrisc.org/data-downloads.html<br>
- Global Health Data Exchange: http://ghdx.healthdata.org/gbd-results-tool
 - Global Burden of Disease Collaborative Network.
 - Global Burden of Disease Study 2017 (GBD 2017) Results.
 - Seattle, United States: Institute for Health Metrics and Evaluation (IHME), 2018.

In [None]:
# Risk factors: obesity (2016 figures, averaged between women and men)
df_ncd1 = (pd.read_csv('../input/my-covid19-dataset/ncd-risc/obesity/NCD_RisC_Lancet_2017_BMI_age_standardised_country.csv')
              .loc[:,['ISO', 'Sex', 'Prevalence of BMI>=30 kg/m2 (obesity)']]
              .rename(columns={'ISO': 'Country Code'})
              .groupby('Country Code')
              .mean()
              .reset_index())

In [None]:
# Risk factors: blood pressure (2015 figures, averaged between women and men)
df_ncd2 = (pd.read_csv('../input/my-covid19-dataset/ncd-risc/blood-pressure/NCD_RisC_Lancet_2016_BP_age_standardised_countries.csv')
               .loc[:,['ISO', 'Sex', 'Prevalence of raised blood pressure']]
               .rename(columns={'ISO': 'Country Code'})
               .groupby('Country Code')
               .mean()
               .reset_index())

In [None]:
# Preexisting health conditions: cancer prevalence, cardiovascucular diseases, chronic respiratory condition, 
# diabetes and kidney diseases (2017 figures, age-standardized)
df_ihme = pd.read_csv('../input/my-covid19-dataset/ihme-gdb/IHME-GBD_2017_DATA-8e93cebf-1.csv')
mask = [cause_name in ['Neoplasms','Cardiovascular diseases',
                       'Chronic respiratory diseases','Diabetes and kidney diseases'] 
        for cause_name in df_ihme['cause_name']]

df_ihme = (df_ihme.rename(columns={'Neoplasms': 'Cancer prevalence'})
           .loc[mask, ['location_name','cause_name','val']]
           .pivot(index='location_name', columns='cause_name', values='val')
           .reset_index()
           .rename(columns={'location_name': 'Location'})
           .merge(df_codes, how='left', left_on='Location', right_on='Country_Region')
           .drop(columns=['Location','Country_Region']))

## Full dataset

In [None]:
# Additional country-specific demographic features are merged together
lst_df = [df_wdi, df_ncd1, df_ncd2, df_ihme]
df_feat = reduce(lambda df_left, df_right: pd.merge(df_left, df_right, how='left', on='Country Code'), lst_df)

In [None]:
# Last, we add (location-specific, standardized) case fatality rates
df_feat = (df_cfr.reset_index()
          .merge(df_feat, how='left', on='Country Code').drop(columns=['Country Code']))

In [None]:
# Features are standardized
df_feat = df_feat.set_index('Location', verify_integrity=True).apply(lambda x: (x-x.min()) / (x.max()-x.min()))

# We replace missing values for obesity, diabetes, pressure and testing by medians
# (this is certainly too simplistic)
df_feat.fillna(df_feat.median(), inplace=True)

In [None]:
# Full set of features
#df_feat.loc[df_feat.isnull().any(axis=1)]
df_feat.describe().T

In [None]:
# Model parameters
#df_params.loc[df_params.isnull().any(axis=1)]
df_params.describe()

## Other possible features (food for thought...)

Other ex-ante (i.e. before the outbreak) features that could be tested to explain the level $\alpha$ and the slope $\beta$:<br>
* Part of the population who has respiratory allergies (pollen)<br>

Other ex-post (i.e. after the outbreak) features that could be tested to explain the damping factor $\phi$:<br>
* Day count since beginning of quarantine<br>
* Proportion of workers who are able to work remotely (easier self-quarantine when confinment becomes necessary)<br>

More speculative indicators:
* 'Intensity' of quarantine (how restrictive is the quarantine)<br>
* Median age of confirmed cases of COVID-19 (this may explain why Germany is doing much better than France for instance)<br>
* Number of fatalities from previous COVID epidemies (some countries were particularly well prepared as they learned their lesson from previous outbreaks)<br>
* Proportion of elder people in retirement homes (where the virus can spread very quickly once it is inside)<br>
* Number of gatherings of more than \[1,000\] people since \[January\] (it is believed that the virus spread so well in Italy and Spain (resp. in France and South-Korea) due to sports events (resp. religious gatherings))<br>
* Education rate (better educated people may better understand self-quarantine and confinment instructions and be more inclined to naturally follow them)<br>
* Political regime (strong political regimes may impose more stringent confinment and surveillance measures)<br>
* Religiosity and frequency of religious gatherings, especially among elder people (Italy, Spain, Iran)<br>

## Fatalities by country over a 90-day period

### Interpolation
We use parameters $\alpha$, $\beta$ and $\phi$ for locations with meaningful results obtained with exponential smoothing. In the absence of state-specific demographic figures, we use median values to estimate country-wide parameters

In [None]:
# Data for locations with meaningful model parameters 
X_data = (df_params
          .loc[idx_data, ['Alpha', 'Beta', 'Phi']]
          .merge(df_feat, how='left', left_index=True, right_index=True))

# Parameters to predict
idx_pred = [idx for idx in df_params.index if idx not in idx_data]

X_pred = (df_params
          .loc[idx_pred, ['Alpha', 'Beta', 'Phi']]
          .merge(df_feat, how='left', left_index=True, right_index=True))

# Full dataset
X = X_data.iloc[:, 3:].values # standardised features
y = X_data.iloc[:, :3].values # model parameters

In [None]:
# Training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.2, random_state=99)

#### Nearest Neighbors Regression

In [None]:
# Cross validation to guess k=n_neighbors (one output at a time)
lbl = [r'$\alpha$', r'$\beta$']
c = ['green', 'blue']
rng_k = range(1,15)

for i in range(2):
    scores = []
    
    for k in rng_k:
        rgr = KNeighborsRegressor(n_neighbors=k, weights='distance')
        rgr.fit(X_train, y_train[:,i])
        scores.append(rgr.score(X_test, y_test[:,i]))
    
    plt.plot(rng_k, scores, label=lbl[i], color=c[i], linestyle='dashed', marker='.', markerfacecolor='grey')

plt.title('Coefficient of determination of the prediction')
plt.xlabel('Number of neighbors')
plt.legend(loc='lower right')
plt.show()

In [None]:
# Cross validation to guess k=n_neighbors (multi-output) and metric ('minkowski' with p=3)
rng_k = range(1,15)
lst_weights = ['distance', 'uniform']
lst_metrics = ['minkowski', 'chebyshev']
rng_p = range(1,4)

def rgr_plot(w, m, p):
    scores = []
    
    for k in rng_k:
        rgr = KNeighborsRegressor(n_neighbors=k, weights=w, metric=m, p=p)
        rgr.fit(X_train, y_train)
        y_pred = rgr.predict(X_test)
        score = r2_score(y_test, y_pred, multioutput='uniform_average')
        scores.append(score)
    
    label = w + ' - ' + m + ' - ' + str(p)
    plt.plot(rng_k, scores, linestyle='dashed', marker='.', label=label)

In [None]:
for w in lst_weights:       
    for m in lst_metrics:
        if m=='minkowski':
            for p in rng_p:
                rgr_plot(w, m, p)
        elif m=='chebyshev':
            rgr_plot(w, m, p=-1)

plt.title('Coefficient of determination of the prediction')
plt.xlabel('Number of neighbors')
plt.legend()
plt.show()

In [None]:
# Prediction
n_neighbors = 10
knn = KNeighborsRegressor(n_neighbors=n_neighbors, weights='uniform', metric='minkowski', p=2)

#### Radius Neighbors Regression

In [None]:
# Cross validation to guess best radius
rng_r = [t*.25 for t in range(1,7)]

def rgr_plot(w, m, p):
    scores = []
    
    for r in rng_r:
        rgr = RadiusNeighborsRegressor(radius=r, weights=w, metric=m, p=p)
        rgr.fit(X_train, y_train)
        
        with warnings.catch_warnings():
            warnings.simplefilter('ignore')
            y_pred = rgr.predict(X_test)

        if np.any(np.isnan(y_pred)):
            score = np.nan
        else:
            score = r2_score(y_test, y_pred, multioutput='uniform_average')
        scores.append(score)
    
    label = w + ' - ' + m + ' - ' + str(p)
    plt.plot(rng_r, scores, linestyle='dashed', marker='.', label=label)

In [None]:
for w in lst_weights:       
    for m in lst_metrics:
        if m=='minkowski':
            for p in rng_p:
                rgr_plot(w, m, p)
        elif m=='chebyshev':
            rgr_plot(w, m, p=-1)

plt.title('Coefficient of determination of the prediction')
plt.xlabel('Radius')
plt.legend()
plt.show()

In [None]:
# Prediction
radius = 1.0
rnn = RadiusNeighborsRegressor(radius=radius, weights='uniform', metric='minkowski', p=2)

#### Linear Regression

In [None]:
lnr = LinearRegression(copy_X=True, fit_intercept=False)

In [None]:
lnr.fit(X_train,y_train)
y_pred = lnr.predict(X_test)
score = r2_score(y_test, y_pred, multioutput='uniform_average')

print('Coefficient of determination of the linear regression: {:.2%}'.format(score))

**Note:** Extremely disappointing score, which may be due to high biais. Let's try regularisation (ridge regressor) and higher degree (polynomial) regression.

#### Ridge Regression

In [None]:
rdg = RidgeCV(alphas=[10**n for n in range(-4,4)])

In [None]:
rdg.fit(X_train,y_train)
y_pred = rdg.predict(X_test)
score = r2_score(y_test, y_pred, multioutput='uniform_average')

print('Coefficient of determination of the linear regression: {:.2%}'.format(score))

#### Polynomial Regression

##### Cross-Validation

In [None]:
# Cross validation to guess d=degree of the polynom
scores = []
rng_d = range(1,5)

for d in rng_d:
    pln = Pipeline([('poly', PolynomialFeatures(degree=d)), 
                    ('linear', LinearRegression(fit_intercept=True))]) 
    pln.fit(X_train, y_train)
    y_pred = pln.predict(X_test)
    score = r2_score(y_test, y_pred, multioutput='uniform_average')
    scores.append(score)

plt.plot(rng_d, scores, color='red', linestyle='dashed', marker='.', markerfacecolor='grey')
plt.title('Coefficient of determination of the prediction (Linear)')
plt.xlabel('Degree')
plt.show()

In [None]:
# Cross validation to guess d=degree of the polynom (with regularisation)
scores = []
rng_d = range(1,6)

for d in rng_d:
    pln = Pipeline([('poly', PolynomialFeatures(degree=d)), 
                    ('ridge', Ridge(alpha=1e1, copy_X=True, fit_intercept=True))]) 
    pln.fit(X_train, y_train)
    y_pred = pln.predict(X_test)
    score = r2_score(y_test, y_pred, multioutput='uniform_average')
    scores.append(score)

plt.plot(rng_d, scores, color='red', linestyle='dashed', marker='.', markerfacecolor='grey')
plt.title('Coefficient of determination of the prediction (Ridge)')
plt.xlabel('Degree')
plt.show()

##### FItted model

In [None]:
pln = Pipeline([('poly', PolynomialFeatures(degree=2)), 
                ('ridge', Ridge(alpha=1e1, copy_X=True, fit_intercept=False))])

In [None]:
pln.fit(X_train,y_train)
y_pred = pln.predict(X_test)
score = r2_score(y_test, y_pred, multioutput='uniform_average')

print('Coefficient of determination of the linear regression: {:.2%}'.format(score))

### Update of model parameters

In [None]:
rgr = knn # KNN Regressor
#rgr = rnn # Radius Neighbors Regressor
#rgr = lnr # Linear Regressor
#rgr = rdg # Ridge
#rgr = pln # Polynomial Regression
#rgr = lgr # Logistic Regression

In [None]:
# Fit with all data available
rgr.fit(X, y)

# Prediction based on selected regressor
y_pred = rgr.fit(X, y).predict(X_pred.iloc[:, 3:].values)
X_pred.loc[:,['Alpha', 'Beta','Phi']] = y_pred

In [None]:
# Update model parameters
df_RGR = (pd.concat([X_data, X_pred]).loc[:,['Alpha','Beta','Phi']].reset_index()
          .merge(df_params.reset_index(), how='left', on='Location', suffixes=('_RGR',''))
          .drop_duplicates().set_index('Location', verify_integrity=True))

df_RGR.loc[idx_pred, ['Alpha','Beta','Phi']] = (df_RGR.loc[idx_pred, ['Alpha_RGR','Beta_RGR','Phi_RGR']]
                                                .apply(pd.to_numeric).values)

df_params = df_RGR.drop(columns=['Alpha_RGR','Beta_RGR','Phi_RGR'])

### Forecast

In [None]:
# Interpolation using customized parameters
fperiod = 90 # forecasting period in days
df = (df_all
      .loc[df_all['Day Count Fatalities']>0]
      .set_index(['Location','Day Count Fatalities'], verify_integrity=True)
      .sort_index(level=[0, 1], ascending=[1, 1]))

##### For a specific location

In [None]:
# Pick a state at random
state = np.random.choice(idx_pred, 1)[0]
print(state)

In [None]:
# Historical curve
mask = (df.index.get_level_values(0)==state) & (df.index.get_level_values(1)>0)
fatalities = df.loc[mask, 'Fatalities'].reset_index(drop=True)

# Exponential smoothing parameters
alpha, beta, phi = df_params.loc[idx[state], ['Alpha', 'Beta', 'Phi']].values

if len(fatalities)>2:
    
    with warnings.catch_warnings():
        warnings.simplefilter('ignore')
        fmodel = (ExponentialSmoothing(fatalities, trend='mul', damped=True, seasonal=None)
                  .fit(smoothing_level=alpha, smoothing_slope=beta, damping_slope=phi, 
                       remove_bias=True, use_basinhopping=True))
        fcast = fmodel.forecast(fperiod)

In [None]:
# Visualisation 
if len(fatalities)>2:
    
    plt.figure(figsize=(10,5))
    fcast.plot(style='--', marker='', color='green', legend=True, label='Forecast')
    fmodel.fittedvalues.plot(style='--', marker='', color='blue', legend=True, label='Smoothed')
    fatalities.plot(linestyle='', marker='.', color='red', legend=True, label='Actual values')

    keys = ['smoothing_level', 'smoothing_slope', 'damping_slope']
    alpha, beta, phi = list(map(fmodel.model.params.get, keys))
    txt_params = ('Exponential smoothing with parameters:\n\n\t' + r'$\alpha=${}'.format(alpha) + '\n\t' + 
                  r'$\beta=${}'.format(beta) + '\n\t' + r'$\phi=${}'.format(phi))

    plt.gcf().text(1, 0.5, txt_params)
    plt.title(r'Fatalities in {}'.format(state) + '\n' + '(multiplicative damped trend)')
    plt.xlabel('Day Count Fatalities')
    plt.show()

##### For all locations

In [None]:
# Loop through all locations
lst_states = df.index.get_level_values(0).unique()
lst_params = df_params.index.unique()

for state in lst_states:

    # Historical curve
    mask = (df.index.get_level_values(0)==state) & (df.index.get_level_values(1)>0)
    fatalities = df.loc[mask, 'Fatalities'].reset_index(drop=True)

    # Exponential smoothing parameters 
    alpha = beta = phi = 0 # constant forecast by default
    if state in lst_params:
        alpha, beta, phi = df_params.loc[idx[state], ['Alpha', 'Beta', 'Phi']].values

    # At least two data points are needed for exponential smoothing
    if len(fatalities)>1:

        # Deal with a few inconsistencies
        if (np.min(fatalities)==0) or (not fatalities.is_monotonic):
            print('Inconsistent data identified for {}. Please check.'.format(state))
            fatalities = fatalities.apply(lambda x: max(x,1)) # at least one fatality
            # (must be strictly positive when using multiplicative trend)
        
        with warnings.catch_warnings():
            warnings.simplefilter('ignore')
            fmodel = (ExponentialSmoothing(fatalities, trend='mul', damped=True, seasonal=None)
                      .fit(smoothing_level=alpha, smoothing_slope=beta, damping_slope=phi, 
                           remove_bias=True, use_basinhopping=True))
            fcast = fmodel.forecast(fperiod)
            
        # Add forecast to the dataset
        date_start = df.loc[idx[state,:], 'Date'][-1]  + dt.timedelta(days=1)
        rng_dt = pd.date_range(start=date_start, periods=fperiod)
        
        arrays = [[state]*len(fcast.index), fcast.index]
        idx_fcast = pd.MultiIndex.from_arrays(arrays, names=('Location', 'Day Count Fatalities'))

        df = df.append(pd.DataFrame(index=idx_fcast, 
                               data={'Fatalities': fcast.values, 'Date': rng_dt}), sort=True)

## Predictions

In [None]:
df_train = pd.read_csv('../input/covid19-global-forecasting-week-2/train.csv', index_col=0)
df_test = pd.read_csv('../input/covid19-global-forecasting-week-2/test.csv', index_col=0)
df_submit = pd.read_csv('../input/covid19-global-forecasting-week-2/submission.csv', index_col=0)

In [None]:
df_train['Country_Region'].replace('Taiwan*', 'Taiwan', inplace=True)
df_test['Country_Region'].replace('Taiwan*', 'Taiwan', inplace=True)

In [None]:
df_train['Date'] = df_train['Date'].apply(lambda x: (dt.datetime.strptime(x, '%Y-%m-%d')))
df_train['Location'] = (df_train[['Province_State','Country_Region']]
                        .apply(lambda row: location(row[0],row[1]), axis=1))

df_test['Date'] = df_test['Date'].apply(lambda x: (dt.datetime.strptime(x, '%Y-%m-%d')))
df_test['Location'] = (df_test[['Province_State','Country_Region']]
                       .apply(lambda row: location(row[0],row[1]), axis=1))

##### Fatalities

In [None]:
# Fill in the test dataset
df_test = (df_test.reset_index()
           .merge(df_train.reset_index().loc[:,['Location','Date','ConfirmedCases','Fatalities']], 
                  how='left', on=['Location','Date'])
           .merge(df.reset_index().loc[:,['Location','Date','Fatalities']], 
                  how='left', on=['Location','Date'], suffixes=('_Train','_Forecast'))
           .set_index('ForecastId', verify_integrity=True))

In [None]:
# There is some overlap between the train and test timelines
df_test['Fatalities'] = df_test.loc[:,['Fatalities_Train','Fatalities_Forecast']].apply(
    lambda x: x[0] if not pd.isnull(x[0]) else x[1], axis=1)

df_test.drop(columns=['Fatalities_Train','Fatalities_Forecast'], inplace=True)

##### Locations with no fatalities or with a negligible number of fatalities

In [None]:
# Locations with day counts less than 1
lst_states = df_test.loc[df_test['Fatalities'].isnull(),'Location'].unique()

# Latest data available
dt_latest = df_train['Date'].max()

df_flat = (df_test
          .loc[df_test['Date']==dt_latest]
          .set_index('Location', verify_integrity=True)
          .loc[idx[lst_states]]
          .sort_values('Fatalities', ascending=True))

# Plot 
df_flat.loc[df_flat['Fatalities']>0, 'Fatalities'].plot(figsize=(10,7.5), kind='barh')
plt.title('Fatalities as of {} in locations with day counts less than 1:'.format(dt_latest))
plt.ylabel('')
plt.show()

In [None]:
# Flat forecast
for state in lst_states:
    mask = (df_test['Location']==state) & (df_test['Fatalities'].isnull())
    df_test.loc[mask, ['ConfirmedCases','Fatalities']] = df_flat.loc[idx[state], 
                                                                     ['ConfirmedCases','Fatalities']].values

##### Confirmed Cases
We simply divide by the last known case fatality rate to get a proxy for confirmed cases. Ideally, we should use a normative value, but this would lead to misleading results as the number of confirmed cases is likely underestimated in many countries.

In [None]:
# Get latest case fatality rates
df_train['Case Fatality Rate'] = (df_train[['Fatalities','ConfirmedCases']]
                                  .apply(lambda x: 0 if ((x[0]==0) and (x[1]==0)) else x[0]/x[1], axis=1))

df_train = df_train.set_index(['Location','Date'], verify_integrity=True).sort_index()
mask = ~df_train.index.get_level_values(0).duplicated(keep='last')

df_cfr = df_train.loc[mask,:].reset_index(level=1, drop=True).loc[:,['Case Fatality Rate']]

In [None]:
# Estimate confirmed cases based on fatalities and case fatality rates
mask = df_test['ConfirmedCases'].isnull()
df_test.loc[mask,'ConfirmedCases'] = (df_test
                                      .reset_index()
                                      .merge(df_cfr.reset_index(), how='left', on='Location')
                                      .set_index('ForecastId', verify_integrity=True)
                                      .loc[mask,['Fatalities','Case Fatality Rate']]
                                      .apply(lambda x: 0 if x[1]==0 else x[0]/x[1], axis=1))

In [None]:
# Round float values to the nearest integer
df_test['ConfirmedCases'] = df_test['ConfirmedCases'].apply(lambda x: round(x, 0)).astype('int')
df_test['Fatalities'] = df_test['Fatalities'].apply(lambda x: round(x, 0)).astype('int')

##### Submission

In [None]:
# Reset
df_test = df_test.reset_index().loc[:,['ForecastId','ConfirmedCases','Fatalities']]
df_submit = df_submit.reset_index().drop(columns=['ConfirmedCases','Fatalities'])

In [None]:
# Update
df_submit = df_submit.merge(df_test, how='left', on='ForecastId').set_index('ForecastId', verify_integrity=True)

In [None]:
# Submit
df_submit.to_csv('submission.csv', index=True)