We will use the daily spreadsheet from EU CDC containing new cases and deaths per country per day.

In [15]:
!wget -N https://www.ecdc.europa.eu/sites/default/files/documents/COVID-19-geographic-disbtribution-worldwide.xlsx

--2020-05-24 15:23:52--  https://www.ecdc.europa.eu/sites/default/files/documents/COVID-19-geographic-disbtribution-worldwide.xlsx
Resolving www.ecdc.europa.eu (www.ecdc.europa.eu)... 13.227.223.83, 13.227.223.78, 13.227.223.89, ...
Connecting to www.ecdc.europa.eu (www.ecdc.europa.eu)|13.227.223.83|:443... connected.
HTTP request sent, awaiting response... 304 Not Modified
File ‘COVID-19-geographic-disbtribution-worldwide.xlsx’ not modified on server. Omitting download.



Get Pandas and NumPy for feature engineering and calculations and get plots inline.

In [16]:
import pandas as pd
import numpy  as np

%matplotlib inline

We read our dataframe directly from the downloaded Excel file and have a look at the first 10 lines for format. Data for Namibia caused missing values because the `geoId` is __NA__, so we disable interpretation of missing values.

In [17]:
df = pd.read_excel('COVID-19-geographic-disbtribution-worldwide.xlsx', keep_default_na=False, na_values='')
df.head(10)

Unnamed: 0,dateRep,day,month,year,cases,deaths,countriesAndTerritories,geoId,countryterritoryCode,popData2018,continentExp
0,2020-05-24,24,5,2020,782,11,Afghanistan,AF,AFG,37172386.0,Asia
1,2020-05-23,23,5,2020,540,12,Afghanistan,AF,AFG,37172386.0,Asia
2,2020-05-22,22,5,2020,531,6,Afghanistan,AF,AFG,37172386.0,Asia
3,2020-05-21,21,5,2020,492,9,Afghanistan,AF,AFG,37172386.0,Asia
4,2020-05-20,20,5,2020,581,5,Afghanistan,AF,AFG,37172386.0,Asia
5,2020-05-19,19,5,2020,408,4,Afghanistan,AF,AFG,37172386.0,Asia
6,2020-05-18,18,5,2020,262,1,Afghanistan,AF,AFG,37172386.0,Asia
7,2020-05-17,17,5,2020,0,0,Afghanistan,AF,AFG,37172386.0,Asia
8,2020-05-16,16,5,2020,1063,32,Afghanistan,AF,AFG,37172386.0,Asia
9,2020-05-15,15,5,2020,113,6,Afghanistan,AF,AFG,37172386.0,Asia


Last check of our source dataframe.

In [18]:
df.count()

dateRep                    19037
day                        19037
month                      19037
year                       19037
cases                      19037
deaths                     19037
countriesAndTerritories    19037
geoId                      19037
countryterritoryCode       18845
popData2018                18781
continentExp               19037
dtype: int64

We pivot to a country by column format.

In [19]:
df_geo = df.pivot(index='dateRep', columns='geoId', values=['cases', 'deaths'])
df_geo

Unnamed: 0_level_0,cases,cases,cases,cases,cases,cases,cases,cases,cases,cases,...,deaths,deaths,deaths,deaths,deaths,deaths,deaths,deaths,deaths,deaths
geoId,AD,AE,AF,AG,AI,AL,AM,AO,AR,AT,...,VC,VE,VG,VI,VN,XK,YE,ZA,ZM,ZW
dateRep,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2,Unnamed: 13_level_2,Unnamed: 14_level_2,Unnamed: 15_level_2,Unnamed: 16_level_2,Unnamed: 17_level_2,Unnamed: 18_level_2,Unnamed: 19_level_2,Unnamed: 20_level_2,Unnamed: 21_level_2
2019-12-31,,0.0,0.0,,,,0.0,,,0.0,...,,,,,0.0,,,,,
2020-01-01,,0.0,0.0,,,,0.0,,,0.0,...,,,,,0.0,,,,,
2020-01-02,,0.0,0.0,,,,0.0,,,0.0,...,,,,,0.0,,,,,
2020-01-03,,0.0,0.0,,,,0.0,,,0.0,...,,,,,0.0,,,,,
2020-01-04,,0.0,0.0,,,,0.0,,,0.0,...,,,,,0.0,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2020-05-20,0.0,873.0,581.0,0.0,0.0,1.0,218.0,2.0,438.0,78.0,...,0.0,0.0,0.0,0.0,0.0,0.0,8.0,26.0,0.0,0.0
2020-05-21,1.0,941.0,492.0,0.0,0.0,15.0,230.0,2.0,474.0,18.0,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,27.0,0.0,0.0
2020-05-22,0.0,894.0,531.0,0.0,0.0,5.0,335.0,6.0,648.0,57.0,...,0.0,0.0,0.0,0.0,0.0,0.0,4.0,30.0,0.0,0.0
2020-05-23,0.0,994.0,540.0,0.0,0.0,12.0,322.0,2.0,718.0,29.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,28.0,0.0,0.0


For predictions later on we need extra rows in our dataframe. One of the ways to do that is reindexing with a larger range, so we use the current range and add six months and check our latest date.

In [20]:
new_index = pd.date_range(df_geo.index.min(), df_geo.index.max() + pd.Timedelta('365 days'))
df_geo = df_geo.reindex(new_index)
df_geo

Unnamed: 0_level_0,cases,cases,cases,cases,cases,cases,cases,cases,cases,cases,...,deaths,deaths,deaths,deaths,deaths,deaths,deaths,deaths,deaths,deaths
geoId,AD,AE,AF,AG,AI,AL,AM,AO,AR,AT,...,VC,VE,VG,VI,VN,XK,YE,ZA,ZM,ZW
2019-12-31,,0.0,0.0,,,,0.0,,,0.0,...,,,,,0.0,,,,,
2020-01-01,,0.0,0.0,,,,0.0,,,0.0,...,,,,,0.0,,,,,
2020-01-02,,0.0,0.0,,,,0.0,,,0.0,...,,,,,0.0,,,,,
2020-01-03,,0.0,0.0,,,,0.0,,,0.0,...,,,,,0.0,,,,,
2020-01-04,,0.0,0.0,,,,0.0,,,0.0,...,,,,,0.0,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2021-05-20,,,,,,,,,,,...,,,,,,,,,,
2021-05-21,,,,,,,,,,,...,,,,,,,,,,
2021-05-22,,,,,,,,,,,...,,,,,,,,,,
2021-05-23,,,,,,,,,,,...,,,,,,,,,,


Most algorithms take numerical data as inputs for a model, so we add a column representing the date as days since the earliest date in the dataframe.

In [21]:
df_geo['daynum'] = (df_geo.index - df_geo.index.min()).days
df_geo['daynum'].describe()

count    511.000000
mean     255.000000
std      147.657261
min        0.000000
25%      127.500000
50%      255.000000
75%      382.500000
max      510.000000
Name: daynum, dtype: float64

In [22]:
import matplotlib as mpl
mpl.rc('figure', max_open_warning = 0)

In [27]:
countries = np.sort(df['geoId'].unique())
#countries = ['JP', 'RU', 'US', 'BR', 'AT', 'CH', 'DE', 'IT', 'ES', 'PT', 'FR', 'SE', 'NO', 'DK', 'BE', 'NL', 'NZ']

showplots = False

df_out = pd.DataFrame({
    'cname':np.nan,\
    'iso3':np.nan,\
    'ccont':np.nan,\
    'popdata':np.nan,\
    'res':np.nan,\
    'progress':np.nan,\
    'final':np.nan,\
    'start':np.nan,\
    'peak':np.nan,\
    'floor':np.nan,\
    'beta':np.nan,\
    'mu':np.nan},\
    index=countries)

measure  = 'cases'
pmeasure = 'pcases'
smeasure = 'scases'

def fitres(progress):
    global df_pred, fit
    
    df_pred['scaled'] = df_pred['cumul'] / numcases * progress[0]

    if len(df_pred) > 10:
        df_pred['linear'] = - np.log(- np.log(df_pred[df_pred['scaled'] < 1]['scaled']))
        fit = np.polyfit(x=df_pred['daynum'], y=df_pred['linear'], deg=1, full=True)
        return(fit[1][0])
    else:
        return np.nan

from scipy.optimize import minimize

for country in countries:
    df_geo[(smeasure, country)] = df_geo[measure][country].rolling(7).mean()
    df_pred = pd.DataFrame(
        {'daynum':df_geo['daynum'], measure:df_geo[smeasure][country]})
    
    cname   = df[df['geoId'] == country]['countriesAndTerritories'].iloc[0]
    iso3    = df[df['geoId'] == country]['countryterritoryCode'].iloc[0]
    ccont   = df[df['geoId'] == country]['continentExp'].iloc[0]
    popdata = df[df['geoId'] == country]['popData2018'].iloc[0]

    mincases = popdata / 1e6
    numcases = df_pred[measure].sum()
    
    df_pred = df_pred[df_pred[measure] > mincases]
    
    if len(df_pred) > 10:
        df_pred['cumul'] = df_pred[measure].cumsum()
        
        optim = minimize(fitres, [0.8], method='SLSQP', bounds=[(0.1, 2)])
        progress = optim.x[0]
        
        df_geo[(pmeasure, country)] = np.exp(- np.exp(- np.polyval(
            fit[0], df_geo['daynum']))) * numcases / progress
        df_geo[(pmeasure, country)] = df_geo[(pmeasure, country)] - df_geo[(pmeasure, country)].shift()
        
        slope = fit[0][0]
        intercept = fit[0][1]
        beta = 1 / slope
        mu = - intercept * beta
        
        peak = df_geo[(df_geo[(pmeasure, country)] > df_geo[(pmeasure, country)].shift(-1))].index.min()
        floor = df_geo[(df_geo[(pmeasure, country)] < (popdata / 1e6)) & (
            df_geo[(pmeasure, country)].index > peak)].index.min()
        start = df_geo[(df_geo[(pmeasure, country)] > (popdata / 1e6)) & (
            df_geo[(pmeasure, country)].index < peak)].index.min()
        final = df_geo[pmeasure][country].cumsum().max()
        
        df_out.loc[country] = [cname, iso3, ccont, popdata, optim.fun, progress, final, start.date(), peak.date(), floor.date(), beta, mu]
        print('{} Res {:6.3f} at {:3.0f}% of {:7.0f} start {} peak {} floor {} beta {:6.3f} mu {:7.3f}'.format(
            country, optim.fun, progress * 100, final, start.date(), peak.date(), floor.date(), beta, mu))
        
        if showplots:
            df_geo[[(measure, country), (smeasure, country), (pmeasure, country)]].plot(
                figsize=(16, 9), grid=True)
            df_geo[[(measure, country), (smeasure, country), (pmeasure, country)]].cumsum().plot(
                figsize=(16, 9), grid=True)
    else:
        df_out.loc[country] = [cname, iso3, ccont, popdata, np.nan, np.nan, np.nan, np.nan, np.nan, np.nan, np.nan, np.nan]

AD Res  1.284 at  10% of    6994 start 2020-01-01 peak 2020-07-25 floor NaT beta 93.711 mu 206.924
AE Res  0.160 at  64% of   40407 start 2020-03-23 peak 2020-05-08 floor 2020-09-07 beta 23.484 mu 128.396
AF Res  0.033 at  10% of   81844 start 2020-04-18 peak 2020-06-29 floor 2020-12-10 beta 41.447 mu 180.373
AG Res  0.523 at  10% of     234 start 2020-03-10 peak 2020-06-17 floor 2021-01-20 beta 58.528 mu 168.206
AL Res  0.178 at  88% of    1077 start 2020-03-16 peak 2020-04-14 floor 2020-06-11 beta 20.150 mu 104.542
AM Res  0.251 at  10% of   52813 start 2020-03-07 peak 2020-07-20 floor NaT beta 66.227 mu 201.124
AR Res  0.209 at  10% of   93077 start 2020-04-03 peak 2020-07-19 floor 2021-03-05 beta 66.529 mu 200.788
AT Res  3.678 at  10% of  161725 start 2020-02-17 peak 2020-06-29 floor NaT beta 65.730 mu 180.667
AU Res  0.185 at 107% of    6638 start 2020-03-17 peak 2020-03-29 floor 2020-04-23 beta  7.065 mu  88.062
AW Res  0.314 at  10% of     479 start 2020-02-18 peak 2020-07-12 f

IT Res  0.074 at  96% of  237067 start 2020-02-29 peak 2020-04-02 floor 2020-07-03 beta 16.820 mu  92.334
JE Res  1.265 at  10% of    2887 start 2020-02-07 peak 2020-07-13 floor NaT beta 75.638 mu 194.185
JM Res  0.458 at  10% of    5178 start 2020-04-09 peak 2020-06-20 floor 2020-11-27 beta 43.121 mu 171.661
JO Res  0.613 at  10% of    5600 start 2020-05-21 peak 2020-09-16 floor 2021-03-09 beta 123.706 mu 259.562
JP Res  0.029 at 108% of   15173 start 2020-04-04 peak 2020-04-17 floor 2020-05-12 beta 10.054 mu 107.582
JPG11668 Res  0.300 at  74% of     951 start 2020-01-30 peak 2020-02-20 floor 2020-05-15 beta  8.041 mu  50.164
KG Res  0.464 at  10% of   12937 start 2020-04-01 peak 2020-07-08 floor 2021-02-04 beta 60.134 mu 189.749
KM Res  0.073 at  10% of     370 start 2020-05-13 peak 2020-06-17 floor 2020-08-24 beta 23.666 mu 168.270
KN Res  0.274 at  90% of      13 start 2020-03-24 peak 2020-04-07 floor 2020-05-06 beta  8.701 mu  97.168
KR Res  1.140 at  10% of  111077 start 2020-02

UK Res  0.059 at  82% of  303264 start 2020-03-14 peak 2020-04-22 floor 2020-08-08 beta 19.745 mu 112.991
US Res  0.370 at  87% of 1794664 start 2020-03-14 peak 2020-04-21 floor 2020-08-06 beta 18.701 mu 111.461
UY Res  0.779 at  93% of     763 start 2020-03-13 peak 2020-04-07 floor 2020-05-22 beta 18.952 mu  97.488
UZ Res  1.124 at  94% of    3064 start 2020-04-08 peak 2020-04-24 floor 2020-05-19 beta 16.081 mu 114.548
VA Res  0.323 at  10% of     105 start 2020-01-01 peak 2020-07-28 floor NaT beta 94.932 mu 209.100
VC Res  1.014 at  10% of     164 start 2020-03-25 peak 2020-07-07 floor 2021-01-31 beta 68.190 mu 188.198
VG Res  0.102 at  10% of      59 start 2020-03-27 peak 2020-07-12 floor 2021-02-23 beta 66.821 mu 193.935
VI Res  0.099 at  10% of     376 start 2020-03-04 peak 2020-07-17 floor 2021-05-12 beta 78.980 mu 198.022
XK Res  0.820 at  90% of    1089 start 2020-03-23 peak 2020-04-18 floor 2020-06-16 beta 16.349 mu 108.877
ZA Res  0.132 at  10% of  180995 start 2020-04-04 pea

Check the output frame assigning the index name.

In [32]:
df_out.index.name = 'iso2'
df_out

Unnamed: 0_level_0,cname,iso3,ccont,popdata,res,progress,final,start,peak,floor,beta,mu
iso2,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
AD,Andorra,AND,Europe,77006.0,1.284250,0.100000,6993.858717,2020-01-01,2020-07-25,NaT,93.710840,206.924010
AE,United_Arab_Emirates,ARE,Asia,9630959.0,0.160093,0.641707,40407.385360,2020-03-23,2020-05-08,2020-09-07,23.483577,128.395905
AF,Afghanistan,AFG,Asia,37172386.0,0.032539,0.100000,81844.075854,2020-04-18,2020-06-29,2020-12-10,41.446778,180.372884
AG,Antigua_and_Barbuda,ATG,America,96286.0,0.523321,0.100000,233.605104,2020-03-10,2020-06-17,2021-01-20,58.527760,168.206073
AI,Anguilla,,America,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...
XK,Kosovo,XKX,Europe,1845300.0,0.820401,0.896290,1089.092170,2020-03-23,2020-04-18,2020-06-16,16.349267,108.877204
YE,Yemen,YEM,Asia,28498687.0,,,,,,,,
ZA,South_Africa,ZAF,Africa,57779622.0,0.131581,0.100000,180994.738310,2020-04-04,2020-07-20,2021-03-19,61.786144,201.436806
ZM,Zambia,ZMB,Africa,17351822.0,0.014329,0.989473,838.830234,2020-05-11,2020-05-17,2020-05-28,4.631684,137.358212


Write out the values per country, discarding countries with progress below 0.101.

In [33]:
df_out[df_out['progress'] > 0.101].to_csv("zzprogress.csv")

Keep exploring! Stay home, wash your hands, keep your distance.