We will use the daily spreadsheet from EU CDC containing new cases and deaths per country per day.

In [1]:
!wget -N https://www.ecdc.europa.eu/sites/default/files/documents/COVID-19-geographic-disbtribution-worldwide.xlsx

--2020-06-11 11:36:09--  https://www.ecdc.europa.eu/sites/default/files/documents/COVID-19-geographic-disbtribution-worldwide.xlsx
Resolving www.ecdc.europa.eu (www.ecdc.europa.eu)... 13.227.209.16, 13.227.209.118, 13.227.209.26, ...
Connecting to www.ecdc.europa.eu (www.ecdc.europa.eu)|13.227.209.16|:443... connected.
HTTP request sent, awaiting response... 304 Not Modified
File ‘COVID-19-geographic-disbtribution-worldwide.xlsx’ not modified on server. Omitting download.



Get Pandas and NumPy for feature engineering and calculations and get plots inline.

In [2]:
import pandas as pd
import numpy  as np

%matplotlib inline

We read our dataframe directly from the downloaded Excel file and have a look at the first 10 lines for format. Data for Namibia caused missing values because the `geoId` is __NA__, so we disable interpretation of missing values.

In [3]:
df = pd.read_excel('COVID-19-geographic-disbtribution-worldwide.xlsx', keep_default_na=False, na_values='')
df.head(10)

Unnamed: 0,dateRep,day,month,year,cases,deaths,countriesAndTerritories,geoId,countryterritoryCode,popData2018,continentExp
0,2020-06-10,10,6,2020,542,15,Afghanistan,AF,AFG,37172386.0,Asia
1,2020-06-09,9,6,2020,575,12,Afghanistan,AF,AFG,37172386.0,Asia
2,2020-06-08,8,6,2020,791,30,Afghanistan,AF,AFG,37172386.0,Asia
3,2020-06-07,7,6,2020,582,18,Afghanistan,AF,AFG,37172386.0,Asia
4,2020-06-06,6,6,2020,915,9,Afghanistan,AF,AFG,37172386.0,Asia
5,2020-06-05,5,6,2020,787,6,Afghanistan,AF,AFG,37172386.0,Asia
6,2020-06-04,4,6,2020,758,24,Afghanistan,AF,AFG,37172386.0,Asia
7,2020-06-03,3,6,2020,759,5,Afghanistan,AF,AFG,37172386.0,Asia
8,2020-06-02,2,6,2020,545,8,Afghanistan,AF,AFG,37172386.0,Asia
9,2020-06-01,1,6,2020,680,8,Afghanistan,AF,AFG,37172386.0,Asia


Last check of our source dataframe.

In [4]:
df.count()

dateRep                    22592
day                        22592
month                      22592
year                       22592
cases                      22592
deaths                     22592
countriesAndTerritories    22592
geoId                      22592
countryterritoryCode       22332
popData2018                22251
continentExp               22592
dtype: int64

We pivot to a country by column format.

In [5]:
df_geo = df.pivot(index='dateRep', columns='geoId', values=['cases', 'deaths'])
df_geo

Unnamed: 0_level_0,cases,cases,cases,cases,cases,cases,cases,cases,cases,cases,...,deaths,deaths,deaths,deaths,deaths,deaths,deaths,deaths,deaths,deaths
geoId,AD,AE,AF,AG,AI,AL,AM,AO,AR,AT,...,VC,VE,VG,VI,VN,XK,YE,ZA,ZM,ZW
dateRep,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2,Unnamed: 13_level_2,Unnamed: 14_level_2,Unnamed: 15_level_2,Unnamed: 16_level_2,Unnamed: 17_level_2,Unnamed: 18_level_2,Unnamed: 19_level_2,Unnamed: 20_level_2,Unnamed: 21_level_2
2019-12-31,,0.0,0.0,,,,0.0,,,0.0,...,,,,,0.0,,,,,
2020-01-01,,0.0,0.0,,,,0.0,,,0.0,...,,,,,0.0,,,,,
2020-01-02,,0.0,0.0,,,,0.0,,,0.0,...,,,,,0.0,,,,,
2020-01-03,,0.0,0.0,,,,0.0,,,0.0,...,,,,,0.0,,,,,
2020-01-04,,0.0,0.0,,,,0.0,,,0.0,...,,,,,0.0,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2020-06-06,0.0,624.0,915.0,0.0,0.0,15.0,596.0,0.0,1769.0,62.0,...,0.0,0.0,0.0,0.0,0.0,0.0,8.0,60.0,0.0,0.0
2020-06-07,0.0,626.0,582.0,0.0,0.0,20.0,547.0,0.0,983.0,19.0,...,0.0,2.0,0.0,0.0,0.0,0.0,1.0,44.0,0.0,0.0
2020-06-08,0.0,540.0,791.0,0.0,0.0,14.0,766.0,5.0,774.0,46.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,46.0,0.0,0.0
2020-06-09,0.0,568.0,575.0,0.0,0.0,17.0,195.0,1.0,826.0,21.0,...,0.0,0.0,0.0,0.0,0.0,1.0,0.0,82.0,3.0,0.0


For predictions later on we need extra rows in our dataframe. One of the ways to do that is reindexing with a larger range, so we use the current range and add six months and check our latest date.

In [6]:
new_index = pd.date_range(df_geo.index.min(), df_geo.index.max() + pd.Timedelta('365 days'))
df_geo = df_geo.reindex(new_index)
df_geo

Unnamed: 0_level_0,cases,cases,cases,cases,cases,cases,cases,cases,cases,cases,...,deaths,deaths,deaths,deaths,deaths,deaths,deaths,deaths,deaths,deaths
geoId,AD,AE,AF,AG,AI,AL,AM,AO,AR,AT,...,VC,VE,VG,VI,VN,XK,YE,ZA,ZM,ZW
2019-12-31,,0.0,0.0,,,,0.0,,,0.0,...,,,,,0.0,,,,,
2020-01-01,,0.0,0.0,,,,0.0,,,0.0,...,,,,,0.0,,,,,
2020-01-02,,0.0,0.0,,,,0.0,,,0.0,...,,,,,0.0,,,,,
2020-01-03,,0.0,0.0,,,,0.0,,,0.0,...,,,,,0.0,,,,,
2020-01-04,,0.0,0.0,,,,0.0,,,0.0,...,,,,,0.0,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2021-06-06,,,,,,,,,,,...,,,,,,,,,,
2021-06-07,,,,,,,,,,,...,,,,,,,,,,
2021-06-08,,,,,,,,,,,...,,,,,,,,,,
2021-06-09,,,,,,,,,,,...,,,,,,,,,,


Most algorithms take numerical data as inputs for a model, so we add a column representing the date as days since the earliest date in the dataframe.

In [7]:
df_geo['daynum'] = (df_geo.index - df_geo.index.min()).days
df_geo['daynum'].describe()

count    528.00000
mean     263.50000
std      152.56474
min        0.00000
25%      131.75000
50%      263.50000
75%      395.25000
max      527.00000
Name: daynum, dtype: float64

Suppress warnings for multiple plots when analyzing many countries with `showplots = True`.

In [8]:
import matplotlib as mpl
mpl.rc('figure', max_open_warning = 0)

Running for multiple countries with a selection or simply all countries found in the input. Full documentation of the approach is found in the `Gumbelpivot` notebook.

In [9]:
# Select countries to fit.
countries = np.sort(df['geoId'].unique())
#countries = ['US', 'UK', 'BR', 'CH', 'DE', 'IT', 'ES', 'PT', 'FR', 'SE',
#             'NO', 'DK', 'BE', 'NL', 'NZ', 'CN', 'JP', 'RU', 'AT']

# Choose whether to output plots per country.
showplots = False

# Create an output dataframe.
df_out = pd.DataFrame({
    'cname':np.nan,
    'iso3':np.nan,
    'ccont':np.nan,
    'popdata':np.nan,
    'rsquared':np.nan,
    'progress':np.nan,
    'final':np.nan,
    'start':np.nan,
    'peak':np.nan,
    'floor':np.nan,
    'beta':np.nan,
    'mu':np.nan,
    'maxcur':np.nan},
    index=countries)

# Choose measure to fit and variables to store predicted and smoothed measures.
measure  = 'cases'
smeasure = 'scases'
pmeasure = 'pcases'

def gumpdf(x, beta, mu):
    """Return PDF value according to Gumbel"""
    expon = - ((x - mu) / beta)
    return(np.exp(expon) * np.exp(- (np.exp(expon))) / beta)

def gumcdf(x, beta, mu):
    """Return CDF value according to Gumbel"""
    expon = - ((x - mu) / beta)
    return(np.exp(- (np.exp(expon))))

from scipy.stats import linregress

# Run the fitting approach for all countries.
for country in countries:
    df_geo[(smeasure, country)] = df_geo[measure][country].rolling(7).mean()
    df_pred = pd.DataFrame(
        {'daynum':df_geo['daynum'], measure:df_geo[smeasure][country]})
    
    # Extract country parameters from the original dataset.
    cname   = df[df['geoId'] == country]['countriesAndTerritories'].iloc[0]
    iso3    = df[df['geoId'] == country]['countryterritoryCode'].iloc[0]
    ccont   = df[df['geoId'] == country]['continentExp'].iloc[0]
    popdata = df[df['geoId'] == country]['popData2018'].iloc[0]

    # We will only use measures above one in a million.
    mincases = popdata / 1e6
    
    # Clean up source data and prepare for fitting
    df_pred['cumul'] = df_pred[measure].cumsum()
    df_pred['gumdiv'] = df_pred[measure] / df_pred['cumul']
    df_pred = df_pred[(df_pred['gumdiv'] > 0)]
    df_pred['linear'] = np.log(df_pred['gumdiv'])
    df_pred = df_pred[(df_pred['linear'] < -2) &
                      (df_pred['linear'] > -5) &
                      (df_pred[measure] > mincases)]
    
    # Start fitting only if more than 9 measures left
    if len(df_pred) > 9:
        slope, intercept, rvalue, pvalue, stderr = linregress(df_pred[['daynum', 'linear']])
        rsquared = rvalue ** 2
        
        # Calculate Gumbel beta and mu from our linear fit parameters
        beta = - 1 / slope
        mu = beta * (intercept + np.log(beta))

        # Find the final number of cases by scaling back to the original data
        df_pred['pgumb'] = gumpdf(df_pred['daynum'], beta, mu)
        df_pred['scale'] = df_pred[measure] / df_pred['pgumb']
        final = df_pred['scale'].mean()

        # Create predicted measures by calculating the scaled Gumbel PDF
        df_geo[(pmeasure, country)] = gumpdf(df_geo['daynum'], beta, mu) * final

        # Progress is current measure ratio to final
        progress = df_geo[measure][country].sum() / final
        
        # Determine peak, floor, start and final analytically.
        peak = df_geo[(df_geo[(pmeasure, country)] >
                       df_geo[(pmeasure, country)].shift(-1))].index.min()
        floor = df_geo[(df_geo[(pmeasure, country)] < (popdata / 1e6)) &
                       (df_geo[(pmeasure, country)].index > peak)].index.min()
        start = df_geo[(df_geo[(pmeasure, country)] > (popdata / 1e6)) &
                       (df_geo[(pmeasure, country)].index < peak)].index.min()
        final = df_geo[pmeasure][country].sum()
        
        # Maximum current infected seems a good measure for outbreak intensity, to be scaled by population.
        maxcur = df_geo[pmeasure][country].rolling(14).sum().max()
        
        # Create an output record and log results.
        df_out.loc[country] = [cname,
                               iso3,
                               ccont,
                               popdata,
                               rsquared,
                               progress,
                               final,
                               start.date(),
                               peak.date(),
                               floor.date(),
                               beta,
                               mu,
                               maxcur]
        print('{}: R2 {:5.3f} at {:6.2f}% of {:8.0f} start {} peak {} floor {} beta {:5.2f} mu {:3.0f}'.format(
            country,
            rsquared,
            progress * 100,
            final,
            start.date(),
            peak.date(),
            floor.date(),
            beta,
            mu))
        
        # Show cumulative and derived results.
        if showplots:
            df_geo[[(measure, country), (smeasure, country), (pmeasure, country)]].plot(
                figsize=(16, 9), grid=True)
            df_geo[[(measure, country), (smeasure, country), (pmeasure, country)]].cumsum().plot(
                figsize=(16, 9), grid=True)
    else:
        df_out.loc[country] = [cname, iso3, ccont, popdata, np.nan, np.nan, np.nan, np.nan, np.nan, np.nan, np.nan, np.nan, np.nan]

AD: R2 0.405 at  29.15% of     2922 start 2020-01-24 peak 2020-05-11 floor 2021-04-09 beta 50.03 mu 132
AE: R2 0.949 at  57.81% of    69025 start 2020-03-15 peak 2020-05-23 floor 2020-11-26 beta 35.00 mu 144
AF: R2 0.693 at   1.53% of  1279815 start 2020-04-13 peak 2020-10-26 floor NaT beta 94.31 mu 300
AG: R2 0.640 at  94.50% of       28 start 2020-03-21 peak 2020-04-05 floor 2020-05-08 beta  9.76 mu  96
AL: R2 0.765 at  78.87% of     1647 start 2020-03-09 peak 2020-04-26 floor 2020-07-26 beta 32.50 mu 117
AM: R2 0.105 at   0.01% of 17471853 start NaT peak NaT floor NaT beta 220.53 mu 646
AR: R2 0.159 at   0.34% of  4097069 start 2020-03-31 peak 2021-03-16 floor NaT beta 158.43 mu 441
AT: R2 0.990 at 107.64% of    15702 start 2020-03-13 peak 2020-03-29 floor 2020-05-12 beta  8.25 mu  89
AU: R2 0.992 at 107.33% of     6771 start 2020-03-18 peak 2020-03-29 floor 2020-04-22 beta  6.39 mu  89
AW: R2 0.860 at 190.33% of       53 start 2020-03-29 peak 2020-04-09 floor 2020-05-07 beta  6.28 

  mu = beta * (intercept + np.log(beta))


DK: R2 0.908 at  92.64% of    12954 start 2020-02-28 peak 2020-04-11 floor 2020-07-28 beta 23.68 mu 102
DM: R2 0.201 at   5.10% of      334 start 2020-03-01 peak 2020-08-23 floor NaT beta 101.85 mu 236
DO: R2 0.925 at  55.50% of    36782 start 2020-03-12 peak 2020-05-23 floor 2020-11-16 beta 39.51 mu 144
DZ: R2 0.893 at  64.23% of    16163 start 2020-03-29 peak 2020-05-14 floor 2020-08-05 beta 36.49 mu 135
EC: R2 0.770 at  79.83% of    55013 start 2020-03-12 peak 2020-05-02 floor 2020-09-09 beta 27.30 mu 123
EE: R2 0.953 at 109.12% of     1784 start 2020-03-08 peak 2020-04-02 floor 2020-06-05 beta 13.86 mu  93
EG: R2 0.490 at   2.19% of  1512061 start 2020-04-12 peak 2020-10-27 floor NaT beta 101.13 mu 301
EL: R2 0.810 at  89.62% of     3412 start 2020-03-08 peak 2020-04-03 floor 2020-05-25 beta 18.40 mu  94
ES: R2 0.991 at 102.94% of   235054 start 2020-03-07 peak 2020-04-01 floor 2020-06-13 beta 12.09 mu  92
FI: R2 0.975 at  93.75% of     7493 start 2020-03-10 peak 2020-04-16 floor 2

NO: R2 0.969 at 103.84% of     8247 start 2020-03-04 peak 2020-03-29 floor 2020-06-01 beta 13.52 mu  89
NP: R2 0.063 at    nan% of        0 start NaT peak NaT floor NaT beta -171.55 mu nan
NZ: R2 0.994 at 105.06% of     1098 start 2020-03-23 peak 2020-04-01 floor 2020-04-22 beta  5.81 mu  92
OM: R2 0.402 at   3.19% of   539683 start 2020-03-19 peak 2020-09-30 floor NaT beta 86.88 mu 274
PA: R2 0.659 at  50.05% of    34421 start 2020-02-28 peak 2020-05-27 floor 2021-01-17 beta 45.08 mu 148
PE: R2 0.927 at  30.01% of   678415 start 2020-03-18 peak 2020-06-24 floor 2021-04-06 beta 46.69 mu 176
PF: R2 0.581 at 105.40% of       57 start 2020-03-13 peak 2020-03-31 floor 2020-05-05 beta 12.47 mu  91
PH: R2 0.431 at  29.27% of    78094 start 2020-03-30 peak 2020-06-26 floor 2020-11-28 beta 67.63 mu 178
PK: R2 0.685 at   5.94% of  1851571 start 2020-04-07 peak 2020-09-07 floor NaT beta 81.60 mu 251
PL: R2 0.901 at  71.65% of    38463 start 2020-03-15 peak 2020-05-06 floor 2020-08-26 beta 32.55 

Check the output frame assigning the index name.

In [21]:
df_out.index.name = 'iso2'
df_out

Unnamed: 0_level_0,cname,iso3,ccont,popdata,rsquared,progress,final,start,peak,floor,beta,mu,maxcur
iso2,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
AD,Andorra,AND,Europe,77006.0,0.397933,3.094300e-01,2.752759e+03,2020-01-25,2020-05-08,2021-03-21,47.824809,128.945389,2.954637e+02
AE,United_Arab_Emirates,ARE,Asia,9630959.0,0.943981,5.509543e-01,7.043628e+04,2020-03-15,2020-05-24,2020-11-29,35.424389,144.629654,1.017452e+04
AF,Afghanistan,AFG,Asia,37172386.0,0.663340,9.100570e-03,1.960625e+06,2020-04-13,2020-11-15,NaT,101.119873,320.035634,1.137559e+05
AG,Antigua_and_Barbuda,ATG,America,96286.0,0.640030,9.449804e-01,2.751380e+01,2020-03-21,2020-04-05,2020-05-08,9.759322,96.059650,1.337899e+01
AI,Anguilla,,America,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...
XK,Kosovo,XKX,Europe,1845300.0,0.790147,8.660156e-01,1.318683e+03,2020-03-16,2020-04-22,2020-07-10,23.251146,112.790402,2.877463e+02
YE,Yemen,YEM,Asia,28498687.0,,,,,,,,,
ZA,South_Africa,ZAF,Africa,57779622.0,0.001757,6.782260e-57,2.565004e+12,NaT,NaT,NaT,2444.901100,12051.807132,1.213256e+12
ZM,Zambia,ZMB,Africa,17351822.0,0.838265,8.881079e-01,1.299392e+03,2020-05-04,2020-05-16,2020-06-05,11.065241,136.796739,5.673963e+02


Write out the values per country, discarding countries with progress below 1%.

In [22]:
df_out[df_out['progress'] > 0.01].to_csv("zzprogress.csv")

Keep exploring! Stay home, wash your hands, keep your distance.