We will use the daily spreadsheet from EU CDC containing new cases and deaths per country per day.

In [15]:
!wget -N https://www.ecdc.europa.eu/sites/default/files/documents/COVID-19-geographic-disbtribution-worldwide.xlsx

--2020-06-06 16:30:24--  https://www.ecdc.europa.eu/sites/default/files/documents/COVID-19-geographic-disbtribution-worldwide.xlsx
Resolving www.ecdc.europa.eu (www.ecdc.europa.eu)... 13.227.209.16, 13.227.209.26, 13.227.209.121, ...
Connecting to www.ecdc.europa.eu (www.ecdc.europa.eu)|13.227.209.16|:443... connected.
HTTP request sent, awaiting response... 304 Not Modified
File ‘COVID-19-geographic-disbtribution-worldwide.xlsx’ not modified on server. Omitting download.



Get Pandas and NumPy for feature engineering and calculations and get plots inline.

In [16]:
import pandas as pd
import numpy  as np

%matplotlib inline

We read our dataframe directly from the downloaded Excel file and have a look at the first 10 lines for format. Data for Namibia caused missing values because the `geoId` is __NA__, so we disable interpretation of missing values.

In [17]:
df = pd.read_excel('COVID-19-geographic-disbtribution-worldwide.xlsx', keep_default_na=False, na_values='')
df.head(10)

Unnamed: 0,dateRep,day,month,year,cases,deaths,countriesAndTerritories,geoId,countryterritoryCode,popData2018,continentExp
0,2020-06-06,6,6,2020,915,9,Afghanistan,AF,AFG,37172386.0,Asia
1,2020-06-05,5,6,2020,787,6,Afghanistan,AF,AFG,37172386.0,Asia
2,2020-06-04,4,6,2020,758,24,Afghanistan,AF,AFG,37172386.0,Asia
3,2020-06-03,3,6,2020,759,5,Afghanistan,AF,AFG,37172386.0,Asia
4,2020-06-02,2,6,2020,545,8,Afghanistan,AF,AFG,37172386.0,Asia
5,2020-06-01,1,6,2020,680,8,Afghanistan,AF,AFG,37172386.0,Asia
6,2020-05-31,31,5,2020,866,3,Afghanistan,AF,AFG,37172386.0,Asia
7,2020-05-30,30,5,2020,623,11,Afghanistan,AF,AFG,37172386.0,Asia
8,2020-05-29,29,5,2020,580,8,Afghanistan,AF,AFG,37172386.0,Asia
9,2020-05-28,28,5,2020,625,7,Afghanistan,AF,AFG,37172386.0,Asia


Last check of our source dataframe.

In [18]:
df.count()

dateRep                    21756
day                        21756
month                      21756
year                       21756
cases                      21756
deaths                     21756
countriesAndTerritories    21756
geoId                      21756
countryterritoryCode       21512
popData2018                21435
continentExp               21756
dtype: int64

We pivot to a country by column format.

In [19]:
df_geo = df.pivot(index='dateRep', columns='geoId', values=['cases', 'deaths'])
df_geo

Unnamed: 0_level_0,cases,cases,cases,cases,cases,cases,cases,cases,cases,cases,...,deaths,deaths,deaths,deaths,deaths,deaths,deaths,deaths,deaths,deaths
geoId,AD,AE,AF,AG,AI,AL,AM,AO,AR,AT,...,VC,VE,VG,VI,VN,XK,YE,ZA,ZM,ZW
dateRep,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2,Unnamed: 13_level_2,Unnamed: 14_level_2,Unnamed: 15_level_2,Unnamed: 16_level_2,Unnamed: 17_level_2,Unnamed: 18_level_2,Unnamed: 19_level_2,Unnamed: 20_level_2,Unnamed: 21_level_2
2019-12-31,,0.0,0.0,,,,0.0,,,0.0,...,,,,,0.0,,,,,
2020-01-01,,0.0,0.0,,,,0.0,,,0.0,...,,,,,0.0,,,,,
2020-01-02,,0.0,0.0,,,,0.0,,,0.0,...,,,,,0.0,,,,,
2020-01-03,,0.0,0.0,,,,0.0,,,0.0,...,,,,,0.0,,,,,
2020-01-04,,0.0,0.0,,,,0.0,,,0.0,...,,,,,0.0,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2020-06-02,1.0,635.0,545.0,0.0,0.0,6.0,210.0,0.0,564.0,21.0,...,0.0,3.0,0.0,0.0,0.0,0.0,4.0,22.0,0.0,0.0
2020-06-03,79.0,596.0,759.0,0.0,0.0,21.0,517.0,0.0,904.0,11.0,...,0.0,1.0,0.0,0.0,0.0,0.0,3.0,50.0,0.0,0.0
2020-06-04,7.0,571.0,758.0,1.0,0.0,20.0,515.0,0.0,949.0,31.0,...,0.0,2.0,0.0,0.0,0.0,0.0,1.0,37.0,0.0,0.0
2020-06-05,1.0,659.0,787.0,0.0,0.0,13.0,697.0,0.0,0.0,36.0,...,0.0,0.0,0.0,0.0,0.0,0.0,15.0,56.0,0.0,0.0


For predictions later on we need extra rows in our dataframe. One of the ways to do that is reindexing with a larger range, so we use the current range and add six months and check our latest date.

In [20]:
new_index = pd.date_range(df_geo.index.min(), df_geo.index.max() + pd.Timedelta('365 days'))
df_geo = df_geo.reindex(new_index)
df_geo

Unnamed: 0_level_0,cases,cases,cases,cases,cases,cases,cases,cases,cases,cases,...,deaths,deaths,deaths,deaths,deaths,deaths,deaths,deaths,deaths,deaths
geoId,AD,AE,AF,AG,AI,AL,AM,AO,AR,AT,...,VC,VE,VG,VI,VN,XK,YE,ZA,ZM,ZW
2019-12-31,,0.0,0.0,,,,0.0,,,0.0,...,,,,,0.0,,,,,
2020-01-01,,0.0,0.0,,,,0.0,,,0.0,...,,,,,0.0,,,,,
2020-01-02,,0.0,0.0,,,,0.0,,,0.0,...,,,,,0.0,,,,,
2020-01-03,,0.0,0.0,,,,0.0,,,0.0,...,,,,,0.0,,,,,
2020-01-04,,0.0,0.0,,,,0.0,,,0.0,...,,,,,0.0,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2021-06-02,,,,,,,,,,,...,,,,,,,,,,
2021-06-03,,,,,,,,,,,...,,,,,,,,,,
2021-06-04,,,,,,,,,,,...,,,,,,,,,,
2021-06-05,,,,,,,,,,,...,,,,,,,,,,


Most algorithms take numerical data as inputs for a model, so we add a column representing the date as days since the earliest date in the dataframe.

In [21]:
df_geo['daynum'] = (df_geo.index - df_geo.index.min()).days
df_geo['daynum'].describe()

count    524.000000
mean     261.500000
std      151.410039
min        0.000000
25%      130.750000
50%      261.500000
75%      392.250000
max      523.000000
Name: daynum, dtype: float64

Suppress warnings for multiple plots when analyzing many countries with `showplots = True`.

In [22]:
import matplotlib as mpl
mpl.rc('figure', max_open_warning = 0)

Running for multiple countries with a selection or simply all countries found in the input. Full documentation of the approach is found in the `Gumbelpivot` notebook.

In [23]:
# Select countries to fit.
#countries = np.sort(df['geoId'].unique())
countries = ['US', 'UK', 'BR', 'CH', 'DE', 'IT', 'ES', 'PT', 'FR', 'SE',
             'NO', 'DK', 'BE', 'NL', 'NZ', 'CN', 'JP', 'RU', 'AT']

# Choose whether to output plots per country.
showplots = True

# Create an output dataframe.
df_out = pd.DataFrame({
    'cname':np.nan,
    'iso3':np.nan,
    'ccont':np.nan,
    'popdata':np.nan,
    'rsquared':np.nan,
    'progress':np.nan,
    'final':np.nan,
    'start':np.nan,
    'peak':np.nan,
    'floor':np.nan,
    'beta':np.nan,
    'mu':np.nan,
    'maxcur':np.nan},
    index=countries)

# Choose measure to fit and variables to store predicted and smoothed measures.
measure  = 'cases'
smeasure = 'scases'
pmeasure = 'pcases'

def gumpdf(x, beta, mu):
    """Return PDF value according to Gumbel"""
    expon = - ((x - mu) / beta)
    return(np.exp(expon) * np.exp(- (np.exp(expon))) / beta)

def gumcdf(x, beta, mu):
    """Return CDF value according to Gumbel"""
    expon = - ((x - mu) / beta)
    return(np.exp(- (np.exp(expon))))

from scipy.stats import linregress

# Run the fitting approach for all countries.
for country in countries:
    df_geo[(smeasure, country)] = df_geo[measure][country].rolling(7).mean()
    df_pred = pd.DataFrame(
        {'daynum':df_geo['daynum'], measure:df_geo[smeasure][country]})
    
    # Extract country parameters from the original dataset.
    cname   = df[df['geoId'] == country]['countriesAndTerritories'].iloc[0]
    iso3    = df[df['geoId'] == country]['countryterritoryCode'].iloc[0]
    ccont   = df[df['geoId'] == country]['continentExp'].iloc[0]
    popdata = df[df['geoId'] == country]['popData2018'].iloc[0]

    # We will only use measures above one in a million.
    mincases = popdata / 1e6
    
    # Clean up source data and prepare for fitting
    df_pred['cumul'] = df_pred[measure].cumsum()
    df_pred['gumdiv'] = df_pred[measure] / df_pred['cumul']
    df_pred = df_pred[(df_pred['gumdiv'] > 0)]
    df_pred['linear'] = np.log(df_pred['gumdiv'])
    df_pred = df_pred[(df_pred['linear'] < -2) &
                      (df_pred['linear'] > -5) &
                      (df_pred[measure] > mincases)]
    
    # Start fitting only if more than 9 measures left
    if len(df_pred) > 9:
        slope, intercept, rvalue, pvalue, stderr = linregress(df_pred[['daynum', 'linear']])
        rsquared = rvalue ** 2
        
        # Calculate Gumbel beta and mu from our linear fit parameters
        beta = - 1 / slope
        mu = beta * (intercept + np.log(beta))

        # Find the final number of cases by scaling back to the original data
        df_pred['pgumb'] = gumpdf(df_pred['daynum'], beta, mu)
        df_pred['scale'] = df_pred[measure] / df_pred['pgumb']
        final = df_pred['scale'].mean()

        # Create predicted measures by calculating the scaled Gumbel PDF
        df_geo[(pmeasure, country)] = gumpdf(df_geo['daynum'], beta, mu) * final

        # Progress is current measure ratio to final
        progress = df_geo[measure][country].sum() / final
        
        # Determine peak, floor, start and final analytically.
        peak = df_geo[(df_geo[(pmeasure, country)] >
                       df_geo[(pmeasure, country)].shift(-1))].index.min()
        floor = df_geo[(df_geo[(pmeasure, country)] < (popdata / 1e6)) &
                       (df_geo[(pmeasure, country)].index > peak)].index.min()
        start = df_geo[(df_geo[(pmeasure, country)] > (popdata / 1e6)) &
                       (df_geo[(pmeasure, country)].index < peak)].index.min()
        final = df_geo[pmeasure][country].sum()
        
        # Maximum current infected seems a good measure for outbreak intensity, to be scaled by population.
        maxcur = df_geo[pmeasure][country].rolling(14).sum().max()
        
        # Create an output record and log results.
        df_out.loc[country] = [cname,
                               iso3,
                               ccont,
                               popdata,
                               rsquared,
                               progress,
                               final,
                               start.date(),
                               peak.date(),
                               floor.date(),
                               beta,
                               mu,
                               maxcur]
        print('{}: R2 {:5.3f} at {:6.2f}% of {:8.0f} start {} peak {} floor {} beta {:5.2f} mu {:3.0f}'.format(
            country,
            rsquared,
            progress * 100,
            final,
            start.date(),
            peak.date(),
            floor.date(),
            beta,
            mu))
        
        # Show cumulative and derived results.
        if showplots:
            df_geo[[(measure, country), (smeasure, country), (pmeasure, country)]].plot(
                figsize=(16, 9), grid=True)
            df_geo[[(measure, country), (smeasure, country), (pmeasure, country)]].cumsum().plot(
                figsize=(16, 9), grid=True)
    else:
        df_out.loc[country] = [cname, iso3, ccont, popdata, np.nan, np.nan, np.nan, np.nan, np.nan, np.nan, np.nan, np.nan, np.nan]

AD: R2 0.394 at  37.24% of     2288 start 2020-01-30 peak 2020-04-29 floor 2021-01-27 beta 41.52 mu 120
AE: R2 0.938 at  52.23% of    72069 start 2020-03-14 peak 2020-05-24 floor 2020-12-02 beta 35.89 mu 145
AF: R2 0.627 at   0.53% of  2993102 start 2020-04-13 peak 2020-12-06 floor NaT beta 107.96 mu 341
AG: R2 0.640 at  94.50% of       28 start 2020-03-21 peak 2020-04-05 floor 2020-05-08 beta  9.76 mu  96
AL: R2 0.792 at  79.83% of     1518 start 2020-03-10 peak 2020-04-23 floor 2020-07-16 beta 29.97 mu 114
AM: R2 0.114 at   0.05% of  7758785 start NaT peak NaT floor NaT beta 194.24 mu 558
AR: R2 0.170 at   0.68% of  2095400 start 2020-03-31 peak 2021-01-22 floor NaT beta 140.79 mu 388
AT: R2 0.990 at 107.01% of    15702 start 2020-03-13 peak 2020-03-29 floor 2020-05-12 beta  8.25 mu  89
AU: R2 0.992 at 107.09% of     6771 start 2020-03-18 peak 2020-03-29 floor 2020-04-22 beta  6.39 mu  89
AW: R2 0.860 at 190.33% of       53 start 2020-03-29 peak 2020-04-09 floor 2020-05-07 beta  6.28

  mu = beta * (intercept + np.log(beta))


DK: R2 0.908 at  91.67% of    12954 start 2020-02-28 peak 2020-04-11 floor 2020-07-28 beta 23.68 mu 102
DM: R2 0.170 at   5.40% of      317 start 2020-02-28 peak 2020-08-15 floor NaT beta 98.44 mu 228
DO: R2 0.921 at  53.43% of    35009 start 2020-03-12 peak 2020-05-21 floor 2020-11-08 beta 38.43 mu 142
DZ: R2 0.876 at  57.05% of    17413 start 2020-03-29 peak 2020-05-17 floor 2020-08-13 beta 38.36 mu 138
EC: R2 0.764 at  79.97% of    51986 start 2020-03-12 peak 2020-04-30 floor 2020-09-01 beta 25.98 mu 121
EE: R2 0.953 at 107.05% of     1784 start 2020-03-08 peak 2020-04-02 floor 2020-06-05 beta 13.86 mu  93
EG: R2 0.471 at   2.52% of  1136608 start 2020-04-12 peak 2020-10-12 floor NaT beta 95.88 mu 286
EL: R2 0.916 at 102.41% of     2868 start 2020-03-11 peak 2020-04-01 floor 2020-05-12 beta 14.46 mu  92
ES: R2 0.991 at 102.52% of   235054 start 2020-03-07 peak 2020-04-01 floor 2020-06-13 beta 12.09 mu  92
FI: R2 0.975 at  92.63% of     7493 start 2020-03-10 peak 2020-04-16 floor 202

PE: R2 0.921 at  23.90% of   783451 start 2020-03-18 peak 2020-06-29 floor 2021-04-29 beta 48.75 mu 181
PF: R2 0.581 at 105.40% of       57 start 2020-03-13 peak 2020-03-31 floor 2020-05-05 beta 12.47 mu  91
PH: R2 0.481 at  36.16% of    56933 start 2020-03-29 peak 2020-06-09 floor 2020-10-10 beta 58.47 mu 161
PK: R2 0.754 at   9.31% of   994459 start 2020-04-07 peak 2020-08-11 floor 2021-06-06 beta 71.17 mu 224
PL: R2 0.917 at  71.53% of    35536 start 2020-03-16 peak 2020-05-04 floor 2020-08-16 beta 30.58 mu 125
PR: R2 0.418 at  33.38% of    13812 start 2020-03-07 peak 2020-06-17 floor 2021-02-18 beta 56.72 mu 169
PS: R2 0.617 at  74.53% of      863 start 2020-04-02 peak 2020-04-30 floor 2020-06-16 beta 24.41 mu 121
PT: R2 0.848 at  86.21% of    39401 start 2020-02-29 peak 2020-04-14 floor 2020-08-11 beta 23.31 mu 105
PY: R2 0.486 at  58.05% of     1873 start 2020-04-08 peak 2020-05-15 floor 2020-07-18 beta 31.16 mu 136
QA: R2 0.887 at  17.03% of   383398 start 2020-03-06 peak 2020-0

Check the output frame assigning the index name.

In [24]:
df_out.index.name = 'iso2'
df_out

Unnamed: 0_level_0,cname,iso3,ccont,popdata,rsquared,progress,final,start,peak,floor,beta,mu,maxcur
iso2,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
AD,Andorra,AND,Europe,77006.0,0.393810,3.723743e-01,2.287884e+03,2020-01-30,2020-04-29,2021-01-27,41.516729,119.719097,2.824928e+02
AE,United_Arab_Emirates,ARE,Asia,9630959.0,0.938451,5.222900e-01,7.206916e+04,2020-03-14,2020-05-24,2020-12-02,35.887228,145.412286,1.027831e+04
AF,Afghanistan,AFG,Asia,37172386.0,0.626952,5.273760e-03,2.993102e+06,2020-04-13,2020-12-06,NaT,107.964402,340.590473,1.714644e+05
AG,Antigua_and_Barbuda,ATG,America,96286.0,0.640030,9.449804e-01,2.751380e+01,2020-03-21,2020-04-05,2020-05-08,9.759322,96.059650,1.337899e+01
AI,Anguilla,,America,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...
XK,Kosovo,XKX,Europe,1845300.0,0.795165,8.981845e-01,1.271454e+03,2020-03-16,2020-04-21,2020-07-06,22.216414,111.737792,2.899435e+02
YE,Yemen,YEM,Asia,28498687.0,,,,,,,,,
ZA,South_Africa,ZAF,Africa,57779622.0,0.002601,1.138563e-44,1.507299e+12,NaT,NaT,NaT,1921.103033,9031.470185,6.908257e+11
ZM,Zambia,ZMB,Africa,17351822.0,0.838265,8.550155e-01,1.299392e+03,2020-05-04,2020-05-16,2020-06-05,11.065241,136.796739,5.673963e+02


Write out the values per country, discarding countries with progress below 1%.

In [25]:
df_out[df_out['progress'] > 0.01].to_csv("zzprogress.csv")

Keep exploring! Stay home, wash your hands, keep your distance.