# Get and Preprocess data from Estonian Open Data Platform

**Index:**
1. [Data acquisition](#Data-acquisition)
2. [Investigation Data Structure](#Investigating-data-structure)
3. [Converting Data](#Converting-dataset)
4. [Data analysis](#Data-analysis)
5. [Get additional Data](#Get-additional-data)
6. [Export Final Data](#Export-final-values)

### Data acquisition 

Information on the COVID-19 open data can be found on the official Terviseamet website

[Koroonaviirus SARS-CoV-2 testide avaandmete kirjeldus](https://www.terviseamet.ee/et/koroonaviirus/avaandmed)

In the section Testide avaandmete andmestruktuuri kirjeldus > Avaandmete lingid is the URL to the JSON file in which all results are published.

In [1]:
url = r'https://opendata.digilugu.ee/opendata_covid19_test_results.json' # JSON document

With the [requests](https://requests.readthedocs.io/en/master/) library the file can be downloaded and read into variable ```d```.

In [2]:
import requests 

In [3]:
r = requests.get(url)
d = r.json()

The [list](https://www.w3schools.com/python/python_lists.asp) ```d``` now contains all the results from the COVID-19 testing. The amount of tests performed can be checked with ```len()```.

In [4]:
print(len(d))

192066


In the header data for the ```requests``` response is the ```Last-Modified``` date.

In [5]:
print(r.headers['Last-Modified'])

Tue, 22 Sep 2020 07:27:36 GMT


The derived results will be collected in the [dictionary](https://www.w3schools.com/python/python_dictionaries.asp) ```dc```.

In [6]:
dc = {}

In [7]:
#from datetime import datetime
from dateutil import tz
from dateutil.parser import parse
tzone = tz.gettz('Europe/Tallinn')

In [8]:
dc['totalTested'] = len(d)
dc['lastUpdate'] = parse(r.headers['Last-Modified']).astimezone(tzone).strftime('%d.%m.%Y %H:%M:%S')

### Investigating data structure

We can access the indivudal entries of the list ```d```. The first one with ```d[0]``` or the last one with ```d[-1]```. The list items are of type dictionary.

In [9]:
d[30]

{'id': '221ff8539c2ad9eed0c02935f716307def6ad3fb518e37cf45b3b40394d0e9f7',
 'Gender': 'M',
 'AgeGroup': '25-29',
 'Country': 'Eesti',
 'County': 'Viljandi maakond',
 'ResultValue': 'N',
 'StatisticsDate': '2020-03-13',
 'ResultTime': '2020-03-12T18:00:00+02:00',
 'AnalysisInsertTime': '2020-03-13T15:23:00+02:00'}

The fields in the resulting dictionary are described on the [official website](https://www.terviseamet.ee/et/koroonaviirus/avaandmed). The fields can be accessed individually:

In [10]:
d[30]['AgeGroup']

'25-29'

### Converting dataset

The [pandas](https://pandas.pydata.org/) library can be used to get the data in a more accessible way. It provides functions for further data analysis.

In [11]:
import pandas as pd

Convert the list with dictionary items to pandas dataframe. Display the first 5 rows.

In [12]:
df = pd.DataFrame(d)
df[10:15]

Unnamed: 0,id,Gender,AgeGroup,Country,County,ResultValue,StatisticsDate,ResultTime,AnalysisInsertTime
10,7148777bb25abfcf6211575c5b4c3c64e10d9f759adf9b...,M,45-49,Eesti,Võru maakond,N,2020-03-11,2020-03-10T19:29:00+02:00,2020-03-11T13:22:53+02:00
11,ded14387a0af0168c1bc7c1bfa2ccd1a91f47eb2cb7560...,N,20-24,Eesti,Pärnu maakond,N,2020-03-11,2020-03-03T00:00:00+02:00,2020-03-11T08:49:21+02:00
12,d16f4fdef48d0ba64d75073e45d7b62ab3ec0afaf95b2b...,M,55-59,Eesti,Viljandi maakond,N,2020-03-11,2020-03-11T14:03:00+02:00,2020-03-11T17:32:21+02:00
13,7bedaddb6a0ae949ca2f86b396f7c8c4e93416046659ee...,M,60-64,Eesti,Harju maakond,N,2020-03-11,2020-03-09T21:49:00+02:00,2020-03-11T10:12:54+02:00
14,7d8c578b972490df562b9cb72c2afce24b9ee47b53bbdf...,M,15-19,Eesti,Tartu maakond,N,2020-03-11,2020-03-03T00:00:00+02:00,2020-03-11T08:48:48+02:00


#### Data cleaning

In [13]:
from aglearn import remap as rm # class has 

The ```ResultTime``` and ```AnalysisInsertTime``` are not of importance right now.

For the Maakonds an identification code is used, as the Counties have a long "Tartu maakond" and short "Tartumaa" way of spelling, which might get mixed up. The dictionary is saved in the custom ```aglearn``` library. 

Slightly adapt the Agegroup (leading zeros).

Transform the text in StatisticsDate into datetime objects.

In [14]:
df = df.drop(['ResultTime', 'AnalysisInsertTime'], axis=1) #, 'id'
df['MKOOD'] = df['County'].map(rm.MNIMI_MKOOD)
df['AgeGroup'] = df['AgeGroup'].map(rm.VANUSER_STR)
df['StatisticsDate'] = pd.to_datetime(df['StatisticsDate'])

In [15]:
df.head()

Unnamed: 0,id,Gender,AgeGroup,Country,County,ResultValue,StatisticsDate,MKOOD
0,95013b64dd5ff18548a92eb5375d9c4a1881467390fed4...,M,10-14,Eesti,Tartu maakond,N,2020-03-10,79
1,71fab95aa66a3976b9d9f2868482192fc2bb77ac07d680...,M,05-09,Eesti,Tartu maakond,N,2020-03-10,79
2,e474cb8d21136013c9c90877592ee8d6b20d1bd72ef48a...,M,20-24,Eesti,Harju maakond,N,2020-03-10,37
3,86a33c6965a464b3c8b754795d99b3fccab5e8349827dc...,M,35-39,Eesti,Tartu maakond,N,2020-03-10,79
4,70fb213dfac6252426170b79224d399c6e613fbca07d54...,N,15-19,Eesti,Viljandi maakond,N,2020-03-10,84


#### Export data

In [16]:
df.to_csv(r'data/covid_digilugu_cleaned.csv')

### Data analysis

In [17]:
dtFormat = '%d.%m.%Y'
dc['firstTest'] = df.StatisticsDate.min().strftime(dtFormat)
dc['lastTest'] = df.StatisticsDate.max().strftime(dtFormat)

#### Positive vs Negative

The function ```value_counts()``` can be used to summarize the respective columns.

In [18]:
df.ResultValue.value_counts()

N    189090
P      2976
Name: ResultValue, dtype: int64

In [19]:
dc['totalPositive'] = int(df.ResultValue.value_counts()['P'])
dc['totalNegative'] = int(df.ResultValue.value_counts()['N'])

In [20]:
dc['percPositive'] = round(dc['totalPositive']/dc['totalTested'],4)

#### Values Last day

In [21]:
from datetime import datetime as dt

In [22]:
res = df[df.StatisticsDate == dt.strptime(dc['lastTest'],dtFormat)].ResultValue.value_counts()
res

N    2826
P      36
Name: ResultValue, dtype: int64

In [23]:
if 'P' in res: # in case there are no positive results :) 
    dc['prevDayConfirmed'] = int(res['P'])
else:
    dc['prevDayConfirmed'] = 0
dc['prevDayTests'] = int(res['N'] + dc['prevDayConfirmed'])

#### Timeseries Estonia

For further statistics a one-hot encoding has to be applied on the dataset. The result is joined with the dataframe.

In [24]:
df1h = pd.get_dummies(df[['ResultValue']], prefix=['Results'])
df1h = df1h.rename(columns={'Results_N' : 'negativeTests', 'Results_P' : 'confirmedCases'})
df1h = df.join(df1h)
df1h[-10:]

Unnamed: 0,id,Gender,AgeGroup,Country,County,ResultValue,StatisticsDate,MKOOD,negativeTests,confirmedCases
192056,bd0bd344193869f4df3b6db1f085bddef090ebabd8c93e...,N,30-34,Eesti,Ida-Viru maakond,N,2020-09-21,45.0,1,0
192057,760e702adadb0b1e3623a1ad9cf55df0a4d96da6016a1b...,M,40-44,Eesti,Harju maakond,N,2020-09-21,37.0,1,0
192058,87f6a89eeaf4e16146fd241147bd64144f557e29653a8c...,M,25-29,Eesti,Jõgeva maakond,N,2020-09-21,50.0,1,0
192059,bde06d2da915f7ad17f0b27046039f24ddf965914788df...,M,30-34,Eesti,Harju maakond,N,2020-09-21,37.0,1,0
192060,7ffd6b173676249bfd6355f0c3ee102af524f64aef4873...,M,10-14,Eesti,Harju maakond,N,2020-09-21,37.0,1,0
192061,ac92a7c570a781a0e8af230ce7b73a74f64a2c73fe4411...,M,05-09,Eesti,Harju maakond,N,2020-09-21,37.0,1,0
192062,d37458fb85913f0ad30fc709e022e529d676dd8e35e6ba...,N,Tundmatu,Tundmatu,,N,2020-09-21,,1,0
192063,2df4df9c2b696c3ec8a81119fde5476143f60126942541...,M,50-54,Eesti,Harju maakond,N,2020-09-21,37.0,1,0
192064,31a04c545de0563b0b81b98fe47899e0f5b4eab9483053...,N,10-14,Eesti,Viljandi maakond,N,2020-09-21,84.0,1,0
192065,c826c1a844e9b33334c6f24cd15e6847cfb9879db04b63...,N,15-19,Eesti,Lääne-Viru maakond,N,2020-09-21,60.0,1,0


In [25]:
newDateRange = pd.date_range(start=dt.strptime(dc['firstTest'], dtFormat), end=dt.strptime(dc['lastTest'], dtFormat), freq='1D')

In [26]:
dfts = df1h.groupby(['StatisticsDate']).sum()
dfts['testsPerDay'] = df1h.groupby(['StatisticsDate']).count().values[:,1]
dfts = dfts.reindex(newDateRange)
dfts = dfts.fillna(0)
dfts[-10:]

Unnamed: 0,negativeTests,confirmedCases,testsPerDay
2020-09-12,1573.0,21.0,1594.0
2020-09-13,1329.0,21.0,1350.0
2020-09-14,2211.0,23.0,2234.0
2020-09-15,2388.0,36.0,2424.0
2020-09-16,2937.0,22.0,2959.0
2020-09-17,2139.0,36.0,2175.0
2020-09-18,3133.0,59.0,3192.0
2020-09-19,1764.0,49.0,1813.0
2020-09-20,1554.0,18.0,1572.0
2020-09-21,2826.0,36.0,2862.0


Cumulative Sums

In [27]:
dfts['cumulativeNegative'] = dfts['negativeTests'].cumsum()
dfts['cumulativePositive'] = dfts['confirmedCases'].cumsum()
dfts['testsPerformed'] = dfts['testsPerDay'].cumsum()
dfts['activeCases'] = dfts['confirmedCases'].rolling(14, min_periods=1).sum()
dfts[-10:]

Unnamed: 0,negativeTests,confirmedCases,testsPerDay,cumulativeNegative,cumulativePositive,testsPerformed,activeCases
2020-09-12,1573.0,21.0,1594.0,168809.0,2676.0,171481.0,303.0
2020-09-13,1329.0,21.0,1350.0,170138.0,2697.0,172831.0,322.0
2020-09-14,2211.0,23.0,2234.0,172349.0,2720.0,175065.0,325.0
2020-09-15,2388.0,36.0,2424.0,174737.0,2756.0,177489.0,341.0
2020-09-16,2937.0,22.0,2959.0,177674.0,2778.0,180448.0,337.0
2020-09-17,2139.0,36.0,2175.0,179813.0,2814.0,182623.0,358.0
2020-09-18,3133.0,59.0,3192.0,182946.0,2873.0,185815.0,383.0
2020-09-19,1764.0,49.0,1813.0,184710.0,2922.0,187628.0,407.0
2020-09-20,1554.0,18.0,1572.0,186264.0,2940.0,189200.0,409.0
2020-09-21,2826.0,36.0,2862.0,189090.0,2976.0,192062.0,413.0


Percentages 

In [None]:
dfts['positiveTestsPerc'] = (dfts['confirmedCases' ]/dfts['testsPerDay']).round(4)
dfts['positiveTestsPercCum'] = (dfts['cumulativePositive' ]/dfts['testsPerformed']).round(4)
dfts.loc[dfts.index[-1], 'lastFeature'] = 1
dfts = dfts.reset_index().rename(columns={'index':'StatisticsDate'})
dfts

#### Timeseries Maakond

In [29]:
counties = list(df1h['County'].unique())
counties.remove('')
counties

['Tartu maakond',
 'Harju maakond',
 'Viljandi maakond',
 'Valga maakond',
 'Võru maakond',
 'Pärnu maakond',
 'Jõgeva maakond',
 'Lääne maakond',
 'Saare maakond',
 'Lääne-Viru maakond',
 'Põlva maakond',
 'Ida-Viru maakond',
 'Rapla maakond',
 'Hiiu maakond',
 'Järva maakond']

In [None]:
i = 0
for county in counties:
    dftsm0 = df1h.loc[df1h['County'] == county] # select a subset 
    dftsm = dftsm0.groupby(['StatisticsDate']).sum() # group by date and county
    dftsm['testsPerDay'] = dftsm0.groupby(['StatisticsDate']).count().values[:,1]
    dftsm = dftsm.reindex(newDateRange)
    dftsm = dftsm.fillna(0)
    dftsm['cumulativeNegative'] = dftsm['negativeTests'].cumsum()
    dftsm['cumulativePositive'] = dftsm['confirmedCases'].cumsum()
    dftsm['testsPerformed'] = dftsm['testsPerDay'].cumsum()
    dftsm['activeCases'] = dftsm['confirmedCases'].rolling(14, min_periods=1).sum()
    dftsm.loc[dftsm.index[-1], 'lastFeature'] = 1
    dftsm['County'] = county
    #dftsm['MKOOD'] = rm.MNIMI_MKOOD[county]
    if i == 0:
        dftsm_all = dftsm
        i += 1
    else:
        dftsm_all = dftsm_all.append(dftsm)
dftsm_all['positiveTestsPerc'] = (dftsm_all['confirmedCases' ]/dftsm_all['testsPerDay']).round(4)
dftsm_all['positiveTestsPercCum'] = (dftsm_all['cumulativePositive' ]/dftsm_all['testsPerformed']).round(4)
dftsm_all = dftsm_all.reset_index().rename(columns={'index':'StatisticsDate'})
dftsm_all

#### New Cases in the Last 14 days
This value can roughly be used to estimate the number of active cases. However, it neglects the hospitalized cases, which may have a significantly longer course of healing. This number shall be considered with care.

In [31]:
from datetime import timedelta as td

In [32]:
val14d = dfts.loc[dfts['StatisticsDate'] > dfts['StatisticsDate'].max() - td(days=14)]['confirmedCases'].sum()
dc['sumLast14D'] = val14d
print('New Cases in last 14d: {}'.format(val14d))

New Cases in last 14d: 413.0


### Export Final Data

In the dictionary ```dc``` all the derived statistics are stored.

In [33]:
dc

{'totalTested': 192066,
 'lastUpdate': '22.09.2020 10:27:36',
 'firstTest': '05.02.2020',
 'lastTest': '21.09.2020',
 'totalPositive': 2976,
 'totalNegative': 189090,
 'percPositive': 0.0155,
 'prevDayConfirmed': 36,
 'prevDayTests': 2862,
 'sumLast14D': 413.0}

In [34]:
import json

In [35]:
with open(r'data/cov_stats_eesti.json', 'w') as f:
    json.dump(dc, f, indent=4)

The dataframe ```dfts``` contains the timeseries for whole Estonia.

In [None]:
dfts

In [37]:
#dfts = dfts.reset_index()
#dfts = dfts.rename(column={})
dfts.to_csv(r'data/cov_ts_eesti.csv', index=False)

The dataframe ```dftsm_all``` contains the timeseries for each County.

In [None]:
dftsm_all

In [39]:
dftsm_all.to_csv(r'data/ts_maakond.csv', index=False)