# Get and Preprocess data from Estonian Open Data Platform

**Index:**
1. [Data acquisition](#Data-acquisition)
2. [Investigation Data Structure](#Investigating-data-structure)
3. [Converting Data](#Converting-dataset)
4. [Data analysis](#Data-analysis)
5. [Get additional Data](#Get-additional-data)
6. [Export Final Data](#Export-final-values)

### Data acquisition 

Information on the COVID-19 open data can be found on the official Terviseamet website

[Koroonaviirus SARS-CoV-2 testide avaandmete kirjeldus](https://www.terviseamet.ee/et/koroonaviirus/avaandmed)

In the section Testide avaandmete andmestruktuuri kirjeldus > Avaandmete lingid is the URL to the JSON file in which all results are published.

In [1]:
url = r'https://opendata.digilugu.ee/opendata_covid19_test_results.json' # JSON document

With the [requests](https://requests.readthedocs.io/en/master/) library the file can be downloaded and read into variable ```d```.

In [2]:
import requests 

In [3]:
r = requests.get(url)
d = r.json()

The [list](https://www.w3schools.com/python/python_lists.asp) ```d``` now contains all the results from the COVID-19 testing. The amount of tests performed can be checked with ```len()```.

In [4]:
print(len(d))

76590


In the header data for the ```requests``` response is the ```Last-Modified``` date.

In [5]:
print(r.headers['Last-Modified'])

Mon, 25 May 2020 07:52:11 GMT


The derived results will be collected in the [dictionary](https://www.w3schools.com/python/python_dictionaries.asp) ```dc```.

In [6]:
dc = {}

In [7]:
#from datetime import datetime
from dateutil import tz
from dateutil.parser import parse
tzone = tz.gettz('Europe/Tallinn')

In [8]:
dc['totalTested'] = len(d)
dc['lastUpdate'] = parse(r.headers['Last-Modified']).astimezone(tzone).strftime('%d.%m.%Y %H:%M:%S')

### Investigating data structure

We can access the indivudal entries of the list ```d```. The first one with ```d[0]``` or the last one with ```d[-1]```. The list items are of type dictionary.

In [9]:
d[20]

{'id': '191b343ea1cbb67fbeb1f8440d80acfa1cc1df1c66e2921706935381ba67a765',
 'Gender': 'N',
 'AgeGroup': '35-39',
 'Country': 'Eesti',
 'County': 'Tartu maakond',
 'ResultValue': 'N',
 'StatisticsDate': '2020-03-11',
 'ResultTime': '2020-03-09T22:00:00+02:00',
 'AnalysisInsertTime': '2020-03-11T10:13:21+02:00'}

The fields in the resulting dictionary are described on the [official website](https://www.terviseamet.ee/et/koroonaviirus/avaandmed). The fields can be accessed individually:

In [10]:
d[20]['AgeGroup']

'35-39'

### Converting dataset

The [pandas](https://pandas.pydata.org/) library can be used to get the data in a more accessible way. It provides functions for further data analysis.

In [11]:
import pandas as pd

Convert the list with dictionary items to pandas dataframe. Display the first 5 rows.

In [12]:
df = pd.DataFrame(d)
df.head()

Unnamed: 0,id,Gender,AgeGroup,Country,County,ResultValue,StatisticsDate,ResultTime,AnalysisInsertTime
0,95013b64dd5ff18548a92eb5375d9c4a1881467390fed4...,M,10-14,Eesti,Tartu maakond,N,2020-03-10,2020-03-06T18:44:00+02:00,2020-03-10T16:01:55+02:00
1,71fab95aa66a3976b9d9f2868482192fc2bb77ac07d680...,M,5-9,Eesti,Tartu maakond,N,2020-03-10,2020-03-06T13:28:00+02:00,2020-03-10T16:05:53+02:00
2,e474cb8d21136013c9c90877592ee8d6b20d1bd72ef48a...,M,20-24,Eesti,Harju maakond,N,2020-03-10,2020-03-05T00:00:00+02:00,2020-03-10T15:53:52+02:00
3,86a33c6965a464b3c8b754795d99b3fccab5e8349827dc...,M,35-39,Eesti,Tartu maakond,N,2020-03-10,2020-03-05T00:00:00+02:00,2020-03-10T15:50:53+02:00
4,70fb213dfac6252426170b79224d399c6e613fbca07d54...,N,15-19,Eesti,Viljandi maakond,N,2020-03-10,2020-03-06T18:46:00+02:00,2020-03-10T15:59:21+02:00


#### Data cleaning

In [13]:
from aglearn import remap as rm # class has 

The ```ResultTime``` and ```AnalysisInsertTime``` are not of importance right now.

For the Maakonds an identification code is used, as the Counties have a long "Tartu maakond" and short "Tartumaa" way of spelling, which might get mixed up. The dictionary is saved in the custom ```aglearn``` library. 

Slightly adapt the Agegroup (leading zeros).

Transform the text in StatisticsDate into datetime objects.

In [14]:
df = df.drop(['ResultTime', 'AnalysisInsertTime', 'id'], axis=1)
df['MKOOD'] = df['County'].map(rm.MNIMI_MKOOD)
df['AgeGroup'] = df['AgeGroup'].map(rm.VANUSER_STR)
df['StatisticsDate'] = pd.to_datetime(df['StatisticsDate'])

In [15]:
df.head()

Unnamed: 0,Gender,AgeGroup,Country,County,ResultValue,StatisticsDate,MKOOD
0,M,10-14,Eesti,Tartu maakond,N,2020-03-10,79
1,M,05-09,Eesti,Tartu maakond,N,2020-03-10,79
2,M,20-24,Eesti,Harju maakond,N,2020-03-10,37
3,M,35-39,Eesti,Tartu maakond,N,2020-03-10,79
4,N,15-19,Eesti,Viljandi maakond,N,2020-03-10,84


#### Export data

In [16]:
df.to_csv(r'covid_digilugu_cleaned.csv')

### Data analysis

In [17]:
dtFormat = '%d.%m.%Y'
dc['firstTest'] = df.StatisticsDate.min().strftime(dtFormat)
dc['lastTest'] = df.StatisticsDate.max().strftime(dtFormat)

#### Positive vs Negative

The function ```value_counts()``` can be used to summarize the respective columns.

In [18]:
df.ResultValue.value_counts()

N    74766
P     1824
Name: ResultValue, dtype: int64

In [19]:
dc['totalPositive'] = int(df.ResultValue.value_counts()['P'])
dc['totalNegative'] = int(df.ResultValue.value_counts()['N'])

In [20]:
dc['percPositive'] = round(dc['totalPositive']/dc['totalTested'],4)

#### Values Last day

In [21]:
from datetime import datetime as dt

In [22]:
res = df[df.StatisticsDate == dt.strptime(dc['lastTest'],dtFormat)].ResultValue.value_counts()
res

N    811
P      1
Name: ResultValue, dtype: int64

In [23]:
if 'P' in res: # in case there are no positive results :) 
    dc['prevDayConfirmed'] = int(res['P'])
else:
    dc['prevDayConfirmed'] = 0
dc['prevDayTests'] = int(res['N'] + dc['prevDayConfirmed'])

#### Timeseries Estonia

For further statistics a one-hot encoding has to be applied on the dataset. The result is joined with the dataframe.

In [24]:
df1h = pd.get_dummies(df[['ResultValue']], prefix=['Results'])
df1h = df1h.rename(columns={'Results_N' : 'negativeTests', 'Results_P' : 'confirmedCases'})
df1h = df.join(df1h)
df1h[-10:]

Unnamed: 0,Gender,AgeGroup,Country,County,ResultValue,StatisticsDate,MKOOD,negativeTests,confirmedCases
76580,M,55-59,Eesti,Harju maakond,N,2020-05-24,37,1,0
76581,M,70-74,Eesti,Harju maakond,N,2020-05-24,37,1,0
76582,M,25-29,Eesti,Harju maakond,N,2020-05-24,37,1,0
76583,N,50-54,Eesti,Harju maakond,N,2020-05-24,37,1,0
76584,M,40-44,Eesti,Harju maakond,N,2020-05-24,37,1,0
76585,M,70-74,Eesti,Harju maakond,N,2020-05-24,37,1,0
76586,N,35-39,Eesti,Saare maakond,N,2020-05-24,74,1,0
76587,N,30-34,Eesti,Lääne-Viru maakond,N,2020-05-24,60,1,0
76588,N,25-29,Eesti,Lääne-Viru maakond,N,2020-05-24,60,1,0
76589,M,00-04,Eesti,Harju maakond,N,2020-05-24,37,1,0


In [25]:
newDateRange = pd.date_range(start=dt.strptime(dc['firstTest'], dtFormat), end=dt.strptime(dc['lastTest'], dtFormat), freq='1D')

In [26]:
dfts = df1h.groupby(['StatisticsDate']).sum()
dfts['testsPerDay'] = df1h.groupby(['StatisticsDate']).count().values[:,1]
dfts = dfts.reindex(newDateRange)
dfts = dfts.fillna(0)
dfts

Unnamed: 0,negativeTests,confirmedCases,testsPerDay
2020-02-05,1.0,0.0,1.0
2020-02-06,1.0,0.0,1.0
2020-02-07,0.0,0.0,0.0
2020-02-08,0.0,0.0,0.0
2020-02-09,0.0,0.0,0.0
...,...,...,...
2020-05-20,1269.0,6.0,1275.0
2020-05-21,869.0,7.0,876.0
2020-05-22,636.0,14.0,650.0
2020-05-23,527.0,2.0,529.0


Cumulative Sums

In [27]:
dfts['cumulativeNegative'] = dfts['negativeTests'].cumsum()
dfts['cumulativePositive'] = dfts['confirmedCases'].cumsum()
dfts['testsPerformed'] = dfts['testsPerDay'].cumsum()
dfts

Unnamed: 0,negativeTests,confirmedCases,testsPerDay,cumulativeNegative,cumulativePositive,testsPerformed
2020-02-05,1.0,0.0,1.0,1.0,0.0,1.0
2020-02-06,1.0,0.0,1.0,2.0,0.0,2.0
2020-02-07,0.0,0.0,0.0,2.0,0.0,2.0
2020-02-08,0.0,0.0,0.0,2.0,0.0,2.0
2020-02-09,0.0,0.0,0.0,2.0,0.0,2.0
...,...,...,...,...,...,...
2020-05-20,1269.0,6.0,1275.0,71923.0,1800.0,73723.0
2020-05-21,869.0,7.0,876.0,72792.0,1807.0,74599.0
2020-05-22,636.0,14.0,650.0,73428.0,1821.0,75249.0
2020-05-23,527.0,2.0,529.0,73955.0,1823.0,75778.0


Percentages 

In [28]:
dfts['positiveTestsPerc'] = (dfts['confirmedCases' ]/dfts['testsPerDay']).round(4)
dfts['positiveTestsPercCum'] = (dfts['cumulativePositive' ]/dfts['testsPerformed']).round(4)
dfts.loc[dfts.index[-1], 'lastFeature'] = 1
dfts = dfts.reset_index().rename(columns={'index':'StatisticsDate'})
dfts

Unnamed: 0,StatisticsDate,negativeTests,confirmedCases,testsPerDay,cumulativeNegative,cumulativePositive,testsPerformed,positiveTestsPerc,positiveTestsPercCum,lastFeature
0,2020-02-05,1.0,0.0,1.0,1.0,0.0,1.0,0.0000,0.0000,
1,2020-02-06,1.0,0.0,1.0,2.0,0.0,2.0,0.0000,0.0000,
2,2020-02-07,0.0,0.0,0.0,2.0,0.0,2.0,,0.0000,
3,2020-02-08,0.0,0.0,0.0,2.0,0.0,2.0,,0.0000,
4,2020-02-09,0.0,0.0,0.0,2.0,0.0,2.0,,0.0000,
...,...,...,...,...,...,...,...,...,...,...
105,2020-05-20,1269.0,6.0,1275.0,71923.0,1800.0,73723.0,0.0047,0.0244,
106,2020-05-21,869.0,7.0,876.0,72792.0,1807.0,74599.0,0.0080,0.0242,
107,2020-05-22,636.0,14.0,650.0,73428.0,1821.0,75249.0,0.0215,0.0242,
108,2020-05-23,527.0,2.0,529.0,73955.0,1823.0,75778.0,0.0038,0.0241,


#### Timeseries Maakond

In [29]:
counties = list(df1h['County'].unique())
counties.remove('')
counties

['Tartu maakond',
 'Harju maakond',
 'Viljandi maakond',
 'Valga maakond',
 'Võru maakond',
 'Pärnu maakond',
 'Jõgeva maakond',
 'Lääne maakond',
 'Saare maakond',
 'Lääne-Viru maakond',
 'Põlva maakond',
 'Ida-Viru maakond',
 'Rapla maakond',
 'Hiiu maakond',
 'Järva maakond']

In [30]:
i = 0
for county in counties:
    dftsm0 = df1h.loc[df1h['County'] == county] # select a subset 
    dftsm = dftsm0.groupby(['StatisticsDate']).sum() # group by date and county
    dftsm['testsPerDay'] = dftsm0.groupby(['StatisticsDate']).count().values[:,1]
    dftsm = dftsm.reindex(newDateRange)
    dftsm = dftsm.fillna(0)
    dftsm['cumulativeNegative'] = dftsm['negativeTests'].cumsum()
    dftsm['cumulativePositive'] = dftsm['confirmedCases'].cumsum()
    dftsm['testsPerformed'] = dftsm['testsPerDay'].cumsum()
    dftsm.loc[dftsm.index[-1], 'lastFeature'] = 1
    dftsm['County'] = county
    #dftsm['MKOOD'] = rm.MNIMI_MKOOD[county]
    if i == 0:
        dftsm_all = dftsm
        i += 1
    else:
        dftsm_all = dftsm_all.append(dftsm)
dftsm_all['positiveTestsPerc'] = (dftsm_all['confirmedCases' ]/dftsm_all['testsPerDay']).round(4)
dftsm_all['positiveTestsPercCum'] = (dftsm_all['cumulativePositive' ]/dftsm_all['testsPerformed']).round(4)
dftsm_all = dftsm_all.reset_index().rename(columns={'index':'StatisticsDate'})
dftsm_all

Unnamed: 0,StatisticsDate,negativeTests,confirmedCases,testsPerDay,cumulativeNegative,cumulativePositive,testsPerformed,lastFeature,County,positiveTestsPerc,positiveTestsPercCum
0,2020-02-05,0.0,0.0,0.0,0.0,0.0,0.0,,Tartu maakond,,
1,2020-02-06,0.0,0.0,0.0,0.0,0.0,0.0,,Tartu maakond,,
2,2020-02-07,0.0,0.0,0.0,0.0,0.0,0.0,,Tartu maakond,,
3,2020-02-08,0.0,0.0,0.0,0.0,0.0,0.0,,Tartu maakond,,
4,2020-02-09,0.0,0.0,0.0,0.0,0.0,0.0,,Tartu maakond,,
...,...,...,...,...,...,...,...,...,...,...,...
1645,2020-05-20,24.0,0.0,24.0,1929.0,13.0,1942.0,,Järva maakond,0.0,0.0067
1646,2020-05-21,9.0,0.0,9.0,1938.0,13.0,1951.0,,Järva maakond,0.0,0.0067
1647,2020-05-22,20.0,0.0,20.0,1958.0,13.0,1971.0,,Järva maakond,0.0,0.0066
1648,2020-05-23,16.0,0.0,16.0,1974.0,13.0,1987.0,,Järva maakond,0.0,0.0065


#### New Cases in the Last 14 days
This value can roughly be used to estimate the number of active cases. However, it neglects the hospitalized cases, which may have a significantly longer course of healing. This number shall be considered with care.

In [31]:
from datetime import timedelta as td

In [32]:
val14d = dfts.loc[dfts['StatisticsDate'] > dfts['StatisticsDate'].max() - td(days=14)]['confirmedCases'].sum()
dc['sumLast14D'] = val14d
print('New Cases in last 14d: {}'.format(val14d))

New Cases in last 14d: 83.0


### Get additional data

#### Webscraping Terviseamet

Some data is not available in the open data, but can be acquired from the website. 
The library [beautifulsoup](https://www.crummy.com/software/BeautifulSoup/bs4/doc/) can be utilized to scrape the data.

(Hint: In case BeautifulSoup is not installed in the Python Envrionment, consult this [arcticle](https://pro.arcgis.com/en/pro-app/arcpy/get-started/what-is-conda.htm) on how to clone the default environment and Add Packages.)

In [33]:
from bs4 import BeautifulSoup
import re

In [34]:
url2 = r'https://www.terviseamet.ee/et/koroonaviirus/koroonakaart'

Download and parse the website.

In [35]:
r2 = requests.get(url2)
soup = BeautifulSoup(r2.text, features="html.parser")

Values that need to be extracted and their translation.

In [36]:
toScrp = {"KINNITATUD SURMAD": "deceasedCases",
        "HAIGLAST VÄLJAKIRJUTATUD": "recoveredCases",
        "HAIGLARAVIL": "hospitalisedCases" }

In [37]:
for scrp in toScrp.keys():
    res = soup.find(text=re.compile(scrp)) # looks for the string
    exstr = res.find_parent('div').text #extracts the text from the parent container
    dc[toScrp[scrp]] = int(re.findall(r"\n([0-9]{1,4})\n", exstr)[0]) # extracts the case number with regular expression
    print("{} ({}): {}".format(scrp, toScrp[scrp], dc[toScrp[scrp]]))

KINNITATUD SURMAD (deceasedCases): 65
HAIGLAST VÄLJAKIRJUTATUD (recoveredCases): 318
HAIGLARAVIL (hospitalisedCases): 39


### Export Final Data

In the dictionary ```dc``` all the derived statistics are stored.

In [38]:
dc

{'totalTested': 76590,
 'lastUpdate': '25.05.2020 10:52:11',
 'firstTest': '05.02.2020',
 'lastTest': '24.05.2020',
 'totalPositive': 1824,
 'totalNegative': 74766,
 'percPositive': 0.0238,
 'prevDayConfirmed': 1,
 'prevDayTests': 812,
 'sumLast14D': 83.0,
 'deceasedCases': 65,
 'recoveredCases': 318,
 'hospitalisedCases': 39}

In [41]:
import json

In [42]:
with open(r'data/cov_stats_eesti.json', 'w') as f:
    json.dump(dc, f, indent=4)

The dataframe ```dfts``` contains the timeseries for whole Estonia.

In [43]:
dfts

Unnamed: 0,StatisticsDate,negativeTests,confirmedCases,testsPerDay,cumulativeNegative,cumulativePositive,testsPerformed,positiveTestsPerc,positiveTestsPercCum,lastFeature
0,2020-02-05,1.0,0.0,1.0,1.0,0.0,1.0,0.0000,0.0000,
1,2020-02-06,1.0,0.0,1.0,2.0,0.0,2.0,0.0000,0.0000,
2,2020-02-07,0.0,0.0,0.0,2.0,0.0,2.0,,0.0000,
3,2020-02-08,0.0,0.0,0.0,2.0,0.0,2.0,,0.0000,
4,2020-02-09,0.0,0.0,0.0,2.0,0.0,2.0,,0.0000,
...,...,...,...,...,...,...,...,...,...,...
105,2020-05-20,1269.0,6.0,1275.0,71923.0,1800.0,73723.0,0.0047,0.0244,
106,2020-05-21,869.0,7.0,876.0,72792.0,1807.0,74599.0,0.0080,0.0242,
107,2020-05-22,636.0,14.0,650.0,73428.0,1821.0,75249.0,0.0215,0.0242,
108,2020-05-23,527.0,2.0,529.0,73955.0,1823.0,75778.0,0.0038,0.0241,


In [44]:
#dfts = dfts.reset_index()
#dfts = dfts.rename(column={})
dfts.to_csv(r'data/cov_ts_eesti.csv', index=False)

The dataframe ```dftsm_all``` contains the timeseries for each County.

In [45]:
dftsm_all

Unnamed: 0,StatisticsDate,negativeTests,confirmedCases,testsPerDay,cumulativeNegative,cumulativePositive,testsPerformed,lastFeature,County,positiveTestsPerc,positiveTestsPercCum
0,2020-02-05,0.0,0.0,0.0,0.0,0.0,0.0,,Tartu maakond,,
1,2020-02-06,0.0,0.0,0.0,0.0,0.0,0.0,,Tartu maakond,,
2,2020-02-07,0.0,0.0,0.0,0.0,0.0,0.0,,Tartu maakond,,
3,2020-02-08,0.0,0.0,0.0,0.0,0.0,0.0,,Tartu maakond,,
4,2020-02-09,0.0,0.0,0.0,0.0,0.0,0.0,,Tartu maakond,,
...,...,...,...,...,...,...,...,...,...,...,...
1645,2020-05-20,24.0,0.0,24.0,1929.0,13.0,1942.0,,Järva maakond,0.0,0.0067
1646,2020-05-21,9.0,0.0,9.0,1938.0,13.0,1951.0,,Järva maakond,0.0,0.0067
1647,2020-05-22,20.0,0.0,20.0,1958.0,13.0,1971.0,,Järva maakond,0.0,0.0066
1648,2020-05-23,16.0,0.0,16.0,1974.0,13.0,1987.0,,Järva maakond,0.0,0.0065


In [46]:
dftsm_all.to_csv(r'data/ts_maakond.csv', index=False)