# Get and Preprocess data from Estonian Open Data Platform

**Index:**
1. [Data acquisition](#Data-acquisition)
2. [Investigation Data Structure](#Investigating-data-structure)
3. [Converting Data](#Converting-dataset)
4. [Data analysis](#Data-analysis)
5. [Get additional Data](#Get-additional-data)
6. [Export Final Data](#Export-final-values)

### Data acquisition 

In [48]:
import requests
import json
import pandas as pd
from datetime import datetime as dt
from time import sleep
from aglearn import lprint
from aglearn import remap as rm

Information on the COVID-19 open data can be found on the official Terviseamet website

[Koroonaviirus SARS-CoV-2 testide avaandmete kirjeldus](https://www.terviseamet.ee/et/koroonaviirus/avaandmed)

In the section Testide avaandmete andmestruktuuri kirjeldus > Avaandmete lingid is the URL to the JSON file in which all results are published.

In [49]:
url = r'https://opendata.digilugu.ee/opendata_covid19_test_results.json' # JSON document

With the [requests](https://requests.readthedocs.io/en/master/) library the file can be downloaded and read into variable ```d```.

In [50]:
import requests 

In [51]:
r = requests.get(url)
d = r.json()

The [list](https://www.w3schools.com/python/python_lists.asp) ```d``` now contains all the results from the COVID-19 testing. The amount of tests performed can be checked with ```len()```.

In [52]:
print(len(d))

51185


In the header data for the ```requests``` response is the ```Last-Modified``` date.

In [53]:
print(r.headers['Last-Modified'])

Wed, 29 Apr 2020 07:55:51 GMT


The derived results will be collected in the [dictionary](https://www.w3schools.com/python/python_dictionaries.asp) ```dc```.

In [54]:
dc = {}

In [55]:
#from datetime import datetime
from dateutil import tz
from dateutil.parser import parse
tzone = tz.gettz('Europe/Tallinn')

In [56]:
dc['totalTested'] = len(d)
dc['lastUpdate'] = parse(r.headers['Last-Modified']).astimezone(tzone).strftime('%d.%m.%Y %H:%M:%S')

### Investigating data structure

We can access the indivudal entries of the list ```d```. The first one with ```d[0]``` or the last one with ```d[-1]```. The list items are of type dictionary.

In [57]:
d[20]

{'id': '191b343ea1cbb67fbeb1f8440d80acfa1cc1df1c66e2921706935381ba67a765',
 'Gender': 'N',
 'AgeGroup': '35-39',
 'Country': 'Eesti',
 'County': 'Tartu maakond',
 'ResultValue': 'N',
 'StatisticsDate': '2020-03-11',
 'ResultTime': '2020-03-09T22:00:00+02:00',
 'AnalysisInsertTime': '2020-03-11T10:13:21+02:00'}

The fields in the resulting dictionary are described on the [official website](https://www.terviseamet.ee/et/koroonaviirus/avaandmed). The fields can be accessed individually:

In [58]:
d[20]['AgeGroup']

'35-39'

### Converting dataset

The [pandas](https://pandas.pydata.org/) library can be used to get the data in a more accessible way. It provides functions for further data analysis.

In [59]:
import pandas as pd

Convert the list with dictionary items to pandas dataframe. Display the first 5 rows.

In [60]:
df = pd.DataFrame(d)
df.head()

Unnamed: 0,id,Gender,AgeGroup,Country,County,ResultValue,StatisticsDate,ResultTime,AnalysisInsertTime
0,95013b64dd5ff18548a92eb5375d9c4a1881467390fed4...,M,10-14,Eesti,Tartu maakond,N,2020-03-10,2020-03-06T18:44:00+02:00,2020-03-10T16:01:55+02:00
1,71fab95aa66a3976b9d9f2868482192fc2bb77ac07d680...,M,5-9,Eesti,Tartu maakond,N,2020-03-10,2020-03-06T13:28:00+02:00,2020-03-10T16:05:53+02:00
2,e474cb8d21136013c9c90877592ee8d6b20d1bd72ef48a...,M,20-24,Eesti,Harju maakond,N,2020-03-10,2020-03-05T00:00:00+02:00,2020-03-10T15:53:52+02:00
3,86a33c6965a464b3c8b754795d99b3fccab5e8349827dc...,M,35-39,Eesti,Tartu maakond,N,2020-03-10,2020-03-05T00:00:00+02:00,2020-03-10T15:50:53+02:00
4,70fb213dfac6252426170b79224d399c6e613fbca07d54...,N,15-19,Eesti,Viljandi maakond,N,2020-03-10,2020-03-06T18:46:00+02:00,2020-03-10T15:59:21+02:00


#### Data cleaning

In [61]:
from aglearn import remap as rm # class has 

The ```ResultTime``` and ```AnalysisInsertTime``` are not of importance right now.

For the Maakonds an identification code is used, as the Counties have a long "Tartu maakond" and short "Tartumaa" way of spelling, which might get mixed up. The dictionary is saved in the custom ```aglearn``` library. 

Slightly adapt the Agegroup (leading zeros).

Transform the text in StatisticsDate into datetime objects.

In [62]:
df = df.drop(['ResultTime', 'AnalysisInsertTime', 'id'], axis=1)
df['MKOOD'] = df['County'].map(rm.MNIMI_MKOOD)
df['AgeGroup'] = df['AgeGroup'].map(rm.VANUSER_STR)
df['StatisticsDate'] = pd.to_datetime(df['StatisticsDate'])

In [63]:
df.head()

Unnamed: 0,Gender,AgeGroup,Country,County,ResultValue,StatisticsDate,MKOOD
0,M,10-14,Eesti,Tartu maakond,N,2020-03-10,79
1,M,05-09,Eesti,Tartu maakond,N,2020-03-10,79
2,M,20-24,Eesti,Harju maakond,N,2020-03-10,37
3,M,35-39,Eesti,Tartu maakond,N,2020-03-10,79
4,N,15-19,Eesti,Viljandi maakond,N,2020-03-10,84


#### Export data

In [64]:
df.to_csv(r'covid_digilugu_cleaned.csv')

### Data analysis

In [65]:
dtFormat = '%d.%m.%Y'
dc['firstTest'] = df.StatisticsDate.min().strftime(dtFormat)
dc['lastTest'] = df.StatisticsDate.max().strftime(dtFormat)

#### Positive vs Negative

The function ```value_counts()``` can be used to summarize the respective columns.

In [66]:
df.ResultValue.value_counts()

N    49519
P     1666
Name: ResultValue, dtype: int64

In [67]:
dc['totalPositive'] = int(df.ResultValue.value_counts()['P'])
dc['totalNegative'] = int(df.ResultValue.value_counts()['N'])

In [68]:
dc['percPositive'] = round(dc['totalPositive']/dc['totalTested'],4)

#### Values Last day

In [69]:
res = df[df.StatisticsDate == dc['lastTest']].ResultValue.value_counts()
res

N    1668
P       9
Name: ResultValue, dtype: int64

In [70]:
dc['prevDayTests'] = int(res['N'] + res['P'])
dc['prevDayConfirmed'] = int(res['P'])

#### Timeseries Estonia

For further statistics a one-hot encoding has to be applied on the dataset. The result is joined with the dataframe.

In [71]:
df1h = pd.get_dummies(df[['ResultValue']], prefix=['Results'])
df1h = df1h.rename(columns={'Results_N' : 'negativeTests', 'Results_P' : 'confirmedCases'})
df1h = df.join(df1h)
df1h[-10:]

Unnamed: 0,Gender,AgeGroup,Country,County,ResultValue,StatisticsDate,MKOOD,negativeTests,confirmedCases
51175,N,40-44,Eesti,Harju maakond,N,2020-04-28,37,1,0
51176,M,30-34,Eesti,Ida-Viru maakond,N,2020-04-28,45,1,0
51177,N,25-29,Eesti,Tartu maakond,N,2020-04-28,79,1,0
51178,M,45-49,Eesti,Ida-Viru maakond,N,2020-04-28,45,1,0
51179,M,35-39,Eesti,Valga maakond,N,2020-04-28,81,1,0
51180,M,30-34,Eesti,Lääne-Viru maakond,N,2020-04-28,60,1,0
51181,N,60-64,Eesti,Harju maakond,N,2020-04-28,37,1,0
51182,N,35-39,Eesti,Harju maakond,N,2020-04-28,37,1,0
51183,M,30-34,Eesti,Harju maakond,N,2020-04-28,37,1,0
51184,M,75-79,Eesti,Tartu maakond,N,2020-04-28,79,1,0


In [72]:
newDateRange = pd.date_range(start=dt.strptime(dc['firstTest'], dtFormat), end=dt.strptime(dc['lastTest'], dtFormat), freq='1D')

In [73]:
dfts = df1h.groupby(['StatisticsDate']).sum()
dfts['testsPerDay'] = df1h.groupby(['StatisticsDate']).count().values[:,1]
dfts = dfts.reindex(newDateRange)
dfts = dfts.fillna(0)
dfts

Unnamed: 0,negativeTests,confirmedCases,testsPerDay
2020-02-05,1.0,0.0,1.0
2020-02-06,1.0,0.0,1.0
2020-02-07,0.0,0.0,0.0
2020-02-08,0.0,0.0,0.0
2020-02-09,0.0,0.0,0.0
...,...,...,...
2020-04-24,1023.0,29.0,1052.0
2020-04-25,594.0,9.0,603.0
2020-04-26,466.0,4.0,470.0
2020-04-27,1101.0,10.0,1111.0


Cumulative Sums

In [74]:
dfts['cumulativeNegative'] = dfts['negativeTests'].cumsum()
dfts['cumulativePositive'] = dfts['confirmedCases'].cumsum()
dfts['testsPerformed'] = dfts['testsPerDay'].cumsum()
dfts

Unnamed: 0,negativeTests,confirmedCases,testsPerDay,cumulativeNegative,cumulativePositive,testsPerformed
2020-02-05,1.0,0.0,1.0,1.0,0.0,1.0
2020-02-06,1.0,0.0,1.0,2.0,0.0,2.0
2020-02-07,0.0,0.0,0.0,2.0,0.0,2.0
2020-02-08,0.0,0.0,0.0,2.0,0.0,2.0
2020-02-09,0.0,0.0,0.0,2.0,0.0,2.0
...,...,...,...,...,...,...
2020-04-24,1023.0,29.0,1052.0,45690.0,1634.0,47324.0
2020-04-25,594.0,9.0,603.0,46284.0,1643.0,47927.0
2020-04-26,466.0,4.0,470.0,46750.0,1647.0,48397.0
2020-04-27,1101.0,10.0,1111.0,47851.0,1657.0,49508.0


Percentages 

In [75]:
dfts['positiveTestsPerc'] = (dfts['confirmedCases' ]/dfts['testsPerDay']).round(4)
dfts['positiveTestsPercCum'] = (dfts['cumulativePositive' ]/dfts['testsPerformed']).round(4)
dfts.loc[dfts.index[-1], 'lastFeature'] = 1
dfts = dfts.reset_index().rename(columns={'index':'StatisticsDate'})
dfts

Unnamed: 0,StatisticsDate,negativeTests,confirmedCases,testsPerDay,cumulativeNegative,cumulativePositive,testsPerformed,positiveTestsPerc,positiveTestsPercCum,lastFeature
0,2020-02-05,1.0,0.0,1.0,1.0,0.0,1.0,0.0000,0.0000,
1,2020-02-06,1.0,0.0,1.0,2.0,0.0,2.0,0.0000,0.0000,
2,2020-02-07,0.0,0.0,0.0,2.0,0.0,2.0,,0.0000,
3,2020-02-08,0.0,0.0,0.0,2.0,0.0,2.0,,0.0000,
4,2020-02-09,0.0,0.0,0.0,2.0,0.0,2.0,,0.0000,
...,...,...,...,...,...,...,...,...,...,...
79,2020-04-24,1023.0,29.0,1052.0,45690.0,1634.0,47324.0,0.0276,0.0345,
80,2020-04-25,594.0,9.0,603.0,46284.0,1643.0,47927.0,0.0149,0.0343,
81,2020-04-26,466.0,4.0,470.0,46750.0,1647.0,48397.0,0.0085,0.0340,
82,2020-04-27,1101.0,10.0,1111.0,47851.0,1657.0,49508.0,0.0090,0.0335,


#### Timeseries Maakond

In [76]:
counties = list(df1h['County'].unique())
counties.remove('')
counties

['Tartu maakond',
 'Harju maakond',
 'Viljandi maakond',
 'Valga maakond',
 'Võru maakond',
 'Pärnu maakond',
 'Jõgeva maakond',
 'Lääne maakond',
 'Saare maakond',
 'Lääne-Viru maakond',
 'Põlva maakond',
 'Ida-Viru maakond',
 'Rapla maakond',
 'Hiiu maakond',
 'Järva maakond']

In [77]:
dftsm = df1h.loc[df1h['County'] == county]
dftsm = dftsm.groupby(['StatisticsDate']).sum() # group by date and county
dftsm['testsPerDay'] = df1h.loc[df1h['County'] == county].groupby(['StatisticsDate']).count().values[:,1]

In [78]:
i = 0
for county in counties:
    dftsm0 = df1h.loc[df1h['County'] == county] # select a subset 
    dftsm = dftsm0.groupby(['StatisticsDate']).sum() # group by date and county
    dftsm['testsPerDay'] = dftsm0.groupby(['StatisticsDate']).count().values[:,1]
    dftsm = dftsm.reindex(newDateRange)
    dftsm = dftsm.fillna(0)
    dftsm['cumulativeNegative'] = dftsm['negativeTests'].cumsum()
    dftsm['cumulativePositive'] = dftsm['confirmedCases'].cumsum()
    dftsm['testsPerformed'] = dftsm['testsPerDay'].cumsum()
    dftsm.loc[dftsm.index[-1], 'lastFeature'] = 1
    dftsm['County'] = county
    #dftsm['MKOOD'] = rm.MNIMI_MKOOD[county]
    if i == 0:
        dftsm_all = dftsm
        i += 1
    else:
        dftsm_all = dftsm_all.append(dftsm)
dftsm_all['positiveTestsPerc'] = (dftsm_all['confirmedCases' ]/dftsm_all['testsPerDay']).round(4)
dftsm_all['positiveTestsPercCum'] = (dftsm_all['cumulativePositive' ]/dftsm_all['testsPerformed']).round(4)
dftsm_all = dftsm_all.reset_index().rename(columns={'index':'StatisticsDate'})
dftsm_all

Unnamed: 0,StatisticsDate,negativeTests,confirmedCases,testsPerDay,cumulativeNegative,cumulativePositive,testsPerformed,lastFeature,County,positiveTestsPerc,positiveTestsPercCum
0,2020-02-05,0.0,0.0,0.0,0.0,0.0,0.0,,Tartu maakond,,
1,2020-02-06,0.0,0.0,0.0,0.0,0.0,0.0,,Tartu maakond,,
2,2020-02-07,0.0,0.0,0.0,0.0,0.0,0.0,,Tartu maakond,,
3,2020-02-08,0.0,0.0,0.0,0.0,0.0,0.0,,Tartu maakond,,
4,2020-02-09,0.0,0.0,0.0,0.0,0.0,0.0,,Tartu maakond,,
...,...,...,...,...,...,...,...,...,...,...,...
1255,2020-04-24,29.0,0.0,29.0,1176.0,12.0,1188.0,,Järva maakond,0.0,0.0101
1256,2020-04-25,13.0,0.0,13.0,1189.0,12.0,1201.0,,Järva maakond,0.0,0.0100
1257,2020-04-26,9.0,0.0,9.0,1198.0,12.0,1210.0,,Järva maakond,0.0,0.0099
1258,2020-04-27,55.0,0.0,55.0,1253.0,12.0,1265.0,,Järva maakond,0.0,0.0095


### Get additional data

#### Webscraping Terviseamet

Some data is not available in the open data, but can be acquired from the website. 
The library [beautifulsoup](https://www.crummy.com/software/BeautifulSoup/bs4/doc/) can be utilized to scrape the data. 

In [79]:
from bs4 import BeautifulSoup
import re

In [80]:
url2 = r'https://www.terviseamet.ee/et/koroonaviirus/koroonakaart'

Download and parse the website.

In [81]:
r2 = requests.get(url2)
soup = BeautifulSoup(r2.text, features="html.parser")

Values that need to be extracted and their translation.

In [82]:
toScrp = {"KINNITATUD SURMAD": "deceasedCases",
        "HAIGLAST VÄLJAKIRJUTATUD": "hospitalisedCases",
        "HAIGLARAVIL": "recoveredCases" }

In [83]:
for scrp in toScrp.keys():
    print(scrp)
    res = soup.find(text=re.compile(scrp)) # looks for the string
    exstr = res.find_parent('div').text #extracts the text from the parent container
    dc[toScrp[scrp]] = int(re.findall(r"\n([0-9]{1,4})\n", exstr)[0]) # extracts the case number with regular expression

KINNITATUD SURMAD
HAIGLAST VÄLJAKIRJUTATUD
HAIGLARAVIL


### Export Final Data

In the dictionary ```dc``` all the derived statistics are stored.

In [84]:
dc

{'totalTested': 51185,
 'lastUpdate': '29.04.2020 10:55:51',
 'firstTest': '05.02.2020',
 'lastTest': '28.04.2020',
 'totalPositive': 1666,
 'totalNegative': 49519,
 'percPositive': 0.0325,
 'prevDayTests': 1677,
 'prevDayConfirmed': 9,
 'deceasedCases': 50,
 'hospitalisedCases': 236,
 'recoveredCases': 89}

In [85]:
with open(r'data/cov_stats_eesti.json', 'w') as f:
    json.dump(dc, f, indent=4)

The dataframe ```dfts``` contains the timeseries for whole Estonia.

In [86]:
dfts

Unnamed: 0,StatisticsDate,negativeTests,confirmedCases,testsPerDay,cumulativeNegative,cumulativePositive,testsPerformed,positiveTestsPerc,positiveTestsPercCum,lastFeature
0,2020-02-05,1.0,0.0,1.0,1.0,0.0,1.0,0.0000,0.0000,
1,2020-02-06,1.0,0.0,1.0,2.0,0.0,2.0,0.0000,0.0000,
2,2020-02-07,0.0,0.0,0.0,2.0,0.0,2.0,,0.0000,
3,2020-02-08,0.0,0.0,0.0,2.0,0.0,2.0,,0.0000,
4,2020-02-09,0.0,0.0,0.0,2.0,0.0,2.0,,0.0000,
...,...,...,...,...,...,...,...,...,...,...
79,2020-04-24,1023.0,29.0,1052.0,45690.0,1634.0,47324.0,0.0276,0.0345,
80,2020-04-25,594.0,9.0,603.0,46284.0,1643.0,47927.0,0.0149,0.0343,
81,2020-04-26,466.0,4.0,470.0,46750.0,1647.0,48397.0,0.0085,0.0340,
82,2020-04-27,1101.0,10.0,1111.0,47851.0,1657.0,49508.0,0.0090,0.0335,


In [87]:
#dfts = dfts.reset_index()
#dfts = dfts.rename(column={})
dfts.to_csv(r'data/cov_ts_eesti.csv', index=False)

The dataframe ```dftsm_all``` contains the timeseries for each County.

In [88]:
dftsm_all

Unnamed: 0,StatisticsDate,negativeTests,confirmedCases,testsPerDay,cumulativeNegative,cumulativePositive,testsPerformed,lastFeature,County,positiveTestsPerc,positiveTestsPercCum
0,2020-02-05,0.0,0.0,0.0,0.0,0.0,0.0,,Tartu maakond,,
1,2020-02-06,0.0,0.0,0.0,0.0,0.0,0.0,,Tartu maakond,,
2,2020-02-07,0.0,0.0,0.0,0.0,0.0,0.0,,Tartu maakond,,
3,2020-02-08,0.0,0.0,0.0,0.0,0.0,0.0,,Tartu maakond,,
4,2020-02-09,0.0,0.0,0.0,0.0,0.0,0.0,,Tartu maakond,,
...,...,...,...,...,...,...,...,...,...,...,...
1255,2020-04-24,29.0,0.0,29.0,1176.0,12.0,1188.0,,Järva maakond,0.0,0.0101
1256,2020-04-25,13.0,0.0,13.0,1189.0,12.0,1201.0,,Järva maakond,0.0,0.0100
1257,2020-04-26,9.0,0.0,9.0,1198.0,12.0,1210.0,,Järva maakond,0.0,0.0099
1258,2020-04-27,55.0,0.0,55.0,1253.0,12.0,1265.0,,Järva maakond,0.0,0.0095


In [89]:
dftsm_all.to_csv(r'data/ts_maakond.csv', index=False)