# Get and Preprocess data from Estonian Open Data Platform

**Index:**
1. [Data acquisition](#Data-acquisition)
2. [Investigation Data Structure](#Investigating-data-structure)
3. [Converting Data](#Converting-dataset)
4. [Data analysis](#Data-analysis)
5. [Get additional Data](#Get-additional-data)
6. [Export Final Data](#Export-final-values)

### Data acquisition 

In [1]:
import requests
import json
import pandas as pd
from datetime import datetime as dt
from time import sleep
from aglearn import remap as rm

Information on the COVID-19 open data can be found on the official Terviseamet website

[Koroonaviirus SARS-CoV-2 testide avaandmete kirjeldus](https://www.terviseamet.ee/et/koroonaviirus/avaandmed)

In the section Testide avaandmete andmestruktuuri kirjeldus > Avaandmete lingid is the URL to the JSON file in which all results are published.

In [2]:
url = r'https://opendata.digilugu.ee/opendata_covid19_test_results.json' # JSON document

With the [requests](https://requests.readthedocs.io/en/master/) library the file can be downloaded and read into variable ```d```.

In [3]:
import requests 

In [4]:
r = requests.get(url)
d = r.json()

The [list](https://www.w3schools.com/python/python_lists.asp) ```d``` now contains all the results from the COVID-19 testing. The amount of tests performed can be checked with ```len()```.

In [5]:
print(len(d))

55206


In the header data for the ```requests``` response is the ```Last-Modified``` date.

In [6]:
print(r.headers['Last-Modified'])

Sun, 03 May 2020 07:54:39 GMT


The derived results will be collected in the [dictionary](https://www.w3schools.com/python/python_dictionaries.asp) ```dc```.

In [7]:
dc = {}

In [8]:
#from datetime import datetime
from dateutil import tz
from dateutil.parser import parse
tzone = tz.gettz('Europe/Tallinn')

In [9]:
dc['totalTested'] = len(d)
dc['lastUpdate'] = parse(r.headers['Last-Modified']).astimezone(tzone).strftime('%d.%m.%Y %H:%M:%S')

### Investigating data structure

We can access the indivudal entries of the list ```d```. The first one with ```d[0]``` or the last one with ```d[-1]```. The list items are of type dictionary.

In [10]:
d[20]

{'id': '191b343ea1cbb67fbeb1f8440d80acfa1cc1df1c66e2921706935381ba67a765',
 'Gender': 'N',
 'AgeGroup': '35-39',
 'Country': 'Eesti',
 'County': 'Tartu maakond',
 'ResultValue': 'N',
 'StatisticsDate': '2020-03-11',
 'ResultTime': '2020-03-09T22:00:00+02:00',
 'AnalysisInsertTime': '2020-03-11T10:13:21+02:00'}

The fields in the resulting dictionary are described on the [official website](https://www.terviseamet.ee/et/koroonaviirus/avaandmed). The fields can be accessed individually:

In [11]:
d[20]['AgeGroup']

'35-39'

### Converting dataset

The [pandas](https://pandas.pydata.org/) library can be used to get the data in a more accessible way. It provides functions for further data analysis.

In [12]:
import pandas as pd

Convert the list with dictionary items to pandas dataframe. Display the first 5 rows.

In [13]:
df = pd.DataFrame(d)
df.head()

Unnamed: 0,id,Gender,AgeGroup,Country,County,ResultValue,StatisticsDate,ResultTime,AnalysisInsertTime
0,95013b64dd5ff18548a92eb5375d9c4a1881467390fed4...,M,10-14,Eesti,Tartu maakond,N,2020-03-10,2020-03-06T18:44:00+02:00,2020-03-10T16:01:55+02:00
1,71fab95aa66a3976b9d9f2868482192fc2bb77ac07d680...,M,5-9,Eesti,Tartu maakond,N,2020-03-10,2020-03-06T13:28:00+02:00,2020-03-10T16:05:53+02:00
2,e474cb8d21136013c9c90877592ee8d6b20d1bd72ef48a...,M,20-24,Eesti,Harju maakond,N,2020-03-10,2020-03-05T00:00:00+02:00,2020-03-10T15:53:52+02:00
3,86a33c6965a464b3c8b754795d99b3fccab5e8349827dc...,M,35-39,Eesti,Tartu maakond,N,2020-03-10,2020-03-05T00:00:00+02:00,2020-03-10T15:50:53+02:00
4,70fb213dfac6252426170b79224d399c6e613fbca07d54...,N,15-19,Eesti,Viljandi maakond,N,2020-03-10,2020-03-06T18:46:00+02:00,2020-03-10T15:59:21+02:00


#### Data cleaning

In [14]:
from aglearn import remap as rm # class has 

The ```ResultTime``` and ```AnalysisInsertTime``` are not of importance right now.

For the Maakonds an identification code is used, as the Counties have a long "Tartu maakond" and short "Tartumaa" way of spelling, which might get mixed up. The dictionary is saved in the custom ```aglearn``` library. 

Slightly adapt the Agegroup (leading zeros).

Transform the text in StatisticsDate into datetime objects.

In [15]:
df = df.drop(['ResultTime', 'AnalysisInsertTime', 'id'], axis=1)
df['MKOOD'] = df['County'].map(rm.MNIMI_MKOOD)
df['AgeGroup'] = df['AgeGroup'].map(rm.VANUSER_STR)
df['StatisticsDate'] = pd.to_datetime(df['StatisticsDate'])

In [16]:
df.head()

Unnamed: 0,Gender,AgeGroup,Country,County,ResultValue,StatisticsDate,MKOOD
0,M,10-14,Eesti,Tartu maakond,N,2020-03-10,79
1,M,05-09,Eesti,Tartu maakond,N,2020-03-10,79
2,M,20-24,Eesti,Harju maakond,N,2020-03-10,37
3,M,35-39,Eesti,Tartu maakond,N,2020-03-10,79
4,N,15-19,Eesti,Viljandi maakond,N,2020-03-10,84


#### Export data

In [17]:
df.to_csv(r'covid_digilugu_cleaned.csv')

### Data analysis

In [18]:
dtFormat = '%d.%m.%Y'
dc['firstTest'] = df.StatisticsDate.min().strftime(dtFormat)
dc['lastTest'] = df.StatisticsDate.max().strftime(dtFormat)

#### Positive vs Negative

The function ```value_counts()``` can be used to summarize the respective columns.

In [19]:
df.ResultValue.value_counts()

N    53506
P     1700
Name: ResultValue, dtype: int64

In [20]:
dc['totalPositive'] = int(df.ResultValue.value_counts()['P'])
dc['totalNegative'] = int(df.ResultValue.value_counts()['N'])

In [21]:
dc['percPositive'] = round(dc['totalPositive']/dc['totalTested'],4)

#### Values Last day

In [28]:
res = df[df.StatisticsDate == dt.strptime(dc['lastTest'],dtFormat)].ResultValue.value_counts()
res

N    742
P      1
Name: ResultValue, dtype: int64

In [30]:
if 'P' in res: # in case there are no positive results :) 
    dc['prevDayConfirmed'] = int(res['P'])
else:
    dc['prevDayConfirmed'] = 0
dc['prevDayTests'] = int(res['N'] + dc['prevDayConfirmed'])

#### Timeseries Estonia

For further statistics a one-hot encoding has to be applied on the dataset. The result is joined with the dataframe.

In [31]:
df1h = pd.get_dummies(df[['ResultValue']], prefix=['Results'])
df1h = df1h.rename(columns={'Results_N' : 'negativeTests', 'Results_P' : 'confirmedCases'})
df1h = df.join(df1h)
df1h[-10:]

Unnamed: 0,Gender,AgeGroup,Country,County,ResultValue,StatisticsDate,MKOOD,negativeTests,confirmedCases
55196,N,15-19,Eesti,Põlva maakond,N,2020-05-02,64,1,0
55197,N,55-59,Eesti,Valga maakond,N,2020-05-02,81,1,0
55198,N,45-49,Eesti,Harju maakond,N,2020-05-02,37,1,0
55199,N,40-44,Eesti,Pärnu maakond,N,2020-05-02,68,1,0
55200,M,65-69,Eesti,Põlva maakond,N,2020-05-02,64,1,0
55201,M,30-34,Eesti,Harju maakond,N,2020-05-02,37,1,0
55202,N,40-44,Eesti,Rapla maakond,N,2020-05-02,71,1,0
55203,M,25-29,Eesti,Lääne maakond,N,2020-05-02,56,1,0
55204,M,30-34,Eesti,Harju maakond,N,2020-05-02,37,1,0
55205,M,45-49,Eesti,Tartu maakond,N,2020-05-02,79,1,0


In [32]:
newDateRange = pd.date_range(start=dt.strptime(dc['firstTest'], dtFormat), end=dt.strptime(dc['lastTest'], dtFormat), freq='1D')

In [33]:
dfts = df1h.groupby(['StatisticsDate']).sum()
dfts['testsPerDay'] = df1h.groupby(['StatisticsDate']).count().values[:,1]
dfts = dfts.reindex(newDateRange)
dfts = dfts.fillna(0)
dfts

Unnamed: 0,negativeTests,confirmedCases,testsPerDay
2020-02-05,1.0,0.0,1.0
2020-02-06,1.0,0.0,1.0
2020-02-07,0.0,0.0,0.0
2020-02-08,0.0,0.0,0.0
2020-02-09,0.0,0.0,0.0
...,...,...,...
2020-04-28,1662.0,9.0,1671.0
2020-04-29,1538.0,23.0,1561.0
2020-04-30,1025.0,5.0,1030.0
2020-05-01,692.0,5.0,697.0


Cumulative Sums

In [34]:
dfts['cumulativeNegative'] = dfts['negativeTests'].cumsum()
dfts['cumulativePositive'] = dfts['confirmedCases'].cumsum()
dfts['testsPerformed'] = dfts['testsPerDay'].cumsum()
dfts

Unnamed: 0,negativeTests,confirmedCases,testsPerDay,cumulativeNegative,cumulativePositive,testsPerformed
2020-02-05,1.0,0.0,1.0,1.0,0.0,1.0
2020-02-06,1.0,0.0,1.0,2.0,0.0,2.0
2020-02-07,0.0,0.0,0.0,2.0,0.0,2.0
2020-02-08,0.0,0.0,0.0,2.0,0.0,2.0
2020-02-09,0.0,0.0,0.0,2.0,0.0,2.0
...,...,...,...,...,...,...
2020-04-28,1662.0,9.0,1671.0,49509.0,1666.0,51175.0
2020-04-29,1538.0,23.0,1561.0,51047.0,1689.0,52736.0
2020-04-30,1025.0,5.0,1030.0,52072.0,1694.0,53766.0
2020-05-01,692.0,5.0,697.0,52764.0,1699.0,54463.0


Percentages 

In [35]:
dfts['positiveTestsPerc'] = (dfts['confirmedCases' ]/dfts['testsPerDay']).round(4)
dfts['positiveTestsPercCum'] = (dfts['cumulativePositive' ]/dfts['testsPerformed']).round(4)
dfts.loc[dfts.index[-1], 'lastFeature'] = 1
dfts = dfts.reset_index().rename(columns={'index':'StatisticsDate'})
dfts

Unnamed: 0,StatisticsDate,negativeTests,confirmedCases,testsPerDay,cumulativeNegative,cumulativePositive,testsPerformed,positiveTestsPerc,positiveTestsPercCum,lastFeature
0,2020-02-05,1.0,0.0,1.0,1.0,0.0,1.0,0.0000,0.0000,
1,2020-02-06,1.0,0.0,1.0,2.0,0.0,2.0,0.0000,0.0000,
2,2020-02-07,0.0,0.0,0.0,2.0,0.0,2.0,,0.0000,
3,2020-02-08,0.0,0.0,0.0,2.0,0.0,2.0,,0.0000,
4,2020-02-09,0.0,0.0,0.0,2.0,0.0,2.0,,0.0000,
...,...,...,...,...,...,...,...,...,...,...
83,2020-04-28,1662.0,9.0,1671.0,49509.0,1666.0,51175.0,0.0054,0.0326,
84,2020-04-29,1538.0,23.0,1561.0,51047.0,1689.0,52736.0,0.0147,0.0320,
85,2020-04-30,1025.0,5.0,1030.0,52072.0,1694.0,53766.0,0.0049,0.0315,
86,2020-05-01,692.0,5.0,697.0,52764.0,1699.0,54463.0,0.0072,0.0312,


#### Timeseries Maakond

In [38]:
counties = list(df1h['County'].unique())
counties.remove('')
counties

['Tartu maakond',
 'Harju maakond',
 'Viljandi maakond',
 'Valga maakond',
 'Võru maakond',
 'Pärnu maakond',
 'Jõgeva maakond',
 'Lääne maakond',
 'Saare maakond',
 'Lääne-Viru maakond',
 'Põlva maakond',
 'Ida-Viru maakond',
 'Rapla maakond',
 'Hiiu maakond',
 'Järva maakond']

In [39]:
i = 0
for county in counties:
    dftsm0 = df1h.loc[df1h['County'] == county] # select a subset 
    dftsm = dftsm0.groupby(['StatisticsDate']).sum() # group by date and county
    dftsm['testsPerDay'] = dftsm0.groupby(['StatisticsDate']).count().values[:,1]
    dftsm = dftsm.reindex(newDateRange)
    dftsm = dftsm.fillna(0)
    dftsm['cumulativeNegative'] = dftsm['negativeTests'].cumsum()
    dftsm['cumulativePositive'] = dftsm['confirmedCases'].cumsum()
    dftsm['testsPerformed'] = dftsm['testsPerDay'].cumsum()
    dftsm.loc[dftsm.index[-1], 'lastFeature'] = 1
    dftsm['County'] = county
    #dftsm['MKOOD'] = rm.MNIMI_MKOOD[county]
    if i == 0:
        dftsm_all = dftsm
        i += 1
    else:
        dftsm_all = dftsm_all.append(dftsm)
dftsm_all['positiveTestsPerc'] = (dftsm_all['confirmedCases' ]/dftsm_all['testsPerDay']).round(4)
dftsm_all['positiveTestsPercCum'] = (dftsm_all['cumulativePositive' ]/dftsm_all['testsPerformed']).round(4)
dftsm_all = dftsm_all.reset_index().rename(columns={'index':'StatisticsDate'})
dftsm_all

Unnamed: 0,StatisticsDate,negativeTests,confirmedCases,testsPerDay,cumulativeNegative,cumulativePositive,testsPerformed,lastFeature,County,positiveTestsPerc,positiveTestsPercCum
0,2020-02-05,0.0,0.0,0.0,0.0,0.0,0.0,,Tartu maakond,,
1,2020-02-06,0.0,0.0,0.0,0.0,0.0,0.0,,Tartu maakond,,
2,2020-02-07,0.0,0.0,0.0,0.0,0.0,0.0,,Tartu maakond,,
3,2020-02-08,0.0,0.0,0.0,0.0,0.0,0.0,,Tartu maakond,,
4,2020-02-09,0.0,0.0,0.0,0.0,0.0,0.0,,Tartu maakond,,
...,...,...,...,...,...,...,...,...,...,...,...
1315,2020-04-28,21.0,0.0,21.0,1273.0,12.0,1285.0,,Järva maakond,0.0000,0.0093
1316,2020-04-29,125.0,1.0,126.0,1398.0,13.0,1411.0,,Järva maakond,0.0079,0.0092
1317,2020-04-30,15.0,0.0,15.0,1413.0,13.0,1426.0,,Järva maakond,0.0000,0.0091
1318,2020-05-01,32.0,0.0,32.0,1445.0,13.0,1458.0,,Järva maakond,0.0000,0.0089


### Get additional data

#### Webscraping Terviseamet

Some data is not available in the open data, but can be acquired from the website. 
The library [beautifulsoup](https://www.crummy.com/software/BeautifulSoup/bs4/doc/) can be utilized to scrape the data.

(Hint: In case BeautifulSoup is not installed in the Python Envrionment, consult this [arcticle](https://pro.arcgis.com/en/pro-app/arcpy/get-started/what-is-conda.htm) on how to clone the default environment and Add Packages.)

In [40]:
from bs4 import BeautifulSoup
import re

In [41]:
url2 = r'https://www.terviseamet.ee/et/koroonaviirus/koroonakaart'

Download and parse the website.

In [42]:
r2 = requests.get(url2)
soup = BeautifulSoup(r2.text, features="html.parser")

Values that need to be extracted and their translation.

In [43]:
toScrp = {"KINNITATUD SURMAD": "deceasedCases",
        "HAIGLAST VÄLJAKIRJUTATUD": "hospitalisedCases",
        "HAIGLARAVIL": "recoveredCases" }

In [47]:
for scrp in toScrp.keys():
    print(scrp)
    res = soup.find(text=re.compile(scrp)) # looks for the string
    exstr = res.find_parent('div').text #extracts the text from the parent container
    dc[toScrp[scrp]] = int(re.findall(r"\n([0-9]{1,4})\n", exstr)[0]) # extracts the case number with regular expression

KINNITATUD SURMAD
HAIGLAST VÄLJAKIRJUTATUD
HAIGLARAVIL


### Export Final Data

In the dictionary ```dc``` all the derived statistics are stored.

In [48]:
dc

{'totalTested': 55206,
 'lastUpdate': '03.05.2020 10:54:39',
 'firstTest': '05.02.2020',
 'lastTest': '02.05.2020',
 'totalPositive': 1700,
 'totalNegative': 53506,
 'percPositive': 0.0308,
 'prevDayConfirmed': 1,
 'prevDayTests': 743,
 'deceasedCases': 55,
 'hospitalisedCases': 247,
 'recoveredCases': 75}

In [49]:
with open(r'data/cov_stats_eesti.json', 'w') as f:
    json.dump(dc, f, indent=4)

The dataframe ```dfts``` contains the timeseries for whole Estonia.

In [50]:
dfts

Unnamed: 0,StatisticsDate,negativeTests,confirmedCases,testsPerDay,cumulativeNegative,cumulativePositive,testsPerformed,positiveTestsPerc,positiveTestsPercCum,lastFeature
0,2020-02-05,1.0,0.0,1.0,1.0,0.0,1.0,0.0000,0.0000,
1,2020-02-06,1.0,0.0,1.0,2.0,0.0,2.0,0.0000,0.0000,
2,2020-02-07,0.0,0.0,0.0,2.0,0.0,2.0,,0.0000,
3,2020-02-08,0.0,0.0,0.0,2.0,0.0,2.0,,0.0000,
4,2020-02-09,0.0,0.0,0.0,2.0,0.0,2.0,,0.0000,
...,...,...,...,...,...,...,...,...,...,...
83,2020-04-28,1662.0,9.0,1671.0,49509.0,1666.0,51175.0,0.0054,0.0326,
84,2020-04-29,1538.0,23.0,1561.0,51047.0,1689.0,52736.0,0.0147,0.0320,
85,2020-04-30,1025.0,5.0,1030.0,52072.0,1694.0,53766.0,0.0049,0.0315,
86,2020-05-01,692.0,5.0,697.0,52764.0,1699.0,54463.0,0.0072,0.0312,


In [51]:
#dfts = dfts.reset_index()
#dfts = dfts.rename(column={})
dfts.to_csv(r'data/cov_ts_eesti.csv', index=False)

The dataframe ```dftsm_all``` contains the timeseries for each County.

In [52]:
dftsm_all

Unnamed: 0,StatisticsDate,negativeTests,confirmedCases,testsPerDay,cumulativeNegative,cumulativePositive,testsPerformed,lastFeature,County,positiveTestsPerc,positiveTestsPercCum
0,2020-02-05,0.0,0.0,0.0,0.0,0.0,0.0,,Tartu maakond,,
1,2020-02-06,0.0,0.0,0.0,0.0,0.0,0.0,,Tartu maakond,,
2,2020-02-07,0.0,0.0,0.0,0.0,0.0,0.0,,Tartu maakond,,
3,2020-02-08,0.0,0.0,0.0,0.0,0.0,0.0,,Tartu maakond,,
4,2020-02-09,0.0,0.0,0.0,0.0,0.0,0.0,,Tartu maakond,,
...,...,...,...,...,...,...,...,...,...,...,...
1315,2020-04-28,21.0,0.0,21.0,1273.0,12.0,1285.0,,Järva maakond,0.0000,0.0093
1316,2020-04-29,125.0,1.0,126.0,1398.0,13.0,1411.0,,Järva maakond,0.0079,0.0092
1317,2020-04-30,15.0,0.0,15.0,1413.0,13.0,1426.0,,Järva maakond,0.0000,0.0091
1318,2020-05-01,32.0,0.0,32.0,1445.0,13.0,1458.0,,Järva maakond,0.0000,0.0089


In [53]:
dftsm_all.to_csv(r'data/ts_maakond.csv', index=False)