# Covid-19 data download & processing
---
This Python Script downloads an up-to-date dataset for **Covid-19** and exports the data to the data folder.

The Data is from the **R**obert **K**och **I**nstitut downloaded over [ArcGis Hub](https://hub.arcgis.com/datasets/dd4580c810204019a7b8eb3e0b329dd6?page=15976).

*Script was created on Python: 3.7.6 64-bit Kernel*

In [1]:
import pandas as pd
import math

import io               # file operations
import json

import ssl              # secure client-server connection
import requests         # html-requests

In [2]:
# Uncomment next 2 lines to install jsonmerge
#import sys
#!{sys.executable} -m pip install jsonmerge
from jsonmerge import Merger

In [3]:
sourceURL = 'https://services7.arcgis.com/mOBPykOjAyBO2ZKk/arcgis/rest/services/RKI_COVID19/FeatureServer/0/query?'
objectIdsQuery = 'where=1%3D1&returnIdsOnly=true&f=json'
dataSetQuery = 'where=ObjectId+BETWEEN+0+AND+0' # just an example gets created later dynamically
dataQuery = '&outSR=4326&outFields=IdBundesland,Bundesland,Landkreis,Altersgruppe,AnzahlFall,AnzahlTodesfall,ObjectId,Meldedatum,IdLandkreis,Datenstand,NeuerFall,NeuerTodesfall,Refdatum,NeuGenesen,AnzahlGenesen,IstErkrankungsbeginn&f=json'

## Requesting which Features (ObjectID´s) are available

In [4]:
objectIdsRequest = requests.get(sourceURL + objectIdsQuery)
objectIdsRequest.status_code

200

In [5]:
objectIds = json.loads(objectIdsRequest.text)

numOfObjectIds = len(objectIds['objectIds'])

objectIdStart = objectIds['objectIds'][0]
objectIdEnd = objectIds['objectIds'][numOfObjectIds - 1]
print(f'Range of ObjectIds: [{objectIdStart}, {objectIdEnd}]')

Range of ObjectIds: [31424304, 31619256]


## Requesting Features

In [6]:
dataRequest = requests.get(sourceURL + 'where=1%3D1' + dataQuery)
dataRequest.status_code

200

In [7]:
data = json.loads(dataRequest.text)
maxApiRequest = len(data['features'])

neededRequests = math.ceil(numOfObjectIds / maxApiRequest)

print(f'The download will require {neededRequests - 1} more requests due to the server limit of {maxApiRequest} features/request.')

The download will require 38 more requests due to the server limit of 5000 features/request.


In [8]:
# Json-Merger with custom rule
jsonMergeSchema = {"properties":{"features":{"mergeStrategy":"append"}}}
dataMerger = Merger(jsonMergeSchema)

In [9]:
i = 0
rangeLowerEnd = data['features'][maxApiRequest - 1]['attributes']['ObjectId'] + 1
rangeUpperEnd = rangeLowerEnd + maxApiRequest

while (i < neededRequests - 1): # neededRequests - 1 because of initial download
    dataSetQuery = f'where=ObjectId+BETWEEN+{rangeLowerEnd}+AND+{rangeUpperEnd}'
    temp_sourceURL = sourceURL + dataSetQuery + dataQuery
    print(i, f'Pulling ObjectIds: [{rangeLowerEnd}, {rangeUpperEnd}]')

    temp_dataRequest = requests.get(temp_sourceURL)
    if (temp_dataRequest.status_code > 200): # stop when a request isn´t working
        print(f'Error in request: {temp_dataRequest.status_code}')
        break
    temp_data = json.loads(temp_dataRequest.text)

    # append new data to already downloaded one
    data = dataMerger.merge(data, temp_data)

    temp_dataLength = len(data['features'])
    t_le = data['features'][0]['attributes']['ObjectId']
    t_ue = data['features'][temp_dataLength - 1]['attributes']['ObjectId']
    print(f'Total collected features: {temp_dataLength} From ObjectIds: [{t_le}, {t_ue}]')

    rangeLowerEnd = rangeUpperEnd + 1
    rangeUpperEnd += maxApiRequest + 1
    if (rangeUpperEnd > objectIdEnd):
        rangeUpperEnd = objectIdEnd
    i += 1

print('Done')

0 Pulling ObjectIds: [31429304, 31434304]
Total collected features: 10000 From ObjectIds: [31424304, 31434303]
1 Pulling ObjectIds: [31434305, 31439305]
Total collected features: 15000 From ObjectIds: [31424304, 31439304]
2 Pulling ObjectIds: [31439306, 31444306]
Total collected features: 20000 From ObjectIds: [31424304, 31444305]
3 Pulling ObjectIds: [31444307, 31449307]
Total collected features: 25000 From ObjectIds: [31424304, 31449306]
4 Pulling ObjectIds: [31449308, 31454308]
Total collected features: 30000 From ObjectIds: [31424304, 31454307]
5 Pulling ObjectIds: [31454309, 31459309]
Total collected features: 35000 From ObjectIds: [31424304, 31459308]
6 Pulling ObjectIds: [31459310, 31464310]
Total collected features: 40000 From ObjectIds: [31424304, 31464309]
7 Pulling ObjectIds: [31464311, 31469311]
Total collected features: 45000 From ObjectIds: [31424304, 31469310]
8 Pulling ObjectIds: [31469312, 31474312]
Total collected features: 50000 From ObjectIds: [31424304, 31474311]
9

In [10]:
print('Entries: ', len(data['features']))
print('Structure: ', data['features'][0])
print('Latest data: ', data['features'][0]['attributes']['Datenstand'])

Entries:  194916
Structure:  {'attributes': {'IdBundesland': 1, 'Bundesland': 'Schleswig-Holstein', 'Landkreis': 'SK Flensburg', 'Altersgruppe': 'A00-A04', 'AnzahlFall': 1, 'AnzahlTodesfall': 0, 'ObjectId': 31424304, 'Meldedatum': 1598227200000, 'IdLandkreis': '01001', 'Datenstand': '31.08.2020, 00:00 Uhr', 'NeuerFall': 0, 'NeuerTodesfall': -9, 'Refdatum': 1598227200000, 'NeuGenesen': -9, 'AnzahlGenesen': 0, 'IstErkrankungsbeginn': 0}}
Latest data:  31.08.2020, 00:00 Uhr


In [11]:
def display_n(df,n): 
    with pd.option_context('display.max_rows',n*2):
        display(df)

In [12]:
dfx = pd.DataFrame.from_dict(data['features'])
display_n(dfx, 2)

Unnamed: 0,attributes
0,"{'IdBundesland': 1, 'Bundesland': 'Schleswig-H..."
1,"{'IdBundesland': 1, 'Bundesland': 'Schleswig-H..."
...,...
194914,"{'IdBundesland': 16, 'Bundesland': 'Thüringen'..."
194915,"{'IdBundesland': 16, 'Bundesland': 'Thüringen'..."


In [13]:
# turning the collumn attributes in seperated collumns
for rowid in data['fields']:
    dfx[rowid['name']] = dfx.apply(lambda row: row.loc['attributes'][rowid['name']], axis=1)
dfx = dfx.drop(['attributes'], axis=1)
display_n(dfx, 2)

Unnamed: 0,IdBundesland,Bundesland,Landkreis,Altersgruppe,AnzahlFall,AnzahlTodesfall,ObjectId,Meldedatum,IdLandkreis,Datenstand,NeuerFall,NeuerTodesfall,Refdatum,NeuGenesen,AnzahlGenesen,IstErkrankungsbeginn
0,1,Schleswig-Holstein,SK Flensburg,A00-A04,1,0,31424304,1598227200000,01001,"31.08.2020, 00:00 Uhr",0,-9,1598227200000,-9,0,0
1,1,Schleswig-Holstein,SK Flensburg,A05-A14,1,0,31424305,1597449600000,01001,"31.08.2020, 00:00 Uhr",0,-9,1597449600000,0,1,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
194914,16,Thüringen,LK Altenburger Land,A80+,2,0,31619255,1590624000000,16077,"31.08.2020, 00:00 Uhr",0,-9,1590624000000,0,2,0
194915,16,Thüringen,LK Altenburger Land,A80+,1,0,31619256,1591660800000,16077,"31.08.2020, 00:00 Uhr",0,-9,1591660800000,0,1,0


In [14]:
data['features'][0]['attributes']['Landkreis']

'SK Flensburg'

In [15]:
frameByLK = dfx.groupby(['Landkreis', 'IdLandkreis'])['AnzahlFall', 'AnzahlTodesfall', 'AnzahlGenesen'].sum().reset_index().set_index('Landkreis')

In [16]:
frameByLK

Unnamed: 0_level_0,IdLandkreis,AnzahlFall,AnzahlTodesfall,AnzahlGenesen
Landkreis,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
LK Ahrweiler,07131,289,2,257
LK Aichach-Friedberg,09771,442,20,400
LK Alb-Donau-Kreis,08425,779,26,684
LK Altenburger Land,16077,94,4,75
LK Altenkirchen,07132,208,11,187
...,...,...,...,...
SK Worms,07319,266,8,239
SK Wuppertal,05124,1402,86,1182
SK Würzburg,09663,525,52,446
SK Zweibrücken,07320,47,1,45


In [17]:
frameByLK.to_csv('frameByLK.csv', index=True, encoding='utf-8')

In [18]:
dfx_slim = dfx.drop(columns=['IdBundesland', 'Bundesland', 'ObjectId', 'NeuerFall', 'NeuerTodesfall', 'IstErkrankungsbeginn', 'NeuGenesen', 'Meldedatum', 'Datenstand', 'Landkreis']).groupby(['Refdatum', 'IdLandkreis', 'Altersgruppe']).sum().groupby(['Refdatum', 'IdLandkreis']).sum()
dfx_slim = dfx_slim.groupby(['IdLandkreis', 'Refdatum']).sum().groupby(level=[0]).cumsum()
dfx_slim['acute'] = dfx_slim['AnzahlFall'] - (dfx_slim['AnzahlTodesfall'] + dfx_slim['AnzahlGenesen'])
display_n(dfx_slim, 3)

Unnamed: 0_level_0,Unnamed: 1_level_0,AnzahlFall,AnzahlTodesfall,AnzahlGenesen,acute
IdLandkreis,Refdatum,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
01001,1583798400000,1,0,1,0
01001,1583884800000,2,0,2,0
01001,1583971200000,3,0,3,0
...,...,...,...,...,...
16077,1598313600000,82,4,75,3
16077,1598572800000,92,4,75,13
16077,1598659200000,94,4,75,15


In [19]:
dfx_slim.to_csv('timeFrameByLK.csv', index=True, encoding='utf-8')