# Data Processing
This notebook will source the latest COVID-19 data from the [EU data portal](https://opendata.ecdc.europa.eu/covid19/casedistribution/json/ "EU data portal") and clean, transform, sort, and inject derivative data to produce a dataset that can be used to create timeline visualizations of the data.

The optional saving / loading steps can be omitted, if you are only interested in the final output. They can, however, be useful if you do not want to redo every step (like downloading the 'raw' data) every time.

## 0. Setup

This cell will always need to be run.

In [1]:
import os
import json

# Change cwd to this file's dir, so we can use a relative path when saving and loading files:
os.chdir(os.getcwd())

## 1. Get latest raw data from EU

Download the latest COVID-19 data from the EU data portal.

In [2]:
import urllib.request

print('Step 1: Hacking into the EU mainframe...')

covid_19_data_url = "https://opendata.ecdc.europa.eu/covid19/casedistribution/json/"

with urllib.request.urlopen(covid_19_data_url) as input_file:
	# Make output available to next script:
	global covid_19_dict
	# We read the json, but decode it as plaintext, not bytes:
	covid_19_dict = json.loads(input_file.read().decode('utf-8'))

Step 1: Hacking into the EU mainframe...


## Optional - Save 1.
Save the output of step 1 in `/data/input` for later use

In [3]:
output_file_path = '../data/input/covid_19_raw.json'

# We save the fetched data to file:
with open(output_file_path, 'w') as output_file:
    json.dump(covid_19_dict, output_file, indent=2)

## Optional - Load 1.
Load the data needed for step 2, if the data from step 1 has already been saved in `/data/input`

In [4]:
input_file_path = '../data/input/covid_19_raw.json'

with open(input_file_path) as input_file:
    # Make input available to next script:
    global covid_19_dict
    covid_19_dict = json.loads(input_file.read())

## 2. Calculate deaths and cases per capita

Creating a value for deaths and cases per capita for every record using the included deaths, cases, and 2019 population data values.

In [5]:
import copy
print('Step 2: Calculating deaths and cases per capita...')

def insert_per_cap_columns(covid_19_dict):
    # We clone the dict to keep our function PURE:
    covid_19_dict_copy = copy.deepcopy(covid_19_dict)

    covid_19_records = covid_19_dict_copy['records']

    for record in covid_19_records:
        # Skip cases in international waters etc:
        if record['popData2019'] == None:
            continue

        # Insertion of new 'columns':
        record['deaths_per_cap'] = record['deaths'] / record['popData2019']
        record['cases_per_cap'] = record['cases'] / record['popData2019']

    return covid_19_dict_copy

# Overwrite our data with data with the new 'columns':
covid_19_dict = insert_per_cap_columns(covid_19_dict)

Step 2: Calculating deaths and cases per capita...


## Optional - Save 2.
Save the output of step 2 in `/data/output` for later use

In [6]:
output_file_path = '../data/output/covid_19_2.json'

with open(output_file_path, 'w') as output_file:
    json.dump(covid_19_dict, output_file, indent=2)

## Optional - Load 2.
Load the data needed for step 3, if the data from step 2 has already been saved in `/data/output`

In [7]:
input_file_path = '../data/output/covid_19_2.json'

with open(input_file_path) as input_file:
    # Make input available to next script:
    global covid_19_dict
    covid_19_dict = json.loads(input_file.read())

## 3. Calculate max values
The maximum values for cases, deaths, cases per capita, deaths per capita are calculated for the visualizer limits and inserted into the data. To make sure the countries with the highest values do not completely stretch the span of values, they are removed from this calculation - but not the dataset (this means that the top `remove_amt` countries at the worst dates will be displayed as the highest value indicator, e.g. color).

In [8]:
print('Step 3: Calculating max covid-19 cases and deaths...')

def get_max_vals(covid_19_records):
  def get_max_value(variable):
    # The variable used for calculating the max value is the variable we need to sort with to remove the top countries:
    sorted_filtered = remove_top_countries(covid_19_records, sort_by=variable)
    # The last index contains the highest value after sorting (and removing a few):
    max = sorted_filtered[len(sorted_filtered) - 1][variable]
    return max

  # Returns the highest value for a given variable, regardless of which country 'has' this value:
  return {
      "cases": get_max_value('cases'),
      "deaths": get_max_value('deaths'),
      "cases_per_cap": get_max_value('cases_per_cap'),
      "deaths_per_cap": get_max_value('deaths_per_cap')
  }

# Takes in which variable to sort by, which dictates which countries get removed from the calculation of maxes:
def remove_top_countries(covid_19_records, sort_by):
  # The amount of countries to remove, to not stretch the scale too much:
  remove_amt = 3

  # There aren't numbers for all records with these two variables:
  if sort_by == 'cases_per_cap' or sort_by == 'deaths_per_cap':
    # So we filter off the records that do not contain a value for this variable:
    covid_19_records = filter(
        lambda record: record.get(sort_by), covid_19_records)

  # Sorts by the given variable:
  sorted_list = sorted(covid_19_records, key=lambda record: record[sort_by])

  # Removes the countries with the top numbers:
  for i in range(remove_amt):
    # Get name of country with highest value for the variable:
    highest_num_country = sorted_list[len(
        sorted_list) - 1]['countriesAndTerritories']
    # Filter out the given country:
    sorted_list = [
        record for record in sorted_list if record['countriesAndTerritories'] != highest_num_country]
  return sorted_list



covid_19_records = covid_19_dict['records']

# The max values are not necessarily all for the same country, since they are just used to set the color limits in the visualizer
# Creation of a dictionary with only the max values for every variable:
covid_19_max_vals_dict = get_max_vals(covid_19_records)

Step 3: Calculating max covid-19 cases and deaths...


## Optional - Save 3.
Save the output of step 3 in `/data/output` for later use

In [9]:
output_file_path = '../data/output/covid_19_3.json'

with open(output_file_path, 'w') as output_file:
  json.dump(covid_19_max_vals_dict, output_file, indent=2)

## Optional - Load 2.
Load the data needed for step 4, if the data from step 2 have already been saved in `/data/output`

In [10]:
input_file_path = '../data/output/covid_19_2.json'

with open(input_file_path) as input_file:
    # Make input available to next script:
    global covid_19_dict
    covid_19_dict = json.loads(input_file.read())

## 4. Transform data into a better format for the visualiser

The records are transformed to be indexed by their date instead of geographic area.

In [11]:
print('Step 4: Transforming the data to make it easier to work with...')

def transform_data(covid_19_records):
    # An array for the finished data is created:
    data_correct_format = []

    for date in unique_dates(covid_19_records):
        # The template for a record:
        date_dict = {'date': date, 'data': []}

        # Go through original data, and put it under the corresponding date in the new format:
        for entry in covid_19_records:
            if entry['dateRep'] == date:
                date_dict['data'].append(entry)

        # The entire dictionary is then appended to the list:
        data_correct_format.append(date_dict)
    return data_correct_format

def unique_dates(covid_19_records):
    # The list containing every date in the dataset:
    used_dates = []

    for entry in covid_19_records:
        # Add the date to the list, if it is not already there:
        date = entry['dateRep']
        if date not in used_dates:
            used_dates.append(date)
        
    return used_dates

# Records will now be a list of dates (with data for every country):
covid_19_dict['records'] = transform_data(covid_19_dict['records'])

Step 4: Transforming the data to make it easier to work with...


## Optional - Save 4.
Save the output of step 4 in `/data/output` for later use

In [12]:
output_file_path = '../data/output/covid_19_4.json'

with open(output_file_path, 'w') as output_file:
    json.dump(covid_19_dict, output_file, indent=2)

## Optional - Load 4 & 3.
Load the data needed for step 5, if the data from step 4 & 3 have already been saved in `/data/output`

In [13]:
max_vals_input_file_path = "../data/output/covid_19_3.json"
covid_19_input_file_path = "../data/output/covid_19_4.json"

with open(max_vals_input_file_path) as max_vals_input_file, open(covid_19_input_file_path) as covid_19_input_file:
    # Make input available to next script:
    global covid_19_dict
    covid_19_dict = json.loads(covid_19_input_file.read())

    global covid_19_max_vals_dict
    covid_19_max_vals_dict = json.loads(max_vals_input_file.read())

## 5. Sorting and filtering data

The previous data are combined and unnecessary data is removed. The list of data is ordered by date from earliest to latest.

In [14]:
print('Step 5: Sorting and filtering the data for the visualization...')

# Used to make date format (DD/MM/YYYY) something that can be sorted by:
def compare_function(e):
  split_date = e["date"].split("/")
  return int(split_date[2] + split_date[1] + split_date[0])

# The records are sorted by date:
covid_19_dict["records"].sort(key=compare_function)

# The data are combined, so a visualizer can access both the data and the limits for the (e.g.) color ranges:
output_dict = {
    "max_vals": covid_19_max_vals_dict,
    "records": covid_19_dict['records']
}

# Clean out unnecessary variables:
for entry in output_dict['records']:
  for data_point in entry['data']:
    del data_point['dateRep']
    del data_point['day']
    del data_point['month']
    del data_point['year']
    del data_point['popData2019']
    del data_point['continentExp']
    del data_point['Cumulative_number_for_14_days_of_COVID-19_cases_per_100000']

# Save the file:
output_file_path = '../data/output/covid_19_final.json'

with open(output_file_path, 'w') as output_file:
  json.dump(output_dict, output_file, indent=2)


Step 5: Sorting and filtering the data for the visualization...
