In [1]:
# DEVELOPMEMT OF DATA PRODUCTS - 18697
# US5 - DATA AGGREGATION

The Notebook is part of the Development of Data Products product development, with the functional objective of providing data analysis and visualization to the end user about comparisons of daily and cumulative recorded cases, for confirmed, death, or recovered patients. In addition, the Stringency Index is also included for comparison how different governments have reacted in terms of restrictions and regulations to the pandemic situation.

# Index:

1. [Imported Libraries and Scripts](#import-libraries-scripts)
2. [Reading and Preprocessing Data Sources](#read-data)
    1. [Daily Cases Data](#daily-data)
    2. [Cumulative Cases Data](#cumulative-data)
    3. [Government Response Data](#si-data)
3. [Merging Data Sources](#merge-data)
    1. [Aggregate and Merge Cumulative Cases Data](#merge-cumulative-data)
    2. [Merge Government Response Data](#merge-si-data)
    3. [Including Continent Variable](#add-continent-data)
    4. [Merge Daily Cases Data](#merge-daily-data)
4. [Merging Data Test](#test-data)

## 1 Imported Libraries and Scripts <a class="anchor" id="import-libraries-scripts"></a>

Some of the code functionalities are included in dedicated Python functions stored in an external file which gets imported to the current Notebook.

In [2]:
# Libraries
import os

import pandas as pd
import numpy as np

import time
import random

# for regular expressions
import re

# for dates and timestamps handling
from datetime import datetime

In [3]:
# Scripts
from scripts import utils

## 2 Reading and Preprocessing Data Sources <a class="anchor" id="read-data"></a>

The collected data sources are under "DDP-unibz-project-18697/ProjectDataSources" inside the following directories:
    
    - csse_covid_29_data/csse_covid_19_daily_reports/ --> Daily data
    - csse_covid_29_data/csse_covid_19_time_series/ --> Cumulative data, recovered, deaths and confirmed cases
    - covid-policy-tracker/timeseries/ --> Stringency Index (Government response Indicator)
    
It is important to mention that daily data comes in the format **Month/Day/Year**, whereas columns listed in cumulative data and government response data tables have the format **Day/Month/Year**.

In addition to reading the CSV files, an initial data check is performed for checking:

    - That columns have the proper data types for further data manipulation
    - How many rows and columns contain null or not available data?
    - Which percentate of the total data is missing or unknown?
    
Moreover, imported data sources have columns which are not relevant for achieving the functional objective of the project and are being deleted using a simple Python function.

On previous product development stages, part of the data has been successfully merged and some of the data has not, this because of mismatch and inconsistencies related to column names and to the listed countries per data source. It is convenient to tackle this problem in the dataframe generation stage, so that there is no issue when aggregating and selecting variables for visualizations.

Using functions from the utils.py Python file, the collected data sources are cleaned and preprocessed on early stage: similar column names are used, the list of countries get filtered for all data sources and the corresponding aggregations are computed.

Daily data has to be aggregated per file, as each data file must be read and preprocessed before becoming merged into a bigger dataframe compliant with the other data sources, the cumulative and the stringency index datasets.

### A) Daily Cases Data <a class="anchor" id="daily-data"></a>

Reading a particular day data from CSV file. Data gets read, cleaned and initial stats are displayed to the user.

Daily data is now preprocessed, by removing, renaming and aggregating the required country daily data, following a similar approach as the one used for cumulative and stringency data.

For collecting all the daily data, a **FOR loop** is implemented for reading each one of the CSV files, for preprocessing them, and stacking them into a big merged dataframe.

#### Preprocess, filter and aggregate countries

In [4]:
file_path = "../ProjectDataSources/csse_covid_19_data/" + \
            "csse_covid_19_daily_reports/01-25-2020.csv"

daily_df = utils.read_data(file_path)

daily_df = utils.drop_columns(daily_df, file_path,
                              data_source="daily")

# date column is added after preprocessing
date = file_path.split("/")[-1].split(".")[0]

# reordering and renaming columns
daily_df.rename(columns={"Confirmed": "Daily Confirmed",
                         "Deaths": "Daily Deaths",
                         "Recovered": "Daily Recovered"}, inplace=True)

utils.initial_dataframe_check(daily_df)

Removed 2 columns from dataframe


Unnamed: 0,Values
# Rows,72.0
# Columns,4.0
# Rows with NAs,42.0
# Columns with NAs,3.0
% Null Values in Dataframe,27.083


In [5]:
print("How many countries in the list ? ", len(np.unique(daily_df["Country"])))

How many countries in the list ?  25


Remove and rename country names.

In [6]:
daily_df = utils.check_daily_data_countries(daily_df, file_path)

print("How many countries in the list ? ", len(np.unique(daily_df["Country"])))

How many countries in the list ?  17


The list of countries have been reduced after preprocessing.

In [7]:
aggregate_countries = utils.get_countries_split_by_regions(daily_df, 
                                                           country_column="Country")
if len(aggregate_countries) > 0:
    daily_df = utils.country_aggregation_dataframe(daily_df, aggregate_countries,
                                                   data_type="daily")
  
    print(utils.initial_dataframe_check(daily_df))
    
else:
    print("There are no countries with data split by regions.")

There are 4 countries where data needs to be aggregated.
                            Values
# Rows                      17.000
# Columns                    4.000
# Rows with NAs              9.000
# Columns with NAs           2.000
% Null Values in Dataframe  26.471


In [8]:
daily_df.head(10)

Unnamed: 0,Country,Daily Confirmed,Daily Deaths,Daily Recovered
0,Australia,4.0,,
1,China,10.0,,
2,France,3.0,,
3,Japan,2.0,,
4,Kiribati,0.0,0.0,0.0
5,Malaysia,0.0,0.0,0.0
6,Nepal,1.0,,
7,New Zealand,0.0,0.0,0.0
8,Singapore,3.0,,
9,South Korea,2.0,,


#### Include rest of countries and filling dataframe with NULL values

In [9]:
file_path = "../ProjectDataSources/Country_Names_List.csv"

list_full_countries = utils.read_data(file_path)["Country"].values
continents = utils.read_data(file_path)["Continent"]

In [10]:
countries_to_merge = utils.get_list_inner_outer_join(np.unique(daily_df["Country"]), 
                                                     list_full_countries, 
                                                     operation="outer")

if len(countries_to_merge) > 0:
    to_merge_df = pd.DataFrame(np.nan, index=np.arange(len(countries_to_merge)), 
                           columns=daily_df.columns)
    to_merge_df["Country"] = countries_to_merge
    
    daily_df = pd.concat([daily_df, to_merge_df], ignore_index=True).sort_values("Country", ascending=True)
    daily_df = daily_df.reset_index(drop=True)
    
    print(utils.initial_dataframe_check(daily_df))
    
else:
    print("All countries are already on dataset.")

                             Values
# Rows                      176.000
# Columns                     4.000
# Rows with NAs             168.000
# Columns with NAs            3.000
% Null Values in Dataframe   70.312


#### Inserting timestamps and continent columns

In [11]:
daily_df["Continent"] = continents
daily_df["Timestamps"] = pd.to_datetime(date)

columns_reordered = ["Country", "Continent", "Timestamps", "Daily Confirmed",
                     "Daily Deaths", "Daily Recovered"]
daily_df = daily_df[columns_reordered]

In [12]:
daily_df.tail(10)

Unnamed: 0,Country,Continent,Timestamps,Daily Confirmed,Daily Deaths,Daily Recovered
166,United Kingdom,Europe,2020-01-25,0.0,0.0,0.0
167,United States,North America,2020-01-25,2.0,0.0,0.0
168,Uruguay,South America,2020-01-25,,,
169,Uzbekistan,Asia,2020-01-25,,,
170,Vanuatu,Oceania,2020-01-25,,,
171,Venezuela,South America,2020-01-25,,,
172,Vietnam,Asia,2020-01-25,2.0,,
173,Yemen,Asia,2020-01-25,,,
174,Zambia,Africa,2020-01-25,,,
175,Zimbabwe,Africa,2020-01-25,,,


#### Running flow for concatenating all daily data CSV files

In [13]:
daily_data_path = "../ProjectDataSources/csse_covid_19_data/" + \
                  "csse_covid_19_daily_reports/"

daily_files_txt = "../ProjectDataSources/timestamps_ordered.txt"
full_countries_file = "../ProjectDataSources/Country_Names_List.csv"

daily_files = pd.read_csv(daily_files_txt, sep=" ", header=None, 
                          names=["File"])


Reading all CSV files, preprocessing them and concatenating them into a bigger dataframe containing all available timestamps. The merging process for daily data takes around **5 minutes** to complete.

In [66]:
merged_daily_df = pd.DataFrame()

for file in np.squeeze(daily_files.values):
    file_path = daily_data_path + file
    
    # reading data
    daily_df = utils.read_data(file_path)
    
    # dropping non-relevant columns
    daily_df = utils.drop_columns(daily_df, file_path,
                                  data_source="daily")
    
    # date column is added after preprocessing
    date = file_path.split("/")[-1].split(".")[0]
    
    # reordering and renaming columns
    daily_df.rename(columns={"Confirmed": "Daily Confirmed Cases",
                             "Deaths": "Daily Death Cases",
                             "Recovered": "Daily Recovered Cases"}, inplace=True)
    
    # renaming and removing countries which are not common for all data sources
    daily_df = utils.check_daily_data_countries(daily_df, file_path)
    
    # aggregating by sum the countries which are split by regions or provinces
    aggregate_countries = utils.get_countries_split_by_regions(daily_df, 
                                                               country_column="Country")
    if len(aggregate_countries) > 0:
        daily_df = utils.country_aggregation_dataframe(daily_df, aggregate_countries,
                                                       data_type="daily")
    
    # Error when reading this date file - need to drop a row for China
    if file == "01-24-2020.csv":
        drop_row = daily_df.loc[daily_df["Country"]=="China"].head(1)
        daily_df = daily_df.drop(drop_row.index).reset_index(drop=True)
    
    # reading reference file with full list of countries and continents
    list_full_countries = utils.read_data(full_countries_file)["Country"].values
    continents = utils.read_data(full_countries_file)["Continent"]   
    
    # checking if some countries are not part of the dataframe
    countries_to_merge = utils.get_list_inner_outer_join(np.unique(daily_df["Country"]), 
                                                         list_full_countries, 
                                                         operation="outer")
    
    # adding the countries in case they are not present in the dataframe
    if len(countries_to_merge) > 0:
        to_merge_df = pd.DataFrame(np.nan, index=np.arange(len(countries_to_merge)), 
                                   columns=daily_df.columns)
        to_merge_df["Country"] = countries_to_merge
    
        daily_df = pd.concat([daily_df, to_merge_df], ignore_index=True).sort_values("Country", ascending=True)
        daily_df = daily_df.reset_index(drop=True)
    
    # including continent and timestamps with the proper format
    daily_df["Continent"] = continents
    daily_df["Timestamps"] = pd.to_datetime(date)

    # reordering columns
    columns_reordered = ["Country", "Continent", "Timestamps", "Daily Confirmed Cases",
                         "Daily Death Cases", "Daily Recovered Cases"]
    daily_df = daily_df[columns_reordered]
    
    # merging dataframe per each daily file
    merged_daily_df = pd.concat([merged_daily_df, daily_df], ignore_index=True)

Removed 2 columns from dataframe
There are 3 countries where data needs to be aggregated.
Removed 2 columns from dataframe
There are 3 countries where data needs to be aggregated.
Removed 2 columns from dataframe
There are 4 countries where data needs to be aggregated.
Removed 2 columns from dataframe
There are 4 countries where data needs to be aggregated.
Removed 2 columns from dataframe
There are 4 countries where data needs to be aggregated.
Removed 2 columns from dataframe
There are 5 countries where data needs to be aggregated.
Removed 2 columns from dataframe
There are 6 countries where data needs to be aggregated.
Removed 2 columns from dataframe
There are 6 countries where data needs to be aggregated.
Removed 2 columns from dataframe
There are 6 countries where data needs to be aggregated.
Removed 2 columns from dataframe
There are 6 countries where data needs to be aggregated.
Removed 2 columns from dataframe
There are 6 countries where data needs to be aggregated.
Removed 2 

Removed 8 columns from dataframe
There are 11 countries where data needs to be aggregated.
Removed 8 columns from dataframe
There are 11 countries where data needs to be aggregated.
Removed 8 columns from dataframe
There are 11 countries where data needs to be aggregated.
Removed 8 columns from dataframe
There are 11 countries where data needs to be aggregated.
Removed 8 columns from dataframe
There are 11 countries where data needs to be aggregated.
Removed 8 columns from dataframe
There are 11 countries where data needs to be aggregated.
Removed 8 columns from dataframe
There are 11 countries where data needs to be aggregated.
Removed 8 columns from dataframe
There are 11 countries where data needs to be aggregated.
Removed 8 columns from dataframe
There are 11 countries where data needs to be aggregated.
Removed 8 columns from dataframe
There are 11 countries where data needs to be aggregated.
Removed 8 columns from dataframe
There are 11 countries where data needs to be aggregated.

Removed 10 columns from dataframe
There are 24 countries where data needs to be aggregated.
Removed 10 columns from dataframe
There are 24 countries where data needs to be aggregated.
Removed 10 columns from dataframe
There are 24 countries where data needs to be aggregated.
Removed 10 columns from dataframe
There are 24 countries where data needs to be aggregated.
Removed 10 columns from dataframe
There are 24 countries where data needs to be aggregated.
Removed 10 columns from dataframe
There are 24 countries where data needs to be aggregated.
Removed 10 columns from dataframe
There are 24 countries where data needs to be aggregated.
Removed 10 columns from dataframe
There are 24 countries where data needs to be aggregated.
Removed 10 columns from dataframe
There are 24 countries where data needs to be aggregated.
Removed 10 columns from dataframe
There are 24 countries where data needs to be aggregated.
Removed 10 columns from dataframe
There are 24 countries where data needs to be 

Removed 10 columns from dataframe
There are 24 countries where data needs to be aggregated.
Removed 10 columns from dataframe
There are 24 countries where data needs to be aggregated.
Removed 10 columns from dataframe
There are 24 countries where data needs to be aggregated.
Removed 10 columns from dataframe
There are 24 countries where data needs to be aggregated.
Removed 10 columns from dataframe
There are 24 countries where data needs to be aggregated.
Removed 10 columns from dataframe
There are 24 countries where data needs to be aggregated.
Removed 10 columns from dataframe
There are 24 countries where data needs to be aggregated.
Removed 10 columns from dataframe
There are 24 countries where data needs to be aggregated.
Removed 10 columns from dataframe
There are 24 countries where data needs to be aggregated.
Removed 10 columns from dataframe
There are 24 countries where data needs to be aggregated.
Removed 10 columns from dataframe
There are 24 countries where data needs to be 

Removed 10 columns from dataframe
There are 25 countries where data needs to be aggregated.
Removed 10 columns from dataframe
There are 25 countries where data needs to be aggregated.
Removed 10 columns from dataframe
There are 25 countries where data needs to be aggregated.
Removed 10 columns from dataframe
There are 25 countries where data needs to be aggregated.
Removed 10 columns from dataframe
There are 25 countries where data needs to be aggregated.
Removed 10 columns from dataframe
There are 25 countries where data needs to be aggregated.
Removed 10 columns from dataframe
There are 25 countries where data needs to be aggregated.
Removed 10 columns from dataframe
There are 25 countries where data needs to be aggregated.
Removed 10 columns from dataframe
There are 25 countries where data needs to be aggregated.
Removed 10 columns from dataframe
There are 25 countries where data needs to be aggregated.
Removed 10 columns from dataframe
There are 25 countries where data needs to be 

Removed 10 columns from dataframe
There are 25 countries where data needs to be aggregated.
Removed 10 columns from dataframe
There are 25 countries where data needs to be aggregated.
Removed 10 columns from dataframe
There are 25 countries where data needs to be aggregated.
Removed 10 columns from dataframe
There are 25 countries where data needs to be aggregated.
Removed 10 columns from dataframe
There are 25 countries where data needs to be aggregated.
Removed 10 columns from dataframe
There are 25 countries where data needs to be aggregated.
Removed 10 columns from dataframe
There are 25 countries where data needs to be aggregated.
Removed 10 columns from dataframe
There are 25 countries where data needs to be aggregated.
Removed 10 columns from dataframe
There are 25 countries where data needs to be aggregated.
Removed 10 columns from dataframe
There are 25 countries where data needs to be aggregated.
Removed 10 columns from dataframe
There are 25 countries where data needs to be 

Removed 10 columns from dataframe
There are 25 countries where data needs to be aggregated.
Removed 10 columns from dataframe
There are 25 countries where data needs to be aggregated.
Removed 10 columns from dataframe
There are 25 countries where data needs to be aggregated.
Removed 10 columns from dataframe
There are 25 countries where data needs to be aggregated.
Removed 10 columns from dataframe
There are 25 countries where data needs to be aggregated.
Removed 10 columns from dataframe
There are 25 countries where data needs to be aggregated.
Removed 10 columns from dataframe
There are 25 countries where data needs to be aggregated.
Removed 10 columns from dataframe
There are 25 countries where data needs to be aggregated.
Removed 10 columns from dataframe
There are 25 countries where data needs to be aggregated.
Removed 10 columns from dataframe
There are 25 countries where data needs to be aggregated.
Removed 10 columns from dataframe
There are 25 countries where data needs to be 

Removed 10 columns from dataframe
There are 25 countries where data needs to be aggregated.
Removed 10 columns from dataframe
There are 25 countries where data needs to be aggregated.
Removed 10 columns from dataframe
There are 25 countries where data needs to be aggregated.
Removed 10 columns from dataframe
There are 25 countries where data needs to be aggregated.
Removed 10 columns from dataframe
There are 25 countries where data needs to be aggregated.
Removed 10 columns from dataframe
There are 25 countries where data needs to be aggregated.
Removed 10 columns from dataframe
There are 25 countries where data needs to be aggregated.
Removed 10 columns from dataframe
There are 25 countries where data needs to be aggregated.
Removed 10 columns from dataframe
There are 25 countries where data needs to be aggregated.
Removed 10 columns from dataframe
There are 25 countries where data needs to be aggregated.
Removed 10 columns from dataframe
There are 25 countries where data needs to be 

Removed 10 columns from dataframe
There are 25 countries where data needs to be aggregated.
Removed 10 columns from dataframe
There are 25 countries where data needs to be aggregated.
Removed 10 columns from dataframe
There are 25 countries where data needs to be aggregated.
Removed 10 columns from dataframe
There are 25 countries where data needs to be aggregated.
Removed 10 columns from dataframe
There are 25 countries where data needs to be aggregated.
Removed 10 columns from dataframe
There are 25 countries where data needs to be aggregated.
Removed 10 columns from dataframe
There are 25 countries where data needs to be aggregated.
Removed 10 columns from dataframe
There are 25 countries where data needs to be aggregated.
Removed 10 columns from dataframe
There are 25 countries where data needs to be aggregated.
Removed 10 columns from dataframe
There are 25 countries where data needs to be aggregated.
Removed 10 columns from dataframe
There are 25 countries where data needs to be 

Removed 10 columns from dataframe
There are 25 countries where data needs to be aggregated.
Removed 10 columns from dataframe
There are 25 countries where data needs to be aggregated.
Removed 10 columns from dataframe
There are 25 countries where data needs to be aggregated.
Removed 10 columns from dataframe
There are 25 countries where data needs to be aggregated.
Removed 10 columns from dataframe
There are 25 countries where data needs to be aggregated.
Removed 10 columns from dataframe
There are 25 countries where data needs to be aggregated.
Removed 10 columns from dataframe
There are 25 countries where data needs to be aggregated.
Removed 10 columns from dataframe
There are 25 countries where data needs to be aggregated.
Removed 10 columns from dataframe
There are 25 countries where data needs to be aggregated.
Removed 10 columns from dataframe
There are 25 countries where data needs to be aggregated.
Removed 10 columns from dataframe
There are 25 countries where data needs to be 

Removed 10 columns from dataframe
There are 25 countries where data needs to be aggregated.
Removed 10 columns from dataframe
There are 25 countries where data needs to be aggregated.
Removed 10 columns from dataframe
There are 25 countries where data needs to be aggregated.
Removed 10 columns from dataframe
There are 25 countries where data needs to be aggregated.
Removed 10 columns from dataframe
There are 25 countries where data needs to be aggregated.
Removed 10 columns from dataframe
There are 25 countries where data needs to be aggregated.
Removed 10 columns from dataframe
There are 25 countries where data needs to be aggregated.
Removed 10 columns from dataframe
There are 25 countries where data needs to be aggregated.
Removed 10 columns from dataframe
There are 25 countries where data needs to be aggregated.
Removed 10 columns from dataframe
There are 25 countries where data needs to be aggregated.
Removed 10 columns from dataframe
There are 25 countries where data needs to be 

In [67]:
utils.initial_dataframe_check(merged_daily_df)

Unnamed: 0,Values
# Rows,165792.0
# Columns,6.0
# Rows with NAs,65611.0
# Columns with NAs,3.0
% Null Values in Dataframe,8.151


In [68]:
merged_daily_df.tail(10)

Unnamed: 0,Country,Continent,Timestamps,Daily Confirmed Cases,Daily Death Cases,Daily Recovered Cases
165782,United Kingdom,Europe,2022-08-20,23675485.0,187731.0,0.0
165783,United States,North America,2022-08-20,93634408.0,1041141.0,0.0
165784,Uruguay,South America,2022-08-20,975264.0,7429.0,
165785,Uzbekistan,Asia,2022-08-20,243654.0,1637.0,
165786,Vanuatu,Oceania,2022-08-20,11770.0,14.0,
165787,Venezuela,South America,2022-08-20,541322.0,5786.0,
165788,Vietnam,Asia,2022-08-20,11382258.0,43104.0,
165789,Yemen,Asia,2022-08-20,11915.0,2153.0,
165790,Zambia,Africa,2022-08-20,332264.0,4016.0,
165791,Zimbabwe,Africa,2022-08-20,256616.0,5592.0,


### B) Cumulative Cases Data <a class="anchor" id="cumulative-data"></a>

Cumulative data is composed of three different timeseries:

    - global confirmed cases
    - global deaths cases
    - global recovered cases

The cumulative data has the same number of countries for the three data sources, so the data can be cleaned and aggregated in advance.

#### Confirmed Cases

Read data, drop non-relevant columns and rename countries column.

In [69]:
file_path = "../ProjectDataSources/csse_covid_19_data/" + \
            "csse_covid_19_time_series/time_series_covid19_confirmed_global.csv"

confirmed_df = utils.read_data(file_path)
confirmed_df = utils.drop_columns(confirmed_df, file_path,
                                  data_source="cumulative")

utils.initial_dataframe_check(confirmed_df)

Removed 3 columns from dataframe


Unnamed: 0,Values
# Rows,285.0
# Columns,943.0
# Rows with NAs,0.0
# Columns with NAs,0.0
% Null Values in Dataframe,0.0


For consistency between all data sources, some countries present on a data source but not on the other have to be removed. Also, some country names have to be renamed for matching the same name on all data sources.

In [70]:
confirmed_df = utils.country_list_formatting(confirmed_df, 
                                             data_source="cumulative")

There have been 23 countries removed from the dataset.


Check if there are countries with data split by regions or provinces. If that is the case, aggregate the data per country by adding up all the cases, so that the number of cases is the total national number and not a regional one.

In [71]:
aggregate_countries = utils.get_countries_split_by_regions(confirmed_df,
                                                           country_column="Country")

if len(aggregate_countries) > 0:
    
    confirmed_df = utils.country_aggregation_dataframe(confirmed_df, aggregate_countries,
                                                       data_type="cumulative")
  
    print(utils.initial_dataframe_check(confirmed_df))
    
else:
    print("There are no countries with data split by regions.")

There are 8 countries where data needs to be aggregated.
                            Values
# Rows                       176.0
# Columns                    943.0
# Rows with NAs                0.0
# Columns with NAs             0.0
% Null Values in Dataframe     0.0


In [72]:
confirmed_df.tail(10)

Unnamed: 0,Country,1/22/20,1/23/20,1/24/20,1/25/20,1/26/20,1/27/20,1/28/20,1/29/20,1/30/20,...,8/11/22,8/12/22,8/13/22,8/14/22,8/15/22,8/16/22,8/17/22,8/18/22,8/19/22,8/20/22
166,United Kingdom,0,0,0,0,0,0,0,0,0,...,23634568,23634821,23634821,23634821,23634821,23634960,23675470,23675485,23675485,23675485
167,United States,1,1,2,2,5,5,5,6,6,...,92748361,92844516,92927176,92934599,93052775,93140225,93278240,93403640,93500945,93634408
168,Uruguay,0,0,0,0,0,0,0,0,0,...,973420,973420,973420,973420,975264,975264,975264,975264,975264,975264
169,Uzbekistan,0,0,0,0,0,0,0,0,0,...,243460,243487,243509,243537,243562,243586,243605,243623,243638,243654
170,Vanuatu,0,0,0,0,0,0,0,0,0,...,11731,11734,11746,11746,11753,11753,11753,11753,11770,11770
171,Venezuela,0,0,0,0,0,0,0,0,0,...,539205,539205,540102,540222,540222,540681,540796,540796,540977,541322
172,Vietnam,0,2,2,2,2,2,2,2,2,...,11360348,11362540,11364355,11365784,11367479,11370462,11373276,11376571,11379554,11382258
173,Yemen,0,0,0,0,0,0,0,0,0,...,11903,11903,11903,11903,11903,11903,11903,11914,11915,11915
174,Zambia,0,0,0,0,0,0,0,0,0,...,331568,331925,331925,331925,332014,332014,332264,332264,332264,332264
175,Zimbabwe,0,0,0,0,0,0,0,0,0,...,256513,256522,256522,256539,256544,256561,256565,256579,256596,256616


#### Death Cases

The death cases cumulative data follow same preprocessing steps as confirmed cases for filtering countries and aggregating data if needed.

In [73]:
file_path = "../ProjectDataSources/csse_covid_19_data/" + \
            "csse_covid_19_time_series/time_series_covid19_deaths_global.csv"

deaths_df = utils.read_data(file_path)
deaths_df = utils.drop_columns(deaths_df, file_path,
                               data_source="cumulative")

utils.initial_dataframe_check(deaths_df)

Removed 3 columns from dataframe


Unnamed: 0,Values
# Rows,285.0
# Columns,943.0
# Rows with NAs,0.0
# Columns with NAs,0.0
% Null Values in Dataframe,0.0


In [74]:
deaths_df = utils.country_list_formatting(deaths_df, 
                                          data_source="cumulative")

There have been 23 countries removed from the dataset.


In [75]:
aggregate_countries = utils.get_countries_split_by_regions(deaths_df,
                                                           country_column="Country")

if len(aggregate_countries) > 0:
    
    deaths_df = utils.country_aggregation_dataframe(deaths_df, aggregate_countries,
                                                    data_type="cumulative")
  
    print(utils.initial_dataframe_check(deaths_df))
    
else:
    print("There are no countries with data split by regions.")

There are 8 countries where data needs to be aggregated.
                            Values
# Rows                       176.0
# Columns                    943.0
# Rows with NAs                0.0
# Columns with NAs             0.0
% Null Values in Dataframe     0.0


In [76]:
deaths_df.tail(10)

Unnamed: 0,Country,1/22/20,1/23/20,1/24/20,1/25/20,1/26/20,1/27/20,1/28/20,1/29/20,1/30/20,...,8/11/22,8/12/22,8/13/22,8/14/22,8/15/22,8/16/22,8/17/22,8/18/22,8/19/22,8/20/22
166,United Kingdom,0,0,0,0,0,0,0,0,0,...,186798,186798,186798,186798,186798,186798,187730,187731,187731,187731
167,United States,0,0,0,0,0,0,0,0,0,...,1036444,1037090,1037117,1037121,1037554,1038023,1039116,1039746,1040355,1041141
168,Uruguay,0,0,0,0,0,0,0,0,0,...,7423,7423,7423,7423,7429,7429,7429,7429,7429,7429
169,Uzbekistan,0,0,0,0,0,0,0,0,0,...,1637,1637,1637,1637,1637,1637,1637,1637,1637,1637
170,Vanuatu,0,0,0,0,0,0,0,0,0,...,14,14,14,14,14,14,14,14,14,14
171,Venezuela,0,0,0,0,0,0,0,0,0,...,5770,5770,5775,5778,5778,5779,5781,5781,5784,5786
172,Vietnam,0,0,0,0,0,0,0,0,0,...,43095,43096,43097,43098,43098,43100,43103,43103,43103,43104
173,Yemen,0,0,0,0,0,0,0,0,0,...,2152,2152,2152,2152,2152,2152,2152,2152,2153,2153
174,Zambia,0,0,0,0,0,0,0,0,0,...,4016,4016,4016,4016,4016,4016,4016,4016,4016,4016
175,Zimbabwe,0,0,0,0,0,0,0,0,0,...,5587,5587,5587,5588,5588,5588,5588,5589,5589,5592


#### Recovered Cases

The recovered cases cumulative data follow same preprocessing steps as confirmed cases for filtering countries and aggregating data if needed.

In [77]:
file_path = "../ProjectDataSources/csse_covid_19_data/" + \
            "csse_covid_19_time_series/time_series_covid19_recovered_global.csv"

recovered_df = utils.read_data(file_path)
recovered_df = utils.drop_columns(recovered_df, file_path,
                                  data_source="cumulative")

utils.initial_dataframe_check(recovered_df)

Removed 3 columns from dataframe


Unnamed: 0,Values
# Rows,270.0
# Columns,943.0
# Rows with NAs,0.0
# Columns with NAs,0.0
% Null Values in Dataframe,0.0


In [78]:
recovered_df = utils.country_list_formatting(recovered_df, 
                                             data_source="cumulative")

There have been 23 countries removed from the dataset.


In [79]:
aggregate_countries = utils.get_countries_split_by_regions(recovered_df,
                                                           country_column="Country")

if len(aggregate_countries) > 0:
    
    recovered_df = utils.country_aggregation_dataframe(recovered_df, aggregate_countries,
                                                       data_type="cumulative")
  
    print(utils.initial_dataframe_check(recovered_df))
    
else:
    print("There are no countries with data split by regions.")

There are 7 countries where data needs to be aggregated.
                            Values
# Rows                       176.0
# Columns                    943.0
# Rows with NAs                0.0
# Columns with NAs             0.0
% Null Values in Dataframe     0.0


In [80]:
recovered_df.tail(10)

Unnamed: 0,Country,1/22/20,1/23/20,1/24/20,1/25/20,1/26/20,1/27/20,1/28/20,1/29/20,1/30/20,...,8/11/22,8/12/22,8/13/22,8/14/22,8/15/22,8/16/22,8/17/22,8/18/22,8/19/22,8/20/22
166,United Kingdom,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
167,United States,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
168,Uruguay,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
169,Uzbekistan,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
170,Vanuatu,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
171,Venezuela,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
172,Vietnam,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
173,Yemen,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
174,Zambia,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
175,Zimbabwe,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


On previous user story, the recovered cases dataset had less observations than confirmed and deaths data. After aggregating per country, the datasets have the same number of rows and the same number of countries on the dataframe. 

In [81]:
# Country Names
print("How many countries in the confirmed cases list ? ",len(np.unique(confirmed_df["Country"])))
print("How many countries in the deaths cases list ? ",len(np.unique(deaths_df["Country"])))
print("How many countries in the recovered cases list ? ",len(np.unique(recovered_df["Country"])))

How many countries in the confirmed cases list ?  176
How many countries in the deaths cases list ?  176
How many countries in the recovered cases list ?  176


### C) Government Response Data <a class="anchor" id="si-data"></a>

The Stringency Index data follow same preprocessing steps as cumulative cases data, but it differs on the way the data is aggregated per country, as for SI, the mean value is taken into consideration. Also, some countries in the SI data are not present in cunulative data and viceversa, so all the outer intersected countries are left away.

In [82]:
file_path = "../ProjectDataSources/covid-policy-tracker/" + \
            "timeseries/stringency_index_avg.csv"

stringency_df = utils.read_data(file_path)
stringency_df = utils.drop_columns(stringency_df, file_path,
                                   data_source="stringency_index")

utils.initial_dataframe_check(stringency_df)

Removed 4 columns from dataframe


Unnamed: 0,Values
# Rows,263.0
# Columns,970.0
# Rows with NAs,263.0
# Columns with NAs,969.0
% Null Values in Dataframe,7.953


In [83]:
number_unique_countries = len(np.unique(stringency_df["Country"]))
print("How many countries in the list ? ", number_unique_countries)

How many countries in the list ?  187


The country list for stringency index dataset is smaller than the ones from the CSSE data source for daily and cumulative data cases. The country list is going to be intersected in further steps for data consistency.

In [84]:
countries_list_cumulative = confirmed_df["Country"].values
countries_list_si = stringency_df["Country"].values

utils.get_list_inner_outer_join(countries_list_cumulative, countries_list_si, 
                                operation="outer")

['Aruba',
 'Bermuda',
 'Cabo Verde',
 'Cape Verde',
 'Faeroe Islands',
 'Greenland',
 'Guam',
 'Hong Kong',
 'Kyrgyz Republic',
 'Kyrgyzstan',
 'Macao',
 'Palestine',
 'Puerto Rico',
 'Slovak Republic',
 'Slovakia',
 'Turkmenistan',
 'United States Virgin Islands']

Some listed countries are repeated but with different names, so the proper Python function should take care in having the same name for all data sources. The rest of the countries in this list are exclusive for SI data and are removed.

In [85]:
stringency_df = utils.country_list_formatting(stringency_df, 
                                              data_source="stringency_index")

There have been 11 countries removed from the dataset.


In [86]:
aggregate_countries = utils.get_countries_split_by_regions(stringency_df,
                                                           country_column="Country")

if len(aggregate_countries) > 0:
    
    stringency_df = utils.country_aggregation_dataframe(stringency_df, aggregate_countries,
                                                        data_type="stringency_index")
  
    print(utils.initial_dataframe_check(stringency_df))
    
else:
    print("There are no countries with data split by regions.")

There are 4 countries where data needs to be aggregated.
                             Values
# Rows                      176.000
# Columns                   970.000
# Rows with NAs             176.000
# Columns with NAs          969.000
% Null Values in Dataframe    4.869


In [87]:
# Country Names
print("How many countries in the confirmed cases list ? ",len(np.unique(confirmed_df["Country"])))
print("How many countries in the deaths cases list ? ",len(np.unique(deaths_df["Country"])))
print("How many countries in the recovered cases list ? ",len(np.unique(recovered_df["Country"])))
print("How many countries in the stringency index list ? ",len(np.unique(stringency_df["Country"])))

How many countries in the confirmed cases list ?  176
How many countries in the deaths cases list ?  176
How many countries in the recovered cases list ?  176
How many countries in the stringency index list ?  176


Now, cumulative and SI data sources have the same number of countries. They are ready to be merged as one dataframe.

## 3 Merging Data Sources  <a class="anchor" id="merge-data"></a>

Some of the collected data sources have the data stores as tables with each timestamp (a day) having a separate column. For further purposes, involving aggregation and visualization, it would be useful to have a dedicated column for timestamps, assigning the numerical value to another column, representing the data source: it can be Daily data, Cumulative or about the Stringency Index.

With Pandas built-in functions, it is easy to do the required manipulation. One thing to notice is that there are a few countries for which collected data is more ambitious and it is split by regions or by its overseas regions. For simplicity, every country must have only one row for all timestamps, and hence a sum aggregations should be done.

The functions to merge and aggregate data are part of the scripts folder.

### A) Aggregate and Merge Cumulative Cases Data <a class="anchor" id="merge-cumulative-data"></a>

For a successful merging, the same number of countries should be contained in each of the dataframes to merge. This is proved by using a function which computes the intersection for the involved datasets, where the values should match between each other.

In [88]:
intersection_1 = utils.get_list_inner_outer_join(np.unique(confirmed_df["Country"]),
                                                 np.unique(deaths_df["Country"]),
                                                 operation="inner")

intersection_2 = utils.get_list_inner_outer_join(np.unique(confirmed_df["Country"]),
                                                 np.unique(recovered_df["Country"]),
                                                 operation="inner")

intersection_3 = utils.get_list_inner_outer_join(np.unique(deaths_df["Country"]),
                                                 np.unique(recovered_df["Country"]),
                                                 operation="inner")

In [89]:
print(len(intersection_1))
print(len(intersection_2))
print(len(intersection_3))

176
176
176


Now, it is clear that the same number of countries are contained in each cumulative dataset. Now, let's check if the data per country is summarized in a single observation, or if the data source collects some countries data by its composed regions or overseas territories.

#### Merging Data

Each cumulative dataset is melted, transforming the dataframe so that a column contain all measurements, and merged together with the other cumulative data ones.

In [90]:
confirmed_df_melt = pd.melt(confirmed_df, 
                            id_vars="Country", 
                            value_vars=list(confirmed_df.columns[1:]),
                            var_name="Timestamps", 
                            value_name="Cumulative Confirmed Cases")

utils.initial_dataframe_check(confirmed_df_melt)

Unnamed: 0,Values
# Rows,165792.0
# Columns,3.0
# Rows with NAs,0.0
# Columns with NAs,0.0
% Null Values in Dataframe,0.0


In [91]:
deaths_df_melt = pd.melt(deaths_df, 
                         id_vars="Country", 
                         value_vars=list(deaths_df.columns[1:]),
                         var_name="Timestamps", 
                         value_name="Cumulative Death Cases")

utils.initial_dataframe_check(deaths_df_melt)

Unnamed: 0,Values
# Rows,165792.0
# Columns,3.0
# Rows with NAs,0.0
# Columns with NAs,0.0
% Null Values in Dataframe,0.0


In [92]:
recovered_df_melt = pd.melt(recovered_df, 
                            id_vars="Country", 
                            value_vars=list(recovered_df.columns[1:]),
                            var_name="Timestamps", 
                            value_name="Cumulative Recovered Cases")

utils.initial_dataframe_check(recovered_df_melt)

Unnamed: 0,Values
# Rows,165792.0
# Columns,3.0
# Rows with NAs,0.0
# Columns with NAs,0.0
% Null Values in Dataframe,0.0


Because each of the stacked dataframe shares the same country and timestamps column names and values, merging process is straightforward.

In [93]:
merged_df = pd.concat([confirmed_df_melt, deaths_df_melt, recovered_df_melt], 
                      axis=1, join='inner')

# drop column duplicates
merged_df = merged_df.loc[:,~merged_df.columns.duplicated()]

print("Number of rows     : ", merged_df.shape[0])
print("Number of columns  : ", merged_df.shape[1])

Number of rows     :  165792
Number of columns  :  5


In [94]:
merged_df.tail(10)

Unnamed: 0,Country,Timestamps,Cumulative Confirmed Cases,Cumulative Death Cases,Cumulative Recovered Cases
165782,United Kingdom,8/20/22,23675485,187731,0
165783,United States,8/20/22,93634408,1041141,0
165784,Uruguay,8/20/22,975264,7429,0
165785,Uzbekistan,8/20/22,243654,1637,0
165786,Vanuatu,8/20/22,11770,14,0
165787,Venezuela,8/20/22,541322,5786,0
165788,Vietnam,8/20/22,11382258,43104,0
165789,Yemen,8/20/22,11915,2153,0
165790,Zambia,8/20/22,332264,4016,0
165791,Zimbabwe,8/20/22,256616,5592,0


### B) Merge Government Response Data <a class="anchor" id="merge-si-data"></a>

For merging the Stringency Index dataset into the Cumulative Data, it is absolutely necessary that the country names match between the two datasets.

Now the countries match in both data sources, as the outer intersection list is empty.

In [95]:
intersection_inner = utils.get_list_inner_outer_join(np.unique(confirmed_df["Country"]),
                                                     np.unique(stringency_df["Country"]),
                                                     operation="inner")

print(len(intersection_inner))

176


In [96]:
intersection_outer = utils.get_list_inner_outer_join(np.unique(confirmed_df["Country"]),
                                                     np.unique(stringency_df["Country"]),
                                                     operation="outer")

print(len(intersection_outer))

0


For merging data sources, the same number of timestamps need to be contained in the data sources. It seems that Stringency Index has more timestamps, as it has timestamps since the beginning of the year 2020, even if no data was collected or available. 

The next tasks must be completed:

    - Have the same number of timestamps for both data sources
    - Have similar name convention for timestamps

In [97]:
print("Number of timestamps in cumulative data : ", len(confirmed_df.columns[1:]))
print("Number of timestamps in SI data         : ", len(stringency_df.columns[1:]))

Number of timestamps in cumulative data :  942
Number of timestamps in SI data         :  969


In [98]:
print("Cumulative data columns : ")
print(confirmed_df.columns[1:10])

print("-"*70)

print("SI Index data columns : ")
print(stringency_df.columns[1:10])

Cumulative data columns : 
Index(['1/22/20', '1/23/20', '1/24/20', '1/25/20', '1/26/20', '1/27/20',
       '1/28/20', '1/29/20', '1/30/20'],
      dtype='object')
----------------------------------------------------------------------
SI Index data columns : 
Index(['01Jan2020', '02Jan2020', '03Jan2020', '04Jan2020', '05Jan2020',
       '06Jan2020', '07Jan2020', '08Jan2020', '09Jan2020'],
      dtype='object')


First, the timestamp name convention must match for both cumulative and SI index data sources. The use of the following function transforms the timestamp strings from stringency dataset to the ones used in cumulative data.

In [99]:
updated_timestamps = utils.formatting_timestamp_string(stringency_df, 
                                                       country_column="Country")

stringency_df.columns = updated_timestamps

In [100]:
stringency_df.columns[1:6]

Index(['1/1/20', '1/2/20', '1/3/20', '1/4/20', '1/5/20'], dtype='object')

Now, for having the same number of timestamp columns, an intersection is done between both data sources, and the output columns are left out from the stringency index dataset.

In [101]:
columns_to_drop = utils.get_list_inner_outer_join(stringency_df.columns, 
                                                  confirmed_df.columns, 
                                                  operation="outer")

columns_to_drop

['1/1/20',
 '1/10/20',
 '1/11/20',
 '1/12/20',
 '1/13/20',
 '1/14/20',
 '1/15/20',
 '1/16/20',
 '1/17/20',
 '1/18/20',
 '1/19/20',
 '1/2/20',
 '1/20/20',
 '1/21/20',
 '1/3/20',
 '1/4/20',
 '1/5/20',
 '1/6/20',
 '1/7/20',
 '1/8/20',
 '1/9/20',
 '8/21/22',
 '8/22/22',
 '8/23/22',
 '8/24/22',
 '8/25/22',
 '8/26/22']

In [102]:
stringency_df = stringency_df.drop(columns_to_drop, axis=1)

utils.initial_dataframe_check(stringency_df)

Unnamed: 0,Values
# Rows,176.0
# Columns,943.0
# Rows with NAs,106.0
# Columns with NAs,942.0
% Null Values in Dataframe,2.238


In [103]:
utils.get_list_inner_outer_join(stringency_df.columns, 
                                confirmed_df.columns, 
                                operation="outer")

[]

Now the same timestamp columns with the same names are in both data sources. The stringency index dataset can be merged with the cumulative data.

In [104]:
stringency_df.head()

Unnamed: 0,Country,1/22/20,1/23/20,1/24/20,1/25/20,1/26/20,1/27/20,1/28/20,1/29/20,1/30/20,...,8/11/22,8/12/22,8/13/22,8/14/22,8/15/22,8/16/22,8/17/22,8/18/22,8/19/22,8/20/22
0,Afghanistan,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,11.11,11.11,11.11,11.11,11.11,11.11,11.11,11.11,11.11,11.11
1,Albania,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,11.11,11.11,11.11,11.11,11.11,11.11,11.11,11.11,11.11,11.11
2,Algeria,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,,,,,,,,,,
3,Andorra,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,5.56,5.56,5.56,5.56,5.56,5.56,5.56,5.56,5.56,5.56
4,Angola,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,,,,,,,,,,


In [105]:
stringency_df_melt = pd.melt(stringency_df, 
                             id_vars="Country", 
                             value_vars=list(stringency_df.columns[1:]),
                             var_name="Timestamps", 
                             value_name="SI Index")

utils.initial_dataframe_check(stringency_df_melt)

Unnamed: 0,Values
# Rows,165792.0
# Columns,3.0
# Rows with NAs,3714.0
# Columns with NAs,1.0
% Null Values in Dataframe,0.747


In [106]:
merged_df = pd.concat([merged_df, stringency_df_melt], 
                      axis=1, join='inner')

# drop column duplicates
merged_df = merged_df.loc[:,~merged_df.columns.duplicated()]

print("Number of rows     : ", merged_df.shape[0])
print("Number of columns  : ", merged_df.shape[1])

Number of rows     :  165792
Number of columns  :  6


### C) Including Continent Variable <a class="anchor" id="add-continent-data"></a>

Both SI index and cumulative data have been merged. The merged dataframe could be extended by adding a column related to which continent each of the displayed countries belongs to. This provides future comparisons not only per country, but also per continent.

An external CSV file is written using the countries from the merged file, adding a column corresponding to the continent. For this product, the following continent convention is used:

    - Europe
    - Asia
    - Africa
    - Oceania
    - North America
    - Central America
    - South America

Reading CSV file and extracting continents column.

In [107]:
file_path = "../ProjectDataSources/Country_Names_List.csv"

continents = utils.read_data(file_path)["Continent"]

Inserting continents column in the data sources.

In [108]:
# Cumulative Data
confirmed_df.insert(1,"Continent", continents)
deaths_df.insert(1,"Continent", continents)
recovered_df.insert(1,"Continent", continents)

# Stringency Index Data
stringency_df.insert(1,"Continent", continents)

Stacking and merging data sources.

In [109]:
confirmed_df_melt = pd.melt(confirmed_df, 
                            id_vars=["Country", "Continent"],
                            value_vars=list(confirmed_df.columns[2:]),
                            var_name="Timestamps", 
                            value_name="Cumulative Confirmed Cases")

In [110]:
deaths_df_melt = pd.melt(deaths_df, 
                         id_vars=["Country", "Continent"],
                         value_vars=list(deaths_df.columns[2:]),
                         var_name="Timestamps", 
                         value_name="Cumulative Death Cases")

In [111]:
recovered_df_melt = pd.melt(recovered_df, 
                            id_vars=["Country", "Continent"], 
                            value_vars=list(recovered_df.columns[2:]),
                            var_name="Timestamps", 
                            value_name="Cumulative Recovered Cases")

In [112]:
stringency_df_melt = pd.melt(stringency_df, 
                             id_vars=["Country", "Continent"], 
                             value_vars=list(stringency_df.columns[2:]),
                             var_name="Timestamps", 
                             value_name="SI Index")

### D) Merge Daily Cases Data <a class="anchor" id="merge-daily-data"></a>

Building updated merged dataframe including daily data.

In [113]:
merged_df = pd.concat([merged_daily_df, confirmed_df_melt, deaths_df_melt, 
                       recovered_df_melt, stringency_df_melt], 
                      axis=1, join='inner')

# drop column duplicates
merged_df = merged_df.loc[:,~merged_df.columns.duplicated()]

print("Number of rows     : ", merged_df.shape[0])
print("Number of columns  : ", merged_df.shape[1])

Number of rows     :  165792
Number of columns  :  10


In [140]:
utils.initial_dataframe_check(merged_df)

Unnamed: 0,Values
# Rows,165792.0
# Columns,10.0
# Rows with NAs,67140.0
# Columns with NAs,4.0
% Null Values in Dataframe,5.114


In [141]:
merged_df.tail(15)

Unnamed: 0,Country,Continent,Timestamps,Daily Confirmed Cases,Daily Death Cases,Daily Recovered Cases,Cumulative Confirmed Cases,Cumulative Death Cases,Cumulative Recovered Cases,SI Index
165777,Tunisia,Africa,2022-08-20,1141773.0,29209.0,,1141773,29209,0,
165778,Turkey,Asia,2022-08-20,16671848.0,100400.0,,16671848,100400,0,13.89
165779,Uganda,Africa,2022-08-20,169396.0,3628.0,,169396,3628,0,13.89
165780,Ukraine,Europe,2022-08-20,5313322.0,116551.0,0.0,5313322,116551,0,
165781,United Arab Emirates,Asia,2022-08-20,1009116.0,2341.0,,1009116,2341,0,
165782,United Kingdom,Europe,2022-08-20,23675485.0,187731.0,0.0,23675485,187731,0,
165783,United States,North America,2022-08-20,93634408.0,1041141.0,0.0,93634408,1041141,0,
165784,Uruguay,South America,2022-08-20,975264.0,7429.0,,975264,7429,0,
165785,Uzbekistan,Asia,2022-08-20,243654.0,1637.0,,243654,1637,0,
165786,Vanuatu,Oceania,2022-08-20,11770.0,14.0,,11770,14,0,13.89


It is important to notice that around **5 %** of total data is not available.

## 4 Merging Data Test  <a class="anchor" id="test-data"></a>

After all data sources are merged as one dataframe, querying can be done by grouping countries and timestamps.

#### Example 1: Get cumulative death cases data over South America on summer 2021 and 2022.

In [148]:
summer_dates = merged_df.loc[(merged_df["Timestamps"]>="2021-06-01") &
                             (merged_df["Timestamps"]<="2021-07-31"), "Timestamps"].values
summer_dates

array(['2021-06-01T00:00:00.000000000', '2021-06-01T00:00:00.000000000',
       '2021-06-01T00:00:00.000000000', ...,
       '2021-07-31T00:00:00.000000000', '2021-07-31T00:00:00.000000000',
       '2021-07-31T00:00:00.000000000'], dtype='datetime64[ns]')

In [149]:
merged_df.loc[(merged_df["Continent"]=="South America") &
              (merged_df["Timestamps"].isin(summer_dates))]
               

Unnamed: 0,Country,Continent,Timestamps,Daily Confirmed Cases,Daily Death Cases,Daily Recovered Cases,Cumulative Confirmed Cases,Cumulative Death Cases,Cumulative Recovered Cases,SI Index
87301,Argentina,South America,2021-06-01,3817139.0,78733.0,3381337.0,3817139,78733,3381337,81.48
87314,Bolivia,South America,2021-06-01,374718.0,14639.0,297580.0,374718,14639,297580,27.78
87317,Brazil,South America,2021-06-01,16636801.0,465578.0,14694950.0,16636801,465578,14694950,66.20
87328,Chile,South America,2021-06-01,1389357.0,29344.0,1315860.0,1389357,29344,1315860,84.72
87330,Colombia,South America,2021-06-01,3432422.0,89297.0,3193406.0,3432422,89297,3193406,62.96
87344,Ecuador,South America,2021-06-01,427690.0,20620.0,375151.0,427690,20620,375151,60.19
87363,Guyana,South America,2021-06-01,17114.0,391.0,14879.0,17114,391,14879,60.19
87419,Paraguay,South America,2021-06-01,358244.0,9293.0,294994.0,358244,9293,294994,49.07
87420,Peru,South America,2021-06-01,1955469.0,184021.0,1905433.0,1955469,184021,1905433,75.93
87445,Suriname,South America,2021-06-01,15128.0,313.0,11877.0,15128,313,11877,81.48


#### Example 2: Get all recovered cases per continent during October 2020.

In [158]:
start_date = "2020-10-01"
end_date   = "2020-10-31"

In [159]:
subset_df = merged_df.loc[(merged_df["Timestamps"] >= start_date) &
                          (merged_df["Timestamps"] <= end_date)]

In [160]:
subset_df.groupby("Continent").agg(sum)["Cumulative Recovered Cases"]

Continent
Africa              45673503
Asia               328248417
Central America     13393039
Europe              88114722
North America      122200793
Oceania                77484
South America      240722215
Name: Cumulative Recovered Cases, dtype: int64

#### Example 3: Compare Government Response in North African countries for the last week.

In [164]:
start_date = "2020-08-16"
end_date   = "2020-08-22"

In [165]:
subset_countries = ["Egypt","Libia","Algeria","Tunisia","Morocco"]

In [169]:
merged_df.loc[(merged_df["Timestamps"] >= start_date) &
              (merged_df["Timestamps"] <= end_date) &
              (merged_df["Country"].isin(subset_countries)), 
              ["Country","Timestamps","SI Index"]]

Unnamed: 0,Country,Timestamps,SI Index
36434,Algeria,2020-08-16,68.52
36481,Egypt,2020-08-16,62.96
36540,Morocco,2020-08-16,73.15
36593,Tunisia,2020-08-16,24.07
36610,Algeria,2020-08-17,68.52
36657,Egypt,2020-08-17,62.96
36716,Morocco,2020-08-17,73.15
36769,Tunisia,2020-08-17,24.07
36786,Algeria,2020-08-18,79.63
36833,Egypt,2020-08-18,62.96
