In [1]:
# DEVELOPMEMT OF DATA PRODUCTS - 18697
# US4 - DATA CONSISTENCY

The Notebook is part of the Development of Data Products product development, with the functional objective of providing data analysis and visualization to the end user about comparisons of daily and cumulative recorded cases, for confirmed, death, or recovered patients. In addition, the Stringency Index is also included for comparison how different governments have reacted in terms of restrictions and regulations to the pandemic situation.

# Index:

1. [Imported Libraries and Scripts](#import-libraries-scripts)
2. [Reading and Preprocessing Data Sources](#read-data)
    1. [Daily Cases Data](#daily-data)
    2. [Cumulative Cases Data](#cumulative-data)
    3. [Government Response Data](#si-data)
3. [Merging Data Sources](#merge-data)
    1. [Aggregate and Merge Cumulative Cases Data](#merge-cumulative-data)
    2. [Merge Government Response Data](#merge-si-data)
    3. [Including Continent Variable](#add-continent-data)
4. [Merging Data Test](#test-data)

## 1 Imported Libraries and Scripts <a class="anchor" id="import-libraries-scripts"></a>

Some of the code functionalities are included in dedicated Python functions stored in an external file which gets imported to the current Notebook.

In [2]:
# Libraries
import os

import pandas as pd
import numpy as np

import time
import random

# for regular expressions
import re

# for dates and timestamps handling
from datetime import datetime

In [3]:
# Scripts
from scripts import utils

## 2 Reading and Preprocessing Data Sources <a class="anchor" id="read-data"></a>

The collected data sources are under "DDP-unibz-project-18697/ProjectDataSources" inside the following directories:
    
    - csse_covid_29_data/csse_covid_19_daily_reports/ --> Daily data
    - csse_covid_29_data/csse_covid_19_time_series/ --> Cumulative data, recovered, deaths and confirmed cases
    - covid-policy-tracker/timeseries/ --> Stringency Index (Government response Indicator)
    
It is important to mention that daily data comes in the format **Month/Day/Year**, whereas columns listed in cumulative data and government response data tables have the format **Day/Month/Year**.

In addition to reading the CSV files, an initial data check is performed for checking:

    - That columns have the proper data types for further data manipulation
    - How many rows and columns contain null or not available data?
    - Which percentate of the total data is missing or unknown?
    
Moreover, imported data sources have columns which are not relevant for achieving the functional objective of the project and are being deleted using a simple Python function.

On previous product development stages, part of the data has been successfully merged and some of the data has not, this because of mismatch and inconsistencies related to column names and to the listed countries per data source. It is convenient to tackle this problem in the dataframe generation stage, so that there is no issue when aggregating and selecting variables for visualizations.

Using functions from the utils.py Python file, the collected data sources are cleaned and preprocessed on early stage: similar column names are used, the list of countries get filtered for all data sources and the corresponding aggregations are computed.

### A) Daily Cases Data <a class="anchor" id="daily-data"></a>

Reading a particular day data from CSV file. Data gets read, cleaned and initial stats are displayed to the user.

Daily data is not covered on the actual commit, as the date of the read file must be taken into consideration for preprocessing. This is not the case for cumulative and stringency index data, which are easier to manipulate.

In [4]:
file_path = "../ProjectDataSources/csse_covid_19_data/" + \
            "csse_covid_19_daily_reports/03-28-2020.csv"

daily_df = utils.read_data(file_path)
daily_df = utils.drop_columns(daily_df, file_path,
                              data_source="daily")

utils.initial_dataframe_check(daily_df)

Removed 8 columns from dataframe


Unnamed: 0,Values
# Rows,3461.0
# Columns,4.0
# Rows with NAs,0.0
# Columns with NAs,0.0
% Null Values in Dataframe,0.0


In [5]:
daily_df.tail()

Unnamed: 0,Country,Confirmed,Deaths,Recovered
3456,Vietnam,174,0,21
3457,West Bank and Gaza,98,1,18
3458,Winter Olympics 2022,0,0,0
3459,Zambia,28,0,0
3460,Zimbabwe,7,1,0


In [6]:
number_unique_countries = len(np.unique(daily_df["Country"]))
print("How many countries in the list ? ", number_unique_countries)

How many countries in the list ?  185


Also, the country names are going to be reviewed in further stages.

### B) Cumulative Cases Data <a class="anchor" id="cumulative-data"></a>

Cumulative data is composed of three different timeseries:

    - global confirmed cases
    - global deaths cases
    - global recovered cases

The cumulative data has the same number of countries for the three data sources, so the data can be cleaned and aggregated in advance.

#### Confirmed Cases

Read data, drop non-relevant columns and rename countries column.

In [7]:
file_path = "../ProjectDataSources/csse_covid_19_data/" + \
            "csse_covid_19_time_series/time_series_covid19_confirmed_global.csv"

confirmed_df = utils.read_data(file_path)
confirmed_df = utils.drop_columns(confirmed_df, file_path,
                                  data_source="cumulative")

utils.initial_dataframe_check(confirmed_df)

Removed 3 columns from dataframe


Unnamed: 0,Values
# Rows,285.0
# Columns,943.0
# Rows with NAs,0.0
# Columns with NAs,0.0
% Null Values in Dataframe,0.0


For consistency between all data sources, some countries present on a data source but not on the other have to be removed. Also, some country names have to be renamed for matching the same name on all data sources.

In [8]:
confirmed_df = utils.country_list_formatting(confirmed_df, 
                                             data_source="cumulative")

There have been 23 countries removed from the dataset.


Check if there are countries with data split by regions or provinces. If that is the case, aggregate the data per country by adding up all the cases, so that the number of cases is the total national number and not a regional one.

In [9]:
aggregate_countries = utils.get_countries_split_by_regions(confirmed_df,
                                                           country_column="Country")

if len(aggregate_countries) > 0:
    
    confirmed_df = utils.country_aggregation_dataframe(confirmed_df, aggregate_countries,
                                                       data_type="cumulative")
  
    print(utils.initial_dataframe_check(confirmed_df))
    
else:
    print("There are no countries with data split by regions.")

There are 8 countries where data needs to be aggregated.
                            Values
# Rows                       176.0
# Columns                    943.0
# Rows with NAs                0.0
# Columns with NAs             0.0
% Null Values in Dataframe     0.0


In [10]:
confirmed_df.tail(20)

Unnamed: 0,Country,1/22/20,1/23/20,1/24/20,1/25/20,1/26/20,1/27/20,1/28/20,1/29/20,1/30/20,...,8/11/22,8/12/22,8/13/22,8/14/22,8/15/22,8/16/22,8/17/22,8/18/22,8/19/22,8/20/22
156,Thailand,4,4,5,6,8,8,14,14,14,...,4616512,4618652,4620425,4622088,4623596,4626057,4628200,4630310,4632212,4634180
157,Timor-Leste,0,0,0,0,0,0,0,0,0,...,23072,23074,23074,23074,23086,23095,23100,23108,23114,23114
158,Togo,0,0,0,0,0,0,0,0,0,...,38236,38273,38273,38285,38295,38303,38330,38337,38348,38355
159,Tonga,0,0,0,0,0,0,0,0,0,...,13405,13405,13405,13405,13405,14135,14135,14135,14135,14135
160,Trinidad and Tobago,0,0,0,0,0,0,0,0,0,...,174159,174552,174896,175098,175273,175494,175856,176107,176468,176821
161,Tunisia,0,0,0,0,0,0,0,0,0,...,1139241,1139241,1139241,1139241,1139241,1141135,1141334,1141487,1141773,1141773
162,Turkey,0,0,0,0,0,0,0,0,0,...,16295817,16295817,16295817,16295817,16528070,16671848,16671848,16671848,16671848,16671848
163,Uganda,0,0,0,0,0,0,0,0,0,...,169396,169396,169396,169396,169396,169396,169396,169396,169396,169396
164,Ukraine,0,0,0,0,0,0,0,0,0,...,5303833,5304149,5304634,5305063,5305455,5305875,5306219,5306713,5312730,5313322
165,United Arab Emirates,0,0,0,0,0,0,0,4,4,...,1002306,1003129,1003929,1004751,1005543,1006318,1007039,1007742,1008435,1009116


#### Death Cases

The death cases cumulative data follow same preprocessing steps as confirmed cases for filtering countries and aggregating data if needed.

In [11]:
file_path = "../ProjectDataSources/csse_covid_19_data/" + \
            "csse_covid_19_time_series/time_series_covid19_deaths_global.csv"

deaths_df = utils.read_data(file_path)
deaths_df = utils.drop_columns(deaths_df, file_path,
                               data_source="cumulative")

utils.initial_dataframe_check(deaths_df)

Removed 3 columns from dataframe


Unnamed: 0,Values
# Rows,285.0
# Columns,943.0
# Rows with NAs,0.0
# Columns with NAs,0.0
% Null Values in Dataframe,0.0


In [12]:
deaths_df = utils.country_list_formatting(deaths_df, 
                                          data_source="cumulative")

There have been 23 countries removed from the dataset.


In [13]:
aggregate_countries = utils.get_countries_split_by_regions(deaths_df,
                                                           country_column="Country")

if len(aggregate_countries) > 0:
    
    deaths_df = utils.country_aggregation_dataframe(deaths_df, aggregate_countries,
                                                    data_type="cumulative")
  
    print(utils.initial_dataframe_check(deaths_df))
    
else:
    print("There are no countries with data split by regions.")

There are 8 countries where data needs to be aggregated.
                            Values
# Rows                       176.0
# Columns                    943.0
# Rows with NAs                0.0
# Columns with NAs             0.0
% Null Values in Dataframe     0.0


In [14]:
deaths_df.tail(20)

Unnamed: 0,Country,1/22/20,1/23/20,1/24/20,1/25/20,1/26/20,1/27/20,1/28/20,1/29/20,1/30/20,...,8/11/22,8/12/22,8/13/22,8/14/22,8/15/22,8/16/22,8/17/22,8/18/22,8/19/22,8/20/22
156,Thailand,0,0,0,0,0,0,0,0,0,...,31763,31798,31828,31858,31887,31915,31944,31971,32000,32027
157,Timor-Leste,0,0,0,0,0,0,0,0,0,...,135,135,135,135,135,135,135,135,135,135
158,Togo,0,0,0,0,0,0,0,0,0,...,281,281,281,281,281,281,281,281,281,282
159,Tonga,0,0,0,0,0,0,0,0,0,...,12,12,12,12,12,12,12,12,12,12
160,Trinidad and Tobago,0,0,0,0,0,0,0,0,0,...,4070,4071,4075,4079,4079,4080,4084,4089,4092,4095
161,Tunisia,0,0,0,0,0,0,0,0,0,...,29153,29153,29153,29153,29153,29189,29202,29206,29209,29209
162,Turkey,0,0,0,0,0,0,0,0,0,...,99678,99678,99678,99678,100058,100400,100400,100400,100400,100400
163,Uganda,0,0,0,0,0,0,0,0,0,...,3628,3628,3628,3628,3628,3628,3628,3628,3628,3628
164,Ukraine,0,0,0,0,0,0,0,0,0,...,116505,116506,116508,116508,116508,116510,116511,116511,116549,116551
165,United Arab Emirates,0,0,0,0,0,0,0,0,0,...,2339,2339,2339,2339,2339,2339,2340,2341,2341,2341


#### Recovered Cases

The recovered cases cumulative data follow same preprocessing steps as confirmed cases for filtering countries and aggregating data if needed.

In [15]:
file_path = "../ProjectDataSources/csse_covid_19_data/" + \
            "csse_covid_19_time_series/time_series_covid19_recovered_global.csv"

recovered_df = utils.read_data(file_path)
recovered_df = utils.drop_columns(recovered_df, file_path,
                                  data_source="cumulative")

utils.initial_dataframe_check(recovered_df)

Removed 3 columns from dataframe


Unnamed: 0,Values
# Rows,270.0
# Columns,943.0
# Rows with NAs,0.0
# Columns with NAs,0.0
% Null Values in Dataframe,0.0


In [16]:
recovered_df = utils.country_list_formatting(recovered_df, 
                                             data_source="cumulative")

There have been 23 countries removed from the dataset.


In [17]:
aggregate_countries = utils.get_countries_split_by_regions(recovered_df,
                                                           country_column="Country")

if len(aggregate_countries) > 0:
    
    recovered_df = utils.country_aggregation_dataframe(recovered_df, aggregate_countries,
                                                       data_type="cumulative")
  
    print(utils.initial_dataframe_check(recovered_df))
    
else:
    print("There are no countries with data split by regions.")

There are 7 countries where data needs to be aggregated.
                            Values
# Rows                       176.0
# Columns                    943.0
# Rows with NAs                0.0
# Columns with NAs             0.0
% Null Values in Dataframe     0.0


In [18]:
recovered_df.tail(20)

Unnamed: 0,Country,1/22/20,1/23/20,1/24/20,1/25/20,1/26/20,1/27/20,1/28/20,1/29/20,1/30/20,...,8/11/22,8/12/22,8/13/22,8/14/22,8/15/22,8/16/22,8/17/22,8/18/22,8/19/22,8/20/22
156,Thailand,2,2,3,3,6,6,6,6,7,...,0,0,0,0,0,0,0,0,0,0
157,Timor-Leste,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
158,Togo,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
159,Tonga,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
160,Trinidad and Tobago,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
161,Tunisia,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
162,Turkey,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
163,Uganda,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
164,Ukraine,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
165,United Arab Emirates,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


On previous user story, the recovered cases dataset had less observations than confirmed and deaths data. After aggregating per country, the datasets have the same number of rows and the same number of countries on the dataframe. 

In [19]:
# Country Names
print("How many countries in the confirmed cases list ? ",len(np.unique(confirmed_df["Country"])))
print("How many countries in the deaths cases list ? ",len(np.unique(deaths_df["Country"])))
print("How many countries in the recovered cases list ? ",len(np.unique(recovered_df["Country"])))

How many countries in the confirmed cases list ?  176
How many countries in the deaths cases list ?  176
How many countries in the recovered cases list ?  176


### C) Government Response Data <a class="anchor" id="si-data"></a>

The Stringency Index data follow same preprocessing steps as cumulative cases data, but it differs on the way the data is aggregated per country, as for SI, the mean value is taken into consideration. Also, some countries in the SI data are not present in cunulative data and viceversa, so all the outer intersected countries are left away.

In [20]:
file_path = "../ProjectDataSources/covid-policy-tracker/" + \
            "timeseries/stringency_index_avg.csv"

stringency_df = utils.read_data(file_path)
stringency_df = utils.drop_columns(stringency_df, file_path,
                                   data_source="stringency_index")

utils.initial_dataframe_check(stringency_df)

Removed 4 columns from dataframe


Unnamed: 0,Values
# Rows,263.0
# Columns,970.0
# Rows with NAs,263.0
# Columns with NAs,969.0
% Null Values in Dataframe,7.953


In [21]:
number_unique_countries = len(np.unique(stringency_df["Country"]))
print("How many countries in the list ? ", number_unique_countries)

How many countries in the list ?  187


The country list for stringency index dataset is smaller than the ones from the CSSE data source for daily and cumulative data cases. The country list is going to be intersected in further steps for data consistency.

In [22]:
countries_list_cumulative = confirmed_df["Country"].values
countries_list_si = stringency_df["Country"].values

utils.get_list_inner_outer_join(countries_list_cumulative, countries_list_si, 
                                operation="outer")

['Aruba',
 'Bermuda',
 'Cabo Verde',
 'Cape Verde',
 'Faeroe Islands',
 'Greenland',
 'Guam',
 'Hong Kong',
 'Kyrgyz Republic',
 'Kyrgyzstan',
 'Macao',
 'Palestine',
 'Puerto Rico',
 'Slovak Republic',
 'Slovakia',
 'Turkmenistan',
 'United States Virgin Islands']

Some listed countries are repeated but with different names, so the proper Python function should take care in having the same name for all data sources. The rest of the countries in this list are exclusive for SI data and are removed.

In [23]:
stringency_df = utils.country_list_formatting(stringency_df, 
                                              data_source="stringency_index")

There have been 11 countries removed from the dataset.


In [24]:
aggregate_countries = utils.get_countries_split_by_regions(stringency_df,
                                                           country_column="Country")

if len(aggregate_countries) > 0:
    
    stringency_df = utils.country_aggregation_dataframe(stringency_df, aggregate_countries,
                                                        data_type="stringency_index")
  
    print(utils.initial_dataframe_check(stringency_df))
    
else:
    print("There are no countries with data split by regions.")

There are 4 countries where data needs to be aggregated.
                             Values
# Rows                      176.000
# Columns                   970.000
# Rows with NAs             176.000
# Columns with NAs          969.000
% Null Values in Dataframe    4.869


In [25]:
# Country Names
print("How many countries in the confirmed cases list ? ",len(np.unique(confirmed_df["Country"])))
print("How many countries in the deaths cases list ? ",len(np.unique(deaths_df["Country"])))
print("How many countries in the recovered cases list ? ",len(np.unique(recovered_df["Country"])))
print("How many countries in the stringency index list ? ",len(np.unique(stringency_df["Country"])))

How many countries in the confirmed cases list ?  176
How many countries in the deaths cases list ?  176
How many countries in the recovered cases list ?  176
How many countries in the stringency index list ?  176


Now, cumulative and SI data sources have the same number of countries. They are ready to be merged as one dataframe.

## 3 Merging Data Sources  <a class="anchor" id="merge-data"></a>

Some of the collected data sources have the data stores as tables with each timestamp (a day) having a separate column. For further purposes, involving aggregation and visualization, it would be useful to have a dedicated column for timestamps, assigning the numerical value to another column, representing the data source: it can be Daily data, Cumulative or about the Stringency Index.

With Pandas built-in functions, it is easy to do the required manipulation. One thing to notice is that there are a few countries for which collected data is more ambitious and it is split by regions or by its overseas regions. For simplicity, every country must have only one row for all timestamps, and hence a sum aggregations should be done.

The functions to merge and aggregate data are part of the scripts folder.

### A) Aggregate and Merge Cumulative Cases Data <a class="anchor" id="merge-cumulative-data"></a>

For a successful merging, the same number of countries should be contained in each of the dataframes to merge. This is proved by using a function which computes the intersection for the involved datasets, where the values should match between each other.

In [26]:
intersection_1 = utils.get_list_inner_outer_join(np.unique(confirmed_df["Country"]),
                                                 np.unique(deaths_df["Country"]),
                                                 operation="inner")

intersection_2 = utils.get_list_inner_outer_join(np.unique(confirmed_df["Country"]),
                                                 np.unique(recovered_df["Country"]),
                                                 operation="inner")

intersection_3 = utils.get_list_inner_outer_join(np.unique(deaths_df["Country"]),
                                                 np.unique(recovered_df["Country"]),
                                                 operation="inner")

In [27]:
print(len(intersection_1))
print(len(intersection_2))
print(len(intersection_3))

176
176
176


Now, it is clear that the same number of countries are contained in each cumulative dataset. Now, let's check if the data per country is summarized in a single observation, or if the data source collects some countries data by its composed regions or overseas territories.

#### Merging Data

Each cumulative dataset is melted, transforming the dataframe so that a column contain all measurements, and merged together with the other cumulative data ones.

In [28]:
confirmed_df_melt = pd.melt(confirmed_df, 
                            id_vars="Country", 
                            value_vars=list(confirmed_df.columns[1:]),
                            var_name="Timestamps", 
                            value_name="Confirmed Cases")

utils.initial_dataframe_check(confirmed_df_melt)

Unnamed: 0,Values
# Rows,165792.0
# Columns,3.0
# Rows with NAs,0.0
# Columns with NAs,0.0
% Null Values in Dataframe,0.0


In [29]:
deaths_df_melt = pd.melt(deaths_df, 
                         id_vars="Country", 
                         value_vars=list(deaths_df.columns[1:]),
                         var_name="Timestamps", 
                         value_name="Death Cases")

utils.initial_dataframe_check(deaths_df_melt)

Unnamed: 0,Values
# Rows,165792.0
# Columns,3.0
# Rows with NAs,0.0
# Columns with NAs,0.0
% Null Values in Dataframe,0.0


In [30]:
recovered_df_melt = pd.melt(recovered_df, 
                            id_vars="Country", 
                            value_vars=list(recovered_df.columns[1:]),
                            var_name="Timestamps", 
                            value_name="Recovered Cases")

utils.initial_dataframe_check(recovered_df_melt)

Unnamed: 0,Values
# Rows,165792.0
# Columns,3.0
# Rows with NAs,0.0
# Columns with NAs,0.0
% Null Values in Dataframe,0.0


Because each of the stacked dataframe shares the same country and timestamps column names and values, merging process is straightforward.

In [31]:
merged_df = pd.concat([confirmed_df_melt, deaths_df_melt, recovered_df_melt], 
                      axis=1, join='inner')

# drop column duplicates
merged_df = merged_df.loc[:,~merged_df.columns.duplicated()]

print("Number of rows     : ", merged_df.shape[0])
print("Number of columns  : ", merged_df.shape[1])

Number of rows     :  165792
Number of columns  :  5


In [32]:
merged_df.tail(20)

Unnamed: 0,Country,Timestamps,Confirmed Cases,Death Cases,Recovered Cases
165772,Thailand,8/20/22,4634180,32027,0
165773,Timor-Leste,8/20/22,23114,135,0
165774,Togo,8/20/22,38355,282,0
165775,Tonga,8/20/22,14135,12,0
165776,Trinidad and Tobago,8/20/22,176821,4095,0
165777,Tunisia,8/20/22,1141773,29209,0
165778,Turkey,8/20/22,16671848,100400,0
165779,Uganda,8/20/22,169396,3628,0
165780,Ukraine,8/20/22,5313322,116551,0
165781,United Arab Emirates,8/20/22,1009116,2341,0


### B) Merge Government Response Data <a class="anchor" id="merge-si-data"></a>

For merging the Stringency Index dataset into the Cumulative Data, it is absolutely necessary that the country names match between the two datasets.

Now the countries match in both data sources, as the outer intersection list is empty.

In [33]:
intersection_inner = utils.get_list_inner_outer_join(np.unique(confirmed_df["Country"]),
                                                     np.unique(stringency_df["Country"]),
                                                     operation="inner")

print(len(intersection_inner))

176


In [34]:
intersection_outer = utils.get_list_inner_outer_join(np.unique(confirmed_df["Country"]),
                                                     np.unique(stringency_df["Country"]),
                                                     operation="outer")

print(len(intersection_outer))

0


For merging data sources, the same number of timestamps need to be contained in the data sources. It seems that Stringency Index has more timestamps, as it has timestamps since the beginning of the year 2020, even if no data was collected or available. 

The next tasks must be completed:

    - Have the same number of timestamps for both data sources
    - Have similar name convention for timestamps

In [35]:
print("Number of timestamps in cumulative data : ", len(confirmed_df.columns[1:]))
print("Number of timestamps in SI data         : ", len(stringency_df.columns[1:]))

Number of timestamps in cumulative data :  942
Number of timestamps in SI data         :  969


In [36]:
print("Cumulative data columns : ")
print(confirmed_df.columns[1:10])

print("-"*70)

print("SI Index data columns : ")
print(stringency_df.columns[1:10])

Cumulative data columns : 
Index(['1/22/20', '1/23/20', '1/24/20', '1/25/20', '1/26/20', '1/27/20',
       '1/28/20', '1/29/20', '1/30/20'],
      dtype='object')
----------------------------------------------------------------------
SI Index data columns : 
Index(['01Jan2020', '02Jan2020', '03Jan2020', '04Jan2020', '05Jan2020',
       '06Jan2020', '07Jan2020', '08Jan2020', '09Jan2020'],
      dtype='object')


First, the timestamp name convention must match for both cumulative and SI index data sources. The use of the following function transforms the timestamp strings from stringency dataset to the ones used in cumulative data.

In [37]:
updated_timestamps = utils.formatting_timestamp_string(stringency_df, 
                                                       country_column="Country")

stringency_df.columns = updated_timestamps

In [38]:
stringency_df.columns[1:6]

Index(['1/1/20', '1/2/20', '1/3/20', '1/4/20', '1/5/20'], dtype='object')

Now, for having the same number of timestamp columns, an intersection is done between both data sources, and the output columns are left out from the stringency index dataset.

In [39]:
columns_to_drop = utils.get_list_inner_outer_join(stringency_df.columns, 
                                                  confirmed_df.columns, 
                                                  operation="outer")

columns_to_drop

['1/1/20',
 '1/10/20',
 '1/11/20',
 '1/12/20',
 '1/13/20',
 '1/14/20',
 '1/15/20',
 '1/16/20',
 '1/17/20',
 '1/18/20',
 '1/19/20',
 '1/2/20',
 '1/20/20',
 '1/21/20',
 '1/3/20',
 '1/4/20',
 '1/5/20',
 '1/6/20',
 '1/7/20',
 '1/8/20',
 '1/9/20',
 '8/21/22',
 '8/22/22',
 '8/23/22',
 '8/24/22',
 '8/25/22',
 '8/26/22']

In [40]:
stringency_df = stringency_df.drop(columns_to_drop, axis=1)

utils.initial_dataframe_check(stringency_df)

Unnamed: 0,Values
# Rows,176.0
# Columns,943.0
# Rows with NAs,106.0
# Columns with NAs,942.0
% Null Values in Dataframe,2.238


In [41]:
utils.get_list_inner_outer_join(stringency_df.columns, 
                                confirmed_df.columns, 
                                operation="outer")

[]

Now the same timestamp columns with the same names are in both data sources. The stringency index dataset can be merged with the cumulative data.

In [42]:
stringency_df.head()

Unnamed: 0,Country,1/22/20,1/23/20,1/24/20,1/25/20,1/26/20,1/27/20,1/28/20,1/29/20,1/30/20,...,8/11/22,8/12/22,8/13/22,8/14/22,8/15/22,8/16/22,8/17/22,8/18/22,8/19/22,8/20/22
0,Afghanistan,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,11.11,11.11,11.11,11.11,11.11,11.11,11.11,11.11,11.11,11.11
1,Albania,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,11.11,11.11,11.11,11.11,11.11,11.11,11.11,11.11,11.11,11.11
2,Algeria,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,,,,,,,,,,
3,Andorra,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,5.56,5.56,5.56,5.56,5.56,5.56,5.56,5.56,5.56,5.56
4,Angola,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,,,,,,,,,,


In [43]:
stringency_df_melt = pd.melt(stringency_df, 
                             id_vars="Country", 
                             value_vars=list(stringency_df.columns[1:]),
                             var_name="Timestamps", 
                             value_name="SI Index")

utils.initial_dataframe_check(stringency_df_melt)

Unnamed: 0,Values
# Rows,165792.0
# Columns,3.0
# Rows with NAs,3714.0
# Columns with NAs,1.0
% Null Values in Dataframe,0.747


In [44]:
merged_df = pd.concat([merged_df, stringency_df_melt], 
                      axis=1, join='inner')

# drop column duplicates
merged_df = merged_df.loc[:,~merged_df.columns.duplicated()]

print("Number of rows     : ", merged_df.shape[0])
print("Number of columns  : ", merged_df.shape[1])

Number of rows     :  165792
Number of columns  :  6


### C) Including Continent Variable <a class="anchor" id="add-continent-data"></a>

Both SI index and cumulative data have been merged. The merged dataframe could be extended by adding a column related to which continent each of the displayed countries belongs to. This provides future comparisons not only per country, but also per continent.

An external CSV file is written using the countries from the merged file, adding a column corresponding to the continent. For this product, the following continent convention is used:

    - Europe
    - Asia
    - Africa
    - Oceania
    - North America
    - Central America
    - South America

Reading CSV file and extracting continents column.

In [45]:
file_path = "../ProjectDataSources/Country_Names_List.csv"

continents = utils.read_data(file_path)["Continent"]

Inserting continents column in the data sources.

In [46]:
# Cumulative Data
confirmed_df.insert(1,"Continent", continents)
deaths_df.insert(1,"Continent", continents)
recovered_df.insert(1,"Continent", continents)

# Stringency Index Data
stringency_df.insert(1,"Continent", continents)

Stacking and merging data sources.

In [47]:
confirmed_df_melt = pd.melt(confirmed_df, 
                            id_vars=["Country", "Continent"],
                            value_vars=list(confirmed_df.columns[2:]),
                            var_name="Timestamps", 
                            value_name="Confirmed Cases")

In [48]:
deaths_df_melt = pd.melt(deaths_df, 
                         id_vars=["Country", "Continent"],
                         value_vars=list(deaths_df.columns[2:]),
                         var_name="Timestamps", 
                         value_name="Death Cases")

In [49]:
recovered_df_melt = pd.melt(recovered_df, 
                            id_vars=["Country", "Continent"], 
                            value_vars=list(recovered_df.columns[2:]),
                            var_name="Timestamps", 
                            value_name="Recovered Cases")

In [50]:
stringency_df_melt = pd.melt(stringency_df, 
                             id_vars=["Country", "Continent"], 
                             value_vars=list(stringency_df.columns[2:]),
                             var_name="Timestamps", 
                             value_name="SI Index")

Building updated merged dataframe.

In [51]:
merged_df = pd.concat([confirmed_df_melt, deaths_df_melt, 
                       recovered_df_melt, stringency_df_melt], 
                      axis=1, join='inner')

# drop column duplicates
merged_df = merged_df.loc[:,~merged_df.columns.duplicated()]

print("Number of rows     : ", merged_df.shape[0])
print("Number of columns  : ", merged_df.shape[1])

Number of rows     :  165792
Number of columns  :  7


In [52]:
merged_df.tail(20)

Unnamed: 0,Country,Continent,Timestamps,Confirmed Cases,Death Cases,Recovered Cases,SI Index
165772,Thailand,Asia,8/20/22,4634180,32027,0,29.63
165773,Timor-Leste,Asia,8/20/22,23114,135,0,
165774,Togo,Africa,8/20/22,38355,282,0,35.61
165775,Tonga,Oceania,8/20/22,14135,12,0,48.61
165776,Trinidad and Tobago,Central America,8/20/22,176821,4095,0,11.11
165777,Tunisia,Africa,8/20/22,1141773,29209,0,
165778,Turkey,Asia,8/20/22,16671848,100400,0,13.89
165779,Uganda,Africa,8/20/22,169396,3628,0,13.89
165780,Ukraine,Europe,8/20/22,5313322,116551,0,
165781,United Arab Emirates,Asia,8/20/22,1009116,2341,0,


## 4 Merging Data Test  <a class="anchor" id="test-data"></a>

After matching timestamps and countries names, a merged dataset including cumulative and stringency index data is obtained. Now it is time to subset data as an example to show that the data is properly preprocessed and merged together.

#### Example 1: Get data over Oceania on two specific dates.

In [70]:
merged_df.loc[(merged_df["Continent"]=="Oceania") &
              (merged_df["Timestamps"].isin(["5/10/21", "10/10/21"]))]

Unnamed: 0,Country,Continent,Timestamps,Confirmed Cases,Death Cases,Recovered Cases,SI Index
83430,Australia,Oceania,5/10/21,29938,910,23460,48.096667
83479,Ethiopia,Oceania,5/10/21,263120,3897,211493,60.19
83508,Kenya,Oceania,5/10/21,163620,2907,112298,22.22
83538,Netherlands,Oceania,5/10/21,1598260,17602,26310,22.22
83546,Panama,Oceania,5/10/21,367908,6277,357353,50.0
83565,Slovenia,Oceania,5/10/21,246231,4299,232983,25.0
83583,Tonga,Oceania,5/10/21,0,0,0,47.22
83594,Vanuatu,Oceania,5/10/21,4,1,3,22.22
110358,Australia,Oceania,10/10/21,129567,1448,0,64.713333
110407,Ethiopia,Oceania,10/10/21,354476,5990,0,88.89


In [79]:
# Country names order is not matching between two datasets
(confirmed_df["Country"] == stringency_df["Country"]).sum()

51

#### NOTE: This query test is perfect to demonstrate errors when merging. It seems that countries are not entirely ordered alphabetically in some data sources like the cumulative data. The Python function "countries_list_formatting" has to be fixed, adding a line for sorting values per country name.

#### Fixing Example 1: Get data over Oceania on two specific dates.

In [53]:
(confirmed_df["Country"] == stringency_df["Country"]).sum()

176

In [54]:
merged_df.loc[(merged_df["Continent"]=="Oceania") &
              (merged_df["Timestamps"].isin(["5/10/21", "10/10/21"]))]

Unnamed: 0,Country,Continent,Timestamps,Confirmed Cases,Death Cases,Recovered Cases,SI Index
83430,Australia,Oceania,5/10/21,29938,910,121,44.91
83479,Fiji,Oceania,5/10/21,140,3,101,60.19
83508,Kiribati,Oceania,5/10/21,0,0,0,22.22
83538,New Zealand,Oceania,5/10/21,2644,28,2591,22.22
83546,Papua New Guinea,Oceania,5/10/21,12086,121,10599,50.0
83565,Solomon Islands,Oceania,5/10/21,20,0,20,25.0
83583,Tonga,Oceania,5/10/21,0,0,0,47.22
83594,Vanuatu,Oceania,5/10/21,4,1,3,22.22
110358,Australia,Oceania,10/10/21,129567,1448,0,74.54
110407,Fiji,Oceania,10/10/21,51499,653,0,88.89


#### Example 2: Get data over Italy on December 2021.

In [74]:
start_date = "12/1/21"
end_date   = "12/31/21"

In [76]:
merged_df.loc[(merged_df["Country"]=="Italy") &
              ((merged_df["Timestamps"] >= start_date) &
               (merged_df["Timestamps"] <= end_date))][:5]

Unnamed: 0,Country,Continent,Timestamps,Confirmed Cases,Death Cases,Recovered Cases,SI Index
55518,Italy,Europe,12/2/20,1641610,57045,823335,82.41
55694,Italy,Europe,12/3/20,1664829,58038,846809,82.41
56926,Italy,Europe,12/10/20,1787147,62626,1027994,80.56
57102,Italy,Europe,12/11/20,1805873,63387,1052163,80.56
57278,Italy,Europe,12/12/20,1825775,64036,1076891,80.56


#### NOTE: It is clear that the timestamp manipulation has a problem. Because it ius of string datatype, selecting data between dates is a difficult task. Using a datetime format for the timestamps can improve the querying of data.

#### Fixing Example 2: Get data over Italy on December 2021.

In [78]:
merged_df["Timestamps"] = pd.to_datetime(merged_df["Timestamps"]) 

In [79]:
merged_df["Timestamps"][:5]

0   2020-01-22
1   2020-01-22
2   2020-01-22
3   2020-01-22
4   2020-01-22
Name: Timestamps, dtype: datetime64[ns]

In [69]:
start_date = "2021-12-01"
end_date   = "2021-12-31"

In [80]:
merged_df.loc[(merged_df["Country"]=="Italy") &
              ((merged_df["Timestamps"] >= start_date) &
               (merged_df["Timestamps"] <= end_date))]

Unnamed: 0,Country,Continent,Timestamps,Confirmed Cases,Death Cases,Recovered Cases,SI Index
119582,Italy,Europe,2021-12-01,5043620,133931,0,49.71
119758,Italy,Europe,2021-12-02,5060430,134003,0,49.69
119934,Italy,Europe,2021-12-03,5077445,134077,0,49.68
120110,Italy,Europe,2021-12-04,5094072,134152,0,49.67
120286,Italy,Europe,2021-12-05,5109082,134195,0,49.67
120462,Italy,Europe,2021-12-06,5118576,134287,0,49.66
120638,Italy,Europe,2021-12-07,5134318,134386,0,49.64
120814,Italy,Europe,2021-12-08,5152264,134472,0,49.64
120990,Italy,Europe,2021-12-09,5164780,134551,0,49.62
121166,Italy,Europe,2021-12-10,5185270,134669,0,49.61


In [81]:
utils.initial_dataframe_check(merged_df)

Unnamed: 0,Values
# Rows,165792.0
# Columns,7.0
# Rows with NAs,3714.0
# Columns with NAs,1.0
% Null Values in Dataframe,0.32


Now it is possible to see data for whole December 2021 for Italy.