In [1]:
# DEVELOPMEMT OF DATA PRODUCTS - 18697
# US2 - DATA CLEANING

The Notebook is part of the Development of Data Products product development, with the functional objective of providing data analysis and visualization to the end user about comparisons of daily and cumulative recorded cases, for confirmed, death, or recovered patients. In addition, the Stringency Index is also included for comparison how different governments have reacted in terms of restrictions and regulations to the pandemic situation.

# Index:

1. [Imported Libraries and Scripts](#import-libraries-scripts)
2. [Reading Data Sources](#read-data)
    1. [Daily Cases Data](#daily-data)
    2. [Cumulative Cases Data](#cumulative-data)
    3. [Government Response Data](#si-data)

## 1 Imported Libraries and Scripts <a class="anchor" id="import-libraries-scripts"></a>

Some of the code functionalities are included in dedicated Python functions stored in an external file which gets imported to the current Notebook.

In [2]:
# Libraries
import os

import pandas as pd
import numpy as np

import time
import random

In [3]:
# Scripts
from scripts import utils

## 2 Reading Data Sources <a class="anchor" id="read-data"></a>

The collected data sources are under "DDP-unibz-project-18697/ProjectDataSources" inside the following directories:
    
    - csse_covid_29_data/csse_covid_19_daily_reports/ --> Daily data
    - csse_covid_29_data/csse_covid_19_time_series/ --> Cumulative data, recovered, deaths and confirmed cases
    - covid-policy-tracker/timeseries/ --> Stringency Index (Government response Indicator)
    
It is important to mention that daily data comes in the format **Month/Day/Year**, whereas columns listed in cumulative data and government response data tables have the format **Day/Month/Year**.

In addition to reading the CSV files, an initial data check is performed for checking:

    - That columns have the proper data types for further data manipulation
    - How many rows and columns contain null or not available data?
    - Which percentate of the total data is missing or unknown?
    
Moreover, imported data sources have columns which are not relevant for achieving the functional objective of the project and are being deleted using a simple Python function.

### A) Daily Cases Data <a class="anchor" id="daily-data"></a>

Reading a particular day data from CSV file.

In [4]:
file_path = "../ProjectDataSources/csse_covid_19_data/" + \
            "csse_covid_19_daily_reports/03-10-2022.csv"

daily_df = utils.read_data(file_path)

utils.initial_dataframe_check(daily_df)

Unnamed: 0,Values
# Rows,4012.0
# Columns,14.0
# Rows with NAs,4012.0
# Columns with NAs,9.0
% Null Values in Dataframe,17.8


In [5]:
daily_df.columns

# Admin2 -- USA County Name
# Province State -- region of the selected country

# What is Case_Fatality Ratio? - cases per 100,000 persons
# What is Incident_Rate?       - number recorded deaths / number cases

Index(['FIPS', 'Admin2', 'Province_State', 'Country_Region', 'Last_Update',
       'Lat', 'Long_', 'Confirmed', 'Deaths', 'Recovered', 'Active',
       'Combined_Key', 'Incident_Rate', 'Case_Fatality_Ratio'],
      dtype='object')

Deleting not relevant columns from data and checking shape and null value information.

In [6]:
daily_df = utils.drop_columns(daily_df, data_source="daily")

utils.initial_dataframe_check(daily_df)

Removed 10 columns from dataframe


Unnamed: 0,Values
# Rows,4012.0
# Columns,4.0
# Rows with NAs,4010.0
# Columns with NAs,1.0
% Null Values in Dataframe,24.988


In [7]:
daily_df.head()

Unnamed: 0,Country_Region,Confirmed,Deaths,Recovered
0,Afghanistan,175893,7639,
1,Albania,272479,3484,
2,Algeria,265366,6861,
3,Andorra,38794,152,
4,Angola,98855,1900,


In [8]:
number_unique_countries = len(np.unique(daily_df["Country_Region"]))
print("How many countries in the list ? ", number_unique_countries)

How many countries in the list ?  199


Also, the country names are going to be reviewed in further stages.

### B) Cumulative Cases Data <a class="anchor" id="cumulative-data"></a>

Cumulative data is composed of three different timeseries:

    - global confirmed cases
    - global deaths cases
    - global recovered cases

Follow same commands used for daily data.

#### Confirmed Cases

In [9]:
file_path = "../ProjectDataSources/csse_covid_19_data/" + \
            "csse_covid_19_time_series/time_series_covid19_confirmed_global.csv"

confirmed_df = utils.read_data(file_path)

utils.initial_dataframe_check(confirmed_df)

Unnamed: 0,Values
# Rows,285.0
# Columns,946.0
# Rows with NAs,198.0
# Columns with NAs,3.0
% Null Values in Dataframe,0.074


In [11]:
confirmed_df.columns

# Lat -- latitude of the specified country/region
# Lon -- longitude of the specified country/region

Index(['Province/State', 'Country/Region', 'Lat', 'Long', '1/22/20', '1/23/20',
       '1/24/20', '1/25/20', '1/26/20', '1/27/20',
       ...
       '8/11/22', '8/12/22', '8/13/22', '8/14/22', '8/15/22', '8/16/22',
       '8/17/22', '8/18/22', '8/19/22', '8/20/22'],
      dtype='object', length=946)

In [12]:
confirmed_df = utils.drop_columns(confirmed_df, data_source="cumulative")

utils.initial_dataframe_check(confirmed_df)

Removed 3 columns from dataframe


Unnamed: 0,Values
# Rows,285.0
# Columns,943.0
# Rows with NAs,0.0
# Columns with NAs,0.0
% Null Values in Dataframe,0.0


In [13]:
confirmed_df.head()

Unnamed: 0,Country/Region,1/22/20,1/23/20,1/24/20,1/25/20,1/26/20,1/27/20,1/28/20,1/29/20,1/30/20,...,8/11/22,8/12/22,8/13/22,8/14/22,8/15/22,8/16/22,8/17/22,8/18/22,8/19/22,8/20/22
0,Afghanistan,0,0,0,0,0,0,0,0,0,...,188506,188704,188820,189045,189343,189477,189710,190010,190254,190435
1,Albania,0,0,0,0,0,0,0,0,0,...,320086,320781,321345,321804,322125,322837,323282,323829,325241,325736
2,Algeria,0,0,0,0,0,0,0,0,0,...,268718,268866,269008,269141,269269,269381,269473,269556,269650,269731
3,Andorra,0,0,0,0,0,0,0,0,0,...,45899,45899,45899,45899,45899,45899,45975,45975,45975,45975
4,Angola,0,0,0,0,0,0,0,0,0,...,102636,102636,102636,102636,102636,102636,102636,102636,102636,102636


#### Death Cases

In [14]:
file_path = "../ProjectDataSources/csse_covid_19_data/" + \
            "csse_covid_19_time_series/time_series_covid19_deaths_global.csv"

deaths_df = utils.read_data(file_path)

utils.initial_dataframe_check(deaths_df)

Unnamed: 0,Values
# Rows,285.0
# Columns,946.0
# Rows with NAs,198.0
# Columns with NAs,3.0
% Null Values in Dataframe,0.074


In [15]:
deaths_df = utils.drop_columns(deaths_df, data_source="cumulative")

utils.initial_dataframe_check(deaths_df)

Removed 3 columns from dataframe


Unnamed: 0,Values
# Rows,285.0
# Columns,943.0
# Rows with NAs,0.0
# Columns with NAs,0.0
% Null Values in Dataframe,0.0


In [16]:
deaths_df.head()

Unnamed: 0,Country/Region,1/22/20,1/23/20,1/24/20,1/25/20,1/26/20,1/27/20,1/28/20,1/29/20,1/30/20,...,8/11/22,8/12/22,8/13/22,8/14/22,8/15/22,8/16/22,8/17/22,8/18/22,8/19/22,8/20/22
0,Afghanistan,0,0,0,0,0,0,0,0,0,...,7755,7755,7758,7758,7759,7759,7759,7759,7759,7759
1,Albania,0,0,0,0,0,0,0,0,0,...,3568,3569,3570,3571,3571,3573,3574,3574,3575,3576
2,Algeria,0,0,0,0,0,0,0,0,0,...,6878,6878,6878,6878,6878,6878,6878,6878,6878,6878
3,Andorra,0,0,0,0,0,0,0,0,0,...,154,154,154,154,154,154,154,154,154,154
4,Angola,0,0,0,0,0,0,0,0,0,...,1917,1917,1917,1917,1917,1917,1917,1917,1917,1917


#### Recovered Cases

In [17]:
file_path = "../ProjectDataSources/csse_covid_19_data/" + \
            "csse_covid_19_time_series/time_series_covid19_recovered_global.csv"

recovered_df = utils.read_data(file_path)

utils.initial_dataframe_check(recovered_df)

Unnamed: 0,Values
# Rows,270.0
# Columns,946.0
# Rows with NAs,198.0
# Columns with NAs,3.0
% Null Values in Dataframe,0.078


In [18]:
recovered_df = utils.drop_columns(recovered_df, data_source="cumulative")

utils.initial_dataframe_check(recovered_df)

Removed 3 columns from dataframe


Unnamed: 0,Values
# Rows,270.0
# Columns,943.0
# Rows with NAs,0.0
# Columns with NAs,0.0
% Null Values in Dataframe,0.0


In [19]:
recovered_df.head()

Unnamed: 0,Country/Region,1/22/20,1/23/20,1/24/20,1/25/20,1/26/20,1/27/20,1/28/20,1/29/20,1/30/20,...,8/11/22,8/12/22,8/13/22,8/14/22,8/15/22,8/16/22,8/17/22,8/18/22,8/19/22,8/20/22
0,Afghanistan,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,Albania,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,Algeria,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,Andorra,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,Angola,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


There are less observations on recovered cases dataset in comparison to deaths or confirmed cases. Does it has to do with the number of listed countries?

In [20]:
# Country Names
print("How many countries in the confirmed cases list ? ",len(np.unique(confirmed_df["Country/Region"])))
print("How many countries in the deaths cases list ? ",len(np.unique(deaths_df["Country/Region"])))
print("How many countries in the recovered cases list ? ",len(np.unique(recovered_df["Country/Region"])))

How many countries in the confirmed cases list ?  199
How many countries in the deaths cases list ?  199
How many countries in the recovered cases list ?  199


Luckily, it has nothing to do with the countries, so the same countries are listed in the three data sources.

### C) Government Response Data <a class="anchor" id="si-data"></a>

In [21]:
file_path = "../ProjectDataSources/covid-policy-tracker/" + \
            "timeseries/stringency_index_avg.csv"

stringency_df = utils.read_data(file_path)

utils.initial_dataframe_check(stringency_df)

Unnamed: 0,Values
# Rows,263.0
# Columns,974.0
# Rows with NAs,263.0
# Columns with NAs,971.0
% Null Values in Dataframe,8.067


In [22]:
stringency_df.columns

# Jurisdiction and code columns have to be deleted from the dataframe

Index(['country_code', 'country_name', 'region_code', 'region_name',
       'jurisdiction', '01Jan2020', '02Jan2020', '03Jan2020', '04Jan2020',
       '05Jan2020',
       ...
       '17Aug2022', '18Aug2022', '19Aug2022', '20Aug2022', '21Aug2022',
       '22Aug2022', '23Aug2022', '24Aug2022', '25Aug2022', '26Aug2022'],
      dtype='object', length=974)

In [23]:
stringency_df = utils.drop_columns(stringency_df, data_source="stringency_index")

utils.initial_dataframe_check(stringency_df)

Removed 4 columns from dataframe


Unnamed: 0,Values
# Rows,263.0
# Columns,970.0
# Rows with NAs,263.0
# Columns with NAs,969.0
% Null Values in Dataframe,7.953


In [24]:
number_unique_countries = len(np.unique(stringency_df["country_name"]))
print("How many countries in the list ? ", number_unique_countries)

How many countries in the list ?  187


The country list for stringency index dataset is smaller than the ones from the CSSE data source for daily and cumulative data cases. The country list is going to be intersected in further steps for data consistency.