In [1]:
# DEVELOPMEMT OF DATA PRODUCTS - 18697
# US1 - DATA COLLECTION

The Notebook is part of the Development of Data Products product development, with the functional objective of providing data analysis and visualization to the end user about comparisons of daily and cumulative recorded cases, for confirmed, death, or recovered patients. In addition, the Stringency Index is also included for comparison how different governments have reacted in terms of restrictions and regulations to the pandemic situation.

# Index:

1. [Imported Libraries](#import-libraries)
2. [Reading Data Sources](#read-data)
    1. [Daily Cases Data](#daily-data)
    2. [Cumulative Cases Data](#cumulative-data)
    3. [Government Response Data](#si-data)

## 1 Imported Libraries <a class="anchor" id="import-libraries"></a>

In [2]:
# Libraries
import os

import pandas as pd
import numpy as np

import time
import random

## 2 Reading Data Sources <a class="anchor" id="read-data"></a>

The collected data sources are under "DDP-unibz-project-18697/ProjectDataSources" inside the following directories:
    
    - csse_covid_29_data/csse_covid_19_daily_reports/ --> Daily data
    - csse_covid_29_data/csse_covid_19_time_series/ --> Cumulative data, recovered, deaths and confirmed cases
    - covid-policy-tracker/timeseries/ --> Stringency Index (Government response Indicator)
    
It is important to mention that daily data comes in the format **Month/Day/Year**, whereas columns listed in cumulative data and government response data tables have the format **Day/Month/Year**.

### A) Daily Cases Data <a class="anchor" id="daily-data"></a>

Reading a particular day data from CSV file.

In [9]:
file_path = "../ProjectDataSources/csse_covid_19_data/" + \
            "csse_covid_19_daily_reports/03-10-2022.csv"

daily_df = pd.read_csv(file_path)

print("Number of rows    : ", daily_df.shape[0])
print("Number of columns : ", daily_df.shape[1])

Number of rows    :  4012
Number of columns :  14


In [10]:
daily_df.head()

Unnamed: 0,FIPS,Admin2,Province_State,Country_Region,Last_Update,Lat,Long_,Confirmed,Deaths,Recovered,Active,Combined_Key,Incident_Rate,Case_Fatality_Ratio
0,,,,Afghanistan,2022-03-11 04:20:50,33.93911,67.709953,175893,7639,,,Afghanistan,451.837904,4.342981
1,,,,Albania,2022-03-11 04:20:50,41.1533,20.1683,272479,3484,,,Albania,9468.309125,1.278631
2,,,,Algeria,2022-03-11 04:20:50,28.0339,1.6596,265366,6861,,,Algeria,605.153223,2.585486
3,,,,Andorra,2022-03-11 04:20:50,42.5063,1.5218,38794,152,,,Andorra,50209.020902,0.391813
4,,,,Angola,2022-03-11 04:20:50,-11.2027,17.8739,98855,1900,,,Angola,300.77951,1.922007


In [11]:
daily_df.columns

# Admin2 -- USA County Name
# Province State -- region of the selected country

# What is Case_Fatality Ratio? - cases per 100,000 persons
# What is Incident_Rate?       - number recorded deaths / number cases

Index(['FIPS', 'Admin2', 'Province_State', 'Country_Region', 'Last_Update',
       'Lat', 'Long_', 'Confirmed', 'Deaths', 'Recovered', 'Active',
       'Combined_Key', 'Incident_Rate', 'Case_Fatality_Ratio'],
      dtype='object')

Some of the columns will probably need to be deleted, but that is reviewed when data is about to be merged.

In [12]:
number_unique_countries = len(np.unique(daily_df["Country_Region"]))
print("How many countries in the list ? ", number_unique_countries)

How many countries in the list ?  199


Also, the country names are going to be reviewed in further stages.

### B) Cumulative Cases Data <a class="anchor" id="cumulative-data"></a>

Cumulative data is composed of three different timeseries:

    - global confirmed cases
    - global deaths cases
    - global recovered cases

Follow same commands used for daily data.

In [13]:
file_path = "../ProjectDataSources/csse_covid_19_data/" + \
            "csse_covid_19_time_series/time_series_covid19_confirmed_global.csv"

confirmed_df = pd.read_csv(file_path)

print("Number of rows    : ", confirmed_df.shape[0])
print("Number of columns : ", confirmed_df.shape[1])

Number of rows    :  285
Number of columns :  946


In [14]:
confirmed_df.head()

Unnamed: 0,Province/State,Country/Region,Lat,Long,1/22/20,1/23/20,1/24/20,1/25/20,1/26/20,1/27/20,...,8/11/22,8/12/22,8/13/22,8/14/22,8/15/22,8/16/22,8/17/22,8/18/22,8/19/22,8/20/22
0,,Afghanistan,33.93911,67.709953,0,0,0,0,0,0,...,188506,188704,188820,189045,189343,189477,189710,190010,190254,190435
1,,Albania,41.1533,20.1683,0,0,0,0,0,0,...,320086,320781,321345,321804,322125,322837,323282,323829,325241,325736
2,,Algeria,28.0339,1.6596,0,0,0,0,0,0,...,268718,268866,269008,269141,269269,269381,269473,269556,269650,269731
3,,Andorra,42.5063,1.5218,0,0,0,0,0,0,...,45899,45899,45899,45899,45899,45899,45975,45975,45975,45975
4,,Angola,-11.2027,17.8739,0,0,0,0,0,0,...,102636,102636,102636,102636,102636,102636,102636,102636,102636,102636


In [16]:
confirmed_df.columns

# Lat -- latitude of the specified country/region
# Lon -- longitude of the specified country/region

Index(['Province/State', 'Country/Region', 'Lat', 'Long', '1/22/20', '1/23/20',
       '1/24/20', '1/25/20', '1/26/20', '1/27/20',
       ...
       '8/11/22', '8/12/22', '8/13/22', '8/14/22', '8/15/22', '8/16/22',
       '8/17/22', '8/18/22', '8/19/22', '8/20/22'],
      dtype='object', length=946)

In [17]:
file_path = "../ProjectDataSources/csse_covid_19_data/" + \
            "csse_covid_19_time_series/time_series_covid19_deaths_global.csv"

deaths_df = pd.read_csv(file_path)

print("Number of rows    : ", deaths_df.shape[0])
print("Number of columns : ", deaths_df.shape[1])

Number of rows    :  285
Number of columns :  946


In [18]:
deaths_df.head()

Unnamed: 0,Province/State,Country/Region,Lat,Long,1/22/20,1/23/20,1/24/20,1/25/20,1/26/20,1/27/20,...,8/11/22,8/12/22,8/13/22,8/14/22,8/15/22,8/16/22,8/17/22,8/18/22,8/19/22,8/20/22
0,,Afghanistan,33.93911,67.709953,0,0,0,0,0,0,...,7755,7755,7758,7758,7759,7759,7759,7759,7759,7759
1,,Albania,41.1533,20.1683,0,0,0,0,0,0,...,3568,3569,3570,3571,3571,3573,3574,3574,3575,3576
2,,Algeria,28.0339,1.6596,0,0,0,0,0,0,...,6878,6878,6878,6878,6878,6878,6878,6878,6878,6878
3,,Andorra,42.5063,1.5218,0,0,0,0,0,0,...,154,154,154,154,154,154,154,154,154,154
4,,Angola,-11.2027,17.8739,0,0,0,0,0,0,...,1917,1917,1917,1917,1917,1917,1917,1917,1917,1917


In [19]:
file_path = "../ProjectDataSources/csse_covid_19_data/" + \
            "csse_covid_19_time_series/time_series_covid19_recovered_global.csv"

recovered_df = pd.read_csv(file_path)

print("Number of rows    : ", recovered_df.shape[0])
print("Number of columns : ", recovered_df.shape[1])

Number of rows    :  270
Number of columns :  946


In [20]:
recovered_df.head()

Unnamed: 0,Province/State,Country/Region,Lat,Long,1/22/20,1/23/20,1/24/20,1/25/20,1/26/20,1/27/20,...,8/11/22,8/12/22,8/13/22,8/14/22,8/15/22,8/16/22,8/17/22,8/18/22,8/19/22,8/20/22
0,,Afghanistan,33.93911,67.709953,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,,Albania,41.1533,20.1683,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,,Algeria,28.0339,1.6596,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,,Andorra,42.5063,1.5218,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,,Angola,-11.2027,17.8739,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


There are less columns on recovered cases dataset in comparison to deaths or confirmed cases. Does it has to do with the number of listed countries?

In [21]:
# Country Names
print("How many countries in the confirmed cases list ? ",len(np.unique(confirmed_df["Country/Region"])))
print("How many countries in the deaths cases list ? ",len(np.unique(deaths_df["Country/Region"])))
print("How many countries in the recovered cases list ? ",len(np.unique(recovered_df["Country/Region"])))

How many countries in the confirmed cases list ?  199
How many countries in the deaths cases list ?  199
How many countries in the recovered cases list ?  199


Luckily, it has nothing to do with the countries, so the same countries are listed in the three data sources.

### C) Government Response Data <a class="anchor" id="si-data"></a>

In [22]:
file_path = "../ProjectDataSources/covid-policy-tracker/" + \
            "timeseries/stringency_index_avg.csv"

stringency_df = pd.read_csv(file_path)

print("Number of rows    : ", stringency_df.shape[0])
print("Number of columns : ", stringency_df.shape[1])

Number of rows    :  263
Number of columns :  975


In [23]:
stringency_df.head()

Unnamed: 0.1,Unnamed: 0,country_code,country_name,region_code,region_name,jurisdiction,01Jan2020,02Jan2020,03Jan2020,04Jan2020,...,17Aug2022,18Aug2022,19Aug2022,20Aug2022,21Aug2022,22Aug2022,23Aug2022,24Aug2022,25Aug2022,26Aug2022
0,1,ABW,Aruba,,,NAT_TOTAL,,,,,...,,,,,,,,,,
1,2,AFG,Afghanistan,,,NAT_TOTAL,,,,,...,11.11,11.11,11.11,11.11,11.11,,,,,
2,3,AGO,Angola,,,NAT_TOTAL,,,,,...,,,,,,,,,,
3,4,ALB,Albania,,,NAT_TOTAL,,,,,...,11.11,11.11,11.11,11.11,11.11,11.11,11.11,,,
4,5,AND,Andorra,,,NAT_TOTAL,,,,,...,5.56,5.56,5.56,5.56,5.56,5.56,5.56,,,


The government response CSV file needs to be read again, skipping the first unnecessary row.

In [24]:
stringency_df = pd.read_csv(file_path, index_col=[0])

print("Number of rows    : ", stringency_df.shape[0])
print("Number of columns : ", stringency_df.shape[1])

Number of rows    :  263
Number of columns :  974


In [25]:
stringency_df.head()

Unnamed: 0,country_code,country_name,region_code,region_name,jurisdiction,01Jan2020,02Jan2020,03Jan2020,04Jan2020,05Jan2020,...,17Aug2022,18Aug2022,19Aug2022,20Aug2022,21Aug2022,22Aug2022,23Aug2022,24Aug2022,25Aug2022,26Aug2022
1,ABW,Aruba,,,NAT_TOTAL,,,,,,...,,,,,,,,,,
2,AFG,Afghanistan,,,NAT_TOTAL,,,,,,...,11.11,11.11,11.11,11.11,11.11,,,,,
3,AGO,Angola,,,NAT_TOTAL,,,,,,...,,,,,,,,,,
4,ALB,Albania,,,NAT_TOTAL,,,,,,...,11.11,11.11,11.11,11.11,11.11,11.11,11.11,,,
5,AND,Andorra,,,NAT_TOTAL,,,,,,...,5.56,5.56,5.56,5.56,5.56,5.56,5.56,,,


In [26]:
number_unique_countries = len(np.unique(stringency_df["country_name"]))
print("How many countries in the list ? ", number_unique_countries)

How many countries in the list ?  187


The country list for stringency index dataset is smaller than the ones from the CSSE data source for daily and cumulative data cases. The country list is going to be intersected in further steps for data consistency.