# COVID-19 epidemiological data analysis

The purpose of this notebook is to assist in exploratory epidemiological analysis related to the [current COVID-19 outbreak](https://en.wikipedia.org/wiki/2019%E2%80%9320_coronavirus_pandemic). Several raw data sources are available:
- [Johns Hopkins University git](https://github.com/CSSEGISandData/COVID-19) is compiled by the [Johns Hopkins University Center for Systems Science and Engineering (JHU CCSE)](https://systems.jhu.edu/research/public-health/ncov/) from [various primary sources](https://data.humdata.org/dataset/novel-coronavirus-2019-ncov-cases)
- [World Health Organisation](https://data.humdata.org/dataset/coronavirus-covid-19-cases-and-deaths). Ref.: [WHO website](https://www.who.int/emergencies/diseases/novel-coronavirus-2019)
- [European Centre for Disease Prevention and Control (ECDC)](https://www.ecdc.europa.eu/en/geographical-distribution-2019-ncov-cases)

WHO and ECDC numbers are similar, Johns Hopkins data are higher (approx. 10-20% by country) due to the inclusion of estimated confirmed cases. [Source of datasets benchmark](https://ourworldindata.org/covid-sources-comparison).

Although [this](https://www.arcgis.com/apps/opsdashboard/index.html#/bda7594740fd40299423467b48e9ecf6) and other dashboards graph confirmed cases, deaths, and recoveries worldwide. However, there is a lack of more granular/local versions of this, which can be useful for localities trying to assess their current threat levels. E.g. Johns Hopkins dataset provides by-state data for the US.

Statistics below are a quick overview using Johns Hopkins dataset.

In [0]:
import pandas as pd
import plotly.express as px
import numpy as np
import datetime

##Define below the constants
- Number of top cases to analyze  
- Size of the rolling window to estimate number of active cases

In [0]:
top_n_countries = 30
roll_window = 21

##Download and preprocess data

In [0]:
data_c = pd.read_csv('https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_confirmed_global.csv')
data_d = pd.read_csv('https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_deaths_global.csv')
populations = pd.read_csv('https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/UID_ISO_FIPS_LookUp_Table.csv')

In [0]:
populations = populations[populations.Province_State.isnull()].drop(columns=['Province_State', 'Admin2', 'FIPS', 'code3', 'iso3', 'iso2', 'UID', 'Lat', 'Long_', 'Combined_Key'])

In [0]:
data_c_cp = data_c.copy()
data_d_cp = data_d.copy()

In [0]:
data_c_cp = data_c_cp.rename(columns={'Country/Region':'Country', 'Province/State':'Province'})
data_d_cp = data_d_cp.rename(columns={'Country/Region':'Country', 'Province/State':'Province'})

In [0]:
# remove country's provinces/states to avoid name duplication. Considering that states have relativelly low population
data_c_cp = data_c_cp[data_c_cp.Province.isnull()]
data_d_cp = data_d_cp[data_d_cp.Province.isnull()]

In [0]:
data_c_cp = data_c_cp.drop(['Lat', 'Long', 'Province'], axis=1)
data_d_cp = data_d_cp.drop(['Lat', 'Long', 'Province'], axis=1)

In [0]:
data_c_cp = pd.melt(data_c_cp, id_vars='Country').sort_values(by=['Country'])
data_d_cp = pd.melt(data_d_cp, id_vars='Country').sort_values(by=['Country'])

In [0]:
data_c_cp = data_c_cp.rename(columns={'variable':'Date', 'value': 'CumCase'})
data_d_cp = data_d_cp.rename(columns={'variable':'Date', 'value': 'CumDeath'})

In [0]:
data_c_cp['Date'] = pd.to_datetime(data_c_cp.Date).dt.date
data_d_cp['Date'] = pd.to_datetime(data_d_cp.Date).dt.date

In [0]:
data_c_cp = data_c_cp.sort_values(by=['Country', 'Date'])
data_d_cp = data_d_cp.sort_values(by=['Country', 'Date'])

In [0]:
datacp = data_c_cp.merge(data_d_cp, how='outer').sort_values(by=['Country', 'Date'])

In [0]:
datacp = datacp.reset_index().drop(['index'], axis=1)

In [0]:
datacp['NewCase'], datacp['CumCaseRoll'], datacp['CumCasePC'], datacp['CumDeathPC'] = np.nan, np.nan, np.nan, np.nan

## Select top countries

### Top countries by number of cumulated cases

In [18]:
# top by cumulated cases
top_cases = datacp.groupby(by='Country').max().sort_values(by=['CumCase'], axis=0, ascending=False).reset_index()[['Country', 'CumCase']][:top_n_countries]
top_cases

Unnamed: 0,Country,CumCase
0,US,1283929
1,Spain,222857
2,Italy,217185
3,United Kingdom,211364
4,Russia,187859
5,France,174318
6,Germany,170588
7,Brazil,146894
8,Turkey,135569
9,Iran,104691


###Top countries by number of deaths

In [19]:
# top by cumulated deaths
top_deaths = datacp.groupby(by='Country').max().sort_values(by=['CumDeath'], axis=0, ascending=False).reset_index()[['Country', 'CumDeath']][:top_n_countries]
top_deaths

Unnamed: 0,Country,CumDeath
0,US,77180
1,United Kingdom,31241
2,Italy,30201
3,Spain,26299
4,France,26192
5,Brazil,10017
6,Belgium,8521
7,Germany,7510
8,Iran,6541
9,Netherlands,5359


###Merge top cases and top death countries using union intersection - total number of top countires is equalt or grater than top nomber of each category separately

In [20]:
top_countries = top_cases[['Country']].merge(top_deaths[['Country']], how='outer')
total_top_countries = top_countries.shape[0]; total_top_countries

37

## Compute cases and deaths by capita

Set estimated length of communicability period.
Period of communicability - the time interval during which an infectious agent may be transferred directly or indirectly from an infected person to another person. There are [different reports](https://www.publichealthontario.ca/-/media/documents/ncov/covid-wwksf/what-we-know-communicable-period-mar-27-2020.pdf?la=en) COVID-19 communicability, however there is not conclusion yet. This period could be roughtly estimated to compute current active cases.

In [0]:
for country in top_countries.Country.values:
  population = populations.loc[populations['Country_Region'] == country, 'Population'].values[0]
  datacp.loc[datacp.Country == country, 'CumCasePC'] = datacp.loc[datacp.Country == country, 'CumCase'] * 1e5 / population
  datacp.loc[datacp.Country == country, 'CumDeathPC'] = datacp.loc[datacp.Country == country, 'CumDeath'] * 1e5 / population
  datacp.loc[datacp.Country == country, 'NewCase'] = datacp.loc[datacp.Country == country, 'CumCase'].diff()
  datacp.loc[datacp.Country == country, 'CumCaseRoll'] = datacp.loc[datacp.Country == country, 'NewCase'].rolling(roll_window).sum()

In [0]:
# data of top countries (currently 23)
data_top_countries = datacp[datacp.Country.isin(top_countries.Country.values)]

##Plot world cumulated cases

Group by date and aggregate by sum of cumulated cases
Merge with equivalent operation over top countries
Meltint of two columns into one is a required operation for plotly.express

In [0]:
data_sum = datacp.groupby(datacp['Date']).agg({'CumCase':'sum'}).reset_index()
data_sum_top_countries = data_top_countries.groupby(data_top_countries['Date']).agg({'CumCase':'sum'}).reset_index()
data_sum_comp = pd.merge(data_sum, data_sum_top_countries, on='Date').rename(columns={'CumCase_x':'CumCasesGlob', 'CumCase_y':'CumCasesTopCntrs'})
data_sum_comp_melted = data_sum_comp.melt(id_vars='Date', value_vars=['CumCasesGlob', 'CumCasesTopCntrs'])

In [34]:
fig = px.line(data_sum_comp_melted, x='Date', y='value', color='variable', title =f'Global confirmed cases of COVID-19 vs top-{total_top_countries} countries')
fig.show()

In [35]:
data_sum_roll = datacp.groupby(datacp['Date']).agg({'CumCaseRoll':'sum'}).reset_index()
fig = px.line(data_sum_roll, x='Date', y='CumCaseRoll', title =f'Global confirmed active cases of COVID-19 over {roll_window} day(s)')
fig.show()

## Plot cases and deaths per capita by country
Graphs are interactive:
- double-click on a name inside the legend to show only this country  
- single-click on hidden country to add it to the plot  
- use slider under the graph to change the time window

In [27]:
fig = px.line(data_top_countries, x="Date", y="CumCasePC", color="Country",
              line_group="Country", hover_name="Country",
              title=f"Cumulated confirmed cases per 100k for top-{total_top_countries} countries (with range slider)")
fig.update_layout(xaxis_rangeslider_visible=True)
fig.show()

In [28]:
fig = px.line(data_top_countries, x="Date", y="CumDeathPC", color="Country",
              line_group="Country", hover_name="Country",
              title=f"Cumulated death cases per 100k for top-{total_top_countries} countries (with range slider)")
fig.update_layout(xaxis_rangeslider_visible=True)
fig.show()

##Plot cases and deaths by country

In [29]:
fig = px.line(data_top_countries, x="Date", y="CumCase", color="Country",
              line_group="Country", hover_name="Country",
              title=f"Cumulated confirmed cases for top {total_top_countries} countries (with range slider)")
fig.update_layout(xaxis_rangeslider_visible=True)
fig.show()

In [30]:
fig = px.line(data_top_countries, x="Date", y="CumDeath", color="Country",
              line_group="Country", hover_name="Country",
              title=f"Cumulated death cases for top-{total_top_countries} countries (with range slider)")
fig.update_layout(xaxis_rangeslider_visible=True)
fig.show()

##Plot cases by country cumulated over N days
Model represents a rough estimation of an average case duration

In [31]:
fig = px.line(data_top_countries, x="Date", y="CumCaseRoll", color="Country",
              line_group="Country", hover_name="Country",
              title=f"Cumulated over {roll_window}-day period cases for top-{total_top_countries} countries (with range slider)")
fig.update_layout(xaxis_rangeslider_visible=True)
fig.show()