# COVID-19 Exploratory Analysis
17 March 2020

The aim of this book is to explore the evolution of the novel corona virus both from a historical and present standpoint. Some key objectives,
1. To ingest data from relevant sources
2. To note key events
3. To determine when the growth factor will be 1

** Note, the data is being received in real time

In [1]:
%system python load_jhp_data.py
%autosave 60

import pandas as pd
import numpy as np
import ipywidgets as widgets
from ipywidgets import AppLayout, HTML, Layout
import matplotlib.pyplot as plt

Autosaving every 60 seconds


In [2]:
# pandas print settingss
pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 500)
pd.set_option('display.width', 1000)
pd.set_option('display.max_colwidth', 100)

# url references
confirmed_out = "data/df_confirmed.csv"
deaths_out = "data/df_deaths.csv"
recovered_out = "data/df_recovered.csv"

# copy of orignals
df_confirmed = pd.read_csv(confirmed_out)
df_deaths = pd.read_csv(deaths_out)
df_recovered = pd.read_csv(recovered_out)

df_confirmed.drop("Unnamed: 0", axis=1, inplace=True)
df_deaths.drop("Unnamed: 0", axis=1, inplace=True)
df_recovered.drop("Unnamed: 0", axis=1, inplace=True)

## Inspect dataframe

In [3]:
df_deaths.head(5)

# Missing values detected
# Data type expectations
#     - Province/State - Catagorical
#     - Country/Region - Catagorical
#     - Lat/Long - Float
#     - Date - datetime

# data type definitions
geographical_cat_features = ["Province/State", "Country/Region"]
geographical_float_features = ["Lat", "Long"]

# specify date features for each data set
date_features_confirmed = df_confirmed.columns[df_confirmed.columns.isin(geographical_cat_features + geographical_float_features) == False]
date_features_deaths = df_deaths.columns[df_deaths.columns.isin(geographical_cat_features + geographical_float_features) == False]
date_features_recovered = df_recovered.columns[df_recovered.columns.isin(geographical_cat_features + geographical_float_features) == False]

In [4]:
df_confirmed.head(5)

Unnamed: 0,Province/State,Country/Region,Lat,Long,1/22/20,1/23/20,1/24/20,1/25/20,1/26/20,1/27/20,1/28/20,1/29/20,1/30/20,1/31/20,2/1/20,2/2/20,2/3/20,2/4/20,2/5/20,2/6/20,2/7/20,2/8/20,2/9/20,2/10/20,2/11/20,2/12/20,2/13/20,2/14/20,2/15/20,2/16/20,2/17/20,2/18/20,2/19/20,2/20/20,2/21/20,2/22/20,2/23/20,2/24/20,2/25/20,2/26/20,2/27/20,2/28/20,2/29/20,3/1/20,3/2/20,3/3/20,3/4/20,3/5/20,3/6/20,3/7/20,3/8/20,3/9/20,3/10/20,3/11/20,3/12/20,3/13/20,3/14/20,3/15/20,3/16/20,3/17/20,3/18/20,3/19/20,3/20/20,3/21/20,3/22/20
0,,Thailand,15.0,101.0,2,3,5,7,8,8,14,14,14,19,19,19,19,25,25,25,25,32,32,32,33,33,33,33,33,34,35,35,35,35,35,35,35,35,37,40,40,41,42,42,43,43,43,47,48,50,50,50,53,59,70,75,82,114,147,177,212,272,322,411,599
1,,Japan,36.0,138.0,2,1,2,2,4,4,7,7,11,15,20,20,20,22,22,45,25,25,26,26,26,28,28,29,43,59,66,74,84,94,105,122,147,159,170,189,214,228,241,256,274,293,331,360,420,461,502,511,581,639,639,701,773,839,825,878,889,924,963,1007,1086
2,,Singapore,1.2833,103.8333,0,1,3,3,4,5,7,7,10,13,16,18,18,24,28,28,30,33,40,45,47,50,58,67,72,75,77,81,84,84,85,85,89,89,91,93,93,93,102,106,108,110,110,117,130,138,150,150,160,178,178,200,212,226,243,266,313,345,385,432,455
3,,Nepal,28.1667,84.25,0,0,0,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,2
4,,Malaysia,2.5,112.5,0,0,0,3,4,4,4,7,8,8,8,8,8,10,12,12,12,16,16,18,18,18,19,19,22,22,22,22,22,22,22,22,22,22,22,22,23,23,25,29,29,36,50,50,83,93,99,117,129,149,149,197,238,428,566,673,790,900,1030,1183,1306


In [5]:
# check dates are consistent for each data 
dates_all_consistent = True

if False in (date_features_confirmed == date_features_deaths):
    dates_all_consistent = False
    if False in (date_features_confirmed == date_features_recovered):
        dates_all_consistent = False

print(dates_all_consistent)

True


## Null case evaluation
Assumptions
1. Province/State - state level information not available
2. Date - case information not available

In [6]:
for df in [df_confirmed, df_recovered, df_deaths]:
    
    print(df.isnull().sum()[df.isnull().sum() != 0])
    print("")

# 142 missing values from Province/State

Province/State    162
dtype: int64

Province/State    162
dtype: int64

Province/State    162
dtype: int64



## Data Dictionaries
1. States in country

In [7]:
states_in_country = {}

# initialize dictionary with unique countries
for index, val in df_confirmed.sort_values("Country/Region").iterrows(): 
    states_in_country[df_confirmed.loc[index, "Country/Region"]] = []

# append states to dictionary for countries that do have provices/states
for index, val in df_confirmed.iterrows(): 
    row_val = df_confirmed.loc[index, "Country/Region"]
    if type(row_val) is not float:
        states_in_country[df_confirmed.loc[index, "Country/Region"]]\
        .append(df_confirmed.loc[index, "Province/State"])
        
# sort dictionary by keys
# states_in_country = sorted(states_in_country.keys(), key=lambda key: key[1])

## Sources
    
3Blue1Brown - Exponentials and Epidemics, https://www.youtube.com/watch?v=Kas0tIxDvrg

MAT12X - Intermediate Algebra, https://www.cgc.edu/Academics/LearningCenter/Math/Documents/Exponential_Functions_Workshop.pdf

## Growth factor of confirmed cases
Growth factor is the rate of change of new cases, from existing cases, from one day to the next.

    New Cases Per Day = Number of current cases * Number of people exposed * Probability of spreading infection
    
    d_nd+1 = E * P(n) * nd
    
    Where d_nd+1 is the change in new cases causes by all existing cases

Notes
    - Since d_nd+1 is a function of Nd, d_nd+1 is susceptible to large growth

Total number of cases for the next day d is;
    - nd+1 = nd + E*p*nd = nd*k 
    - Where k = (1+E*p)
    
Total number of cases for the current day d is;
    - nd = n0*k^d
    - Where n0 is the start of the outbreak and d_nd is the growth factor
    
    - Standard form: y = ab^x
        - Where x is the number of days since infection
    

In [8]:
def get_growth_factor(df):
    growth_factor = {}

    # initialize growth factor
    for i in df.index.to_list():
        growth_factor[i] = None

    for index, val in df.iterrows():
        # get growth_factor for each row
        growth_factor[index] = np.array(df.iloc[index, 1:])/np.array(df.iloc[index, 0:-1])
    
    return pd.DataFrame(growth_factor).replace({float('inf'): None, np.nan: None})

In [9]:
# get dates only
dates_df_confirmed = df_confirmed.drop(geographical_float_features+geographical_cat_features, axis=1)
dates_df_recovered = df_recovered.drop(geographical_float_features+geographical_cat_features, axis=1)
dates_df_deaths = df_deaths.drop(geographical_float_features+geographical_cat_features, axis=1)

# calculate growth factors for each dataset
g_factors_confirmed = get_growth_factor(dates_df_confirmed).T
g_factors_recovered = get_growth_factor(dates_df_recovered).T
g_factors_deaths = get_growth_factor(dates_df_deaths).T

# merge datasets together
g_factors_confirmed.columns = dates_df_confirmed.columns[1:]
df_gf_confirmed = pd.concat(
    [df_confirmed.loc[:, geographical_cat_features+geographical_float_features],
    g_factors_confirmed],
    axis=1
    )

g_factors_recovered.columns = dates_df_recovered.columns[1:]
df_gf_recovered= pd.concat(
    [df_recovered.loc[:, geographical_cat_features+geographical_float_features],
    g_factors_recovered],
    axis=1
    )

g_factors_deaths.columns = dates_df_deaths.columns[1:]
df_gf_deaths = pd.concat(
    [df_deaths.loc[:, geographical_cat_features+geographical_float_features],
    g_factors_deaths],
    axis=1
    )


  # Remove the CWD from sys.path while we load stuff.
  # Remove the CWD from sys.path while we load stuff.


## Aggregated cases by country

In [10]:
# drop Province/State
df_confirmed_no_province = df_confirmed.drop(["Province/State", "Lat", "Long"], axis=1)
df_recovered_no_province = df_recovered.drop(["Province/State", "Lat", "Long"], axis=1)
df_deaths_no_province = df_deaths.drop(["Province/State", "Lat", "Long"], axis=1)

# groupby country/region
df_rec_grp = df_recovered_no_province.groupby("Country/Region").sum()
df_conf_grp = df_confirmed_no_province.groupby("Country/Region").sum()
df_death_grp = df_deaths_no_province.groupby("Country/Region").sum()

## Visualization

In [11]:
# box plot visualiztion
def f(Country, Dataset):
    
    # get dataframe of choice
    get_df_agg = {
        "Recovered": [df_rec_grp, date_features_confirmed],
        "Confirmed": [df_conf_grp, date_features_deaths],
        "Deaths": [df_death_grp, date_features_recovered]
    }
    
    # To continue this..
    
    # get ranges
    date_range = get_df_agg[Dataset][1]
    value_range = get_df_agg[Dataset][0].loc[Country, date_range]
    
    # Plot cases
    fig = plt.figure(figsize=(6, 4))
    ax_1 = plt.subplot(1, 2, 1)
    ax_1.scatter(range(0, len(date_range), 1), value_range)
    
    # Decorative properties
    ax_1.set_xlabel('Days ago')
    ax_1.set_ylabel('Number of cases')

    ax_1.spines['right'].set_visible(False)
    ax_1.spines['top'].set_visible(False)
    ax_1.set_title(Country)
    
    # plot growth factor cases
    # set layout
    fig.tight_layout()

In [12]:
# box plot visualiztion
def g(Country, Dataset):
    
    # get dataframe of choice
    get_df_agg = {
        "Recovered": [df_rec_grp, date_features_confirmed],
        "Confirmed": [df_conf_grp, date_features_deaths],
        "Deaths": [df_death_grp, date_features_recovered]
    }
    
    # get ranges
    date_range = get_df_agg[Dataset][1]
    value_range = get_df_agg[Dataset][0].loc[Country, date_range]
    
    # Plot cases
    fig = plt.figure(figsize=(6, 4))
    ax_1 = plt.subplot(1, 2, 1)
    ax_1.scatter(range(0, len(date_range), 1), value_range)
    
    # Decorative properties
    ax_1.set_xlabel('Days ago')
    ax_1.set_ylabel('Number of cases')

    ax_1.spines['right'].set_visible(False)
    ax_1.spines['top'].set_visible(False)
    ax_1.set_title(Country)
    
    # plot growth factor cases
    # set layout
    fig.tight_layout()

In [13]:
# Setup introduction
introduction_text = """
    <br>
    <br>
    <h1>COVID-19 Global Siutation Update</h1>
    <p>Source: <a src=\"https://github.com/CSSEGISandData\">John Hopkins Whiting School Of Engineering, CSSEGISandData 2020</a></p>
    <br>
    <h3>Introduction</h3>
    <p>Use this tool to understand the impact of COVID in your country.</p>
    <br>
    <h3>Functionality</h3>
    <li>Observe total number of confirmed, recovered of cases of death.</li>
    <li>The growth rate factor of your selection. The closer it is to one, the more likely the disease is plateauing.</li>
    <br>
    <br>
"""
introduction = HTML(introduction_text, layout = Layout(height='auto'))

# left_sidebar
per_country = widgets.interactive(
    f, 
    Country=list(states_in_country.keys()), 
    Dataset=["Confirmed", "Deaths", "Recovered"]
)

# right_sidebar
growth_factor = widgets.interactive(
    f, 
    Country=list(states_in_country.keys()), 
    Dataset=["Confirmed", "Deaths", "Recovered"]
)

# AppLayout Object
app = AppLayout(
    header = introduction,
    left_sidebar = per_country,
    right_sidebar = growth_factor,
)

In [14]:
display(app)

AppLayout(children=(HTML(value='\n    <br>\n    <br>\n    <h1>COVID-19 Global Siutation Update</h1>\n    <p>So…