In [1]:
from IPython.display import Image
from IPython.core.display import HTML 
Image(url= "https://cdn.downtoearth.org.in/library/large/2020-03-01/0.01792700_1583044755_coronavirus-illustration-carousel.jpg")

# COVID-19 Global Outlook: Exploratory data analysis (EDA)

Coronavirus disease 2019 (COVID-19) is an infectious disease caused by severe acute respiratory syndrome. The first disease was identified in Dicember 2019 in Wuhan. 
On March 11 2020, after 118,000 people being infected in 114 Countries, and causing the death of 4,291 people, COVID-19 has been recognized as a pandemic.
Today, April 13,2020, the pandemic infected 1,854,464 people in 185 countries, causing the death of 114,331 people.

In the context of the global COVID-19 pandemic, we follow the suggestions from Kaggle's competitions in order to provide useful insights about the virus' spread. Starting from a global exploratory analysis, then we focus on virus' modelling and prediction for the countries with the largest number of confirmed cases. For modelling, we implement SIR Model with some extensions and, for prediction, logistic and Gompertz model. At the end, we choose the best model based on $R^{2}$ score, check the predictions' numbers about confirmed and fatalities for the next time interval and display some results from NLTK Sentiment analysis. 

Data: [COVID19 Global Forecasting](https://www.kaggle.com/c/covid19-global-forecasting-week-4)

**TABLE OF CONTENTS**

1. [Exploratory data analysis (EDA)](#section1)

    1.1. [Worldwide Trend](#section11)
    
    1.2. [Country-Wise growth](#section12)
    
    1.3. [Zoom up to](#section13)
    
      1.3.1. [Asia](#section131)
      
      1.3.2. [Europe](#section132)
      
      1.3.3. [US](#section133)

# **1. Exploratory data analysis (EDA)** <a id="section1"></a>

In [2]:
from kaggle.api.kaggle_api_extended import KaggleApi
api = KaggleApi()
api.authenticate()

In [3]:
api.competition_download_file('covid19-global-forecasting-week-4','train.csv')
api.competition_download_file('covid19-global-forecasting-week-4','test.csv')
api.competition_download_file('covid19-global-forecasting-week-4','submission.csv')

train.csv: Skipping, found more recently modified local copy (use --force to force download)
test.csv: Skipping, found more recently modified local copy (use --force to force download)
submission.csv: Skipping, found more recently modified local copy (use --force to force download)


In [4]:
import pandas as pd
train = pd.read_csv("train.csv")
test = pd.read_csv("test.csv")
submission_example = pd.read_csv("submission.csv")

In [5]:
import numpy as np
import scipy as sp

# --- plotly ---
from plotly import tools, subplots
import plotly.offline as py
py.init_notebook_mode(connected=True)
import plotly.graph_objs as go
import plotly.express as px
import plotly.figure_factory as ff
import plotly.io as pio
pio.templates.default = "plotly_white"

In [6]:
train.rename({'Country_Region': 'country', 'Province_State': 'province', 'Id': 'id', 'Date': 'date', 'ConfirmedCases': 'confirmed', 'Fatalities': 'fatalities'}, axis=1, inplace=True)
test.rename({'Country_Region': 'country', 'Province_State': 'province', 'Id': 'id', 'Date': 'date', 'ConfirmedCases': 'confirmed', 'Fatalities': 'fatalities'}, axis=1, inplace=True)
train['country_province'] = train['country'].fillna('') + '/' + train['province'].fillna('')
test['country_province'] = test['country'].fillna('') + '/' + test['province'].fillna('')

### **1.1. Worldwide Trend** <a id="section11"></a>

As a first analysis, we want to take a look at the exponential spread of the pandemic worldwide. 

In [7]:
ww_df = train.groupby('date')[['confirmed', 'fatalities']].sum().reset_index()
ww_df['new_case'] = ww_df['confirmed'] - ww_df['confirmed'].shift(1)
ww_df.tail()

Unnamed: 0,date,confirmed,fatalities,new_case
77,2020-04-08,1510928.0,88332.0,85005.0
78,2020-04-09,1595174.0,95449.0,84246.0
79,2020-04-10,1691542.0,102519.0,96368.0
80,2020-04-11,1771337.0,108497.0,79795.0
81,2020-04-12,1846503.0,114088.0,75166.0


In [8]:
ww_melt_df = pd.melt(ww_df, id_vars=['date'], value_vars=['confirmed', 'fatalities', 'new_case'])

In [9]:
fig = px.line(ww_melt_df, x="date", y="value", color='variable', 
              title="Worldwide Confirmed/Death Cases Over Time")
fig.show()

We can notice, as said before, that the confirmed cases' curve is exponentially growing till April 11th without signal of deceleration. This fact is probably due to the influence of US, since New York state has become the epicentre of the outbreak in the US, recording more than 180,000 of the country's nearly 530,000 cases.

Neverthless, it can be noticed a change of trend in the direction of the new cases' curve because the green curve has crossed down the fatalities' one. This could be explained as a consequence of National responses to the coronavirus pandemic that have included containment measures such as lockdowns, quarantines, and curfews.

In [10]:
fig = px.line(ww_melt_df, x="date", y="value", color='variable',
              title="Worldwide Confirmed/Death Cases Over Time (Log scale)",
             log_y=True)
fig.show()

Moreover, despite the Lockdown policy in Europe or US, when we check the growth in log-scale, we can see that the speed of confirmed cases growth rate slightly increases when compared with the beginning of March and end of March.

In [11]:
ww_df['mortality'] = ww_df['fatalities'] / ww_df['confirmed']

fig = px.line(ww_df, x="date", y="mortality", 
                  title="Worldwide CFR Over Time")
fig.show()

Initially, the World Health Organization (WHO) had mentioned 2% as a mortality rate estimate in a press conference on Wednesday, January 29 and again on February 10. However, in his opening remarks at the March 3 media briefing on Covid-19, WHO Director-General Dr Tedros Adhanom Ghebreyesus stated: “Globally, about 3.4% of reported COVID-19 cases have died. By comparison, seasonal flu generally kills far fewer than 1% of those infected.”

Nowadays, the mortality rate worldwide has rised to 6.2% on Apr 12th. 

We should stress that there is no single figure of CFR (case fatality rate) for any particular disease. The CFR varies by location, and is typically changing over time (as shown by the graph above). CFRs vary widely between countries, from 2.3% in Germany to 12.8% in Italy. 

The mortality rate reflects the severity of the disease in a particular context, at a particular time, in a particular population. This means that the CFR can decrease or increase over time, as responses change; and that it can vary by location and by the characteristics of the infected population, such as age, or sex. For instance, older populations would expect to see a higher CFR from COVID-19 than younger ones (such as Italian context). 


### **1.2. Country-Wise growth** <a id="section12"></a>

For a detailed view, we have focused on cases by country. 
Below, we show how the number of confirmed cases is distributed. 

In [12]:
country_df = train.groupby(['date', 'country'])[['confirmed', 'fatalities']].sum().reset_index()
target_date = country_df['date'].max()

print('Date: ', target_date)
for i in [1, 10, 100, 1000, 10000]:
    n_countries = len(country_df.query('(date == @target_date) & confirmed > @i'))
    print(f'{n_countries} countries have more than {i} confirmed cases')

Date:  2020-04-12
184 countries have more than 1 confirmed cases
168 countries have more than 10 confirmed cases
123 countries have more than 100 confirmed cases
70 countries have more than 1000 confirmed cases
20 countries have more than 10000 confirmed cases


In [13]:
top_country_df = country_df.query('(date == @target_date) & (confirmed > 1000)').sort_values('confirmed', ascending=False)
top_country_df = top_country_df.iloc[0:10]
top_country_melt_df = pd.melt(top_country_df, id_vars='country', value_vars=['confirmed', 'fatalities'])

In [14]:
fig = px.bar(top_country_melt_df.iloc[::-1],
             x='value', y='country', color='variable', barmode='group',
             title=f'Confirmed Cases/Deaths on {target_date}', text='value', height=1500, orientation='h')
fig.show()

COVID-19 has spread to many countries around the world, with the most affected countries being the United States,  Spain, Italy, France, Germany, United Kingdom, China, Iran, Turkey and Belgium.

Now we can see as US and many Europe countries are in the top, overtaking the number of confirmed cases in China. 

In [15]:
top10_countries = top_country_df.sort_values('confirmed', ascending=False).iloc[:10]['country'].unique()
top10_countries_df = country_df[country_df['country'].isin(top10_countries)]
fig = px.line(top10_countries_df,
              x='date', y='confirmed', color='country',
              title=f'Confirmed Cases for top 10 country as of {target_date}')
fig.show()

Coronavirus hits China at first but its trend is slowing down, with an upcoming 2nd wave to Europe (Italy, Spain, Germany, France, UK) at March. More recently, the 3rd wave comes to US, whose growth rate is much faster than China, or even Europe. Its main spread starts from middle of March and its speed is faster than Italy. Now US seems to be in the most serious situation in terms of both total number and spread speed.


In [16]:
top10_countries = top_country_df.sort_values('fatalities', ascending=False).iloc[:10]['country'].unique()
top10_countries_df = country_df[country_df['country'].isin(top10_countries)]
fig = px.line(top10_countries_df,
              x='date', y='fatalities', color='country',
              title=f'Fatalities for top 10 country as of {target_date}')
fig.show()

On March 19th, Italy was the first country who surpassed the number of death of China, followed by Spain and US. 
Focusing on US, we noticed how US's spread speed is the fastest compared to the other countries in evidence, and how this country surpassed Italy as highest number of cases on Apr 10th.

In [17]:
top_country_df = country_df.query('(date == @target_date) & (confirmed > 100)')
top_country_df['mortality_rate'] = top_country_df['fatalities'] / top_country_df['confirmed']
top_country_df = top_country_df.sort_values('mortality_rate', ascending=False)

fig = px.bar(top_country_df[:15].iloc[::-1],
             x='mortality_rate', y='country',
             title=f'Mortality rate HIGH: top 15 countries on {target_date}', text='mortality_rate', height=800, orientation='h')
fig.show()

In the graph above, we decided to plot the mortality rate by country, showing this coronavirus is really world wide pandemic. Even if Algeria is on the top, Italy is the most serious situation, whose mortality rate is over 10% as of 2020/3/28, causing the death on average of 600/700 people per day.

It would be interesting to analyze the evolution of the mortality rate for the top countries in the future, also taking into account the structure of the health system.

In [18]:
fig = px.bar(top_country_df[-15:],
             x='mortality_rate', y='country',
             title=f'Mortality rate LOW: top 15 countries on {target_date}', text='mortality_rate', height=800, orientation='h')
fig.show()

How about the countries whose mortality rate is low?
By investigating the difference between above & below countries, we can notice how most of the countries that appear in this graph have statistically lowest % of population ages 65 and above, compared to the overall population.

In the following plots, an interactive representation of the confirmed cases' and fatalities' spread over time:  China -> Europe -> US.

In [19]:
fig = px.scatter_geo(country_df, locations="country", locationmode='country names', 
                     color="confirmed", size='confirmed', hover_name="country", 
                     hover_data=['confirmed', 'fatalities'],
                     range_color= [0, country_df['confirmed'].max()], 
                     projection="natural earth", animation_frame="date", 
                     title='COVID-19: Confirmed cases spread Over Time', color_continuous_scale="portland")
# fig.update(layout_coloraxis_showscale=False)
fig.show()

In [20]:
fig = px.scatter_geo(country_df, locations="country", locationmode='country names', 
                     color="fatalities", size='fatalities', hover_name="country", 
                     hover_data=['confirmed', 'fatalities'],
                     range_color= [0, country_df['fatalities'].max()], 
                     projection="natural earth", animation_frame="date", 
                     title='COVID-19: Fatalities growth Over Time', color_continuous_scale="portland")
fig.show()

In [43]:
n_countries = 20
n_start_confirmed = 2000
confirmed_top_countires = top_country_df.sort_values('confirmed', ascending=False).iloc[:n_countries]['country'].values
country_df['date'] = pd.to_datetime(country_df['date'])


df_list = []
for country in confirmed_top_countires:
    this_country_df = country_df.query('country == @country')
    start_date = this_country_df.query('confirmed > @n_start_confirmed')['date'].min()
    this_country_df = this_country_df.query('date >= @start_date')
    this_country_df['date_since'] = this_country_df['date'] - start_date
    this_country_df['confirmed_log1p'] = np.log10(this_country_df['confirmed'] + 1)
    this_country_df['confirmed_log1p'] -= this_country_df['confirmed_log1p'].values[0]
    df_list.append(this_country_df)

tmpdf = pd.concat(df_list)
tmpdf['date_since_days'] = tmpdf['date_since'] / pd.Timedelta('1 days')

In [45]:
fig = px.line(tmpdf,
              x='date_since_days', y='confirmed_log1p', color='country',
              title=f'Confirmed cases by country since 10 confirmed cases, as of {target_date}')
fig.add_trace(go.Scatter(x=[0, 21], y=[0, 3], name='Double by 7 days', line=dict(dash='dash', color=('rgb(200, 200, 200)'))))
fig.add_trace(go.Scatter(x=[0, 42], y=[0, 3], name='Double by 14 days', line=dict(dash='dash', color=('rgb(200, 200, 200)'))))
fig.add_trace(go.Scatter(x=[0, 63], y=[0, 3], name='Double by 21 days', line=dict(dash='dash', color=('rgb(200, 200, 200)'))))
fig.show()

Are confirmed cases increasing at different rates in different countries?

The chart shown here is designed to allow these comparisons.
It allows to compare the trajectory of confirmed cases between countries. The starting point for each country is the day that particular country had reached 10 total confirmed confirmed from COVID-19.

On the x-axis we see the days since the 10th confirmed case, and on the y-axis, the total number of confirmed cases in log10 scale. The grey lines show trajectories for a doubling time of 7 days, 14 days and 21 days. Countries that follow a steeper rise have seen a doubling time faster than that.

We can notice how steep is the US' trajectory compared to the other countries, showing how the Coronavirus is rapidly spreading throught it. Different trajectories are shown by China and South Korea, where the confirmed cases seems to have reached a steady state, thanks to the National responses including lockdowns, control over population and frequent swab tests.  

In [37]:
n_countries = 20
n_start_death = 10
fatality_top_countires = top_country_df.sort_values('fatalities', ascending=False).iloc[:n_countries]['country'].values
country_df['date'] = pd.to_datetime(country_df['date'])


df_list = []
for country in fatality_top_countires:
    this_country_df = country_df.query('country == @country')
    start_date = this_country_df.query('fatalities > @n_start_death')['date'].min()
    this_country_df = this_country_df.query('date >= @start_date')
    this_country_df['date_since'] = this_country_df['date'] - start_date
    this_country_df['fatalities_log1p'] = np.log10(this_country_df['fatalities'] + 1)
    this_country_df['fatalities_log1p'] -= this_country_df['fatalities_log1p'].values[0]
    df_list.append(this_country_df)

tmpdf = pd.concat(df_list)
tmpdf['date_since_days'] = tmpdf['date_since'] / pd.Timedelta('1 days')

In [38]:
fig = px.line(tmpdf,
              x='date_since_days', y='fatalities_log1p', color='country',
              title=f'Fatalities by country since 10 deaths, as of {target_date}')
fig.add_trace(go.Scatter(x=[0, 21], y=[0, 3], name='Double by 7 days', line=dict(dash='dash', color=('rgb(200, 200, 200)'))))
fig.add_trace(go.Scatter(x=[0, 42], y=[0, 3], name='Double by 14 days', line=dict(dash='dash', color=('rgb(200, 200, 200)'))))
fig.add_trace(go.Scatter(x=[0, 63], y=[0, 3], name='Double by 21 days', line=dict(dash='dash', color=('rgb(200, 200, 200)'))))
fig.show()

Are deaths increasing at different rates in different countries?

The chart shown here is designed to allow these comparisons.
It allows to compare the trajectory of confirmed deaths between countries. The starting point for each country is the day that particular country had reached 10 total confirmed deaths from COVID-19.

On the x-axis we see the days since the 10th confirmed death, and on the y-axis, the total number of confirmed deaths in log10 scale. The grey lines show trajectories for a doubling time of 7 days, 14 days and 21 days. Countries that follow a steeper rise have seen a doubling time faster than that.

We can notice how the China's curve reaches a stable trajectory compared to the other ones. 

## **1.3. Zoom up to** <a id="section13"></a>

To conclude the EDA, we decide to focus separately on three main territories, following the spread of coronavirus pandemic through time: China, Europe and US. 

* In Asia, China & Iran have many confirmed cases, followed by Turkey. Looking at the daily new confirmed cases in China by province, we can notice how the curve, nowadays, is very flat. 
* In Europe, its Northern & Eastern areas are relatively better situation compared to Eastern & Southern areas. Especially Italy, Spain, German, France are in more serious situation. 
* Focusing on US, we can see only New York, and its neighbor New Jersey dominates its spread and are in serious situation. Mortality rate in New York seems not high, around 2% for now.



### **1.3.1. Asia** <a id="section131"></a>

In [31]:
country_latest = country_df.query('date == @target_date')

fig = px.choropleth(country_latest, locations="country", 
                    locationmode='country names', color="confirmed", 
                    hover_name="country", range_color=[1, 50000], 
                    color_continuous_scale='portland', 
                    title=f'Asian Countries with Confirmed Cases as of {target_date}', scope='asia', height=800)
fig.show()

In [32]:
china_df = train.query('country == "China"')
china_df['prev_confirmed'] = china_df.groupby('country')['confirmed'].shift(1)
china_df['new_case'] = china_df['confirmed'] - china_df['prev_confirmed']
china_df.loc[china_df['new_case'] < 0, 'new_case'] = 0.
fig = px.line(china_df,
              x='date', y='new_case', color='province',
              title=f'DAILY NEW Confirmed cases in China by province')
fig.show()

### **1.3.2. Europe** <a id="section132"></a>

In [26]:
europe_country_list =list([
    'Austria','Belgium','Bulgaria','Croatia','Cyprus','Czechia','Denmark','Estonia','Finland','France','Germany','Greece','Hungary','Ireland',
    'Italy', 'Latvia','Luxembourg','Lithuania','Malta','Norway','Netherlands','Poland','Portugal','Romania','Slovakia','Slovenia',
    'Spain', 'Sweden', 'United Kingdom', 'Iceland', 'Russia', 'Switzerland', 'Serbia', 'Ukraine', 'Belarus',
    'Albania', 'Bosnia and Herzegovina', 'Kosovo', 'Moldova', 'Montenegro', 'North Macedonia'])

country_df['date'] = pd.to_datetime(country_df['date'])
train_europe = country_df[country_df['country'].isin(europe_country_list)]
#train_europe['date_str'] = pd.to_datetime(train_europe['date'])
train_europe_latest = train_europe.query('date == @target_date')

In [27]:
fig = px.choropleth(train_europe_latest, locations="country", 
                    locationmode='country names', color="confirmed", 
                    hover_name="country", range_color=[1, 100000], 
                    color_continuous_scale='portland', 
                    title=f'European Countries with Confirmed Cases as of {target_date}', scope='europe', height=500)
fig.show()

In [30]:
europe_country_list =list([
    'France','Germany',
    'Italy', 
    'Spain','United Kingdom','Belgium','Netherlands',"Austria","Portugal","Norway"])

country_df['date'] = pd.to_datetime(country_df['date'])
train_europe = country_df[country_df['country'].isin(europe_country_list)]
train_europe_march = train_europe.query('date > "2020-03-01"')

train_europe_march['prev_confirmed'] = train_europe_march.groupby('country')['confirmed'].shift(1)
train_europe_march['new_case'] = train_europe_march['confirmed'] - train_europe_march['prev_confirmed']
fig = px.line(train_europe_march,
              x='date', y='new_case', color='country',
              title=f'DAILY NEW Confirmed cases by country in Europe')
fig.show()

### **1.3.3. US** <a id="section133"></a>

In [21]:
api.dataset_download_file('corochann/usa-state-code','usa_states2.csv')

False

In [22]:
usa_state_code_df = pd.read_csv("usa_states2.csv")

In [23]:
# Prepare data frame only for US. 

train_us = train.query('country == "US"')
train_us['mortality_rate'] = train_us['fatalities'] / train_us['confirmed']

# Convert province column to its 2-char code name,
state_name_to_code = dict(zip(usa_state_code_df['state_name'], usa_state_code_df['state_code']))
train_us['province_code'] = train_us['province'].map(state_name_to_code)

# Only show latest days.
train_us_latest = train_us.query('date == @target_date')

In [24]:
fig = px.choropleth(train_us_latest, locations='province_code', locationmode="USA-states",
                    color='confirmed', scope="usa", hover_data=['province', 'fatalities', 'mortality_rate'],
                    title=f'Confirmed cases in US on {target_date}')
fig.show()

In [25]:
fig = px.choropleth(train_us_latest, locations='province_code', locationmode="USA-states",
                    color='mortality_rate', scope="usa", hover_data=['province', 'fatalities', 'mortality_rate'],
                    title=f'Mortality rate in US on {target_date}')
fig.show()