# Cross-country analysis of the Covid-19 outbreak

> **Note the following:** 
> 1. This is *not* meant to be an example of an actual **data analysis project**, just an example of how to structure such a project.
> 1. Remember the general advice on structuring and commenting your code from [lecture 5](https://numeconcopenhagen.netlify.com/lectures/Workflow_and_debugging).
> 1. Remember this [guide](https://www.markdownguide.org/basic-syntax/) on markdown and (a bit of) latex.
> 1. Turn on automatic numbering by clicking on the small icon on top of the table of contents in the left sidebar.
> 1. The `dataproject.py` file includes a function which will be used multiple times in this notebook.

Imports and set magics:

In [3]:
import pandas as pd
import matplotlib.pyplot as plt
import ipywidgets as widgets

# autoreload modules when code is run
%load_ext autoreload
%autoreload 2

# local modules
import dataproject

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


# Read and clean data

We collect data from the "2019 Novel Coronavirus COVID-19 (2019-nCoV) Data Repository by Johns Hopkins CSSE", found on GitHub on the following link: https://github.com/CSSEGISandData/COVID-19/tree/master/csse_covid_19_data/csse_covid_19_time_series. We have collected the data on the 26th of March, and the data is read statically - therefore we do *not* have the newest data on the number of Covid-19 cases in this notebook. We use three datasets: The number of confirmed cases, the number of cases resulting in death and the number of cases recovered.

In [4]:
# a. load data
df_confirmed = pd.read_csv("time_series_covid19_confirmed_global.csv")
df_deaths = pd.read_csv("time_series_covid19_deaths_global.csv")
df_recovered = pd.read_csv("time_series_covid19_recovered_global.csv")

# b. drop irrelevant columns
dfs = [df_confirmed, df_deaths, df_recovered]
drop_these = ['Lat', 'Long', 'Province/State']
for df in dfs:
    df.drop(drop_these, axis=1, inplace=True)
    
# c. group by country/region (we sum the number of cases across provinces/states)
df_confirmed = df_confirmed.groupby('Country/Region').sum()
df_deaths = df_deaths.groupby('Country/Region').sum()
df_recovered = df_recovered.groupby('Country/Region').sum()

The three datasets now look alike. The dataset of confirmed cases looks like this:

In [5]:
df_confirmed.head()

Unnamed: 0_level_0,1/22/20,1/23/20,1/24/20,1/25/20,1/26/20,1/27/20,1/28/20,1/29/20,1/30/20,1/31/20,...,3/16/20,3/17/20,3/18/20,3/19/20,3/20/20,3/21/20,3/22/20,3/23/20,3/24/20,3/25/20
Country/Region,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Afghanistan,0,0,0,0,0,0,0,0,0,0,...,21,22,22,22,24,24,40,40,74,84
Albania,0,0,0,0,0,0,0,0,0,0,...,51,55,59,64,70,76,89,104,123,146
Algeria,0,0,0,0,0,0,0,0,0,0,...,54,60,74,87,90,139,201,230,264,302
Andorra,0,0,0,0,0,0,0,0,0,0,...,2,39,39,53,75,88,113,133,164,188
Angola,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,1,2,2,3,3,3


**Convert to long format:** We convert the datasets to long format in order to be able to make an interactive plot. The function `long()` is defined in the file `dataproject.py`.

In [6]:
deaths_long = dataproject.long(df_deaths, 'deaths')
recovered_long = dataproject.long(df_recovered, 'recovered')
confirmed_long = dataproject.long(df_confirmed, 'confirmed')

# Explore data set

**General function to plot data**

In [7]:
def plot_covid(confirmed, deaths, recovered, dataset, country_region): 
    
    if dataset == 'Confirmed':
        df = confirmed
        y = 'confirmed'
    elif dataset == 'Deaths':
        df = deaths
        y = 'deaths'
    else:
        df = recovered
        y = 'recovered'
    
    I = df['Country/Region'] == country_region
    ax = df.loc[I,:].plot(x='date', y=y, style='-o')

**Interactive plot with all countries and regions**

In this plot, we have created drop-down menus, where you can select the dataset and country/region, you want to look at.

In [8]:
widgets.interact(plot_covid, 
    
    confirmed = widgets.fixed(confirmed_long),
    deaths = widgets.fixed(deaths_long),
    recovered = widgets.fixed(recovered_long),
    dataset = widgets.Dropdown(description='Dataset', 
                               options=['Confirmed','Deaths','Recovered']),
    country_region = widgets.Dropdown(description='Country/Region', 
                                    options=confirmed_long['Country/Region'].unique())
                 
); 

interactive(children=(Dropdown(description='Dataset', options=('Confirmed', 'Deaths', 'Recovered'), value='Con…

**Interactive plot with top countries**

For simplicity, we now construct a new, similar interactive plot, where we only show the top ten countries - as measured by the number of deaths.

In [9]:
# a. the total deaths in a country is given by the number of deaths at the latest date:
total_deaths = deaths_long.loc[deaths_long['date']==df_deaths.columns[-1]]

# b. the top 10 countries with the highest number of deaths
top_countries = total_deaths.sort_values(by = 'deaths', ascending=False).head(10)['Country/Region']

In [10]:
widgets.interact(plot_covid, 
    
    confirmed = widgets.fixed(confirmed_long),
    deaths = widgets.fixed(deaths_long),
    recovered = widgets.fixed(recovered_long),
    dataset = widgets.Dropdown(description='Dataset', 
                               options=['Confirmed','Deaths','Recovered']),
    country_region = widgets.Dropdown(description='Country/Region', 
                                    options=top_countries)
                 
); 

interactive(children=(Dropdown(description='Dataset', options=('Confirmed', 'Deaths', 'Recovered'), value='Con…

From the above plot, we can see that China is affected first - the number of confirmed cases begins to increase around January 21st, and stabilizes around 80.000 in the beginning of March. 
To compare, Italy gets an approximately exponential growth in the number of confirmed cases from the beginning of March. Other European countries such as France, Netherlands and Belgium experience a similar exponential growth around the beginning to the mid of March -  a little later than in Italy.

We can also see that China measures the number of recoveries to a higher degree than the remaining countries. Looking at the plot for recoveries in China, it is also very apparent that the number of cases in China is very high, which results in "nice", smooth curves.

# Analysis

To get a quick overview of the data, we create a table of the number of deaths in the top ten countries as well as the increase since the day before.

In [11]:
# a. We create a new dataframe, where we sort by date
# and calculate absolute and relative changes since the day before
df_long = deaths_long.sort_values(by = ['Country/Region', 'date'])
df_long['diff'] = df_long.groupby('Country/Region')['deaths'].diff()
df_long['diff_pct'] = df_long.groupby('Country/Region')['deaths'].apply(pd.Series.pct_change)*100

# b. We find the total deaths as the number of deaths at the latest date
table = df_long.loc[df_long['date']==df_deaths.columns[-1]]
table.sort_values(by = 'deaths', ascending=False).head(10)

Unnamed: 0,Country/Region,date,variable,deaths,diff,diff_pct
11043,Italy,3/25/20,deaths,7503,683.0,10.014663
11109,Spain,3/25/20,deaths,3647,839.0,29.878917
10995,China,3/25/20,deaths,3285,4.0,0.121914
11039,Iran,3/25/20,deaths,2077,143.0,7.394002
11020,France,3/25/20,deaths,1333,231.0,20.961887
11124,US,3/25/20,deaths,942,236.0,33.427762
11128,United Kingdom,3/25/20,deaths,466,43.0,10.165485
11076,Netherlands,3/25/20,deaths,357,80.0,28.880866
11024,Germany,3/25/20,deaths,206,49.0,31.210191
10978,Belgium,3/25/20,deaths,178,56.0,45.901639


We see that Italy has the highest number of deaths the 25th of March. China is number 3 on the list, but has only 4 more deaths than the day before. Italy and Spain are the countries that experience the highest absolute increase in the number of deaths since the day before. 

However, when we look at the percentage increase in the number of deaths, we find that Belgium, US and Germany are the countries that experience the highest percentage growth. As epidemics are exponential by nature, the percentage increase per day is the most relevant measure if we want to predict, which  countries will suffer the most.

# Conclusion

In this project, we have used data on the number of cases of Covid-19. We have constructed graphs depicting the development in the number of confirmed cases, the number of recovered cases and the number of deaths from the Covid-19 virus across countries. We have looked at the top ten countries as measured by the number of deaths on the 25th of March 2020 and find that Spain and Italy are the fastest growing countries in terms of absolute deaths. In terms of relative increase in the number of deaths (which is the relevant measure for a pandemic, that is exponential by nature), Belgium, the US and Germany tops the list.