Last Updated: 02 Dec 2020

Today, we're going to be analyzing preprocessed Covid-19 (Coronavirus) sourced from this upstream repository maintained by the amazing team at Johns Hopkins University Center for Systems Science and Engineering (CSSE) who have been doing a great public service from an early point by collating data from around the world.

Data is disaggregated by country (and sometimes subregion). Coronavirus disease (COVID-19) is caused by the Severe acute respiratory syndrome Coronavirus 2 (SARS-CoV-2) and has had a worldwide effect. On March 11 2020, the World Health Organization (WHO) declared it a pandemic, pointing to the over 118,000 cases of the coronavirus illness in over 110 countries and territories around the world at the time.

This dataset includes time series data tracking the number of people affected by COVID-19 worldwide, including:

* confirmed tested cases of Coronavirus infection
* the number of people who have reportedly died while sick with Coronavirus
* the number of people who have reportedly recovered from it

In [33]:
import plotly as py
from plotly.offline import plot, iplot, init_notebook_mode
import plotly.graph_objs as go
init_notebook_mode(connected=True)
import plotly.express as px
import plotly.figure_factory as ff
from plotly.subplots import make_subplots

import folium
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

import random
from datetime import timedelta

#color pallettes
cnf = '#393e46'
dth = '#ff2e63'
rec = '#21bf73'
act = '#fe9801'

In [2]:
df = pd.read_csv('covid_19_data_cleaned.csv', parse_dates=['Date'])

In [3]:
country_daywise = pd.read_csv('country_daywise.csv', parse_dates=['Date'])

In [4]:
countrywise = pd.read_csv('countrywise.csv')

In [5]:
daywise = pd.read_csv('daywise.csv', parse_dates=['Date'])

In [6]:
ship_rows = (df["Country"].str.contains("Grand Princess")) | (df["Country"].str.contains("Diamond Princess")) | (df["Country"].str.contains("MS Zaandam")) | (df["Province/State"].str.contains("Grand Princess")) | (df["Province/State"].str.contains("Diamond Princess")) | (df["Province/State"].str.contains("MS Zaandam"))
df = df[~ship_rows]

In [7]:
ship_rows = (country_daywise["Country"].str.contains("Grand Princess")) | (country_daywise["Country"].str.contains("Diamond Princess")) | (country_daywise["Country"].str.contains("MS Zaandam"))
country_daywise = country_daywise[~ship_rows]

In [8]:
ship_rows = (countrywise["Country"].str.contains("Grand Princess")) | (countrywise["Country"].str.contains("Diamond Princess")) | (countrywise["Country"].str.contains("MS Zaandam"))
countrywise = countrywise[~ship_rows]

In [9]:
df['Province/State'] = df['Province/State'].fillna("")

confirmed = df.groupby('Date').sum()['Confirmed'].reset_index()
recovered = df.groupby('Date').sum()['Recovered'].reset_index()
deaths = df.groupby('Date').sum()['Deaths'].reset_index()

# Global Number of Confirmed, Recovered and Death Cases

In [10]:
df["Date"] = df["Date"].astype(str)

In [11]:
date_country = country_daywise.sort_values(by=["Date", "Country"]).reset_index().drop("index", axis=1)

Let's visualise this further. First, let's visualize the latest number of Covid19 cases worldwide. Then, we will make two line plots visualising the number of new cases every day along with the number of countries reported with new cases every day.

In [12]:
df_map = df[df["Date"] == max(df["Date"])]

world_map = folium.Map(location=[11,0], tiles="Stamen Terrain", zoom_start=1.5, max_zoom=6, min_zoom=1.5)

for i in range(0, len(df_map)):
    folium.Circle(
    location=[df_map.iloc[i]["Lat"], df_map.iloc[i]["Long"]],
    fill=True,
    radius=int(df_map.iloc[i]["Confirmed"]*0.1),
    tooltip='<li><bold> Country: ' + str(df_map.iloc[i]["Country"])+
            '<li><bold> Province/State: ' + str(df_map.iloc[i]["Province/State"])+
            '<li><bold> Confirmed: ' + str(df_map.iloc[i]["Confirmed"])+
            '<li><bold> Active: ' + str(df_map.iloc[i]["Active"])+
            '<li><bold> Recovered: ' + str(df_map.iloc[i]["Recovered"])+
            '<li><bold> Deaths: ' + str(df_map.iloc[i]["Deaths"]),
    fill_color="red",
    color='red').add_to(world_map)
    
world_map    

In [13]:
fig1 = px.line(daywise.sort_values(by="Date"), x="Date", y="Confirmed", color_discrete_sequence=[act])
fig2 = px.line(daywise.sort_values(by="Date"), x="Date", y="No. of Countries", color_discrete_sequence=[dth])

fig = make_subplots(rows=1, cols=2, shared_xaxes=True, horizontal_spacing=0.1,
                    subplot_titles=("No. of New Cases Per Day", "No. of Countries Infected"))

fig.add_trace(fig1["data"][0], row=1, col=1)
fig.add_trace(fig2["data"][0], row=1, col=2)
fig.update_layout(paper_bgcolor='rgba(0,0,0,0)',
                  plot_bgcolor='rgba(0,0,0,0)')

figure=go.FigureWidget(fig)
figure

FigureWidget({
    'data': [{'hovertemplate': 'Date=%{x}<br>Confirmed=%{y}<extra></extra>',
              'leg…

From here, we have a very clear visualization of how the virus spread across the world/in each country over the year. In terms of the number of countries infected, the rate of infection peaked somewhere near April 10, where the number of infected countries was 185. That is about 96% of the world.
Next, let's take a look at the number of confirmed/recovered/active/death cases over time.

In [14]:
fig = {'data': [go.Bar(x=daywise["Date"], y=daywise["Confirmed"], name="Confirmed"),
                go.Bar(x=daywise["Date"], y=daywise["Recovered"], name="Recovered"),
                go.Bar(x=daywise["Date"], y=daywise["Active"], name="Active"),
                go.Bar(x=daywise["Date"], y=daywise["Deaths"], name="Deaths")],
       'layout': go.Layout(barmode='overlay', colorway=px.colors.qualitative.Set2,
                           paper_bgcolor='rgba(0,0,0,0)',
                           plot_bgcolor='rgba(0,0,0,0)')}
figure=go.FigureWidget(fig)
figure

FigureWidget({
    'data': [{'name': 'Confirmed',
              'type': 'bar',
              'uid': 'dbeaee51-…

From this, we can observe a clear trend of increase for all 4 sections. Global confirmed cases seem to increase exponentially while it is not so clear for recovered cases (there might be a linear trend). The number of active cases matched the number of recovered cases till sometime before Jul 2020, where the number of recovered cases start to outstrip the number of active cases. Throughout, the number of death cases remained relatively small. However, around 1.5m people have died to Covid19 thus far. 

Global trend doesn't seem to offer much more useful insights, so let's dive into individual countries. We want to take a look at which countries are handling the pandemic best and which countries are handling it worst. Hence, let's plot the total number of confirmed/recovered/active/death cases from the top 10 countries (sporting the highest values in each area). 

# Top 10 Countries with Confirmed/Recovered/Active/Death Cases

In [15]:
fig1 = px.bar(countrywise.sort_values(by="Confirmed").tail(10), x="Confirmed", y="Country",
              text="Confirmed", orientation="h", color_discrete_sequence=[cnf])
fig2 = px.bar(countrywise.sort_values(by="Deaths").tail(10), x="Deaths", y="Country",
              text="Deaths", orientation="h", color_discrete_sequence=[dth])
fig3 = px.bar(countrywise.sort_values(by="Active").tail(10), x="Active", y="Country",
              text="Active", orientation="h", color_discrete_sequence=[act])
fig4 = px.bar(countrywise.sort_values(by="Recovered").tail(10), x="Recovered", y="Country",
              text="Recovered", orientation="h", color_discrete_sequence=[rec])

fig = make_subplots(rows=2, cols=2, shared_xaxes=False, horizontal_spacing=0.15, vertical_spacing=0.1,
                    subplot_titles=("Confirmed Cases", "Death Cases", "Active Cases", "Recovered Cases"))

fig.add_trace(fig1["data"][0], row=1, col=1)
fig.add_trace(fig2["data"][0], row=1, col=2)
fig.add_trace(fig3["data"][0], row=2, col=1)
fig.add_trace(fig4["data"][0], row=2, col=2)

fig.update_layout(height=1000, plot_bgcolor='rgba(0,0,0,0)')
figure=go.FigureWidget(fig)
figure

FigureWidget({
    'data': [{'alignmentgroup': 'True',
              'hovertemplate': 'Confirmed=%{text}<br>Co…

Initially, it would seem that US is doing the worst with the highest number of confirmed, active and death cases while Brazil and India is doing well with the highest number of recovered cases. However, that is not the case as Brazil and India are among top 3 for the number of confirmed, active and death cases as well. Additionally, pure number of cases alone are not an accurate gauge of how well each country is dealing with the virus due to geographical differences.

I.e. A country with greater population will have a greater number of cases, but that does not mean that it is handling the virus worse than a country with a smaller population and number of cases. Moreover, population alone is insufficient to account for the rate of infection as other factors such as population density and climate are also important. A country with a large population but lower population density might have a lower transmission rate than a country with small population but high population density.

That being said, let's evaluate the countries by the number of death/recovered cases per 100 confirmed cases.

In [16]:
fig1 = px.bar(countrywise.sort_values(by='Recovered / 100 Cases').tail(10), x='Recovered / 100 Cases', y="Country",
              text='Recovered / 100 Cases', orientation="h", color_discrete_sequence=[rec])
fig2 = px.bar(countrywise.sort_values(by='Deaths / 100 Cases').tail(10), x='Deaths / 100 Cases', y="Country",
              text='Deaths / 100 Cases', orientation="h", color_discrete_sequence=[dth])

fig = make_subplots(rows=1, cols=2, shared_xaxes=False, horizontal_spacing=0.15, vertical_spacing=0.1,
                    subplot_titles=('Recovered / 100 Cases', 'Deaths / 100 Cases'))

fig.add_trace(fig1["data"][0], row=1, col=1)
fig.add_trace(fig2["data"][0], row=1, col=2)

fig.update_layout(plot_bgcolor='rgba(0,0,0,0)')
figure=go.FigureWidget(fig)
figure

FigureWidget({
    'data': [{'alignmentgroup': 'True',
              'hovertemplate': 'Recovered / 100 Cases=%…

From this, we get a better representation of countries who are managing the pandemic well and those who are not. However, this statistic is still not accurate enough. It speaks more of their healthcare systems (ablility to nurture infected to recovery) than of their ability to stop the transmission of the virus. Given that we do not have the data for other important factors affecting transmission, we shall use a naive metric of "Confirmed Cases Per Capita" to measure countries' abilities to prevent transmission. In this following section, we will only evaluate countries who have had more than 100 cases in total.

In [17]:
countrywise.at[36, "Population"] = 1.393*10e9
countrywise.at[79, "Population"] = 1.353*10e9

In [18]:
filtered_countries = countrywise[countrywise["Confirmed"] >= 100]

In [19]:
filtered_countries["Confirmed Cases Per Capita"] = filtered_countries["Confirmed"]/filtered_countries["Population"]
filtered_countries["Confirmed Cases Per Capita"] = filtered_countries["Confirmed Cases Per Capita"]*10e3
filtered_countries["Confirmed Cases Per Capita"] = filtered_countries["Confirmed Cases Per Capita"].round(2)

In [20]:
fig1 = px.bar(filtered_countries.sort_values(by="Confirmed Cases Per Capita", ascending=False).tail(10), x="Confirmed Cases Per Capita", y="Country",
              text="Confirmed Cases Per Capita", orientation="h", color_discrete_sequence=[act])
fig2 = px.bar(filtered_countries.sort_values(by="Confirmed Cases Per Capita").tail(10), x="Confirmed Cases Per Capita", y="Country",
              text="Confirmed Cases Per Capita", orientation="h", color_discrete_sequence=[cnf])

fig = make_subplots(rows=1, cols=2, shared_xaxes=False, horizontal_spacing=0.15, vertical_spacing=0.1,
                    subplot_titles=("Least Confirmed Cases Per Capita (by a factor of 3)", "Most Confirmed Cases Per Capita (by a factor of 3)"))

fig.add_trace(fig1["data"][0], row=1, col=1)
fig.add_trace(fig2["data"][0], row=1, col=2)

fig.update_layout(plot_bgcolor='rgba(0,0,0,0)')
for i in fig['layout']['annotations']:
    i['font'] = dict(size=12)
figure=go.FigureWidget(fig)
figure

FigureWidget({
    'data': [{'alignmentgroup': 'True',
              'hovertemplate': 'Confirmed Cases Per Cap…

Cool. Now let's visualize this all together as donut charts.

In [21]:
fig = make_subplots(rows=4, cols=2, specs=[[{'type':'domain'}, {'type':'domain'}],
                                           [{'type':'domain'}, {'type':'domain'}],
                                           [{'type':'domain'}, {'type':'domain'}],
                                           [{'type':'domain'}, {'type':'domain'}]],
                    vertical_spacing=0.08,
                    subplot_titles=["Confirmed Cases", "Death Cases", "Active Cases", "Recovered Cases", 
                                    'Recovered / 100 Cases', 'Deaths / 100 Cases', "Least Confirmed Cases Per Capita", "Most Confirmed Cases Per Capita"])

fig.add_trace(go.Pie(labels=countrywise.sort_values(by="Confirmed").tail(10)["Country"], 
                     values=countrywise.sort_values(by="Confirmed").tail(10)["Confirmed"], name="Confirmed", 
                     marker_colors=px.colors.diverging.RdYlBu),1,1)

fig.add_trace(go.Pie(labels=countrywise.sort_values(by="Deaths").tail(10)["Country"], 
                     values=countrywise.sort_values(by="Deaths").tail(10)["Deaths"], name="Deaths",
                     marker_colors=px.colors.diverging.RdYlBu),1,2)

fig.add_trace(go.Pie(labels=countrywise.sort_values(by="Active").tail(10)["Country"], 
                     values=countrywise.sort_values(by="Active").tail(10)["Active"], name="Active",
                     marker_colors=px.colors.diverging.RdYlBu),2,1)

fig.add_trace(go.Pie(labels=countrywise.sort_values(by="Recovered").tail(10)["Country"], 
                     values=countrywise.sort_values(by="Recovered").tail(10)["Recovered"], name="Recovered",
                     marker_colors=px.colors.diverging.RdYlBu),2,2)

fig.add_trace(go.Pie(labels=countrywise.sort_values(by='Recovered / 100 Cases').tail(10)["Country"], 
                     values=countrywise.sort_values(by='Recovered / 100 Cases').tail(10)['Recovered / 100 Cases'], name='Recovered / 100 Cases', 
                    marker_colors=px.colors.diverging.RdYlGn),3,1)

fig.add_trace(go.Pie(labels=countrywise.sort_values(by='Deaths / 100 Cases').tail(10)["Country"], 
                     values=countrywise.sort_values(by='Deaths / 100 Cases').tail(10)['Deaths / 100 Cases'], name='Deaths / 100 Cases', 
                     marker_colors=px.colors.diverging.RdYlGn),3,2)

fig.add_trace(go.Pie(labels=filtered_countries.sort_values(by="Confirmed Cases Per Capita", ascending=False).tail(10)["Country"], 
                     values=filtered_countries.sort_values(by="Confirmed Cases Per Capita", ascending=False).tail(10)["Confirmed Cases Per Capita"], 
                     name="Confirmed Cases Per Capita", marker_colors=px.colors.diverging.Spectral_r),4,1)

fig.add_trace(go.Pie(labels=filtered_countries.sort_values(by="Confirmed Cases Per Capita").tail(10)["Country"], 
                     values=filtered_countries.sort_values(by="Confirmed Cases Per Capita").tail(10)["Confirmed Cases Per Capita"], 
                     name="Confirmed Cases Per Capita", marker_colors=px.colors.diverging.Spectral),4,2)

fig.update_layout(height=1200)
fig.update_traces(hole=.4, hoverinfo="label+percent+value", textinfo='label', textposition='inside')
fig.update(layout_title_text='Top 10 Countries In Each Category',
           layout_showlegend=False)

for i in fig['layout']['annotations']:
    i['font'] = dict(size=14)

figure=go.FigureWidget(fig)
figure

FigureWidget({
    'data': [{'domain': {'x': [0.0, 0.45], 'y': [0.81, 1.0]},
              'hole': 0.4,
      …

# Flattening The Curve
One of the most widely discussed ways of determining if a country handled Covid19 well is if they managed to "flatten the curve". The curve refers to the total number of confirmed cases over time. As we can see from the data earlier on, covid infection is increasing at an exponential rate. Therefore, if a country has "flattened the curve", it has successfully managed to minimize the rate of infection. However, this has attracted some controversy due to the possibility of data manipulation (i.e. inaccurate reporting, selective reporting, misrepresentation etc). Here, we will look at more well-known countries in Asia, Europe, and the US. (Disclaimer: The following list is completely arbitrary.)

In [22]:
countries = np.array(["Australia", "New Zealand", "Singapore", "Malaysia", "Indonesia",
             "Vietnam", "Cambodia", "Thailand", "India", "China",
             "Japan", "Korea, South", "Taiwan*", "Philippines", "Spain",
             "France", "Italy", "Russia", "United Kingdom", "Germany",
             "Netherlands", "Belgium", "Denmark", "Switzerland", "Sweden",
             "US", "Brazil", "Mexico", "Colombia", "Peru"])

In [23]:
country_date = country_daywise.groupby(["Country", "Date"])["Confirmed"].sum().reset_index()
country_date = country_date[country_date["Country"].isin(countries)]

In [27]:
ncols=5
nrows=6

fig = make_subplots(rows=nrows, cols=ncols, shared_xaxes=False, subplot_titles=countries)

for ind, country in enumerate(countries):
    row = int((ind/ncols)+1)
    col = int((ind%ncols)+1)  
    fig.add_trace(go.Bar(x=country_date["Date"], y=country_date.loc[country_date["Country"]==country, "Confirmed"],
                        name=country, opacity=1), row=row, col=col)
    fig.update_xaxes(tickfont = {'size': 9})
    fig.update_yaxes(tickfont = {'size': 9})
      
fig.update_layout(title_text = "Confirmed Cases In Each Country Over Time", showlegend=False)
fig.update_layout(height=1500, plot_bgcolor='rgba(0,0,0,0)')
figure=go.FigureWidget(fig)
figure

FigureWidget({
    'data': [{'name': 'Australia',
              'opacity': 1,
              'type': 'bar',
   …

From here, we can see that Asia-Pacific has done pretty well in flattening the curve. Countries like Australia, Singapore, Thailand and China have managed to minimize the rate of Covid19 transmission. Comparatively, the rest of the world are still facing an exponential rise in cases everyday.

# Assumptions
There are two key assumptions we make when visualising this data:

1. The figures are accurately reported by every country.

2. The figures are up-to-date.

Obviously, these two assumptions don't hold up in reality. Therefore, it's necessary to check assumption 2 as it's impossible to check assumption 1. Let's get straight into it by looking at the start and end dates of daily updates by each country.

In [28]:
df = pd.read_csv('covid_19_data_cleaned.csv', parse_dates=['Date'])
ship_rows = (df["Country"].str.contains("Grand Princess")) | (df["Country"].str.contains("Diamond Princess")) | (df["Country"].str.contains("MS Zaandam")) | (df["Province/State"].str.contains("Grand Princess")) | (df["Province/State"].str.contains("Diamond Princess")) | (df["Province/State"].str.contains("MS Zaandam"))
df = df[~ship_rows]
df['Province/State'] = df['Province/State'].fillna("")

In [29]:
start_date = df[df["Confirmed"]>0]
start_date = start_date.groupby("Country")["Date"].agg(["min"]).reset_index()

end_date = df.groupby(["Country", "Date"])["Confirmed", "Deaths", "Recovered"] 
end_date = end_date.sum().diff().reset_index()

# It's a bit trickier to find the end date of cases reported due to the nature of our data. 
# Even if the reporting has stopped, the number of cases might remain the same or be replaced with 0. 
# Therefore, to find the actual end date of cases reported, we have to remove all duplicate and zero values.


Indexing with multiple keys (implicitly converted to a tuple of keys) will be deprecated, use a list instead.



In [30]:
mask = (end_date["Country"] != end_date["Country"].shift(1))
end_date.loc[mask, "Confirmed"] = np.nan
end_date.loc[mask, "Deaths"] = np.nan
end_date.loc[mask, "Recovered"] = np.nan

end_date = end_date[end_date["Confirmed"]>0]
end_date = end_date.groupby("Country")["Date"].agg(["max"]).reset_index()

In [31]:
start_end = pd.concat([start_date, end_date["max"]], axis=1)
start_end["max"] = start_end["max"] + timedelta(days=1) #reporting is lagged by 1 day
start_end["days"] = start_end["max"] - start_end["min"]
start_end.columns = ["Country", "Start", "Finish", "No. of Days"]
start_end = start_end.sort_values("No. of Days")

In [34]:
color = ["#" + ''.join([random.choice("123456789ABCDEF") for j in range(6)]) for i in range(len(start_end))] #creating random rgb for each country

In [35]:
start_end["Task"] = start_end["Country"]
fig = ff.create_gantt(start_end, index_col="Country", colors=color, show_colorbar=False,
                     bar_width=0.2, showgrid_x=True, showgrid_y=True, height=2500)
fig.update_layout(height=1750, plot_bgcolor='rgba(0,0,0,0)')
fig.layout.yaxis['tickfont'] = {'size': 8}
figure=go.FigureWidget(fig)
figure

FigureWidget({
    'data': [{'fill': 'toself',
              'fillcolor': 'rgb(100, 43, 146)',
              '…

Great. From here we can see which countries started reporting their cases later and which countries stopped reporting/updating their cases. This does not tell us if data has been manipulated though: some countries might have started reporting cases later simply because they weren't affected, some countries might have stopped reporting cases simply because they successfully contained the virus. If you're interested in whether data manipulation has occurred, further research is required. My visualisation only offers basic insights into that possibility.

# Conclusion
Asia and some parts of Africa have been doing better in stemming the spread of the virus. In contrast, most of what we would consider the "developed" world (Europe and US) are faring the worst in that aspect. This can possibly be attributed to their lack of strict containment measures or that they can afford to do so. With more advanced healthcare capabilities and capacity, they do not run as much risk of high mortality rates or overwhelming their healthcare system as less developed countries.