# Group 1 - Data Project - Covid-19  

**Imports and set magics:**

In [3]:
import pandas as pd
import matplotlib.pyplot as plt
import ipywidgets as widgets
import folium
import requests
import numpy as np 
import ipywidgets as widgets
from matplotlib_venn import venn2 # install with pip install matplotlib-venn
from ipywidgets import interact, interactive, fixed, interact_manual
from datetime import datetime
from plotly.subplots import make_subplots  # install with pip install plotly==4.6.0
import plotly.graph_objects as go
import plotly.express as px 


# autoreload modules when code is run
%load_ext autoreload
%autoreload 2

# local modules
import dataproject

# Read and clean data

### Covid-19 data retrieved from The Humanitarian Data Exchange collected by the John Hopkin's Hospital. We are using data on confirmed covid-19 cases, deaths due to covid-19, recovered patients of covid-19 and data on each individual country. ###

**Loading the CSSEGIS data** on covid-19 retrieved from the official data repository for the 2019 Novel Coronavirus Visual Dashboard operated by the Johns Hopkins University Center for Systems Science and Engineering (JHU CSSE). Also, Supported by ESRI Living Atlas Team and the Johns Hopkins University Applied Physics Lab (JHU APL). The data is avialable at: https://github.com/CSSEGISandData/COVID-19 (also available at the official webpage: https://data.humdata.org/dataset/novel-coronavirus-2019-ncov-cases). 

The data is **cleaned**, variables are removed and columns renamed:


In [4]:
# a. Loading data
death = pd.read_csv('https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_deaths_global.csv')
confirmed = pd.read_csv('https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_confirmed_global.csv')
recovered = pd.read_csv('https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_recovered_global.csv')
country = pd.read_csv('https://raw.githubusercontent.com/CSSEGISandData/COVID-19/web-data/data/cases_country.csv')


# c. Renaming country/region to country
confirmed = confirmed.rename(columns={'Country/Region': 'Country'})
recovered = recovered.rename(columns={'Country/Region': 'Country'})
death = death.rename(columns={'Country/Region': 'Country'})
country = country.rename(columns={'Country_Region': 'Country'})

# d. Droping columns
drop_these = ['Province/State', 'Lat', 'Long']
confirmed.drop(drop_these, axis=1, inplace=True)
recovered.drop(drop_these, axis=1, inplace=True)
death.drop(drop_these, axis=1, inplace=True)

The data is updated daily and will upon running be updated with the most recent numbers. 

# Visualising the worst-hit countries in terms of number of infected individuals

** Initially, we want to create a table displaying the countries who have the highest number of confirmed infected individuals.**

In [5]:
sorted_country = country.sort_values('Confirmed', ascending= False)

def highlight_col(x):
    b = 'background-color: blue'
    d = 'background-color: darkblue'
    g = 'background-color: green'
    df1 = pd.DataFrame('', index=x.index, columns=x.columns)
    df1.iloc[:, 4] = d
    df1.iloc[:, 5] = b
    df1.iloc[:, 6] = g
    return df1

def show_latest_cases(n):
    n = int(n)
    return country.sort_values('Confirmed', ascending= False).head(n).style.apply(highlight_col, axis=None)

interact(show_latest_cases, n='10')

Unnamed: 0,Country,Last_Update,Lat,Long_,Confirmed,Deaths,Recovered,Active
17,US,2020-04-06 13:15:41,40.0,-100.0,337971,9654,17582,0
159,Spain,2020-04-06 13:15:20,40.4637,-3.74922,135032,13055,40437,81540
10,Italy,2020-04-06 13:15:20,41.8719,12.5674,128948,15887,21815,91246
7,Germany,2020-04-06 13:15:20,51.1657,10.4515,100186,1590,28700,69896
6,France,2020-04-06 13:15:20,46.2276,2.2137,93780,8093,16354,69333
3,China,2020-04-06 09:37:01,30.5928,114.305,82665,3335,77310,2020
89,Iran,2020-04-06 13:15:20,32.4279,53.688,60500,3739,24236,32525
16,United Kingdom,2020-04-06 13:15:20,55.0,-3.0,48451,4943,229,43279
171,Turkey,2020-04-06 13:15:20,38.9637,35.2433,27069,574,1042,25453
15,Switzerland,2020-04-06 13:15:20,46.8182,8.2275,21652,734,7298,13620


<function __main__.show_latest_cases(n)>

The table shows the countries that are worst hit by the corona-virus in terms of number of confirmed infected individuals. With the interactive function enables one to choose how many countries to view. Data from March 6th shows that the countries with most individuals diagnosed with corona-virus is the US, Spain, Italy, Germany and France. Interesting to see that is now not in the top five.

** We now want to create a more visually intuitive visualisation of the above table. **

In [6]:
def bubble_chart(n):
    fig = px.scatter(sorted_country.head(n), x="Country", y="Confirmed", size="Confirmed", color="Country",
               hover_name="Country", size_max=60)
    fig.update_layout(
    title=str(n) +" Worst Hit Countries",
    xaxis_title="Countries",
    yaxis_title="Confirmed Cases",
    width = 700
    )
    fig.show()
interact(bubble_chart, n=10)

<function __main__.bubble_chart(n)>

The interactive legend lets one choose the number of countries viewed. The plot confirms what we found in the table. We note that in order to fully grasp what countries are worst hit we would need to look at the numbers relative to the sizes of the populations. 

# Visualisation of worst affected countries in terms of deaths #

** To gain another perspective on the current status of Covid-19, we will now turn to look at number of deaths across countries. **

In [7]:
px.bar(
    sorted_country.head(10),
    x = "Country",
    y = "Deaths",
    title= "10 Countries most affected by Covid-19", #
    color_discrete_sequence=["blue"], 
    height=400,
    width=800
)

It is evident that among the most affected countries the US, Spain, Italy, France and the UK are the countries that are worst hit by the pandemic in terms of deaths. Although it is almost the same countries as when we looked at the number of confirmed infected individuals, the order is different. The countries with most deaths caused by Covid-19 are Italy, Spain, the US, France and the UK. 

# Visualisation of the spread of Covid-19 across time #

** In order to gain an understanding of how the virus has spread globally over time, we now turn to a time series analysis. **

The data is converted to fit the plot we want to create. We import a package which allows us to alter the format of the dates given in the date to month/day/year format. A list is created for the timeline values needed for the graph, which are then appended to the date format in the DataFrames. Then the timeline fits the length of the x-axis in the graph. This is done for the three DataFrames confirmed, death and recovered. 

In [12]:
#Confirmed 
timeline = ['1/22/20', '1/23/20',
       '1/24/20', '1/25/20', '1/26/20', '1/27/20', '1/28/20', '1/29/20',
       '1/30/20', '1/31/20', '2/1/20', '2/2/20', '2/3/20', '2/4/20', '2/5/20',
       '2/6/20', '2/7/20', '2/8/20', '2/9/20', '2/10/20', '2/11/20', '2/12/20',
       '2/13/20', '2/14/20', '2/15/20', '2/16/20', '2/17/20', '2/18/20',
       '2/19/20', '2/20/20', '2/21/20', '2/22/20', '2/23/20', '2/24/20',
       '2/25/20', '2/26/20', '2/27/20', '2/28/20', '2/29/20', '3/1/20',
       '3/2/20', '3/3/20', '3/4/20', '3/5/20', '3/6/20', '3/7/20', '3/8/20', 
       '3/9/20', '3/10/20', '3/11/20', '3/12/20', '3/13/20', '3/14/20',
       '3/15/20', '3/16/20', '3/17/20', '3/18/20', '3/19/20', '3/20/20',
       '3/21/20', '3/22/20', '3/23/20', '3/24/20', '3/25/20', '3/26/20',
       '3/27/20', '3/28/20', '3/29/20', '3/30/20', '3/31/20', '4/1/20',
       '4/2/20', '4/3/20', '4/4/20', '4/5/20'] 
#list of values to append equal to length of x axis plot
time = [];value = [];country=[]
col_value = list(confirmed.columns)
for i in timeline:
    time.append(datetime.strptime(i, '%m/%d/%y'))
    value.append(confirmed[i].sum())
    

new_confirmed = pd.DataFrame({'Timeline':time,'Covid-19 impact':value})
#change to date time formatdatetime_object = datetime.strptime(datetime_str, '%m/%d/%y %H:%M:%S')

In [13]:
#Deaths
timeline = ['1/22/20', '1/23/20',
       '1/24/20', '1/25/20', '1/26/20', '1/27/20', '1/28/20', '1/29/20',
       '1/30/20', '1/31/20', '2/1/20', '2/2/20', '2/3/20', '2/4/20', '2/5/20',
       '2/6/20', '2/7/20', '2/8/20', '2/9/20', '2/10/20', '2/11/20', '2/12/20',
       '2/13/20', '2/14/20', '2/15/20', '2/16/20', '2/17/20', '2/18/20',
       '2/19/20', '2/20/20', '2/21/20', '2/22/20', '2/23/20', '2/24/20',
       '2/25/20', '2/26/20', '2/27/20', '2/28/20', '2/29/20', '3/1/20',
       '3/2/20', '3/3/20', '3/4/20', '3/5/20', '3/6/20', '3/7/20', '3/8/20', 
       '3/9/20', '3/10/20', '3/11/20', '3/12/20', '3/13/20', '3/14/20',
       '3/15/20', '3/16/20', '3/17/20', '3/18/20', '3/19/20', '3/20/20',
       '3/21/20', '3/22/20', '3/23/20', '3/24/20', '3/25/20', '3/26/20',
       '3/27/20', '3/28/20', '3/29/20', '3/30/20', '3/31/20', '4/1/20',
       '4/2/20', '4/3/20', '4/4/20', '4/5/20'] 
#list of values to append equal to length of x axis for plot
time = [];value = [];country=[]
col_value = list(death.columns)
for i in timeline:
    time.append(datetime.strptime(i, '%m/%d/%y'))
    value.append(death[i].sum())
    

new_death = pd.DataFrame({'Timeline':time,'Covid-19 impact':value})
#change to date time formatdatetime_object = datetime.strptime(datetime_str, '%m/%d/%y %H:%M:%S')

In [14]:
#Recovered
timeline = ['1/22/20', '1/23/20',
       '1/24/20', '1/25/20', '1/26/20', '1/27/20', '1/28/20', '1/29/20',
       '1/30/20', '1/31/20', '2/1/20', '2/2/20', '2/3/20', '2/4/20', '2/5/20',
       '2/6/20', '2/7/20', '2/8/20', '2/9/20', '2/10/20', '2/11/20', '2/12/20',
       '2/13/20', '2/14/20', '2/15/20', '2/16/20', '2/17/20', '2/18/20',
       '2/19/20', '2/20/20', '2/21/20', '2/22/20', '2/23/20', '2/24/20',
       '2/25/20', '2/26/20', '2/27/20', '2/28/20', '2/29/20', '3/1/20',
       '3/2/20', '3/3/20', '3/4/20', '3/5/20', '3/6/20', '3/7/20', '3/8/20', 
       '3/9/20', '3/10/20', '3/11/20', '3/12/20', '3/13/20', '3/14/20',
       '3/15/20', '3/16/20', '3/17/20', '3/18/20', '3/19/20', '3/20/20',
       '3/21/20', '3/22/20', '3/23/20', '3/24/20', '3/25/20', '3/26/20',
       '3/27/20', '3/28/20', '3/29/20', '3/30/20', '3/31/20', '4/1/20',
       '4/2/20', '4/3/20', '4/4/20', '4/5/20'] 
#list of values to append equal to length of x axis for plot
time = [];value = [];country=[]
col_value = list(recovered.columns)
for i in timeline:
    time.append(datetime.strptime(i, '%m/%d/%y'))
    value.append(recovered[i].sum())
    

new_recovered = pd.DataFrame({'Timeline':time,'Covid-19 impact':value})
#change to date time formatdatetime_object = datetime.strptime(datetime_str, '%m/%d/%y %H:%M:%S')

** We create a plot visualising the development in the three variables infected, deaths and recovered.**

In [16]:
fig = make_subplots()

fig.add_trace(
    go.Scatter(x=new_confirmed["Timeline"], y=new_confirmed["Covid-19 impact"], name = 'Infected'))

fig.add_trace(
    go.Scatter(x=new_death["Timeline"], y=new_death["Covid-19 impact"], name = 'Deaths'))

fig.add_trace(
    go.Scatter(x=new_recovered["Timeline"], y=new_recovered["Covid-19 impact"], name = 'Recovery'))

fig.update_xaxes(title_text="Timeline")


fig.update_layout(height=500, width=800, title_text="Timeline of Covid-19")
fig.show()

We see that the number of infected individuals globally appears to be exponential. The number of rerovered individuals is slowly increasing and at an even slower pace, the number of deaths are increasing.

We deem it relevant to plot the above logarithmically too as it can make the spread of the virus easier to discern. That is done below: 

In [17]:
#Log-graph
fig = make_subplots()

fig.add_trace(
    go.Scatter(x=new_confirmed["Timeline"], y=new_confirmed["Covid-19 impact"], name = 'Infected'))

fig.add_trace(
    go.Scatter(x=new_death["Timeline"], y=new_death["Covid-19 impact"], name = 'Deaths'))

fig.add_trace(
    go.Scatter(x=new_recovered["Timeline"], y=new_recovered["Covid-19 impact"], name = 'Recovery'))

fig.update_xaxes(title_text="Timeline")


fig.update_layout(height=500, width=800, title_text="Timeline of Covid-19 (Logarithmic)", yaxis=dict(type='log', autorange=True))
fig.show()

This gives a better insight to the exponential development. It can (potentially) hint to when the exponential increase might come to an end, or at least at what pace it is developping. 

# Conclusion#

The conclusion is draw upon the data retrieved on March 6th. From our data project on the spread of Covid-19, it is evident that there are a handfull of countries that are terribly hit by the pandemic both in terms of number of individuals infected and in terms of individuals dying. These countries are the US, Spain, Italy, Germany, France, the UK and Spain. 

On another, global perspective, the spread of the virus across time was visualised. It shows that, unfortunately the number of infected individuals appears to be exponentially increasing. The number of deaths and recovered individuals are increasing too, but a slower rate. 