## Global COVID-19 Data Analysis
The dataset is sourced from this [upstream repository](https://github.com/CSSEGISandData/COVID-19) maintained by the amazing team at [Johns Hopkins University Center for Systems Science and Engineering](https://systems.jhu.edu/) (CSSE). 
The sample dataset used in this notebook is maintained and updated by [laxmimerit](https://github.com/laxmimerit/Covid-19-Preprocessed-Dataset.git).
### With Case Study: UNITED KINGDOM 
**With interactive Visualisations**

In [1]:
import plotly.express as px
import plotly.graph_objects as go
import plotly.figure_factory as ff
from plotly.subplots import make_subplots
import folium
%matplotlib inline

import math
import random
import os
from datetime import datetime
import plotly as py
py.offline.init_notebook_mode(connected = True)

#color palette (dth = Deaths, rec = Recovered, act = Active)
#Use color picker to select any color of your choice
cnf = '#39e46'
dth = '#ff2e63'
rec = '#21bf73'
act = '#fe9801'

In [2]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
import warnings

warnings.filterwarnings('ignore')

In [5]:
try:
    os.system('rm -rf Covid-19-Preprocessed-Dataset')
except:
    print('File does not exist')
!git clone https://github.com/laxmimerit/Covid-19-Preprocessed-Dataset.git

fatal: destination path 'Covid-19-Preprocessed-Dataset' already exists and is not an empty directory.


### Read the datasets

In [3]:
file_path = './Covid-19-Preprocessed-Dataset/preprocessed/covid_19_data_cleaned.csv'
df = pd.read_csv(file_path, index_col = 'Date', parse_dates = True)
print(f'Size of dataset: {df.shape}')
df.head()

Size of dataset: (216576, 8)


Unnamed: 0_level_0,Province/State,Country,Lat,Long,Confirmed,Recovered,Deaths,Active
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
2020-01-22,,Afghanistan,33.93911,67.709953,0,0,0,0
2020-01-23,,Afghanistan,33.93911,67.709953,0,0,0,0
2020-01-24,,Afghanistan,33.93911,67.709953,0,0,0,0
2020-01-25,,Afghanistan,33.93911,67.709953,0,0,0,0
2020-01-26,,Afghanistan,33.93911,67.709953,0,0,0,0


In [4]:
df['Province/State'] = df['Province/State'].fillna('')
df.head()

Unnamed: 0_level_0,Province/State,Country,Lat,Long,Confirmed,Recovered,Deaths,Active
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
2020-01-22,,Afghanistan,33.93911,67.709953,0,0,0,0
2020-01-23,,Afghanistan,33.93911,67.709953,0,0,0,0
2020-01-24,,Afghanistan,33.93911,67.709953,0,0,0,0
2020-01-25,,Afghanistan,33.93911,67.709953,0,0,0,0
2020-01-26,,Afghanistan,33.93911,67.709953,0,0,0,0


In [5]:
z = df.isnull().sum()
print(z)
print(f'Number of rows with null values: {z[z > 0]}')

Province/State    0
Country           0
Lat               0
Long              0
Confirmed         0
Recovered         0
Deaths            0
Active            0
dtype: int64
Number of rows with null values: Series([], dtype: int64)


In [25]:
df.rename(columns = {'Province/State': 'ProvinceState'}, inplace = True)
df.sample(5)

Unnamed: 0_level_0,ProvinceState,Country,Lat,Long,Confirmed,Recovered,Deaths,Active
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
2021-12-20,Zhejiang,China,29.1832,120.0934,1998,0,1,1997
2021-08-11,Sint Maarten,Netherlands,18.0425,-63.0548,3045,0,38,3007
2020-07-17,,Paraguay,-23.4425,-58.4438,3457,1481,28,1948
2021-07-22,,Mauritius,-20.348404,57.552152,3181,1854,19,1308
2021-08-17,,Suriname,3.9193,-56.0278,26802,0,689,26113


### Globally Daily Confirmed, Recovered and Death Cases
#### Confirmed

In [26]:
global_confirmed_cases_by_date = df.groupby(df.index)['Confirmed'].sum()
global_confirmed_cases_by_date = pd.DataFrame(global_confirmed_cases_by_date)
global_confirmed_cases_by_date.head()

Unnamed: 0_level_0,Confirmed
Date,Unnamed: 1_level_1
2020-01-22,557
2020-01-23,655
2020-01-24,941
2020-01-25,1434
2020-01-26,2118


#### Recovered

In [27]:
global_recovered_cases_by_date = df.groupby(df.index)['Recovered'].sum()
global_recovered_cases_by_date = pd.DataFrame(global_recovered_cases_by_date)
global_recovered_cases_by_date.head()

Unnamed: 0_level_0,Recovered
Date,Unnamed: 1_level_1
2020-01-22,30
2020-01-23,32
2020-01-24,39
2020-01-25,42
2020-01-26,56


#### Deaths

In [28]:
global_deaths_cases_by_date = df.groupby(df.index)['Deaths'].sum()
global_deaths_cases_by_date = pd.DataFrame(global_deaths_cases_by_date)
global_deaths_cases_by_date.head()

Unnamed: 0_level_0,Deaths
Date,Unnamed: 1_level_1
2020-01-22,17
2020-01-23,18
2020-01-24,26
2020-01-25,42
2020-01-26,56


In [29]:
print(df.info())

<class 'pandas.core.frame.DataFrame'>
Index: 216576 entries, 2020-01-22 to 2022-02-11
Data columns (total 8 columns):
 #   Column         Non-Null Count   Dtype  
---  ------         --------------   -----  
 0   ProvinceState  216576 non-null  object 
 1   Country        216576 non-null  object 
 2   Lat            216576 non-null  float64
 3   Long           216576 non-null  float64
 4   Confirmed      216576 non-null  int64  
 5   Recovered      216576 non-null  int64  
 6   Deaths         216576 non-null  int64  
 7   Active         216576 non-null  int64  
dtypes: float64(2), int64(4), object(2)
memory usage: 14.9+ MB
None


### Analysis and Visualisation of Confirmed, Recovered and Deaths Cases Globally

In [30]:
global_confirmed_cases_by_date.sample(5)

Unnamed: 0_level_0,Confirmed
Date,Unnamed: 1_level_1
2021-05-18,164733600
2021-03-01,114874419
2020-07-10,12516760
2020-08-01,17834470
2020-10-09,36922554


In [31]:
global_recovered_cases_by_date.sample(5)

Unnamed: 0_level_0,Recovered
Date,Unnamed: 1_level_1
2021-06-23,117507151
2021-05-16,99064468
2020-10-03,24329331
2020-07-15,7559130
2020-06-19,4250267


In [32]:
global_deaths_cases_by_date.sample(5)

Unnamed: 0_level_0,Deaths
Date,Unnamed: 1_level_1
2021-09-25,4744471
2020-10-11,1128009
2020-01-28,131
2020-08-10,772307
2021-03-09,2692107


#### Interactive Visualisation of results
*Uncomment the code and run the cell to see thr plots*

In [1]:
# fig = go.Figure()
# fig.add_trace(go.Scatter(x = global_confirmed_cases_by_date.index, y = global_confirmed_cases_by_date.Confirmed, mode = 'lines + markers', name = 'Confirmed', line = dict(color = 'Orange', width = 2)))
# fig.add_trace(go.Scatter(x = global_recovered_cases_by_date.index, y = global_recovered_cases_by_date.Recovered, mode = 'lines + markers', name = 'Recovered', line = dict(color = 'Green', width = 2)))
# fig.add_trace(go.Scatter(x = global_deaths_cases_by_date.index, y = global_deaths_cases_by_date.Deaths, mode = 'lines + markers', name = 'Deaths', line = dict(color = 'Red', width = 2)))
# fig.update_layout(title = 'Global Covid-19 Cases', xaxis_tickfont_size = 14, yaxis = dict(title = 'Number of Cases'))
# fig.show()

**We can't see the Deaths curve due to range problem**...**Let's plot it separately**

In [2]:
# fig = go.Figure()
# fig.add_trace(go.Scatter(x = global_deaths_cases_by_date.index, y = global_deaths_cases_by_date.Deaths, mode = 'lines + markers', name = 'Deaths', line = dict(color = 'Red', width = 2)))
# fig.update_layout(title = 'Global Covid-19 Deaths', xaxis_tickfont_size = 14, yaxis = dict(title = 'Number of Deaths'))
# fig.show()

### Cases Density Animation on World Map
**Plotly Express requires the date to be of string type**...

In [35]:
print(type(df.index))
df.index = df.index.astype(str)
print(type(df.index))

<class 'pandas.core.indexes.base.Index'>
<class 'pandas.core.indexes.base.Index'>


*Uncomment the code and run the cell to see thr plots*

In [3]:
# fig = px.density_mapbox(df, lat = 'Lat', lon = 'Long', hover_name = 'Country', hover_data = ['Confirmed', 'Recovered', 'Deaths'], animation_frame = df.index, color_continuous_scale = 'Portland', radius = 7, zoom = 0, height = 700)
# fig.update_layout(title = 'Global Covid-19 Cases')
# fig.update_layout(mapbox_style = 'open-street-map', mapbox_center_lon = 0)
# fig.show()

## Case Study: United Kingdom

In [59]:
#print(list(df.Country.unique()))

In [37]:
df_uk = df.query('Country == "United Kingdom"')
df_uk.head()

Unnamed: 0_level_0,ProvinceState,Country,Lat,Long,Confirmed,Recovered,Deaths,Active
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
2020-01-22,Anguilla,United Kingdom,18.2206,-63.0686,0,0,0,0
2020-01-23,Anguilla,United Kingdom,18.2206,-63.0686,0,0,0,0
2020-01-24,Anguilla,United Kingdom,18.2206,-63.0686,0,0,0,0
2020-01-25,Anguilla,United Kingdom,18.2206,-63.0686,0,0,0,0
2020-01-26,Anguilla,United Kingdom,18.2206,-63.0686,0,0,0,0


In [38]:
uk_confirmed_cases = df_uk.groupby(df_uk.index)['Confirmed'].sum()
uk_confirmed_cases = pd.DataFrame(uk_confirmed_cases)
uk_confirmed_cases.sample(5)

Unnamed: 0_level_0,Confirmed
Date,Unnamed: 1_level_1
2021-01-15,3325646
2021-01-22,3594098
2021-10-26,8894843
2021-05-17,4468582
2022-01-20,15718193


In [39]:
uk_recovered_cases = df_uk.groupby(df_uk.index)['Recovered'].sum()
uk_recovered_cases = pd.DataFrame(uk_recovered_cases)
uk_recovered_cases.sample(5)

Unnamed: 0_level_0,Recovered
Date,Unnamed: 1_level_1
2022-01-06,0
2021-03-13,11949
2020-08-02,1444
2021-05-20,15407
2020-06-18,1313


In [40]:
uk_deaths_cases = df_uk.groupby(df_uk.index)['Deaths'].sum()
uk_deaths_cases = pd.DataFrame(uk_deaths_cases)
uk_deaths_cases.sample(5)

Unnamed: 0_level_0,Deaths
Date,Unnamed: 1_level_1
2021-03-23,126523
2021-12-27,148480
2020-07-02,40617
2021-07-14,128797
2021-02-28,123083


*Uncomment the code and run the cell to see thr plots*

In [4]:
# fig = go.Figure()
# fig.add_trace(go.Scatter(x = uk_confirmed_cases.index, y = uk_confirmed_cases.Confirmed, mode = 'lines + markers', name = 'Confirmed', line = dict(color = 'Orange', width = 2)))
# fig.add_trace(go.Scatter(x = uk_recovered_cases.index, y = uk_recovered_cases.Recovered, mode = 'lines + markers', name = 'Recovered', line = dict(color = 'Green', width = 2)))
# fig.add_trace(go.Scatter(x = uk_deaths_cases.index, y = uk_deaths_cases.Deaths, mode = 'lines + markers', name = 'Deaths', line = dict(color = 'Red', width = 2)))
# fig.update_layout(title = 'UK Covid-19 Cases', xaxis_tickfont_size = 14, yaxis = dict(title = 'Number of Cases'))
# fig.show()

In [5]:
# fig = go.Figure()
# fig.add_trace(go.Scatter(x = uk_recovered_cases.index, y = uk_recovered_cases.Recovered, mode = 'lines + markers', name = 'Recovered', line = dict(color = 'Green', width = 2)))
# fig.add_trace(go.Scatter(x = uk_deaths_cases.index, y = uk_deaths_cases.Deaths, mode = 'lines + markers', name = 'Deaths', line = dict(color = 'Red', width = 2)))
# fig.update_layout(title = 'UK Covid-19 Cases', xaxis_tickfont_size = 14, yaxis = dict(title = 'Number of Cases'))
# fig.show()

In [43]:
print(type(df_uk.index))
df_uk.index = df_uk.index.astype(str)
print(type(df_uk.index))

<class 'pandas.core.indexes.base.Index'>
<class 'pandas.core.indexes.base.Index'>


In [6]:
# fig = px.density_mapbox(df_uk, lat = 'Lat', lon = 'Long', hover_name = 'ProvinceState', hover_data = ['Confirmed', 'Recovered', 'Deaths'], animation_frame = df_uk.index, color_continuous_scale = 'Portland', radius = 7, zoom = 0, height = 700)
# fig.update_layout(title = 'UK Covid-19 Cases')
# fig.update_layout(mapbox_style = 'open-street-map', mapbox_center_lon = 0)
# fig.show()

#### UK Cases Over Time with Area Plot

In [45]:
uk_cases = df_uk.groupby(df_uk.index)[['Confirmed', 'Recovered', 'Deaths', 'Active']].sum()
uk_cases = pd.DataFrame(uk_cases)
uk_cases.sample(5)

Unnamed: 0_level_0,Confirmed,Recovered,Deaths,Active
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
2021-07-27,5771732,18908,129591,5623233
2020-09-03,342708,1750,41616,299342
2020-10-24,857045,2678,44835,809532
2021-04-03,4371393,13222,127068,4231103
2021-10-08,8119442,0,137945,7981497


In [46]:
latest_uk_cases = uk_cases[uk_cases.index == max(uk_cases.index)]
latest_uk_cases

Unnamed: 0_level_0,Confirmed,Recovered,Deaths,Active
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
2022-02-11,18346553,0,159909,18186644


In [48]:
latest_uk = latest_uk_cases.reset_index().melt(id_vars = 'Date', value_vars = ['Active', 'Deaths', 'Recovered'])

*Uncomment the code and run the cell to see thr plots*

In [7]:
# fig = px.treemap(latest_uk, path = ['variable'], values = 'value', height = 250, width = 800, color_discrete_sequence = [act, dth, rec])
# fig.data[0].textinfo = 'label+text+value'
# fig.show()

### Total Cases on Ships

In [50]:
print(type(df.index))
df.index = pd.to_datetime(df.index)
print(type(df.index))

<class 'pandas.core.indexes.base.Index'>
<class 'pandas.core.indexes.datetimes.DatetimeIndex'>


In [51]:
df.rename(columns = {'Province/State': 'ProvinceState'}, inplace = True)

#### Print the list of `ProvinceStates` and `Countries` to identify possible ship names
Names of ship with covid cases are mentioned under **ProvinceState** or **Country** columns

In [54]:
#print(df['ProvinceState'].unique())
#print(df['Country'].unique())

In [55]:
#ship_df = df.query('ProvinceState == "Grand Princess"')           # cool syntax!
#ship_df = df.query('ProvinceState == "Grand Princess"' or 'ProvinceState == "Diamond Princess"') 
ship_rows = df['ProvinceState'].str.contains('Grand Princess') | df['ProvinceState'].str.contains('Diamond Princess') | \
            df['Country'].str.contains('Grand Princess') | df['Country'].str.contains('Diamond Princess') | \
            df['Country'].str.contains('MS Zaandam') 
ship_df = df[ship_rows]
ship_df.sample(10)

Unnamed: 0_level_0,ProvinceState,Country,Lat,Long,Confirmed,Recovered,Deaths,Active
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
2020-08-23,Diamond Princess,Canada,0.0,0.0,0,0,1,-1
2021-10-09,Diamond Princess,Canada,0.0,0.0,0,0,1,-1
2021-01-21,Diamond Princess,Canada,0.0,0.0,0,0,1,-1
2021-05-14,Diamond Princess,Canada,0.0,0.0,0,0,1,-1
2021-09-27,Diamond Princess,Canada,0.0,0.0,0,0,1,-1
2021-11-24,Grand Princess,Canada,0.0,0.0,13,0,0,13
2020-03-25,,Diamond Princess,0.0,0.0,712,587,10,115
2021-07-02,Diamond Princess,Canada,0.0,0.0,0,0,1,-1
2020-02-01,,MS Zaandam,0.0,0.0,0,0,0,0
2021-11-10,Grand Princess,Canada,0.0,0.0,13,0,0,13


**Let's remove all ships from our dataframe**...

In [56]:
df = df[~ship_rows]
print(df.shape)

(213568, 8)


In [57]:
print(ship_df.shape)

(3008, 8)


#### Recent Cases on the Ships
...latest or most recent date

In [58]:
ship_latest = ship_df[ship_df.index == max(ship_df.index)]
ship_latest.head()

Unnamed: 0_level_0,ProvinceState,Country,Lat,Long,Confirmed,Recovered,Deaths,Active
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
2022-02-11,Diamond Princess,Canada,0.0,0.0,0,0,1,-1
2022-02-11,Grand Princess,Canada,0.0,0.0,13,0,0,13
2022-02-11,,Diamond Princess,0.0,0.0,712,0,13,699
2022-02-11,,MS Zaandam,0.0,0.0,9,0,2,7


In [59]:
ship_latest.reset_index().style.background_gradient(cmap = 'Pastel1_r')       # make sure to reset_index()!

Unnamed: 0,Date,ProvinceState,Country,Lat,Long,Confirmed,Recovered,Deaths,Active
0,2022-02-11 00:00:00,Diamond Princess,Canada,0.0,0.0,0,0,1,-1
1,2022-02-11 00:00:00,Grand Princess,Canada,0.0,0.0,13,0,0,13
2,2022-02-11 00:00:00,,Diamond Princess,0.0,0.0,712,0,13,699
3,2022-02-11 00:00:00,,MS Zaandam,0.0,0.0,9,0,2,7


### Global Cases Over Time with Area Plot

In [60]:
cases = df.groupby(df.index)[['Confirmed', 'Recovered', 'Deaths', 'Active']].sum()
cases = pd.DataFrame(cases)
cases.sample(5)

Unnamed: 0_level_0,Confirmed,Recovered,Deaths,Active
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
2020-02-18,74611,14352,2008,58251
2020-06-30,10468494,5351859,536106,4580529
2022-01-07,303825892,0,5483619,298342273
2022-02-06,395492616,0,5740788,389751828
2020-08-26,24214655,15793968,869008,7551679


In [61]:
latest_cases = cases[cases.index == max(cases.index)]
latest_cases

Unnamed: 0_level_0,Confirmed,Recovered,Deaths,Active
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
2022-02-11,408441408,0,5801984,402639424


In [62]:
#case1 = latest_cases.reset_index().melt(id_vars = 'Date', value_vars = ['Confirmed','Active', 'Deaths', 'Recovered'])
case1 = latest_cases.reset_index().melt(id_vars = 'Date', value_vars = ['Active', 'Deaths', 'Recovered'])

*Uncomment the code and run the cell to see thr plots*

In [8]:
# fig = px.treemap(case1, path = ['variable'], values = 'value', height = 250, width = 800, color_discrete_sequence = [act, dth, rec])
# fig.data[0].textinfo = 'label+text+value'
# fig.show()

In [64]:
cases = df.groupby(df.index)[['Recovered', 'Deaths', 'Active']].sum()
cases = pd.DataFrame(cases)
cases = cases.reset_index().melt(id_vars = 'Date', value_vars = ['Recovered', 'Deaths', 'Active'], var_name = 'Case', value_name = 'Count')
cases.sample(5)

Unnamed: 0,Date,Case,Count
1021,2020-10-17,Deaths,1161760
134,2020-06-04,Recovered,2944654
2032,2021-07-03,Active,59414969
1786,2020-10-30,Active,13824760
615,2021-09-28,Recovered,0


*Uncomment the code and run the cell to see thr plots*

In [9]:
# fig = px.area(cases, x = 'Date', y = 'Count', color = 'Case', height = 400, title = 'Cases over time', color_discrete_sequence = [rec, dth, act])
# fig.update_layout(xaxis_rangeslider_visible = True)
# fig.show()

### Global cases using Folium Maps
##### Recent cases globally

In [66]:
global_latest = df[df.index == max(df.index)]
global_latest.sample(5)

Unnamed: 0_level_0,ProvinceState,Country,Lat,Long,Confirmed,Recovered,Deaths,Active
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
2022-02-11,,Kenya,-0.0236,37.9062,322388,0,5626,316762
2022-02-11,,Mauritania,21.0079,-10.9408,58557,0,971,57586
2022-02-11,,United Arab Emirates,23.424076,53.847818,865576,0,2283,863293
2022-02-11,,New Zealand,-40.9006,174.886,19777,0,53,19724
2022-02-11,,Malta,35.9375,14.3754,69794,0,579,69215


In [67]:
global_latest = global_latest.reset_index()

In [68]:
#m = folium.Map(location = [0, 0], tiles = 'OpenStreetMap', min_zoom = 1, max_zoom = 4, zoom_start = 1)
m = folium.Map(location = [0, 0], tiles = 'cartodb positron', min_zoom = 1, max_zoom = 4, zoom_start = 1)
for i in range(global_latest.shape[0]):
    folium.Circle(location = [global_latest.iloc[i]['Lat'], global_latest.iloc[i]['Long']], color = 'crimson', fill = 'crimson', 
                  tooltip = '<li><bold> Country: ' + str(global_latest.iloc[i]['Country']) + 
                  '<li><bold> Province: ' + str(global_latest.iloc[i]['ProvinceState']) + 
                  '<li><bold> Confirmed: ' + str(global_latest.iloc[i]['Confirmed']) + 
                  '<li><bold> Deaths: ' + str(global_latest.iloc[i]['Deaths']), 
                  radius = int(global_latest.iloc[i]['Confirmed'])**0.5).add_to(m)