# Covid-19 analysis
This Python3 notebook is based on the original [Pharo Smalltalk version](https://github.com/olekscode/CovidAnalysis) by Oleksandr Zaitsev. <br> The dataset for this Covid-19 analysis is publicly available ([see](#Download-latest-COVID-19-data)) and updated daily. Where appropriate some code extensions have been added to view the supplied data from a different angle.

#### Disclaimer 
August 2020. To prevent inappropriate conclusions drawn on this highly actual subject:
* As a Data Science beginner this Covid-19 analysis is purely intended for educational purposes.
* The outcome of the code cells in this notebook should be regarded as the *technical* result of the code and not be interpreted otherwise.
* I have no opinion on the dataset provided nor am I responsible for possible misinterpretation of the outcome

In [1]:
# The %... is an iPython thing, and is not part of the Python language.
# In this case we're just telling the plotting library to draw things on
# the notebook, instead of on a separate window.
%matplotlib inline 
#this line above prepares IPython notebook for working with matplotlib

# See all the "as ..." contructs? They're just aliasing the package names.
# That way we can call methods like plt.plot() instead of matplotlib.pyplot.plot().

import numpy as np # imports a fast numerical programming library
import scipy as sp #imports stats functions, amongst other things
import matplotlib as mpl # this actually imports matplotlib
import matplotlib.cm as cm #allows us easy access to colormaps
import matplotlib.pyplot as plt #sets up plotting under plt
import pandas as pd #lets us handle data as dataframes
#sets up pandas table display
pd.set_option('display.width', 500)
pd.set_option('display.max_columns', 100)
pd.set_option('display.notebook_repr_html', True)
import seaborn as sns #sets up styles and gives us more plotting options
import urllib.request
import pickle
import requests
import datetime

### Download latest COVID-19 data
Download the latest COVID-19 data from https://www.ecdc.europa.eu/en/publications-data/download-todays-data-geographic-distribution-covid-19-cases-worldwide.

In [2]:
def dataCsvURL():
    return 'https://opendata.ecdc.europa.eu/covid19/casedistribution/csv'

In [3]:
def dataDirectory():
    return 'data/'

In [4]:
def dataCsvFile():
    return dataDirectory() + 'covidData.csv'

In [5]:
#Let's give it a try and save to CSV file
covidFile = requests.get(dataCsvURL())
open(dataCsvFile(), 'wb').write(covidFile.content)

2703798

### Read and clean data
Now read the data from the CSV file into a dataframe using pandas. Clean up to show the relevant columns.

In [6]:
def columnsToDisplay():
    return ["dateRep", "cases", "deaths", "countriesAndTerritories","popData2019"]

In [7]:
def columnsRenameDict():
    return {'dateRep':'date', 'countriesAndTerritories':'country', 'popData2019':'population' }

In [8]:
# Change date format to DD-MONTH-YYYY
# Convert population to integer. Caution: astype(int) gives error "Cannot convert non-finite values (NA or inf) to integer"
def preProcessing(df):
    df['date'] = pd.to_datetime(df['date'], dayfirst=True)
    df['date'] = df['date'].dt.strftime("%d %B %Y")
    df['population'] = df['population'].astype('Int64')

In [9]:
#df = pd.read_csv(dataCsvFile())
df = pd.read_csv(dataCsvFile(), usecols=columnsToDisplay())
df.rename(columns=columnsRenameDict(), inplace=True)
preProcessing(df)
df.head(20)


Unnamed: 0,date,cases,deaths,country,population
0,30 August 2020,3,0,Afghanistan,38041757
1,29 August 2020,11,1,Afghanistan,38041757
2,28 August 2020,3,0,Afghanistan,38041757
3,27 August 2020,55,4,Afghanistan,38041757
4,26 August 2020,1,0,Afghanistan,38041757
5,25 August 2020,71,10,Afghanistan,38041757
6,24 August 2020,0,0,Afghanistan,38041757
7,23 August 2020,105,2,Afghanistan,38041757
8,22 August 2020,38,0,Afghanistan,38041757
9,21 August 2020,97,2,Afghanistan,38041757


#### One country
The core of western civilization and a popular holiday destination:

In [10]:
#df[df.country=='Greece'].head() identical to:
df.query("country == 'Greece'").head()

Unnamed: 0,date,cases,deaths,country,population
14614,30 August 2020,177,1,Greece,10724599
14615,29 August 2020,269,5,Greece,10724599
14616,28 August 2020,251,6,Greece,10724599
14617,27 August 2020,293,5,Greece,10724599
14618,26 August 2020,168,1,Greece,10724599


### Top 10 countries of reported cases  (in numbers)
This is an aggregate action on the dataframe. As a beginner in Data science and Pandas I find it cumbersome to get these top 10 countries of reported cases. IMO **Pharo's Smalltalk** solution below is much cleaner and easier to understand.

```
(df group: 'cases' by: 'country' aggregateUsing: #sum)
	sortDescending
	head: 10.
```

To find the 10 countries having the least reported cases simply replace `head` by `tail`.

In [11]:
#df['country'].count()
#df['cases'].sum()
#df.sum()

casesPerCountry = df.groupby('country').sum()[['cases']]
sortedCasesPerCountry = casesPerCountry.sort_values(by=['cases'], ascending=False)
sortedCasesPerCountry.head(10)

Unnamed: 0_level_0,cases
country,Unnamed: 1_level_1
United_States_of_America,5961582
Brazil,3846153
India,3542733
Russia,985346
Peru,639435
South_Africa,622551
Colombia,599914
Mexico,591712
Spain,439286
Chile,408009


### Top 10 countries of reported cases  (relative to population)
Another approach is to collect the number of cases in a country relative to the population.

In [12]:
# Find 10 countries with relatively highest cases
populationPerCountry = df.groupby('country').mean()[['population']]
populationPerCountry['cases (%)']=casesPerCountry['cases'] / populationPerCountry['population'] * 100
populationPerCountry.sort_values(by=['cases (%)'], ascending=False).head(10)

Unnamed: 0_level_0,population,cases (%)
country,Unnamed: 1_level_1,Unnamed: 2_level_1
Qatar,2832071,4.180933
Bahrain,1641164,3.131375
Chile,18952035,2.152851
Panama,4246440,2.150908
San_Marino,34453,2.060778
Kuwait,4207077,2.00196
Peru,32510462,1.966859
Aruba,106310,1.857774
Brazil,211049519,1.822394
United_States_of_America,329064917,1.811674


### Top 10 countries of reported deaths (in numbers)
To find the 10 countries having the least reported Covid-19 deaths simply replace `head` by `tail`.


In [13]:
casesPerCountry = df.groupby('country').sum()[['deaths']]
sortedCasesPerCountry = casesPerCountry.sort_values(by=['deaths'], ascending=False)
sortedCasesPerCountry.head(10)

Unnamed: 0_level_0,deaths
country,Unnamed: 1_level_1
United_States_of_America,182779
Brazil,120462
Mexico,63819
India,63498
United_Kingdom,41498
Italy,35473
France,30602
Spain,29011
Peru,28607
Iran,21359


### Top 10 countries of reported deaths  (relative to population)
Another approach is to collect the number of deaths in a country relative to the population.

In [14]:
# Find 10 countries with relatively highest deaths
populationPerCountry = df.groupby('country').mean()[['population']]
populationPerCountry['deaths (%)']=casesPerCountry['deaths'] / populationPerCountry['population'] * 100
populationPerCountry.sort_values(by=['deaths (%)'], ascending=False).head(10)

Unnamed: 0_level_0,population,deaths (%)
country,Unnamed: 1_level_1,Unnamed: 2_level_1
San_Marino,34453,0.121905
Peru,32510462,0.087993
Belgium,11455519,0.086343
Andorra,76177,0.069575
United_Kingdom,66647112,0.062265
Spain,46937060,0.061808
Chile,18952035,0.058996
Italy,60359546,0.058769
Brazil,211049519,0.057078
Sweden,10230185,0.0569


### Covid-19 spread in The Netherlands

In [15]:
df_NL = df[df.country=="Netherlands"]
relevant_columns = ['date', 'cases', 'deaths']
covid_df_NL = df_NL[relevant_columns]
covid_df_NL

Unnamed: 0,date,cases,deaths
25707,30 August 2020,500,4
25708,29 August 2020,507,2
25709,28 August 2020,510,3
25710,27 August 2020,570,8
25711,26 August 2020,414,5
...,...,...,...
25946,04 January 2020,0,0
25947,03 January 2020,0,0
25948,02 January 2020,0,0
25949,01 January 2020,0,0


### Max daily cases in The Netherlands
Find date when most cases were reported

In [16]:
maxDailyCases = covid_df_NL['cases'].max()
print('Max daily cases NL ', maxDailyCases)

covid_df_NL[(covid_df_NL.cases==maxDailyCases)]


Max daily cases NL  1335


Unnamed: 0,date,cases,deaths
25848,11 April 2020,1335,115


### Max daily deaths in The Netherlands
Find when most deaths were reported

In [17]:
maxDailyDeaths = covid_df_NL['deaths'].max()
print('Max daily deaths NL ', maxDailyDeaths)
covid_df_NL[(covid_df_NL.deaths==maxDailyDeaths)]

Max daily deaths NL  234


Unnamed: 0,date,cases,deaths
25851,08 April 2020,777,234


### Cumulative sum of cases and deaths
To find the total number of reported cases and deaths up to a date in the Netherlands

In [18]:
covid_df_NL['cumulativeCases']=covid_df_NL.loc[::-1, 'cases'].cumsum(axis = 0)[::-1]
covid_df_NL['cumulativeDeaths']=covid_df_NL.loc[::-1, 'deaths'].cumsum(axis = 0)[::-1]
covid_df_NL

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  covid_df_NL['cumulativeCases']=covid_df_NL.loc[::-1, 'cases'].cumsum(axis = 0)[::-1]


Unnamed: 0,date,cases,deaths,cumulativeCases,cumulativeDeaths
25707,30 August 2020,500,4,69563,6215
25708,29 August 2020,507,2,69063,6211
25709,28 August 2020,510,3,68556,6209
25710,27 August 2020,570,8,68046,6206
25711,26 August 2020,414,5,67476,6198
...,...,...,...,...,...
25946,04 January 2020,0,0,0,0
25947,03 January 2020,0,0,0,0
25948,02 January 2020,0,0,0,0
25949,01 January 2020,0,0,0,0


###  Initial growth in cases
How long did it take to spread from 1 to 100 cases? And, next, how many days from 100 to 1000 cases?

In [19]:
firstCase = pd.to_datetime(covid_df_NL[(covid_df_NL.cumulativeCases > 0)].date).min()
plus100Cases = pd.to_datetime(covid_df_NL[(covid_df_NL.cumulativeCases >= 100)].date).min()
plus1000Cases = pd.to_datetime(covid_df_NL[(covid_df_NL.cumulativeCases >= 1000)].date).min()
print ("From 1st case to 100 or above in ", plus100Cases - firstCase)
print ("From 100 or above to 1000 or above in ", plus1000Cases - plus100Cases)

From 1st case to 100 or above in  8 days 00:00:00
From 100 or above to 1000 or above in  9 days 00:00:00
