# Covid-19 analysis
This Python3 notebook is based on the original [Pharo Smalltalk version](https://github.com/olekscode/CovidAnalysis) by Oleksandr Zaitsev. <br>


In [40]:
# The %... is an iPython thing, and is not part of the Python language.
# In this case we're just telling the plotting library to draw things on
# the notebook, instead of on a separate window.
%matplotlib inline 
#this line above prepares IPython notebook for working with matplotlib

# See all the "as ..." contructs? They're just aliasing the package names.
# That way we can call methods like plt.plot() instead of matplotlib.pyplot.plot().

import numpy as np # imports a fast numerical programming library
import scipy as sp #imports stats functions, amongst other things
import matplotlib as mpl # this actually imports matplotlib
import matplotlib.cm as cm #allows us easy access to colormaps
import matplotlib.pyplot as plt #sets up plotting under plt
import pandas as pd #lets us handle data as dataframes
#sets up pandas table display
pd.set_option('display.width', 500)
pd.set_option('display.max_columns', 100)
pd.set_option('display.notebook_repr_html', True)
import seaborn as sns #sets up styles and gives us more plotting options
import urllib.request
import pickle
import requests
import datetime

### Download latest COVID-19 data
Download the latest COVID-19 data from https://www.ecdc.europa.eu/en/publications-data/download-todays-data-geographic-distribution-covid-19-cases-worldwide.

In [3]:
def dataCsvURL():
    return 'https://opendata.ecdc.europa.eu/covid19/casedistribution/csv'

In [5]:
def dataDirectory():
    return 'data/'

In [17]:
def dataCsvFile():
    return dataDirectory() + 'covidData.csv'

In [18]:
#Let's give it a try and save to CSV file
covidFile = requests.get(dataCsvURL())
open(dataCsvFile(), 'wb').write(covidFile.content)

2644793

### Read and clean data
Now read the data from the CSV file into a dataframe using pandas. Clean up to show the relevant columns.

In [74]:
def columnsToDisplay():
    return ["dateRep", "cases", "deaths", "countriesAndTerritories","popData2019"]

In [75]:
def columnsRenameDict():
    return {'dateRep':'date', 'countriesAndTerritories':'country', 'popData2019':'population' }

In [120]:
# Change date format to DD-MONTH-YYYY
# Convert population to integer. Caution: astype(int) gives error "Cannot convert non-finite values (NA or inf) to integer"
def preProcessing(df):
#    df['date'] = pd.to_datetime(df['date'])
#    df['date'] = df['date'].dt.strftime("%d %B %Y")
    df['population'] = df['population'].astype('Int64')

In [121]:
#df = pd.read_csv(dataCsvFile())
df = pd.read_csv(dataCsvFile(), usecols=columnsToDisplay())
df.rename(columns=columnsRenameDict(), inplace=True)
preProcessing(df)
df.head()


Unnamed: 0,date,cases,deaths,country,population
0,26/08/2020,1,0,Afghanistan,38041757
1,25/08/2020,71,10,Afghanistan,38041757
2,24/08/2020,0,0,Afghanistan,38041757
3,23/08/2020,105,2,Afghanistan,38041757
4,22/08/2020,38,0,Afghanistan,38041757


### Top 10 countries of reported cases
This is an aggregate action on the dataframe. As a beginner in Data science and Pandas I find it cumbersome to get these top 10 countries of reported cases. IMO **Pharo's Smalltalk** solution below is much cleaner and easier to understand.

```
(df group: 'cases' by: 'country' aggregateUsing: #sum)
	sortDescending
	head: 10.
```

To find the 10 countries having the least reported cases simply replace `head` by `tail`.

In [122]:
#df['country'].count()
#df['cases'].sum()
#df.sum()
casesPerCountry = df.groupby('country').sum()[['cases']]
sortedCasesPerCountry = casesPerCountry.sort_values(by=['cases'], ascending=False)
sortedCasesPerCountry.head(10)

Unnamed: 0_level_0,cases
country,Unnamed: 1_level_1
United_States_of_America,5779028
Brazil,3669995
India,3234474
Russia,966189
South_Africa,613017
Peru,607382
Mexico,568621
Colombia,562128
Spain,412553
Chile,400985


### Top 10 countries of reported deaths
To find the 10 countries having the least reported Covid-19 deaths simply replace `head` by `tail`.


In [123]:
casesPerCountry = df.groupby('country').sum()[['deaths']]
sortedCasesPerCountry = casesPerCountry.sort_values(by=['deaths'], ascending=False)
sortedCasesPerCountry.head(10)

Unnamed: 0_level_0,deaths
country,Unnamed: 1_level_1
United_States_of_America,178486
Brazil,116580
Mexico,61450
India,59449
United_Kingdom,41449
Italy,35445
France,30544
Spain,28924
Peru,28001
Iran,20776


### Covid-19 spread in The Netherlands

In [125]:
df_NL = df[df.country=="Netherlands"]
relevant_columns = ['date', 'cases', 'deaths']
covid_df_NL = df_NL[relevant_columns]
covid_df_NL

Unnamed: 0,date,cases,deaths
25163,26/08/2020,414,5
25164,25/08/2020,572,2
25165,24/08/2020,456,0
25166,23/08/2020,508,5
25167,22/08/2020,534,4
...,...,...,...
25398,04/01/2020,0,0
25399,03/01/2020,0,0
25400,02/01/2020,0,0
25401,01/01/2020,0,0


### Max daily cases in The Netherlands
Find date when most cases were reported

In [133]:
maxDailyCases = covid_df_NL['cases'].max()
print('Max daily cases NL ', maxDailyCases)

covid_df_NL[(covid_df_NL.cases==maxDailyCases)]


Max daily cases NL  1335


Unnamed: 0,date,cases,deaths
25300,11/04/2020,1335,115


### Max daily deaths in The Netherlands
Find when most deaths were reported

In [134]:
maxDailyDeaths = covid_df_NL['deaths'].max()
print('Max daily deaths NL ', maxDailyDeaths)
covid_df_NL[(covid_df_NL.deaths==maxDailyDeaths)]

Max daily deaths NL  234


Unnamed: 0,date,cases,deaths
25303,08/04/2020,777,234
