### **Introduction to Programming and Numerical Analysis: Project 1**

# Visualization of the Coronavirus Outbreak over Time

At the moment, the whole world experiences an extraordinary time. At least however, data analysis can be a basic tool in the fight against COVID-19. If we can compare the infection rates as well as the recovering and death rates across countries and time, we might be able to find out which policies are most effective and which are not. Of course, this is so far only out of an descriptive nature without causal statements, however first hints for the spread and trends in the fight against the virus is very useful especially when running out of time.

In [1]:
# We import all the necessary packets at the beginning of our code:

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import geopandas                     ## install with: conda install geopandas
import descartes                     ## pip install descartes
import datetime
from ipywidgets import interact
from time import gmtime, strftime
import mapclassify                   ## pip install mapclassify
import ipywidgets as widgets

## **Import, merge and cleaning of data set**

### Coronavirus Data

We first import and merge all data concerning the number corona cases, recoveries and deaths, which are three separate sets originally. The data is provied by the Johns Hopkins University Center for Systems Science and Engineering (JHU CSSE) and is freely downloadable on their Github-website: https://github.com/CSSEGISandData/COVID-19. All the time series are updated daily. 

In [2]:
# We import the number of corona cases and have a first look at the data:

url = 'https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_confirmed_global.csv'
df_confirmed = pd.read_csv(url, error_bad_lines=False)  

df_confirmed.head()

Unnamed: 0,Province/State,Country/Region,Lat,Long,1/22/20,1/23/20,1/24/20,1/25/20,1/26/20,1/27/20,...,5/5/20,5/6/20,5/7/20,5/8/20,5/9/20,5/10/20,5/11/20,5/12/20,5/13/20,5/14/20
0,,Afghanistan,33.0,65.0,0,0,0,0,0,0,...,3224,3392,3563,3778,4033,4402,4687,4963,5226,5639
1,,Albania,41.1533,20.1683,0,0,0,0,0,0,...,820,832,842,850,856,868,872,876,880,898
2,,Algeria,28.0339,1.6596,0,0,0,0,0,0,...,4838,4997,5182,5369,5558,5723,5891,6067,6253,6442
3,,Andorra,42.5063,1.5218,0,0,0,0,0,0,...,751,751,752,752,754,755,755,758,760,761
4,,Angola,-11.2027,17.8739,0,0,0,0,0,0,...,36,36,36,43,43,45,45,45,45,48


In [3]:
# Instead of having each country as one observation (and each day as variable), 
# we want that each country/day combination corresponds to one observation:

confirmed = df_confirmed.melt(id_vars=["Country/Region", "Lat", "Long", "Province/State"], 
        var_name=str("Date"), 
        value_name="Confirmed")

# We also recode the date as a pandas-type day. Furthermore, we check for which period data exists i.e. if it has been updated.

confirmed['Date'] = pd.to_datetime(confirmed['Date'])
start_date =confirmed["Date"].min()
end_date = confirmed["Date"].max()
print('Start of DataFrame: ' + str(start_date))
print('End of DataFrame: ' + str(end_date))

# As a last step, we add a variable that counts the days from the first item.

confirmed['Days'] = (confirmed['Date'] - start_date).dt.days

# Now, our data frame looks like:

confirmed.sort_values(['Country/Region', 'Date'])

Start of DataFrame: 2020-01-22 00:00:00
End of DataFrame: 2020-05-14 00:00:00


Unnamed: 0,Country/Region,Lat,Long,Province/State,Date,Confirmed,Days
0,Afghanistan,33.0,65.0,,2020-01-22,0,0
266,Afghanistan,33.0,65.0,,2020-01-23,0,1
532,Afghanistan,33.0,65.0,,2020-01-24,0,2
798,Afghanistan,33.0,65.0,,2020-01-25,0,3
1064,Afghanistan,33.0,65.0,,2020-01-26,0,4
...,...,...,...,...,...,...,...
29224,Zimbabwe,-20.0,30.0,,2020-05-10,36,109
29490,Zimbabwe,-20.0,30.0,,2020-05-11,36,110
29756,Zimbabwe,-20.0,30.0,,2020-05-12,36,111
30022,Zimbabwe,-20.0,30.0,,2020-05-13,37,112


In [4]:
# We import the number of deaths due to corona. As for the number of cases, we also adjust the data structure:

url2 = 'https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_deaths_global.csv'
df_death = pd.read_csv(url2, error_bad_lines=False) 
death = df_death.melt(id_vars=["Country/Region", "Lat", "Long", "Province/State"], 
        var_name=str("Date"), 
        value_name="Death")
death.head(-5)

Unnamed: 0,Country/Region,Lat,Long,Province/State,Date,Death
0,Afghanistan,33.000000,65.000000,,1/22/20,0
1,Albania,41.153300,20.168300,,1/22/20,0
2,Algeria,28.033900,1.659600,,1/22/20,0
3,Andorra,42.506300,1.521800,,1/22/20,0
4,Angola,-11.202700,17.873900,,1/22/20,0
...,...,...,...,...,...,...
30314,Malawi,-13.254308,34.301525,,5/14/20,3
30315,United Kingdom,-51.796300,-59.523600,Falkland Islands (Malvinas),5/14/20,0
30316,France,46.885200,-56.315900,Saint Pierre and Miquelon,5/14/20,0
30317,South Sudan,6.877000,31.307000,,5/14/20,0


In [5]:
# We import the number of recoveries of corona. As for the number of cases and deaths due to corona, we also adjust the data structure:

url3 = 'https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_recovered_global.csv'
df_recovered = pd.read_csv(url3, error_bad_lines=False) 
recovered = df_recovered.melt(id_vars=["Country/Region", "Lat", "Long", "Province/State"], 
        var_name=str("Date"), 
        value_name="Recovered")
recovered.sort_values(['Country/Region', 'Date'])

Unnamed: 0,Country/Region,Lat,Long,Province/State,Date,Recovered
0,Afghanistan,33.0,65.0,,1/22/20,0
253,Afghanistan,33.0,65.0,,1/23/20,0
506,Afghanistan,33.0,65.0,,1/24/20,0
759,Afghanistan,33.0,65.0,,1/25/20,0
1012,Afghanistan,33.0,65.0,,1/26/20,0
...,...,...,...,...,...,...
26542,Zimbabwe,-20.0,30.0,,5/5/20,5
26795,Zimbabwe,-20.0,30.0,,5/6/20,5
27048,Zimbabwe,-20.0,30.0,,5/7/20,5
27301,Zimbabwe,-20.0,30.0,,5/8/20,9


In [6]:
# After having importet and adjusted all three individual data sets, we are now able to merge them together:

# first we replace NaN in Province/State which would mess up the merge
confirmed['Province/State'].fillna('Country',inplace=True) 
death['Province/State'].fillna('Country',inplace=True)
recovered['Province/State'].fillna('Country',inplace=True)

#The data columns need to be the same as confirmed for this to work
death['Date'] = pd.to_datetime(death['Date'])
recovered['Date'] = pd.to_datetime(recovered['Date'])

# Merge
corona = confirmed.merge(death, on= ["Country/Region", "Lat", "Long", "Province/State","Date"],how='outer')
corona = corona.merge(recovered, on= ["Country/Region", "Lat", "Long", "Province/State","Date"],how='outer')
corona.sort_values(['Country/Region', 'Date'])

Unnamed: 0,Country/Region,Lat,Long,Province/State,Date,Confirmed,Days,Death,Recovered
0,Afghanistan,33.0,65.0,Country,2020-01-22,0.0,0.0,0.0,0.0
266,Afghanistan,33.0,65.0,Country,2020-01-23,0.0,1.0,0.0,0.0
532,Afghanistan,33.0,65.0,Country,2020-01-24,0.0,2.0,0.0,0.0
798,Afghanistan,33.0,65.0,Country,2020-01-25,0.0,3.0,0.0,0.0
1064,Afghanistan,33.0,65.0,Country,2020-01-26,0.0,4.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...
29224,Zimbabwe,-20.0,30.0,Country,2020-05-10,36.0,109.0,4.0,9.0
29490,Zimbabwe,-20.0,30.0,Country,2020-05-11,36.0,110.0,4.0,9.0
29756,Zimbabwe,-20.0,30.0,Country,2020-05-12,36.0,111.0,4.0,9.0
30022,Zimbabwe,-20.0,30.0,Country,2020-05-13,37.0,112.0,4.0,12.0


In [7]:
# We observe that recovered has some NaN values and is saved as a float instead as an integer. 
# We change that by the following command:

corona.info()
# fill the Recovered cases with the previous daily case number of the Country/Region (forward filling)
corona['Recovered'] = corona.sort_values(['Country/Region', 'Date']).groupby('Country/Region', as_index=False).Recovered.ffill()
# fill the NaN Recovered cases with zero
corona['Recovered'] = corona.Recovered.fillna(0)
corona = corona.astype({"Recovered": int})

<class 'pandas.core.frame.DataFrame'>
Int64Index: 30780 entries, 0 to 30779
Data columns (total 9 columns):
 #   Column          Non-Null Count  Dtype         
---  ------          --------------  -----         
 0   Country/Region  30780 non-null  object        
 1   Lat             30780 non-null  float64       
 2   Long            30780 non-null  float64       
 3   Province/State  30780 non-null  object        
 4   Date            30780 non-null  datetime64[ns]
 5   Confirmed       30324 non-null  float64       
 6   Days            30324 non-null  float64       
 7   Death           30324 non-null  float64       
 8   Recovered       28842 non-null  float64       
dtypes: datetime64[ns](1), float64(6), object(2)
memory usage: 2.3+ MB


In [8]:
corona.sort_values(['Country/Region', 'Date'])

Unnamed: 0,Country/Region,Lat,Long,Province/State,Date,Confirmed,Days,Death,Recovered
0,Afghanistan,33.0,65.0,Country,2020-01-22,0.0,0.0,0.0,0
266,Afghanistan,33.0,65.0,Country,2020-01-23,0.0,1.0,0.0,0
532,Afghanistan,33.0,65.0,Country,2020-01-24,0.0,2.0,0.0,0
798,Afghanistan,33.0,65.0,Country,2020-01-25,0.0,3.0,0.0,0
1064,Afghanistan,33.0,65.0,Country,2020-01-26,0.0,4.0,0.0,0
...,...,...,...,...,...,...,...,...,...
29224,Zimbabwe,-20.0,30.0,Country,2020-05-10,36.0,109.0,4.0,9
29490,Zimbabwe,-20.0,30.0,Country,2020-05-11,36.0,110.0,4.0,9
29756,Zimbabwe,-20.0,30.0,Country,2020-05-12,36.0,111.0,4.0,9
30022,Zimbabwe,-20.0,30.0,Country,2020-05-13,37.0,112.0,4.0,12


<p>         <br>

We import a second data set which contains information about total population. This allows us to present data not only in absolute but also in a relative terms. But more interestingly, it contains geographic information, which allows us to plot a world map.

### Geographic World Data

In [9]:
# We import the second data set: 
world = geopandas.read_file(geopandas.datasets.get_path('naturalearth_lowres'))
world.head()

Unnamed: 0,pop_est,continent,name,iso_a3,gdp_md_est,geometry
0,920938,Oceania,Fiji,FJI,8374.0,"MULTIPOLYGON (((180.00000 -16.06713, 180.00000..."
1,53950935,Africa,Tanzania,TZA,150600.0,"POLYGON ((33.90371 -0.95000, 34.07262 -1.05982..."
2,603253,Africa,W. Sahara,ESH,906.5,"POLYGON ((-8.66559 27.65643, -8.66512 27.58948..."
3,35623680,North America,Canada,CAN,1674000.0,"MULTIPOLYGON (((-122.84000 49.00000, -122.9742..."
4,326625791,North America,United States of America,USA,18560000.0,"MULTIPOLYGON (((-122.84000 49.00000, -120.0000..."


In [10]:
world.info()

<class 'geopandas.geodataframe.GeoDataFrame'>
RangeIndex: 177 entries, 0 to 176
Data columns (total 6 columns):
 #   Column      Non-Null Count  Dtype   
---  ------      --------------  -----   
 0   pop_est     177 non-null    int64   
 1   continent   177 non-null    object  
 2   name        177 non-null    object  
 3   iso_a3      177 non-null    object  
 4   gdp_md_est  177 non-null    float64 
 5   geometry    177 non-null    geometry
dtypes: float64(1), geometry(1), int64(1), object(3)
memory usage: 8.4+ KB


In [11]:
# We clean the data for unnecessary observations (only countries with a positive number of population and without Anarctica) 
 
world = world[(world.pop_est>0) & (world.name!="Antarctica")]

# as well as unnecessary variables.

world.drop(['continent','iso_a3', 'gdp_md_est'], axis=1)

Unnamed: 0,pop_est,name,geometry
0,920938,Fiji,"MULTIPOLYGON (((180.00000 -16.06713, 180.00000..."
1,53950935,Tanzania,"POLYGON ((33.90371 -0.95000, 34.07262 -1.05982..."
2,603253,W. Sahara,"POLYGON ((-8.66559 27.65643, -8.66512 27.58948..."
3,35623680,Canada,"MULTIPOLYGON (((-122.84000 49.00000, -122.9742..."
4,326625791,United States of America,"MULTIPOLYGON (((-122.84000 49.00000, -120.0000..."
...,...,...,...
172,7111024,Serbia,"POLYGON ((18.82982 45.90887, 18.82984 45.90888..."
173,642550,Montenegro,"POLYGON ((20.07070 42.58863, 19.80161 42.50009..."
174,1895250,Kosovo,"POLYGON ((20.59025 41.85541, 20.52295 42.21787..."
175,1218208,Trinidad and Tobago,"POLYGON ((-61.68000 10.76000, -61.10500 10.890..."


### Compare Data Sets

In [12]:
# As the corona data frame does not contain the iso_a3 country code, we have to adjust the names in the two different data frames manually. 
# We do that by having first a look at which names are actually different, meaning which are contained in one data set but not in the other:

diff_name = [name for name in world.name.unique() if name not in corona['Country/Region'].unique()] 
print(f'names in world data, but not in corona data: {diff_name}')

names in world data, but not in corona data: ['W. Sahara', 'United States of America', 'Dem. Rep. Congo', 'Dominican Rep.', 'Falkland Is.', 'Greenland', 'Fr. S. Antarctic Lands', 'Puerto Rico', "Côte d'Ivoire", 'Central African Rep.', 'Congo', 'Eq. Guinea', 'eSwatini', 'Palestine', 'Vanuatu', 'Myanmar', 'North Korea', 'South Korea', 'Turkmenistan', 'New Caledonia', 'Solomon Is.', 'Taiwan', 'N. Cyprus', 'Somaliland', 'Bosnia and Herz.', 'Macedonia', 'S. Sudan']


In [13]:
diff_country = [c for c in corona['Country/Region'].unique() if c not in world.name.unique()] 
print(f'names in corona data, but not in world data: {diff_country}')

names in corona data, but not in world data: ['Andorra', 'Antigua and Barbuda', 'Bahrain', 'Barbados', 'Bosnia and Herzegovina', 'Cabo Verde', 'Central African Republic', 'Congo (Brazzaville)', 'Congo (Kinshasa)', "Cote d'Ivoire", 'Diamond Princess', 'Dominican Republic', 'Equatorial Guinea', 'Eswatini', 'Holy See', 'Korea, South', 'Liechtenstein', 'Maldives', 'Malta', 'Mauritius', 'Monaco', 'North Macedonia', 'Saint Lucia', 'Saint Vincent and the Grenadines', 'San Marino', 'Seychelles', 'Singapore', 'Taiwan*', 'US', 'Dominica', 'Grenada', 'West Bank and Gaza', 'Saint Kitts and Nevis', 'Burma', 'MS Zaandam', 'South Sudan', 'Western Sahara', 'Sao Tome and Principe', 'Comoros']


In [14]:
# We match the name of the countries in both dataframes now manually. Due to time restrictions, we only match the most interesting countries 
# considering the size and coronavirus cases. 

world.replace('United States of America', 'US', inplace=True)
corona.replace('Korea, South', 'South Korea',inplace=True)
corona.replace('Taiwan*','Taiwan',inplace=True)

## **Output: Table and several graphs**

### Table

We are now able to code tables and graphs. The first one shows a table where one can select the country as well as if the data should be shown in absolute or relative terms:

In [15]:
def table_cases(Country='x', freq='absolute'):
    """ prints a table with the coronavirus cases
    
    Args:
        Country: name of the country
        freq: absolute or relative number of cases
        
    """ 
    # if the cases should be reported in absolute numbers 
    if (freq == 'absolute'):
        # we are only interested in the cases, we leave-out all other columns
        df_country=corona.loc[corona['Country/Region']==Country, ['Confirmed', 'Death', 'Recovered', 'Date']].groupby(['Date']).sum()
        df_country['Confirmed % increase'] = (df_country['Confirmed'].pct_change() * 100)
        df_country['Death % increase'] = (df_country['Death'].pct_change() * 100)
        df_country['Recovered % increase'] = (df_country['Recovered'].pct_change() * 100)
    

     # if the cases should be reported in relative numbers     
    elif (freq == 'relative'):
        # we are only interested in the cases, we leave-out all other columns
        df_absolute=corona.loc[corona['Country/Region']==Country, ['Confirmed', 'Death', 'Recovered', 'Date']].groupby(['Date']).sum()
        pop = world.loc[world['name']==Country, 'pop_est'].item() *100
        df_country=df_absolute/pop*100
        df_country.columns = ['Confirmed % of Population', 'Death % of Population', 'Recovered % of Population']
    
    # some values are nan or inf, we replace it with the preceding value or zero
    df_country.index = df_country.index.date
    df_country.replace(['inf', 'nan'], np.nan, inplace=True)
    df_country.fillna(0,inplace=True)
    
    
    display(df_country.style)
    
widgets.interact(table_cases,
    Country=widgets.Dropdown(options = sorted(corona['Country/Region'].unique()), value='Denmark'),
    freq=widgets.RadioButtons(options=['absolute', 'relative'], description='Frequency:', disabled=False),
);

interactive(children=(Dropdown(description='Country', index=47, options=('Afghanistan', 'Albania', 'Algeria', …

### Coronavirus Cases over Time

<p>         <br>
Our second ouput is a graph for the illustration of coronoa cases, deaths and recoveries for a selected country over time. In contrast to before, it is also possible to choose rates in addition to the illustration in absolute or relative terms.

In [16]:
def plot_cases(Country='x', freq='absolute'):
    """ prints a plot with the coronavirus cases per country
    
    Args:
        Country: name of the country
        freq: absolute or relative number of cases or rates (fatality / recovery)
        
    """     
    if (freq == 'absolute'):
        df_country=corona.loc[corona['Country/Region']==Country, ['Confirmed', 'Death', 'Recovered', 'Date']].groupby(['Date']).sum()
        
    elif (freq == 'relative'):
        df_absolute=corona.loc[corona['Country/Region']==Country, ['Confirmed', 'Death', 'Recovered', 'Date']].groupby(['Date']).sum()
        pop = world.loc[world['name']==Country, 'pop_est'].item()
        df_country=df_absolute/pop*1000
        df_country.columns = ['Confirmed Cases per 1000', 'Death Cases per 1000', 'Recovered Cases per 1000']
    
    elif (freq == 'rates'):
        df_absolute=corona.loc[corona['Country/Region']==Country, ['Confirmed', 'Death', 'Recovered', 'Date']].groupby(['Date']).sum()
        death = df_absolute['Death']/df_absolute['Confirmed']
        recov = df_absolute['Recovered']/df_absolute['Confirmed']
        d = {'Fatality Rate':death, 'Recovery Rate':recov}
        df_country = pd.DataFrame(data=d)
        
    df_country.plot(figsize  = (20, 10))
    plt.title("Coronavirus Cases over Time", fontsize=20)
    
widgets.interact(plot_cases,
    Country=widgets.Dropdown(options = sorted(corona['Country/Region'].unique()), value='Denmark'),
    freq=widgets.RadioButtons(
    options=['absolute', 'relative', 'rates'], description='Frequency:', disabled=False)
);

interactive(children=(Dropdown(description='Country', index=47, options=('Afghanistan', 'Albania', 'Algeria', …

### Worldwide Coronavirus Cases over Time

<p>         <br>
The next graph is illustrating the same, however for the whole world:

In [17]:
def plot_world_cases(freq='absolute'):
    """ prints a plot with the coronavirus cases worldwide
    
    Args:
    freq: absolute, relative, rates
        
    """    
    if (freq == 'absolute'):
        df_country=corona.loc[:, ['Confirmed', 'Death', 'Recovered', 'Date']].groupby(['Date']).sum()
        
    elif (freq == 'relative'):
        df_absolute=corona.loc[:, ['Confirmed', 'Death', 'Recovered', 'Date']].groupby(['Date']).sum()
        pop = world.loc[:, 'pop_est'].sum().item()
        df_country=df_absolute/pop*1000
        df_country.columns = ['Confirmed Cases per 1000', 'Death Cases per 1000', 'Recovered Cases per 1000']
    
    elif (freq == 'rates'):
        df_absolute=corona.loc[:, ['Confirmed', 'Death', 'Recovered', 'Date']].groupby(['Date']).sum()
        death = df_absolute['Death']/df_absolute['Confirmed']
        recov = df_absolute['Recovered']/df_absolute['Confirmed']
        d = {'Fatality Rate':death, 'Recovery Rate':recov}
        df_country = pd.DataFrame(data=d)
        
    df_country.plot(figsize  = (20, 10))
    plt.title("Worldwide Coronavirus Cases over Time", fontsize=20)
    
widgets.interact(plot_world_cases,
    freq=widgets.RadioButtons(
    options=['absolute', 'relative', 'rates'], description='Frequency:', disabled=False)
);

interactive(children=(RadioButtons(description='Frequency:', options=('absolute', 'relative', 'rates'), value=…

### Comparision of Coronavirus Cases over Time

<p>         <br>
Our fourth ouput illustrates and compares the number of corona cases, deaths or the number of recoveries for three different countries which can freely be chosen:

In [18]:
def plot_compare(CountryX='x', CountryY='y', CountryZ='z', Cases='Confirmed'):
    """ prints a plot with the coronavirus cases for multiple countries
    
    Args:
        CountryX: name of the country X
        CountryY: name of the country Y
        CountryZ: name of the country Y
        Cases: kind of cases displayed: Confirmed, Death, Recovered
        
    """ 
    if (Cases == 'Confirmed'):
         # for country X
        df_X=corona.loc[corona['Country/Region']==CountryX, ['Confirmed', 'Date']].groupby(['Date']).sum()
        df_X.columns = ['Confirmed ' +str(CountryX)]
         # for country Y
        df_Y=corona.loc[corona['Country/Region']==CountryY, ['Confirmed', 'Date']].groupby(['Date']).sum()
        df_Y.columns = ['Confirmed ' +str(CountryY)]
        # for country Z
        df_Z=corona.loc[corona['Country/Region']==CountryZ, ['Confirmed', 'Date']].groupby(['Date']).sum()
        df_Z.columns = ['Confirmed ' +str(CountryZ)]
        # merge each country data set
        df_country = df_X.join(df_Y).join(df_Z)
        
        
    elif (Cases =='Death'):
         # for country X
        df_X=corona.loc[corona['Country/Region']==CountryX, ['Death', 'Date']].groupby(['Date']).sum()
        df_X.columns = ['Death ' +str(CountryX)]
        # for country Y
        df_Y=corona.loc[corona['Country/Region']==CountryY, ['Death', 'Date']].groupby(['Date']).sum()
        df_Y.columns = ['Death ' +str(CountryY)]
         # for country Z
        df_Z=corona.loc[corona['Country/Region']==CountryZ, ['Death', 'Date']].groupby(['Date']).sum()
        df_Z.columns = ['Death ' +str(CountryZ)]
         # merge each country data set
        df_country = df_X.join(df_Y).join(df_Z)
       
    
    elif (Cases =='Recovered'):
         # for country X
        df_X=corona.loc[corona['Country/Region']==CountryX, ['Recovered', 'Date']].groupby(['Date']).sum()
        df_X.columns = ['Recovered ' +str(CountryX)]
        # for country Y
        df_Y=corona.loc[corona['Country/Region']==CountryY, ['Recovered', 'Date']].groupby(['Date']).sum()
        df_Y.columns = ['Recovered ' +str(CountryY)]
         # for country Z
        df_Z=corona.loc[corona['Country/Region']==CountryZ, ['Recovered', 'Date']].groupby(['Date']).sum()
        df_Z.columns = ['Recovered ' +str(CountryZ)]
         # merge each country data set
        df_country = df_X.join(df_Y).join(df_Z)
        
        
    df_country.plot(figsize  = (20, 10))
    plt.title("Comparision of Coronavirus Cases over Time", fontsize=20)
    
widgets.interact(plot_compare,
    CountryX=widgets.Dropdown(options = sorted(corona['Country/Region'].unique()), value='Denmark'),
    CountryY=widgets.Dropdown(options = sorted(corona['Country/Region'].unique()), value='China'),
    CountryZ=widgets.Dropdown(options = sorted(corona['Country/Region'].unique()), value='Switzerland'),
    Cases=widgets.RadioButtons(
    options=['Confirmed','Death','Recovered'], description='Cases:', disabled=False)            
);

interactive(children=(Dropdown(description='CountryX', index=47, options=('Afghanistan', 'Albania', 'Algeria',…

### Comparision of Coronavirus Outbreak

<p>         <br>
The next plot may be the most interesting one. It allows to compare the developments in countries from the outbreak day on, which is defined as the day when the number of corona cases passes a certain level (which can be specified):

In [19]:
def plot_compare_outbreak(CountryX='x', CountryY='y', CountryZ='z', Cases='Confirmed', threshold=10, freq='absolute'):
    """ prints a plot which compares the evolution of the coronavirus cases between countries when a certain threshold is surpassed
    
    Args:
        CountryX: name of the country X
        CountryY: name of the country Y
        CountryZ: name of the country Z
        Cases: kind of cases displayed: Confirmed, Death, Recovered
        threshold: number of cases that must be confirmed at which the date is set to zero
        
    """  
    # for country X
    
    # only interested in the cases columns and the data columns. 
    # Moreover, there could be multiple observations for a country on a given date, hence we group by the country and sum it up. 
    df_X=corona.loc[corona['Country/Region']==CountryX, ['Confirmed', 'Death', 'Recovered', 'Date']].groupby(['Date']).sum()
    
    # rename the columns, so that the labels in the figure are right
    df_X.columns = ['Confirmed ' +str(CountryX), 'Death ' +str(CountryX), 'Recovered ' +str(CountryX)]
    
    df_X['Date'] = pd.to_datetime(df_X.index.date)
    
    # define variable Outbreak, which equals the Date, when the cionfirmed cases are higher than the threshold
    df_X.loc[df_X['Confirmed ' +str(CountryX)] >= threshold, 'Outbreak'] = df_X['Date']
    
    # out of simplicity we fill na values with the min value. It makes the next step easier
    df_X['Outbreak'].fillna((df_X['Outbreak'].min()), inplace=True)
    
    # the outbreak date is set at the date at which the threshold is surpassed the first time
    df_X['Outbreak Date'] = df_X['Outbreak'].min()
    
    # then we define a new variable, which shows how many days are passed since the outbreak occured. This var. will be used for the x-axis.
    df_X['Outbreak Days'] = (df_X['Date'] - df_X['Outbreak Date']).dt.days 
 

    # for country Y (we use the same code as above, just replce X with Y)
    df_Y=corona.loc[corona['Country/Region']==CountryY, ['Confirmed', 'Death', 'Recovered', 'Date']].groupby(['Date']).sum()
    df_Y.columns = ['Confirmed ' +str(CountryY), 'Death ' +str(CountryY), 'Recovered ' +str(CountryY)]

    df_Y['Date'] = pd.to_datetime(df_Y.index.date)
    df_Y.loc[df_Y['Confirmed ' +str(CountryY)] >= threshold, 'Outbreak'] = df_Y['Date']
    df_Y['Outbreak'].fillna((df_Y['Outbreak'].min()), inplace=True)
    df_Y['Outbreak Date'] = df_Y['Outbreak'].min()
    df_Y['Outbreak Days'] = (df_Y['Date'] - df_Y['Outbreak Date']).dt.days  
    
    
    # for country Z (we use the same code as above, just replce X with Z)
    df_Z=corona.loc[corona['Country/Region']==CountryZ, ['Confirmed', 'Death', 'Recovered', 'Date']].groupby(['Date']).sum()
    df_Z.columns = ['Confirmed ' +str(CountryZ), 'Death ' +str(CountryZ), 'Recovered ' +str(CountryZ)]

    df_Z['Date'] = pd.to_datetime(df_Z.index.date)
    df_Z.loc[df_Z['Confirmed ' +str(CountryZ)] >= threshold, 'Outbreak'] = df_Y['Date']
    df_Z['Outbreak'].fillna((df_Z['Outbreak'].min()), inplace=True)
    df_Z['Outbreak Date'] = df_Z['Outbreak'].min()
    df_Z['Outbreak Days'] = (df_Z['Date'] - df_Z['Outbreak Date']).dt.days 
    
    f = plt.figure(figsize=(20,10))
    ax = f.add_subplot(1,1,1)
    
    # depending on the function input we plot different columns
    if (freq == 'absolute'):
    
        if (Cases == 'Confirmed'):
            ax.plot(df_X['Outbreak Days'], df_X['Confirmed ' +str(CountryX)], label='Confirmed ' +str(CountryX))  
            ax.plot(df_Y['Outbreak Days'], df_Y['Confirmed ' +str(CountryY)], label='Confirmed ' +str(CountryY))
            ax.plot(df_Z['Outbreak Days'], df_Z['Confirmed ' +str(CountryZ)], label='Confirmed ' +str(CountryZ))

        
        elif (Cases =='Death'):
            ax.plot(df_X['Outbreak Days'], df_X['Death ' +str(CountryX)], label='Death ' +str(CountryX))  
            ax.plot(df_Y['Outbreak Days'], df_Y['Death ' +str(CountryY)], label='Death ' +str(CountryY))
            ax.plot(df_Z['Outbreak Days'], df_Z['Death ' +str(CountryZ)], label='Death ' +str(CountryZ))
    
        elif (Cases =='Recovered'):
            ax.plot(df_X['Outbreak Days'], df_X['Recovered ' +str(CountryX)], label='Recovered ' +str(CountryX))  
            ax.plot(df_Y['Outbreak Days'], df_Y['Recovered ' +str(CountryY)], label='Recovered ' +str(CountryY))
            ax.plot(df_Z['Outbreak Days'], df_Z['Recovered ' +str(CountryZ)], label='Recovered ' +str(CountryZ))
    
    elif (freq == 'relative'):
        # to estimate the relative numbers, we have to add the population size
        popX = world.loc[world['name']==CountryX, 'pop_est'].item()
        popY = world.loc[world['name']==CountryY, 'pop_est'].item()
        popZ = world.loc[world['name']==CountryZ, 'pop_est'].item()
        
        # same code as with the absolute number, but we devide the cases by the population size
        if (Cases == 'Confirmed'):
            ax.plot(df_X['Outbreak Days'], df_X['Confirmed ' +str(CountryX)]/popX, label='Confirmed ' +str(CountryX))  
            ax.plot(df_Y['Outbreak Days'], df_Y['Confirmed ' +str(CountryY)]/popY, label='Confirmed ' +str(CountryY))
            ax.plot(df_Z['Outbreak Days'], df_Z['Confirmed ' +str(CountryZ)]/popZ, label='Confirmed ' +str(CountryZ))

        
        elif (Cases =='Death'):
            ax.plot(df_X['Outbreak Days'], df_X['Death ' +str(CountryX)]/popX, label='Death ' +str(CountryX))  
            ax.plot(df_Y['Outbreak Days'], df_Y['Death ' +str(CountryY)]/popY, label='Death ' +str(CountryY))
            ax.plot(df_Z['Outbreak Days'], df_Z['Death ' +str(CountryZ)]/popZ, label='Death ' +str(CountryZ))
    
        elif (Cases =='Recovered'):
            ax.plot(df_X['Outbreak Days'], df_X['Recovered ' +str(CountryX)]/popX, label='Recovered ' +str(CountryX))  
            ax.plot(df_Y['Outbreak Days'], df_Y['Recovered ' +str(CountryY)]/popY, label='Recovered ' +str(CountryY))
            ax.plot(df_Z['Outbreak Days'], df_Z['Recovered ' +str(CountryZ)]/popZ, label='Recovered ' +str(CountryZ))
        
        
    #df_country.plot(figsize  = (20, 10))
    ax.set_xlim(left=0)
    # add legend
    ax.legend(loc='lower right')
    # add text that shows additional infos about the outbreak of the countries
    text = ' Outbreak Date'
    text += f'\n'
    text +='at which threshold surpassed:'
    text += f'\n'
    text += f'\n'
    text += str(CountryX) +':' + str(df_X['Outbreak Date'].min().date()) 
    text += f'\n'
    text +=str(CountryY) +':' + str(df_Y['Outbreak Date'].min().date())
    text += f'\n'
    text +=str(CountryZ) +':' + str(df_Z['Outbreak Date'].min().date())
    ax.text(0.1, 0.9,text, c='black', bbox=dict(facecolor='none', edgecolor='black'),horizontalalignment='center',verticalalignment='center',transform = ax.transAxes)
    plt.title("Comparision of Coronavirus Outbreak", fontsize=20) 
    plt.xlabel("Days since threshold surpassed")
    
style = {'description_width': 'initial'}
widgets.interact(plot_compare_outbreak,
    CountryX=widgets.Dropdown(options = sorted(corona['Country/Region'].unique()), value='US'),
    CountryY=widgets.Dropdown(options = sorted(corona['Country/Region'].unique()), value='Italy'),
    CountryZ=widgets.Dropdown(options = sorted(corona['Country/Region'].unique()), value='South Korea'),
    Cases=widgets.RadioButtons(options=['Confirmed','Death','Recovered'], description='Cases:', disabled=False),
    threshold=widgets.IntSlider(value=10, min=10,max=100, step=2, style=style, description='Confirmed Cases Threshold :'),
    freq=widgets.RadioButtons(options=['absolute', 'relative'], description='Frequency:', disabled=False)
);

interactive(children=(Dropdown(description='CountryX', index=174, options=('Afghanistan', 'Albania', 'Algeria'…

### World Map: Coronavirus Cases Development

<p>         <br>
Our next plot is a world map that shows the developement of the spread. In addition to that, additional information for a chosen country is shown in a information box. 

In [20]:
def world_map_plot_info(Country='x', day=1):
    """ prints a world map which shows the evolution of the coronavirus cases, includes information about a country
    Args:
        Country: country name for which information are displayed
        day: Date of the map 
    
    """
    fig, ax = plt.subplots(figsize  = (20, 10))
    # filter dataframe: only look at day
    corona_day = corona[corona['Days']==day]
    today = corona[corona['Days']==day]['Date'].iloc[0].date()
    
    # plot world map
    world.plot(alpha = 1, color="lightgrey",edgecolor = "black",ax=ax)
    
    # add cases points with the frequency as size of the point
    plt.scatter(x=corona_day['Long'], y=corona_day['Lat'], alpha=0.4, s=corona_day['Confirmed']**0.8, c='y', label='Confirmed')
    plt.scatter(x=corona_day['Long'], y=corona_day['Lat'], alpha=0.6, s=corona_day['Death']**0.8, c='r',label='Deaths')
    #plt.scatter(x=corona_day['Long'], y=corona_day['Lat'], alpha=0.5, s=corona_day['Recovered']**0.8, c='g', label='Recovered') 
    
    
    ax.set_title("Coronavirus Cases Development", fontsize=30)
    ax.set_axis_off()
    plt.axis('equal')
    leg = ax.legend(loc='lower center', title='Cases',fontsize='x-large')
    for handle in leg.legendHandles:
        handle.set_sizes([20.0])
    plt.setp(leg.get_title(),fontsize='x-large')
    
    corona_country=corona.loc[corona['Country/Region']==Country].loc[corona['Days']==day].loc[:,['Confirmed', 'Death', 'Recovered', 'Date']].groupby(['Date']).sum()  
    
    # add additional box with infos 
    Conf = corona_country['Confirmed'].item()
    Death = corona_country['Death'].item()
    Recovered = corona_country['Recovered'].item()
    
    text = 'Date: ' + str(today)
    text += f'\n'
    text += str(Country) +' Cases:' 
    text += f'\n'
    text +='Confirmed: ' +str(Conf)
    text += f'\n'
    text +='Death: ' +str(Death)
    text += f'\n'
    text +='Recovered: ' +str(Recovered)
    plt.text(-170, 90, text, fontsize='large', c='black', bbox=dict(facecolor='none', edgecolor='black'))
    
max_date = corona['Days'].max()    
widgets.interact(world_map_plot_info,
    Country=widgets.Dropdown(options = sorted(corona['Country/Region'].unique()), value='Denmark'),
    day=widgets.FloatSlider(description="Days since the January 22", style=style, min=0, max=max_date, step=1),
);

interactive(children=(Dropdown(description='Country', index=47, options=('Afghanistan', 'Albania', 'Algeria', …

<p>         <br>
In our date used until now the cases for most countries are aggregated on the country level. For example for the US, there exists only one point. The Johns Hopkins University also provides local data for the US. In the next part we add this data, so we can get a nicer map for the USA.

#### Read and Clean: US local-level Data

We first add the confirmed cases on the local level for the US using the same code as above:

In [21]:
url_us1 = 'https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_confirmed_US.csv'
df_us_confirmed = pd.read_csv(url_us1, error_bad_lines=False) 
df_us_confirmed.rename(columns={"Province_State": "Province/State", "Long_" : "Long", "Country_Region" : "Country/Region"}, inplace=True)
df_us_confirmed.drop(columns=['UID','iso2','iso3','code3','FIPS','Admin2','Combined_Key'], inplace=True)

In [22]:
us_confirmed = df_us_confirmed.melt(id_vars=["Country/Region", "Lat", "Long", "Province/State"], 
        var_name=str("Date"), 
        value_name="Confirmed")

us_confirmed['Date'] = pd.to_datetime(us_confirmed['Date'])
us_confirmed['Days'] = (us_confirmed['Date'] - start_date).dt.days
us_confirmed.sort_values(['Country/Region', 'Date'])

Unnamed: 0,Country/Region,Lat,Long,Province/State,Date,Confirmed,Days
0,US,-14.271000,-170.132000,American Samoa,2020-01-22,0,0
1,US,13.444300,144.793700,Guam,2020-01-22,0,0
2,US,15.097900,145.673900,Northern Mariana Islands,2020-01-22,0,0
3,US,18.220800,-66.590100,Puerto Rico,2020-01-22,0,0
4,US,18.335800,-64.896300,Virgin Islands,2020-01-22,0,0
...,...,...,...,...,...,...,...
371749,US,39.372319,-111.575868,Utah,2020-05-14,29,113
371750,US,38.996171,-110.701396,Utah,2020-05-14,13,113
371751,US,37.854472,-111.441876,Utah,2020-05-14,187,113
371752,US,40.124915,-109.517442,Utah,2020-05-14,16,113


Then we add the dead cases (there is no recovered data available):

In [23]:
url_us2 = 'https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_deaths_US.csv'
df_us_death = pd.read_csv(url_us2, error_bad_lines=False) 
df_us_death.rename(columns={"Province_State": "Province/State", "Long_" : "Long", "Country_Region" : "Country/Region"}, inplace=True)
df_us_death.drop(columns=['UID','iso2','iso3','code3','FIPS','Admin2','Combined_Key', 'Population'], inplace=True)

In [24]:
us_death = df_us_death.melt(id_vars=["Country/Region", "Lat", "Long", "Province/State"], 
        var_name=str("Date"), 
        value_name="Death")

us_death['Date'] = pd.to_datetime(us_death['Date'])
us_corona = us_confirmed.merge(us_death,on = ["Country/Region", "Lat", "Long", "Province/State",'Date'],how ='outer')

us_corona.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 386688 entries, 0 to 386687
Data columns (total 8 columns):
 #   Column          Non-Null Count   Dtype         
---  ------          --------------   -----         
 0   Country/Region  386688 non-null  object        
 1   Lat             386688 non-null  float64       
 2   Long            386688 non-null  float64       
 3   Province/State  386688 non-null  object        
 4   Date            386688 non-null  datetime64[ns]
 5   Confirmed       386118 non-null  float64       
 6   Days            386118 non-null  float64       
 7   Death           386118 non-null  float64       
dtypes: datetime64[ns](1), float64(5), object(2)
memory usage: 26.6+ MB


We drop the US country-level rows from the corona-dataframe constructed before and replace it with the US local-level data:

In [25]:
corona_us = corona[corona['Country/Region'] != 'US']

corona = pd.concat([corona_us, us_corona], axis=0, ignore_index=True)
corona.sort_values(['Country/Region', 'Date'])

Unnamed: 0,Country/Region,Lat,Long,Province/State,Date,Confirmed,Days,Death,Recovered
0,Afghanistan,33.0,65.0,Country,2020-01-22,0.0,0.0,0.0,0.0
265,Afghanistan,33.0,65.0,Country,2020-01-23,0.0,1.0,0.0,0.0
530,Afghanistan,33.0,65.0,Country,2020-01-24,0.0,2.0,0.0,0.0
795,Afghanistan,33.0,65.0,Country,2020-01-25,0.0,3.0,0.0,0.0
1060,Afghanistan,33.0,65.0,Country,2020-01-26,0.0,4.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...
29114,Zimbabwe,-20.0,30.0,Country,2020-05-10,36.0,109.0,4.0,9.0
29379,Zimbabwe,-20.0,30.0,Country,2020-05-11,36.0,110.0,4.0,9.0
29644,Zimbabwe,-20.0,30.0,Country,2020-05-12,36.0,111.0,4.0,9.0
29909,Zimbabwe,-20.0,30.0,Country,2020-05-13,37.0,112.0,4.0,12.0


### World Map: Coronavirus Cases Development 2

Then we use the same function as above and plot the map again. But this time, we see many different outbreak points in the US. This is also the only difference to the plot before. 

In [26]:
widgets.interact(world_map_plot_info,
    Country=widgets.Dropdown(options = sorted(corona['Country/Region'].unique()), value='US'),
    day=widgets.FloatSlider(description="Days since the January 22", style=style, min=0, max=max_date, step=1),
);

interactive(children=(Dropdown(description='Country', index=174, options=('Afghanistan', 'Albania', 'Algeria',…

## Conclusion

In this project with have imported, cleaned and merged time series about the coronavirus cases from the Johns Hopkins University. Additionally, we loaded the world dataset provided by geopandas. We then visualized the coronavirus outbreak using figures from newspaper as inspiration.