Basic imports and setup

In [None]:
import numpy as np
np.random.seed(12345)
import matplotlib.pyplot as plt
plt.rc('figure', figsize=(10, 6))
np.set_printoptions(precision=4, suppress=True)
import pandas as pd
from pandas import Series, DataFrame, RangeIndex

### COVID-19 DATASET

Where to find and how to retrieve it from your notebook<br>

https://github.com/CSSEGISandData/COVID-19

You can check the csv structure in the github page:<br>

https://github.com/CSSEGISandData/COVID-19/tree/master/csse_covid_19_data

### File naming convention
MM-DD-YYYY.csv in UTC.

### Field description
**FIPS**: US only. Federal Information Processing Standards code that uniquely identifies counties within the USA.<br>
**Admin2**: County name. US only.<br>
**Province_State**: Province, state or dependency name.<br>
**Country_Region**: Country, region or sovereignty name. The names of locations included on the Website correspond with the official designations used by the U.S. Department of State.<br>
**Last Update**: MM/DD/YYYY HH:mm:ss (24 hour format, in UTC).<br>
**Lat** and **Long_**: Dot locations on the dashboard. All points (except for Australia) shown on the map are based on geographic centroids, and are not representative of a specific address, building or any location at a spatial scale finer than a province/state. Australian dots are located at the centroid of the largest city in each state.<br>
**Confirmed**: Confirmed cases include presumptive positive cases and probable cases, in accordance with CDC guidelines as of April 14.<br>
**Deaths**: Death totals in the US include confirmed and probable, in accordance with CDC guidelines as of April 14.<br>
**Recovered**: Recovered cases outside China are estimates based on local media reports, and state and local reporting when available, and therefore may be substantially lower than the true number.<br>
**Active**: Active cases = total confirmed - total recovered - total deaths.<br>
**Incidence_Rate**: Admin2 + Province_State + Country_Region.<br>
**Case-Fatality Ratio (%)**: = confirmed cases per 100,000 persons.<br>
**US Testing Rate**: = total test results per 100,000 persons. The "total test results" is equal to "Total test results (Positive + Negative)" from Covid Tracking Project.<br>
**US Hospitalization Rate (%)**: = Total number hospitalized / Number confirmed cases. The "Total number hospitalized" is the "Hospitalized – Cumulative" count from Covid Tracking Project. The "hospitalization rate" and "hospitalized - Cumulative" data is only presented for those states which provide cumulative hospital data.

### Reading the data

In [None]:
# covid_data = pd.read_csv('https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_daily_reports/11-29-2022.csv')

In [None]:
# covid_data.info()

### Exercise 1

Find, for each column (i.e. Series, i.e. Field) how many data are missing, find also how many data are missing in total.

e.g.

<pre>
FIPS                    738
Admin2                  734
Province_State          174
Country_Region            0
Last_Update               0
Lat                      90
Long_                    90
Confirmed                 0
Deaths                    0
Recovered              4006
Active                 4006
Combined_Key              0
Incident_Rate            91
Case_Fatality_Ratio      41
dtype: int64
</pre>

##### Solution

### Exercise 2

Create a Series with the number of deaths for every country.

e.g.

<pre>
Country_Region
Afghanistan           1638
Albania                637
Algeria               2186
Andorra                 76
Angola                 328
                      ... 
West Bank and Gaza     580
Western Sahara           1
Yemen                  607
Zambia                 353
Zimbabwe               260
Name: Deaths, Length: 191, dtype: int64
</pre>

##### Solution

In [None]:
from IPython.display import display, HTML
# display(HTML(covid_data.to_html()))

deathsCount = covid_data.set_index(["Country_Region","Province_State"]).sort_index().sum(level="Country_Region")
# display(HTML(deathsCount.to_html()))
deathsCount = deathsCount["Deaths"]
# deathsCount = covid_data.set_index(["Country_Region")["Deaths"]

# deathsCount = covid_data["Deaths"]
# deathsCount.index = covid_data["Country_Region"]
deathsCount

### Exercise 3

Find all the countries with more deaths than Italy.

e.g.

<pre>
Country_Region
Brazil            166699
India             130993
Mexico             99026
US                248672
United Kingdom     52839
Name: Deaths, dtype: int64
</pre>

##### Solution

### Exercise 4

Find a datasource with the population for each country and find the ratio between deaths and polulation.

e.g.

<pre>
                Deaths	Population	Ratio
Country_Region			
Afghanistan     1638    34656032.0  0.000047
Albania         637     2876101.0   0.000221
Algeria         2186    40606052.0  0.000054
Andorra         76      77281.0     0.000983
Angola          328     28813463.0  0.000011
...	...	...	...
191 rows × 3 columns
</pre>

##### Solution

### Exercise 5

Sort the countries according to the percentage of deaths over population.

Any problem with missing data ?

##### Solution

### Exercise 6

Define a strategy to handle missing data / Errors with exercise 5

If you did not have any missing data (which I doubt), try with this population table: https://worldpopulationreview.com/countries

##### Solution