![Callysto.ca Banner](https://github.com/callysto/curriculum-notebooks/blob/master/callysto-notebook-banner-top.jpg?raw=true)

<a href="https://hub.callysto.ca/jupyter/hub/user-redirect/git-pull?repo=https%3A%2F%2Fgithub.com%2Fcallysto%2Fpresentations&branch=master&subPath=data-science-with-covid-instructor.ipynb&depth=1" target="_parent"><img src="https://raw.githubusercontent.com/callysto/presentations/master/open-in-callysto-button.svg?sanitize=true" width="123" height="24" alt="Open in Callysto"/></a>

# Introduction to Data Science with COVID-19 Data

This Jupyter notebook uses [COVID-19 statistics from Johns Hopkins CSSE](https://github.com/CSSEGISandData/COVID-19).

First, `▶Run` the next cell to import the data. Once the data set has been downloaded and imported into a [DataFrame](https://www.tutorialspoint.com/python_pandas/python_pandas_dataframe.htm), it will be displayed.

You can change the date, but make sure you use the format `'MM-DD-YYYY'` as they do in the CSSE data set.

In [None]:
date = '04-07-2020'

import pandas as pd

csv_url = 'https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_daily_reports/'+date+'.csv'
covid_stats = pd.read_csv(csv_url)
covid_stats

## Data Cleaning

`Run` the next cell to clean up the data. We'll add up values for each country and create a new dataframe.

In [None]:
# If you prefer specific countries, put a # in front of the next line and remove the six ' marks around the next list
country_list = covid_stats['Country_Region'].unique()
'''
country_list = ['Italy', 'Spain', 'Germany', 'France', 
                'Israel', 'US', 'United Kingdom',
                'Singapore', 'Australia', 'Canada',
                'China', 'Argentina', 'Russia', 'India']
'''

df = pd.DataFrame(columns=['Country', 'Confirmed', 'Recovered', 'Deaths'])

for country in country_list:
    confirmed = covid_stats[covid_stats['Country_Region']==country]['Confirmed'].sum()
    recovered = covid_stats[covid_stats['Country_Region']==country]['Recovered'].sum()
    deaths = covid_stats[covid_stats['Country_Region']==country]['Deaths'].sum()
    data_row = {'Country':country,'Confirmed':confirmed,'Recovered':recovered,'Deaths':deaths}
    df = df.append(data_row, ignore_index=True)

df.sort_values('Confirmed',ascending=False)

## Add World Data

We can also add up all of the values in the data set to get worldwide totals.

In [None]:
confirmed = covid_stats['Confirmed'].sum()
recovered = covid_stats['Recovered'].sum()
deaths = covid_stats['Deaths'].sum()
world_values = {'Country':'World','Confirmed':confirmed,'Recovered':recovered,'Deaths':deaths}
df = df.append(world_values, ignore_index=True)
df.tail()

## Adding Population Data

We'll use population data from [Gapminder](https://gapminder.org).

In [None]:
pop_sheet_id = '18Ep3s1S0cvlT1ovQG9KdipLEoQ1Ktz5LtTTQpDcWbX0'
pop_gid = '1668956939'
pop_csv_url = 'https://docs.google.com/spreadsheets/d/'+pop_sheet_id+'/export?gid='+pop_gid+'&format=csv'
pop_df = pd.read_csv(pop_csv_url)
current_population = pop_df[pop_df['time']==2019]
current_population

In [None]:
cp = current_population.set_index('name')
df.replace('Korea, South','South Korea',regex=True,inplace=True)
df.replace('US','United States',regex=True,inplace=True)
cs = df.set_index('Country')

In [None]:
new_df = cs.join(cp)
new_df

In [None]:
new_df.drop(columns=['geo','time'],inplace=True)
new_df.rename(columns={'population':'Population'},inplace=True)
new_df

In [None]:
new_df = new_df.dropna()
new_df

In [None]:
new_df['Confirmed Percent'] = new_df['Confirmed']/new_df['Population']*100
new_df

In [None]:
import cufflinks as cf
cf.go_offline()
new_df.sort_values('Confirmed Percent').tail(20).iplot(kind='bar',y='Confirmed Percent')

## Sorting Data

`Run` the next cell to sort the data by a particular column. The `ascending=False` is optional (the default is `True`), and `.head(16)` shows just the first 16 rows.

In [None]:
df.sort_values('Confirmed', ascending=False).head(16)

## Selecting Specific Countries

To see a DataFrame of specific countries, edit and run the next cell.

In [None]:
#df[df['Country']=='Canada']
list_of_countries = ['Canada', 'China', 'Italy']
df[df['Country'].isin(list_of_countries)]

## Graphing Data

We will use the `cufflinks` library to create a graph of our data set.

```python
import cufflinks as cf
cf.go_offline()
df.sort_values('Confirmed').iplot(kind='bar',x='Country',y='Confirmed')
```

Another option:

```python
import cufflinks as cf
cf.go_offline()
df.sort_values('Confirmed',ascending=False).head(20).iplot(kind='bar',x='Country',y='Confirmed',title='COVID Cases')
```

To exclude the `World` row, you can `.drop(184)` (or `.drop('World')` if you've set the index to `'Country'`).

In [None]:
import cufflinks as cf
cf.go_offline()
df.sort_values('Confirmed').iplot(kind='bar',x='Country',y='Confirmed')

**Hopefully that's an interesting introduction to data science using online COVID-19 data.**

[![Callysto.ca License](https://github.com/callysto/curriculum-notebooks/blob/master/callysto-notebook-banner-bottom.jpg?raw=true)](https://github.com/callysto/curriculum-notebooks/blob/master/LICENSE.md)