<a href="https://colab.research.google.com/github/RubeRad/tcscs/blob/master/CovidPlots.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Introduction to Python, Jupyter, Pandas, and Matplotlib
This is a Jupyter Python notebook, which is a collection of cells. Each cell is either of type 'markdown' (formatted text, like this cell) or code (python, grey background). The two most important rules of Jupyter Notebooks are:
1. ***SHIFT-ENTER*** will cause the current cell to execute. 
  - For Markdown cells, 'execute' means render the formatting. ([Here's a markdown cheatsheet](https://sqlbak.com/blog/wp-content/uploads/2020/04/Jupyter-Notebook-Markdown-Cheatsheet.pdf))
  - For Code cells, 'execute' means run the python.
  - Some Code cells take a while to execute, watch for the * to change to a number
1. Any cell can be edited (double-click into it) and re-executed (SHIFT-ENTER again).
--- 

The first code in any Python script/Jupyter notebook, needs to import any libraries that will be used. The `as` directives allow specification of nicknames that are more convenient to type.

In [None]:
from datetime import date        # Because we're going to be plotting time series
import matplotlib.pyplot as plt  # This is for creating graphs
import pandas            as pd   # This greatly simplifies handling of tabular (csv) data

# this is one special little function to support graphs with separate left/right scales
from mpl_toolkits.axes_grid1 import host_subplot

# Date handling in python
There's a lot more we could go into regarding time zones, subsecond precision, etc, but for this dataset we only need to deal with calendar dates

In [None]:
# What answer do you expect for this?
date1 = '2023-01-30'
date2 = '2023-02-06'
date2 - date1

In [None]:
# The datetime module lets us do this and get the right answer
date1 = date(2023,2,25)  # what if it's 2020?
date2 = date(2023,3,5)
date2 - date1

# Slurp the data into a pandas DataFrame

The `pandas` function `read_csv()` can read .csv files on your computer, but it's so smart it can even slurp in a .csv directly from online. This reads in the csv data linked to on this page: https://ourworldindata.org/coronavirus-source-data). 

It takes a few seconds to download, watch for the `*` to change to a number. 

**The two most important terms in the grammar of pandas are `DataFrame` and `Series`/column**. A `DataFrame` is equivalent to  one tab of a spreadsheet (or one worksheet of a workbook). Each column of the `DataFrame` is one `Series`. The data object returned by `read_csv()` is held in a variable named `df` which stands for `DataFrame`.

In [None]:
url = 'https://covid.ourworldindata.org/data/owid-covid-data.csv'
dfall = pd.read_csv(url)  # this takes a few seconds because it has to download the file
#dfall = pd.read_csv(url, parse_dates=['date'])

As usual, if you mention any python variable at the end of a code cell, the notebook will try to print or summarize it. The `NaN` ('not a number') are empty cells in the csv (missing data).

In [None]:
dfall

That doesn't even have enough room to show all 67 columns in the data. What are all those columns?

In [None]:
dfall.columns   # pandas assumes the first row of the csv are column headers

In [None]:
# info is a 'class function', so it needs (); sometimes input parameters go in there.
dfall.info() # Data size, column headers, counts, and types

## Exercise

Note that the type listed for column `3 date` up there, is just `object` (i.e. text string). We want pandas to see that as type `datetime`. Go back up to the `read_csv(url)` line and uncomment the 2nd option to tell pandas to re-read the data that way.

# Cleaning the data
In Data Science, the majority of the work is obtaining, aggregating, and cleaning the data. This is a great dataset that doesn't need much work, but we should fill in some of the holes. 

In [None]:
# isnull() by itself prints too much stuff, sum() helps summarize
dfall.isnull() #.sum()

## Filling in missing values in a specific column
`iso_code` and `date` have 0 missing entries, that's good. But we're going to be dividing by `population` so missing data is not good there.

In [None]:
# This gives us the list of True/False for where all the missing values are in the population column
# The list is too long to see where those 1113 Trues are hiding
dfall.population.isnull()

In [None]:
# Make a slice that captures all the rows where population is missing
df_no_pop = #

In [None]:
df_no_pop

In [None]:
# The fillna() command allows you to say what you want for the missing values
dfall.population.fillna(1)

Go back and check if that worked. 

* Go back up and re-execute the `dfall.isnull().sum()`; how many missing entries does `population` now have?
* Re-execute the `df_no_pop` slice, how many rows are in it now?

What happened is that the `fillna(1)` returned a version of the column that was filled in (which could have been caught in a separate variable), but didn't change the column. To actually change, add `, inplace=True` to the `fillna` arguments.

## Filling in missing values in an entire DataFrame
We had to be especially careful with the population column, because we're going to need to divide by it. But for the rest, it's ok to fill in missing values with 0. This is actually quite easy:

In [None]:
dfall.fillna(0, inplace=True)

# Slicing a DataFrame down to a subset of columns
67 is way more columns than we need for this exercise! Just like in the last notebook we learned about *slicing* a `DataFrame` to get a `DataFrame` for a subset of the rows, we can also slice a smaller `DataFrame` of desired `Series`/columns. As always in Python, take careful note of the syntax, with `[]` inside `[]`, and quotes and commas

In [None]:
# dfall is all the data, let's call this smaller DataFrame df
df = dfall[  ['iso_code', 'location', 'date', 'new_cases', 'new_deaths', 'total_cases', 'population']  ]

## Exercise
In the following cell, try all these ways of inspecting the smaller `DataFrame`:
* `df`
* `df.head()`
* `df.tail()`
* `df.describe()`
* `df.info()`
* `df.isnull().sum()`

# Choosing individual columns by name

Any individual `Series` (column) can be fetched out of the `DataFrame` one of two ways:
* `df['iso_code']` always works
* `df.iso_code` works if the column name has no spaces or weird characters.

In [None]:
df.iso_code

In [None]:
df.iso_code.describe()

In [None]:
df.iso_code.head()

In [None]:
df.iso_code.tail()

In [None]:
df.iso_code.value_counts()

## Exercise
* Go back into those cells above, and edit to investigate some other Series, like 'location', 'date', 'new_cases', 'population'
* 'value_counts()' has different top numbers for `iso_code` vs `location` vs `date`. What does that mean?

# Slicing DataFrame rows with a condition
A `DataFrame` can be 'sliced' (filtered) based on conditions applied to the data. These actions can be read something like "The new variable USA is a `DataFrame` made from df by selecting all the rows for which the 'iso_code' is equal to the text constant 'USA'"

***Critically important Python NOTE:*** One = means the action **assignment**, whatever is on the right, put it into the left. It is the same as Snap's `Set <variable> to <value>` from the  yellow Variables tab.  Two == means the **question** *are these two things the same* (or in this context, *where* are these two the same?), and is equivalent to the Predicate = in Snap!, from the green Operators tab.

In [None]:
# this is a Series of True/False; it's True in all the rows with iso_code=='USA'
df.iso_code == 'AFG' 

In [None]:
# catch that list of True/False in a variable
AFGrows = (df.iso_code == 'AFG')

Once we have a list of True/False flags for all the rows, we can use that to filter the DataFrame to a smaller DataFrame with just the rows flagged True. Note that the resulting DataFrame is all `iso_code==AFG`, and fewer rows.

In [None]:
AFG = df[ AFGrows ]
AFG

In [None]:
# This can all be done in one line,
# without the intermediate variable AFGrows
AFG = df[ df.iso_code == 'AFG' ]
AFG

Note we can do the above steps in one line per country, like this:

## Exercise
Use the following cells to make a USA sliced DataFrame, and inspect it like USA.info(), USA.describe(), etc.

In [None]:
# Make a slice of all the rows for the USA
USA = #

In [None]:
# Use various methods to inspect the USA DataFrame


# Plotting Pandas Series' with MATPLOTLIB
Now that we have a handle on manipulating csv data with pandas, we turn to the main point, which is to be able to visualize the data graphically. Here's a naive plot to start with. 


In [None]:
fig = plt.figure(figsize=(16,8)) # 16x8 is twice as wide as tall
ax = plt.gca()

ax.plot(USA.date, USA.new_cases)

#ax.set_xlim( date(2022,1,1), date(2022,3,1) )

plt.show()

## Exercise

Why does it look so jaggedy like that? use the `ax.set_xlim()` function (as in the MatplotlibIntro notebook) and the `date` objects (as in the top of this notebook) to narrow down the graph above to a couple-month range and get a closer look.

# Data smoothing
The USA curve (and all the curves) are jaggedy because bureaucrats don't file paperwork on the weekends. Pandas lets us create a smoother dataset with a rolling window.

In [None]:
# Variable name "USAcs" is "USA new Cases (Smoothed)"
# Try this first as-is, and then with the .mean() 
USAcs = USA.new_cases.rolling(window=7) # .mean()
USAcs

Now go copy that matplotlib code from above into the cell below, and add a plot for this new smoothed dataset `USAcs`. Then you can remove the `set_xlim` and see the big picture. (And then you can remove the plot of the jaggedy `USA.new_cases` series)

# Decorating with annotations
There are a lot of ways to annotate a matplotlib graph, some of which we've seen before:
* `ax.text(x,y,s)`  at position x,y, write text string s
* `ax.vlines(x, ymin, ymax)` at x (or list of xs), draw vertical line from ymin to ymax
* `ax.hlines(y, xmin, xmax)` at y (or list of ys), draw horizontal line from xmin to xmax
* `plt.axvline(x)` draw full vertical line at x  (or fractional with `ymin` and `ymax` between 0..1)
* `plt.axhline(y)` draw full horizontal line at y (or fractional with `xmin` and `xmax` between 0..1)
* `ax.annotate(s, (x,y) )` basically the same as `ax.text(x,y,s)`
  * But you can add optional arguments `xytext=(x,y), arrowprops=dict(arrowstyle='->',color='r'))` to displace the text label to `xytext` and point the annotated `(x,y)` with an arrow.

In [None]:
# Run this cell and see what happens:
ax.text()         # NOTE: () with nothing in them

In [None]:
# Run this cell and see what happens:
help(plt.vlines)  # NOTE: no ()

Mark up the following graph:
* Use `plt.axvline()` to make red lines to mark the dates when the classes of 2021 and 2022 did this notebook, `date(2020,12,9)` and `date(2021,12,8)`
* Use `ax.text()` to write "Year 1" and "Year 2" next to those lines.
* Use `ax.annotate()` to write "Omicron" with an arrow pointing to the Christmas 2021 spike.

In [None]:
fig = plt.figure(figsize=(16,8)) # 16x8 is twice as wide as tall
ax = plt.gca()

ax.plot(USA.date, USAcs)

plt.show()

# Dual/Twin axes
Use the cells below to also plot `new_deaths` for the USA. The scale of the `new_deaths` numbers is so much smaller than `new_cases`, no detail is visible! It is necessary to scale the `new_deaths` with a separate Y-axis on the right.

In [None]:
# Use this cell to construct a data Series for USA new Deaths (Smoothed)"
USAds = #...

In [None]:
# Once USAds is working above, uncomment the first plot()
# to plot both series together

fig = plt.figure(figsize=(16,8)) # 16x8 is twice as wide as tall
ax = plt.gca()

#ax.plot(USA.date, USAcs)
ax.plot(USA.date, USAds)

plt.show()

## Step one: twinx()
Instead of plotting both curves on `ax = plt.gca()`, make a separate `ax2 = ax.twinx()` and use that to plot `death_cases`.

How/why does that work? `ax2` has the same (twinned) X-axis as `ax`, but it makes a separate Y-axis on the right

## Step two: differentiate with colors
Which blue curve is which? Why are they both blue? (Because `ax2` starts over with default coloring)

* Specify the colors in the `plot` commands
* Use `set_ylabel()` for both `ax` and `ax2`, and also specify `color=` to match the plot lines

## Step three: add a Legend

* Easier way: `fig.legend(['one','two'])` -- but because this is a *figure* legend, it is complicated to force it inside the Axes
* Harder way: Replace `ax = plt.gca()` with `ax = host_subplot(111)`, and then instead of `fig.legend` do `ax.legend`. This method requires the special import statement up top, to bring in `host_subplot`

# Comparing countries per-capita
Prepare a sliced, smoothed dataframes like above, for ITA (Italy), or some other country of your choosing

In [None]:
ITA = # slice data frame for rows where iso_code == 'ITA'
ITAcs = # ITAly new Cases (Smoothed)
ITAds = # ITAly new Deaths (Smoothed)

In [None]:
fig = plt.figure(figsize=(16,8)) # 16x8 is twice as wide as tall
ax = plt.gca()

ax.plot(USA.date, USAcs)
ax.plot(ITA.date, ITAcs)

ax.legend(['USA','Italy'])

plt.show()

Those curves are obviously incomparable because of quite different country populations. We can convert to per-million by applying arithmetic to whole series

In [None]:
# Try each of these, one at a time
USAcs                               # the whole number of USA cases per day (smoothed)
#USAcs / USA.population             # row-by-row divide to get per capita
#USAcs / USA.population * 1000000   # per capita is too small, so scale up to per million

In [None]:
# Create two data series here which are USA/ITAly smoothed cases, per Million
USAcsm = # ...
ITAcsm = # ...


In [None]:
# Now copy the plotting code from above and plot the USAcsm and ITAcsm curves together

# Homework

Choose four countries (two of them can be USA and ITAly), and make a figure with two subplots above and below (refer to Anscombe's Quartet for subplots: above and below are 2x1, so the `fig.add_subplot()` codes would be `211` and `212`)

Here is a table of `iso_code` abbreviations that you might find helpful (but you don't have to pick from these)

|iso_code  | location |iso_code  | location |
|:---------|:---------|:---------|:---------|
|AUS       | Australia|IRN       | Iran     |
|BRA       | Brazil   |ISR       | Israel   |
|CAN       | Canada   |MEX       | Mexico   |
|ESP       | Spain    |NZL       | New Zealand |
|FRA       | France   |RUS       | Russia   |
|GBR       | United Kingdom |ZAF | South Africa |

The subplot above should be `new_cases` for the four countries (smoothed, per million), and the subplot below should be `new_deaths` for the four countries (smoothed, per million).

The whole figure should be well-designed, styled, colored, labeled, etc.

In [None]:
# Use this cell to prepare the new_cases/deaths smoothed, per-million Series for each country


In [None]:
# Use this cell to make your figure


# World Aggregation
We can add all the countries together. `groupby(['date']).sum()` says we want a new `DataFrame`, with a row for every unique date, and all the other columns are added up per date.

(For some situations, it might make more sense to `groupby(['date']).mean()` or `min()` or `max()` or `median()`, and there are many other grouping options!)

Note how `world.head()` or `tail()` prints `date` differently than the column headers, and `date` doesn't even show up in the columns; that's because it is now the 'index', not a regular column. 

Repeat the cell with `.reset_index()` active, and you will see that date is set to a regular column again (and the index is just a running counter)

In [None]:
world = df.groupby(['date']).sum()  #.reset_index()
# ignore the warnings
# or as the warning suggests, put numeric_only=True inside the sum()

In [None]:
world.tail()  # is 'date' formatted differently?

In [None]:
world.columns # does it include 'date'?

After the `.reset_index()` above is applied, the following code should make a barebones graph:

In [None]:
fig = plt.figure()
ax = plt.gca()

ax.plot(world.date, world.new_cases)
ax.plot(world.date, world.new_deaths)

plt.show()

# Homework
Improve the graph above in the following ways:
1. Initialize the figure to have a larger figsize with an attractive aspect ratio.
1. Use a 7-day rolling average instead of the raw numbers
1. Use set_ylabel() to describe the left and right axes, and use set_title() to title the whole chart
1. Use different colors for the cases/deaths graphs
1. Use plt.legend()
1. Use the plt.axvline() example above to annotate a few significant dates, such as the start of vaccination, the discovery of the Delta/Omicron variants, etc.
1. Decorate the graph with text, annotations, etc, highlighting significant events during the progress of the pandemic. Some ideas: release of the vaccine, waves like Delta/Omicron, date of 1M US deaths, dates public figures got infected or died, etc.

---
# Optional extra: Shifted Dates

Most complicated, we can see that these curves would be more comparable if they were date-shifted, to reflect the different times when the pandemic hit different countries. A common technique is to line them all up based on when they had a certain common minimum number of cases, say 10. We will filter on a condition again.

In [None]:
# I could just type 100 in every line below, but this way if I want to experiment with a different value
# I can edit just 1 line, instead of having to edit a line for every country (especially as countries are added)
min_cases = 100
USAsh = USA[ USA['total_cases'] >= min_cases ]  # 'sh' for shift
ITAsh = ITA[ ITA['total_cases'] >= min_cases ]
SWEsh = SWE[ SWE['total_cases'] >= min_cases ]
KORsh = KOR[ KOR['total_cases'] >= min_cases ]

In [None]:
USAsh['date']

In [None]:
USAsh['date'].min()

In [None]:
ITAsh['date'].min()

In [None]:
USAsh['date'].min() - ITAsh['date'].min()

Now we can see above that Italy reached 100 cases on Feb 23, 10 days before the US on Mar 4. (And that date objects can be subtracted!)

Here are all the dates where these countries reached 100 cases:

In [None]:
USAt0 = USAsh['date'].min()
ITAt0 = ITAsh['date'].min()
SWEt0 = SWEsh['date'].min()
KORt0 = KORsh['date'].min()

Just like we were able to simply multiply and divide the entire 'new_cases' Series by constant numbers, we can subtract the start date from the date Series, yielding number of days since 100 cases:

In [None]:
USAsh['date'] - USAt0

In [None]:
USAshX = USAsh['date'] - USAt0
USAshY = USAsh['new_cases'].rolling(window=7).mean()/USAsh.population*1000000
plt.figure(figsize=(16,8))
ax = plt.gca()
ax.plot(USAshX, USAshY)
plt.show()

Note that plot goes from 0 to 2.5e16. Even though the description of `USAsh['date'] - USAt0` above says 'days', matplotlib is interpreting it as milliseconds. We can fix this by forcing conversion to days.

In [None]:
USAshX = (USAsh['date'] - USAt0).astype('timedelta64[D]')   # 'D' is for Days
USAshX

In [None]:
plt.figure(figsize=(16,8))
ax = plt.gca()
ax.plot(USAshX, USAshY)
ax.plot((ITAsh['date']-ITAt0).astype('timedelta64[D]'), ITAsh['new_cases'].rolling(window=7).mean()/ITAsh.population*1000000)  
ax.plot((SWEsh['date']-SWEt0).astype('timedelta64[D]'), SWEsh['new_cases'].rolling(window=7).mean()/SWEsh.population*1000000) 
ax.plot((KORsh['date']-KORt0).astype('timedelta64[D]'), KORsh['new_cases'].rolling(window=7).mean()/KORsh.population*1000000)
ax.legend(['USA', 'Italy', 'Sweden', 'South Korea'])
ax.set_xlim(0, 300)
ax.set_ylim(0, 1000)
plt.show()