# Introduction to Python, Jupyter, Pandas, and Matplotlib
This is a Jupyter Python notebook, which is a collection of cells. Each cell is either of type 'markdown' (formatted text, like this cell) or code (python, grey background). The two most important rules of Jupyter Notebooks are:
1. ***SHIFT-ENTER*** will cause the current cell to execute. 
  - For Markdown cells, 'execute' means render the formatting. ([Here's a markdown cheatsheet](https://sqlbak.com/blog/wp-content/uploads/2020/04/Jupyter-Notebook-Markdown-Cheatsheet.pdf))
  - For Code cells, 'execute' means run the python.
  - Some Code cells take a while to execute, watch for the * to change to a number
1. Any cell can be edited (double-click into it) and re-executed (SHIFT-ENTER again).
--- 

The first code in any Python script/Jupyter notebook, needs to import any libraries that will be used. The `as` directives allow specification of nicknames that are more convenient to type.

In [None]:
from datetime import date        # Because we're going to be plotting time series
import matplotlib.pyplot as plt  # This is for creating graphs
import pandas            as pd   # This greatly simplifies handling of tabular (csv) data

The `pandas` function `read_csv()` can read .csv files on your computer, but it's so smart it can even slurp in a .csv directly from online. This reads in the csv data linked to on this page: https://ourworldindata.org/coronavirus-source-data). 

It takes a few seconds to download, watch for the `*` to change to a number. 

**The two most important terms in the grammar of pandas are `Series` and `DataFrame`**. A `DataFrame` is equivalent to  one tab of a spreadsheet (or one worksheet of a workbook). Each column of the `DataFrame` is one `Series`. The data object returned by `read_csv()` is held in a variable named `df` which stands for `DataFrame`.

Note: Python is happy with specifying text constants (strings) with either single or double quotes, but single are preferred because they read cleaner.

In [None]:
url = 'https://covid.ourworldindata.org/data/owid-covid-data.csv'
df  = pd.read_csv(url)  # this takes a few seconds because it has to download the file
#df  = pd.read_csv(url, parse_dates=['date'])

One thing you can do it Jupyter Notebooks that you can't do in regular Python scripts (programs), is to just mention something at the end of a code cell, which is a request for the notebook to display it. If it's too big to be printed completely, it will be automatically summarized. For starters, you can try to get a glimpse of the whole `DataFrame` object itself. The `NaN` ('not a number') are empty cells in the csv.

In [None]:
df

Here are a few more ways to describe/understand the data, execute each cell and take a minute to read and understand what information is provided:

In [None]:
df.shape     # note shape is a 'data member', a variable that belongs to every data frame. (Rows, Columns)

In [None]:
df.columns   # pandas assumes the first row of the csv are column headers

In [None]:
# info is a 'class function', so it needs (); sometimes input parameters go in there.
df.info() # Data size, column headers, counts, and types

## Exercise

Note that the type listed for column `3 date` up there, is just `object`. For proper handling, we want pandas to see that as type `datetime`. Go back up to the `read_csv(url)` line and uncomment the 2nd option to tell pandas to re-read the data that way

In [None]:
df.describe() # some common statistics on each column, note you can scroll to the right

In [None]:
df.head() # first few rows of data (default 5)

In [None]:
df.tail(8) # last few rows of data (we specify 8 here, just to show how)

# Choosing individual columns by name

Any individual `Series` can be fetched out of the data frame using square brackets, and some of the same functions apply, as well as a few others.

In [None]:
df['iso_code']

In [None]:
df['iso_code'].describe()

In [None]:
df['iso_code'].head()

In [None]:
df['iso_code'].tail()

In [None]:
df['iso_code'].value_counts()

## Exercise
Go back into those cells above, and edit to investigate some other Series, like 'location', 'date', 'new_cases'

Change the first line to `series = df['iso_code']`, and for the rest of the lines use the variable `series` instead of repeating `df['iso_code']` all the time. 

Then edit `'iso_code'` in the first line to other column names, and rerun the following cells to see different results.

'value_counts()' has different top numbers for `iso_code` vs `location` vs `date`. What does that mean?

---


# Slicing a DataFrame down to a subset of columns
A smaller `DataFrame` can be created from any group of `Series` that you choose. As always in Python, take careful note of the syntax, with `[]` inside `[]`, and quotes and commas

In [None]:
# Let's call this smaller DataFrame df6, because we're selecting 6 columns of interest
df6 = df[  ['iso_code', 'location', 'date', 'new_cases', 'new_deaths', 'total_cases']  ]
df6.head()

# Filtering DataFrame rows with a condition
A `DataFrame` can be 'sliced' (filtered) based on conditions applied to the data. These actions can be read something like "The new variable USA is a `DataFrame` made from df6 by selecting all the rows for which the 'iso_code' is equal to the text constant 'USA'"

***Critically important Python NOTE:*** One = means the action **assignment**, whatever is on the right, put it into the left. It is the same as Snap's `Set <variable> to <value>` from the  yellow Variables tab.  Two == means the **question** *are these two things the same* (or in this context, *where* are these two the same?), and is equivalent to the Predicate = in Snap!, from the green Operators tab.

In [None]:
AFGrows = df6['iso_code'] == 'AFG' # this is a Series of 61245 False/True; it's True in all the rows with iso_code=='USA'
AFGrows

Once we have a list of True/False flags for all the rows, we can use that to filter the DataFrame to a smaller DataFrame with just the rows flagged True. Note that the resulting DataFrame is all `iso_code==AFG`, and fewer rows.

In [None]:
AFG = df6[ AFGrows ]
AFG

Note we can do the above steps in one line per country, like this:

In [None]:
USA = df6[ df6['iso_code'] == 'USA' ] # Set variable USA to the DataFrame of the rows of df6 with iso_code==USA
ITA = df6[ df6['iso_code'] == 'ITA' ]
SWE = df6[ df6['iso_code'] == 'SWE' ]
KOR = df6[ df6['iso_code'] == 'KOR' ]

## Exercise
Use the following cell to try various looks at these sliced DataFrames, like USA.info() or SWE.tail(), etc.

In [None]:

USA.info()

# Plotting Pandas Series' with MATPLOTLIB
Now that we have a handle on manipulating csv data with pandas, we turn to the main point, which is to be able to visualize the data graphically. Here's a naive plot to start with. 


In [None]:
USAx = USA['date']      # grab the Series 'date'      out of the USA DataFrame into a variable called USAx
USAy = USA['new_cases'] # grab the Series 'new_cases' out of the USA DataFrame into a variable called USAy
plt.figure(figsize=(16,8))
ax = plt.gca()
ax.plot(USAx, USAy)

# Uncomment these lines, one at a time, to see what they do
#ax.plot(ITA['date'], ITA['new_cases']) # The same can be done inside the plot command
#ax.plot(SWE['date'], SWE['new_cases'])
#ax.plot(KOR['date'], KOR['new_cases'])
#ax.legend(['USA', 'Italy', 'Sweden', 'South Korea'])

plt.show()

First off, those curves are obviously incomparable because of quite different country populations. We can convert to per-million by applying arithmetic to the whole series (requires knowledge of the country population)

In [None]:
USA['new_cases']
# USA['new_cases'] / 331000000             # USA population is about 331M, so this is per capita
# USA['new_cases'] / 331000000 * 1000000   # per capita is too small, so scale up to per million

In [None]:
# Same as the above plot, except scaling per million (USA~331M, etc)
USAperM = USA['new_cases']/331
plt.figure(figsize=(16,8))
ax = plt.gca()
ax.plot(USAx, USAperM)
ax.plot(ITA['date'], ITA['new_cases']/60) # again, all this can be done inline
ax.plot(SWE['date'], SWE['new_cases']/10)
ax.plot(KOR['date'], KOR['new_cases']/51)
ax.legend(['USA', 'Italy', 'Sweden', 'South Korea'])
plt.show()

Next let's smooth out those weekly cycles with a 7-day average

In [None]:
USAperMrollWeek = USAperM.rolling(window=7) # this tracks statistics on a rolling window of 7 rows (days)
USAperMweekAvg  = USAperMrollWeek.mean()    # this grabs the mean() (average), as opposed to min/max/med, etc.
plt.figure(figsize=(16,8))
ax=plt.gca()
ax.plot(USAx, USAperMweekAvg)
ax.plot(ITA['date'], ITA['new_cases'].rolling(window=7).mean()/60)  # These inlines are starting to get 
ax.plot(SWE['date'], SWE['new_cases'].rolling(window=7).mean()/10)  #    unmanageably complex
ax.plot(KOR['date'], KOR['new_cases'].rolling(window=7).mean()/51)
ax.legend(['USA', 'Italy', 'Sweden', 'South Korea'])


# Try these out...
#plt.axvline('2020-12-09', c='r', ls=':')
#plt.text(x='2020-12-15', y=100, s="Last year's class", c='r')


plt.show()

# Exercise
Modify this notebook to make the following changes:

1. Remove S. Korea (they did so well, they can barely be seen on the graph!)
1. Add **TWO** countries of your choosing that make this graph more interesting (note you'll have to google the population of those two new countries)
1. Change the graph to be **deaths per million** (so far all the graphs have been  **cases per million**) 

Here is a table of abbreviations that you might find helpful (but you don't have to pick from these)

|iso_code  | location |iso_code  | location |
|:---------|:---------|:---------|:---------|
|AUS       | Australia|IRN       | Iran     |
|BRA       | Brazil   |ISR       | Israel   |
|CAN       | Canada   |MEX       | Mexico   |
|ESP       | Spain    |NZL       | New Zealand |
|FRA       | France   |RUS       | Russia   |
|GBR       | United Kingdom |ZAF | South Africa |


# World Aggregation
We can add all the countries together. `groupby(['date']).sum()` says we want a new dataframe, with a row for every unique date, and all the other columns are added up per date.

(For some situations, it might make more sense to `groupby(['date']).mean()`)

Note how `world.head()` prints the date in bold; that's because it is now the 'index', not a regular column. Repeat the cell with `.reset_index()` active, and you will see that date is set to a regular column again (and the index is just a running counter)

In [None]:
world = df.groupby(['date']).sum()    #.reset_index()
world.head()

All the same kinds of plotting that happened above can be done with this world dataframe. If `.reset_index()` is used, then the x for the plots can be `world['date']` as before; if `.reset_index()` is *not* used, then the x for plotting must be `world.index`.

In [None]:
dateX   = world.index         # or world['date'] if .reset_index() is used
casesY  = world['new_cases']  # rolling average gets applied on these two lines
deathsY = world['new_deaths']

plt.figure()
ax = plt.gca()
ax.plot(dateX, casesY)

# This is how you have a 2nd y axis on the right; two Axes share the same (twin) X axis
ax2 = ax.twinx()  
ax2.plot(dateX, deathsY)

plt.show()

# Exercise
Improve the graph above in the following ways:
1. Initialize the figure to have a larger figsize with an attractive aspect ratio.
1. Use a 7-day rolling average instead of the raw numbers
1. Use set_ylabel() to describe the left and right axes, and use set_title() to title the whole chart
1. Use different colors for the cases/deaths graphs
1. Use plt.legend()
1. Use the plt.axvline() example above to annotate a few significant dates, such as the start of vaccination, the discovery of the Delta/Omicron variants, etc.
1. Use plt.text() or plt.annotate() (example in the MatplotlibIntro notebook) to annotate 'First Wave', 'Second Wave', etc.

---
# Optional extra: Shifted Dates

Most complicated, we can see that these curves would be more comparable if they were date-shifted, to reflect the different times when the pandemic hit different countries. A common technique is to line them all up based on when they had a certain common minimum number of cases, say 10. We will filter on a condition again.

In [None]:
# I could just type 100 in every line below, but this way if I want to experiment with a different value
# I can edit just 1 line, instead of having to edit a line for every country (especially as countries are added)
min_cases = 100
USAsh = USA[ USA['total_cases'] >= min_cases ]  # 'sh' for shift
ITAsh = ITA[ ITA['total_cases'] >= min_cases ]
SWEsh = SWE[ SWE['total_cases'] >= min_cases ]
KORsh = KOR[ KOR['total_cases'] >= min_cases ]

In [None]:
USAsh['date']

In [None]:
USAsh['date'].min()

In [None]:
ITAsh['date'].min()

In [None]:
USAsh['date'].min() - ITAsh['date'].min()

Now we can see above that Italy reached 100 cases on Feb 24, 8 days before the US on Mar 3. (And that date objects can be subtracted!)

Here are all the dates where these countries reached 100 cases:

In [None]:
USAt0 = USAsh['date'].min()
ITAt0 = ITAsh['date'].min()
SWEt0 = SWEsh['date'].min()
KORt0 = KORsh['date'].min()

Just like we were able to simply multiply and divide the entire 'new_cases' Series by constant numbers, we can subtract the start date from the date Series, yielding number of days since 100 cases:

In [None]:
USAsh['date'] - USAt0

In [None]:
USAshX = USAsh['date'] - USAt0
USAshY = USAsh['new_cases'].rolling(window=7).mean()/331
plt.figure(figsize=(16,8))
ax = plt.gca()
ax.plot(USAshX, USAshY)
plt.show()

Note that plot goes from 0 to 2.5e16. Even though the description of `USAsh['date'] - USAt0` above says 'days', matplotlib is interpreting it as milliseconds. We can fix this by forcing conversion to days.

In [None]:
USAshX = (USAsh['date'] - USAt0).astype('timedelta64[D]')   # 'D' is for Days
USAshX

In [None]:
plt.figure(figsize=(16,8))
ax = plt.gca()
ax.plot(USAshX, USAshY)
ax.plot((ITAsh['date']-ITAt0).astype('timedelta64[D]'), ITAsh['new_cases'].rolling(window=7).mean()/60)  
ax.plot((SWEsh['date']-SWEt0).astype('timedelta64[D]'), SWEsh['new_cases'].rolling(window=7).mean()/10) 
ax.plot((KORsh['date']-KORt0).astype('timedelta64[D]'), KORsh['new_cases'].rolling(window=7).mean()/51)
ax.legend(['USA', 'Italy', 'Sweden', 'South Korea'])
ax.set_xlim( (0, 300) )
plt.show()