<div>
    <img style="float:right;" src="images/smi-logo.png"/>
    <div style="float:left;color:#58288C;"><h1>Introduction to Python for Data Science</h1></div>
</div>

---
# Notebook 3: Pandas
This notebook introduces the `pandas` package as a convenient toolset to work with tabular data.

## Contents

[1. Importing data from APIs](#chapter1)  
[2. Introduction to DataFrames](#chapter2)  
[3. Simple data visualization](#chapter3)  
[4. INSIDER Task](#chapter4)  
---

# 1. Importing data from APIs <a id="chapter1"/>

We'll start this session by using [REST-APIs](https://en.wikipedia.org/wiki/Representational_state_transfer) to retrieve some data. In short, when using a REST API, we use the same methods as a browser does, when retrieving a webpage. But instead of an HTML description of a webpage, we retrieve the data.

For practising , [this](https://github.com/public-apis/public-apis) is a list of publicly available APIs. In this notebook we are going to use `corona-api.com` that provides recent COVID infection data from the WHO and Johns Hopkins university in a compact data format.

To get a first impression, point your browser to http://corona-api.com/timeline. The displayed data is a mixed data structure, some sections correspond to Python lists, others to dictionaries.

Let's retrieve this data, step by step...

In [None]:
import requests    # package to send http queries to the API

link = "http://corona-api.com/timeline"   # URL to query, you can try http://corona-api.com/countries/DE instead or replace DE with another country code

res = requests.get("https://corona-api.com/timeline")   # send a get request to that url, store the response in variable "res"
raw_data = res.json()  # this now contains the data, uncomment this line and execute the cell to check

---
## <span style="color:#FF5D02;">Assignment: Analyze data structure </span>

The retrieved raw data should look like this:

```
{'data': [  
  {'updated_at': '...',
   'date': '2022-09-13',
   'deaths': 6467297,
   'confirmed': 604938009,
   'recovered': 0,
   'new_confirmed': 508345,
   'new_recovered': 0,
   'new_deaths': 1446,
   'active': 598470712},
  {'updated_at': '...',
   'date': '2022-09-12',
   'deaths': 6465852,
   'confirmed': 604429805,
   'recovered': 0,
   'new_confirmed': 313811,
   'new_recovered': 0,
   'new_deaths': 740,
   'active': 597963953},
   ...
  ]
```

Please take a minute to describe for yourself what data structures you recognize! Hint: they're nested inside each other.


**Hints**

Is it a list of dictionaries? A dictionary with keys that contain lists as values? A dictionary of dictionaries?

Recap how we accessed data inside dictionaries and lists.  
Try to access some of the data fields with the syntax you've learned previously to access dictionaries and lists:

In [None]:
raw_data[...][...]

Expand the following two cells to see the solution!

In [None]:
raw_data["data"][0] # this returns the most recent record

In [None]:
raw_data["data"][0]["confirmed"] # this returns the number of confirmed cases from the most recent record

---

# 2. Introduction to DataFrames<a id="chapter2"/>
You probably noticed that working purely with lists of dictionaries and such is not 
Pandas are the central tool for reading and manipulating data in Python. For our purposes, the `DataFrames` data structure is the most important:

> **DataFrame** is a 2-dimensional labeled data structure with columns of potentially different types.  
> You can think of it like a spreadsheet or SQL table [...]. It is generally the most commonly used pandas object.  
> [(Source)](https://pandas.pydata.org/pandas-docs/stable/user_guide/dsintro.html)

Dataframes contain rows and columns and distinguish between regular data and indices - columns that contain an unique identifier for each row.
I.e. our just imported covid dataset would look like this:

<img src="images/dataframe.png"/>  

The full documentation can be found [here](https://pandas.pydata.org/pandas-docs/stable/reference/frame.html). 

Let's explore step by step, how dataframes make life easier when exploring data.

In [None]:
import pandas
import requests

raw_data = requests.get("https://corona-api.com/timeline").json()  # fetch data
df = pandas.DataFrame(raw_data["data"])     # raw_data["data"] contains the relevant datatable as a list of dictionaries [{},{},{},{},....], see above

In [None]:
# let's have a look at our brand new dataframe...
df

In [None]:
df.head(5)   # shows the first n rows

In [None]:
df.sample(5) # show random n rows

In [None]:
df.count()   # shows the number of valid entries per column

## 2.1. Preparing the dataset

Before working with the data, we usually want to remove/rename some columns, sort the data, apply filters or partition the data.
In this chapter we'll briefly walk the some commonly used functions to prepare datasets.

> **Important**: All edits to the Dataframe create a copy with the changes, if you don't explicitly force the function to apply the changes directly ("inplace"). If you don't force inplace editing, the original DataFrame remains unchanged. So you usually have two options to apply changes:  
>
> `df = df.change_something(...)             # assign the copy with the change to the original variable`  
> `df.change_something(..., inplace=True)    # apply the change to the original dataframe`

In [None]:
# Let's prepare the dataframe...

# First delete / rename some columns
df = df.drop(["updated_at", "deaths", "confirmed", "new_recovered", "recovered"], axis="columns")

df = df.rename(columns = {                      # pass a dictionary of "oldname": "newname" pairs to rename columns
    "new_confirmed": "new_cases", 
})

df = df.sort_values("date", ascending=True)     # sort data ascending

df = df.set_index("date")                       # set date column as unique identifier for records (index)

In [None]:
df.head() # check result

## 2.2. Selections and filtering

Dataframes generally accept filters/selections in the format `[row_filter, column_filter]`. The expressions and inner workings can be quite different, we look at some of the most helpful ways.

### Select by True/False vector

In [None]:
# by passing a vector of True and False as row_filter, we toggle which rows we want to keep
# a bool condition like the following generates such a structure, in this case with the date as index ... try it!

df.new_cases > 30000

In [None]:
# let's pass this as row_filter
peaks = df[df.new_cases > 30000]                                # select all days with > 30.000 cases
peaks.sort_values("new_cases", ascending=False)                 # show dataframe, sorted by "worst days" first

In [None]:
df[(df.new_cases > 30000) & (df.new_deaths < 600)]              # Use bool algebra operators & ("and") and | ("or") to combine filters

In [None]:
df[(df.new_cases > 30000) | (df.new_deaths < 600)]              # Important: don't forget the brackets

### Select by naming relevant rows, columns

In [None]:
# Select using function .loc[list of row_indexes, list of column names]:

df.loc["2020-08-01","new_cases"]   # single day, single column

In [None]:
df.loc["2020-09-01":"2020-09-07", "new_cases":"new_deaths"]  # ranges of days, range of columns

In [None]:
df.loc["2020-09-01":"2020-09-07", : ]  # ranges of days, all columns (full range

In [None]:
df.loc[["2020-02-01","2020-03-01"], ["new_deaths", "active"]]  # subsets via lists of days (index) and columns

---
## <span style="color:#FF5D02;">Assignment: Data selection</span>

Generate a new dataframe that only contains new_cases for January 2021!

**Hints**

The `df.loc` function can filter rows (here: dates) and columns (here: data fields)

You need to use a range of days and a singe column (see example above)

In [None]:
# Solution

df.loc["2022-01-01":"2022-01-31", "new_cases"]

---

## 2.3. Calculations and simple statistics

In [None]:
# Do calculations with columns similar to single variables

df["death_rate"] = df.new_deaths / df.active

In [None]:
# Calculate common descriptive statistics for numeric columns of the whole dataframe
df.describe()

In [None]:
# ... or for a single column
df.new_cases.describe()   # note the scientific notation ("e notation") in the result, if unknown, check here: https://en.wikipedia.org/wiki/Scientific_notation

In [None]:
# or just a specific metric :-)

print("New Cases")
print("Mean: ", df.new_cases.mean())
print("Median: ", df.new_cases.median())
print("Maximum: ", df.new_cases.max())
print("20% quantile: ", df.new_cases.quantile(0.2))
print("80% quantile: ", df.new_cases.quantile(0.8))

---
## <span style="color:#FF5D02;">Assignment: Compare descriptive statistics</span>

Calculate the average of new cases of Jan 2021 and Feb 2021! Which one is bigger?

**Hints**

Use the `.loc[]` function to select the months (see above examples for index ranges)

Use the `.mean()` function to calculate the mean

You can chain functions like `df.function1().function2().function(3)["2022-01-01"]`

In [None]:
# Solution

print("Jan 2021: ", df.loc["2021-01-01":"2021-01-31", "new_cases"].mean())
print("Feb 2021: ", df.loc["2021-02-01":"2021-02-28", "new_cases"].mean())

---
# 3. Simple data visualization<a id="chapter3"/>

There are numerous data visualization packages available for Python (e.g. matplotlib, seaborn). The [Python Graph Gallery](https://www.python-graph-gallery.com) gives a lot of examples with code snippets.

Pandas include a `.plot()` function that automatically calls the respective functionality from a visualization package (matplotlib, by default).

This sections shows a lot of examples to get you started.

In [None]:
last_quarter = df.new_cases[-90:-1]
last_quarter.plot()  # plot the new cases for the last 90 days 

# (if you're unsure about the syntax, recheck the Python basics notebook or google python negative indexing for lists or dataframes 

In [None]:
# let's make it look a little nicer, for a list of all parameters check https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.plot.html

last_quarter.plot(kind="area", figsize=(18,5), 
                  color="lightblue", legend=True,
                   title="New infections during the last quarter",
                   ylabel="Number of cases")

In [None]:
# let's check the distribution of daily case count for January 2021 with a boxplot
last_quarter.plot(kind="box")

In [None]:
# let's do a histogram to check the overall distribution
last_quarter.plot(kind="hist", edgecolor="white") 

In [None]:
# For scatterplots (plot x vs y values as dots) we need a little different syntax and come back to the full dataframe, containing all columns:
df[-90:-1].plot(kind="scatter", x="new_cases", y="new_deaths", color="blue", title="Todesfälle vs. Neuinfektionen pro Tag")

To show multiple data series in a single plot, just put the statements in the same notebook cell:

In [None]:
df[-90:-1].new_cases.plot  (kind="hist", figsize=(16,5), alpha=0.5, color="blue", legend=True, 
                            label="last 90 days", title="Number of days with x new infections, quarterly comparison") 
df[-180:-91].new_cases.plot(kind="hist", figsize=(16,5), alpha=0.2, color="green", legend=True, 
                            label="previous 90 days") 

---
## <span style="color:#FF5D02;">Assignment: Explore viszalization commands</span>

Change the above plotting commands to show other data fields, other sections of the dataframe.