# Lecture 5: Pandas II and Matplotlib

March 17, 2025

## Recap: Pandas

- What new does it bring for us?

## Key packages

1. `pandas`: Data Manipulation and Analysis

   - pandas is the primary library for handling and manipulating structured data in Python.
   - Data Structures, Data Cleaning and Preparation, Time Series Support

2. `matplotlib`: Fundamental Plotting Library

   - Matplotlib is the foundational plotting library in Python, offering comprehensive control over plotting elements.
   - Flexibility, Broad Plot Types, Integration (pandas)
    
3. `seaborn`: Statistical Data Visualization
    
   - Built on top of Matplotlib, seaborn offers a more user-friendly interface for statistical plots.
   - Aesthetics, Easy Statistical Plots, Integration with pandas

## Pandas (cont') and other packages

- `seaborn` package [[link](https://seaborn.pydata.org)]:
  - *Seaborn is a library for making statistical graphics in Python. It builds on top of `matplotlib` and integrates closely with `pandas` data structures.*
  - *A high-level API for statistical graphics.*
- `zipfile`
  - *The ZIP file format is a common archive and compression standard. This module provides tools to create, read, write, append, and list a ZIP file.*
- `datatime`
  - *The datetime module supplies classes for manipulating dates and times.*

### Plots (more)

- configuration of plots

In [None]:
!pip install seaborn

In [None]:
import seaborn as sns
import pandas as pd
import zipfile
import numpy as np

import datetime

In [None]:
idx = pd.IndexSlice # for multi-indexing

In [None]:
idx?

In [None]:
# Set the default style
plotconfig = {
    'style':'.',
    'grid':True,
    'markersize':5,
    'figsize':(10,4)
}

#### Simple `seaborn`

In [None]:
penguins = sns.load_dataset("penguins")

In [None]:
penguins.head()

In [None]:
sns.histplot(data=penguins, x="flipper_length_mm", hue="species", multiple="stack")

In [None]:
# sns.histplot(data=penguins, x="flipper_length_mm", hue="species")

In [None]:
sns.pairplot(penguins, hue="species")

In [None]:
# sns.pairplot(penguins, hue="island")

#### Loading data from a .zip

In [None]:
with zipfile.ZipFile("data/covid.zip") as z:
    print("Files in the zip are: ", z.namelist())

In [None]:
# Load the data, which is in a zip file
# with zipfile.ZipFile("data/covid.zip") as z:
#     z.extractall("data")

In [None]:
!ls data

In [None]:
!ls data/Covid\ data

#### Work with .csv or write a loop to work with the data

In [None]:
# Path to the zip file
zip_file_path = 'data/covid.zip'

# Open the zip file and list the files inside
with zipfile.ZipFile(zip_file_path, 'r') as z:
    print(z.namelist())  # List all files in the zip archive

    # Specify the file you want to load
    csv_filename = 'Covid data/CovidDeaths.csv'

    # Load the specific CSV file into a DataFrame
    with z.open(csv_filename) as f:
        df = pd.read_csv(f)

# Display the DataFrame
print(df.head())

## DONT FORGET TO UPDATE TO NEWER PYTHON WITH NEWER PANDAS

Loading data from a zip file

In [None]:
with zipfile.ZipFile("data/covid.zip") as z:
    with z.open("Covid data/CovidDeaths.csv") as f:
        # covid = pd.read_csv(f, index_col=["iso_code", "date"], parse_dates=["date"], date_parser=lambda d: pd.to_datetime(d, format="%d-%m-%y"))
        covid = pd.read_csv(f, index_col=["iso_code", "date"], parse_dates=["date"], date_format="%d-%m-%y")

        country_columns = ["continent", "location", "population"]
        countries = covid.groupby("iso_code").apply(
            lambda g: g.iloc[0][country_columns]
        )

        countries = countries[countries.apply(lambda row: len(row.name) == 3, axis=1)]
        countries.continent = countries.continent.astype("category")

        keep_covid_columns = [
            "new_cases",
            "new_deaths",
            "icu_patients",
            "hosp_patients",
        ]

        covid = covid[keep_covid_columns]
        covid = covid[covid.apply(lambda row: len(row.name[0]) == 3, axis=1)]

        covid = covid.sort_index()

        covid = covid.reset_index()

countries = countries

In [None]:
countries.head()

In [None]:
covid.head()

In [None]:
# Check if the data contains the Czech Republic
'CZE' in covid['iso_code'].unique()

In [None]:
czech_cases = covid.loc[covid['iso_code'] == 'CZE'].set_index('date')
slovak_cases = covid.loc[covid['iso_code'] == 'SVK'].set_index('date')

In [None]:
czech_cases.head()

In [None]:
slovak_cases.head()

### Args / Kwargs

In [None]:
plotconfig

In [None]:
czech_cases['new_cases'].plot(style='.',grid=True)

In [None]:
# all keyword arguments in plotconfig are passed to the plot function
czech_cases['new_cases'].plot(**plotconfig)

In [None]:
plotconfig

## Indexing data
### Using `loc` - selecting based on index labels

In [None]:
czech_cases.index

In [None]:
datetime.date(year=2020, month=3, day=1)

In [None]:
czech_cases.loc[datetime.datetime(year=2020, month=12, day=24)]

In [None]:
czech_cases.loc['2020-12-24']

In [None]:
czech_cases.loc['2020-09-01':'2020-11-15'].plot()

### Sub-setting using `mask` - conditional on value of series

- Masking is a way to filter data by creating a "mask" (boolean array) that indicates which rows or columns should be included in a subset.
- If needed : using `~` to invert a mask

In [None]:
czech_cases[(czech_cases['new_cases'] >= 5000) & (czech_cases['new_cases'] < 15000)]

In [None]:
# ax is the axis object, which is used to plot multiple lines on the same plot
ax = czech_cases.plot(color="lightgrey", label="other values", legend=True, **plotconfig)

czech_cases.loc[
    (czech_cases["new_cases"] >= 5000) & (czech_cases["new_cases"] < 15000), "new_cases"
].plot(ax=ax, label="Values between 5k and 15k", legend=True, **plotconfig)

czech_cases.loc[
    czech_cases.index.weekday == 6, "new_cases"
].plot(ax=ax, label="Sunday", legend=True, **plotconfig)

czech_cases.loc[czech_cases.index.weekday == 5, "new_cases"].plot(ax=ax, label="Saturday", legend=True, **plotconfig)

In [None]:
covid

In [None]:
CSSR = covid.loc[covid['iso_code'].isin(['SVK','CZE'])] 

In [None]:
CSSR.head()

## `MultiIndex`

In [None]:
CSSR = CSSR.set_index(['iso_code','date']) 

In [None]:
CSSR.head()

In [None]:
CSSR.loc[('CZE','2020-12-24')]

if slicing or multi-selecting use `idx = pd.IndexSlice`

In [None]:
idx = pd.IndexSlice

In [None]:
CSSR.loc[idx['CZE']]

In [None]:
CSSR.index.get_level_values('date')

In [None]:
# IndexSlice is used to slice multi-indexed dataframes
czechoslovak_christmas = CSSR.loc[pd.IndexSlice[['CZE','SVK'],'2020-12-24':'2020-12-27'],:] #
czechoslovak_christmas

* alternatively use notation below with `slice()`

In [None]:
CSSR.loc[(['CZE','SVK'], slice(None))] # all dates, both countries

In [None]:
# you can create custom multi-index, not only set it up using set_index 
pd.MultiIndex.from_arrays([[1,1],['a','2']])

In [None]:
# get specific level from multi-index
CSSR.index.get_level_values(level = 'iso_code') # or level = 0

In [None]:
CSSR.index.get_level_values(level = 'iso_code').unique() # unique values in the level

In [None]:
# .reset_index enables reseting only specific level
CSSR.reset_index(level = 'date')

## Reshaping and pivoting

https://pandas.pydata.org/pandas-docs/stable/user_guide/reshaping.html

### Reshape `pd.Series` into `pd.DataFrame`: `.unstack`

In [None]:
czechoslovak_christmas['new_cases']

In [None]:
# unstack is used to pivot the data
czechoslovak_christmas['new_cases'].unstack(level = 'iso_code')

### Stack `pd.DataFrame` to `pd.Series`


In [None]:
CSSR.stack()

### melting → long format

In [None]:
CSSR = CSSR.reset_index()

In [None]:
CSSR.head()

In [None]:
CSSR.melt?

In [None]:
CSSR.melt().head()

In [None]:
CSSR.melt()['variable'].unique()

## Applying functions

#### Aggregation
- decreasing dimensionality

In [None]:
czech_cases

In [None]:
czech_cases[['new_cases', 'new_deaths']].mean()

In [None]:
czech_cases.min()

In [None]:
czech_cases.sum()

### Transforming
* preserves dimensionality and shape

In [None]:
czech_cases = czech_cases.set_index('iso_code', append = True)

In [None]:
czech_cases.diff(axis = 0)

In [None]:
czech_cases.apply(np.log)

In [None]:
czech_cases.cumsum()

In [None]:
czech_cases.pct_change() # Warning: this will not work for the first row

In [None]:
czech_cases.pct_change(fill_method=None)

#### Custom functions

In [None]:
czech_cases.apply(lambda x: (x - np.mean(x)) / np.std(x))

## Group By

**Split-Apply-Combine Logic**

https://pandas.pydata.org/pandas-docs/stable/user_guide/groupby.html

* Splitting the data into groups based on some criteria.
* Applying a function to each group independently.
* Combining the results into a data structure.


In [None]:
countries.head()

In [None]:
covid = covid.merge(countries, how = 'left', on = 'iso_code')

In [None]:
covid.shape

In [None]:
covid.head()

In [None]:
covid.groupby('continent', observed=False).count() # Observed = False is used to include all categories in the grouping

In [None]:
g = covid.groupby(['continent', 'date'], observed=False)

In [None]:
g.groups.keys()

In [None]:
g.groups.values()

In [None]:
# get_group is used to get a specific group from the groupby object
g.get_group(('Europe', '2020-12-24'))

### Group By + Apply

In [None]:
interesting_countries = ['Austria', 'Poland', 'Germany', 'Czechia', 'Slovakia', 'Hungary', 'France', 'Denmark', 'Sweden']

In [None]:
some_countries = covid.loc[covid.location.isin(interesting_countries)]
some_countries['deaths_per_case'] = some_countries.new_deaths / some_countries.new_cases
some_countries

## Merging and joing datasets

https://pandas.pydata.org/pandas-docs/stable/user_guide/merging.html

* `pd.concat` - alignment (along index or columns)
* `pd.merge` - combining data (along columns, by values)
    * `df.join` - merge on index


### Concatenate
![concatenate](./img/concatenate.png)

### Merge
![merge](./img/merge.png)



- good to know when working with TS: `merge_as_of`

## Rolling object

A `rolling` object is used for performing operations on a sliding window basis across a `DataFrame` or `Series`. 

This is particularly useful for time series analysis, where rolling operations (e.g., moving averages, sums, etc.) help smooth out data, identify trends, or calculate indicators.

Common Rolling Aggregations:

| Method | Description |
| ---- | ---- |
| mean() | Rolling mean (moving average) |
| sum() | Rolling sum |
| std() | Rolling standard deviation |
| min() | Rolling minimum |
| max() | Rolling maximum |
| apply(func) | Custom function application |
| corr() | Rolling correlation with another column |
| cov() | Rolling covariance with another column |

In [None]:
plotconfig = {
    'style':'.',
    'grid':True,
    'markersize':5,
    'figsize':(12,5)
}

In [None]:
ax = czech_cases.plot(label="original", **plotconfig, legend=True)
czech_cases.rolling(3).mean().plot(label="3 days rolling", ax=ax, legend=True)
czech_cases.rolling(5).mean().plot(label="5 days rolling", ax=ax, legend=True)
czech_cases.rolling(10).mean().plot(label="10 days rolling", ax=ax, **plotconfig, legend=True)