# Pivot tables with Pandas

## Learning Objectives

At the end of this notebook you should be able to
- create pivot tables with Pandas

Importing modules and loading the Datasets.  

Note: please use the __nf_geo environment__ again.

In [None]:
# standard import of pandas
import pandas as pd

# additional import of the geopandas package
import geopandas as gpd

# numpy
import numpy as np

# hides warning messages
import warnings
warnings.filterwarnings("ignore")

In [None]:
# you don't need to look too close at this cell - it just recreates the datatsets as already known before!

# bike theft data
thefts_df = pd.read_csv('data/Fahrraddiebstahl.csv', encoding='latin-1') # proper encoding is necessary here!
thefts_df.columns = thefts_df.columns.str.lower()  # make column names lowercase

# geodataframe based on the shapefile
gdf = gpd.GeoDataFrame.from_file('data/LOR_SHP_2021/lor_plr.shp')
gdf.columns = gdf.columns.str.lower()

# recreate the bike theft raw dataframe as in the notebook before.
thefts_df['lor_str'] = thefts_df['lor'].astype('str') # changing the lor column datatype to string
thefts_df['plr_id'] = thefts_df['lor_str'].apply(lambda x: x.zfill(8)) # fill leading gaps up to 8 characters with zeros and call the new column accordingly to the geodataframe
thefts_df.drop(columns=['lor', 'lor_str'], inplace=True) # dropping no longer needed columns
gdf_biketheft = gdf.merge(thefts_df, on='plr_id') # merging

# red wine dataset
red_wines_df = pd.read_csv('data/winequality-red.csv', delimiter=';')

## Pivot tables

From [wiki](https://en.wikipedia.org/wiki/Pivot_table): "Among other functions, a pivot table can automatically sort, count total, or give the average of the data stored in one table or spreadsheet, displaying the results in a second table showing the summarized data. Pivot tables are also useful for quickly creating unweighted cross tabulations."

As you might have guessed, we have functionality to create pivot tables available for our use in Pandas. The way that we do this is by calling the `pivot_table()` function that is available on the pandas module (which we've stored as `pd`). As the [docs](https://pandas.pydata.org/pandas-docs/stable/user_guide/reshaping.html) tell us, the `pivot_table()` expects a number of different arguments: 

1. `data`: A DataFrame object
2. `values`: a column or a list of columns to aggregate
3. `index`: a column, Grouper, array which has the same length as data, or list of them. Keys to group by on the pivot table index. If an array is passed, it is being used as the same manner as column values.
4. `columns`: a column, Grouper, array which has the same length as data, or list of them. Keys to group by on the pivot table column. If an array is passed, it is being used as the same manner as column values.
5. `aggfunc`: function to use for aggregation, defaulting to numpy.mean

Notice that by default this uses the mean for the `aggfunc` parameter. 

1. For this example, we want to show the max theft value for each type of bike and each district.  
Let's have a look at the available columns again:

In [None]:
gdf_biketheft.info()

We choose for the name of the district as the rows and the type of bike as columns to aggregate on,  
and the aggregation in this case is the maximum value.  
Also we fill the NaNs with zero.

In [None]:
# We can specify a function to aggregate with (by default it is mean)
pd.pivot_table(gdf_biketheft,
                values='schadenshoehe',
                index='plr_name',
                columns='art_des_fahrrads',
                aggfunc=max,
                fill_value=0)

## Check your understanding

2. For the second pivot table, we ask you to recreate a similar table as the final table of the previous notebook showing the _theft count_ and the _mean theft amount_ as shown in the example below.  
You can choose for the columns to display and the functions to aggregate by,  
the syntax is in the comment in the cell below.  

Try to make it look like this:  
<!-- ![pivot_table](images/biketheft_pivot.png) -->
<img src="images/biketheft_pivot.png" alt="drawing" width="40% of window"/>

In [None]:
# Please ignore the slightly different values to the previous notebook - since we spared ourselves the cleaning.
"\N{smiling face with sunglasses}" * 3

In [None]:
# basic syntax of a pivot table with aggregate function "aggfunc" dictionary:
''' table = pd.pivot_table(df, values=['D', 'E'], index=['A', 'C'],
                     aggfunc={'D': np.mean,
                              'E': np.mean})'''
                                                  

## Pivot tables and binning

In [None]:
# At this point, we deserve some good wine again!
# Let's recall what the data looks like. 
red_wines_df.head()

Let's take a moment to quickly learn about another Pandas function called `cut()` that allows us to turn a column with continuous data into categories by specifying bins to place them in.

In [None]:
# Create categories, bins should start with number 4 and end with 17. Since no step size is given, 1 is taken as default.
# Numpy functionalities are covered in more depth in the next notebook
import numpy as np
pd.cut(red_wines_df['fixed acidity'], bins=np.arange(4, 17)).head()

In [None]:
# let's create a new column with fixed_acidity split into categories
# 1. Create bins using np.arrange
fixed_acidity_bins = np.arange(4, 17)

# 2. Create categories for fixed acidity, using the bins created above. 
# The labels of the categories are the lower boundary of the bins

fixed_acidity_series = pd.cut(red_wines_df['fixed acidity'], bins=fixed_acidity_bins, 
                              labels=fixed_acidity_bins[:-1])

# Give series a name, which will be the new column's name
fixed_acidity_series.name = 'fa_bin'

# Concatenate the original df with the newly created series
red_wines_df = pd.concat([red_wines_df, fixed_acidity_series], axis=1)

In [None]:
# Let's check the resulting df
red_wines_df.head()

Now we can get the mean residual sugar for each quality category/fixed acidity bin like we did earlier, but with a pivot_table (mean is the default aggregation function).

In [None]:
pd.pivot_table(red_wines_df,
            values='residual sugar', 
            index='quality', 
            columns='fa_bin')

In [None]:
# or, we specify "max" as a function to aggregate with
pd.pivot_table(red_wines_df,
             values='residual sugar', 
             index='quality', 
             columns='fa_bin',
             aggfunc=np.max)

---
pop the corks!