# Exploring datasets: Pivot tables

Like groupby, pivot tables are used to partition the data and quickly get summarized data. 

Pandas has a pivot_table() function:

- index: variables to partition on (becomes rows)
- columns: optional variable to partition on in the columns
- values: variables to get aggregates on
- aggfunc: function to use to get aggregates (for example np.sum for sum, or len for count)

The following cheat-sheet is helpful:

In [None]:
from IPython.display import Image
Image("images/pivot-table-datasheet.png")

From: https://pbpython.com/pandas-pivot-table-explained.html

In [None]:
import pandas as pd
df = pd.read_csv("../datasets/glassdoor.csv")
df.head()

In [None]:
# group by firm and job title, get average value for worklife
pd.pivot_table(df,index=["firm","auditor_city"],values=["worklife"])

In [None]:
# columns for the different values of department
pd.pivot_table(df,index=["firm","department"],columns=['auditor_city'],values=["worklife"])

In [None]:
# Fill_value = "" to supress NaNs
pd.pivot_table(df,index=["firm","department"],columns=['auditor_city'],values=["worklife"], fill_value="")

In [None]:
# average and len (#obs)
import numpy as np
pd.pivot_table(df,index=["firm","department"],columns=['auditor_city'],values=["worklife"], fill_value="",aggfunc=[np.mean,len])

In [None]:
# assign pivot table to a variable (DataFrame)
t = pd.pivot_table(df,index=["firm","department"],columns=['auditor_city'],values=["worklife"], fill_value="",aggfunc=[np.mean,len])

In [None]:
t.index

In [None]:
# it supports query(), https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.query.html
# filter on departments (tax and audit only)
t.query('department == ["tax","audit"]')

In [None]:
# I wasn't able to filter on 'columns' ()
# this throws an error
t.query('auditor_city == ["miami","orlando"]')

In [None]:
# the workaround: include auditor_city in the index, and columns for the departments
t = pd.pivot_table(df,index=["firm","auditor_city"],columns=['department'],values=["worklife"], fill_value="",aggfunc=[np.mean,len])
t.query('auditor_city == ["miami","orlando"]')