# Introduction

A pivot table is a summarized form of a more extensive table. In simple terms, it's a breakdown of larger values. Overall, it is an essential tool for every data scientist and with the knowledge of building blocks of Pandas, it is much easier to learn. 

In [1]:
# Let's load a simple file
import pandas as pd
import numpy as np

df = pd.read_csv('./titanic/train.csv')
df.sample(5)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
95,96,0,3,"Shorney, Mr. Charles Joseph",male,,0,0,374910,8.05,,S
475,476,0,1,"Clifford, Mr. George Quincy",male,,0,0,110465,52.0,A14,S
435,436,1,1,"Carter, Miss. Lucile Polk",female,14.0,1,2,113760,120.0,B96 B98,S
78,79,1,2,"Caldwell, Master. Alden Gates",male,0.83,0,2,248738,29.0,,S
627,628,1,1,"Longley, Miss. Gretchen Fiske",female,21.0,0,0,13502,77.9583,D9,S


With this data, if we need to analyze count of passengers by their class and break that into survived and non-survived, a pivot operation would be much easier to do this then any other method.

In [8]:
df.pivot_table(index = 'Pclass', columns = 'Survived', values = 'PassengerId', aggfunc = 'count')

Survived,0,1
Pclass,Unnamed: 1_level_1,Unnamed: 2_level_1
1,80,136
2,97,87
3,372,119


If you go through Pandas source code, it will help you to understand that pivot_table does the same operation as the function below. 

In [9]:
df.groupby(['Pclass', 'Survived'])['PassengerId'].nunique().unstack()

Survived,0,1
Pclass,Unnamed: 1_level_1,Unnamed: 2_level_1
1,80,136
2,97,87
3,372,119


Pivot_table groups the data first to perform the specified operation and then unstacks the result to create multi-index columns combined in a table.

# Missing value imputation

This function has fill_value parameter which will show the provided value when a missing value is generated after aggregation.

In [10]:
df.pivot_table(index = 'Pclass', columns = 'Embarked', values = 'Age', aggfunc = np.mean, fill_value = 0)

Embarked,C,Q,S
Pclass,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1,38.027027,38.5,38.152037
2,22.766667,43.5,30.386731
3,20.741951,25.9375,25.696552


# Multiple Statistics

In data analysis many times need comes to generate multiple statistisc in order to fully understand full nature of distributions and relationships.

It is very easy to do this in pivot_table, simply pass all aggregation functions in form of a list.

In [11]:
df.pivot_table(index = 'Pclass', columns = 'Embarked', values = 'Age', aggfunc = [np.mean, np.median, np.std], fill_value = 0)

Unnamed: 0_level_0,mean,mean,mean,median,median,median,std,std,std
Embarked,C,Q,S,C,Q,S,C,Q,S
Pclass,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2
1,38.027027,38.5,38.152037,36.5,38.5,37,14.243454,7.778175,15.315584
2,22.766667,43.5,30.386731,25.0,43.5,30,10.192551,19.091883,14.080001
3,20.741951,25.9375,25.696552,20.0,21.5,25,11.712367,16.807938,12.110906
