# Meet the hardest functions of Pandas, Part I
## Master the when and how of `pivot_table()`, `stack()`, `unstack()`
<img src='images/gym_1.jpg'></img>

### Introduction <small id='intro'></small>

Here is the worst case scenario: You are watching a paid course and instructor is talking about a certain topic. Then, all of sudden, he introduces a completely new function saying "This function/method is perfect in this case, it is very easy so just check out its documentation for more details". You say OK, go to the documentation and don't even know what you are looking at.

You feel frustrated, go to read some articles or StackOverflow threads and sometimes come back feeling even worse. Believe me that happens to everyone. This article is specifically about the case of hard `pandas` functions. 

Mostly, the reason why most sources do not cover some advanced functions of `pandas` is that they are very case-specific. When you learn the basic methods and functions, you learn them in their own context like on toy datasets. For harder functions, they are difficult to explain and would be hard to create the context they are useful in. 

Such functions are often in the toolbox of more experienced scientists. They make the difference where you use them in such a way that the function solves the problem you are having with one line of code so elegantly. This post is about the three of them: `pivot_table()`, `stack()` and `unstack()`.

### Setup

In [21]:
# Load necessary libraries
import pandas as pd
import seaborn as sns
import numpy as np

# Ignore warnings
import warnings
warnings.filterwarnings('ignore')

# Enable multiple cell outputs
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = 'all'

### Pandas pivot_table(), with comparison to groupby() <small id='pivot'></small>

> There should be one-- and preferably only one --obvious way to do it.

The above is a quote from the Zen of python. Python wants to have only one obvious solution for a single problem. But, `pandas` deliberately avoids this. Often in `pandas`, there are several ways to do one operation. 

`pivot_table()` is an example. It is a complete and sometimes a better alternative to `groupby()` function. The difference is the shape of the result. `groupby()` returns a `Series` object while `pivot_table()` gives an easy-to-work dataframe.

Let's work on a problem and give the solutions using both functions. I will load the `tips` dataset from `seaborn`:

In [2]:
tips = sns.load_dataset('tips')
tips.head()

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size
0,16.99,1.01,Female,No,Sun,Dinner,2
1,10.34,1.66,Male,No,Sun,Dinner,3
2,21.01,3.5,Male,No,Sun,Dinner,3
3,23.68,3.31,Male,No,Sun,Dinner,2
4,24.59,3.61,Female,No,Sun,Dinner,4


We want to find the sum of all bills for each gender:

In [3]:
# Using groupby
result = tips.groupby('sex')['total_bill'].sum()
type(result)
result

pandas.core.series.Series

sex
Male      3256.82
Female    1570.95
Name: total_bill, dtype: float64

In [4]:
# Using pivot_table
result_pivot = tips.pivot_table(values='total_bill', index='sex', aggfunc=np.sum)
type(result_pivot)
result_pivot

pandas.core.frame.DataFrame

Unnamed: 0_level_0,total_bill
sex,Unnamed: 1_level_1
Male,3256.82
Female,1570.95


Let's compare the syntax of the two functions. In `groupby()`, we pass the column we want to group by in the parentheses and in `pivot_table()` the equivalent parameter is the `index`. In `groupby()`, to choose the column to aggregate, we use subsetting with brackets while in `pivot_table()` we pass it to `values`. Finally, to choose the aggregating function, we use method chaining in `groupby()` whereas, `pivot_table()` provides `aggfunc` argument. 

When I wrote an article about project setup for DS and ML, I researched a lot of notebooks. What I found surprising was that many people used `groupby()` and used `.reset_index()` function to turn the results of `groupby()` into a dataframe, let's explore further to find out why:

In [5]:
result = tips.groupby('sex')['total_bill'].sum().reset_index()
result

Unnamed: 0,sex,total_bill
0,Male,3256.82
1,Female,1570.95


If you use the `pivot_table()` you don't have to use `reset_index()` to convert the result into a dataframe. `groupby()` results are not as easy to work with as dataframes. Let's see how to group by multiple columns and aggregate with multiple functions:

In [6]:
tips.groupby(['sex', 'day'])['total_bill'].agg([np.mean, np.median, np.sum]).reset_index()

Unnamed: 0,sex,day,mean,median,sum
0,Male,Thur,18.714667,16.975,561.44
1,Male,Fri,19.857,17.215,198.57
2,Male,Sat,20.802542,18.24,1227.35
3,Male,Sun,21.887241,20.725,1269.46
4,Female,Thur,16.715312,13.785,534.89
5,Female,Fri,14.145556,15.38,127.31
6,Female,Sat,19.680357,18.36,551.05
7,Female,Sun,19.872222,17.41,357.7


In [7]:
tips.pivot_table(values='total_bill', index=['sex', 'day'], aggfunc=[np.mean, np.median, np.sum])

Unnamed: 0_level_0,Unnamed: 1_level_0,mean,median,sum
Unnamed: 0_level_1,Unnamed: 1_level_1,total_bill,total_bill,total_bill
sex,day,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
Male,Thur,18.714667,16.975,561.44
Male,Fri,19.857,17.215,198.57
Male,Sat,20.802542,18.24,1227.35
Male,Sun,21.887241,20.725,1269.46
Female,Thur,16.715312,13.785,534.89
Female,Fri,14.145556,15.38,127.31
Female,Sat,19.680357,18.36,551.05
Female,Sun,19.872222,17.41,357.7


Both functions return a dataframe for multiple columns. But, even though for a single column `pivot_table()` is better, using the `reset_index()` on the `groupby` result gives a much nicer dataframe. Maybe that's why Kagglers prefer `groupby()`.

In `pivot_table()`, sometimes you can use `columns` parameter instead of `index` (or sometimes both) to display each group as a column. But if you pass multiple arguments to `columns`, the result will be a long dataframe with a single row.

Another difference between `groupby()` and `pivot_table()` would be `fill_value` parameter. Sometimes, when you group by multiple variables, there won't be matching cells for the result. In such cases, `groupby()` puts `NaN`s but in `pivot_table()` you can control this behavior:

In [60]:
tips.head()

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size
0,16.99,1.01,Female,No,Sun,Dinner,2
1,10.34,1.66,Male,No,Sun,Dinner,3
2,21.01,3.5,Male,No,Sun,Dinner,3
3,23.68,3.31,Male,No,Sun,Dinner,2
4,24.59,3.61,Female,No,Sun,Dinner,4


In [8]:
pivoted = tips.pivot_table(values='total_bill', index=['sex', 'day'], aggfunc=np.median, fill_value=0)

In [22]:
pivoted.loc[('Female', 'Sun')] = 0
pivoted.loc[('Male', 'Fri')] = 0

In [23]:
pivoted

Unnamed: 0_level_0,Unnamed: 1_level_0,total_bill
sex,day,Unnamed: 2_level_1
Male,Thur,16.975
Male,Fri,0.0
Male,Sat,18.24
Male,Sun,20.725
Female,Thur,13.785
Female,Fri,15.38
Female,Sat,18.36
Female,Sun,0.0


When would you want to use `pivot_table()`? As I said previously, it can sometimes be a better alternative for `groupby()`. It is also personal preference when it comes to syntax. An obvious example would be choosing `pivot_table()` because it has some other parameters which is not available in `groupby()`. I already covered `fill_value`, but there are others like `margins`. You can learn more about it in the documentation😁.

### Pandas stack()

When used, `stack()` returns a reshaped dataframe with a multi-level index. The inner-most levels are created by pivoting the columns of the dataframe. It is best we start with an example. I will load the `cars` dataset and subset it for better understanding:

In [11]:
cars_small = sns.load_dataset('mpg').set_index('name')[['weight', 'horsepower']].iloc[:10]

In [12]:
cars_small

Unnamed: 0_level_0,weight,horsepower
name,Unnamed: 1_level_1,Unnamed: 2_level_1
chevrolet chevelle malibu,3504,130.0
buick skylark 320,3693,165.0
plymouth satellite,3436,150.0
amc rebel sst,3433,150.0
ford torino,3449,140.0
ford galaxie 500,4341,198.0
chevrolet impala,4354,220.0
plymouth fury iii,4312,215.0
pontiac catalina,4425,225.0
amc ambassador dpl,3850,190.0


Let's see how we pivot the dataframe so that the columns are now indexes:

When we use the `stack()` function on this dataframe, the result will have multi-level indexes, `name` being the outer level and `weight`, `horsepower` in the inner-level:

In [13]:
stacked_cars = cars_small.stack()
stacked_cars

name                                 
chevrolet chevelle malibu  weight        3504.0
                           horsepower     130.0
buick skylark 320          weight        3693.0
                           horsepower     165.0
plymouth satellite         weight        3436.0
                           horsepower     150.0
amc rebel sst              weight        3433.0
                           horsepower     150.0
ford torino                weight        3449.0
                           horsepower     140.0
ford galaxie 500           weight        4341.0
                           horsepower     198.0
chevrolet impala           weight        4354.0
                           horsepower     220.0
plymouth fury iii          weight        4312.0
                           horsepower     215.0
pontiac catalina           weight        4425.0
                           horsepower     225.0
amc ambassador dpl         weight        3850.0
                           horsepower     190.0
dt

Here, the original dataframe had a single level column names. That's why the resulting dataframe was `pandas.Series` rather than a dataframe. 

Remember that `stack()` function always pivots columns to the inner level index. If there are no columns left, meaning if the final data is a Series, `stack()` will not work. Let's try stacking the above `Series`:

In [14]:
stacked_cars.stack()

AttributeError: 'Series' object has no attribute 'stack'

A more complex example of `stack()` would be when the column names are given as multi-level indexes. Let's get back to one of our pivoted tables:

In [24]:
multi_name = tips.pivot_table(values='total_bill', index=['sex', 'day'], aggfunc=[np.mean, np.median, np.sum])
multi_name

Unnamed: 0_level_0,Unnamed: 1_level_0,mean,median,sum
Unnamed: 0_level_1,Unnamed: 1_level_1,total_bill,total_bill,total_bill
sex,day,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
Male,Thur,18.714667,16.975,561.44
Male,Fri,19.857,17.215,198.57
Male,Sat,20.802542,18.24,1227.35
Male,Sun,21.887241,20.725,1269.46
Female,Thur,16.715312,13.785,534.89
Female,Fri,14.145556,15.38,127.31
Female,Sat,19.680357,18.36,551.05
Female,Sun,19.872222,17.41,357.7


As you see, the column names have two-level hierarchy. You can access a column with multi-level name like this:

In [25]:
multi_name[('mean', 'total_bill')]

sex     day 
Male    Thur    18.714667
        Fri     19.857000
        Sat     20.802542
        Sun     21.887241
Female  Thur    16.715312
        Fri     14.145556
        Sat     19.680357
        Sun     19.872222
Name: (mean, total_bill), dtype: float64

Let's use `stack()` on this dataframe and see what happens:

In [26]:
multi_name.stack()

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,mean,median,sum
sex,day,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Male,Thur,total_bill,18.714667,16.975,561.44
Male,Fri,total_bill,19.857,17.215,198.57
Male,Sat,total_bill,20.802542,18.24,1227.35
Male,Sun,total_bill,21.887241,20.725,1269.46
Female,Thur,total_bill,16.715312,13.785,534.89
Female,Fri,total_bill,14.145556,15.38,127.31
Female,Sat,total_bill,19.680357,18.36,551.05
Female,Sun,total_bill,19.872222,17.41,357.7


Now, `total_bill` which is the inner-level column name became an index. You can control which column level you want to stack. Let's see how you would stack the outer level column name:

In [28]:
multi_name.stack(level=0)

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,total_bill
sex,day,Unnamed: 2_level_1,Unnamed: 3_level_1
Male,Thur,mean,18.714667
Male,Thur,median,16.975
Male,Thur,sum,561.44
Male,Fri,mean,19.857
Male,Fri,median,17.215
Male,Fri,sum,198.57
Male,Sat,mean,20.802542
Male,Sat,median,18.24
Male,Sat,sum,1227.35
Male,Sun,mean,21.887241


As you see, using different levels gives differently-shaped dataframes. By default, `level` is set to 1.

### Pandas unstack()

As the name suggests, `unstack()` does exactly the opposite of `stack()`. It takes multi-level indexed `Series` or a dataframe and pivots the indexes to become columns. If we unstack the stacked cars Series we will get back the original dataframe:

In [15]:
print('Stacked Series:')
stacked_cars

Stacked Series:


name                                 
chevrolet chevelle malibu  weight        3504.0
                           horsepower     130.0
buick skylark 320          weight        3693.0
                           horsepower     165.0
plymouth satellite         weight        3436.0
                           horsepower     150.0
amc rebel sst              weight        3433.0
                           horsepower     150.0
ford torino                weight        3449.0
                           horsepower     140.0
ford galaxie 500           weight        4341.0
                           horsepower     198.0
chevrolet impala           weight        4354.0
                           horsepower     220.0
plymouth fury iii          weight        4312.0
                           horsepower     215.0
pontiac catalina           weight        4425.0
                           horsepower     225.0
amc ambassador dpl         weight        3850.0
                           horsepower     190.0
dt

In [16]:
print('Unstacked Dataframe:')
stacked_cars.unstack()

Unstacked Dataframe:


Unnamed: 0_level_0,weight,horsepower
name,Unnamed: 1_level_1,Unnamed: 2_level_1
chevrolet chevelle malibu,3504.0,130.0
buick skylark 320,3693.0,165.0
plymouth satellite,3436.0,150.0
amc rebel sst,3433.0,150.0
ford torino,3449.0,140.0
ford galaxie 500,4341.0,198.0
chevrolet impala,4354.0,220.0
plymouth fury iii,4312.0,215.0
pontiac catalina,4425.0,225.0
amc ambassador dpl,3850.0,190.0


Perhaps, the most obvious use case for unstacking is when we use the `groupby()` function. Although the process is not possible when we group by one variable, unstacking proves very useful for grouping by multiple variables. Let's get back to our `tips` dataset:

In [17]:
tips.head()

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size
0,16.99,1.01,Female,No,Sun,Dinner,2
1,10.34,1.66,Male,No,Sun,Dinner,3
2,21.01,3.5,Male,No,Sun,Dinner,3
3,23.68,3.31,Male,No,Sun,Dinner,2
4,24.59,3.61,Female,No,Sun,Dinner,4


In [18]:
multiple_groups = tips.groupby(['sex', 'smoker', 'day', 'time'])['total_bill'].sum()
multiple_groups

sex     smoker  day   time  
Male    Yes     Thur  Lunch     191.71
                      Dinner       NaN
                Fri   Lunch      34.16
                      Dinner    129.46
                Sat   Lunch        NaN
                      Dinner    589.62
                Sun   Lunch        NaN
                      Dinner    392.12
        No      Thur  Lunch     369.73
                      Dinner       NaN
                Fri   Lunch        NaN
                      Dinner     34.95
                Sat   Lunch        NaN
                      Dinner    637.73
                Sun   Lunch        NaN
                      Dinner    877.34
Female  Yes     Thur  Lunch     134.53
                      Dinner       NaN
                Fri   Lunch      39.78
                      Dinner     48.80
                Sat   Lunch        NaN
                      Dinner    304.00
                Sun   Lunch        NaN
                      Dinner     66.16
        No      Thur  Lunch     381

The result is a Series with 4-level indexes. This is not what we want. Let's unstack so that it is easier to use:

In [19]:
multiple_groups.unstack()

Unnamed: 0_level_0,Unnamed: 1_level_0,time,Lunch,Dinner
sex,smoker,day,Unnamed: 3_level_1,Unnamed: 4_level_1
Male,Yes,Thur,191.71,
Male,Yes,Fri,34.16,129.46
Male,Yes,Sat,,589.62
Male,Yes,Sun,,392.12
Male,No,Thur,369.73,
Male,No,Fri,,34.95
Male,No,Sat,,637.73
Male,No,Sun,,877.34
Female,Yes,Thur,134.53,
Female,Yes,Fri,39.78,48.8


The result still has multi-level indexes. That's because `unstack()` works on one index-level at a time. Let's call it once more to get a dataframe with single index level:

In [20]:
multiple_groups.unstack().unstack()

Unnamed: 0_level_0,time,Lunch,Lunch,Lunch,Lunch,Dinner,Dinner,Dinner,Dinner
Unnamed: 0_level_1,day,Thur,Fri,Sat,Sun,Thur,Fri,Sat,Sun
sex,smoker,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2
Male,Yes,191.71,34.16,,,,129.46,589.62,392.12
Male,No,369.73,,,,,34.95,637.73,877.34
Female,Yes,134.53,39.78,,,,48.8,304.0,66.16
Female,No,381.58,15.98,,,18.78,22.75,247.05,291.54


I think you already see the effect `unstack()` has when you use it with `groupby()`. Multi-level indexes are always hard to work with. Always try to avoid them unless you absolutely have to. One way you can do it is with `unstack()`.

Although, `stack()` is not used often, I showed you the basic syntax and the general so that you will have a better grasp on `unstack()`.

In [52]:
type(_)

pandas.core.series.Series