## Stacking and unstacking

In this unit, we will look at the stack() and unstack() functions of pandas. These functions are useful for DataFrames where we have multiple indexing. Their purpose is the following:

* the stack() function takes the innermost column label and turns it into the innermost row index. The overall effect is to make the DataFrame taller.
* the unstack() function is the inverse operation: it takes the innermost row index and turns it into the innermost column label. The overall effect is to make the DataFrame wider.

In [2]:
import pandas as pd
import numpy as np

# define the MultiIndex for the rows
row_levels = [["R0", "R1"], ["r00", "r01", "r10", "r11"]]
row_labels = [[0, 0, 1, 1], [0, 1, 2, 3]]
row_indices = pd.MultiIndex(row_levels, row_labels)

# define the MultiIndex for the columns
col_levels = [["C0", "C1"], ["c00", "c01", "c10", "c11"]]
col_labels = [[0, 0, 1, 1], [0, 1, 2, 3]]
col_indices = pd.MultiIndex(col_levels, col_labels)

# define the data
data = np.arange(16).reshape(4, 4)

# create the dataframe
df = pd.DataFrame(data, index=row_indices, columns=col_indices)
df

Unnamed: 0_level_0,Unnamed: 1_level_0,C0,C0,C1,C1
Unnamed: 0_level_1,Unnamed: 1_level_1,c00,c01,c10,c11
R0,r00,0,1,2,3
R0,r01,4,5,6,7
R1,r10,8,9,10,11
R1,r11,12,13,14,15


In [3]:
df.stack()

Unnamed: 0,Unnamed: 1,Unnamed: 2,C0,C1
R0,r00,c00,0.0,
R0,r00,c01,1.0,
R0,r00,c10,,2.0
R0,r00,c11,,3.0
R0,r01,c00,4.0,
R0,r01,c01,5.0,
R0,r01,c10,,6.0
R0,r01,c11,,7.0
R1,r10,c00,8.0,
R1,r10,c01,9.0,


The innermost column level is [c00,c01,c10,c11]. So the stack() function took this level and turned it into an innermost row level

In [4]:
#Let’s now try to unstack() instead:
df.unstack()

Unnamed: 0_level_0,C0,C0,C0,C0,C0,C0,C0,C0,C1,C1,C1,C1,C1,C1,C1,C1
Unnamed: 0_level_1,c00,c00,c00,c00,c01,c01,c01,c01,c10,c10,c10,c10,c11,c11,c11,c11
Unnamed: 0_level_2,r00,r01,r10,r11,r00,r01,r10,r11,r00,r01,r10,r11,r00,r01,r10,r11
R0,0.0,4.0,,,1.0,5.0,,,2.0,6.0,,,3.0,7.0,,
R1,,,8.0,12.0,,,9.0,13.0,,,10.0,14.0,,,11.0,15.0


We can see that this is in some sense the opposite. The innermost row level r00,r01,r10,r11 was now taken and turned into an innermost column level. 

#### Stacking and unstacking on different levels

pandas allows us to stack or unstack at any level of index, not just innermost, which is the default. To specify what level we want, use the level parameter. The outermost level is always level=0. Let’s try stacking the outermost level:

In [5]:
df.stack(level=0)

Unnamed: 0,Unnamed: 1,Unnamed: 2,c00,c01,c10,c11
R0,r00,C0,0.0,1.0,,
R0,r00,C1,,,2.0,3.0
R0,r01,C0,4.0,5.0,,
R0,r01,C1,,,6.0,7.0
R1,r10,C0,8.0,9.0,,
R1,r10,C1,,,10.0,11.0
R1,r11,C0,12.0,13.0,,
R1,r11,C1,,,14.0,15.0


In [6]:
df.stack().unstack().dropna(axis=1)

Unnamed: 0_level_0,Unnamed: 1_level_0,C0,C0,C1,C1
Unnamed: 0_level_1,Unnamed: 1_level_1,c00,c01,c10,c11
R0,r00,0.0,1.0,2.0,3.0
R0,r01,4.0,5.0,6.0,7.0
R1,r10,8.0,9.0,10.0,11.0
R1,r11,12.0,13.0,14.0,15.0


In [14]:
df.stack().unstack().dropna(axis=1)

Unnamed: 0_level_0,Unnamed: 1_level_0,C0,C0,C1,C1
Unnamed: 0_level_1,Unnamed: 1_level_1,c00,c01,c10,c11
R0,r00,0.0,1.0,2.0,3.0
R0,r01,4.0,5.0,6.0,7.0
R1,r10,8.0,9.0,10.0,11.0
R1,r11,12.0,13.0,14.0,15.0


What you might have noticed while working through the quizzes above is that even though we can choose to stack or unstack at any level, the chosen index will always be moved to the last or innermost level of the other index. This is why after stacking the outermost level, we cannot obtain back our original DataFrame with any combination of stacking or unstacking. In order to achieve this, we would need to use an additional function.

Here is a possible solution. First, we call:

In [15]:
df.stack(level=0).unstack()

Unnamed: 0_level_0,Unnamed: 1_level_0,c00,c00,c01,c01,c10,c10,c11,c11
Unnamed: 0_level_1,Unnamed: 1_level_1,C0,C1,C0,C1,C0,C1,C0,C1
R0,r00,0.0,,1.0,,,2.0,,3.0
R0,r01,4.0,,5.0,,,6.0,,7.0
R1,r10,8.0,,9.0,,,10.0,,11.0
R1,r11,12.0,,13.0,,,14.0,,15.0


Now this is almost correct, except the two column levels are swapped. We can change this with the swaplevel() function as follows:

In [16]:
df.stack(level=0).unstack().swaplevel(axis=1)

Unnamed: 0_level_0,Unnamed: 1_level_0,C0,C1,C0,C1,C0,C1,C0,C1
Unnamed: 0_level_1,Unnamed: 1_level_1,c00,c00,c01,c01,c10,c10,c11,c11
R0,r00,0.0,,1.0,,,2.0,,3.0
R0,r01,4.0,,5.0,,,6.0,,7.0
R1,r10,8.0,,9.0,,,10.0,,11.0
R1,r11,12.0,,13.0,,,14.0,,15.0


Finally we can drop the columns with missing entries. So the full command which returns back the original DataFrame is:

In [17]:
df.stack(level=0).unstack().swaplevel(axis=1).dropna(axis=1)

Unnamed: 0_level_0,Unnamed: 1_level_0,C0,C0,C1,C1
Unnamed: 0_level_1,Unnamed: 1_level_1,c00,c01,c10,c11
R0,r00,0.0,1.0,2.0,3.0
R0,r01,4.0,5.0,6.0,7.0
R1,r10,8.0,9.0,10.0,11.0
R1,r11,12.0,13.0,14.0,15.0


We end this unit with an observation that whenever possible, it is desirable to have the data as stacked as possible. This is because stacked data can often provide significant performance benefits when accessing the entries of a DataFrame.