In [2]:
import pandas as pd
import seaborn as sns

## Stacking & Unstacking

Exploration of pivoting / unpivoting using the inbuilt `.stack()` & `.unstack()` functionality. 

[Reference Video](https://www.youtube.com/watch?v=kJsiiPK5sxs)

In [33]:
# Load in sample dataset. This dataset has a good variety of numeric / categorical columns. 
main_df = sns.load_dataset('taxis')
main_df.head(3)

Unnamed: 0,pickup,dropoff,passengers,distance,fare,tip,tolls,total,color,payment,pickup_zone,dropoff_zone,pickup_borough,dropoff_borough
0,2019-03-23 20:21:09,2019-03-23 20:27:24,1,1.6,7.0,2.15,0.0,12.95,yellow,credit card,Lenox Hill West,UN/Turtle Bay South,Manhattan,Manhattan
1,2019-03-04 16:11:55,2019-03-04 16:19:00,1,0.79,5.0,0.0,0.0,9.3,yellow,cash,Upper West Side South,Upper West Side South,Manhattan,Manhattan
2,2019-03-27 17:53:01,2019-03-27 18:00:25,1,1.37,7.5,2.36,0.0,14.16,yellow,credit card,Alphabet City,West Village,Manhattan,Manhattan


Observe that the dataset is in a [tidy format](https://cran.r-project.org/web/packages/tidyr/vignettes/tidy-data.html#:~:text=Tidy%20data%20is%20a%20standard,with%20observations%2C%20variables%20and%20types.), where : 

1. Every column is a variable
2. Every row is an observation
3. Every cell is a single value

Data in this format is easily understood.

### Stacking (or Pivoting)

It is typical to want to aggreagate data by one or more dimensions. Suppose we wanted __the breakdown of the *number of* rides with each payment type by pickup borough__

In [42]:
# Group by the two items
df = main_df.groupby(['payment', 'pickup_borough'])\
    .agg({'pickup' : 'size'})\
    .rename(columns={'pickup' : 'n'})
df

Unnamed: 0_level_0,Unnamed: 1_level_0,n
payment,pickup_borough,Unnamed: 2_level_1
cash,Bronx,25
cash,Brooklyn,119
cash,Manhattan,1397
cash,Queens,266
credit card,Bronx,74
credit card,Brooklyn,261
credit card,Manhattan,3839
credit card,Queens,383


This gives us the multi indexed aggregation. However this view isn't great for making quick comparisons across dimensions (eg. how do cash & credit card payments differ for the Bronx?)

Converting the `pickup_borough` to separate columns, or __pivoting from long to wide__ would solve this problem.

In [43]:
df = df.unstack(0)
df

Unnamed: 0_level_0,n,n
payment,cash,credit card
pickup_borough,Unnamed: 1_level_2,Unnamed: 2_level_2
Bronx,25,74
Brooklyn,119,261
Manhattan,1397,3839
Queens,266,383


The comparison is now plain to see. However there arises a new issue with the column indexing. 

The columns are now a MultiIndex, with two layers to the columns. 

In [55]:
df.columns

MultiIndex([('n',        'cash'),
            ('n', 'credit card')],
           names=[None, 'payment'])

It is possible to index the data using [tuple indexing](https://pandas.pydata.org/docs/user_guide/advanced.html), however most of the time it easier to simply concat the two layers together. 

In [56]:
df.columns =  ['__'.join(col).strip() for col in df.columns.values]
df

Unnamed: 0_level_0,n__cash,n__credit card
pickup_borough,Unnamed: 1_level_1,Unnamed: 2_level_1
Bronx,25,74
Brooklyn,119,261
Manhattan,1397,3839
Queens,266,383


This manual reset also removes the two layer column names. 

In [57]:
df.columns

Index(['n__cash', 'n__credit card'], dtype='object')

### Unstacking (or unpivoting)

This function does the opposite of `unstack()`

In [64]:
df = df.stack().reset_index()\
    .rename(columns = {'level_1' : "payment_method", 0:'n'})

df['payment_method'] = df['payment_method'].str.removeprefix("n__")
df

Unnamed: 0,pickup_borough,payment_method,n
0,Bronx,cash,25
1,Bronx,credit card,74
2,Brooklyn,cash,119
3,Brooklyn,credit card,261
4,Manhattan,cash,1397
5,Manhattan,credit card,3839
6,Queens,cash,266
7,Queens,credit card,383
