# Stacking and unstacking data

In [3]:
import pandas as pd
from dfply import *

## Reshaping data

Two ways

* We can **stack** data into a *tall* format.
* We can **unstack** data into a *long* format.

## (totally real and not at all made-up) Example - Quarterly Auto Sales

**Note** the last four columns are

* same measurements
* same units

In [4]:
sales = pd.read_csv("./data/auto_sales.csv")
sales

Unnamed: 0,Salesperson,Compact,Sedan,SUV,Truck
0,Ann,22,18,15.0,12
1,Bob,19,12,17.0,20
2,Doug,20,13,,20
3,Yolanda,19,8,32.0,15
4,Xerxes,12,23,18.0,9


## Stacking measurements of the same type/units

<img src="./img/stack_in_action.gif" width=600>

We can fix issues with informative column labels by stacking the data with `gather`

## A Stack by any other name ...

The act of stacking similar columns goes by various names.

* JMP and Minitab call this *stack*
* `pandas` calls this *melt*
* Wickham/`tidyr`/`dfply` call this *gather*

I prefer **stack**, primarily because it makes it clear we are *melting*/*gathering* data vertically.

## Stacking data in `pandas` with `gather`

Syntax: `gather(lbl_col_name, val_col_name, cols_to_stack)`

In [8]:
sales_cols = ['Compact', 'Sedan', 'SUV', 'Truck']
sales_stacked = (sales 
                 >> gather("CarType","QrtSales", sales_cols))
sales_stacked

Unnamed: 0,Salesperson,CarType,QrtSales
0,Ann,Compact,22.0
1,Bob,Compact,19.0
2,Doug,Compact,20.0
3,Yolanda,Compact,19.0
4,Xerxes,Compact,12.0
5,Ann,Sedan,18.0
6,Bob,Sedan,12.0
7,Doug,Sedan,13.0
8,Yolanda,Sedan,8.0
9,Xerxes,Sedan,23.0


In [11]:
sales_cols = ['Compact', 'Sedan', 'SUV', 'Truck']
sales_stacked = (sales 
                 >> gather("CarType","QrtSales", columns_from(X['Salesperson']))
                )
sales_stacked

Unnamed: 0,CarType,QrtSales
0,Salesperson,Ann
1,Salesperson,Bob
2,Salesperson,Doug
3,Salesperson,Yolanda
4,Salesperson,Xerxes
5,Compact,22
6,Compact,19
7,Compact,20
8,Compact,19
9,Compact,12


## Unstacking Data with `spread`

Syntax: `spread(split_by_col, to_split_col)`

In [9]:
(sales_stacked
 >> spread(X.CarType, X.QrtSales))

Unnamed: 0,Salesperson,Compact,SUV,Sedan,Truck
0,Ann,22.0,15.0,18.0,12.0
1,Bob,19.0,17.0,12.0,20.0
2,Doug,20.0,,13.0,20.0
3,Xerxes,12.0,18.0,23.0,9.0
4,Yolanda,19.0,32.0,8.0,15.0


## Safely working with `gather` and `spread`


We were lucky the last example worked.  Note that 

* `spread` needs a unique column to work properly.  
* `gather` will add a column by setting `add_id=True`

In [6]:
sales_stacked = sales >> gather("CarType","QrtSales", sales_cols, add_id=True)
sales_stacked >> head(2)

Unnamed: 0,Salesperson,_ID,CarType,QrtSales
0,Ann,0,Compact,22.0
1,Bob,1,Compact,19.0


In [7]:
sales_stacked >> spread(X.CarType, X.QrtSales) >> head(2)

Unnamed: 0,Salesperson,_ID,Compact,SUV,Sedan,Truck
0,Ann,0,22.0,15.0,18.0,12.0
1,Bob,1,19.0,17.0,12.0,20.0


## Why Stack?

* Perform transformations on many columns.
* Fix problems with the Golden Rule

## Example - Switching Units on All Sales

Suppose your manager wants these numbers in *monthly* sales.  You could

1. Adjust each column with a separate formula
2. Stack --> Transform once --> Unstack

#### Method 1 - Column Transformations

In [8]:
(sales
 >> mutate(Compact = X.Compact/3,
           SUV =   X.SUV/3,
           Sedan = X.Sedan/3,
           Truck = X.Truck/3)
 >> head(2))

Unnamed: 0,Salesperson,Compact,Sedan,SUV,Truck
0,Ann,7.333333,6.0,5.0,4.0
1,Bob,6.333333,4.0,5.666667,6.666667


#### Method 2 - Stack-Transform-Unstack

In [9]:
(sales 
 >> gather("CarType","QrtSales", sales_cols)
 >> mutate(MonSales = X.QrtSales/3)
 >> drop(X.QrtSales)
 >> spread(X.CarType, X.MonSales)
 >> head(2))

Unnamed: 0,Salesperson,Compact,SUV,Sedan,Truck
0,Ann,7.333333,5.0,6.0,4.0
1,Bob,6.333333,5.666667,4.0,6.666667


## Comparing the two methods

**Method 1:**
* More straight forward
* Lots of repeated code
* Doesn't scale ... imagine 100+ columns

**Method 2:**
* More complicated
* Scales well


## <font color="red"> Exercise 1 </font>
    
**Task:** Load the `health_survey.csv` data and use the Stack-Transform-Unstack trick to transform the responses to a Lickert scale where *Strongly Agree* mapped to 5 and *Strongly Disagree* mapped to 1


In [15]:
survey = pd.read_csv("./data/health_survey.csv")
survey.head(2)
from more_dfply import ifelse

In [16]:
(survey
>> gather("Question", "Answer", columns_between(X["F1"], X["F2.11"]))
>> mutate(score = case_when([X.Answer == "Strongly Agree", 5],
                            [X.Answer == "Somewhat Agree", 4], 
                            [X.Answer == "Neither Agree nor Disagree", 3],
                            [X.Answer == "Somewhat Disagree", 2], 
                            [X.Answer == "Strongly Disagree", 1]))
>> drop(X.Answer)
>> spread(X.Question, X.score)
>> head
)

  return coalescer.lookup(np.arange(coalescer.shape[0]), min_nonna)


Unnamed: 0.1,Unnamed: 0,F1,F1.1,F1.2,F1.3,F1.4,F1.5,F1.6,F1.7,F2,...,F5.3,F5.4,F5.5,F5.6,F5.7,F6,F6.1,F6.2,F6.3,F6.4
0,1,4.0,4.0,4.0,4.0,3.0,4.0,4.0,4.0,4.0,...,4.0,4.0,2.0,4.0,4.0,2.0,4.0,4.0,2.0,4.0
1,2,4.0,4.0,4.0,4.0,4.0,3.0,4.0,4.0,4.0,...,4.0,3.0,3.0,2.0,4.0,2.0,4.0,4.0,2.0,2.0
2,3,5.0,5.0,5.0,5.0,2.0,5.0,5.0,4.0,4.0,...,4.0,4.0,2.0,2.0,5.0,2.0,3.0,4.0,2.0,4.0
3,4,4.0,4.0,5.0,4.0,3.0,4.0,5.0,3.0,5.0,...,4.0,4.0,2.0,2.0,4.0,3.0,4.0,4.0,4.0,3.0
4,5,5.0,5.0,5.0,5.0,4.0,4.0,5.0,4.0,3.0,...,5.0,3.0,1.0,1.0,5.0,1.0,4.0,5.0,3.0,4.0
