<a href="https://colab.research.google.com/github/AKerby/dsci_325_module_7_more_data_management_in_python/blob/main/Stacking%20and%20Unstacking.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Colab Prep

Execute the following code cells to whenever you open/restart the notebook in Google Colab.

In [None]:
!pip install "polars[all]"

In [None]:
!wget https://github.com/WSU-DataScience/dsci_325_module_7_more_data_management_in_python/raw/main/sample_data.zip

In [None]:
!unzip ./sample_data.zip

In [None]:
!pip install more_polars

# Stacking and Unstacking Data

In [None]:
import polars as pl

## Reshaping data

Two ways

* We can **stack** data into a *tall* format.
* We can **unstack** data into a *long* format.

## (totally real and not at all made-up) Example - Quarterly Auto Sales

**Note** the last four columns are

* same measurements
* same units

In [None]:
sales = pl.read_csv("./sample_data/auto_sales.csv")
sales

Salesperson,Compact,Sedan,SUV,Truck
str,i64,i64,i64,i64
"""Ann""",22,18,15.0,12
"""Bob""",19,12,17.0,20
"""Doug""",20,13,,20
"""Yolanda""",19,8,32.0,15
"""Xerxes""",12,23,18.0,9


## Stacking measurements of the same type/units

<img src="https://github.com/WSU-DataScience/dsci_325_module_7_more_data_management_in_python/raw/main/img/stack_in_action.gif" width=600>

We can fix issues with informative column labels by stacking the data with `gather`

## A Stack by any other name ...

The act of stacking similar columns goes by various names.

* `polars` calls this `melt`
* JMP and Minitab call this *stack*
* Wickham/`tidyr`/`dfply` call this *gather*

I prefer **stack**, primarily because it makes it clear we are *melting*/*gathering* data vertically.

## Stacking data in `polars` with `melt`

Syntax: `df.melt(id_cols, val_vars, variable_name, value_name)`

In [None]:
sales_cols = ['Compact', 'Sedan', 'SUV', 'Truck']
sales_stacked = (sales
                 .melt('Salesperson', sales_cols, "CarType","QrtSales")
                )
sales_stacked

Salesperson,CarType,QrtSales
str,str,i64
"""Ann""","""Compact""",22.0
"""Bob""","""Compact""",19.0
"""Doug""","""Compact""",20.0
"""Yolanda""","""Compact""",19.0
"""Xerxes""","""Compact""",12.0
"""Ann""","""Sedan""",18.0
"""Bob""","""Sedan""",12.0
"""Doug""","""Sedan""",13.0
"""Yolanda""","""Sedan""",8.0
"""Xerxes""","""Sedan""",23.0


## Unstacking Data with `unstack`

Syntax: `pivot(values, index, columns, aggregate_fn = 'first')`

In [None]:
(sales_stacked
 .pivot('QrtSales', 'Salesperson', 'CarType')
)

Salesperson,Compact,Sedan,SUV,Truck
str,i64,i64,i64,i64
"""Ann""",22,18,15.0,12
"""Bob""",19,12,17.0,20
"""Doug""",20,13,,20
"""Yolanda""",19,8,32.0,15
"""Xerxes""",12,23,18.0,9


## Safely STACK then UNSTACK


If we want to ensure we can unstack after stacking,

* Add an `ID`/`index` column of unique values
* Use this column as one of the index columns.
* Use `'first'` as the `aggregation_fn`.


In [None]:
(sales
 .with_column(pl.arange(0, len(sales)).alias('ID'))
 .melt(['ID', 'Salesperson'], sales_cols, "CarType","QrtSales")
 .pivot('QrtSales', ['ID','Salesperson'], 'CarType')
)

ID,Salesperson,Compact,Sedan,SUV,Truck
i64,str,i64,i64,i64,i64
0,"""Ann""",22,18,15.0,12
1,"""Bob""",19,12,17.0,20
2,"""Doug""",20,13,,20
3,"""Yolanda""",19,8,32.0,15
4,"""Xerxes""",12,23,18.0,9


## Why Stack?

* Perform transformations on many columns.
* Fix problems with the Golden Rule

## Example - Switching Units on All Sales

Suppose your manager wants these numbers in *monthly* sales.  You could

1. Adjust each column with a separate formula
2. Stack --> Transform once --> Unstack

#### Method 1 - Brute-force Column Transformations

In [None]:
pl.Config.with_columns_kwargs = True

(sales
 .with_columns([(pl.col('Compact')/3).alias('Compact'),
                (pl.col('SUV')/3).alias('SUV'),
                (pl.col('Sedan')/3).alias('Sedan'),
                (pl.col('Truck')/3).alias('Truck')
               ])
)

Salesperson,Compact,Sedan,SUV,Truck
str,f64,f64,f64,f64
"""Ann""",7.333333,6.0,5.0,4.0
"""Bob""",6.333333,4.0,5.666667,6.666667
"""Doug""",6.666667,4.333333,,6.666667
"""Yolanda""",6.333333,2.666667,10.666667,5.0
"""Xerxes""",4.0,7.666667,6.0,3.0


#### Method 2 - Stack-Transform-Unstack

In [None]:
(sales
 .melt('Salesperson', sales_cols, "CarType","QrtSales")
 .with_columns(MonSales = pl.col('QrtSales')/3)
 .drop('QrtSales')
 .pivot('MonSales', 'Salesperson', 'CarType')
)

Salesperson,Compact,Sedan,SUV,Truck
str,f64,f64,f64,f64
"""Ann""",7.333333,6.0,5.0,4.0
"""Bob""",6.333333,4.0,5.666667,6.666667
"""Doug""",6.666667,4.333333,,6.666667
"""Yolanda""",6.333333,2.666667,10.666667,5.0
"""Xerxes""",4.0,7.666667,6.0,3.0


## Comparing the two methods

**Method 1:**
* More straight forward
* Lots of repeated code
* Doesn't scale ... imagine 100+ columns

**Method 2:**
* More complicated
* Scales well $\longrightarrow$ same code regardless of number of columns
* Easier with more complicated transformations

## <font color="red"> Exercise 7.2.1 </font>
    
**Task:** Load the `Artwork.csv` data and use the Stack-Transform-Unstack trick to convert all measurements in cm to mm.

**Hints.**
1. You will need to fix the `dtypes` for some of the measurement columns by passing the `artwork_dtypes` to `pl.read_csv`.
2. You will want to add an `ID` column to make the stack and unstack safe.
3. You can use `cols.from_` and `cols.to` to get the list of columns needed in `melt`.
4. `pivot` can't group by float columns, so you need to stack all measurements.
5. To process only the `cm` columns, use a `pl.when(cond).then(expr).otherwise(expr)` expression.
6. You should also replace the `cm` with `mm` in the label column (using the same trick in the last hint) before unstacking.

In [None]:
artwork_dtypes = {'Title': pl.datatypes.Utf8,
                  'Artist': pl.datatypes.Utf8,
                  'ConstituentID': pl.datatypes.Utf8,
                  'ArtistBio': pl.datatypes.Utf8,
                  'Nationality': pl.datatypes.Utf8,
                  'BeginDate': pl.datatypes.Utf8,
                  'EndDate': pl.datatypes.Utf8,
                  'Gender': pl.datatypes.Utf8,
                  'Date': pl.datatypes.Utf8,
                  'Medium': pl.datatypes.Utf8,
                  'Dimensions': pl.datatypes.Utf8,
                  'CreditLine': pl.datatypes.Utf8,
                  'AccessionNumber': pl.datatypes.Utf8,
                  'Classification': pl.datatypes.Utf8,
                  'Department': pl.datatypes.Utf8,
                  'DateAcquired': pl.datatypes.Utf8,
                  'Cataloged': pl.datatypes.Utf8,
                  'ObjectID': pl.datatypes.Int64,
                  'URL': pl.datatypes.Utf8,
                  'ThumbnailURL': pl.datatypes.Utf8,
                  'Circumference (cm)': pl.datatypes.Float64,
                  'Depth (cm)': pl.datatypes.Float64,
                  'Diameter (cm)': pl.datatypes.Float64,
                  'Height (cm)': pl.datatypes.Float64,
                  'Length (cm)': pl.datatypes.Float64,
                  'Weight (kg)': pl.datatypes.Float64,
                  'Width (cm)': pl.datatypes.Float64,
                  'Seat Height (cm)': pl.datatypes.Float64,
                  'Duration (sec.)': pl.datatypes.Float64}

In [None]:
# Your code here