<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Setup" data-toc-modified-id="Setup-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Setup</a></span></li><li><span><a href="#Joining" data-toc-modified-id="Joining-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Joining</a></span></li><li><span><a href="#Optional:-Joining-after-split-apply-combine" data-toc-modified-id="Optional:-Joining-after-split-apply-combine-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Optional: Joining after split-apply-combine</a></span><ul class="toc-item"><li><span><a href="#split-apply-combine-rejoin-via-.transform()" data-toc-modified-id="split-apply-combine-rejoin-via-.transform()-3.1"><span class="toc-item-num">3.1&nbsp;&nbsp;</span><strong>split-apply-combine-rejoin</strong> via <code>.transform()</code></a></span></li><li><span><a href="#Groupwise-imputation-of-NAs-using-.transform()" data-toc-modified-id="Groupwise-imputation-of-NAs-using-.transform()-3.2"><span class="toc-item-num">3.2&nbsp;&nbsp;</span>Groupwise imputation of <code>NA</code>s using <code>.transform()</code></a></span></li></ul></li><li><span><a href="#Tidy-data:-melting-and-pivoting" data-toc-modified-id="Tidy-data:-melting-and-pivoting-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Tidy data: melting and pivoting</a></span><ul class="toc-item"><li><span><a href="#Melting:-wide-form-to-long-long" data-toc-modified-id="Melting:-wide-form-to-long-long-4.1"><span class="toc-item-num">4.1&nbsp;&nbsp;</span>Melting: wide form to long long</a></span></li><li><span><a href="#Optional:-Pivoting:-long-form-to-wide-form" data-toc-modified-id="Optional:-Pivoting:-long-form-to-wide-form-4.2"><span class="toc-item-num">4.2&nbsp;&nbsp;</span>Optional: Pivoting: long form to wide form</a></span></li></ul></li></ul></div>

In this lesson we will look at common manipulations of `DataFrames`: joining one `DataFrame` to another via the `.merge()` method, transforming data from 'wide' to 'long' form using `.melt()`, and the reverse transformation using `.pivot()`.

# Setup

In [1]:
import pandas as pd
import numpy as np

stock = pd.DataFrame({
    'item_no': pd.Series([1, 2, 3, 4, 5, 6, 7, 8, 9, 10], dtype='Int64'),
    'cost_class': pd.Series(['1st', '2nd', '3rd', '4th', '4th', '3rd', '2nd', np.nan, '1st', '3rd'], dtype='string'),
    'cost': pd.Series([10.99, np.nan, 2.99, np.nan, 2.99, 2.45, 5.99, 5.99, 3.00, None], dtype='float64'),
    'stock_code': pd.Series(['a', 'a', 'c', 'b', 'a', 'b', np.nan, np.nan, 'a', 'c'], dtype='string'),
    'priority_code': pd.Series([np.nan, None, 'a', 'b', None, 'a', 'e', None, 'a', 'd'], dtype='string'),
    'tax_rate': pd.Series([0, 0, 20, 20, 20, 0, 20, 20, 5, 20])
}).set_index('item_no')

feedback = pd.DataFrame({
    'item_no': pd.Series([2, 2, 3, 4, 5, 1, 9, 5, 7, 10, np.nan], dtype='Int64'),
    'date': pd.Series(['2020-04-11', '2020-04-12', '2020-05-13', np.nan, '2020-05-28', '2020-05-29',
                       '2020-06-01', '2020-06-07', '2020-06-300', '2020-06-30', '2020-08-01']),
    'rating': pd.Series([5, 1, 3, 5, 4, 3, 2, 5, 1, 4, 5], dtype='Int64'),
    'message': pd.Series(["Ideal for my lunchbox - Dave Smith", "Broke first time I used it, I want a refund! Get back to me at lenore29@gmail.com or 07700 900796",
                        "My name is Tony 07700900829", "Bought another one for my sister", "Works pretty well, but can't handle carrots", 
                        "The concept is great, the execution- not so great, thin handles - Eleanor & dave", np.nan,
                        "Arrived on time, as expected", "Customer service terrible - hello anyone there?! DaveAllsop@yahoo.co.uk, 07700 900572 or 0131 9496 0886", 
                        "Workks well, seems solid, good value", "Great finish on it, really decent build quality"], dtype='string')
})

sales = pd.DataFrame({
    'item_number': pd.Series([1, 2, 3, 5, 6, 7, 8, 9, 10], dtype='Int64'),
    'target_class': pd.Series(['a', 'b', 'b', 'c', 'c', 'b', 'a', 'a', 'a']),
    'days_in_reduction': pd.Series([0, 7, 14, 14, 0, 0, 7, 14, 30]),
    'days_sales_0_50':   pd.Series([120, 19, 282, 210, 194, 101, 298, 187, 103], dtype='Int64'),
    'days_sales_51_100': pd.Series([141, 341, 22, np.nan, 112, 87, 54, 130, 105], dtype='Int64'),
    'days_sales_101plus':   pd.Series([99, np.nan, 16, 49, 54, 130, np.nan, 23, 152], dtype='Int64')
})

Let's remind ourselves what these `DataFrame`s contain

In [2]:
stock

Unnamed: 0_level_0,cost_class,cost,stock_code,priority_code,tax_rate
item_no,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
1,1st,10.99,a,,0
2,2nd,,a,,0
3,3rd,2.99,c,a,20
4,4th,,b,b,20
5,4th,2.99,a,,20
6,3rd,2.45,b,a,0
7,2nd,5.99,,e,20
8,,5.99,,,20
9,1st,3.0,a,a,5
10,3rd,,c,d,20


In [3]:
feedback

Unnamed: 0,item_no,date,rating,message
0,2.0,2020-04-11,5,Ideal for my lunchbox - Dave Smith
1,2.0,2020-04-12,1,"Broke first time I used it, I want a refund! G..."
2,3.0,2020-05-13,3,My name is Tony 07700900829
3,4.0,,5,Bought another one for my sister
4,5.0,2020-05-28,4,"Works pretty well, but can't handle carrots"
5,1.0,2020-05-29,3,"The concept is great, the execution- not so gr..."
6,9.0,2020-06-01,2,
7,5.0,2020-06-07,5,"Arrived on time, as expected"
8,7.0,2020-06-300,1,Customer service terrible - hello anyone there...
9,10.0,2020-06-30,4,"Workks well, seems solid, good value"


In [4]:
sales

Unnamed: 0,item_number,target_class,days_in_reduction,days_sales_0_50,days_sales_51_100,days_sales_101plus
0,1,a,0,120,141.0,99.0
1,2,b,7,19,341.0,
2,3,b,14,282,22.0,16.0
3,5,c,14,210,,49.0
4,6,c,0,194,112.0,54.0
5,7,b,0,101,87.0,130.0
6,8,a,7,298,54.0,
7,9,a,14,187,130.0,23.0
8,10,a,30,103,105.0,152.0


# Joining

Joining `DataFrame`s in `pandas` works very similarly to joining tables in `SQL`! Often, you will be able to identify a key relationship between `DataFrames`: a column in one `DataFrame` will relate to a column in another.

The `merge()` function is generally useful for this: 

* Specify the type of join with the `how=` argument (options are `left`, `right`, `outer` and `inner`)
* Specify shared columns (and/or `index`) to join on with argument `on=`, or specify differently named columns using `left_on=` and `right_on=` arguments

Let's see an example:

> **Get the details of all the items for which feedback has been left, together with all feedback details**

This will be an **inner join**: only those items with feedback and the feedback left

In [5]:
stock.merge(feedback, how='inner', on='item_no')

Unnamed: 0,item_no,cost_class,cost,stock_code,priority_code,tax_rate,date,rating,message
0,1,1st,10.99,a,,0,2020-05-29,3,"The concept is great, the execution- not so gr..."
1,2,2nd,,a,,0,2020-04-11,5,Ideal for my lunchbox - Dave Smith
2,2,2nd,,a,,0,2020-04-12,1,"Broke first time I used it, I want a refund! G..."
3,3,3rd,2.99,c,a,20,2020-05-13,3,My name is Tony 07700900829
4,4,4th,,b,b,20,,5,Bought another one for my sister
5,5,4th,2.99,a,,20,2020-05-28,4,"Works pretty well, but can't handle carrots"
6,5,4th,2.99,a,,20,2020-06-07,5,"Arrived on time, as expected"
7,7,2nd,5.99,,e,20,2020-06-300,1,Customer service terrible - hello anyone there...
8,9,1st,3.0,a,a,5,2020-06-01,2,
9,10,3rd,,c,d,20,2020-06-30,4,"Workks well, seems solid, good value"


Note in this case that we have joined on the `index` in `stock` to a regular column in `feedback`: this works, however, as they are both called `item_no`.

How can we amend this code for the following:

> **Get the details of all items for which feedback has been left, together with feedback rating and date**

In [6]:
stock.merge(feedback.loc[:, ['item_no', 'rating', 'date']], how='inner', on='item_no')

Unnamed: 0,item_no,cost_class,cost,stock_code,priority_code,tax_rate,rating,date
0,1,1st,10.99,a,,0,3,2020-05-29
1,2,2nd,,a,,0,5,2020-04-11
2,2,2nd,,a,,0,1,2020-04-12
3,3,3rd,2.99,c,a,20,3,2020-05-13
4,4,4th,,b,b,20,5,
5,5,4th,2.99,a,,20,4,2020-05-28
6,5,4th,2.99,a,,20,5,2020-06-07
7,7,2nd,5.99,,e,20,1,2020-06-300
8,9,1st,3.0,a,a,5,2,2020-06-01
9,10,3rd,,c,d,20,4,2020-06-30


<hr style="border:8px solid black"> </hr>

***

**<u>Task - 5 mins</u>**

***Show all stock items, together with any sales data for them***

* Think about the type of join: **all** of one `DataFrame`, together with any matching rows from another `DataFrame`
* To keep `item_no` in your joined `DataFrame`, use `reset_index()` on `stock` prior to joining
* You will also need to use arguments `left_on=` and `right_on=`

If you have time, also try the following:

* drop column `item_number` after joining
* save to a new variable `stock_sales`

**Solution**

The key column in `stock` is `item_no`, while in `sales` it is `item_number`

In [7]:
stock_sales = stock \
    .reset_index() \
    .merge(sales, how='left', left_on='item_no', right_on='item_number') \
    .drop(columns='item_number')

stock_sales

Unnamed: 0,item_no,cost_class,cost,stock_code,priority_code,tax_rate,target_class,days_in_reduction,days_sales_0_50,days_sales_51_100,days_sales_101plus
0,1,1st,10.99,a,,0,a,0.0,120.0,141.0,99.0
1,2,2nd,,a,,0,b,7.0,19.0,341.0,
2,3,3rd,2.99,c,a,20,b,14.0,282.0,22.0,16.0
3,4,4th,,b,b,20,,,,,
4,5,4th,2.99,a,,20,c,14.0,210.0,,49.0
5,6,3rd,2.45,b,a,0,c,0.0,194.0,112.0,54.0
6,7,2nd,5.99,,e,20,b,0.0,101.0,87.0,130.0
7,8,,5.99,,,20,a,7.0,298.0,54.0,
8,9,1st,3.0,a,a,5,a,14.0,187.0,130.0,23.0
9,10,3rd,,c,d,20,a,30.0,103.0,105.0,152.0


***

<hr style="border:8px solid black"> </hr>

# Joining after grouping

Grouping and aggregating a `DataFrame`, and then joining the result back again to the same `DataFrame` is a common operation in data analysis. This process is known as **split-apply-combine**. Let's see an example in `pandas`:

> **Add for each item in stock a column mean_cost_by_stock_code containing the mean cost for all items with the same stock_code**

First, we need to get the mean `cost` for each `stock_code`. Let's be specific about the name of the final column we want after aggregation

In [8]:
mean_cost_by_stock_code = stock.groupby('stock_code').agg(mean_cost_by_stock_code=('cost', 'mean'))
mean_cost_by_stock_code

Unnamed: 0_level_0,mean_cost_by_stock_code
stock_code,Unnamed: 1_level_1
a,5.66
b,2.45
c,2.99


Now we need to join this back to the `stock` `DataFrame`, using `stock_code` as the joining key

In [9]:
stock = stock.merge(mean_cost_by_stock_code, how='left', on='stock_code')
stock

Unnamed: 0,cost_class,cost,stock_code,priority_code,tax_rate,mean_cost_by_stock_code
0,1st,10.99,a,,0,5.66
1,2nd,,a,,0,5.66
2,3rd,2.99,c,a,20,2.99
3,4th,,b,b,20,2.45
4,4th,2.99,a,,20,5.66
5,3rd,2.45,b,a,0,2.45
6,2nd,5.99,,e,20,
7,,5.99,,,20,
8,1st,3.0,a,a,5,5.66
9,3rd,,c,d,20,2.99


Why might we do this in the first place? To allow for operations like the following:

> **Now add a new column showing the ratio of each item's cost to the mean cost of items sharing the same stock_code**

In [10]:
stock.loc[:, 'cost_over_mean_by_stock_code'] = stock.cost / stock.mean_cost_by_stock_code
stock

Unnamed: 0,cost_class,cost,stock_code,priority_code,tax_rate,mean_cost_by_stock_code,cost_over_mean_by_stock_code
0,1st,10.99,a,,0,5.66,1.941696
1,2nd,,a,,0,5.66,
2,3rd,2.99,c,a,20,2.99,1.0
3,4th,,b,b,20,2.45,
4,4th,2.99,a,,20,5.66,0.528269
5,3rd,2.45,b,a,0,2.45,1.0
6,2nd,5.99,,e,20,,
7,,5.99,,,20,,
8,1st,3.0,a,a,5,5.66,0.530035
9,3rd,,c,d,20,2.99,


## Joining via `.transform()`

This **split-apply-combine** operation followed by **rejoining** to the original `DataFrame` is so common that `pandas` has a dedicated method for it: `.transform()`

Let's use it to add a new column `sum_cost_by_cost_class` which will contain, for each item, the sum of the `cost`s of items in the same `cost_class` as itself 

In [11]:
stock.loc[:, 'sum_cost_by_cost_class'] = stock.groupby('cost_class').cost.transform('sum')
stock

Unnamed: 0,cost_class,cost,stock_code,priority_code,tax_rate,mean_cost_by_stock_code,cost_over_mean_by_stock_code,sum_cost_by_cost_class
0,1st,10.99,a,,0,5.66,1.941696,13.99
1,2nd,,a,,0,5.66,,5.99
2,3rd,2.99,c,a,20,2.99,1.0,5.44
3,4th,,b,b,20,2.45,,2.99
4,4th,2.99,a,,20,5.66,0.528269,2.99
5,3rd,2.45,b,a,0,2.45,1.0,5.44
6,2nd,5.99,,e,20,,,5.99
7,,5.99,,,20,,,
8,1st,3.0,a,a,5,5.66,0.530035,13.99
9,3rd,,c,d,20,2.99,,5.44


## Optional: Groupwise imputation of `NA`s using `.transform()`

Imputation of `NA`s in a column by an aggregate of **all non-missing** values in the column [e.g. fill `NA`s in `cost` with `median(cost)`] is clearly a blunt instrument.

**Groupwise imputation** is often a better approach: 

* First group your `DataFrame` by a column or set of columns (e.g. `A` and `B`)
* Say you want to fill `NA`s in column `C`: calculate aggregates of `C` (e.g. mean, median etc) for **each `A`, `B` group**. So far, this is just **split-apply-combine** as usual.
* Now, for each `NA` in `C`, look at which `A`, `B` group it occurs in, and fill it with the aggregate calculated for that group

Let's see an example of how to do this using `.fillna()` and `.transform()`

***Group stock by cost_class, and then fill each missing value in cost with the median cost calculated over the same group as that missing value*** 

In [12]:
# Let's keep track of which costs were originally missing
# Creating of a column tracking missings is common before imputation, as the fact 
# a value was originally missing can itself be informative in later analysis steps
stock.loc[:, 'cost_missing'] = stock.cost.isna()

# Now fill with groupwise medians by cost_class
stock.fillna({'cost': np.round(stock.groupby('cost_class').cost.transform('median'), 2)}, inplace=True)
stock

Unnamed: 0,cost_class,cost,stock_code,priority_code,tax_rate,mean_cost_by_stock_code,cost_over_mean_by_stock_code,sum_cost_by_cost_class,cost_missing
0,1st,10.99,a,,0,5.66,1.941696,13.99,False
1,2nd,5.99,a,,0,5.66,,5.99,True
2,3rd,2.99,c,a,20,2.99,1.0,5.44,False
3,4th,2.99,b,b,20,2.45,,2.99,True
4,4th,2.99,a,,20,5.66,0.528269,2.99,False
5,3rd,2.45,b,a,0,2.45,1.0,5.44,False
6,2nd,5.99,,e,20,,,5.99,False
7,,5.99,,,20,,,,False
8,1st,3.0,a,a,5,5.66,0.530035,13.99,False
9,3rd,2.72,c,d,20,2.99,,5.44,True


# Tidy data: melting and pivoting

Before we get any further into reshaping `DataFrame`s, let's briefly discuss the concept of **tidy data**. 

> Happy families are all alike; every unhappy family is unhappy in its own way. — Leo Tolstoy  
> Tidy datasets are all alike; every messy dataset is messy in its own way. — Hadley Wickham

What is a tidy dataset? Here's the definition:

1. Each variable forms a **column**.
2. Each observation forms a **row**.
3. Each type of observational unit forms a **table** (or `DataFrame` in `pandas`)

This is probably more easily understood by seeing it as a figure:

![Tidy data](images/tidy.png "from 'R for Data Science' by Wickham and Grolemund")

*from 'R for Data Science' by Wickham and Grolemund*

So each set of observations for any one 'unit' (e.g. one stock item, one feedback) forms a row in a `DataFrame`. Each column should relate to one variable. Finally each cell contains one measurement of a particular variable for a particular unit. 

Some data analysis operations are made easier by having the data in tidy form, while other operations are easier in other forms. Tidy form is useful to have in your mind as a 'default' for datasets: frequently you will wish to deviate from tidy form for some analysis purpose, but it sometimes helps to think of tidy data as a starting point.

<hr style="border:8px solid black"> </hr>

***

**<u>Task - 2 mins</u>**

Have a look at the `sales` data. Is it in tidy form?

[**Hint** - this is quite subtle question. Focus on the `days_sales...` columns]

In [13]:
sales

Unnamed: 0,item_number,target_class,days_in_reduction,days_sales_0_50,days_sales_51_100,days_sales_101plus
0,1,a,0,120,141.0,99.0
1,2,b,7,19,341.0,
2,3,b,14,282,22.0,16.0
3,5,c,14,210,,49.0
4,6,c,0,194,112.0,54.0
5,7,b,0,101,87.0,130.0
6,8,a,7,298,54.0,
7,9,a,14,187,130.0,23.0
8,10,a,30,103,105.0,152.0


**Solution**

Technically, no; `sales` is not in tidy form. The data relating to daily sales are spread over three columns `day_sales_0_50`, `day_sales_51_100` and `days_sales_101plus`. It *would* be in tidy form if we could gather these three columns into two columns:

* `days_sales_class` containing values '0_50', '51_100' and '101+'
* `no_of_days` containing the frequency

***

<hr style="border:8px solid black"> </hr>

## Melting: wide form to long long

Let's see how to do this by **melting** the `DataFrame`. To do this, we use the `.melt()` method, setting arguments:

* `id_vars=` to the list of columns we want to function as the **identifier** for a row. These columns will not be melted, and all others will be melted
* `var_name=` to the column name that will hold the **former headings** of the melted columns 
* `value_name=` to the column name that will hold the **former values** in the melted columns

To be honest, it probably makes more sense once you see the results!

<hr style="border:8px solid black"> </hr>

***

**<u>Task - 5 mins</u>**

Run the following cell, and then interpret the results. Think about the following:

* Where do the values in `days_sales_class` come from?
* Where do the values in `no_of_days` come from?
* What role do columns `item_number`, `sales_target_class` and `days_in_reduction` play? Are rows in these columns now repeated?

[**Hint** - it might help to sort `sales_melted` by the `item_number` and `days_sales_class` columns using `.sort_values()` to see what is happening!]

In [14]:
sales_melted = sales.melt(
    id_vars=['item_number', 'target_class', 'days_in_reduction'], 
    var_name='days_sales_class',
    value_name='no_of_days'
)

In [15]:
# melt() cannot yet infer types for new nullable data types in pandas 1.x.x
# hopefully this will be fixed in the near future
sales_melted.loc[:, 'no_of_days'] = sales_melted.no_of_days.astype("Int64")
sales_melted

Unnamed: 0,item_number,target_class,days_in_reduction,days_sales_class,no_of_days
0,1,a,0,days_sales_0_50,120.0
1,2,b,7,days_sales_0_50,19.0
2,3,b,14,days_sales_0_50,282.0
3,5,c,14,days_sales_0_50,210.0
4,6,c,0,days_sales_0_50,194.0
5,7,b,0,days_sales_0_50,101.0
6,8,a,7,days_sales_0_50,298.0
7,9,a,14,days_sales_0_50,187.0
8,10,a,30,days_sales_0_50,103.0
9,1,a,0,days_sales_51_100,141.0


**Solution**

In [16]:
sales_melted.sort_values(['item_number', 'days_sales_class'], inplace=True)
sales_melted

Unnamed: 0,item_number,target_class,days_in_reduction,days_sales_class,no_of_days
0,1,a,0,days_sales_0_50,120.0
18,1,a,0,days_sales_101plus,99.0
9,1,a,0,days_sales_51_100,141.0
1,2,b,7,days_sales_0_50,19.0
19,2,b,7,days_sales_101plus,
10,2,b,7,days_sales_51_100,341.0
2,3,b,14,days_sales_0_50,282.0
20,3,b,14,days_sales_101plus,16.0
11,3,b,14,days_sales_51_100,22.0
3,5,c,14,days_sales_0_50,210.0


Focus on `item_number` 1: formerly we had one row containing all the data for this item. Now this has been split over three rows: one each for the three columns that have been melted. The old column headers are now in `days_sales_class`; the counts of days are in `no_of_days`.

It's common to say that a transformation like this takes the data from **long form** to **wide form**: we took the data and **reduced** the number of columns by **increasing** the number of rows.

***

<hr style="border:8px solid black"> </hr>

Let's tidy up the values in `days_sales_class`. We don't need the repeated string 'days_sales_...', so we'll remove it

In [17]:
# using replace:
sales_melted.loc[:, 'days_sales_class'] = sales_melted.days_sales_class.replace({
    'days_sales_0_50': '0_50',
    'days_sales_51_100': '51_100',
    'days_sales_101plus': '101_plus',
})

sales_melted.reset_index(drop=True, inplace=True)
sales_melted

Unnamed: 0,item_number,target_class,days_in_reduction,days_sales_class,no_of_days
0,1,a,0,0_50,120.0
1,1,a,0,101_plus,99.0
2,1,a,0,51_100,141.0
3,2,b,7,0_50,19.0
4,2,b,7,101_plus,
5,2,b,7,51_100,341.0
6,3,b,14,0_50,282.0
7,3,b,14,101_plus,16.0
8,3,b,14,51_100,22.0
9,5,c,14,0_50,210.0


What does this long form data let us do? Well, it makes it easier to code a solution to the following, using the **split-apply-combine** techniques we learned earlier:

> **Find the most common days_sales_class for each item**

In [18]:
sales_melted \
    .sort_values('no_of_days', ascending=False) \
    .groupby('item_number') \
    .head(1) \
    .loc[:, ['item_number', 'days_sales_class']] \
    .sort_values('item_number')

Unnamed: 0,item_number,days_sales_class
2,1,51_100
5,2,51_100
6,3,0_50
9,5,0_50
12,6,0_50
16,7,101_plus
18,8,0_50
21,9,0_50
25,10,101_plus


Let's break this down:

* First, sort `sales_melted` in descending order of `no_of_days`
* Next, group rows by `item_number` (**split**)
* Then get the first row in each group using `.head(1)` (**apply**)
* **Combine** these rows together, extracting columns `item_number` and `days_sales_class`
* For readability, order by `item_number`

As long and nasty as this looks, the code to do it with **wide form** would be even worse!

Here's another way to do this using `.transform()`. This method applies an aggregator to each group, but **preserves all the original rows in the `DataFrame`**

In [19]:
group_max_mask = sales_melted.groupby('item_number').no_of_days.transform(max) == sales_melted.no_of_days
sales_melted.loc[group_max_mask, ['item_number', 'days_sales_class']]

Unnamed: 0,item_number,days_sales_class
2,1,51_100
5,2,51_100
6,3,0_50
9,5,0_50
12,6,0_50
16,7,101_plus
18,8,0_50
21,9,0_50
25,10,101_plus


Let's see what `.transform(max)` returns

In [20]:
sales_melted.groupby('item_number').no_of_days.transform(max)

0     141
1     141
2     141
3     341
4     341
5     341
6     282
7     282
8     282
9     210
10    210
11    210
12    194
13    194
14    194
15    130
16    130
17    130
18    298
19    298
20    298
21    187
22    187
23    187
24    152
25    152
26    152
Name: no_of_days, dtype: Int64

## Pivoting: long form to wide form

What if we want to go back in the opposite direction: from long data to wide data? This is where the `.pivot()` method can help. We've made this section optional as melting a `DataFrame` is generally more common than pivoting: we are often presented with data in wide form, but find certain operations are easier in long form. 

The logic when using `.pivot()` is very similar to creating a 'pivot table' in a spreadsheet. We need to provide three arguments:

* `index=` to specify the columns that won't be pivoted (these will appear as a `MultiIndex` in the output `DataFrame`)
* `columns=` to specify the column(s) from which the new column headers will be taken
* `values=` to specify the column 

As before, it's probably easier just to see the method in use: we'll 'unmelt' `sales_melted` using `.pivot()`

In [21]:
sales_pivoted = sales_melted \
    .pivot(
        index=['item_number', 'target_class', 'days_in_reduction'], 
        columns='days_sales_class', 
        values='no_of_days'
    )

sales_pivoted

Unnamed: 0_level_0,Unnamed: 1_level_0,days_sales_class,0_50,101_plus,51_100
item_number,target_class,days_in_reduction,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
1,a,0,120,99.0,141.0
2,b,7,19,,341.0
3,b,14,282,16.0,22.0
5,c,14,210,49.0,
6,c,0,194,54.0,112.0
7,b,0,101,130.0,87.0
8,a,7,298,,54.0
9,a,14,187,23.0,130.0
10,a,30,103,152.0,105.0


Note we have a `MultiIndex` for rows specified by the `index=` argument for rows. The pivoted column headers have been taken from `days_sales_class`, and the values in pivoted columns from `no_of_days`.

Let's do a bit of tidying up to get the `DataFrame` back in a shape closer to the original

In [22]:
sales_pivoted = sales_pivoted.reset_index() \
    .rename(columns={
        "0_50": "days_sales_0_50",
        "101_plus": "days_sales_101plus",
        "51_100": "days_sales_51_100"
    })


sales_pivoted.columns.name = None
sales_pivoted

Unnamed: 0,item_number,target_class,days_in_reduction,days_sales_0_50,days_sales_101plus,days_sales_51_100
0,1,a,0,120,99.0,141.0
1,2,b,7,19,,341.0
2,3,b,14,282,16.0,22.0
3,5,c,14,210,49.0,
4,6,c,0,194,54.0,112.0
5,7,b,0,101,130.0,87.0
6,8,a,7,298,,54.0
7,9,a,14,187,23.0,130.0
8,10,a,30,103,152.0,105.0


Some operations are easier in wide form than in long form. For example: we can easily add a `days_sales_total` column in wide form like so 

In [23]:
sales_pivoted.loc[:, 'days_sales_total'] = sales_pivoted.days_sales_0_50 + \
    sales_pivoted.days_sales_51_100 +\
    sales_pivoted.days_sales_101plus

sales_pivoted

Unnamed: 0,item_number,target_class,days_in_reduction,days_sales_0_50,days_sales_101plus,days_sales_51_100,days_sales_total
0,1,a,0,120,99.0,141.0,360.0
1,2,b,7,19,,341.0,
2,3,b,14,282,16.0,22.0,320.0
3,5,c,14,210,49.0,,
4,6,c,0,194,54.0,112.0,360.0
5,7,b,0,101,130.0,87.0,318.0
6,8,a,7,298,,54.0,
7,9,a,14,187,23.0,130.0,340.0
8,10,a,30,103,152.0,105.0,360.0


<hr style="border:8px solid black"> </hr>

***

**<u>Task - 5 mins</u>**

Now `.melt()` `sales_pivoted` with the new `days_sales_total` column **back to long form** as before, with columns `days_sales_class` and `no_of_days`. At the end, the `days_sales_class` column should contain four values: '0_50', '51_100', '101plus' and 'total'.

**Solution**

In [24]:
melted_again = sales_pivoted.melt(
    id_vars=['item_number', 'target_class', 'days_in_reduction'],
    var_name='days_sales_class',
    value_name='no_of_days'
).replace({
    'days_sales_class': {
        'days_sales_0_50': '0_50',
        'days_sales_101plus': '101plus',
        'days_sales_51_100': '51_100',
        'days_sales_total': 'total'
        
    }
})

# or

melted_again = sales_pivoted.melt(
    id_vars=['item_number', 'target_class', 'days_in_reduction'],
    var_name='days_sales_class',
    value_name='no_of_days'
)
melted_again.loc[:, 'days_sales_class'] = melted_again.days_sales_class.str.replace("days_sales_", "")

melted_again.sort_values(['item_number', 'days_sales_class'])

Unnamed: 0,item_number,target_class,days_in_reduction,days_sales_class,no_of_days
0,1,a,0,0_50,120.0
9,1,a,0,101plus,99.0
18,1,a,0,51_100,141.0
27,1,a,0,total,360.0
1,2,b,7,0_50,19.0
10,2,b,7,101plus,
19,2,b,7,51_100,341.0
28,2,b,7,total,
2,3,b,14,0_50,282.0
11,3,b,14,101plus,16.0


***
<hr style="border:8px solid black"> </hr>