<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Setup" data-toc-modified-id="Setup-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Setup</a></span></li><li><span><a href="#The-.groupby()-method" data-toc-modified-id="The-.groupby()-method-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>The <code>.groupby()</code> method</a></span></li><li><span><a href="#Applying-multiple-aggregators" data-toc-modified-id="Applying-multiple-aggregators-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Applying multiple aggregators</a></span></li><li><span><a href="#Optional---custom-aggregators" data-toc-modified-id="Optional---custom-aggregators-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Optional - custom aggregators</a></span></li><li><span><a href="#Optional---pandas-equivalent-of-SQL-HAVING" data-toc-modified-id="Optional---pandas-equivalent-of-SQL-HAVING-5"><span class="toc-item-num">5&nbsp;&nbsp;</span>Optional - <code>pandas</code> equivalent of <code>SQL</code> <code>HAVING</code></a></span></li></ul></div>

In this lesson we will learn more about the `groupby` command in `pandas`. The techniques we'll see are often termed **'split-apply-combine'** in data analysis: 

* **split** a `DataFrame` into separate groups of rows
* **apply** some form of aggregating function to each group
* **combine** the aggregates and group identifiers (or 'keys') back into a new `DataFrame`.

# Setup

In [1]:
import pandas as pd
import numpy as np

stock = pd.DataFrame({
    'item_no': pd.Series([1, 2, 3, 4, 5, 6, 7, 8, 9, 10], dtype='Int64'),
    'cost_class': pd.Series(['1st', '2nd', '3rd', '4th', '4th', '3rd', '2nd', np.nan, '1st', '3rd'], dtype='string'),
    'cost': pd.Series([10.99, np.nan, 2.99, np.nan, 2.99, 2.45, 5.99, 5.99, 3.00, None], dtype='float64'),
    'stock_code': pd.Series(['a', 'a', 'c', 'b', 'a', 'b', np.nan, np.nan, 'a', 'c'], dtype='string'),
    'priority_code': pd.Series([np.nan, None, 'a', 'b', None, 'a', 'e', None, 'a', 'd'], dtype='string'),
    'tax_rate': pd.Series([0, 0, 20, 20, 20, 0, 20, 20, 5, 20])
}).set_index('item_no')

# Recap:  `.groupby()` method 

As we saw on Tuesday, all `pandas` `DataFrames` offer a `.groupby()` method, which returns a `DataFrameGroupBy` object. 

In [2]:
groupby_stock_code = stock.groupby('stock_code')
type(groupby_stock_code)

pandas.core.groupby.generic.DataFrameGroupBy

If we try to examine this object, we'll see that we just get object information but don't see the contents. `pandas` is acting in a **lazy** fashion here: `groupby_stock_code` will return results only when an **aggregating or summarising method** is called upon it. This is pretty similar to what you saw in R when using the `group_by()` function. 

To see the contents, we will have to cast the `DataFrameGroupBy` object to a `DataFrame`

In [3]:
pd.DataFrame(groupby_stock_code)

Unnamed: 0,0,1
0,a,cost_class cost stock_code priority_...
1,b,cost_class cost stock_code priority_c...
2,c,cost_class cost stock_code priority_c...


So in column 0 and row 0, we have the first group key `'a'`, corresponding to one value of `stock_code`. What is stored against this value in column 1?

In [4]:
# check out what's in index 0 (corresponds to a)
pd.DataFrame(groupby_stock_code).loc[0, 1]

Unnamed: 0_level_0,cost_class,cost,stock_code,priority_code,tax_rate
item_no,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
1,1st,10.99,a,,0
2,2nd,,a,,0
5,4th,2.99,a,,20
9,1st,3.0,a,a,5


We have all the rows from the original `stock` `DataFrame` corresponding to `stock_code == 'a'`, and so on for `stock_code`s `'b'` and `'c'`. 


In [5]:
# check b
pd.DataFrame(groupby_stock_code).loc[1, 1]

Unnamed: 0_level_0,cost_class,cost,stock_code,priority_code,tax_rate
item_no,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
4,4th,,b,b,20
6,3rd,2.45,b,a,0


In [6]:
# check c
pd.DataFrame(groupby_stock_code).loc[2, 1]

Unnamed: 0_level_0,cost_class,cost,stock_code,priority_code,tax_rate
item_no,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
3,3rd,2.99,c,a,20
10,3rd,,c,d,20




# Aggregates 

This is an aside to help you understand how `.groupby()` works: the normal workflow after **splitting** the `DataFrame` into groups is to **apply** an aggregating function to each group, and then **combine** the aggregates back into a final `DataFrame`. Let's do that now for `groupby_stock_code`

> **Get the count of non-missing values in each column for each stock_code**

In [7]:
# count the number of non-missing values in each column for each group
counts = groupby_stock_code.count()
counts

Unnamed: 0_level_0,cost_class,cost,priority_code,tax_rate
stock_code,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
a,4,3,1,4
b,2,1,2,2
c,2,1,2,2


Note that the group keys form the `Index` of the aggregated `DataFrame`

In [8]:
counts.index

Index(['a', 'b', 'c'], dtype='object', name='stock_code')

If we want to suppress this behaviour, we can pass argument `as_index=False` to `.groupby()` (or use `.reset_index()`)

In [9]:
stock.groupby('stock_code', as_index=False).count()

Unnamed: 0,stock_code,cost_class,cost,priority_code,tax_rate
0,a,4,3,1,4
1,b,2,1,2,2
2,c,2,1,2,2


# Applying multiple aggregators

We can apply multiple aggregators by passing their names as a `list` of strings to the `.agg()` method

In [10]:
multiple_aggs = groupby_stock_code.agg(['count', 'max', 'min'])
multiple_aggs

Unnamed: 0_level_0,cost_class,cost_class,cost_class,cost,cost,cost,priority_code,priority_code,priority_code,tax_rate,tax_rate,tax_rate
Unnamed: 0_level_1,count,max,min,count,max,min,count,max,min,count,max,min
stock_code,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2
a,4,4th,1st,3,10.99,2.99,1,a,a,4,20,0
b,2,4th,3rd,1,2.45,2.45,2,b,a,2,20,0
c,2,3rd,3rd,1,2.99,2.99,2,d,a,2,20,20


Hmm, what's happened to the column labels? We seem to have multiple levels to them. In fact, this is an example of a **hierarchical index** known as a `MultiIndex`

In [11]:
multiple_aggs.columns

MultiIndex([(   'cost_class', 'count'),
            (   'cost_class',   'max'),
            (   'cost_class',   'min'),
            (         'cost', 'count'),
            (         'cost',   'max'),
            (         'cost',   'min'),
            ('priority_code', 'count'),
            ('priority_code',   'max'),
            ('priority_code',   'min'),
            (     'tax_rate', 'count'),
            (     'tax_rate',   'max'),
            (     'tax_rate',   'min')],
           )

How do we work with a `MultiIndex`? Say, for example, we want to get the `max()` `cost` for items with `stock_code` 'a'. You pass the levels of the `MultiIndex` to `.loc[]` as a `tuple`

In [12]:
multiple_aggs.loc['a', ('cost', 'max')]

10.99

What if you don't want to apply the same aggregators to each column? Well, you can pass `.agg()` a `dictionary` specifying the functions to apply to each column like so

In [13]:
other_multiple_aggs = groupby_stock_code.agg({
    'cost': ['mean', 'count', 'sum'],
    'cost_class': 'count',
    'priority_code': 'count'
})
other_multiple_aggs

Unnamed: 0_level_0,cost,cost,cost,cost_class,priority_code
Unnamed: 0_level_1,mean,count,sum,count,count
stock_code,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2
a,5.66,3,16.98,4,1
b,2.45,1,2.45,2,2
c,2.99,1,2.99,2,2



Finally, for maximum user control, we can specify the columns upon which to apply aggregators, the aggregators to apply, and the final column label in this way

In [14]:
groupby_stock_code.agg(
    total_cost=('cost', 'sum'),
    no_of_items=('cost', 'count'),
    mean_tax_rate=('tax_rate', 'mean') 
)

Unnamed: 0_level_0,total_cost,no_of_items,mean_tax_rate
stock_code,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
a,16.98,3,6.25
b,2.45,1,10.0
c,2.99,1,20.0




**<u>Task - 2 mins</u>**

Interpret the output of the following line of code. In particular, why do only some of the columns in the `stock` `DataFrame` appear in the final aggregated `DataFrame`?

`groupby_stock_code.agg(['mean', 'count', 'max'])`

**Solution**

In [15]:
groupby_stock_code.agg(['mean', 'count', 'max'])

Unnamed: 0_level_0,cost,cost,cost,tax_rate,tax_rate,tax_rate
Unnamed: 0_level_1,mean,count,max,mean,count,max
stock_code,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2
a,5.66,3,10.99,6.25,4,20
b,2.45,1,2.45,10.0,2,20
c,2.99,1,2.99,20.0,2,20


The `.agg()` method selects only the subset of columns to which **all** of the specified aggregator functions can be meaningfully applied. In this case, the `mean()` aggregator can be applied only to numeric columns, and so just `cost` and `tax_rate` are selected.




# Grouping by multiple columns 



What happens if we `.groupby()` more than one column?



**<u>Task - 5 mins</u>**

Write code for the following:

1. Group the stock `DataFrame` by cost_class and stock_code, and then calculate the mean of all numerical columns. 
2. Extract the `mean(cost)` for items in the 3rd `cost_class` with `stock_code` 'c'
3. Interpret what you see if you also pass `dropna=False` into `.groupby()` 

**Hints**

* You can pass a `list` to `.groupby()`
* What appears as row `index`?


**Solution**

In [16]:
# 1. group by variables and calculate the mean
mean_by_class_code = stock.groupby(['cost_class', 'stock_code']).mean()
mean_by_class_code

Unnamed: 0_level_0,Unnamed: 1_level_0,cost,tax_rate
cost_class,stock_code,Unnamed: 2_level_1,Unnamed: 3_level_1
1st,a,6.995,2.5
2nd,a,,0.0
3rd,b,2.45,0.0
3rd,c,2.99,20.0
4th,a,2.99,20.0
4th,b,,20.0


In [17]:
# have a look at the multiIndex 
mean_by_class_code.index

MultiIndex([('1st', 'a'),
            ('2nd', 'a'),
            ('3rd', 'b'),
            ('3rd', 'c'),
            ('4th', 'a'),
            ('4th', 'b')],
           names=['cost_class', 'stock_code'])

_The row `index` is a `MultiIndex`, this time by the individual levels of the grouping columns._

In [18]:
# 2. Extract the mean(cost) for items in the 3rd cost_class with stock_code 'c'
mean_by_class_code.loc[('3rd', 'c'), 'cost']

2.99

In [19]:
# Interpret what you see if you also pass dropna=False into .groupby()
stock.groupby(['cost_class', 'stock_code'], dropna=False).mean()

Unnamed: 0_level_0,Unnamed: 1_level_0,cost,tax_rate
cost_class,stock_code,Unnamed: 2_level_1,Unnamed: 3_level_1
1st,a,6.995,2.5
2nd,a,,0.0
2nd,,5.99,20.0
3rd,b,2.45,0.0
3rd,c,2.99,20.0
4th,a,2.99,20.0
4th,b,,20.0
,,5.99,20.0


_If we pass in `dropna=False` we also obtain `NaN` as possible levels of the grouping columns (if those columns have any missing values)_

# Optional - custom aggregators

What if no pre-defined aggregator (like `sum()`, `mean()` etc) suits our needs? In that case, we will need to define a custom aggregator. We can do this either by defining a function in the usual way using `def`, or for one-time use, by creating a **`lambda`** (an **anonymous function**). 

Imagine we need to calculate the following:

***Get the sum of cost in each stock_code group, less a 2.00 restocking fee per item***

First let's define a function to calculate this for each group

In [20]:
# restocking_fee is defaulted to a value of 2.00
# i.e. if we call sum_less_restocking without specifying restocking_fee
# a value of 2.00 will be assumed
def sum_less_restocking(rows, restocking_fee=2.00):
    return rows.sum() - (rows.count() * restocking_fee)

Now let's apply it (together with `sum` and `count`) to the `cost` column

In [21]:
groupby_stock_code.cost.agg(['sum', 'count', sum_less_restocking])

Unnamed: 0_level_0,sum,count,sum_less_restocking
stock_code,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
a,16.98,3,10.98
b,2.45,1,0.45
c,2.99,1,0.99


Let's see a use of `sum_less_restocking()` with more user control

In [22]:
groupby_stock_code.agg(
    total_cost=('cost', 'sum'),
    no_of_items=('cost', 'count'),
    total_cost_less_restocking=('cost', sum_less_restocking)
)

Unnamed: 0_level_0,total_cost,no_of_items,total_cost_less_restocking
stock_code,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
a,16.98,3,10.98
b,2.45,1,0.45
c,2.99,1,0.99


With experience, it may make sense to define a custom aggregator not as a separate function, but rather as a **`lambda`** (i.e. an anonymous function). This is true particularly if you anticipate no need to re-use an aggregator. Let's see how to write our custom aggregator as a lambda

In [23]:
groupby_stock_code.agg(
    total_cost_less_restocking = ('cost', lambda rows: rows.sum() - rows.count() * 2.00)
)

Unnamed: 0_level_0,total_cost_less_restocking
stock_code,Unnamed: 1_level_1
a,10.98
b,0.45
c,0.99


`Lambda`s are also useful to alter the arguments passed to a custom aggregator function. Earlier we made use of the fact that `sum_less_restocking()` defaulted the `restocking_fee` to 2.00, but what if we wish to use the function with a value of 1.00 instead?

In [24]:
groupby_stock_code.agg(
    total_cost_less_restocking = ('cost', lambda rows: sum_less_restocking(rows, restocking_fee=1.00))
)

Unnamed: 0_level_0,total_cost_less_restocking
stock_code,Unnamed: 1_level_1
a,13.98
b,1.45
c,1.99


# Optional - `pandas` equivalent of `SQL` `HAVING`

`SQL`s `HAVING` statement lets you **filter groups based upon an aggregate function**. Let's see the equivalent in `pandas` for this problem:

***Group stock by stock_code, and then get the mean cost for any stock_code group having more than two items***

First, let's start by getting the `count` of rows in each `stock_code` group. Let's make use of `item_no` essentially as a primary key column on which to `count`. Given that it is currently the `index`, we first must turn it back into a regular column using `.reset_index()`

In [25]:
stock.reset_index()\
    .groupby('stock_code')\
    .item_no.count()

stock_code
a    4
b    2
c    2
Name: item_no, dtype: int64

Next we use the `.filter()` method available on `DataFrameGroupBy` objects. This method takes in a function that should apply a condition based on an aggregator to each group, returning `True` or `False`. If `True`, the rows from that group get selected.

The easiest way to write this function is usually as a `lambda` (i.e. an anonymous function). Think of this `lambda` as being fed each group of rows in turn, and being asked to decide `True` or `False` for each group

In [26]:
stock.reset_index()\
    .groupby('stock_code')\
    .filter(lambda group_rows: group_rows.item_no.count() > 2)

Unnamed: 0,item_no,cost_class,cost,stock_code,priority_code,tax_rate
0,1,1st,10.99,a,,0
1,2,2nd,,a,,0
4,5,4th,2.99,a,,20
8,9,1st,3.0,a,a,5


So we see a regular `DataFrame` is returned, **but only the rows from the groups that 'passed' the `.filter()` are included**. Note above that the lambda variable `group_rows` has an arbitrary name: we could have called it anything (even `banana` if we like), but it's sensible to make the name fit the purpose.  

Now we have this filtered `DataFrame`, we carry on with the rest of the **split-apply-combine** logic as previously

In [27]:
stock.reset_index()\
    .groupby('stock_code')\
    .filter(lambda group_rows: group_rows.item_no.count() > 2)\
    .groupby('stock_code')\
    .agg(mean_cost=('cost', 'mean'))

Unnamed: 0_level_0,mean_cost
stock_code,Unnamed: 1_level_1
a,5.66


Only the `stock_code` 'a' group passed the filter, and so we get a `mean_cost` for just that group.