In [8]:
import numpy as np
import pandas as pd

## Simple `groupby` example

In [1]:
df = pd.DataFrame({'key': ['A', 'B', 'C', 'A', 'B', 'C'],
                   'data': range(6)}, columns=['key', 'data'])
df

Unnamed: 0,key,data
0,A,0
1,B,1
2,C,2
3,A,3
4,B,4
5,C,5


Simple `groupby`:

1. Split the DataFrame based on the specified key\
    *Here we split based on the `key` column. We can think of it as the originial bigDataFrame is splited into a collection of small DataFrames, which contains A, B or C.*

2. Apply the desired aggregatio operation\
    *Here we apply simply the `sum()`*
    
3. Combine the results of all DataFrames together

In [3]:
df.groupby('key') # split based on the `key` column

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x6291837f0>

In [4]:
df.groupby('key').sum() # apply sum() and combine the results together

Unnamed: 0_level_0,data
key,Unnamed: 1_level_1
A,3
B,5
C,7


## `DataFrameGroupBy` object

In [5]:
df.groupby('key')

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x629748780>

`groupby()` returns a `DataFramGroupBy` object. We can simply treat it as a collection of small `DataFrame`s.

On `DataFramGroupBy` we can do:

- Column indexing (as in `DataFrame`)

In [11]:
df1 = df.copy()
df1['data2'] = np.random.randint(10, size=6)
df1

Unnamed: 0,key,data,data2
0,A,0,1
1,B,1,1
2,C,2,5
3,A,3,6
4,B,4,8
5,C,5,4


In [12]:
df1.groupby('key')['data2'].sum()

key
A    7
B    9
C    9
Name: data2, dtype: int64

- Iteration over groups

In [16]:
for key, group in df.groupby('key'):
    print(f'key: {key} \n {group} \n')

key: A 
   key  data
0   A     0
3   A     3 

key: B 
   key  data
1   B     1
4   B     4 

key: C 
   key  data
2   C     2
5   C     5 



- Dispatch methods

    Methods of `DataFrame` and `Series` can be passed through and applied to each individual group. The results are then combined within `GroupBy` and returned.

In [20]:
df1.groupby('key')['data2'].describe()

Unnamed: 0_level_0,count,mean,std,min,25%,50%,75%,max
key,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
A,2.0,3.5,3.535534,1.0,2.25,3.5,4.75,6.0
B,2.0,4.5,4.949747,1.0,2.75,4.5,6.25,8.0
C,2.0,4.5,0.707107,4.0,4.25,4.5,4.75,5.0


## Flexible control

`DataFrameGroupBy` objects have `aggregate()`, `filter()`, `transform()`, and `apply()` methods that efficiently implement a variety of useful operations before combining the grouped data.

### Aggregation

`aggregate()`: can take a string, a function, or a list thereof, and compute the aggregates **at once**.

Pandas built-in aggregation:

| Aggregation          | Description                     |
| :------------------- | :------------------------------ |
| `count()`            | Total number of items           |
| `first()`, `last()`  | First and last item             |
| `mean()`, `median()` | Mean and median                 |
| `min()`, `max()`     | Minimum and maximum             |
| `std()`, `var()`     | Standard deviation and variance |
| `mad()`              | Mean absolute deviation         |
| `prod()`             | Product of all items            |
| `sum()`              | Sum of all items                |

In [25]:
df1

Unnamed: 0,key,data,data2
0,A,0,1
1,B,1,1
2,C,2,5
3,A,3,6
4,B,4,8
5,C,5,4


In [21]:
df1.groupby('key').aggregate(['min', max, np.median])

Unnamed: 0_level_0,data,data,data,data2,data2,data2
Unnamed: 0_level_1,min,max,median,min,max,median
key,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2
A,0,3,1.5,1,6,3.5
B,1,4,2.5,1,8,4.5
C,2,5,3.5,4,5,4.5


In [26]:
df1.groupby('key').aggregate({'data':['min', max], 'data2': [np.min, np.median, np.mean]})

Unnamed: 0_level_0,data,data,data2,data2,data2
Unnamed: 0_level_1,min,max,amin,median,mean
key,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2
A,0,3,1,3.5,3.5
B,1,4,1,4.5,4.5
C,2,5,4,4.5,4.5


### Filtering

Syntax: `df.groupby(key).filter(filter_criteria)`

E.g.: we might want to keep all groups in which the mean of `data2` column is larger than some critical value:

In [29]:
def filter_func(x):
    return x['data2'].mean() > 4


df1.groupby('key').filter(filter_func)

Unnamed: 0,key,data,data2
1,B,1,1
2,C,2,5
4,B,4,8
5,C,5,4


### Transformation

Transformation can return some transformed version of the full data to recombine.

Syntax: `df.groupby(key).transform(transform_func)`

In [32]:
df

Unnamed: 0,key,data
0,A,0
1,B,1
2,C,2
3,A,3
4,B,4
5,C,5


In [33]:
df.groupby('key').mean()

Unnamed: 0_level_0,data
key,Unnamed: 1_level_1
A,1.5
B,2.5
C,3.5


In [31]:
df.groupby('key').transform(lambda x: x - x.mean())

Unnamed: 0,data
0,-1.5
1,-1.5
2,-1.5
3,1.5
4,1.5
5,1.5


### Apply

Apply an arbitrary function to the group results.

E.g.:

In [38]:
def norm(x):
    x['data'] /= (x['data'] + x['data2'])
    return x

In [39]:
df1

Unnamed: 0,key,data,data2
0,A,0,1
1,B,1,1
2,C,2,5
3,A,3,6
4,B,4,8
5,C,5,4


In [48]:
df1.groupby('key').apply(norm)

Unnamed: 0,key,data,data2
0,A,0.0,1
1,B,0.5,1
2,C,0.285714,5
3,A,0.333333,6
4,B,0.333333,8
5,C,0.555556,4


In [52]:
df1.groupby('key').sum()

Unnamed: 0_level_0,data,data2
key,Unnamed: 1_level_1,Unnamed: 2_level_1
A,3,7
B,5,9
C,7,9


## Specify the split key

- List, array, `Series` or index as grouping keys

- A dictionary or series mapping index to group

In [41]:
df2 = df.set_index('key')
df2

Unnamed: 0_level_0,data
key,Unnamed: 1_level_1
A,0
B,1
C,2
A,3
B,4
C,5


In [42]:
mapping = {
    'A': 'foo',
    'B': 'bar',
    'C': 'foo'
}

df2.groupby(mapping).sum()

Unnamed: 0,data
bar,5
foo,10


- Pass any Python function that will input the index value and output the group

In [43]:
df2.groupby(str.upper).sum()

Unnamed: 0,data
A,3
B,5
C,7


- A list of valid keys\
Any of the preceding key choices can be combined to group on a multi-index

In [46]:
df2.groupby([str.upper, mapping]).sum()

Unnamed: 0,Unnamed: 1,data
A,foo,3
B,bar,5
C,foo,7
