# GroupBy-Split-Apply-Combine Chain

## References

* [How to Use the Split-Apply-Combine Strategy in pandas Groupby](https://towardsdatascience.com/how-to-use-the-split-apply-combine-strategy-in-pandas-groupby-29e0eb44b62e) by Filip Ciesielski
* [`groupby`](https://pandas.pydata.org/pandas-docs/stable/reference/groupby.html)
* [`agg`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.agg.html)
* [`DataFrameGroupBy.agg`](https://pandas.pydata.org/pandas-docs/version/0.22/generated/pandas.core.groupby.DataFrameGroupBy.agg.html)

## Import Libraries

In [1]:
import pandas as pd
import seaborn as sns

## Import Data

In [2]:
data = sns.load_dataset('iris')
df = data.head(5).copy()
df = pd.concat([df, data.iloc[50:55]])
df = pd.concat([df, data.iloc[100:105]])
df

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
0,5.1,3.5,1.4,0.2,setosa
1,4.9,3.0,1.4,0.2,setosa
2,4.7,3.2,1.3,0.2,setosa
3,4.6,3.1,1.5,0.2,setosa
4,5.0,3.6,1.4,0.2,setosa
50,7.0,3.2,4.7,1.4,versicolor
51,6.4,3.2,4.5,1.5,versicolor
52,6.9,3.1,4.9,1.5,versicolor
53,5.5,2.3,4.0,1.3,versicolor
54,6.5,2.8,4.6,1.5,versicolor


## The `groupby` Method

The `groupby` method can be applied to a data frame with a one-dimensional index column, and it returns a `DataFrameGroupBy` object. However, it does not perform any operations on the table.

In [3]:
df.groupby('species')

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x0000011D1B4350B8>

### Aggregation Methods

Aggregation methods (e.g. `sum`, `mean`, `min`, and `max`) perform operations on `DataFrameGroupBy` objects. When an aggregation method is called on a `DataFrameGroupBy` object, a new data frame is returned.

In [4]:
df.groupby('species').mean()

Unnamed: 0_level_0,sepal_length,sepal_width,petal_length,petal_width
species,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
setosa,4.86,3.28,1.4,0.2
versicolor,6.46,2.92,4.54,1.44
virginica,6.4,2.98,5.68,2.1


It is also possible to apply an aggregation method to a single column in a data frame -

In [5]:
df.groupby('species')[['sepal_width']].mean()

Unnamed: 0_level_0,sepal_width
species,Unnamed: 1_level_1
setosa,3.28
versicolor,2.92
virginica,2.98


#### `MultiIndex` Objects

Passing a list of column names into the `groupby` function performs an aggregation method on all of the remaining applicable columns (e.g. the columns that are not mentioned in the list - `['sepal_length', 'sepal_width', 'petal_length']`).

In [6]:
df_multi_index = df.groupby(['species', 'petal_width']).mean()
df_multi_index

Unnamed: 0_level_0,Unnamed: 1_level_0,sepal_length,sepal_width,petal_length
species,petal_width,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
setosa,0.2,4.86,3.28,1.4
versicolor,1.3,5.5,2.3,4.0
versicolor,1.4,7.0,3.2,4.7
versicolor,1.5,6.6,3.033333,4.666667
virginica,1.8,6.3,2.9,5.6
virginica,1.9,5.8,2.7,5.1
virginica,2.1,7.1,3.0,5.9
virginica,2.2,6.5,3.0,5.8
virginica,2.5,6.3,3.3,6.0


The returned data frame contains a `MultiIndex` object as the index instead of an `Index` object - 

In [7]:
type(df_multi_index.index)

pandas.core.indexes.multi.MultiIndex

To select a subset from the `MultiIndex` data frame, call the `xs` method on the data frame -

In [8]:
df_multi_index.xs('versicolor', level = 'species')

Unnamed: 0_level_0,sepal_length,sepal_width,petal_length
petal_width,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1.3,5.5,2.3,4.0
1.4,7.0,3.2,4.7
1.5,6.6,3.033333,4.666667


#### `reset_index`

Calling the `reset_index` method on a `MultiIndex` data frame flattens the index into columns -

In [9]:
df_multi_index.reset_index()

Unnamed: 0,species,petal_width,sepal_length,sepal_width,petal_length
0,setosa,0.2,4.86,3.28,1.4
1,versicolor,1.3,5.5,2.3,4.0
2,versicolor,1.4,7.0,3.2,4.7
3,versicolor,1.5,6.6,3.033333,4.666667
4,virginica,1.8,6.3,2.9,5.6
5,virginica,1.9,5.8,2.7,5.1
6,virginica,2.1,7.1,3.0,5.9
7,virginica,2.2,6.5,3.0,5.8
8,virginica,2.5,6.3,3.3,6.0


## The `apply` Method

The `apply` method allows us to write and run custom functions on a data frame. Writing a custom function requires an understanding of the groupby-split-apply-combine chain method.

Here's our original data frame -

In [10]:
df

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
0,5.1,3.5,1.4,0.2,setosa
1,4.9,3.0,1.4,0.2,setosa
2,4.7,3.2,1.3,0.2,setosa
3,4.6,3.1,1.5,0.2,setosa
4,5.0,3.6,1.4,0.2,setosa
50,7.0,3.2,4.7,1.4,versicolor
51,6.4,3.2,4.5,1.5,versicolor
52,6.9,3.1,4.9,1.5,versicolor
53,5.5,2.3,4.0,1.3,versicolor
54,6.5,2.8,4.6,1.5,versicolor


To use the `apply` method, you must first create a `DataFrameGroupBy` object - 

In [11]:
df_species = df.groupby('species')
df_species

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x0000011D1EC28F60>

You must also write a custom aggregation function that takes in a `DataFrameGroupBy` object - 

In [12]:
def rate(group):
    '''
    This function is a custom aggregation method that calculates the rate of each group. The rate of each group is 
    calcualted by summing up all values in each group and dividing by the length of an existing data frame.
    
    Args:
        group (DataFrameGroupBy): A DataFrameGroupBy object
    Returns
        (DataFrame): 
    '''
    
    return group.sum() / len(df) # note that 'df' is a data frame that already exists

To call the custom aggregation function (e.g. `rate`), pass it into the `apply` function -

In [13]:
df_species.apply(rate)

Unnamed: 0_level_0,sepal_length,sepal_width,petal_length,petal_width
species,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
setosa,1.62,1.093333,0.466667,0.066667
versicolor,2.153333,0.973333,1.513333,0.48
virginica,2.133333,0.993333,1.893333,0.7


## The `agg` Method

The `agg` method allows us to "aggregate using one or more operations over a specified axis." It is also possible to specify different aggregations per column by calling the `agg` function on a `DataFrameGroupBy` object -

In [14]:
df.groupby('species').agg({'sepal_length': 'sum', 
                           'sepal_width': 'min', 
                           'petal_length': rate,
                           'petal_width': [rate, 'sum']}) # multiple aggregation functions can be called on one column

Unnamed: 0_level_0,sepal_length,sepal_width,petal_length,petal_width,petal_width
Unnamed: 0_level_1,sum,min,rate,rate,sum
species,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2
setosa,24.3,3.0,0.466667,0.066667,1.0
versicolor,32.3,2.3,1.513333,0.48,7.2
virginica,32.0,2.7,1.893333,0.7,10.5
