# Data Aggregation

By aggergation, I am generally referring to any data transformation that produces scalar values from arrays. In the examples above I have used several of them, such as *mean*, *count*, *min*, *sum*. You may wonder what is going on when you invoke *mean()* on a GroupBy object. Many common aggergations, such as those found in the givan table below, have optimized implementations that compute the statistics on the dataset in place. However, you are not limited to only this set of methods. You can use aggregations of your own devising and additionally call any method that is aloso defined on the grouped object. For example, as you recall *quantile* computes sample quantiles of a Series or a DataFrame's columns:

In [63]:
import numpy as np
import pandas as pd
from statistics import median
from pandas import DataFrame, Series
import matplotlib.pyplot as plt

In [89]:
df = DataFrame({'data1' : np.random.randn(5),
                'data2' : np.arange(5),
                'key1' : ['a', 'a', 'b', 'b', 'a'],
                'key2' : ['one', 'two', 'one', 'two', 'one']})

In [90]:
df

Unnamed: 0,data1,data2,key1,key2
0,0.359679,0,a,one
1,0.725165,1,a,two
2,-0.178343,2,b,one
3,-1.810031,3,b,two
4,-0.414026,4,a,one


In [91]:
grouped = df.groupby('key1')

In [92]:
grouped['data1'].quantile(q = 0.9)

key1
a    0.652068
b   -0.341512
Name: data1, dtype: float64

While quantile is not explicitly implemented for GroupBy, it is a Series method and thus available for use. Internally, GroupBy efficiently slices up the Series, calls piece.quantile(0.9) for each piece, then assembles those results together into the result object.

To use your own aggregation functions, pass any function that aggregates an array to the aggregate or agg method:

In [102]:
def peak(arr):
    return arr.max() - arr.min()

In [103]:
grouped.agg(peak)

  grouped.agg(peak)


Unnamed: 0_level_0,data1,data2
key1,Unnamed: 1_level_1,Unnamed: 2_level_1
a,1.139191,4
b,1.631687,1


You’ll notice that some methods like describe also work, even though they are not aggregations, strictly speaking:

In [104]:
grouped.describe()

Unnamed: 0_level_0,data1,data1,data1,data1,data1,data1,data1,data1,data2,data2,data2,data2,data2,data2,data2,data2
Unnamed: 0_level_1,count,mean,std,min,25%,50%,75%,max,count,mean,std,min,25%,50%,75%,max
key1,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2,Unnamed: 13_level_2,Unnamed: 14_level_2,Unnamed: 15_level_2,Unnamed: 16_level_2
a,3.0,0.223606,0.581658,-0.414026,-0.027173,0.359679,0.542422,0.725165,3.0,1.666667,2.081666,0.0,0.5,1.0,2.5,4.0
b,2.0,-0.994187,1.153777,-1.810031,-1.402109,-0.994187,-0.586265,-0.178343,2.0,2.5,0.707107,2.0,2.25,2.5,2.75,3.0


![Optimized groupedby methods](../../Pictures/Optimized%20groupedby%20methods.png)