# GroupBy Mechanics

Hadley Wickham, an author of many popular packages for the R programming language, coined the term split-apply-combine for talking about group operations, and I think that’s a good description of the process. In the first stage of the process, data contained in a pandas object, whether a Series, DataFrame, or otherwise, is split into groups based on one or more keys that you provide. The splitting is performed on a particular axis of an object. For example, a DataFrame can be grouped on its rows (axis=0) or its columns (axis=1). Once this is done, a function is applied to each group, producing a new value. Finally, the results of all those function applications are combined into a result object. The form of the resulting object will usually depend on what’s being done to the data. See Figure for a mockup of a simple group aggregation.

![Illustration of a group aggregation](../../Pictures/Illustration%20of%20a%20group%20aggregation.png)

In [1]:
import pandas as pd
import numpy as np
from pandas import DataFrame, Series

In [6]:
df = DataFrame({'key1' : ['a', 'a', 'b', 'b', 'a'],
                'key2' : ['one', 'two', 'one', 'two', 'one'],
                'data1': np.random.randn(5),
                'data2' : np.random.randn(5)})

In [7]:
df

Unnamed: 0,key1,key2,data1,data2
0,a,one,0.516541,-0.469395
1,a,two,-1.823933,0.103379
2,b,one,0.978657,-0.265056
3,b,two,1.056487,-0.08662
4,a,one,-1.269375,0.321822


Suppose you wanted to compute the mean of the data1 column using the groups labels from key1. There are a number of ways to do this. One is to access data1 and call groupby with the column (a Series) at key1:

In [8]:
grouped = df['data1'].groupby(df['key1'])

In [9]:
grouped

<pandas.core.groupby.generic.SeriesGroupBy object at 0x0000015620EEA170>

This grouped variable is now a GroupBy object. It has not actually computed anything yet except for some intermediate data about the group key df['key1']. The idea is that this object has all of the information needed to then apply some operation to each of the groups. For example, to compute group means we can call the GroupBy’s mean method:

In [10]:
grouped.mean()

key1
a   -0.858922
b    1.017572
Name: data1, dtype: float64

If instead we had passed multiple arrays as a list, we get something different:

In [24]:
mean = df['data1'].groupby([df['key1'], df['key2']]).mean()

In [25]:
mean

key1  key2
a     one    -0.376417
      two    -1.823933
b     one     0.978657
      two     1.056487
Name: data1, dtype: float64

In this case, we grouped the data using two keys, and the resulting Series now has a hierarchical index consisting of the unique pairs of keys observed:

In [26]:
mean.unstack()

key2,one,two
key1,Unnamed: 1_level_1,Unnamed: 2_level_1
a,-0.376417,-1.823933
b,0.978657,1.056487


In these examples, the group keys are all Series, though they could be any arrays of the right length:

In [27]:
states = np.array(['s1', 's2', 's2', 's1', 's1'])

In [28]:
years = np.array([2005, 2005, 2006, 2005, 2006])

In [29]:
df['data1'].groupby([states, years]).mean()

s1  2005    0.786514
    2006   -1.269375
s2  2005   -1.823933
    2006    0.978657
Name: data1, dtype: float64

Frequently the grouping information to be found in the same DataFrame as the data you want to work on. In that case, you can pass column names (whether those are strings, numbers, or other Python objects) as the group keys:

In [30]:
df.groupby('key1').mean()

Unnamed: 0_level_0,data1,data2
key1,Unnamed: 1_level_1,Unnamed: 2_level_1
a,-0.858922,-0.014731
b,1.017572,-0.175838


In [33]:
df.groupby(['key1', 'key2']).mean()

Unnamed: 0_level_0,Unnamed: 1_level_0,data1,data2
key1,key2,Unnamed: 2_level_1,Unnamed: 3_level_1
a,one,-0.376417,-0.073786
a,two,-1.823933,0.103379
b,one,0.978657,-0.265056
b,two,1.056487,-0.08662


In [35]:
df.groupby(['key1', 'key2']).mean()

Unnamed: 0_level_0,Unnamed: 1_level_0,data1,data2
key1,key2,Unnamed: 2_level_1,Unnamed: 3_level_1
a,one,-0.376417,-0.073786
a,two,-1.823933,0.103379
b,one,0.978657,-0.265056
b,two,1.056487,-0.08662


You may have noticed in the first case df.groupby('key1').mean() that there is no key2 column in the result. Because df['key2'] is not numeric data, it is said to be a *nuisance column*, which is therefore excluded from the result. By default, all of the numeric columns are aggregated, though it is possible to filter down to a subset as you’ll see soon.

Regardless of the objective in using groupby, a generally useful GroupBy method is  size which return a Series containing group sizes:

In [36]:
df.groupby(['key1', 'key2']).size()

key1  key2
a     one     2
      two     1
b     one     1
      two     1
dtype: int64