# Chapter 10. Data Aggregation and Group Operations

## 10.1 GroupBy Mechanics

Hadley Wickham, an author of many popular packages for the R programming language, coined the term split-apply-combine for describing group operations. 
![Split Apply Combine](images/split-apply-combine.png)

In [2]:
import pandas as pd
import numpy as np

In [7]:
df = pd.DataFrame({'key1' : ['a', 'a', 'b', 'b', 'a'],
                  'key2' : ['one', 'two', 'one', 'two', 'one'],
                  'data1' : [1, 2, 3, 4, 5],
                  'data2' : [10, 20, 30, 40, 50]})

df

Unnamed: 0,key1,key2,data1,data2
0,a,one,1,10
1,a,two,2,20
2,b,one,3,30
3,b,two,4,40
4,a,one,5,50


In [8]:
grouped = df['data1'].groupby(df['key1'])
grouped

<pandas.core.groupby.generic.SeriesGroupBy object at 0x116182340>

In [9]:
# the name of the index is 'key1'
grouped.mean()

key1
a    2.666667
b    3.500000
Name: data1, dtype: float64

In [10]:
means = df['data1'].groupby([df['key1'], df['key2']]).mean()
means

key1  key2
a     one     3
      two     2
b     one     3
      two     4
Name: data1, dtype: int64

In [11]:
means.unstack()

key2,one,two
key1,Unnamed: 1_level_1,Unnamed: 2_level_1
a,3,2
b,3,4


In [12]:
# You can use arrays from outside to group data
states = np.array(['Ohio', 'California', 'California', 'Ohio', 'Ohio'])
years = np.array([2005, 2005, 2006, 2005, 2006])
df['data1'].groupby([states, years]).mean()

California  2005    2.0
            2006    3.0
Ohio        2005    2.5
            2006    5.0
Name: data1, dtype: float64

In [13]:
# non numerical data 'key2' is hidden (nuisance column)
df.groupby('key1').mean()

Unnamed: 0_level_0,data1,data2
key1,Unnamed: 1_level_1,Unnamed: 2_level_1
a,2.666667,26.666667
b,3.5,35.0


In [14]:
df.groupby(['key1', 'key2']).mean()

Unnamed: 0_level_0,Unnamed: 1_level_0,data1,data2
key1,key2,Unnamed: 2_level_1,Unnamed: 3_level_1
a,one,3,30
a,two,2,20
b,one,3,30
b,two,4,40


In [15]:
df.groupby(['key1', 'key2']).size()

key1  key2
a     one     2
      two     1
b     one     1
      two     1
dtype: int64

### Iterating Over Groups

In [16]:
for name, group in df.groupby('key1'):
    print(name)
    print(group)

a
  key1 key2  data1  data2
0    a  one      1     10
1    a  two      2     20
4    a  one      5     50
b
  key1 key2  data1  data2
2    b  one      3     30
3    b  two      4     40


In [17]:
for (k1, k2), group in df.groupby(['key1', 'key2']):
    print((k1, k2))
    print(group)

('a', 'one')
  key1 key2  data1  data2
0    a  one      1     10
4    a  one      5     50
('a', 'two')
  key1 key2  data1  data2
1    a  two      2     20
('b', 'one')
  key1 key2  data1  data2
2    b  one      3     30
('b', 'two')
  key1 key2  data1  data2
3    b  two      4     40


In [18]:
pieces = dict(list(df.groupby('key1')))
pieces

{'a':   key1 key2  data1  data2
 0    a  one      1     10
 1    a  two      2     20
 4    a  one      5     50,
 'b':   key1 key2  data1  data2
 2    b  one      3     30
 3    b  two      4     40}