# Chapter 10. Data Aggregation and Group Operations

## 10.1 GroupBy Mechanics

Hadley Wickham, an author of many popular packages for the R programming language, coined the term split-apply-combine for describing group operations. 
![Split Apply Combine](images/split-apply-combine.png)

In [30]:
import pandas as pd
import numpy as np

%config Completer.use_jedi = False

In [2]:
df = pd.DataFrame({'key1' : ['a', 'a', 'b', 'b', 'a'],
                  'key2' : ['one', 'two', 'one', 'two', 'one'],
                  'data1' : [1, 2, 3, 4, 5],
                  'data2' : [10, 20, 30, 40, 50]})

df

Unnamed: 0,key1,key2,data1,data2
0,a,one,1,10
1,a,two,2,20
2,b,one,3,30
3,b,two,4,40
4,a,one,5,50


In [3]:
grouped = df['data1'].groupby(df['key1'])
grouped

<pandas.core.groupby.generic.SeriesGroupBy object at 0x11db62d60>

In [4]:
# the name of the index is 'key1'
grouped.mean()

key1
a    2.666667
b    3.500000
Name: data1, dtype: float64

In [5]:
means = df['data1'].groupby([df['key1'], df['key2']]).mean()
means

key1  key2
a     one     3
      two     2
b     one     3
      two     4
Name: data1, dtype: int64

In [6]:
means.unstack()

key2,one,two
key1,Unnamed: 1_level_1,Unnamed: 2_level_1
a,3,2
b,3,4


In [7]:
# You can use arrays from outside to group data
states = np.array(['Ohio', 'California', 'California', 'Ohio', 'Ohio'])
years = np.array([2005, 2005, 2006, 2005, 2006])
df['data1'].groupby([states, years]).mean()

California  2005    2.0
            2006    3.0
Ohio        2005    2.5
            2006    5.0
Name: data1, dtype: float64

In [8]:
# non numerical data 'key2' is hidden (nuisance column)
df.groupby('key1').mean()

Unnamed: 0_level_0,data1,data2
key1,Unnamed: 1_level_1,Unnamed: 2_level_1
a,2.666667,26.666667
b,3.5,35.0


In [9]:
df.groupby(['key1', 'key2']).mean()

Unnamed: 0_level_0,Unnamed: 1_level_0,data1,data2
key1,key2,Unnamed: 2_level_1,Unnamed: 3_level_1
a,one,3,30
a,two,2,20
b,one,3,30
b,two,4,40


In [10]:
df.groupby(['key1', 'key2']).size()

key1  key2
a     one     2
      two     1
b     one     1
      two     1
dtype: int64

### Iterating Over Groups

In [11]:
for name, group in df.groupby('key1'):
    print(name)
    print(group)

a
  key1 key2  data1  data2
0    a  one      1     10
1    a  two      2     20
4    a  one      5     50
b
  key1 key2  data1  data2
2    b  one      3     30
3    b  two      4     40


In [12]:
for (k1, k2), group in df.groupby(['key1', 'key2']):
    print((k1, k2))
    print(group)

('a', 'one')
  key1 key2  data1  data2
0    a  one      1     10
4    a  one      5     50
('a', 'two')
  key1 key2  data1  data2
1    a  two      2     20
('b', 'one')
  key1 key2  data1  data2
2    b  one      3     30
('b', 'two')
  key1 key2  data1  data2
3    b  two      4     40


In [13]:
pieces = dict(list(df.groupby('key1')))
pieces

{'a':   key1 key2  data1  data2
 0    a  one      1     10
 1    a  two      2     20
 4    a  one      5     50,
 'b':   key1 key2  data1  data2
 2    b  one      3     30
 3    b  two      4     40}

In [14]:
df.dtypes

key1     object
key2     object
data1     int64
data2     int64
dtype: object

In [20]:
grouped = df.groupby(df.dtypes, axis=1)
for dtype, group in grouped:
    print(dtype)
    print(group)

int64
   data1  data2
0      1     10
1      2     20
2      3     30
3      4     40
4      5     50
object
  key1 key2
0    a  one
1    a  two
2    b  one
3    b  two
4    a  one


### Selecting a Column or Subset of Columns

```
df.groupby('key1')['data1']
df.groupby('key1')[['data2']]
```
The code above is syntactic sugar for:
```
df['data1'].groupby(df['key1'])
df['data2'].groupby(df['key2'])
```

### Grouping with Dicts and Series

In [15]:
people = pd.DataFrame(np.random.randn(5, 5),
                      columns=['a', 'b', 'c', 'd', 'e'],
                      index=['Joe', 'Steve', 'Wes', 'Jim', 'Travis'])

people.iloc[2:3, [1, 2]] = np.nan # Add a few NA values

people

Unnamed: 0,a,b,c,d,e
Joe,-0.30271,0.257188,0.55312,0.621309,0.913201
Steve,0.960476,-0.190188,1.318647,-0.14583,0.791716
Wes,0.206235,,,1.556204,-0.22597
Jim,0.156508,0.053316,1.372105,0.518629,-0.750475
Travis,0.578852,0.275588,2.128975,0.431901,0.748738


In [16]:
mapping = {'a': 'red', 'b': 'red', 'c': 'blue',
           'd': 'blue', 'e': 'red', 'f' : 'orange'}

In [17]:
by_column = people.groupby(mapping, axis=1)
by_column.sum()

Unnamed: 0,blue,red
Joe,1.174429,0.86768
Steve,1.172818,1.562004
Wes,1.556204,-0.019735
Jim,1.890734,-0.540651
Travis,2.560876,1.603178


In [18]:
map_series = pd.Series(mapping)
map_series

a       red
b       red
c      blue
d      blue
e       red
f    orange
dtype: object

In [19]:
people.groupby(map_series, axis=1).count()

Unnamed: 0,blue,red
Joe,2,3
Steve,2,3
Wes,1,2
Jim,2,3
Travis,2,3


### Grouping with functions

In [21]:
people.groupby(len).sum()

Unnamed: 0,a,b,c,d,e
3,0.060033,0.310505,1.925225,2.696142,-0.063244
5,0.960476,-0.190188,1.318647,-0.14583,0.791716
6,0.578852,0.275588,2.128975,0.431901,0.748738


In [23]:
key_list = ['one', 'one', 'one', 'two', 'two']
people.groupby([len, key_list]).min()

Unnamed: 0,Unnamed: 1,a,b,c,d,e
3,one,-0.30271,0.257188,0.55312,0.621309,-0.22597
3,two,0.156508,0.053316,1.372105,0.518629,-0.750475
5,one,0.960476,-0.190188,1.318647,-0.14583,0.791716
6,two,0.578852,0.275588,2.128975,0.431901,0.748738


### Grouping by Index Levels

In [24]:
columns = pd.MultiIndex.from_arrays([['US', 'US', 'US', 'JP', 'JP'], [1, 3, 5, 1, 3]],
                                    names=['cty', 'tenor'])
hier_df = pd.DataFrame(np.random.randn(4, 5), columns=columns)

hier_df

cty,US,US,US,JP,JP
tenor,1,3,5,1,3
0,-1.321575,0.16981,-1.361797,-0.249581,-0.045002
1,0.022117,2.089873,1.147045,-0.946652,-1.366134
2,-1.468431,0.494251,-0.894602,1.294916,-0.311233
3,-1.997622,-1.637637,-0.385429,-0.343458,0.253028


In [25]:
hier_df.groupby(level='cty', axis=1).count()

cty,JP,US
0,2,3
1,2,3
2,2,3
3,2,3


## 10.2 Data Aggregation

Aggregation: any data transformation that produces scalar values from arrays.

In [26]:
df

Unnamed: 0,key1,key2,data1,data2
0,a,one,1,10
1,a,two,2,20
2,b,one,3,30
3,b,two,4,40
4,a,one,5,50


In [27]:
grouped = df.groupby('key1')
grouped['data1'].quantile(0.9)

key1
a    4.4
b    3.9
Name: data1, dtype: float64

In [28]:
# custom function is much slower than the optimized builtin functions
def peak_to_peak(arr):
    return arr.max() - arr.min()

grouped.agg(peak_to_peak)

Unnamed: 0_level_0,data1,data2
key1,Unnamed: 1_level_1,Unnamed: 2_level_1
a,4,40
b,1,10


In [31]:
grouped.describe()

Unnamed: 0_level_0,data1,data1,data1,data1,data1,data1,data1,data1,data2,data2,data2,data2,data2,data2,data2,data2
Unnamed: 0_level_1,count,mean,std,min,25%,50%,75%,max,count,mean,std,min,25%,50%,75%,max
key1,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2,Unnamed: 13_level_2,Unnamed: 14_level_2,Unnamed: 15_level_2,Unnamed: 16_level_2
a,3.0,2.666667,2.081666,1.0,1.5,2.0,3.5,5.0,3.0,26.666667,20.81666,10.0,15.0,20.0,35.0,50.0
b,2.0,3.5,0.707107,3.0,3.25,3.5,3.75,4.0,2.0,35.0,7.071068,30.0,32.5,35.0,37.5,40.0


### Column-Wise and Multiple Function Application