## 1 GroupBy Mechanic

***split-apply-combine***.

Each grouping key can take many forms and do not have to be all of the same type:
- A list or array of values that is the same length as the axis being grouped
- A value indicating a column name in a DataFrame
- A dict or Series giving a correspondence between the values on the axis being grouped and the group names
- A function to be invoked on the axis index or the individual labels in the index

In [1]:
import pandas as pd

In [2]:
import numpy as np

In [11]:
df = pd.DataFrame({'key1': ['a', 'a', 'b', 'b', 'a'],
                   'key2': ['one', 'two', 'one', 'two', 'one'],
                   'data1': np.random.randn(5),
                   'data2': np.random.randn(5)})

In [12]:
df

Unnamed: 0,data1,data2,key1,key2
0,0.801638,1.115052,a,one
1,1.377083,0.080599,a,two
2,-1.304465,-0.4364,b,one
3,0.836782,-1.185623,b,two
4,0.715934,0.305773,a,one


Compute the mean of data1 column using the labels from key1

In [14]:
# Method 1
df[['data1', 'data2', 'key1']].set_index('key1').mean(level='key1')

Unnamed: 0_level_0,data1,data2
key1,Unnamed: 1_level_1,Unnamed: 2_level_1
a,0.964885,0.500475
b,-0.233842,-0.811012


In [15]:
# Method 2
grouped = df['data1'].groupby(df['key1'])

In [16]:
grouped

<pandas.core.groupby.SeriesGroupBy object at 0x10654dcd0>

In [18]:
grouped.mean()

key1
a    0.964885
b   -0.233842
Name: data1, dtype: float64

In [19]:
means = df['data1'].groupby([df['key1'], df['key2']]).mean()

In [20]:
means

key1  key2
a     one     0.758786
      two     1.377083
b     one    -1.304465
      two     0.836782
Name: data1, dtype: float64

In [21]:
means.unstack()

key2,one,two
key1,Unnamed: 1_level_1,Unnamed: 2_level_1
a,0.758786,1.377083
b,-1.304465,0.836782


The group keys could be any arrays of the right length

In [22]:
states = np.array(['Ohio', 'California', 'California', 'Ohio', 'Ohio'])

In [23]:
years = [2005, 2005, 2006, 2005, 2006]

In [24]:
df['data1'].groupby([states, years]).mean()

California  2005    1.377083
            2006   -1.304465
Ohio        2005    0.819210
            2006    0.715934
Name: data1, dtype: float64

Grouping the dataframe directly

In [26]:
df.groupby(['key1', 'key2']).mean()

Unnamed: 0_level_0,Unnamed: 1_level_0,data1,data2
key1,key2,Unnamed: 2_level_1,Unnamed: 3_level_1
a,one,0.758786,0.710413
a,two,1.377083,0.080599
b,one,-1.304465,-0.4364
b,two,0.836782,-1.185623


By default, all of the numeric columns are aggregated, and a non-numeric data might be treated as a nuisance column and be excluded from the result

`size` method of a GroupBy object returns a Series containing group sizes

In [27]:
df.groupby(['key1', 'key2']).size()

key1  key2
a     one     2
      two     1
b     one     1
      two     1
dtype: int64

*** Any missing values in a group key will be excluded from the result***

### 1.1 Iterating Over Groups

Iterating a GroupBy object generates a sequence of 2-tuples containing the group name along with the chunk of data

In [28]:
for name, group in df.groupby('key1'):
    print name
    print group

a
      data1     data2 key1 key2
0  0.801638  1.115052    a  one
1  1.377083  0.080599    a  two
4  0.715934  0.305773    a  one
b
      data1     data2 key1 key2
2 -1.304465 -0.436400    b  one
3  0.836782 -1.185623    b  two


In the case of multiple keys, the first element in the tuple will be a tuple of key values

In [29]:
for (k1, k2), group in df.groupby(['key1', 'key2']):
    print (k1, k2)
    print group

('a', 'one')
      data1     data2 key1 key2
0  0.801638  1.115052    a  one
4  0.715934  0.305773    a  one
('a', 'two')
      data1     data2 key1 key2
1  1.377083  0.080599    a  two
('b', 'one')
      data1   data2 key1 key2
2 -1.304465 -0.4364    b  one
('b', 'two')
      data1     data2 key1 key2
3  0.836782 -1.185623    b  two


In [31]:
dict([('a', 1), ('b', 2)])

{'a': 1, 'b': 2}

In [32]:
pieces = dict(list(df.groupby('key1')))

In [33]:
pieces['b']

Unnamed: 0,data1,data2,key1,key2
2,-1.304465,-0.4364,b,one
3,0.836782,-1.185623,b,two


Grouping columns by dtype

In [35]:
grouped = df.groupby(df.dtypes, axis=1)

In [36]:
for dtype, group in grouped:
    print dtype
    print group

float64
      data1     data2
0  0.801638  1.115052
1  1.377083  0.080599
2 -1.304465 -0.436400
3  0.836782 -1.185623
4  0.715934  0.305773
object
  key1 key2
0    a  one
1    a  two
2    b  one
3    b  two
4    a  one


### 1.2 Selecting a Column or Subset of Columns

Indexing a GroupBy object created from a DataFrame with a column name has the effect of column subsetting for aggregation.

```
df.groupby('key1')['data1']
df.groupby('key1')[['data2']]
```
are syntactic sugar for:
```
df['data1'].groupby(df['key1'])
df[['data2']].groupby(df['key1'])
```

In [37]:
df.groupby(['key1', 'key2'])[['data2']].mean()

Unnamed: 0_level_0,Unnamed: 1_level_0,data2
key1,key2,Unnamed: 2_level_1
a,one,0.710413
a,two,0.080599
b,one,-0.4364
b,two,-1.185623


In [38]:
df.groupby(['key1', 'key2'])['data2'].mean()

key1  key2
a     one     0.710413
      two     0.080599
b     one    -0.436400
      two    -1.185623
Name: data2, dtype: float64

### 1.3 Grouping with Dicts and Series

In [39]:
people = pd.DataFrame(np.random.randn(5, 5),
                      columns=['a', 'b', 'c', 'd', 'e'],
                      index=['Joe', 'Steve', 'Wes', 'Jim', 'Travis'])

In [40]:
people.iloc[2, [1, 2]] = np.nan

In [41]:
people

Unnamed: 0,a,b,c,d,e
Joe,0.010259,-0.351395,0.122652,0.051172,0.262207
Steve,-0.717925,1.093584,-0.832276,-0.416769,0.583197
Wes,0.95447,,,-1.039793,-0.059271
Jim,0.422414,0.187094,-1.040474,0.177332,-1.807394
Travis,0.457099,-1.63542,1.738532,1.135215,-0.039597


A group correspondence for the columns.

In [42]:
mapping = {'a': 'red', 'b': 'red', 'c': 'blue',
           'd': 'blue', 'e': 'red', 'f' : 'orange'}

In [44]:
by_column = people.groupby(mapping, axis=1)

In [45]:
by_column.sum()

Unnamed: 0,blue,red
Joe,0.173824,-0.078929
Steve,-1.249045,0.958856
Wes,-1.039793,0.895199
Jim,-0.863141,-1.197887
Travis,2.873747,-1.217918


The same functionality holds for Series.

In [47]:
map_series = pd.Series(mapping)

In [48]:
map_series

a       red
b       red
c      blue
d      blue
e       red
f    orange
dtype: object

In [49]:
people.groupby(map_series, axis=1).count()

Unnamed: 0,blue,red
Joe,2,3
Steve,2,3
Wes,1,2
Jim,2,3
Travis,2,3


### 1.4 Grouping with Functions

Any function passed as a group key will be called once per index value, will the return values being used as the group names.

Group by the length of column name

In [51]:
people.groupby(len).sum()

Unnamed: 0,a,b,c,d,e
3,1.387143,-0.164301,-0.917822,-0.81129,-1.604459
5,-0.717925,1.093584,-0.832276,-0.416769,0.583197
6,0.457099,-1.63542,1.738532,1.135215,-0.039597


Mixing functions with arrays, dicts, or Series. Everything gets converted to arrays internally

In [52]:
key_list = ['one', 'one', 'one', 'two', 'two']

In [53]:
people.groupby([len, key_list]).min()

Unnamed: 0,Unnamed: 1,a,b,c,d,e
3,one,0.010259,-0.351395,0.122652,-1.039793,-0.059271
3,two,0.422414,0.187094,-1.040474,0.177332,-1.807394
5,one,-0.717925,1.093584,-0.832276,-0.416769,0.583197
6,two,0.457099,-1.63542,1.738532,1.135215,-0.039597


### 1.5 Grouping by Index Levels

In [54]:
columns = pd.MultiIndex.from_arrays([['US', 'US', 'US', 'JP', 'JP'],
                                    [1, 3, 5, 1, 3]],
                                    names=['cty', 'tenor'])

In [55]:
hier_df = pd.DataFrame(np.random.randn(4, 5), columns=columns)

In [56]:
hier_df

cty,US,US,US,JP,JP
tenor,1,3,5,1,3
0,-1.283369,-0.863683,1.929965,-0.721369,-0.499294
1,-0.896008,-0.322117,1.147151,-1.12804,1.170596
2,0.478959,0.503407,1.81362,-0.522976,-0.54766
3,-0.662532,-0.722552,-1.840245,-0.334214,1.526501


In [57]:
hier_df.groupby(level='cty', axis=1).count()

cty,JP,US
0,2,3
1,2,3
2,2,3
3,2,3


## 2 Data Aggregation

Aggregations refer to any data transformation that produces scalar values from arrays.

The common aggregations listed as follows have optimized implementations for GroupBy object

<img src='img/10_2_1.png'>

Other aggregation method or self-defined functions can also be passed

In [58]:
df

Unnamed: 0,data1,data2,key1,key2
0,0.801638,1.115052,a,one
1,1.377083,0.080599,a,two
2,-1.304465,-0.4364,b,one
3,0.836782,-1.185623,b,two
4,0.715934,0.305773,a,one


In [60]:
grouped = df.groupby('key1')

In [61]:
grouped['data1'].quantile(0.9)

key1
a    1.261994
b    0.622657
Name: data1, dtype: float64