## 1 GroupBy Mechanic

***split-apply-combine***.

Each grouping key can take many forms and do not have to be all of the same type:
- A list or array of values that is the same length as the axis being grouped
- A value indicating a column name in a DataFrame
- A dict or Series giving a correspondence between the values on the axis being grouped and the group names
- A function to be invoked on the axis index or the individual labels in the index

In [1]:
import pandas as pd

In [2]:
import numpy as np

In [3]:
df = pd.DataFrame({'key1': ['a', 'a', 'b', 'b', 'a'],
                   'key2': ['one', 'two', 'one', 'two', 'one'],
                   'data1': np.random.randn(5),
                   'data2': np.random.randn(5)})

In [4]:
df

Unnamed: 0,data1,data2,key1,key2
0,1.273728,1.052138,a,one
1,-0.940964,0.878502,a,two
2,0.406551,-0.00749,b,one
3,-1.928155,1.36478,b,two
4,0.258822,-0.987653,a,one


Compute the mean of data1 column using the labels from key1

In [5]:
# Method 1
df[['data1', 'data2', 'key1']].set_index('key1').mean(level='key1')

Unnamed: 0_level_0,data1,data2
key1,Unnamed: 1_level_1,Unnamed: 2_level_1
a,0.197195,0.314329
b,-0.760802,0.678645


In [6]:
# Method 2
grouped = df['data1'].groupby(df['key1'])

In [7]:
grouped

<pandas.core.groupby.SeriesGroupBy object at 0x10aabbc10>

In [8]:
grouped.mean()

key1
a    0.197195
b   -0.760802
Name: data1, dtype: float64

In [9]:
means = df['data1'].groupby([df['key1'], df['key2']]).mean()

In [10]:
means

key1  key2
a     one     0.766275
      two    -0.940964
b     one     0.406551
      two    -1.928155
Name: data1, dtype: float64

In [11]:
means.unstack()

key2,one,two
key1,Unnamed: 1_level_1,Unnamed: 2_level_1
a,0.766275,-0.940964
b,0.406551,-1.928155


The group keys could be any arrays of the right length

In [12]:
states = np.array(['Ohio', 'California', 'California', 'Ohio', 'Ohio'])

In [13]:
years = [2005, 2005, 2006, 2005, 2006]

In [14]:
df['data1'].groupby([states, years]).mean()

California  2005   -0.940964
            2006    0.406551
Ohio        2005   -0.327214
            2006    0.258822
Name: data1, dtype: float64

Grouping the dataframe directly

In [15]:
df.groupby(['key1', 'key2']).mean()

Unnamed: 0_level_0,Unnamed: 1_level_0,data1,data2
key1,key2,Unnamed: 2_level_1,Unnamed: 3_level_1
a,one,0.766275,0.032243
a,two,-0.940964,0.878502
b,one,0.406551,-0.00749
b,two,-1.928155,1.36478


By default, all of the numeric columns are aggregated, and a non-numeric data might be treated as a nuisance column and be excluded from the result

`size` method of a GroupBy object returns a Series containing group sizes

In [16]:
df.groupby(['key1', 'key2']).size()

key1  key2
a     one     2
      two     1
b     one     1
      two     1
dtype: int64

*** Any missing values in a group key will be excluded from the result***

### 1.1 Iterating Over Groups

Iterating a GroupBy object generates a sequence of 2-tuples containing the group name along with the chunk of data

In [17]:
for name, group in df.groupby('key1'):
    print name
    print group

a
      data1     data2 key1 key2
0  1.273728  1.052138    a  one
1 -0.940964  0.878502    a  two
4  0.258822 -0.987653    a  one
b
      data1    data2 key1 key2
2  0.406551 -0.00749    b  one
3 -1.928155  1.36478    b  two


In the case of multiple keys, the first element in the tuple will be a tuple of key values

In [18]:
for (k1, k2), group in df.groupby(['key1', 'key2']):
    print (k1, k2)
    print group

('a', 'one')
      data1     data2 key1 key2
0  1.273728  1.052138    a  one
4  0.258822 -0.987653    a  one
('a', 'two')
      data1     data2 key1 key2
1 -0.940964  0.878502    a  two
('b', 'one')
      data1    data2 key1 key2
2  0.406551 -0.00749    b  one
('b', 'two')
      data1    data2 key1 key2
3 -1.928155  1.36478    b  two


In [19]:
dict([('a', 1), ('b', 2)])

{'a': 1, 'b': 2}

In [20]:
pieces = dict(list(df.groupby('key1')))

In [21]:
pieces['b']

Unnamed: 0,data1,data2,key1,key2
2,0.406551,-0.00749,b,one
3,-1.928155,1.36478,b,two


Grouping columns by dtype

In [22]:
grouped = df.groupby(df.dtypes, axis=1)

In [23]:
for dtype, group in grouped:
    print dtype
    print group

float64
      data1     data2
0  1.273728  1.052138
1 -0.940964  0.878502
2  0.406551 -0.007490
3 -1.928155  1.364780
4  0.258822 -0.987653
object
  key1 key2
0    a  one
1    a  two
2    b  one
3    b  two
4    a  one


### 1.2 Selecting a Column or Subset of Columns

Indexing a GroupBy object created from a DataFrame with a column name has the effect of column subsetting for aggregation.

```
df.groupby('key1')['data1']
df.groupby('key1')[['data2']]
```
are syntactic sugar for:
```
df['data1'].groupby(df['key1'])
df[['data2']].groupby(df['key1'])
```

In [24]:
df.groupby(['key1', 'key2'])[['data2']].mean()

Unnamed: 0_level_0,Unnamed: 1_level_0,data2
key1,key2,Unnamed: 2_level_1
a,one,0.032243
a,two,0.878502
b,one,-0.00749
b,two,1.36478


In [25]:
df.groupby(['key1', 'key2'])['data2'].mean()

key1  key2
a     one     0.032243
      two     0.878502
b     one    -0.007490
      two     1.364780
Name: data2, dtype: float64

### 1.3 Grouping with Dicts and Series

In [26]:
people = pd.DataFrame(np.random.randn(5, 5),
                      columns=['a', 'b', 'c', 'd', 'e'],
                      index=['Joe', 'Steve', 'Wes', 'Jim', 'Travis'])

In [27]:
people.iloc[2, [1, 2]] = np.nan

In [28]:
people

Unnamed: 0,a,b,c,d,e
Joe,-4.396123,0.006858,-0.667907,-0.381901,-0.202327
Steve,1.938087,0.817526,0.111457,-0.729114,-0.570212
Wes,-0.852693,,,-1.335131,0.727978
Jim,0.030195,1.717716,1.495565,1.089981,1.962407
Travis,1.540977,-0.232942,-0.348551,0.583127,2.326376


A group correspondence for the columns.

In [29]:
mapping = {'a': 'red', 'b': 'red', 'c': 'blue',
           'd': 'blue', 'e': 'red', 'f' : 'orange'}

In [30]:
by_column = people.groupby(mapping, axis=1)

In [31]:
by_column.sum()

Unnamed: 0,blue,red
Joe,-1.049807,-4.591592
Steve,-0.617657,2.185401
Wes,-1.335131,-0.124715
Jim,2.585545,3.710317
Travis,0.234577,3.634412


The same functionality holds for Series.

In [32]:
map_series = pd.Series(mapping)

In [33]:
map_series

a       red
b       red
c      blue
d      blue
e       red
f    orange
dtype: object

In [34]:
people.groupby(map_series, axis=1).count()

Unnamed: 0,blue,red
Joe,2,3
Steve,2,3
Wes,1,2
Jim,2,3
Travis,2,3


### 1.4 Grouping with Functions

Any function passed as a group key will be called once per index value, will the return values being used as the group names.

Group by the length of column name

In [35]:
people.groupby(len).sum()

Unnamed: 0,a,b,c,d,e
3,-5.218622,1.724574,0.827658,-0.627051,2.488058
5,1.938087,0.817526,0.111457,-0.729114,-0.570212
6,1.540977,-0.232942,-0.348551,0.583127,2.326376


Mixing functions with arrays, dicts, or Series. Everything gets converted to arrays internally

In [36]:
key_list = ['one', 'one', 'one', 'two', 'two']

In [37]:
people.groupby([len, key_list]).min()

Unnamed: 0,Unnamed: 1,a,b,c,d,e
3,one,-4.396123,0.006858,-0.667907,-1.335131,-0.202327
3,two,0.030195,1.717716,1.495565,1.089981,1.962407
5,one,1.938087,0.817526,0.111457,-0.729114,-0.570212
6,two,1.540977,-0.232942,-0.348551,0.583127,2.326376


### 1.5 Grouping by Index Levels

In [38]:
columns = pd.MultiIndex.from_arrays([['US', 'US', 'US', 'JP', 'JP'],
                                    [1, 3, 5, 1, 3]],
                                    names=['cty', 'tenor'])

In [39]:
hier_df = pd.DataFrame(np.random.randn(4, 5), columns=columns)

In [40]:
hier_df

cty,US,US,US,JP,JP
tenor,1,3,5,1,3
0,1.615514,0.241491,-0.877846,-0.898354,-0.9636
1,-3.347172,-1.499991,0.710433,-1.350951,-1.233864
2,-0.604638,0.171662,-0.231906,-0.713595,0.583108
3,1.291055,-0.333279,-0.160237,1.622486,-0.775295


In [41]:
hier_df.groupby(level='cty', axis=1).count()

cty,JP,US
0,2,3
1,2,3
2,2,3
3,2,3


## 2 Data Aggregation

Aggregations refer to any data transformation that produces scalar values from arrays.

The common aggregations listed as follows have optimized implementations for GroupBy object

<img src='img/10_2_1.png'>

Other aggregation method or self-defined functions can also be passed. But they are generally much slower than the optimized functions in the table above. This is because there is some extra overhead like function calls, data rearrangement in constructing the intermediate group data chunks

In [42]:
df

Unnamed: 0,data1,data2,key1,key2
0,1.273728,1.052138,a,one
1,-0.940964,0.878502,a,two
2,0.406551,-0.00749,b,one
3,-1.928155,1.36478,b,two
4,0.258822,-0.987653,a,one


In [43]:
grouped = df.groupby('key1')

In [44]:
grouped['data1'].quantile(0.9)

key1
a    1.070746
b    0.173081
Name: data1, dtype: float64

To use own aggregation functions, pass any function that aggregates an array to the `agg` method

In [45]:
def peak_to_peak(arr):
    return arr.max() - arr.min()

In [46]:
grouped.agg(peak_to_peak)

Unnamed: 0_level_0,data1,data2
key1,Unnamed: 1_level_1,Unnamed: 2_level_1
a,2.214691,2.039791
b,2.334706,1.37227


In [47]:
grouped.agg(lambda arr: arr.max() - arr.min())

Unnamed: 0_level_0,data1,data2
key1,Unnamed: 1_level_1,Unnamed: 2_level_1
a,2.214691,2.039791
b,2.334706,1.37227


Some non-aggregation methods like `describe` also work

In [48]:
grouped.describe()

Unnamed: 0_level_0,Unnamed: 1_level_0,data1,data2
key1,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
a,count,3.0,3.0
a,mean,0.197195,0.314329
a,std,1.108631,1.130887
a,min,-0.940964,-0.987653
a,25%,-0.341071,-0.054576
a,50%,0.258822,0.878502
a,75%,0.766275,0.96532
a,max,1.273728,1.052138
b,count,2.0,2.0
b,mean,-0.760802,0.678645


### 2.1 Column-Wise and Multiple Function Application

In [49]:
tips = pd.read_csv('examples/tips.csv')

In [50]:
tips['tip_pct'] = tips['tip'] / tips['total_bill']

In [51]:
tips.head()

Unnamed: 0,total_bill,tip,smoker,day,time,size,tip_pct
0,16.99,1.01,No,Sun,Dinner,2,0.059447
1,10.34,1.66,No,Sun,Dinner,3,0.160542
2,21.01,3.5,No,Sun,Dinner,3,0.166587
3,23.68,3.31,No,Sun,Dinner,2,0.13978
4,24.59,3.61,No,Sun,Dinner,4,0.146808


Using multiple functions to aggregate columns at once

In [53]:
grouped = tips.groupby(['day', 'smoker'])

You can pass the name of optimized aggregation function as a string to `agg`

In [54]:
grouped_pct = grouped['tip_pct']

In [55]:
grouped_pct.agg('mean')

day   smoker
Fri   No        0.151650
      Yes       0.174783
Sat   No        0.158048
      Yes       0.147906
Sun   No        0.160113
      Yes       0.187250
Thur  No        0.160298
      Yes       0.163863
Name: tip_pct, dtype: float64

Passing a list of functions or function names

In [56]:
grouped_pct.agg(['mean', 'std', peak_to_peak])

Unnamed: 0_level_0,Unnamed: 1_level_0,mean,std,peak_to_peak
day,smoker,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Fri,No,0.15165,0.028123,0.067349
Fri,Yes,0.174783,0.051293,0.159925
Sat,No,0.158048,0.039767,0.235193
Sat,Yes,0.147906,0.061375,0.290095
Sun,No,0.160113,0.042347,0.193226
Sun,Yes,0.18725,0.154134,0.644685
Thur,No,0.160298,0.038774,0.19335
Thur,Yes,0.163863,0.039389,0.15124


Using custom column names instead of default function names by passing a list of (`name`, `function`) tuples

In [57]:
grouped_pct.agg([('foo', 'mean'), ('bar', np.std)])

Unnamed: 0_level_0,Unnamed: 1_level_0,foo,bar
day,smoker,Unnamed: 2_level_1,Unnamed: 3_level_1
Fri,No,0.15165,0.028123
Fri,Yes,0.174783,0.051293
Sat,No,0.158048,0.039767
Sat,Yes,0.147906,0.061375
Sun,No,0.160113,0.042347
Sun,Yes,0.18725,0.154134
Thur,No,0.160298,0.038774
Thur,Yes,0.163863,0.039389


With a DataFrame, You can also specify a list of functions to apply to all of the columns or different functions per column

In [58]:
functions = ['count', 'mean', 'max']

In [59]:
result = grouped['tip_pct', 'total_bill'].agg(functions)

In [60]:
result

Unnamed: 0_level_0,Unnamed: 1_level_0,tip_pct,tip_pct,tip_pct,total_bill,total_bill,total_bill
Unnamed: 0_level_1,Unnamed: 1_level_1,count,mean,max,count,mean,max
day,smoker,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2
Fri,No,4,0.15165,0.187735,4,18.42,22.75
Fri,Yes,15,0.174783,0.26348,15,16.813333,40.17
Sat,No,45,0.158048,0.29199,45,19.661778,48.33
Sat,Yes,42,0.147906,0.325733,42,21.276667,50.81
Sun,No,57,0.160113,0.252672,57,20.506667,48.17
Sun,Yes,19,0.18725,0.710345,19,24.12,45.35
Thur,No,45,0.160298,0.266312,45,17.113111,41.19
Thur,Yes,17,0.163863,0.241255,17,19.190588,43.11


In [61]:
ftuples = [('Durchschnitt', 'mean'), ('Abweichung', np.var)]

In [62]:
grouped['tip_pct', 'total_bill'].agg(ftuples)

Unnamed: 0_level_0,Unnamed: 1_level_0,tip_pct,tip_pct,total_bill,total_bill
Unnamed: 0_level_1,Unnamed: 1_level_1,Durchschnitt,Abweichung,Durchschnitt,Abweichung
day,smoker,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2
Fri,No,0.15165,0.000791,18.42,25.596333
Fri,Yes,0.174783,0.002631,16.813333,82.562438
Sat,No,0.158048,0.001581,19.661778,79.908965
Sat,Yes,0.147906,0.003767,21.276667,101.387535
Sun,No,0.160113,0.001793,20.506667,66.09998
Sun,Yes,0.18725,0.023757,24.12,109.046044
Thur,No,0.160298,0.001503,17.113111,59.625081
Thur,Yes,0.163863,0.001551,19.190588,69.808518


Passing a dit to `agg` that contains a mapping of column names to aggregation function to appy different functions to one or more of the columns

In [63]:
grouped.agg({'tip': np.max, 'size': 'sum'})

Unnamed: 0_level_0,Unnamed: 1_level_0,tip,size
day,smoker,Unnamed: 2_level_1,Unnamed: 3_level_1
Fri,No,3.5,9
Fri,Yes,4.73,31
Sat,No,9.0,115
Sat,Yes,10.0,104
Sun,No,6.0,167
Sun,Yes,6.5,49
Thur,No,6.7,112
Thur,Yes,5.0,40


In [64]:
grouped.agg({'tip_pct': ['min', 'max', 'mean', 'std'],
             'size': 'sum'})

Unnamed: 0_level_0,Unnamed: 1_level_0,tip_pct,tip_pct,tip_pct,tip_pct,size
Unnamed: 0_level_1,Unnamed: 1_level_1,min,max,mean,std,sum
day,smoker,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2
Fri,No,0.120385,0.187735,0.15165,0.028123,9
Fri,Yes,0.103555,0.26348,0.174783,0.051293,31
Sat,No,0.056797,0.29199,0.158048,0.039767,115
Sat,Yes,0.035638,0.325733,0.147906,0.061375,104
Sun,No,0.059447,0.252672,0.160113,0.042347,167
Sun,Yes,0.06566,0.710345,0.18725,0.154134,49
Thur,No,0.072961,0.266312,0.160298,0.038774,112
Thur,Yes,0.090014,0.241255,0.163863,0.039389,40


### 2.2 Returning Aggregated Data Without Row Indexes

In [65]:
tips.groupby(['day', 'smoker'], as_index=False).mean()

Unnamed: 0,day,smoker,total_bill,tip,size,tip_pct
0,Fri,No,18.42,2.8125,2.25,0.15165
1,Fri,Yes,16.813333,2.714,2.066667,0.174783
2,Sat,No,19.661778,3.102889,2.555556,0.158048
3,Sat,Yes,21.276667,2.875476,2.47619,0.147906
4,Sun,No,20.506667,3.167895,2.929825,0.160113
5,Sun,Yes,24.12,3.516842,2.578947,0.18725
6,Thur,No,17.113111,2.673778,2.488889,0.160298
7,Thur,Yes,19.190588,3.03,2.352941,0.163863


## 3 Apply: General split-apply-combine

`apply` splits the object being manipulated into pieces, invokes the passed function on each piece, and then attempts to concatenate the pieces together

Select the top five tip_pct value by group

In [66]:
def top(df, n=5, column='tip_pct'):
    return df.sort_values(by=column)[-n:]

In [67]:
top(tips, n=6)

Unnamed: 0,total_bill,tip,smoker,day,time,size,tip_pct
109,14.31,4.0,Yes,Sat,Dinner,2,0.279525
183,23.17,6.5,Yes,Sun,Dinner,4,0.280535
232,11.61,3.39,No,Sat,Dinner,2,0.29199
67,3.07,1.0,Yes,Sat,Dinner,1,0.325733
178,9.6,4.0,Yes,Sun,Dinner,2,0.416667
172,7.25,5.15,Yes,Sun,Dinner,2,0.710345


In [69]:
tips.groupby('smoker').apply(top)

Unnamed: 0_level_0,Unnamed: 1_level_0,total_bill,tip,smoker,day,time,size,tip_pct
smoker,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
No,88,24.71,5.85,No,Thur,Lunch,2,0.236746
No,185,20.69,5.0,No,Sun,Dinner,5,0.241663
No,51,10.29,2.6,No,Sun,Dinner,2,0.252672
No,149,7.51,2.0,No,Thur,Lunch,2,0.266312
No,232,11.61,3.39,No,Sat,Dinner,2,0.29199
Yes,109,14.31,4.0,Yes,Sat,Dinner,2,0.279525
Yes,183,23.17,6.5,Yes,Sun,Dinner,4,0.280535
Yes,67,3.07,1.0,Yes,Sat,Dinner,1,0.325733
Yes,178,9.6,4.0,Yes,Sun,Dinner,2,0.416667
Yes,172,7.25,5.15,Yes,Sun,Dinner,2,0.710345


If you pass a function to `apply` that takes other argument or keywords, you can pass these after the function

In [70]:
tips.groupby(['smoker', 'day']).apply(top, n=1, column='total_bill')

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,total_bill,tip,smoker,day,time,size,tip_pct
smoker,day,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
No,Fri,94,22.75,3.25,No,Fri,Dinner,2,0.142857
No,Sat,212,48.33,9.0,No,Sat,Dinner,4,0.18622
No,Sun,156,48.17,5.0,No,Sun,Dinner,6,0.103799
No,Thur,142,41.19,5.0,No,Thur,Lunch,5,0.121389
Yes,Fri,95,40.17,4.73,Yes,Fri,Dinner,4,0.11775
Yes,Sat,170,50.81,10.0,Yes,Sat,Dinner,3,0.196812
Yes,Sun,182,45.35,3.5,Yes,Sun,Dinner,3,0.077178
Yes,Thur,197,43.11,5.0,Yes,Thur,Lunch,4,0.115982


Inside GroupBy, when you invoke a method like `describe`, it is actually a short-cut for:

In [78]:
tips.groupby('smoker')['tip_pct'].apply(lambda x: x.describe())

smoker       
No      count    151.000000
        mean       0.159328
        std        0.039910
        min        0.056797
        25%        0.136906
        50%        0.155625
        75%        0.185014
        max        0.291990
Yes     count     93.000000
        mean       0.163196
        std        0.085119
        min        0.035638
        25%        0.106771
        50%        0.153846
        75%        0.195059
        max        0.710345
Name: tip_pct, dtype: float64

### 3.1 Suppressing the Group Keys

Passing `group_keys=False` to groupby to supress to disable forming a hierarchical index

In [83]:
tips.groupby('smoker', group_keys=False).apply(top)

Unnamed: 0,total_bill,tip,smoker,day,time,size,tip_pct
88,24.71,5.85,No,Thur,Lunch,2,0.236746
185,20.69,5.0,No,Sun,Dinner,5,0.241663
51,10.29,2.6,No,Sun,Dinner,2,0.252672
149,7.51,2.0,No,Thur,Lunch,2,0.266312
232,11.61,3.39,No,Sat,Dinner,2,0.29199
109,14.31,4.0,Yes,Sat,Dinner,2,0.279525
183,23.17,6.5,Yes,Sun,Dinner,4,0.280535
67,3.07,1.0,Yes,Sat,Dinner,1,0.325733
178,9.6,4.0,Yes,Sun,Dinner,2,0.416667
172,7.25,5.15,Yes,Sun,Dinner,2,0.710345


`group_keys` is for apply method, which is a bit different with `as_index`

### 3.2 Quantile and Bucket Analysis

Combining `cut` and `qcut` with `groupby` to perform bucket or quantile analysis.

In [85]:
frame = pd.DataFrame({'data1': np.random.randn(1000),
                      'data2': np.random.randn(1000)})

In [95]:
quartiles = pd.cut(frame.data1, 4)

In [96]:
quartiles.head()

0     (-0.0695, 1.595]
1     (-0.0695, 1.595]
2    (-1.734, -0.0695]
3     (-0.0695, 1.595]
4     (-0.0695, 1.595]
Name: data1, dtype: category
Categories (4, object): [(-3.406, -1.734] < (-1.734, -0.0695] < (-0.0695, 1.595] < (1.595, 3.26]]

In [91]:
def get_stats(group):
    return {'min': group.min(), 'max': group.max(),
            'count': group.count(), 'mean': group.mean()}

In [97]:
grouped = frame.data2.groupby(quartiles)

In [98]:
grouped.apply(get_stats).unstack()

Unnamed: 0_level_0,count,max,mean,min
data1,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
"(-3.406, -1.734]",36.0,2.171334,0.013803,-1.480722
"(-1.734, -0.0695]",438.0,3.197375,-0.035417,-2.729932
"(-0.0695, 1.595]",479.0,3.161869,-0.080301,-4.472571
"(1.595, 3.26]",47.0,2.787539,0.115021,-2.096223


These were equal-length buckets; Using `qcut` to compute equal-size buckets

In [110]:
grouping = pd.qcut(frame.data1, 10, labels=False)

In [111]:
grouped = frame.data2.groupby(grouping)

In [112]:
grouped.apply(get_stats).unstack()

Unnamed: 0_level_0,count,max,mean,min
data1,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
0,100.0,2.187637,-0.103039,-2.131041
1,100.0,2.880006,-0.055354,-2.729932
2,100.0,3.197375,-0.054287,-2.468397
3,100.0,2.589869,0.015831,-2.295667
4,100.0,2.568551,-0.003762,-2.284145
5,100.0,3.161869,-0.161691,-3.77659
6,100.0,2.935506,-0.019543,-2.080056
7,100.0,2.806221,-0.123097,-3.181939
8,100.0,2.357783,-0.000699,-3.330223
9,100.0,2.787539,0.024903,-4.472571


### 3.3 Example: Filling Missing Values with Group-Specific Values