# 4.1 GroupBy技术

## 4.1.1 GroupBy
**分组的本质是根据键将数据拆分成小数据集**

![GroupBy]()

**分组键**  
1. 列表或数组，其长度与待分组的轴一样
2. 表示DatFrame某个列名的值
3. 字典或Series，给出待分组轴上的值与分组名之间的对应关系
4. 函数，用于处理轴索引或索引中的哥哥标签

In [19]:
import pandas as pd
from pandas import Series, DataFrame
import numpy as np
from numpy import random, array

In [14]:
df = DataFrame(data={
    'key1': list('aabba'),
    'key2': ['one', 'two', 'one', 'two', 'one'],
    'data1': random.randn(5),
    'data2': random.randn(5)
})
df

Unnamed: 0,key1,key2,data1,data2
0,a,one,-1.552385,1.166627
1,a,two,0.593793,1.489591
2,b,one,-1.462639,0.849348
3,b,two,-0.230402,0.804105
4,a,one,0.356141,-1.058369


In [15]:
df.groupby('key1').mean()

Unnamed: 0_level_0,data1,data2
key1,Unnamed: 1_level_1,Unnamed: 2_level_1
a,-0.200817,0.532616
b,-0.846521,0.826726


In [34]:
df.groupby(['key1', 'key2'])['data1'].mean()

key1  key2
a     one    -0.598122
      two     0.593793
b     one    -1.462639
      two    -0.230402
Name: data1, dtype: float64

In [41]:
df.groupby(['key1', 'key2'])['data1'].size()

key1  key2
a     one     2
      two     1
b     one     1
      two     1
Name: data1, dtype: int64

In [39]:
df.groupby([df['key1'], df['key2']])['data1'].mean()

key1  key2
a     one    -0.598122
      two     0.593793
b     one    -1.462639
      two    -0.230402
Name: data1, dtype: float64

In [37]:
states = array(['Ohio', 'Ohio', 'California', 'California', 'Ohio'])
years = array([2005, 2006, 2005, 2006, 2005])

df.groupby([states, years])['data1'].mean()

California  2005   -1.462639
            2006   -0.230402
Ohio        2005   -0.598122
            2006    0.593793
Name: data1, dtype: float64

## 4.1.2 对分组进行迭代

In [42]:
for key, group in df.groupby('key1'):
    print(key)
    print(group)

a
  key1 key2     data1     data2
0    a  one -1.552385  1.166627
1    a  two  0.593793  1.489591
4    a  one  0.356141 -1.058369
b
  key1 key2     data1     data2
2    b  one -1.462639  0.849348
3    b  two -0.230402  0.804105


In [44]:
for (key1, key2), group in df.groupby(['key1', 'key2']):
    print(key1, key2)
    print(group)

a one
  key1 key2     data1     data2
0    a  one -1.552385  1.166627
4    a  one  0.356141 -1.058369
a two
  key1 key2     data1     data2
1    a  two  0.593793  1.489591
b one
  key1 key2     data1     data2
2    b  one -1.462639  0.849348
b two
  key1 key2     data1     data2
3    b  two -0.230402  0.804105


In [47]:
pieces = dict(list(df.groupby('key1')))
pieces['b']

Unnamed: 0,key1,key2,data1,data2
2,b,one,-1.462639,0.849348
3,b,two,-0.230402,0.804105


In [51]:
pieces = dict(list(df.groupby(['key1', 'key2'])))
pieces[('a', 'two')]

Unnamed: 0,key1,key2,data1,data2
1,a,two,0.593793,1.489591


In [73]:
grouped = df.groupby(df.dtypes, axis=1)
pieces = dict(list(grouped))
pieces.keys()

dict_keys([dtype('float64'), dtype('O')])