## **15. Pandas怎样实现 group by 分组数据统计**  

### group by 先对数据分组，然后在各个分组上应用聚合函数、转换函数。与SQL Select 中的 group by 子句类似

**本章重要内容：**

1. 分组使用聚合函数做数据统计
2. 遍历 group by 结果，理解执行流程
3. 实例分组，探索数据

In [1]:
import pandas as pd
import numpy as np

In [3]:
#以下语句使得jupyter notebook能够展示matplot图标
%matplotlib inline

In [4]:
df=pd.DataFrame({
        'A':['foo','bar','foo','bar','foo','bar','foo','foo'],
        'B':['one','one','two','three','two','two','one','three'],
        'C':np.random.randn(8),
        'D':np.random.randn(8)
})

In [5]:
df

Unnamed: 0,A,B,C,D
0,foo,one,1.740608,-0.385986
1,bar,one,-0.36757,0.225512
2,foo,two,0.460869,-0.473574
3,bar,three,-0.270861,0.103749
4,foo,two,-1.01254,0.544438
5,bar,two,0.983596,0.885655
6,foo,one,-0.070675,-1.304106
7,foo,three,-0.162457,0.227928


### **15.1 分组使用聚合函数进行数据统计**

#### **15.1.1 单个列group by，对所有数据列进行聚合**

In [7]:
df.groupby('A').sum()

  df.groupby('A').sum()


Unnamed: 0_level_0,C,D
A,Unnamed: 1_level_1,Unnamed: 2_level_1
bar,0.345165,1.214916
foo,0.955805,-1.3913


In [13]:
df.groupby('A').sum(numeric_only=True)

Unnamed: 0_level_0,C,D
A,Unnamed: 1_level_1,Unnamed: 2_level_1
bar,0.345165,1.214916
foo,0.955805,-1.3913


从以上可以看到：  

1. 分组列A在结果集中变成了索引列
2. 由于B列不是数字列，自动被忽略了
3. 不设置numeric_only参数时会出现警告信息。

#### **15.1.2 多个列group by，对所有数据列进行聚合**

In [15]:
df.groupby(['A','B']).mean()

Unnamed: 0_level_0,Unnamed: 1_level_0,C,D
A,B,Unnamed: 2_level_1,Unnamed: 3_level_1
bar,one,-0.36757,0.225512
bar,three,-0.270861,0.103749
bar,two,0.983596,0.885655
foo,one,0.834967,-0.845046
foo,three,-0.162457,0.227928
foo,two,-0.275836,0.035432


**注意结果集中的组合索引（二级索引）**

In [17]:
df.groupby(['A','B'],as_index=False).mean()

Unnamed: 0,A,B,C,D
0,bar,one,-0.36757,0.225512
1,bar,three,-0.270861,0.103749
2,bar,two,0.983596,0.885655
3,foo,one,0.834967,-0.845046
4,foo,three,-0.162457,0.227928
5,foo,two,-0.275836,0.035432


**注意结果集中的索引，与as_index=True时不同。列‘A'和列‘B’没有成为索引**

#### **15.1.3 同时查看多种数据统计**

In [19]:
#这里应该是一个链式操作，先groupby，然后在agg.
df.groupby('A').agg([np.sum,np.mean,np.std])

  df.groupby('A').agg([np.sum,np.mean,np.std])


Unnamed: 0_level_0,C,C,C,D,D,D
Unnamed: 0_level_1,sum,mean,std,sum,mean,std
A,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2
bar,0.345165,0.115055,0.753731,1.214916,0.404972,0.420712
foo,0.955805,0.191161,1.0144,-1.3913,-0.27826,0.713297


**此时，列变成了多级索引**

#### **15.1.4 对各列使用不同的聚合函数**

In [23]:
#很强大的功能，单个SQL语句做不到
df.groupby(['A','B']).agg({'C':['sum','mean'],'D':['min','max']})

Unnamed: 0_level_0,Unnamed: 1_level_0,C,C,D,D
Unnamed: 0_level_1,Unnamed: 1_level_1,sum,mean,min,max
A,B,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2
bar,one,-0.36757,-0.36757,0.225512,0.225512
bar,three,-0.270861,-0.270861,0.103749,0.103749
bar,two,0.983596,0.983596,0.885655,0.885655
foo,one,1.669933,0.834967,-1.304106,-0.385986
foo,three,-0.162457,-0.162457,0.227928,0.227928
foo,two,-0.551671,-0.275836,-0.473574,0.544438


**很强大的功能！！！**

#### **15.1.5 在列上聚合不同的函数，并重命名生成的DataFrame的索引。**

In [32]:
df1=df.agg(x=('C',max),y=('D',min),z=('B',max))

In [36]:
df1.fillna(value='',inplace=True)

In [37]:
df1

Unnamed: 0,C,D,B
x,1.740608,,
y,,-1.304106,
z,,,two


#### **15.1.6 使用自定义聚合函数**

In [53]:
def fun1(x):
    r=0
    for i in x:
       r=r+i
    return r

df.groupby(['A','B']).agg({'C':[fun1,sum],'D':[fun1,sum]})

Unnamed: 0_level_0,Unnamed: 1_level_0,C,C,D,D
Unnamed: 0_level_1,Unnamed: 1_level_1,fun1,sum,fun1,sum
A,B,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2
bar,one,-0.36757,-0.36757,0.225512,0.225512
bar,three,-0.270861,-0.270861,0.103749,0.103749
bar,two,0.983596,0.983596,0.885655,0.885655
foo,one,1.669933,1.669933,-1.690092,-1.690092
foo,three,-0.162457,-0.162457,0.227928,0.227928
foo,two,-0.551671,-0.551671,0.070864,0.070864


#### **15.1.7 查看单列的结果数据统计**

In [61]:
#方法1，预先过滤，性能更好
df.groupby('A')['C'].agg([max,min,np.std])

Unnamed: 0_level_0,max,min,std
A,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
bar,0.983596,-0.36757,0.753731
foo,1.740608,-1.01254,1.0144


In [65]:
#方法2，先对全部列聚合，再过滤
df.groupby(['A','B']).agg([max,min,np.std])['C']

Unnamed: 0_level_0,Unnamed: 1_level_0,max,min,std
A,B,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
bar,one,-0.36757,-0.36757,
bar,three,-0.270861,-0.270861,
bar,two,0.983596,0.983596,
foo,one,1.740608,-0.070675,1.28077
foo,three,-0.162457,-0.162457,
foo,two,0.460869,-1.01254,1.041857


### **15.2 遍历groupby的结果，理解执行过程**

**for循环可以遍历每个group**

#### **15.2.1 遍历单个列聚合的分组**

In [66]:
g=df.groupby('A')

In [68]:
g

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x7f58b543a0>

**可见结果是DataFrameGroupBy对象**

In [69]:
for name,group in g:
    print(name)
    print(group)
    print()

bar
     A      B         C         D
1  bar    one -0.367570  0.225512
3  bar  three -0.270861  0.103749
5  bar    two  0.983596  0.885655

foo
     A      B         C         D
0  foo    one  1.740608 -0.385986
2  foo    two  0.460869 -0.473574
4  foo    two -1.012540  0.544438
6  foo    one -0.070675 -1.304106
7  foo  three -0.162457  0.227928



**可以单独获取某个分组的数据**

In [70]:
g.get_group('bar')

Unnamed: 0,A,B,C,D
1,bar,one,-0.36757,0.225512
3,bar,three,-0.270861,0.103749
5,bar,two,0.983596,0.885655


#### **15.2.2 遍历多个列聚合的分组**

In [75]:
g=df.groupby(['A','B'])