## **15. Pandas怎样实现 group by 分组数据统计**  

### group by 先对数据分组，然后在各个分组上应用聚合函数、转换函数。与SQL Select 中的 group by 子句类似

**本章重要内容：**

1. 分组使用聚合函数做数据统计
2. 遍历 group by 结果，理解执行流程
3. 实例分组，探索数据

In [2]:
import pandas as pd
import numpy as np

In [3]:
#以下语句使得jupyter notebook能够展示matplot图标
%matplotlib inline

In [230]:
df=pd.DataFrame({
        'A':['foo','bar','foo','bar','foo','bar','foo','foo'],
        'B':['one','one','two','three','two','two','one','three'],
        'C':np.random.randn(8),
        'D':np.random.randn(8)
})

In [231]:
df

Unnamed: 0,A,B,C,D
0,foo,one,1.607244,-0.484185
1,bar,one,0.939612,-0.163602
2,foo,two,-0.157027,-0.090537
3,bar,three,2.108325,0.209951
4,foo,two,-1.093747,0.282443
5,bar,two,-1.146026,-0.112424
6,foo,one,0.229108,-0.870564
7,foo,three,-1.138436,-1.446859


### **15.1 分组使用聚合函数进行数据统计**

#### **15.1.1 单个列group by，对所有数据列进行聚合**

In [232]:
df.groupby('A').sum()

  df.groupby('A').sum()


Unnamed: 0_level_0,C,D
A,Unnamed: 1_level_1,Unnamed: 2_level_1
bar,1.901911,-0.066075
foo,-0.552859,-2.609702


In [233]:
df.groupby('A').sum(numeric_only=True)

Unnamed: 0_level_0,C,D
A,Unnamed: 1_level_1,Unnamed: 2_level_1
bar,1.901911,-0.066075
foo,-0.552859,-2.609702


从以上可以看到：  

1. 分组列A在结果集中变成了索引列
2. 由于B列不是数字列，自动被忽略了
3. 不设置numeric_only参数时会出现警告信息。

#### **15.1.2 多个列group by，对所有数据列进行聚合**

In [234]:
df.groupby(['A','B']).mean()

Unnamed: 0_level_0,Unnamed: 1_level_0,C,D
A,B,Unnamed: 2_level_1,Unnamed: 3_level_1
bar,one,0.939612,-0.163602
bar,three,2.108325,0.209951
bar,two,-1.146026,-0.112424
foo,one,0.918176,-0.677375
foo,three,-1.138436,-1.446859
foo,two,-0.625387,0.095953


**注意结果集中的组合索引（二级索引）**

In [235]:
df.groupby(['A','B'],as_index=False).mean()

Unnamed: 0,A,B,C,D
0,bar,one,0.939612,-0.163602
1,bar,three,2.108325,0.209951
2,bar,two,-1.146026,-0.112424
3,foo,one,0.918176,-0.677375
4,foo,three,-1.138436,-1.446859
5,foo,two,-0.625387,0.095953


**注意结果集中的索引，与as_index=True时不同。列‘A'和列‘B’没有成为索引**

#### **15.1.3 对每一列同时进行多种聚合（使用agg函数）**

In [236]:
#这里应该是一个链式操作，先groupby，然后在agg.注意，各种聚合函数是以列表的形式提供给agg函数的
df.groupby('A').agg([np.sum,np.mean,np.std])

  df.groupby('A').agg([np.sum,np.mean,np.std])


Unnamed: 0_level_0,C,C,C,D,D,D
Unnamed: 0_level_1,sum,mean,std,sum,mean,std
A,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2
bar,1.901911,0.63397,1.648563,-0.066075,-0.022025,0.20252
foo,-0.552859,-0.110572,1.128225,-2.609702,-0.52194,0.672975


**此时，列变成了多级索引（二级索引）**

#### **15.1.4 对各列使用不同的聚合函数**

In [238]:
#很强大的功能，单个SQL语句做不到。注意：此时使用字典的形式提供参数！！！
df.groupby(['A','B']).agg({'C':['sum','mean'],'D':['min','max']})

Unnamed: 0_level_0,Unnamed: 1_level_0,C,C,D,D
Unnamed: 0_level_1,Unnamed: 1_level_1,sum,mean,min,max
A,B,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2
bar,one,0.939612,0.939612,-0.163602,-0.163602
bar,three,2.108325,2.108325,0.209951,0.209951
bar,two,-1.146026,-1.146026,-0.112424,-0.112424
foo,one,1.836352,0.918176,-0.870564,-0.484185
foo,three,-1.138436,-1.138436,-1.446859,-1.446859
foo,two,-1.250774,-0.625387,-0.090537,0.282443


**很强大的功能！！！**

#### **15.1.5 在列上聚合不同的函数，并重命名生成的DataFrame的索引。**

In [239]:
df1=df.agg(x=('C',max),y=('D',min),z=('B',max))

In [240]:
df1.fillna(value='',inplace=True)

In [241]:
df1

Unnamed: 0,C,D,B
x,2.108325,,
y,,-1.446859,
z,,,two


#### **15.1.6 使用自定义聚合函数**

In [242]:
def fun1(x):
    r=0
    for i in x:
       r=r+i
    return r

df.groupby(['A','B']).agg({'C':[fun1,sum],'D':[fun1,sum]})

Unnamed: 0_level_0,Unnamed: 1_level_0,C,C,D,D
Unnamed: 0_level_1,Unnamed: 1_level_1,fun1,sum,fun1,sum
A,B,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2
bar,one,0.939612,0.939612,-0.163602,-0.163602
bar,three,2.108325,2.108325,0.209951,0.209951
bar,two,-1.146026,-1.146026,-0.112424,-0.112424
foo,one,1.836352,1.836352,-1.354749,-1.354749
foo,three,-1.138436,-1.138436,-1.446859,-1.446859
foo,two,-1.250774,-1.250774,0.191907,0.191907


#### **15.1.7 查看单列的结果数据统计**

In [243]:
#方法1，预先过滤，性能更好
df.groupby('A')['C'].agg([max,min,np.std])

Unnamed: 0_level_0,max,min,std
A,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
bar,2.108325,-1.146026,1.648563
foo,1.607244,-1.138436,1.128225


In [244]:
#方法2，先对全部列聚合，再过滤
df.groupby(['A','B']).agg([max,min,np.std])['C']

Unnamed: 0_level_0,Unnamed: 1_level_0,max,min,std
A,B,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
bar,one,0.939612,0.939612,
bar,three,2.108325,2.108325,
bar,two,-1.146026,-1.146026,
foo,one,1.607244,0.229108,0.974489
foo,three,-1.138436,-1.138436,
foo,two,-0.157027,-1.093747,0.662361


### **15.2 遍历groupby的结果，理解执行过程**

**方法：用for循环遍历每个group**

#### **15.2.1 遍历单个列聚合的分组**

In [245]:
g=df.groupby('A')

In [246]:
g

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x7f5e41fcd0>

**可见结果是DataFrameGroupBy对象**

In [247]:
#使用for循环遍历
for name,group in g:
    print(name)
    print(group)
    print()

bar
     A      B         C         D
1  bar    one  0.939612 -0.163602
3  bar  three  2.108325  0.209951
5  bar    two -1.146026 -0.112424

foo
     A      B         C         D
0  foo    one  1.607244 -0.484185
2  foo    two -0.157027 -0.090537
4  foo    two -1.093747  0.282443
6  foo    one  0.229108 -0.870564
7  foo  three -1.138436 -1.446859



**可以单独获取某个分组的数据**

In [248]:
#单独获取某个分组的数据
g.get_group('bar')

Unnamed: 0,A,B,C,D
1,bar,one,0.939612,-0.163602
3,bar,three,2.108325,0.209951
5,bar,two,-1.146026,-0.112424


#### **15.2.2 遍历多个列聚合的分组**

In [249]:
g=df.groupby(['A','B'])

In [250]:
for name,group in g:
    print(name)
    print(group)
    print()

('bar', 'one')
     A    B         C         D
1  bar  one  0.939612 -0.163602

('bar', 'three')
     A      B         C         D
3  bar  three  2.108325  0.209951

('bar', 'two')
     A    B         C         D
5  bar  two -1.146026 -0.112424

('foo', 'one')
     A    B         C         D
0  foo  one  1.607244 -0.484185
6  foo  one  0.229108 -0.870564

('foo', 'three')
     A      B         C         D
7  foo  three -1.138436 -1.446859

('foo', 'two')
     A    B         C         D
2  foo  two -0.157027 -0.090537
4  foo  two -1.093747  0.282443



**可以看到name是一个二元的tuple,代表不同的列**

In [251]:
#获取某个分组的数据
g.get_group(('foo','two'))

Unnamed: 0,A,B,C,D
2,foo,two,-0.157027,-0.090537
4,foo,two,-1.093747,0.282443


### **可以直接查看groupby后的某列或某几列，生成Series或者子DataFrame**

In [252]:
g['C']

<pandas.core.groupby.generic.SeriesGroupBy object at 0x7f5e495cd0>

**此时是一个 SeriesGroupBy 对象**

In [253]:
#也可以用for循环对SeriesGroupBy对象进行遍历
for name,group in g['C']:
    print(name)
    print(group)
    print()
    print(type(name))
    print(type(group))
    print('----------------------------------------------------')

('bar', 'one')
1    0.939612
Name: C, dtype: float64

<class 'tuple'>
<class 'pandas.core.series.Series'>
----------------------------------------------------
('bar', 'three')
3    2.108325
Name: C, dtype: float64

<class 'tuple'>
<class 'pandas.core.series.Series'>
----------------------------------------------------
('bar', 'two')
5   -1.146026
Name: C, dtype: float64

<class 'tuple'>
<class 'pandas.core.series.Series'>
----------------------------------------------------
('foo', 'one')
0    1.607244
6    0.229108
Name: C, dtype: float64

<class 'tuple'>
<class 'pandas.core.series.Series'>
----------------------------------------------------
('foo', 'three')
7   -1.138436
Name: C, dtype: float64

<class 'tuple'>
<class 'pandas.core.series.Series'>
----------------------------------------------------
('foo', 'two')
2   -0.157027
4   -1.093747
Name: C, dtype: float64

<class 'tuple'>
<class 'pandas.core.series.Series'>
----------------------------------------------------


**可以看到，此时的name是一个tuple，group是一个Series**

### **所有的聚合，都是在DataFrame或者Series上进行的**

### **15.3 实例分组，探索数据**

In [254]:
#读取文件的同时指定各列的数据类型！！！！
df=pd.read_csv('gy202302.csv',dtype={'NF':np.str_,'YF':np.str_,'LEV':np.str_,'A07_2017':np.str_,
                                     'A18':np.str_,'A19':np.str_,'A09':np.str_,'A10':np.str_,
                                     'A341':np.str_,'A342':np.str_,'A343':np.str_,'B93':np.str_,
                                     'DZX':np.str_,'A14':np.str_,'A052':np.str_})

In [255]:
df.set_index(['NF','YF'],inplace=True,drop=False)

In [256]:
df.loc[df['A01']=='057686533',['A01','DZX']]

Unnamed: 0_level_0,Unnamed: 1_level_0,A01,DZX
NF,YF,Unnamed: 2_level_1,Unnamed: 3_level_1
2023,2,57686533,3


### **1. 查各街镇产值最高、最低的的企业产值，各街镇平均产值**

In [275]:
g1=df.groupby(df['LEV'].str[6:9])

In [276]:
g1.get_group('501')[['H02','LEV','N2000_2']].head(2)

Unnamed: 0_level_0,Unnamed: 1_level_0,H02,LEV,N2000_2
NF,YF,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
2023,2,上海壹徕科技股份有限公司,310112501804,6627
2023,2,上海玖广兴塑胶制品有限公司,310112501803,4266


In [289]:
# todo 将自定义的函数作用到dataframe的行和列 或者Serise的行上
ser1 = pd.Series(np.random.randint(-10,10,5),index=list('abcde'))
ser1

a    -6
b    -9
c     9
d     4
e   -10
dtype: int64

In [294]:
# todo 定义一个函数，求其和，绝对值，最大值减最小值的差值，平方
def func(x):
 num= np.max(x)-np.min(x)
 print(num)
 a = abs(x)
 b= x**2
 return b

print(ser1.apply(func))

0
0
0
0
0
a     36
b     81
c     81
d     16
e    100
dtype: int64


In [288]:
df1 = pd.DataFrame(np.random.randint(-10,10,(4,5)),index=list('ACBD'),columns=list('abcde'))
df1

Unnamed: 0,a,b,c,d,e
A,-3,0,-5,3,8
C,-2,0,-5,-10,4
B,1,4,9,-1,4
D,9,-1,-6,9,8


In [298]:
def func1(x):
 # print(x)
 #print('--------------')
 num= np.max(x)-np.min(x)
 a = abs(x)
 b= x**2
 # return num
 # return a
 return a,b,num
print(df1.apply(func1,axis = 1))

A       ([3, 0, 5, 3, 8], [9, 0, 25, 9, 64], 13)
C    ([2, 0, 5, 10, 4], [4, 0, 25, 100, 16], 14)
B      ([1, 4, 9, 1, 4], [1, 16, 81, 1, 16], 10)
D     ([9, 1, 6, 9, 8], [81, 1, 36, 81, 64], 15)
dtype: object


In [None]:
# todo 使用匿名函数实现----求其和，绝对值，最大值减最小值的差值，
print(df1.apply(lambda x:x**2,axis=1))
print('------')
print(df1.apply(lambda x:np.max(x)-np.min(x),axis=1))
print('---------')
print(df1.apply(lambda x:abs(x),axis=1))
# applymap的使用
# todo 使用applymap 因为applymap作用在每个元素上，所以不需要指定axis
print(df1.applymap(lambda x:x**2))
print('---------')
print(df1.applymap(lambda x:abs(x)))

In [285]:
#{'C':[fun1,sum],'D':[fun1,sum]}
def max1(x,y):
#    print(x)
    return 0
g1.agg({'N2000_2':[max1('N2000_2','H02'),min,np.mean]})

  g1.agg({'N2000_2':[max1('N2000_2','H02'),min,np.mean]})


TypeError: 'int' object is not callable