利用股票数据展示pandas的分组操作

groupby函数操作的整个流程分为：（1）拆分数据，（2）应用函数，（3）汇总计算结果

![image.png](attachment:image.png)

In [1]:
import pandas as pd

In [3]:
# 载入数据
df = pd.read_csv('./srcdata/stocks_20170930.csv', dtype={'ticker': str, 'holdingTicker':str,}, encoding='GBK')

In [4]:
"""
ticker:基金代号
holdingTicker：前10大持仓的股票代码
marketValue：股票市值
industryName1：股票所属产业
"""
df = df[['ticker', 'holdingTicker', 'marketValue', 'industryName1']]
df.head()

Unnamed: 0,ticker,holdingTicker,marketValue,industryName1
0,1,2236,237401500.0,电子
1,1,568,216279100.0,食品饮料
2,1,300156,160320000.0,公用事业
3,1,603799,153880000.0,有色金属
4,1,600056,151963800.0,医药生物


1.分组后统计数量：基金的持股数量

In [6]:
# 统计整个df在进行ticker字段分组后的数量
df.groupby('ticker').count().tail()

Unnamed: 0_level_0,holdingTicker,marketValue,industryName1
ticker,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
740101,10,10,10
750001,10,10,10
750005,10,10,10
762001,10,10,10
770001,10,10,10


In [8]:
# 统计holdingTicker在进行ticker字段分组后的数量
df[['holdingTicker']].groupby(df['ticker']).count().tail()

Unnamed: 0_level_0,holdingTicker
ticker,Unnamed: 1_level_1
740101,10
750001,10
750005,10
762001,10
770001,10


2.分组后根据根据统计量进行排序：基金持股所属产业的数量

In [12]:
df[['holdingTicker']].groupby(df['industryName1']).count().sort_values('holdingTicker', ascending=False).head()

Unnamed: 0_level_0,holdingTicker
industryName1,Unnamed: 1_level_1
银行,4666
电子,3850
食品饮料,3539
非银金融,3306
医药生物,2950


In [13]:
df[['marketValue']].groupby(df['ticker']).sum().sort_values('marketValue', ascending=False).head()

Unnamed: 0_level_0,marketValue
ticker,Unnamed: 1_level_1
510050,18185640000.0
1683,9222674000.0
1772,7983531000.0
150201,7650096000.0
150200,7650096000.0


3.对分组后的某组数据进行自定义的聚合操作

3.1 每个基金持股的个股市值差别

In [21]:
# 自定义一个函数进行聚合操作
def t_range(arr):
     return arr.max() - arr.min()

# 'marketValue'字段根据'ticker'分组后再对每个'ticker'组内的'marketValue'进行聚合操作
df[['marketValue']].groupby(df['ticker']).agg(t_range).head()

Unnamed: 0_level_0,marketValue
ticker,Unnamed: 1_level_1
1,134945800.0
3,1820900.0
4,1820900.0
5,0.0
7,745400.0


3.2 每个基金持股的个股市值最大值、最小值和差别

In [22]:
# 对同一组数据进行多种聚合操作
df[['marketValue']].groupby(df['ticker']).agg(['sum', 'max', t_range]).head()

Unnamed: 0_level_0,marketValue,marketValue,marketValue
Unnamed: 0_level_1,sum,max,t_range
ticker,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2
1,1517243000.0,237401500.0,134945800.0
3,5342707.0,1978000.0,1820900.0
4,5342707.0,1978000.0,1820900.0
5,1277144.0,1277144.0,0.0
7,8266880.0,1287000.0,745400.0


4.对分组后的多组数据分别进行聚合操作

In [24]:
df[['marketValue', 'industryName1']].groupby(df['ticker']).agg ({'marketValue':[t_range, 'max'],  'industryName1':['count']}).head()

Unnamed: 0_level_0,marketValue,marketValue,industryName1
Unnamed: 0_level_1,t_range,max,count
ticker,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2
1,134945800.0,237401500.0,10
3,1820900.0,1978000.0,4
4,1820900.0,1978000.0,4
5,0.0,1277144.0,1
7,745400.0,1287000.0,10


5.统计每只基金持有的前5股票

In [41]:
import numpy as np
df.groupby(df['ticker']).apply(lambda x: x.sort_values(by='marketValue')[::-3]).head(6)

Unnamed: 0_level_0,Unnamed: 1_level_0,ticker,holdingTicker,marketValue,industryName1
ticker,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
1,0,1,2236,237401500.0,电子
1,3,1,603799,153880000.0,有色金属
1,6,1,858,131744500.0,食品饮料
1,9,1,600779,102455700.0,食品饮料
3,10,3,600622,1978000.0,房地产
3,13,3,600060,157100.0,家用电器
