# 整理思路

首先回顾一下公式

$$r_t-r_f=\alpha_t+\beta_0SMB_t+\beta_1HML_t+\beta_3MKT+\epsilon_t$$

其中

- SMB ：规模因子
- HML ：账面市值比因子
- MKT ：市场因子
- 市净率（Price-to-Book Ratio/PB）= 股票价格(P)/每股净资产(Book Value) 
- 市场价值比率(Book to market ratio/BM) = 每股净资产(Book Value) /股票价格(P) = 1/市净率

**SMB因子构造**

- 使用流通市值进行排序（各一半），计算小盘股组合和大盘股组合的（流通市值加权）收益率之差。计算公式：SMB=(SH+SM+SL)/3-(BH+BM+BL)/3

**HML因子构造**

- 使用账面市值比进行排序（以BM的70%、30%分位数为界），计算高账面市值比组合和低账面市值比组合的（流通市值加权）收益率之差。计算公式：HML=(SH+BH)/2-(SL+BL)/2

因此我们还需要（月度复权数据-依照网上的案例）

- 个股收盘价
- 市场收盘价
- 无风险利率

# 数据处理

数据可见我的 网盘分享。密码

#### 导入财务数据并分析

In [11]:
import pandas as pd

In [4]:
b2p = pd.HDFStore('b2p.h5', mode='r') 
totcap = pd.HDFStore('totcap.h5', mode='r') 
floatcap = pd.HDFStore('floatcap.h5', mode='r') 
print(b2p.keys(),totcap.keys(),floatcap.keys())

['/eqtopc2totcap'] ['/totcap'] ['/floatcap']


In [5]:
b2p_data=b2p['eqtopc2totcap']
totcap_data=totcap['totcap']
floatcap_data=floatcap['floatcap']

读取数据后，看看里面都是什么

In [6]:
b2p_data.head()

Unnamed: 0,date,Stkcd,sig,facnameuse
0,20050104,000001.SZ,0.34096,eqtopc2totcap
1,20050104,000002.SZ,0.477448,eqtopc2totcap
2,20050104,000004.SZ,0.169741,eqtopc2totcap
3,20050104,000005.SZ,0.576144,eqtopc2totcap
4,20050104,000006.SZ,0.868864,eqtopc2totcap


In [7]:
b2p_data.tail()

Unnamed: 0,date,Stkcd,sig,facnameuse
4837,20220801,871970.BJ,0.403256,eqtopc2totcap
4838,20220801,871981.BJ,0.360205,eqtopc2totcap
4839,20220801,872925.BJ,0.42779,eqtopc2totcap
4840,20220801,873169.BJ,0.321375,eqtopc2totcap
4841,20220801,873223.BJ,0.318666,eqtopc2totcap


In [8]:
b2p_data.columns=['date','Stkcd','b2p','facnameuse'] # 修改列名以方便阅读

2005-2022 年内的数据，列名分别是: 日期 编码 sig(按文件名是Book to price)

In [9]:
totcap_data.head()

Unnamed: 0,date,Stkcd,sig
0,20031231,000001.SZ,16928650000.0
1,20031231,000002.SZ,8996443000.0
2,20031231,000004.SZ,670133900.0
3,20031231,000005.SZ,1913386000.0
4,20031231,000006.SZ,1344036000.0


In [10]:
totcap_data.columns=['date','Stkcd','totcap'] # 修改列名以方便阅读

In [11]:
floatcap_data.head()

Unnamed: 0,date,Stkcd,sig
0,20031231,000001.SZ,12261450000.0
1,20031231,000002.SZ,6000288000.0
2,20031231,000004.SZ,332424700.0
3,20031231,000005.SZ,1009664000.0
4,20031231,000006.SZ,836377300.0


In [12]:
floatcap_data.columns=['date','Stkcd','floatcap'] # 修改列名以方便阅读

2003-2022 年的数据，列名分别是: 日期 编码 sig(按文件名是total capital市值和 float capital流通市值)

#### 合并&删除空值

由于B2P似乎是从2005开始的，其他数据也有可能不一致，先试着合并一下吧

In [13]:
tfcap_data=pd.merge(totcap_data,floatcap_data,how='inner',on=['date','Stkcd'])
fin_data=pd.merge(b2p_data[['date','Stkcd','b2p']],tfcap_data,how='inner',on=['date','Stkcd'])

In [14]:
fin_data.head()

Unnamed: 0,date,Stkcd,b2p,totcap,floatcap
0,20050104,000001.SZ,0.34096,12822970000.0,9287695000.0
1,20050104,000002.SZ,0.477448,11959280000.0,8294219000.0
2,20050104,000004.SZ,0.169741,571881200.0,283685700.0
3,20050104,000005.SZ,0.576144,1559055000.0,822689200.0
4,20050104,000006.SZ,0.868864,1047333000.0,651743100.0


接下来去掉空缺值,并重新排index，最后就可以暂时输出了

In [15]:
fin_data=fin_data.loc[(fin_data != 0).all(1)].reset_index(drop=True)

In [16]:
fin_data.to_pickle('fin.pkl') # 导出成 pickle 格式的文件，以防意外

In [None]:
del fin_data, totcap_data, floatcap_data, b2p_data

节约内存及时清理后台！

#### 合并行情数据

In [12]:
fin_data=pd.read_pickle('fin.pkl')

In [13]:
prices=pd.HDFStore('ashareeodprices.h5', mode='r') 
prices.keys()

['/ashareeodprices', '/lastupdatedt']

In [14]:
prices_data=prices['ashareeodprices']
prices_data.head()

Unnamed: 0,date,Stkcd,preclose,open,high,low,close,ret,vol,amt,adjpreclose,adjopen,adjhigh,adjlow,adjclose,adjfac,avgprice,ztprice,dtprice
0,20000104,000001.SZ,17.45,17.5,18.55,17.2,18.29,4.8138,82161.0,147325.3568,378.0,379.08,401.82,372.58,396.19,21.661613,17.9313,,
1,20000104,000002.SZ,9.75,9.8,10.4,9.6,10.3,5.641,45747.0,46053.4516,82.99,83.42,88.53,81.72,87.68,8.512305,10.067,,
2,20000104,000003.SZ,5.47,5.48,5.85,5.4,5.74,4.936,19073.0,10787.1201,20.97,21.01,22.43,20.7,22.0,3.833487,5.6557,,
3,20000104,000004.SZ,8.51,8.55,8.75,8.36,8.74,2.7027,6577.0,5628.1931,26.48,26.6,27.23,26.01,27.19,3.111524,8.5574,,
4,20000104,000005.SZ,6.04,6.1,6.27,6.0,6.24,3.3113,8365.0,5132.4196,26.78,27.04,27.8,26.6,27.66,4.433378,6.1356,,


因为我们只需要 adjclose 的数据，将对应列合并后再确定一下是否存在空值就可以清内存了！

In [15]:
fama_data=pd.merge(fin_data,prices_data[['date','Stkcd','close','adjclose']],how='inner',on=['date','Stkcd'])
fama_data.loc[(fama_data == 0).any(1)]

Unnamed: 0,date,Stkcd,b2p,totcap,floatcap,close,adjclose


In [16]:
del fin_data,prices_data

因为只需要月末数据,我们取每只股票每月最后一个数据

In [17]:
fama_data['date']=pd.to_datetime(fama_data['date'],format = '%Y%m%d') # 先将转换成时间戳的格式
fama_m_data=fama_data.groupby(['Stkcd',pd.Grouper(freq='m',key='date')]).last().ffill().reset_index() 
# 按股票代码和月份分类，取每个月最后一个，并填充空缺值

In [18]:
fama_m_data['date'].unique().max()

numpy.datetime64('2022-07-31T00:00:00.000000000')

In [19]:
fama_m_data['date'].unique().min()

numpy.datetime64('2003-12-31T00:00:00.000000000')

In [20]:
len(fama_m_data['date'].unique())

224

2003-12-31 到 2022-07-29 ，18*12+8=224 个月，看来数据没多没少！可以开始计算收益率了

In [21]:
fama_m_data['adjclose'] = fama_m_data.groupby('Stkcd')['adjclose'].apply(lambda x:(x.shift(-1)-x)/x) # （这个月减上个月）/上个月
fama_m_data = fama_m_data.dropna()

#### 开始构造因子

在构造之前需要开始划分 SB 和 HML

In [22]:
fama_m_data.head()

Unnamed: 0,Stkcd,date,b2p,totcap,floatcap,close,adjclose
0,000001.SZ,2003-12-31,0.235216,16928650000.0,12261450000.0,8.51,0.090474
1,000001.SZ,2004-01-31,0.21473,18543690000.0,13431220000.0,9.28,0.116395
2,000001.SZ,2004-02-29,0.197909,20119800000.0,14572800000.0,10.36,0.02894
3,000001.SZ,2004-03-31,0.192691,20664630000.0,14967420000.0,10.66,-0.124766
4,000001.SZ,2004-04-30,0.232417,17979400000.0,13022500000.0,9.33,0.031107


In [23]:
import numpy as np

In [24]:
def dec_SMB(x): # 划分SMB的公式
    outp=x.duplicated()
    outp[x>=x.quantile(0.5)]='B'
    outp[x<x.quantile(0.5)]='S'
    return outp
     

def dec_HML(x): # 划分HML的公式
    outp=x.duplicated()
    outp[:]='M'
    outp[x>=x.quantile(0.7)]='H'
    outp[x<=x.quantile(0.3)]='L'
    return outp

In [25]:
fama_m_data.loc[:,'SMB']=fama_m_data.groupby('date')['floatcap'].apply(lambda x: dec_SMB(x))
fama_m_data.loc[:,'HML']=fama_m_data.groupby('date')['b2p'].apply(lambda x: dec_HML(x))

In [26]:
fama_m_data['tag']=fama_m_data['SMB']+fama_m_data['HML'] # 分成六组，标上名字

In [27]:
fama_m_data.head()

Unnamed: 0,Stkcd,date,b2p,totcap,floatcap,close,adjclose,SMB,HML,tag
0,000001.SZ,2003-12-31,0.235216,16928650000.0,12261450000.0,8.51,0.090474,B,L,BL
1,000001.SZ,2004-01-31,0.21473,18543690000.0,13431220000.0,9.28,0.116395,B,L,BL
2,000001.SZ,2004-02-29,0.197909,20119800000.0,14572800000.0,10.36,0.02894,B,L,BL
3,000001.SZ,2004-03-31,0.192691,20664630000.0,14967420000.0,10.66,-0.124766,B,L,BL
4,000001.SZ,2004-04-30,0.232417,17979400000.0,13022500000.0,9.33,0.031107,B,L,BL


SMB因子构造

- 使用流通市值进行排序（各一半），计算小盘股组合和大盘股组合的（流通市值加权）收益率之差。计算公式：SMB=(SH+SM+SL)/3-(BH+BM+BL)/3

HML因子构造

- 使用账面市值比进行排序（以BM的70%、30%分位数为界），计算高账面市值比组合和低账面市值比组合的（流通市值加权）收益率之差。计算公式：HML=(SH+BH)/2-(SL+BL)/2

In [28]:
factor_data=fama_m_data.groupby(['date','tag']).apply(lambda x: (x['adjclose']*x['floatcap']).sum()/x['floatcap'].sum()).reset_index()
# 对每一组的股票收益进行流通市值加权

factor_data.rename(columns={factor_data.columns[-1]:'ret'},inplace=True)
# 重新给得出的值改名

# 改变一下表格形式，使用透视表功能
factor_data=factor_data.pivot(index='date',columns='tag',values='ret')

In [29]:
factor_data.head()

tag,BH,BL,BM,SH,SL,SM
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
2003-12-31,0.101038,0.047227,0.066838,0.105323,0.082403,0.107298
2004-01-31,0.091564,0.050715,0.052916,0.106129,0.12195,0.107536
2004-02-29,0.038319,0.005672,0.03837,0.047358,0.033937,0.055577
2004-03-31,-0.101308,-0.121238,-0.082126,-0.100408,-0.117495,-0.102251
2004-04-30,-0.018893,-0.032013,-0.026163,-0.019857,-0.004122,-0.009239


In [30]:
factor_data['SMB']=factor_data.apply(lambda x: (x['SH']+x['SM']+x['SL'])/3-(x['BH']+x['BM']+x['BL'])/3,axis=1)
factor_data['HML']=factor_data.apply(lambda x: (x['SH']+x['BH'])/2-(x['SL']+x['BL'])/3,axis=1)

In [31]:
factor_data.head()

tag,BH,BL,BM,SH,SL,SM,SMB,HML
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
2003-12-31,0.101038,0.047227,0.066838,0.105323,0.082403,0.107298,0.02664,0.05997
2004-01-31,0.091564,0.050715,0.052916,0.106129,0.12195,0.107536,0.046807,0.041292
2004-02-29,0.038319,0.005672,0.03837,0.047358,0.033937,0.055577,0.01817,0.029635
2004-03-31,-0.101308,-0.121238,-0.082126,-0.100408,-0.117495,-0.102251,-0.005161,-0.02128
2004-04-30,-0.018893,-0.032013,-0.026163,-0.019857,-0.004122,-0.009239,0.014617,-0.00733


In [42]:
factor_data[['SMB','HML']].to_csv('SMB_and_HML_factors.csv') # 导出因子数据

#### 市场因子

我将市场收益定为所有股票的平均收益值

In [32]:
factor_data['mrt']=fama_m_data.groupby(['date'])['adjclose'].agg('mean')

对于无风险利率数据，我从tushare数据工具库调取了 上海银行间同业拆放利率（shibor/1m）2006-10-08---2022-07-31 的数据

In [69]:
rf_data=pd.read_csv('tushare_shibor.csv')

In [70]:
rf_data['date']=pd.to_datetime(rf_data['date'],format = '%Y%m%d')
rf_data=rf_data.set_index('date')
rf_data.columns=['rf'] #为了方便合并把 date 格式改了

In [78]:
fama3_data=pd.merge(factor_data[['SMB','HML','mrt']],rf_data,on='date')
fama3_data['MKT']=fama3_data['mrt']-fama3_data['rf'] # 用市场收益 - 无风险利率 = MKT因子

In [83]:
fama3_data=fama3_data.drop(columns=['mrt','rf']) # 删除多余的序列

In [84]:
fama3_data.to_csv('FAMA_3_factors.csv') # 导出因子数据

结束撒花！