# 金融经济数据处理

+  这一讲中进一步学习和联系利用pandas进行金融数据的处理，特别是时间序列的数据，内容依然集中在数据的预处理部分。包括数据对齐，数据类型变换，从不同数据源中提取整合数据等的回顾。
 

In [1]:
from __future__ import division
from pandas import Series, DataFrame
import pandas as pd
from numpy.random import randn
import numpy as np
pd.options.display.max_rows = 12
np.set_printoptions(precision=4, suppress=True)
import matplotlib.pyplot as plt
plt.rc('figure', figsize=(12, 6))

In [2]:
%matplotlib inline

In [3]:
%pwd

'D:\\teaching\\金融数据分析datafin'

### 时序方向和横截面方向对齐计算

+ 计算指数需要对价格进行加权，这是数据对齐非常繁琐，利用pandas中的dataframe处理这个问题非常简单。
自动对齐！

+ 下面的例子实现了数据的加权计算

In [4]:
close_px = pd.read_csv('data/stock_px_11.csv', parse_dates=True, index_col=0)
volume = pd.read_csv('data/volume_11.csv', parse_dates=True, index_col=0)
prices = close_px.loc['2011-09-05':'2011-09-14', ['AAPL', 'JNJ', 'SPX', 'XOM']]
volume = volume.loc['2011-09-05':'2011-09-12', ['AAPL', 'JNJ', 'XOM']]

In [5]:
prices

Unnamed: 0,AAPL,JNJ,SPX,XOM
2011-09-06,379.74,64.64,1165.24,71.15
2011-09-07,383.93,65.43,1198.62,73.65
2011-09-08,384.14,64.95,1185.9,72.82
2011-09-09,377.48,63.64,1154.23,71.01
2011-09-12,379.94,63.59,1162.27,71.84
2011-09-13,384.62,63.61,1172.87,71.65
2011-09-14,389.3,63.73,1188.68,72.64


In [6]:
volume

Unnamed: 0,AAPL,JNJ,XOM
2011-09-06,18173500.0,15848300.0,25416300.0
2011-09-07,12492000.0,10759700.0,23108400.0
2011-09-08,14839800.0,15551500.0,22434800.0
2011-09-09,20171900.0,17008200.0,27969100.0
2011-09-12,16697300.0,13448200.0,26205800.0


In [7]:
prices * volume

Unnamed: 0,AAPL,JNJ,SPX,XOM
2011-09-06,6901205000.0,1024434000.0,,1808370000.0
2011-09-07,4796054000.0,704007200.0,,1701934000.0
2011-09-08,5700561000.0,1010070000.0,,1633702000.0
2011-09-09,7614489000.0,1082402000.0,,1986086000.0
2011-09-12,6343972000.0,855171000.0,,1882625000.0
2011-09-13,,,,
2011-09-14,,,,


In [8]:
vwap = (prices * volume).sum() / volume.sum()

In [9]:
vwap

AAPL    380.655181
JNJ      64.394769
SPX            NaN
XOM      72.024288
dtype: float64

In [10]:
vwap.dropna()

AAPL    380.655181
JNJ      64.394769
XOM      72.024288
dtype: float64

### 不同频率时间序列的运算

+ 利用resample函数可以实现各种频率时间序列数据的采样；reindex用于使数据符合一个新的索引。

In [11]:
ts1 = Series(np.random.randn(3),
             index=pd.date_range('2012-6-13', periods=3, freq='W-WED'))
ts1

2012-06-13   -0.319992
2012-06-20    0.278125
2012-06-27    0.253227
Freq: W-WED, dtype: float64

In [12]:
ts1.resample('B')

<pandas.core.resample.DatetimeIndexResampler object at 0x000002092097C848>

In [13]:
ts1.resample('B').ffill()

2012-06-13   -0.319992
2012-06-14   -0.319992
2012-06-15   -0.319992
2012-06-18   -0.319992
2012-06-19   -0.319992
2012-06-20    0.278125
2012-06-21    0.278125
2012-06-22    0.278125
2012-06-25    0.278125
2012-06-26    0.278125
2012-06-27    0.253227
Freq: B, dtype: float64

In [14]:
dates = pd.DatetimeIndex(['2012-6-12', '2012-6-17', '2012-6-18',
                          '2012-6-21', '2012-6-22', '2012-6-29'])
ts2 = Series(np.random.randn(6), index=dates)
ts2

2012-06-12    0.581441
2012-06-17    1.007213
2012-06-18    0.503563
2012-06-21    0.191923
2012-06-22    0.216226
2012-06-29   -2.710182
dtype: float64

In [15]:
ts1.reindex(ts2.index, method='ffill')

2012-06-12         NaN
2012-06-17   -0.319992
2012-06-18   -0.319992
2012-06-21    0.278125
2012-06-22    0.278125
2012-06-29    0.253227
dtype: float64

In [16]:
ts2 + ts1.reindex(ts2.index, method='ffill')

2012-06-12         NaN
2012-06-17    0.687220
2012-06-18    0.183571
2012-06-21    0.470048
2012-06-22    0.494352
2012-06-29   -2.456954
dtype: float64

#### 使用时间段而不是时间戳

使用Periods 非常适用处理特殊规范的以年或季度为频率的金融或经济序列。
下面是两个有关GDP和通货澎胀的宏观经济时间序列。

In [17]:
gdp = Series([1.78, 1.94, 2.08, 2.01, 2.15, 2.31, 2.46],
             index=pd.period_range('1984Q2', periods=7, freq='Q-SEP'))
infl = Series([0.025, 0.045, 0.037, 0.04],
              index=pd.period_range('1982', periods=4, freq='A-DEC'))
gdp

1984Q2    1.78
1984Q3    1.94
1984Q4    2.08
1985Q1    2.01
1985Q2    2.15
1985Q3    2.31
1985Q4    2.46
Freq: Q-SEP, dtype: float64

In [18]:
infl

1982    0.025
1983    0.045
1984    0.037
1985    0.040
Freq: A-DEC, dtype: float64

In [19]:
infl_q = infl.asfreq('Q-SEP', how='end')

In [20]:
infl_q

1983Q1    0.025
1984Q1    0.045
1985Q1    0.037
1986Q1    0.040
Freq: Q-SEP, dtype: float64

In [21]:
infl_q.reindex(gdp.index, method='ffill')

1984Q2    0.045
1984Q3    0.045
1984Q4    0.045
1985Q1    0.037
1985Q2    0.037
1985Q3    0.037
1985Q4    0.037
Freq: Q-SEP, dtype: float64

### Time of day and "as of" data selection

假如有一个很长的盘中市场数据时间序列，现在希望抽取其中每天特定时间的价格数据。如果数据不规则，该怎么办呢？在实际工作中，稍有疏忽，就会导致错误的数据规整化。

In [4]:
rng = pd.date_range('2012-06-01 09:30', '2012-06-01 15:59', freq='T')
rng

DatetimeIndex(['2012-06-01 09:30:00', '2012-06-01 09:31:00',
               '2012-06-01 09:32:00', '2012-06-01 09:33:00',
               '2012-06-01 09:34:00', '2012-06-01 09:35:00',
               '2012-06-01 09:36:00', '2012-06-01 09:37:00',
               '2012-06-01 09:38:00', '2012-06-01 09:39:00',
               ...
               '2012-06-01 15:50:00', '2012-06-01 15:51:00',
               '2012-06-01 15:52:00', '2012-06-01 15:53:00',
               '2012-06-01 15:54:00', '2012-06-01 15:55:00',
               '2012-06-01 15:56:00', '2012-06-01 15:57:00',
               '2012-06-01 15:58:00', '2012-06-01 15:59:00'],
              dtype='datetime64[ns]', length=390, freq='T')

In [6]:
x=[rng + pd.offsets.BDay(i) for i in range(1, 4)]
x[0]

DatetimeIndex(['2012-06-04 09:30:00', '2012-06-04 09:31:00',
               '2012-06-04 09:32:00', '2012-06-04 09:33:00',
               '2012-06-04 09:34:00', '2012-06-04 09:35:00',
               '2012-06-04 09:36:00', '2012-06-04 09:37:00',
               '2012-06-04 09:38:00', '2012-06-04 09:39:00',
               ...
               '2012-06-04 15:50:00', '2012-06-04 15:51:00',
               '2012-06-04 15:52:00', '2012-06-04 15:53:00',
               '2012-06-04 15:54:00', '2012-06-04 15:55:00',
               '2012-06-04 15:56:00', '2012-06-04 15:57:00',
               '2012-06-04 15:58:00', '2012-06-04 15:59:00'],
              dtype='datetime64[ns]', length=390, freq=None)

In [2]:
# Make an intraday date range and time series
rng = pd.date_range('2012-06-01 09:30', '2012-06-01 15:59', freq='T')
# Make a 5-day series of 9:30-15:59 values
# rng中的每个时间的下一个工作日 rng + pd.offsets.BDay(1)
rng = rng.append([rng + pd.offsets.BDay(i) for i in range(1, 4)])
ts = Series(np.arange(len(rng), dtype=float), index=rng)
ts

2012-06-01 09:30:00       0.0
2012-06-01 09:31:00       1.0
2012-06-01 09:32:00       2.0
2012-06-01 09:33:00       3.0
2012-06-01 09:34:00       4.0
                        ...  
2012-06-06 15:55:00    1555.0
2012-06-06 15:56:00    1556.0
2012-06-06 15:57:00    1557.0
2012-06-06 15:58:00    1558.0
2012-06-06 15:59:00    1559.0
Length: 1560, dtype: float64

In [7]:
from datetime import time
ts[time(10, 0)]

2012-06-01 10:00:00      30.0
2012-06-04 10:00:00     420.0
2012-06-05 10:00:00     810.0
2012-06-06 10:00:00    1200.0
dtype: float64

In [24]:
ts.at_time(time(10, 0))

2012-06-01 10:00:00      30.0
2012-06-04 10:00:00     420.0
2012-06-05 10:00:00     810.0
2012-06-06 10:00:00    1200.0
dtype: float64

In [25]:
ts.between_time(time(10, 0), time(10, 1))

2012-06-01 10:00:00      30.0
2012-06-01 10:01:00      31.0
2012-06-04 10:00:00     420.0
2012-06-04 10:01:00     421.0
2012-06-05 10:00:00     810.0
2012-06-05 10:01:00     811.0
2012-06-06 10:00:00    1200.0
2012-06-06 10:01:00    1201.0
dtype: float64

In [26]:
np.random.seed(12346)

In [27]:
# Set most of the time series randomly to NA
indexer = np.sort(np.random.permutation(len(ts))[700:])
irr_ts = ts.copy()
irr_ts[indexer] = np.nan
irr_ts['2012-06-01 09:50':'2012-06-01 10:00']

2012-06-01 09:50:00    20.0
2012-06-01 09:51:00     NaN
2012-06-01 09:52:00    22.0
2012-06-01 09:53:00    23.0
2012-06-01 09:54:00     NaN
2012-06-01 09:55:00    25.0
2012-06-01 09:56:00     NaN
2012-06-01 09:57:00     NaN
2012-06-01 09:58:00     NaN
2012-06-01 09:59:00     NaN
2012-06-01 10:00:00     NaN
dtype: float64

In [28]:
selection = pd.date_range('2012-06-01 10:00', periods=4, freq='B')
irr_ts.asof(selection)

2012-06-01 10:00:00      25.0
2012-06-04 10:00:00     420.0
2012-06-05 10:00:00     810.0
2012-06-06 10:00:00    1197.0
Freq: B, dtype: float64

### 若干数据源切片

在金融和经济数据中经常会进行一些如下的操作：

+ 在一个特定的时间点上，从一个数据源切换到另一个数据源

+ 用另一个时间序列的数据对当前时间序列打补丁

+ 将数据中的符号 替换为实际数据


In [29]:
data1 = DataFrame(np.ones((6, 3), dtype=float),
                  columns=['a', 'b', 'c'],
                  index=pd.date_range('6/12/2012', periods=6))
data2 = DataFrame(np.ones((6, 3), dtype=float) * 2,
                  columns=['a', 'b', 'c'],
                  index=pd.date_range('6/13/2012', periods=6))
spliced = pd.concat([data1.loc[:'2012-06-14'], data2.loc['2012-06-15':]])
spliced

Unnamed: 0,a,b,c
2012-06-12,1.0,1.0,1.0
2012-06-13,1.0,1.0,1.0
2012-06-14,1.0,1.0,1.0
2012-06-15,2.0,2.0,2.0
2012-06-16,2.0,2.0,2.0
2012-06-17,2.0,2.0,2.0
2012-06-18,2.0,2.0,2.0


In [30]:
data2 = DataFrame(np.ones((6, 4), dtype=float) * 2,
                  columns=['a', 'b', 'c', 'd'],
                  index=pd.date_range('6/13/2012', periods=6))
spliced = pd.concat([data1.loc[:'2012-06-14'], data2.loc['2012-06-15':]],sort=False)
spliced

Unnamed: 0,a,b,c,d
2012-06-12,1.0,1.0,1.0,
2012-06-13,1.0,1.0,1.0,
2012-06-14,1.0,1.0,1.0,
2012-06-15,2.0,2.0,2.0,2.0
2012-06-16,2.0,2.0,2.0,2.0
2012-06-17,2.0,2.0,2.0,2.0
2012-06-18,2.0,2.0,2.0,2.0


In [31]:
 spliced.combine_first?

In [32]:
spliced_filled = spliced.combine_first(data2)
spliced_filled

Unnamed: 0,a,b,c,d
2012-06-12,1.0,1.0,1.0,
2012-06-13,1.0,1.0,1.0,2.0
2012-06-14,1.0,1.0,1.0,2.0
2012-06-15,2.0,2.0,2.0,2.0
2012-06-16,2.0,2.0,2.0,2.0
2012-06-17,2.0,2.0,2.0,2.0
2012-06-18,2.0,2.0,2.0,2.0


In [33]:
spliced.update(data2, overwrite=False)

In [34]:
spliced

Unnamed: 0,a,b,c,d
2012-06-12,1.0,1.0,1.0,
2012-06-13,1.0,1.0,1.0,2.0
2012-06-14,1.0,1.0,1.0,2.0
2012-06-15,2.0,2.0,2.0,2.0
2012-06-16,2.0,2.0,2.0,2.0
2012-06-17,2.0,2.0,2.0,2.0
2012-06-18,2.0,2.0,2.0,2.0


In [35]:
cp_spliced = spliced.copy()
cp_spliced[['a', 'c']] = data1[['a', 'c']]
cp_spliced

Unnamed: 0,a,b,c,d
2012-06-12,1.0,1.0,1.0,
2012-06-13,1.0,1.0,1.0,2.0
2012-06-14,1.0,1.0,1.0,2.0
2012-06-15,1.0,2.0,1.0,2.0
2012-06-16,1.0,2.0,1.0,2.0
2012-06-17,1.0,2.0,1.0,2.0
2012-06-18,,2.0,,2.0


In [36]:
data=pd.read_csv("data/000001.SH.csv",index_col=0,parse_dates=True)
data.head()

Unnamed: 0,OPEN,HIGH,LOW,CLOSE,VOLUME
1990-12-19,96.05,100.0,95.79,100.0,126000.0
1990-12-20,104.3,104.39,99.98,104.39,19700.0
1990-12-21,109.07,109.13,103.73,109.13,2800.0
1990-12-24,113.57,114.55,109.13,114.55,3200.0
1990-12-25,120.09,120.25,114.55,120.25,1500.0


### 收益率指数和累计收益



In [37]:
# %run getsinadata.py
# price =get_sina_stock(601899,"1990-1-1","2015-8-15")["close"]
data=pd.read_csv("data/000001.SH.csv",index_col=0,parse_dates=True)
price=data["CLOSE"]
price[-5:]

2020-09-07    3292.5907
2020-09-08    3316.4170
2020-09-09    3254.6279
2020-09-10    3234.8234
2020-09-11    3260.3461
Name: CLOSE, dtype: float64

In [38]:
returns = price.pct_change()
ret_index = (1 + returns).cumprod()
ret_index[0] = 1  # Set first value to 1
ret_index

1990-12-19     1.000000
1990-12-20     1.043900
1990-12-21     1.091300
1990-12-24     1.145500
1990-12-25     1.202500
                ...    
2020-09-07    32.925907
2020-09-08    33.164170
2020-09-09    32.546279
2020-09-10    32.348234
2020-09-11    32.603461
Name: CLOSE, Length: 7270, dtype: float64

In [39]:
m_returns = ret_index.resample('BM').last().pct_change()
m_returns['2012']

2012-01-31    0.042372
2012-02-29    0.059267
2012-03-30   -0.068231
2012-04-30    0.059010
2012-05-31   -0.010050
2012-06-29   -0.061884
2012-07-31   -0.054730
2012-08-31   -0.026674
2012-09-28    0.018875
2012-10-31   -0.008287
2012-11-30   -0.042904
2012-12-31    0.145957
Freq: BM, Name: CLOSE, dtype: float64

In [40]:
m_rets = (1 + returns).resample('M', kind='period').prod() - 1
m_rets['2012']

2012-01    0.042372
2012-02    0.059267
2012-03   -0.068231
2012-04    0.059010
2012-05   -0.010050
2012-06   -0.061884
2012-07   -0.054730
2012-08   -0.026674
2012-09    0.018875
2012-10   -0.008287
2012-11   -0.042904
2012-12    0.145957
Freq: M, Name: CLOSE, dtype: float64

## 分组计算

In [41]:
pd.options.display.max_rows = 100
pd.options.display.max_columns = 10
np.random.seed(12345)

In [42]:
import random; random.seed(0)
import string

N = 1000
def rands(n):
    choices = string.ascii_uppercase
    return ''.join([random.choice(choices) for _ in range(n)])
tickers = np.array([rands(5) for _ in range(N)])

In [43]:
''.join([random.choice(string.ascii_uppercase) for _ in range(6)])

'ILVWDD'

In [44]:
M = 500
df = DataFrame({'Momentum' : np.random.randn(M) / 200 + 0.03,
                'Value' : np.random.randn(M) / 200 + 0.08,
                'ShortInterest' : np.random.randn(M) / 200 - 0.02},
                index=tickers[:M])

In [45]:
ind_names = np.array(['FINANCIAL', 'TECH'])
sampler = np.random.randint(0, len(ind_names), N)
industries = Series(ind_names[sampler], index=tickers,
                    name='industry')

In [46]:
by_industry = df.groupby(industries)
by_industry.mean()

Unnamed: 0_level_0,Momentum,Value,ShortInterest
industry,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
FINANCIAL,0.029485,0.079929,-0.020739
TECH,0.030407,0.080113,-0.019609


In [47]:
by_industry.describe()

Unnamed: 0_level_0,Momentum,Momentum,Momentum,Momentum,Momentum,...,ShortInterest,ShortInterest,ShortInterest,ShortInterest,ShortInterest
Unnamed: 0_level_1,count,mean,std,min,25%,...,min,25%,50%,75%,max
industry,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2
FINANCIAL,246.0,0.029485,0.004802,0.01721,0.026263,...,-0.036997,-0.024138,-0.020833,-0.017345,-0.006322
TECH,254.0,0.030407,0.005303,0.016778,0.026456,...,-0.032682,-0.022779,-0.019829,-0.016923,-0.003698


In [48]:
# Within-Industry Standardize
def zscore(group):
    return (group - group.mean()) / group.std()

df_stand = by_industry.apply(zscore)

In [49]:
df_stand.groupby(industries).agg(['mean', 'std'])

Unnamed: 0_level_0,Momentum,Momentum,Value,Value,ShortInterest,ShortInterest
Unnamed: 0_level_1,mean,std,mean,std,mean,std
industry,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2
FINANCIAL,1.114736e-15,1.0,8.001278e-15,1.0,3.081772e-15,1.0
TECH,-2.779929e-16,1.0,-7.139521e-15,1.0,-1.910982e-15,1.0


In [50]:
# Within-industry rank descending
ind_rank = by_industry.rank(ascending=False)
ind_rank.groupby(industries).agg(['min', 'max'])

Unnamed: 0_level_0,Momentum,Momentum,Value,Value,ShortInterest,ShortInterest
Unnamed: 0_level_1,min,max,min,max,min,max
industry,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2
FINANCIAL,1.0,246.0,1.0,246.0,1.0,246.0
TECH,1.0,254.0,1.0,254.0,1.0,254.0


In [51]:
# Industry rank and standardize
by_industry.apply(lambda x: zscore(x.rank()))

Unnamed: 0,Momentum,Value,ShortInterest
MYNBI,-0.091346,-1.004802,-0.976696
QPMZJ,0.794005,-0.358356,1.299919
PLSGQ,-0.541047,-1.679355,-0.836164
EJEYD,-0.583207,0.990749,-1.623142
TZIRW,1.572120,0.374314,-0.265423
...,...,...,...
JPHKQ,-0.238200,0.074863,0.401537
VACPK,1.681011,-0.238200,-1.395171
MHNBS,0.673766,-1.313503,1.490451
YBNCI,1.623142,0.976696,0.541047


作业 

1. 自行选择一只股票，下载,财务数据、近三年日线数据、近一周成交明细数据以及分红资料。下载参考如下网址:（请换一只股票，不要选择002051这只），或者用baostock相应接口。

http://quotes.money.163.com/f10/fhpg_002051.html#01d05

http://quotes.money.163.com/trade/lsjysj_002051.html#01b07

http://quotes.money.163.com/trade/cjmx_002051.html#01b05

http://quotes.money.163.com/f10/zycwzb_002051.html#01c02

将这些数据存入磁盘数据库中


2. 画出该日线数据的k线图，5日，20日，30日移动平均线图

3. 利用最近三年日线数据，分别计算日收益率、周收益率、月收益率、季度收益率、年收益率简单收益率和对数收益率

4. 计算该股票一周内各交易价格上买盘，卖盘和中性盘的成交量，以价格为横坐标，成交量为纵坐标画图

5. 利用一周高频数据，分别计算１分钟、１０分钟、１５分钟、半小时简单收益率和对数收益率，给出各收益率的统计量（均值，方差，分位点等信息），画各数据的盒须图；
6. 解读其中2个以上财务指标，并画出其时序图，和相应股票价格走势图形进行比较。
 
 


