## 相关系数与协方差

有些汇总统计（如相关系数和协方差）是通过参数对计算出来的。我们来看几个DataFrame，它们的数据来自Yahoo!Finance的股票价格和成交量，使用的是pandas-datareader包（可以用conda或pip安装）：

In [47]:
import pandas_datareader.data as pdr
import fix_yahoo_finance as yf


import pandas as pd
import numpy as np


yf.pdr_override()

In [35]:
# 手动加载
# data_path = r'/home/mylady/code/python/py-data-analysis/data_examples'
data_path = r'../../data_examples'


yahoo_volume = pd.read_pickle('%s/%s' % (data_path, 'yahoo_volume.pkl'))
yahoo_price = pd.read_pickle('%s/%s' % (data_path, 'yahoo_price.pkl'))

In [36]:
print('交易量 前5行数据: \n', yahoo_volume.head())

print('\n交易量 打印索引: \n', yahoo_volume.index)

print('\n交易量 打印行: \n', yahoo_volume.columns)

交易量 前5行数据: 
                  AAPL      GOOG      IBM      MSFT
Date                                              
2010-01-04  123432400   3927000  6155300  38409100
2010-01-05  150476200   6031900  6841400  49749600
2010-01-06  138040000   7987100  5605300  58182400
2010-01-07  119282800  12876600  5840600  50559700
2010-01-08  111902700   9483900  4197200  51197400

交易量 打印索引: 
 DatetimeIndex(['2010-01-04', '2010-01-05', '2010-01-06', '2010-01-07',
               '2010-01-08', '2010-01-11', '2010-01-12', '2010-01-13',
               '2010-01-14', '2010-01-15',
               ...
               '2016-10-10', '2016-10-11', '2016-10-12', '2016-10-13',
               '2016-10-14', '2016-10-17', '2016-10-18', '2016-10-19',
               '2016-10-20', '2016-10-21'],
              dtype='datetime64[ns]', name='Date', length=1714, freq=None)

交易量 打印行: 
 Index(['AAPL', 'GOOG', 'IBM', 'MSFT'], dtype='object')


In [27]:
print('价格 前5行数据: \n', yahoo_price.head())

print('\n价格 打印索引: \n', yahoo_price.index)

print('\n价格 打印行: \n', yahoo_price.columns)

价格 前5行数据: 
                  AAPL        GOOG         IBM       MSFT
Date                                                    
2010-01-04  27.990226  313.062468  113.304536  25.884104
2010-01-05  28.038618  311.683844  111.935822  25.892466
2010-01-06  27.592626  303.826685  111.208683  25.733566
2010-01-07  27.541619  296.753749  110.823732  25.465944
2010-01-08  27.724725  300.709808  111.935822  25.641571

价格 打印索引: 
 DatetimeIndex(['2010-01-04', '2010-01-05', '2010-01-06', '2010-01-07',
               '2010-01-08', '2010-01-11', '2010-01-12', '2010-01-13',
               '2010-01-14', '2010-01-15',
               ...
               '2016-10-10', '2016-10-11', '2016-10-12', '2016-10-13',
               '2016-10-14', '2016-10-17', '2016-10-18', '2016-10-19',
               '2016-10-20', '2016-10-21'],
              dtype='datetime64[ns]', name='Date', length=1714, freq=None)

价格 打印行: 
 Index(['AAPL', 'GOOG', 'IBM', 'MSFT'], dtype='object')


> 注意：此时Yahoo! Finance已经不存在了，因为2017年Yahoo!被Verizon收购了。参阅pandas-datareader文档，可以学习最新的功能。

我使用pandas_datareader模块下载了一些股票数据：

现在计算价格的百分数变化，时间序列的操作会在第11章介绍：

In [29]:
returns = yahoo_price.pct_change()

returns

Unnamed: 0_level_0,AAPL,GOOG,IBM,MSFT
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
2010-01-04,,,,
2010-01-05,0.001729,-0.004404,-0.012080,0.000323
2010-01-06,-0.015906,-0.025209,-0.006496,-0.006137
2010-01-07,-0.001849,-0.023280,-0.003462,-0.010400
2010-01-08,0.006648,0.013331,0.010035,0.006897
...,...,...,...,...
2016-10-17,-0.000680,0.001837,0.002072,-0.003483
2016-10-18,-0.000681,0.019616,-0.026168,0.007690
2016-10-19,-0.002979,0.007846,0.003583,-0.002255
2016-10-20,-0.000512,-0.005652,0.001719,-0.004867


In [30]:
returns.tail()

Unnamed: 0_level_0,AAPL,GOOG,IBM,MSFT
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
2016-10-17,-0.00068,0.001837,0.002072,-0.003483
2016-10-18,-0.000681,0.019616,-0.026168,0.00769
2016-10-19,-0.002979,0.007846,0.003583,-0.002255
2016-10-20,-0.000512,-0.005652,0.001719,-0.004867
2016-10-21,-0.00393,0.003011,-0.012474,0.042096


Series的corr方法用于计算两个Series中重叠的、非NA的、按索引对齐的值的相关系数。与此类似，cov用于计算协方差：

In [33]:
returns['IBM']

Date
2010-01-04         NaN
2010-01-05   -0.012080
2010-01-06   -0.006496
2010-01-07   -0.003462
2010-01-08    0.010035
                ...   
2016-10-17    0.002072
2016-10-18   -0.026168
2016-10-19    0.003583
2016-10-20    0.001719
2016-10-21   -0.012474
Name: IBM, Length: 1714, dtype: float64

In [31]:
returns['MSFT'].corr(returns['IBM'])

0.49976361144151144

In [32]:
returns['MSFT'].cov(returns['IBM'])

8.870655479703546e-05

因为MSTF是一个合理的Python属性，我们还可以用更简洁的语法选择列：

In [37]:
returns.MSFT.corr(returns.IBM)

0.49976361144151144

另一方面，DataFrame的corr和cov方法将以DataFrame的形式分别返回完整的相关系数或协方差矩阵：

In [38]:
returns.corr()

Unnamed: 0,AAPL,GOOG,IBM,MSFT
AAPL,1.0,0.407919,0.386817,0.389695
GOOG,0.407919,1.0,0.405099,0.465919
IBM,0.386817,0.405099,1.0,0.499764
MSFT,0.389695,0.465919,0.499764,1.0


利用DataFrame的corrwith方法，你可以计算其列或行跟另一个Series或DataFrame之间的相关系数。传入一个Series将会返回一个相关系数值Series（针对各列进行计算）：

In [44]:
print("IBM: ")
print(returns.corrwith(returns.IBM))

print("")

print("GOOG: ")
print(returns.corrwith(returns.GOOG))

print("")

print("AAPL: ")
print(returns.corrwith(returns.AAPL))

IBM: 
AAPL    0.386817
GOOG    0.405099
IBM     1.000000
MSFT    0.499764
dtype: float64

GOOG: 
AAPL    0.407919
GOOG    1.000000
IBM     0.405099
MSFT    0.465919
dtype: float64

AAPL: 
AAPL    1.000000
GOOG    0.407919
IBM     0.386817
MSFT    0.389695
dtype: float64


传入一个DataFrame则会计算按列名配对的相关系数。这里，我计算百分比变化与成交量的相关系数：

In [46]:
returns.corrwith(yahoo_volume)

AAPL   -0.075565
GOOG   -0.007067
IBM    -0.204849
MSFT   -0.092950
dtype: float64

传入axis='columns'即可按行进行计算。无论如何，在计算相关系数之前，所有的数据项都会按标签对齐。