# Correlation and Covariance

Some summary statistics, like correlation and covariance, are computed from pairs of arguments. Let’s consider some DataFrames of stock prices and volumes obtained from Yahoo! Finance:

In [5]:
from pandas_datareader import data as web
import yfinance
import datetime
import pandas as pd
import numpy as np
from pandas import Series, DataFrame

In [8]:
start = datetime.datetime(2010, 1, 1)
end = datetime.datetime(2017, 1, 1)

all_data={}
for ticker in ['AAPL', 'IBM', 'MSFT', 'GOOGL']:
    all_data[ticker] = web.get_data_yahoo(ticker, start, end)

price = DataFrame({tic: data['Adj Close'] 
                   for tic, data in all_data.items()})
volume = DataFrame({tic: data['Volume']
                     for tic, data in all_data.items()})

I now compute percent changes of the prices:

In [9]:
returns = price.pct_change()

In [11]:
returns.tail()

Unnamed: 0_level_0,AAPL,IBM,MSFT,GOOGL
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
2016-12-23,0.001978,-0.002095,-0.004878,-0.002322
2016-12-27,0.006351,0.002579,0.000632,0.002637
2016-12-28,-0.004264,-0.005684,-0.004583,-0.006618
2016-12-29,-0.000257,0.002467,-0.001429,-0.0021
2016-12-30,-0.007796,-0.003661,-0.012083,-0.012991


The *corr* method of Series computes the correlation of the overlapping, non-NA, aligned-by-index values in two Series. Relatedly, *cov* computes the covariance:

In [12]:
returns.MSFT.corr(returns.IBM)

0.4973963031548564

In [13]:
returns.MSFT.cov(returns.IBM)

8.735228717153598e-05

DataFrame’s corr and cov methods, on the other hand, return a full correlation or covariance matrix as a DataFrame, respectively:

In [15]:
returns.corr()

Unnamed: 0,AAPL,IBM,MSFT,GOOGL
AAPL,1.0,0.384859,0.393722,0.413241
IBM,0.384859,1.0,0.497396,0.404737
MSFT,0.393722,0.497396,1.0,0.466131
GOOGL,0.413241,0.404737,0.466131,1.0


In [34]:
returns.cov()

Unnamed: 0,AAPL,IBM,MSFT,GOOGL
AAPL,0.000272,7.6e-05,9.5e-05,0.000107
IBM,7.6e-05,0.000145,8.7e-05,7.6e-05
MSFT,9.5e-05,8.7e-05,0.000213,0.000107
GOOGL,0.000107,7.6e-05,0.000107,0.000246


Using DataFrame’s corrwith method, you can compute pairwise correlations between a DataFrame’s columns or rows with another Series or DataFrame. Passing a Series returns a Series with the correlation value computed for each column

In [35]:
returns.corrwith(returns.IBM)

AAPL     0.384859
IBM      1.000000
MSFT     0.497396
GOOGL    0.404737
dtype: float64

Passing a DataFrame computes the correlations of matching column names. Here I compute correlations of percent changes with volume:

In [36]:
returns.corrwith(volume)

AAPL    -0.073039
IBM     -0.200648
MSFT    -0.092002
GOOGL   -0.001307
dtype: float64