*This notebook is intellectual property of Auquan and is distributed under the [Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International Public License](https://creativecommons.org/licenses/by-nc-nd/4.0/legalcode). Any modification or distribution of this notebook without express permission of Auquan is prohibited and will result in legal prosecution.*

# Covariance and Correlation

Covariance and correlation describe how two random variables are related.

## Covariance

Covariance measures the extent to which the relationship between two variables is linear. The sign of the covariance shows the trend in the linear relationship between the variables, i.e if they tend to move together or in separate directions. A positive sign indicates that the variables are directly related, i.e. when one increases the other one also increases. A negative sign indicates that the variables are inversely related, so that when one increases the other decreases. It is calculated as
$$Cov(X,Y) = E[XY] - E[X]E[Y] = E[(X- E[X])(Y-E[Y])]$$

Note that
$$Cov(X,X) = E[X^2] - E[X]^2 = E[(X- E[X])^2] = \sigma^2 $$

When the two variables are identical, covariance is same as  variance.

### Covariance isn't that meaningful by itself

Let's say we have two variables $X$ and $Y$ and we take the covariance of the two.

In [None]:
import os
import sys

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats

X = np.random.rand(50)
Y = 2 * X + np.random.normal(0, 0.1, 50)

np.cov(X, Y)[0, 1]

What does this mean? We know they move together, but how strongly? To make better sense of data, we introduce correlation

## Correlation

Correlation uses information about the variance of X and Y to normalize this metric. The value of correlation coeffecient is always between -1 and 1. Once we've normalized the metric to the -1 to 1 scale, we can make meaningful statements and compare correlations. 

To normalize Covariance, consider

$$\frac{Cov(X, Y)}{\sqrt{Cov(X, X)}\sqrt{Cov(Y, Y)}}$$

$$= \frac{Cov(X, Y)}{\sigma(X)\sigma(Y)} = \rho$$ 

where $\rho$ is the correlation coefficient of two series $X$ and $Y$ and can range from −1 to 1; =1 or -1 implies that the relationship between two random variables can be perfectly described by a linear equation. A positive sign indicates both the variables increase or decrease together while a negative sign indicates one decreases as other increases. A value of 0 implies that there is no linear correlation between the variables.
For example, if oil production in the world decreases, petrol prices rise. These two variables move in opposite directions, so their covariance will have a negative sign. How significantly can the petrol prices be expected to change is shown by the value of correlation coefficient.

Two random sets of data will have a correlation coefficient close to 0:

### Correlation vs. Covariance

Correlation is simply a normalized form of covariance. They are otherwise the same and are often used semi-interchangeably in everyday conversation. It is obviously important to be precise with language when discussing the two, but conceptually they are almost identical.


In [None]:
print('Covariance of X and Y: %.2f'%np.cov(X, Y)[0, 1])
print('Correlation of X and Y: %.2f'%np.corrcoef(X, Y)[0, 1])

To get a sense of what correlated data looks like, lets plot two correlated datasets





In [None]:
X = np.random.rand(50)
Y = X + np.random.normal(0, 0.1, 50)

plt.scatter(X,Y)
plt.xlabel('X Value')
plt.ylabel('Y Value')
plt.show()
print('Correlation of X and Y: %.2f'%np.corrcoef(X, Y)[0, 1])

And here's an inverse relationship

In [None]:
X = np.random.rand(50)
Y = -X + np.random.normal(0, .1, 50)

plt.scatter(X,Y)
plt.xlabel('X Value')
plt.ylabel('Y Value')
plt.show()
print('Correlation of X and Y: %.2f'%np.corrcoef(X, Y)[0, 1])

### Applications in finance

Let's look at correlation in financial datasets. We look for correlation between prices of AAPL stock and the market (SPX) as well as another stock

In [None]:
# Install yahoo finance to obtain historical market data
!pip install yfinance

In [None]:
import yfinance as yf

# Download data for stocks
startDateStr = '2012-12-31'
endDateStr = '2017-12-31'
instrumentIds = ['AAPL','LRCX', 'SPY']
price_dict = {}
for instrumentId in instrumentIds:
    data = yf.download(instrumentId, startDateStr, endDateStr)
    price_dict[instrumentId] = data.Close

In [None]:
a1 = price_dict['AAPL']
a2 = price_dict['LRCX']
bench = price_dict['SPY']
plt.scatter(a1,a2)
plt.xlabel('LRCX')
plt.ylabel('AAPL')
plt.title('Stock prices from ' + startDateStr + ' to ' + endDateStr)
plt.show()
print('Correlation of %s and %s: %.2f'%('AAPL', 'LRCX', np.corrcoef(a1,a2)[0, 1]))

Once we've established that two series are probably related, we can use that in an effort to predict future values of the series.

Another application is to find uncorrelated assets to produce hedged portfolios - if the assets are uncorrelated, a drawdown in one will not correspond with a drawdown in another. This leads to a very stable return stream when many uncorrelated assets are combined.

In [None]:
# Download data for stocks
startDateStr = '2012-12-31'
endDateStr = '2017-12-31'
instrumentIds = ['AAPL','MSFT', 'FB', 'GOOG', 'T', 'INTC', 'V', 'CSCO', 'VZ']
price_dict = {}
for instrumentId in instrumentIds:
    data = yf.download(instrumentId, startDateStr, endDateStr)
    price_dict[instrumentId] = data.Close

**Ex1: Calculate the volatility of this basket now taking covariance into account.**

| Symbol | % of Portfolio 
| ------ | -----
|AAPL | 24% 
|MSFT | 18.25%
|GOOG | 17.5%
|FB	| 11.75%
|T | 6.5%
|INTC | 6%
|V | 5.75%
|CSCO | 5.25%
|VZ | 5%

Remember, $\sigma^2(X+Y) = \sigma^2(X) + \sigma^2(Y) + 2*Cov(X,Y)$

If you are calculating a weighted sum, $\sigma^2(\sum X_i) = w(X_i)^T Cov(X_i) w(X_i)$
where $w(X_i)$ is the weight vector 

In [None]:
##

What happens if you add a negatively correlated stock into the mix?

Consider the case of a single stock, AAPL. Let's say I find a second stock Y such that
$$E[Future\:Returns\:of\:Y] ∼ E[Future\:Returns\:of\:AAPL]$$ and $$Cov(AAPL,\:Y) < 0$$
If we buy these two in equal quantities,the expected return of the basket =
$$E[0.5 *AAPL + 0.5*Y ] = 0.5*(E[AAPL]+E[Y]) ∼ E[AAPL]$$
And standard deviation of this basket:
$$\sigma_{AAPL+Y} = \sqrt{\sigma_{AAPL}^2+\sigma_{Y}^2+2*Cov(AAPL,\:Y)} <\sigma_{AAPL}$$ 
Therefore we can improve our investment decision by adding this stock to our portfolio.

Let's generate a fake stock 'Y' who returns are drawn from a normal distribution with same standard deviation as AAPL returns but the returns are negative correlated with returns of AAPL.

**Ex2: Add this stock Y to your original basket of tech stocks and give it 10% basket weight. Now recalculate basket volatility. What do you find?**


In [None]:
Y_returns = np.random.normal(returns['AAPL'].mean(),returns['AAPL'].std(),len(returns)) - returns['AAPL']
# recalculate basket volatility

A large part of portfolio optimization related to finding such uncorrelated stocks with high returns to result in a portfolio with overall high returns but low volatility.

### Important: Significance of Correlation

It's hard to rigorously determine whether or not a correlation is significant, especially when we are not aware if the variables are normally distributed or not. 

Consider our two stocks, AAPL and LRCX. Their correlation coefficient is close to 1, so it's pretty safe to say that the two stock prices are correlated over the time period we tested over, but is this indicative of future correlation? 

As an example, remember that the correlation of AAPL and LRCX from 2013-1-1 to 2015-1-1 was 0.95. 

**Ex3: Plot the rolling 60 day correlation between the two to see how that varies.**
You can use `series1.rolling(window=period).corr(series2)` function to calculate this. In general `df.rolling()` can be used to calculate many rolling quantitities, like rolling mean, standard deviation etc

In [None]:
#Plot rolling correlation
plt.figure()
plt.show()

What do you find? You'll see the correlation is not only unstable, but also reverses sign!

It may also be possible that we may be led to believe that two stocks have a relationship because of their high correlation, when in fact they are both caused by a third factor(market). 

**Ex4: Examine the correlation of each stock with the S&P 500 (labeled as SPY)**


In [None]:
#Calculate correlation with SPY

You'll see that each stock is very strongly correlated with S&P 500, so we cannot be sure if they are actually correlated with each other.

Another problem is we may determine a good correlation by picking the right time period but it may not hold out of sample. To avoid this, one should compute the correlation of two quantities over many historical time periods and examine the distibution of the correlation coefficient. 

**Another shortcoming is that two variables may be associated in different, predictable but non-linear ways which this analysis would not pick up. For example, a variable may be related to the rate of change of another which will not be detected by correlation alone.**

**Just remember to be careful not to interpret results where there are none.**

This presentation is for informational purposes only and does not constitute an offer to sell, a solicitation to buy, or a recommendation for any security; nor does it constitute an offer to provide investment advisory or other services by Auquan. Nothing contained herein constitutes investment advice or offers any opinion with respect to the suitability of any security, and any views expressed herein should not be taken as advice to buy, sell, or hold any security or as an endorsement of any security or company. In preparing the information contained herein, Auquan, has not taken into account the investment needs, objectives, and financial circumstances of any particular investor. Any views expressed and data illustrated herein were prepared based upon information, believed to be reliable, available to Auquan, at the time of publication. Auquan makes no guarantees as to their accuracy or completeness. All information is subject to change and may quickly become unreliable for various reasons, including changes in market conditions or economic circumstances.