# Tutorial 5 - Correlation (资产和风险的相关性分析）
- Correlation measures association, but doesn't show if x causes y or vice versa
- Correlation is a statistic that measures the degree to which two variables move in relation to each other.
- In finance, the correlation can measure the movement of a stock with that of a benchmark index, such as the S&P 500.


### Formula
- $r = \frac{\sum(X - \bar{X})(Y - \bar{Y})}{\sqrt{\sum(X - \bar{X})^2 \sum(Y - \bar{Y})^2}}$
- $r$: the correlation factor
- $\bar{X}$: the average observations of $X$
- $\bar{Y}$: the average observations of $Y$

What does it mean?
- $r$ ranges between -1 and 1 (both inclusive)
- $r = 1$: Perfect positive correlation
- $r = -1$: Perfect negative correlation
- $r = 0$: No correlation at all

### Resources
- Correlation https://www.investopedia.com/terms/c/correlation.asp
- SP500 by Market Cap https://www.slickcharts.com/sp500

In [1]:
import pandas as pd
import pandas_datareader as pdr
import datetime as dt
import numpy as np

In [2]:
tickers = ['AAPL', 'TWTR', 'IBM', 'MSFT']
start = dt.datetime(2020, 1, 1)

data = pdr.get_data_yahoo(tickers, start)

In [3]:
data.head()

Attributes,Adj Close,Adj Close,Adj Close,Adj Close,Close,Close,Close,Close,High,High,...,Low,Low,Open,Open,Open,Open,Volume,Volume,Volume,Volume
Symbols,AAPL,TWTR,IBM,MSFT,AAPL,TWTR,IBM,MSFT,AAPL,TWTR,...,IBM,MSFT,AAPL,TWTR,IBM,MSFT,AAPL,TWTR,IBM,MSFT
Date,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2,Unnamed: 13_level_2,Unnamed: 14_level_2,Unnamed: 15_level_2,Unnamed: 16_level_2,Unnamed: 17_level_2,Unnamed: 18_level_2,Unnamed: 19_level_2,Unnamed: 20_level_2,Unnamed: 21_level_2
2020-01-02,73.894325,32.299999,115.726616,157.289886,75.087502,32.299999,129.46463,160.619995,75.150002,32.5,...,128.843216,158.330002,74.059998,32.310001,129.063095,158.779999,135480400.0,10721100.0,3293436.0,22622100.0
2020-01-03,73.175934,31.52,114.80368,155.33136,74.357498,31.52,128.432129,158.619995,75.144997,32.099998,...,127.686424,158.059998,74.287498,31.709999,127.695984,158.320007,146322800.0,14429500.0,2482890.0,21116200.0
2020-01-06,73.75901,31.639999,114.598579,155.732834,74.949997,31.639999,128.202682,159.029999,74.989998,31.709999,...,127.342255,156.509995,73.447502,31.23,127.552582,157.080002,118387200.0,12582500.0,2537073.0,20813700.0
2020-01-07,73.412117,32.540001,114.675476,154.312927,74.597504,32.540001,128.288712,157.580002,75.224998,32.700001,...,127.533463,157.320007,74.959999,31.799999,127.810707,159.320007,108872000.0,13712900.0,3232977.0,21634100.0
2020-01-08,74.593033,33.049999,115.632607,156.770828,75.797501,33.049999,129.359467,160.089996,76.110001,33.400002,...,128.030594,157.949997,74.290001,32.349998,128.59465,158.929993,132079200.0,14632400.0,4545916.0,27746500.0


## Find close value of the tickers

In [4]:
data = data['Adj Close']

In [5]:
data.head()

Symbols,AAPL,TWTR,IBM,MSFT
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
2020-01-02,73.894325,32.299999,115.726616,157.289886
2020-01-03,73.175934,31.52,114.80368,155.33136
2020-01-06,73.75901,31.639999,114.598579,155.732834
2020-01-07,73.412117,32.540001,114.675476,154.312927
2020-01-08,74.593033,33.049999,115.632607,156.770828


## Calculate the returns for the tickers

In [6]:
log_returns = np.log(data/data.shift())

In [7]:
log_returns

Symbols,AAPL,TWTR,IBM,MSFT
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
2020-01-02,,,,
2020-01-03,-0.009769,-0.024445,-0.008007,-0.012530
2020-01-06,0.007937,0.003800,-0.001788,0.002581
2020-01-07,-0.004714,0.028048,0.000671,-0.009159
2020-01-08,0.015958,0.015551,0.008312,0.015803
...,...,...,...,...
2022-02-18,-0.009400,-0.031831,-0.004974,-0.009678
2022-02-22,-0.017973,-0.041344,-0.003464,-0.000730
2022-02-23,-0.026205,-0.005176,-0.015042,-0.026234
2022-02-24,0.016543,0.065568,-0.000820,0.049831


## Calculate the correclation with .corr function call

In [8]:
log_returns.corr()

Symbols,AAPL,TWTR,IBM,MSFT
Symbols,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
AAPL,1.0,0.504462,0.451414,0.804257
TWTR,0.504462,1.0,0.310268,0.525204
IBM,0.451414,0.310268,1.0,0.483954
MSFT,0.804257,0.525204,0.483954,1.0


## Get SP500 index value and calculate the correlations

In [9]:
sp500 = pdr.get_data_yahoo("^GSPC", start)

In [10]:
log_returns['SP500'] = np.log(sp500['Adj Close']/sp500['Adj Close'].shift())

In [11]:
log_returns.corr()

Symbols,AAPL,TWTR,IBM,MSFT,SP500
Symbols,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
AAPL,1.0,0.504462,0.451414,0.804257,0.803312
TWTR,0.504462,1.0,0.310268,0.525204,0.581588
IBM,0.451414,0.310268,1.0,0.483954,0.711039
MSFT,0.804257,0.525204,0.483954,1.0,0.850102
SP500,0.803312,0.581588,0.711039,0.850102,1.0


## Define a function to calculate correlations

In [12]:
def test_correlation(ticker):
    df = pdr.get_data_yahoo(ticker, start)
    lr = log_returns.copy()
    lr[ticker] = np.log(df['Adj Close']/df['Adj Close'].shift())
    return lr.corr()

In [13]:
test_correlation("LQD")

Symbols,AAPL,TWTR,IBM,MSFT,SP500,LQD
Symbols,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
AAPL,1.0,0.504462,0.451414,0.804257,0.803312,0.250959
TWTR,0.504462,1.0,0.310268,0.525204,0.581588,0.227495
IBM,0.451414,0.310268,1.0,0.483954,0.711039,0.228174
MSFT,0.804257,0.525204,0.483954,1.0,0.850102,0.269082
SP500,0.803312,0.581588,0.711039,0.850102,1.0,0.308282
LQD,0.250959,0.227495,0.228174,0.269082,0.308282,1.0


In [14]:
test_correlation("TLT")

Symbols,AAPL,TWTR,IBM,MSFT,SP500,TLT
Symbols,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
AAPL,1.0,0.504462,0.451414,0.804257,0.803312,-0.26701
TWTR,0.504462,1.0,0.310268,0.525204,0.581588,-0.136915
IBM,0.451414,0.310268,1.0,0.483954,0.711039,-0.371393
MSFT,0.804257,0.525204,0.483954,1.0,0.850102,-0.246597
SP500,0.803312,0.581588,0.711039,0.850102,1.0,-0.388489
TLT,-0.26701,-0.136915,-0.371393,-0.246597,-0.388489,1.0


Notice that TLT has negtive coorelations with many other tickers 

## Define an Visualization function to visulise any two tickers

In [15]:
import matplotlib.pyplot as plt
%matplotlib notebook

In [16]:
def visualize_correlation(ticker1, ticker2):
    df = pdr.get_data_yahoo([ticker1, ticker2], start)
    df = df['Adj Close']
    df = df/df.iloc[0]
    fig, ax = plt.subplots()
    df.plot(ax=ax)

In [17]:
visualize_correlation("AAPL", "TLT")

<IPython.core.display.Javascript object>

In [18]:
visualize_correlation("^GSPC", "TLT")

<IPython.core.display.Javascript object>

# End