# Correlation

**Correlation** is a measure of the strength and direction of a relationship between 2 variables. Correlation ranges from -1 to 1. 
+ A perfect positive correlation is 1,
+ no correlation is 0, and 
+ a perfect negative correlation is -1. 

The calculation involves two parts. 
+ The first part is taking 1 over n-1 where n is the number of observations. 
+ The second part is the sum of the product of the z-scores, also known as standard scores, for two variables, namely x and y.  Correlation, known as r,  is a calculation of the first part multiplied by the second part.

### Computing Correlation

Formula for correlation:


r = (1/(n-1)) * SUM[[z-score of x]*[z-score of y]]

In [1]:
import pandas as pd

In [2]:
# Let's say we have 2 variables, x and y
dict1 = {'x':[1, 2, 3, 3], 'y':[1, 2, 4, 5]}
df1 = pd.DataFrame(dict1)
df1

Unnamed: 0,x,y
0,1,1
1,2,2
2,3,4
3,3,5


In [3]:
# Let's calculate the mean of x
df1['x'].mean()

2.25

In [4]:
# standard deviation of x
df1['x'].std()

0.9574271077563381

In [5]:
# mean of y
df1['y'].mean()

3.0

In [6]:
# standard deviation of y
df1['y'].std()

1.8257418583505538

Recall how the standard score (also called z-score) is calculated.

standard score = (observation - mean) / standard deviation

In [7]:
# Create a new column with z-scores for x and y
df1['zscore_x'] = (df1['x'] - df1['x'].mean())/df1['x'].std()
df1['zscore_y'] = (df1['y'] - df1['y'].mean())/df1['y'].std()
df1

Unnamed: 0,x,y,zscore_x,zscore_y
0,1,1,-1.305582,-1.095445
1,2,2,-0.261116,-0.547723
2,3,4,0.783349,0.547723
3,3,5,0.783349,1.095445


In [8]:
# Let's create another new column with the product of the z-scores for x and y
df1['zscore_product_xy'] = df1['zscore_x']*df1['zscore_y']
df1

Unnamed: 0,x,y,zscore_x,zscore_y,zscore_product_xy
0,1,1,-1.305582,-1.095445,1.430194
1,2,2,-0.261116,-0.547723,0.143019
2,3,4,0.783349,0.547723,0.429058
3,3,5,0.783349,1.095445,0.858116


In [9]:
# Apply the formula of Correlation (r)

# n is the number of observations
n = df1.shape[0]

r = (1/(n-1))*(df1['zscore_product_xy'].sum())
r

0.9534625892455924

In [10]:
# Now, let's look at some other examples

import numpy as np
import pandas as pd

In [11]:
# Read in file

sports = pd.read_csv("sports.csv", skiprows=2)
# Reset column names
col_names = ['Month', 'Golf', 'Soccer', 'Tennis', 'Hockey', 'Baseball']
sports.columns = col_names

# Set index
sports.set_index('Month', inplace=True)

sports.head()

Unnamed: 0_level_0,Golf,Soccer,Tennis,Hockey,Baseball
Month,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
2004-01,45,21,13,22,24
2004-02,50,24,13,23,32
2004-03,63,27,15,23,45
2004-04,80,29,16,16,53
2004-05,82,31,17,14,52


In [12]:
# We can simply use the correlation function

sports['Golf'].corr(sports['Baseball'])

0.6812491123136919

### Correlation Matrix

In [13]:
sports_correlation = sports[["Golf", "Soccer", "Tennis", "Hockey", "Baseball"]].corr()
print(sports_correlation)

              Golf    Soccer    Tennis    Hockey  Baseball
Golf      1.000000  0.442875  0.694183 -0.473888  0.681249
Soccer    0.442875  1.000000  0.379443 -0.404575  0.379608
Tennis    0.694183  0.379443  1.000000 -0.416072  0.282095
Hockey   -0.473888 -0.404575 -0.416072  1.000000 -0.329349
Baseball  0.681249  0.379608  0.282095 -0.329349  1.000000


### Rolling Correlation

In [14]:
sports['Golf'].rolling(12).corr(sports['Soccer'])

Month
2004-01         NaN
2004-02         NaN
2004-03         NaN
2004-04         NaN
2004-05         NaN
             ...   
2019-07    0.717964
2019-08    0.700039
2019-09    0.688190
2019-10    0.693513
2019-11    0.661489
Length: 191, dtype: float64

In [None]:
# end

<a style='text-decoration:none;line-height:16px;display:flex;color:#5B5B62;padding:10px;justify-content:end;' href='https://deepnote.com?utm_source=created-in-deepnote-cell&projectId=0be58e23-60c5-40af-9cb7-633ba8900837' target="_blank">
 </img>
Created in <span style='font-weight:600;margin-left:4px;'>Deepnote</span></a>