# Manually calculating a correllation coefficient

$r = \frac{{}\sum_(x_i - \overline{x})(y_i - \overline{y})}
{\sqrt{\sum_(x_i - \overline{x})^2\sum_(y_i - \overline{y})^2}}$

In [3]:
import numpy as np
import pandas as pd
df = pd.DataFrame(np.random.randint(0,10,size=(5, 2)), columns=list('xy'))

In [5]:
df

Unnamed: 0,x,y
0,2,1
1,4,3
2,8,3
3,1,5
4,1,8


In [10]:
df.x.mean()

3.2

Step 1 - $x_i - \overline{x}$

In [11]:
df['step1'] = df.x - df.x.mean()

In [12]:
df

Unnamed: 0,x,y,step1
0,2,1,-1.2
1,4,3,0.8
2,8,3,4.8
3,1,5,-2.2
4,1,8,-2.2


Step 2 - $y_i - \overline{y}$

In [13]:
df['step2'] = df.y - df.y.mean()

In [14]:
df

Unnamed: 0,x,y,step1,step2
0,2,1,-1.2,-3.0
1,4,3,0.8,-1.0
2,8,3,4.8,-1.0
3,1,5,-2.2,1.0
4,1,8,-2.2,4.0


Step 3 - $(x_i - \overline{x})(y_i - \overline{y})$

In [15]:
df['step3'] = df.step1 * df.step2

In [16]:
df

Unnamed: 0,x,y,step1,step2,step3
0,2,1,-1.2,-3.0,3.6
1,4,3,0.8,-1.0,-0.8
2,8,3,4.8,-1.0,-4.8
3,1,5,-2.2,1.0,-2.2
4,1,8,-2.2,4.0,-8.8


Step 4 - $\sum_{}(x_i - \overline{x})(y_i - \overline{y})$

In [17]:
step4 = df.step3.sum()

In [18]:
step4

-13.0

Step 5 - $(x_i - \overline{x})^2$

In [19]:
df['step5'] = df.step1 ** 2

Step 6 - $(y_i - \overline{y})^2$

In [20]:
df['step6'] = df.step2 ** 2

In [21]:
df

Unnamed: 0,x,y,step1,step2,step3,step5,step6
0,2,1,-1.2,-3.0,3.6,1.44,9.0
1,4,3,0.8,-1.0,-0.8,0.64,1.0
2,8,3,4.8,-1.0,-4.8,23.04,1.0
3,1,5,-2.2,1.0,-2.2,4.84,1.0
4,1,8,-2.2,4.0,-8.8,4.84,16.0


Step 7 - $\sum_{}(x_i - \overline{x})^2\sum_{}(y_i - \overline{y})^2$

In [22]:
step7 = df.step5.sum() * df.step6.sum()

974.3999999999999

Step 8 - ${\sqrt{\sum_{} (x_i - \overline{x})^2\sum_{}(y_i - \overline{y})^2}}$

In [26]:
step8 = np.sqrt(step7)

31.215380824202672

Final Step  - $r = \frac{{}\sum_(x_i - \overline{x})(y_i - \overline{y})}
{\sqrt{\sum_(x_i - \overline{x})^2\sum_(y_i - \overline{y})^2}}$

In [27]:
step4/step8

-0.4164613615708485

In [30]:
df.x.corr(df.y)  #comparison with the Pandas calculation

-0.4164613615708485