In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# Correlation

<hr>

__A. ⭐ Pearson Method__

Pearson correlation is a measure of the linear correlation between two variables X and Y. Also known as: 
- __the Pearson correlation coefficient (PCC)__ 
- __Pearson's r__
- __the Pearson product-moment correlation coefficient (PPMCC)__
- __the bivariate correlation__

According to the __Cauchy–Schwarz inequality__ it has a value between +1 and −1, where: 
- 1 is total positive linear correlation 
- 0 is no linear correlation
- −1 is total negative linear correlation

_Pearson formula_:
    
$$\displaystyle r = \frac {n(\sum{x y}) - (\sum{x})(\sum{y})} {\sqrt {(n \sum{x^2} - (\sum{x})^2)(n \sum{y^2} - (\sum{y})^2)}} $$

In [10]:
df = pd.DataFrame({
    'mesin (x)': [1000, 2000, 3000, 4000, 5000],
    'harga (y)': [10, 25, 35, 55, 80]
})
df

Unnamed: 0,mesin (x),harga (y)
0,1000,10
1,2000,25
2,3000,35
3,4000,55
4,5000,80


In [11]:
# Pearson correlation using Pandas
df.corr(method='pearson') # default method = 'pearson'

Unnamed: 0,mesin (x),harga (y)
mesin (x),1.0,0.98644
harga (y),0.98644,1.0


In [18]:
# Pearson Manual calculation
df['x^2'] = df['mesin (x)'] ** 2
df['y^2'] = df['harga (y)'] ** 2
df['xy'] = df['mesin (x)'] * df['harga (y)']
df

Unnamed: 0,mesin (x),harga (y),x^2,y^2,xy
0,1000,10,1000000,100,10000
1,2000,25,4000000,625,50000
2,3000,35,9000000,1225,105000
3,4000,55,16000000,3025,220000
4,5000,80,25000000,6400,400000


In [19]:
sumX = df['mesin (x)'].sum()
sumY = df['harga (y)'].sum()
sumX2 = df['x^2'].sum()
sumY2 = df['y^2'].sum()
sumXY = df['xy'].sum()
n = df['xy'].count()

n, sumX, sumY, sumX2, sumY2, sumXY

(5, 15000, 205, 55000000, 11375, 785000)

In [23]:
r = ((n * sumXY) - (sumX * sumY)) / (((n * sumX2) - (sumX ** 2)) * ((n * sumY2) - (sumY ** 2))) ** 0.5
r

0.9864400504156211

<hr>

__B. ⭐ Spearman Method__

Spearman correlation is a non-parametric measure of rank correlation (statistical dependence between the rankings of two variables). Also known as: 
- __Spearman's rank correlation coefficient__
- __Spearman's $\displaystyle \rho$__ (rho)
- __Spearman's $\displaystyle r_{s}$__

_Spearman formula_:
- There are __no "tied rank"__ data: tidak ada data dengan value & ranking sama
    $$\displaystyle \rho = 1 - \frac {6 \sum{d^2_{i}}} {n(n^2 - 1)}$$
    
    $d$ = difference between rank
    
- There are __"tied ranks"__: ada data dengan ranking sama
    $$\displaystyle \rho = \frac {\sum{(x - \bar{x})(y - \bar{y})}} {\sqrt {\sum{(x - \bar{x})^2} \sum{(y - \bar{y})^2}}}$$

In [43]:
df = pd.DataFrame({
    'Math': [56, 75, 45, 71, 62, 64, 58, 80, 76, 61],
    'Physics': [66, 70, 40, 60, 65, 56, 59, 77, 67, 63]
})
df

Unnamed: 0,Math,Physics
0,56,66
1,75,70
2,45,40
3,71,60
4,62,65
5,64,56
6,58,59
7,80,77
8,76,67
9,61,63


In [46]:
# Pandas spearman correlation
df.corr(method='spearman')

Unnamed: 0,Math,Physics
Math,1.0,0.672727
Physics,0.672727,1.0


In [47]:
# Manual: 1. sort & rank data
df = df.sort_values(by='Math', ascending=False)
df['Rank Math'] = np.arange(1, 11)
df = df.sort_values(by='Physics', ascending=False)
df['Rank Phy'] = np.arange(1, 11)
df = df.sort_index()
df

Unnamed: 0,Math,Physics,Rank Math,Rank Phy
0,56,66,9,4
1,75,70,3,2
2,45,40,10,10
3,71,60,4,7
4,62,65,6,5
5,64,56,5,9
6,58,59,8,8
7,80,77,1,1
8,76,67,2,3
9,61,63,7,6


In [48]:
# Manual: 2. calculate d (difference between rank)
df['d'] = df['Rank Math'] - df['Rank Phy']
df['d^2'] = df['d'] ** 2
df

Unnamed: 0,Math,Physics,Rank Math,Rank Phy,d,d^2
0,56,66,9,4,5,25
1,75,70,3,2,1,1
2,45,40,10,10,0,0
3,71,60,4,7,-3,9
4,62,65,6,5,1,1
5,64,56,5,9,-4,16
6,58,59,8,8,0,0
7,80,77,1,1,0,0
8,76,67,2,3,-1,1
9,61,63,7,6,1,1


In [50]:
# Manual: 3. hitung ρ
sumd2 = df['d^2'].sum()
p = 1 - ((6 * sumd2) / (10 * ((10**2) - 1)))
p

0.6727272727272727

<hr>

__C. ⭐ Kendall Method__

Is a statistic used to measure the ordinal association between two measured quantities. Also known as: 
- __Kendall rank correlation coefficient__
- __Kendall's τ (tau) coefficient__

- _Kendall Tau-A formula_:

$$\tau_{a} = \frac {n_{c} - n_{d}} {n_{0}}$$

- _Kendall Tau-B formula_:

$$\tau_{b} = \frac {n_{c} - n_{d}} {\sqrt{(n_{0} - n_{1})(n_{0} - n_{2})}}$$
    
- _Kendall Tau-C formula_:

$$\tau_{c} = \frac {2(n_{c} - n_{d})} {n^2 \frac {m-1} {m}}$$

https://www.statisticshowto.datasciencecentral.com/kendalls-tau/