## Pearson Correlation Coefficient (CC)

Pearson Correlation Coefficient is defined as $r = \frac{cov(X, Y)}{\sigma_X \sigma_Y}$, 
which measures the linear correlation between two sets of data.

When we have n samples of the data, we will use sample covariance to calculate the correlation
coefficient:

$$
\begin{aligned}
r = \frac{\sum{(x_i - \bar{x})(y_i - \bar{y})}}{\sqrt{\sum{(x_i - \bar{x})^2}\sum{(y_i - \bar{y})^2}}}
\end{aligned}
$$

$r$ can take a value between -1 and 1. The higher the absolute value of the $r$ is, the stronger
the correlation between x and y is. 



In [8]:
import numpy as np

In [9]:
def pearson_correlation(x, y):
    if x.shape != y.shape:
        raise ValueError("x and y shape should be the same")
    
    x_diffs = x - x.mean()
    y_diffs = y - y.mean()
    
    if np.all(x_diffs == 0) or np.all(y_diffs == 0):
        raise ValueError("x or y has constant value, cc is not defined")
    
    r = (x_diffs * y_diffs).sum() / np.sqrt((x_diffs ** 2).sum() * (y_diffs ** 2).sum())
    return r

In [10]:
x = np.array([1, 2, 3, 4, 5])
y = np.array([10, 20, 30, 40, 50])
pearson_correlation(x, y)

np.float64(1.0)

In [13]:
x = np.array([1, 2, 3, 4, 5])
y = np.random.rand(5)
pearson_correlation(x, y)

np.float64(0.006683847076446301)

## Coefficient of Determination (R2 score)

Coefficient of determination is a measure of how much variation of a dependent variable (e.g. output of a model)
is predictable from the independent variable (e.g. actual observed data).

$$
\begin{aligned}
R^2 = 1 - \frac{\sum{(y_i - \hat{y}_i)^2}}{\sum{(y_i - \bar{y})^2}}
\end{aligned}
$$

$\sum{(y_i - \hat{y}_i)^2} = \sum{e_i ^ 2}$ is the residual sum of squares, $SS_{res}$, and 
$\sum{(y_i - \bar{y})^2}$ is the total sum of squares, $SS_{tot}$, so we can also write $R^2$
as $1 - \frac{SS_{res}}{SS_{tot}} = 1 - FVU$.

where $FVU$ is Fraction of Variance Unexplained, which compares the unexplained variance (the model's errors) with the 
total variance of the data.



In [29]:
def r2_score(y, y_hat):
    if y.shape != y_hat.shape:
        raise ValueError("x and y shape should be the same")
    
    ss_res = ((y - y_hat)**2).sum()
    ss_tot = ((y - y.mean())**2).sum()
    
    r2 = 1 - ss_res / ss_tot
    return r2

In [31]:
x = np.array([1, 2, 3, 4, 5])
y = np.array([1, 2, 3, 4, 5])
r2_score(x, y)

np.float64(1.0)

An R2 of 1 indicates perfect fit to the data.

In [37]:
x = np.array([1, 2, 3, 4, 5])
y = np.array([1.4, 2.3, 2.9, 4, 5])
r2_score(x, y)

np.float64(0.974)

In [34]:
x = np.array([1, 2, 3, 4, 5])
y = np.random.rand(5)
pearson_correlation(x, y)

np.float64(0.37764462899758466)

In [38]:
x = np.array([1, 2, 3, 4, 5])
y = np.array([3, 3, 3, 3, 3])
r2_score(x, y)

np.float64(0.0)

An R2 of 0 indicates doing as good as the worst possible least-square predicotr, i.e. using the mean of the data as
the prediction for all values.

In [30]:
x = np.array([1, 2, 3, 4, 5])
y = np.array([10, 20, 30, 40, 50])
r2_score(x, y)

np.float64(-444.5)

Values of R2 outside the range of 0 to 1 means the model fits the data worse than just using the mean as the prediction.

### References

- https://en.wikipedia.org/wiki/Pearson_correlation_coefficient
- https://en.wikipedia.org/wiki/Coefficient_of_determination
- https://machinelearningmastery.com/how-to-use-correlation-to-understand-the-relationship-between-variables/