# Pearson’s Product-Moment Correlation Coefficient, RV, and Canonical Correlation Analysis

In [1]:
from mgcpy.benchmarks.simulations import linear_sim
from mgcpy.independence_tests.rv_corr import RVCorr

Let $X$ and $Y$ be random variables, with realizations $x \in \mathbb{R}^p$ and $y \in \mathbb{R}^q$, respectively. 

## Pearson's Product-Moment Correlation
Pearson correlation is a measure of the linear dependence between two univariate random variables.
Given sample data $\bf{x}$ and $\bf{y}$ where $p = q = 1$, the sample Pearson correlation is

$$\begin{equation} \label{eq:genpearson}
    \text{Pearson}_n \left( \bf{x}, \bf{y} \right) = \frac{\hat{\text{cov}} \left( \bf{x}, \bf{y} \right)}{\hat{\sigma}_{\bf{x}} \hat{\sigma}_{\bf{y}}},
\end{equation}$$

where $\hat{\text{cov}} \left( \bf{x}, \bf{y} \right)$ is the sample covariance, $\hat{\sigma}_{\bf{x}}$ and $\hat{\sigma}_{\bf{y}}$ are the sample standard deviations of $\bf{x}$ and $\bf{y}$ respectively.

This is a test returns a statistic between -1 and 1 and measures the linear correlation between two vectors. <b>Note: The inputs for this test must be 1 dimensional.</b>

Create an `RVCorr` object with `'pearson'` as the `which_test` parameter and then call the `test_statistic` method. This is done below, by utilizing a simulation and calculating the Pearson's test statistic from that data:

In [2]:
x, y = linear_sim(100, 1)

pearson = RVCorr(which_test='pearson')
test_stat = pearson.test_statistic(x, y)[0]
print("Pearson's test statistic: %.2f" % test_stat)

Pearson's test statistic: 0.50


p-values are calculated via permutation tests as with other packages. This is done by permutting $\bf{y}$ and computing the test statistic for each permutation. This empirical distribution estimates that of the test statistic under the null. The p-value is equal to number of times that the permuted test statistics are greater than or equal to the observed statistic divided by the `replication_factor`.

In [3]:
p_value = pearson.p_value(x, y)[0]
print("Pearson's p-value: %.2f" % p_value)

Pearson's p-value: 0.00


## RV
RV is a multivariate generalization of the squared Pearson coefficient. The derivation is as follows: assuming each column in $\bf{x}$ and $\bf{y}$ are pre-centered to zero mean in each dimension, then the sample covariance matrix is $\bf{\hat{\Sigma}_{xy}} =  {\bf{x}} \bf{y}^T $, and the RV coefficient is

$$\begin{equation} \label{eq:rv}
    \text{RV}_n \left( \bf{x}, \bf{y} \right) = \frac{\text{tr}{\bf{\hat{\Sigma}_{xy}} \bf{\hat{\Sigma}_{yx}}} }{\text{tr}{\bf{\hat{\Sigma}^2_{xx}}} \text{tr}{\bf{\hat{\Sigma}^2_{yy}} }}.
\end{equation}$$

As with the other tests, simply create an `RVCorr` object with `'rv'` as the `which_test` parameter and then call the test statistic method. This is done below, by utilizing a simulation and calculating the RV test statistic from that data:

In [4]:
x, y = linear_sim(100, 3)

rv = RVCorr(which_test='rv')
test_stat = rv.test_statistic(x, y)[0]
print("RV test statistic: %.2f" % test_stat)

RV test statistic: 0.57


p-values are calculated via permutation tests as with other packages. This is done by permutting $\bf{y}$ and computing the test statistic for each permutation. This empirical distribution estimates that of the test statistic under the null. The p-value is equal to number of times that the permuted test statistics are greater than or equal to the observed statistic divided by the `replication_factor`.

In [5]:
p_value = rv.p_value(x, y)[0]
print("RV p-value: %.2f" % p_value)

RV p-value: 0.00


## CCA
Another similarly defined tool is CCA, which  finds the linear combinations with respect to the dimensions of $\bf{x}$ and $\bf{y}$ that maximize their correlation. 
It seeks a vector $\bf{a} \in {\mathbb{R}}^p$ and $\bf{b} \in {\mathbb{R}}^q$ to compute the first correlation coefficient as

$$\begin{equation} \label{eq:tomaxcca}
 \max_{\bf{a} \in {\mathbb{R}}^n, \bf{b} \in {\mathbb{R}}^m}{ \frac{{\bf{a}} ^T \bf{\hat{\Sigma}_{xy} b}}{\sqrt{{\bf{a}} ^T \bf{\hat{\Sigma}_{xx} a}} \sqrt{{\bf{b}} ^T \bf{\hat{\Sigma}_{yy} b}}}}.
\end{equation}$$

One can keep on deriving the second and the third canonical correlation coefficients in a similar manner until the end, and CCA can also be generalized to more than two random variables. Therefore, CCA can be used to define a test statistic for dependence, and usually people take the first correlation coefficient or the sum of all correlation coefficients as the statistic.

As with the other tests, simply create an `RVCorr` object with `'cca'` as the `which_test` parameter and then call the test statistic method. This is done below, by utilizing a simulation and calculating the CCA test statistic from that data:

In [6]:
x, y = linear_sim(100, 6)

cca = RVCorr(which_test='cca')
test_stat = cca.test_statistic(x, y)[0]
print("CCA test statistic: %.2f" % test_stat)

CCA test statistic: 0.37


p-values are calculated via permutation tests as with other packages. This is done by permutting $\bf{y}$ and computing the test statistic for each permutation. This empirical distribution estimates that of the test statistic under the null. The p-value is equal to number of times that the permuted test statistics are greater than or equal to the observed statistic divided by the `replication_factor`.

In [7]:
p_value = cca.p_value(x, y)[0]
print("CCA p-value: %.2f" % p_value)

CCA p-value: 0.00
