# Multivariate Distributions 
- Covariance matrix
- Correlation

How you encapsulate the relationship between variables

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

In [5]:
dataset = pd.read_csv('../data/height_weight.csv')[['height','weight']]
dataset.head()

Unnamed: 0,height,weight
0,71.74,259.88
1,71.0,186.73
2,63.83,172.17
3,67.74,174.66
4,67.28,169.2


### Covariance

We've talked about variance (the average square deviation from the mean). Covariance is, as you've guessed, similar. Let's say we have a data vector, $x^a$, which has $i$ points... so $x_i^a$ is the first element of the data vector, from the previous section we'd have that:

$$ Var^{a,a} = \frac{1}{N-1} \sum_{i=1}^N (x_i^a - \mu^a)(x_i^a - \mu^a), $$

This should look like the last section, except I've stuck $a$ in a few places. Another way of stating this is that this is covariance of vector $x^a$ with itself. Notice there are two sets of brackets, both use data vector $x^a$. Covariance is what you get when you change one of the letters. Like this:

$$ Var^{a,b} = \frac{1}{N-1} \sum_{i=1}^N (x_i^a - \mu^a)(x_i^b - \mu^b), $$

Easy! All we've done is now one set in the brackets iterates over a different data vector. The goal is to do this for each different vector you have to form a matrix. If we had only two vectors, our matrix is this:

$$ Cov = \begin{pmatrix} Var^{a,a} & Var^{a,b} \\ Var^{b,a} & Var^{b,b} \\ \end{pmatrix} $$

Notice how this is symmetric. $Var^{a,b} = Var^{b,a}$. And the diagonals are just the variance for each data vector. The off-diagonals are measure of the joint spread between the two. If the concept still isn't perfect, don't worry, the examples will clear everything up.

We can calculate the covariance using either `np.cov` ([doco here](https://docs.scipy.org/doc/numpy/reference/generated/numpy.cov.html)) or `pd.DataFrame.cov` ([doco here](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.cov.html)).

In [10]:
covariance = np.cov(dataset)
covariance

array([[17698.3298 , 10886.7211 , 10191.5438 , ...,  9768.2288 ,
         6992.2231 ,  4852.1306 ],
       [10886.7211 ,  6696.71645,  6269.0941 , ...,  6008.7016 ,
         4301.10545,  2984.6767 ],
       [10191.5438 ,  6269.0941 ,  5868.7778 , ...,  5625.0128 ,
         4026.4561 ,  2794.0886 ],
       ...,
       [ 9768.2288 ,  6008.7016 ,  5625.0128 , ...,  5391.3728 ,
         3859.2136 ,  2678.0336 ],
       [ 6992.2231 ,  4301.10545,  4026.4561 , ...,  3859.2136 ,
         2762.47445,  1916.9707 ],
       [ 4852.1306 ,  2984.6767 ,  2794.0886 , ...,  2678.0336 ,
         1916.9707 ,  1330.2482 ]])

In [11]:
covariance = np.cov(dataset.T) # OR covariance = np.cov(dataset, rowvar=False)
covariance

array([[  18.60200779,   78.50218098],
       [  78.50218098, 1512.91208783]])

In [12]:
covariance = dataset.cov()
covariance

Unnamed: 0,height,weight
height,18.602008,78.502181
weight,78.502181,1512.912088


### Correlation

Correlation and covariance are easily linked. If we take that 2D covariance matrix from above, which is written in terms of variance, we can rewrite it in terms of standard deviation $\sigma$, as $Var = \sigma^2$.

$$ Cov = \begin{pmatrix} \sigma^2_{a,a} & \sigma^2_{a,b} \\ \sigma^2_{b,a} & \sigma^2_{b,b} \\ \end{pmatrix} $$

Great. And here is the correlation matrix:

$$ Corr = \begin{pmatrix} \sigma^2_{a,a}/\sigma^2_{a,a} & \sigma^2_{a,b}/(\sigma_{a,a}\sigma_{b,b}) \\ \sigma^2_{b,a}/(\sigma_{a,a}\sigma_{b,b}) & \sigma^2_{b,b}/\sigma^2_{b,b} \\ \end{pmatrix} $$

Which is the same as

$$ Corr = \begin{pmatrix} 1 & \rho_{a,b} \\ \rho_{b,a} & 1 \\ \end{pmatrix}, $$

where $\rho_{a,b} = \sigma^2_{a,b}/(\sigma_{a,a}\sigma_{b,b})$. Another way to think about this is that 

$$ Corr_{a,b} = \frac{Cov_{a,b}}{\sigma_a \sigma_b} $$

It is the joint variability normalised by the variability of each independent variable.

But this is *still too mathy for me*. Let's just go back to the code. We can calculate a correlation matrix using `np.corrcoef` ([doco here](https://docs.scipy.org/doc/numpy/reference/generated/numpy.corrcoef.html)) or `pd.DataFrame.corr` ([doco here](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.corr.html))


In [14]:
corr = np.corrcoef(dataset.T)
corr

array([[1.        , 0.46794517],
       [0.46794517, 1.        ]])

In [15]:
corr = dataset.corr()
corr

Unnamed: 0,height,weight
height,1.0,0.467945
weight,0.467945,1.0
