In [1]:
# import libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

In [8]:
# load data with hight and weight only, and inspect first row
df = pd.read_csv('nd_data.csv')[['height', 'weight']]
df.head(1)

Unnamed: 0,height,weight
0,71.74,259.88


# Covariance
we defined variance as the deviation from the mean, which is squared, summed and normalized. 

$$ Var^{a,a} = \frac{1}{N-1} \sum_{i=1}^N (x_i^a - \mu^a)(x_i^a - \mu^a), $$

Covariance has the same concept, however, you are looking at the variance of one column against the variance of another column. For this, we will get the divation from one column from its mean, and multiply that with the deviation of the other column from that columns mean .

$$ Var^{a,b} = \frac{1}{N-1} \sum_{i=1}^N (x_i^a - \mu^a)(x_i^b - \mu^b), $$

Formal Definition: 
* Covariance defines the linear relationship between 2 variables
* It can be any value, positive or negative
* ONLY measure how two variables change together, NOT how one is dependent on the other.
* Determines the DIRECTION of the relationship between 2 variables.

To calculate the covariance using either `np.cov` ([doco here](https://docs.scipy.org/doc/numpy/reference/generated/numpy.cov.html)) or `pd.DataFrame.cov` ([doco here](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.cov.html)).

In [11]:
# using numpy cov, the covariance will be calculated for each observation! which infact we know this muct be a 2x2 matrix
# to do this we can either use the transpose method or rowvar argument

# using transpose
covar = np.cov(df.T)
print(covar)

# using argument
covarriance = np.cov(df, rowvar=False)
print(covariance)

[[  18.60200779   78.50218098]
 [  78.50218098 1512.91208783]]
[[  18.60200779   78.50218098]
 [  78.50218098 1512.91208783]]


In [12]:
# much more simple when using pandas
# this will come out as a dataframe object with same values as np.cov
covariance = df.cov()
print(covariance)

           height       weight
height  18.602008    78.502181
weight  78.502181  1512.912088


# Correlation
corr and cov can be linked together. If a covariance matrix is obtained, which is expressed in terms of variance, this can be re-written with standard deviations. This can be expressed as the below correlation matrix:

$$ Corr = \begin{pmatrix} 1 & \rho_{a,b} \\ \rho_{b,a} & 1 \\ \end{pmatrix}, $$

where $\rho_{a,b} = \sigma^2_{a,b}/(\sigma_{a,a}\sigma_{b,b})$.

Formal definition:
* Defines both the direction AND magnitude of the linear relationship between 2 variables.
* It can be any value between -1 to +1.
* This value defines how two variables are correlated with each other (positively or negatively).
* determines MAGNIUDE and DIRECTION of the relationship between 2 variables.

We can calculate a correlation matrix using `np.corrcoef` ([doco here](https://docs.scipy.org/doc/numpy/reference/generated/numpy.corrcoef.html)) or `pd.DataFrame.corr` ([doco here](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.corr.html))

In [14]:
# using numpy corrcoef
correlation = np.corrcoef(df.T)
print(correlation)

[[1.         0.46794517]
 [0.46794517 1.        ]]


In [15]:
# using pandas
correlation = df.corr()
print(correlation)

          height    weight
height  1.000000  0.467945
weight  0.467945  1.000000
