In [None]:
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from common import load_iris, percentile_rank, make_cdf

%matplotlib inline

# Relationships between variables

Chapter 7 of [Think Stats 2nd Edition](https://greenteapress.com/wp/think-stats-2e/).

In [None]:
iris = load_iris().features

In [None]:
f, a = plt.subplots()

_ = iris.plot(kind='scatter', x='sepal length (cm)', y='sepal width (cm)', ax=a)

In [None]:
_ = sns.pairplot(iris)

Plotting percentiles against each other:

In [None]:
y, x = zip(*make_cdf(iris.loc[:, 'petal length (cm)']))
plt.plot(x, y)

y, x = zip(*make_cdf(iris.loc[:, 'petal width (cm)']))
plt.plot(x, y)

_ = plt.ylabel('cumulative (%)')
_ = plt.xlabel('length (cm)')

## Co-variance

Variance = how far a random variable is (on expectation) from its mean

$$ \sigma^2_X = \mathbf{E} [(X - {E}[X])^2] $$

Co-variance = measure of the **joint variability** of two random variables

$$ \sigma^2_{XY} = \mathbf{E} [(X-\mathbf{E}[X])(Y-\mathbf{E}[Y])] $$

Use delta degrees of freedom (`ddof`) of one for an unbiased estimate.

In [None]:
x = iris.loc[:, 'sepal length (cm)']
y = iris.loc[:, 'sepal width (cm)']

def covariance(x, y, ddof=1):
    return np.sum((x - np.mean(x)) * (y - np.mean(y))) / (x.shape[0] - ddof)

In [None]:
## co-variance with itself is the variance
np.testing.assert_almost_equal(covariance(x, x), np.var(x, ddof=1))
covariance(x, x)

In [None]:
#  how does x change with y?
covariance(x, y)

The **co-variance matrix** shows all of the pairwise co-variances:

In [None]:
np.cov(x, y, ddof=1)

What does an identity co-variance matrix imply?

In [None]:
np.eye(3)

## Correlation

Correlation = **strength of a relationship** between two variables.  
- almost always it is a measure of the **linear relationship**
- correlation doesn't have to be perfect to be **useful**

Challenge = variables for different units

Solution = transform each value
1. a standard score (standardization)
2. a rank

## Karl Pearson 

1857 to 1936 - English biostatistician - [Wikipedia](https://en.wikipedia.org/wiki/Karl_Pearson).

![](assets/pearson.jpg)

Introduced
- Pearson correlation coefficient
- moments
- chi distance & chi squared test
- p-value
- PCA
- the histogram

## Pearson Correlation Coefficient

$$ \rho_{X,Y} = \frac{\mathbf{E}[(X - \mathbf{E}[X])(Y - \mathbf{E}[Y])]}{\sigma_{X}\sigma_{Y}} $$

- dimensionless
- always between -1 to +1
- measures a **linear relationship**
- assumes that both datasets are normally

Problem with co-variance = the units are the product (i.e. $cm^2$)
- Pearson's correlation fixes this by using standard scores (aka standardization)
- dividing by the standard deviation

$$ y = \frac{x-\mu}{\sigma} $$ 

In [None]:
from scipy.stats import pearsonr

def standardize(x):
    return (x - np.mean(x)) / np.std(x)

def pearson(x, y):
    return np.mean(standardize(x) * standardize(y))

pearson(x, y)

In [None]:
pearsonr(x, y)

## Spearman's Rank Correlation

Non-parametric (why?)

Dependence of the **ranks***
- asseses strength & direction of monotonic relationships

Monotonic
- as one variable increases, so does the other variable; or
- as one variable increases, the other variable decreases

In [None]:
def rank(x):
    temp = x.argsort()
    ranks = np.empty_like(temp)
    ranks[temp] = np.arange(len(x))
    return ranks

ranked = rank(x)

In [None]:
def spearman(x, y):
    x_ranks = rank(x)
    y_ranks = rank(y)
    return pearson(x_ranks, y_ranks)

spearman(x, y)

In [None]:
from scipy.stats import spearmanr, tiecorrect

corr, p_value = spearmanr(x, y)
#  a bug!!!
corr

In [None]:
import scipy
pearson(scipy.stats.rankdata(x), scipy.stats.rankdata(y))

In [None]:
#  using pandas here
iris.corr('spearman')

## Anscombe's quartet 

[Wikipedia](https://en.wikipedia.org/wiki/Anscombe%27s_quartet)

Each dataset consists of eleven (x, y) points. 

Constructed in 1973 by the statistician Francis Anscombe to demonstrate:
- the importance of **graphing data before analyzing it**
- the **effect of outliers** and other influential observations on statistical properties

In [None]:
data = pd.DataFrame({
    'x': [10, 8, 13, 9, 11, 14, 6, 4, 12, 7, 5],
    'y1': [8.04, 6.95, 7.58, 8.81, 8.33, 9.96, 7.24, 4.26, 10.84, 4.82, 5.68],
    'y2': [9.14, 8.14, 8.74, 8.77, 9.26, 8.10, 6.13, 3.10, 9.13, 7.26, 4.74],
    'y3': [7.46, 6.77, 12.74, 7.11, 7.81, 8.84, 6.08, 5.39, 8.15, 6.42, 5.73]
})

data2 = pd.DataFrame({
    'x4': [8, 8, 8, 8, 8, 8, 8, 19, 8, 8, 8],
    'y4': [6.58, 5.76, 7.71, 8.84, 8.47, 7.04, 5.25, 12.50, 5.56, 7.91, 6.89]
})

In [None]:
#  all our y's have similar means
data.mean(axis=0)

In [None]:
#  all our y's have similar variances
data.cov()

In [None]:
#  x has different values - hence two dfs 
data2.cov()

In [None]:
f, a = plt.subplots(2, 2, figsize=(10,5))
data.plot(ax=a[0][0], x='x', y='y1', kind='scatter')
data.plot(ax=a[0][1], x='x', y='y2', kind='scatter')
data.plot(ax=a[1][0], x='x', y='y3', kind='scatter')
data2.plot(ax=a[1][1], x='x4', y='y4', kind='scatter')

for ax in a.flatten():
    ax.get_yaxis().set_label_text('')
    ax.get_xaxis().set_label_text('')

a[0][0] = simple linear relationship

a[0][1] = Pearson correlation coefficient is not relevant. A more general regression and the corresponding coefficient of determination would be more appropriate.

a[1][0] = the distribution is linear, but should have a different regression line (a robust regression would have been called for). The calculated regression is offset by the one outlier which exerts enough influence to lower the correlation coefficient from 1 to 0.816.

a[1][1] =  one high-leverage point is enough to produce a high correlation coefficient, even though the other data points do not indicate any relationship between the variables.

## Quiz

What is variance?

What is co-variance?

What is correlation?

What challenge do the Pearson & Spearman correlations attempt to solve?

What is the Pearson correlation coefficient?  What kind of relationship does it measure?

What is the Spearman correlation coefficient?  What kind of relationship does it measure?

What are two takeaways from Anscombe's quartet?