*Credit*: some material here has been adapted from [Machine Learning: A Probabilistic Perspective](https://www.cs.ubc.ca/~murphyk/MLbook/) by Kevin P. Murphy (Chapter 2).


In [0]:
import numpy as np
import matplotlib.pyplot as plt

# Summary Statistics and Independence

## Expectation

The **expected value** of a function $g : \mathbb{R} \rightarrow \mathbb{R}$ of a univariate continuous random variable $X \sim p(x)$ is defined by:

$$
  \mathbb{E}_X \left[ g(x) \right] = \int_{\mathcal{X}} g(x) p(x) dx .\
$$

The expected value of a function $g$ of a discrete random variable $X \sim p(x)$ is given by:

$$
  \mathbb{E}_X \left[ g(x) \right] = \sum_{x \in \mathcal{X}} g(x) p(x) .\
$$

where $\mathcal{X}$ is the set of possible outcomes (target space) of the random variable $X$.

## Mean and variance

The most familiar property of a distribution is its **mean**, or **expected value**, denoted by $\mu$. For discrete rv's, it is defined as $\mathbb{E}_X[x] \triangleq \sum_{x \in \mathcal{X}} x p(x)$, and for continuous rv's, it is defined as $\mathbb{E}[x] \triangleq \int_{\mathcal{X}} x p(x) dx$.

The **variance** is a measure of the "spread" of a distribution, denoted by $\sigma^2$. This is defined as follows:

$$
\begin{align}
\mathbb{V}_X[x] & \triangleq \mathbb{E}_X\left[ \left( x - \mu\right)^2 \right] = \int \left( x - \mu \right) ^2 p(x) dx \\\
&= \int x^2 p(x)dx + \mu^2 \int p(x) dx - 2 \mu \int x p(x) dx = \mathbb{E}_X[x^2] - \mu^2
\end{align}
$$

from which we derive the useful result

$$\mathbb{E}_X[x^2] = \mu^2 + \sigma^2$$

The **standard deviation** is defined as

$$\text{std}[x] \triangleq \sqrt{\mathbb{V}[x]}$$

It is often denoted $\sigma(x)$.

## Population statistics vs. empirical statistics

Let's use `scipy.stats` to explore the difference between population mean and covariance vs. empirical mean and covariance.

First we will create a univariate random variable which is normally distributed. We will get into the details of the normal distribution in the next topic. The important thing to know here is that we have access to it's distribution, so we know the true mean and covariance.

In [0]:
from scipy.stats import multivariate_normal

rv = multivariate_normal(2.5, 0.5)
print(rv.mean)
print(rv.cov)

x = np.linspace(0, 5, 100, endpoint=False)
y = rv.pdf(x)

plt.plot(x, y)
plt.title('Gaussian pdf')

But what if we only had access to samples from the distribution? We would fit the empirical mean and covariance.

In [0]:
X = rv.rvs(20)  # sample 20 points from the distribution
print(X)

print(np.mean(X))  # empirical mean
print(np.cov(X))   # empirical covariance (=variance in the 1d case)

Note that our empirical statistics will approach the population statistics as we gain more data. This is a result of the central limit theorem.

In [0]:
X = rv.rvs(1000)  # sample 1000 points from the distribution

print(np.mean(X))  # empirical mean
print(np.cov(X))   # empirical covariance (=variance in the 1d case)

Let's do the same for a bivariate Gaussian, just so we see the difference between the univariate and multivariate case.

In [0]:
rv = multivariate_normal([0.5, -0.2], [[2.0, 0.3], [0.3, 0.5]])
print(rv.mean)
print(rv.cov)

x, y = np.mgrid[-1:1:.01, -1:1:.01]
pos = np.empty(x.shape + (2,))
pos[:, :, 0] = x; pos[:, :, 1] = y

plt.contourf(x, y, rv.pdf(pos))
plt.title('Gaussian pdf')

### Exercise

Generate samples from the bivariate normal distribution. Estimate the empirical mean and covariance, similar to the univariate case. Experiment with different sample sizes.

Make a contour plot which also shows the sampled data points.

## Independence and conditional independence

We say $X$ and $Y$ are **statistically independent**, denoted $X \perp Y$, if we can represent the joint as the product of the two marginals, i.e.,

$$X \perp Y \Longleftrightarrow p(x,y) = p(x)p(y)$$

<img width=400px src='https://drive.google.com/uc?id=1d5pnEkpZTyK0lesflkha8zkDImMvunpY' />

In general, we say a **set** of variables is mutually independent if the joint can be written as a product of marginals.

We say $X$ and $Y$ are **conditionally independent** given $Z$ iff the conditional joint can be written as a product of conditional marginals:

$$X \perp Y|Z \Longleftrightarrow p(x,y|z)=p(x|z)p(y|z)$$


In contrast, $X \perp Y$ would be said to be **unconditionally independent** or **marginally independent** because they are not conditioning on another variable.

In machine learning, conditional independence assumptions allow us to build large probabilistic models from small pieces.

## Covariance and correlation

The **covariance** between two rv's $X$ and $Y$ measures the degree to which $X$ and $Y$ are (linearly) related. Covariance is defined as

$$\text{Cov}_{X,Y}[x,y] \triangleq \mathbb{E}\left[\left(x - \mathbb{E}_X[X]\right)\left(y - \mathbb{E}_Y[Y]\right)\right]=\mathbb{E}[xy] - \mathbb{E}[x]\mathbb{E}[y]$$

Note that in the rightmost term above we have left off the subscript denoting the random variable associated with the expectation. This is common practice when the expectation or covariance is clear from its arguments.

If we consider a random varible $X$ with states $\mathbf{x} \in \mathbb{R}^D$, its **covariance matrix** is defined to be the following symmetric, positive semi-definite matrix:

$$
\begin{align}
\mathbb{V}_X[\mathbf{x}] = \text{Cov}_X[\mathbf{x},\mathbf{x}] & \triangleq \mathbb{E}_X \left[\left(\mathbf{x} - \mathbb{E}[\mathbf{x}]\right)\left(\mathbf{x} - \mathbb{E}[\mathbf{x}]\right)^T\right]\\
& = \left( \begin{array}{ccc}
  \mathbb{V}[X_1] & \text{cov}[X_1, X_2] &  \ldots & \text{cov}[X_1, X_D] \\
  \text{cov}[X_2, X_1] & \mathbb{V}[X_2] &  \ldots & \text{cov}[X_2, X_D] \\
  \vdots               & \vdots          & \ddots  & \vdots\\
  \text{cov}[X_D, X_1] & \text{cov}[X_D, X_2] &  \ldots & \mathbb{V}[X_D] 
  \end{array} \right)
\end{align}
$$

Covariances can be between $-\infty$ and $\infty$. Sometimes it is more convenient to work with a normalized measure, with finite bounds. The (Pearson) **correlation coefficient** between $X$ and $Y$ is defined as

$$\text{corr}[x,y] \triangleq \frac{\text{Cov}[X,Y]}{\sqrt{\mathbb{V}[X]\mathbb{V}[Y]}}$$

A **correlation matrix** has the form

$$
\text{corr}[\mathbf{x}, \mathbf{y}] = \left( \begin{array}{ccc}
  \text{corr}[X_1, X_1] & \text{corr}[X_1, X_2] &  \ldots & \text{corr}[X_1, X_d] \\
  \text{corr}[X_2, X_1] & \text{corr}[X_2, X_2] &  \ldots & \text{corr}[X_2, X_d] \\
  \vdots               & \vdots          & \ddots  & \vdots\\
  \text{corr}[X_d, X_1] & \text{corr}[X_d, X_2] &  \ldots & \text{corr}[X_d, X_d] 
  \end{array} \right)
$$

One can show that $-1 \leq \text{corr}[x,y] \leq 1$. Hence, in a correlation matrix, each entry on the diagonal is 1, and the other entries are between -1 and 1. One can also show that $\text{corr}[x,y]=1$ iff $Y=aX + b$ for some parameters $a$ and $b$, i.e. there is a *linear* relationship between $X$ and $Y$. A good way to think of the correlation coefficient is as a degree of linearity.

If $X$ and $Y$ are independent, meaning $p(\mathbf{x},\mathbf{y})=p(\mathbf{x})p(\mathbf{y})$, then $\text{Cov}_{X,Y}[\mathbf{x},\mathbf{y}]=\mathbf{0}$, and hence $\text{corr}[\mathbf{x},\mathbf{y}]=\mathbf{0}$ so they are uncorrelated. However, the converse is not true: *uncorrelated does not imply independent*. Some striking examples are shown below. The correlation between $X$ and $Y$ is shown at the top of each subplot.

<img src="https://upload.wikimedia.org/wikipedia/commons/0/02/Correlation_examples.png" />

Source: https://upload.wikimedia.org/wikipedia/commons/0/02/Correlation_examples.png

### Exercise

The *Iris* dataset consists of 3 different types of irises' (Setosa, Versicolour, and Virginica). The iris types are stored as integer classes in the array `target` . Each example has 4 input features: Sepal Length, Sepal Width, Petal Length and Petal Width. These are stored in the array `data`.

I've started you off with some code that loads the dataset and partitions the features into an array `X` and labels into an array `y`. For each of the classes, compute the empirical mean and covariance for the whole dataset (you won't need the labels). Then compute the empirical mean and covariance for each class individually.

Before you get going, think about what dimensions each of these objects should be.

*Iris* is a popular "toy" dataset in machine learning and statistics. You can read more information about it on its [Wikipedia page](https://en.wikipedia.org/wiki/Iris_flower_data_set).

In [0]:
from sklearn import datasets

iris = datasets.load_iris()

X = iris.data
y = iris.target

print(X.shape)
print(y.shape)