# Moments and Conditioning
### Understanding Uncertainty

## Roadmap
1. Key Moments
2. Random Vectors
3. Marginal and Conditional

# 1. Key Moments

## Expectation
- So we have a probability space $(\mathcal{Z}, \mathcal{E}, p)$ and a random variable $X: \mathcal{Z} \rightarrow \mathbb{R}$
- These yield a distribution function $F_X(x) = p(\{z\in \mathcal{Z}:X(z)\le x\})$ and density $f_X(x)=F_X'(x)$
- What now?
- The **Expected Value** or **Expectation** of a random variable $X$ is
$$
\mathbb{E}[X] = \begin{cases} \int_{x \in \text{supp}(X)} x \times f_X(x)dx, & \text{$X$ continuous}\\
\sum_{x \in \text{supp}(X)} x \times m_X(x), & \text{$X$ discrete}.
\end{cases}
$$
- We often write $\mu_X$ for the expectation of $X$
- This is one of the most important quantities in probability and statistics

## Exercises
- What is the expected value of a single die roll? 
- Rolling two dice and adding the results together?
- What is the expected winnings of any gamble in European roulette?


## Where does $\mathbb{E}[X]$ come from?
- Imagine you want to predict $X$. You are going to decide how badly you failed by computing the loss $-(x-\hat{x})^2$ for the realized value of $X=x$. 
- Since $f_X(x)$ weights each $X=x$ by the likelihood it occurs, it makes sense to minimize the expected squared error:
$$
\min_{\hat{x}} \mathbb{E}[(X-\hat{x})^2]= \min_{\hat{x}} \int_{x} (x - \hat{x})^2 f_X(x) dx
$$
- Take the derivative and set it equal to zero:
$$
\int_{x} -2(x - \hat{x}) f_X(x) dx = 0,
$$
and solve:
$$
\hat{x} = \int_x x f_X(x)dx = \mathbb{E}[X]
$$
- The expectation $\mathbb{E}[X]$ is the optimal least squares predictor of $X$
- Minimizing expected loss is the core idea of machine learning

## What is the Expectation?
- You are a data scientist, so what does the expectation "mean" to you?
- We are investigating how our models work, **before** we gather the data
- How do we **expect** a model to behave? 
- In this case of minimizing squared loss, what is our optimal prediction $\hat{x}$? It's the expected value of $X$

## Exercise
- Compute the expected value for a uniform random variable.
- Show that $\mathbb{E}[a+bX] = a + b\mathbb{E}[X]$
- Show, by example, that $v(\mathbb{E}[X]) \neq \mathbb{E}[v(X)]$. For example, try $v(y) = y^2$ or $v(y)=\sqrt{y}$ with a normally distributed random variable. This is a very important thing to remember: The expectation of a transformed random variable is not the transformation of the expected value.

## Variance
- If we plug our estimate $\hat{y} = \mathbb{E}[X]$ back into the objective function $\int (x-\hat{x})^2 f_X(x)dx$, we get the  **Variance of $X$**:
$$
\mathbb{V}[X] = \int_{x} (x - \mathbb{E}[X])^2 f(x) dx
$$
- This is the expected squared error if we try to predict the value of $X$ by selecting the optimal $\hat{x} = \mathbb{E}[X]$
- We often write $\sigma_X^2$ for the variance of $X$

## What is the Variance?
- Our backgrounds in stats tempt us to say, "the variance is a measure of how uncertain $X$ is," which is true
- As data scientists, we have an additional interpretation: "If your loss is $L(e)=e^2$ and you predict "$\hat{x} = \mathbb{E}[X]$, what is your expected loss? How wrong do you expect to be?"
- So, before we gather our data, how wrong do we expect to be if we predict $\hat{x} = \mathbb{E}[X]$?

## Exercise
- Compute the variance for a uniform random variable.
- Show that 
$$
\mathbb{V}[X] = \mathbb{E}[X^2] - \mathbb{E}[X]^2
$$
$$
\mathbb{V}[a+bX] = b^2 \mathbb{V}[X]
$$
- Show that if $X$ has distribution $F_X(x)$, then $Y = a+bX$ has distribution $F_Y(y) = F_X( (y-a)/b)$ and density $f_Y(y) = f_X((y-a)/b)/b$, if $b>0$

These properties get used all the time!


## Exercise: Decomposing a Random Variable
- Suppose $X$ has an expectation $\mathbb{E}[X]$ and variance $\mathbb{V}[X]$
- Consider making a new variable, $\varepsilon = X - \mathbb{E}[X]$
- What's the expectation of $\varepsilon$?
- What's the variance of $\varepsilon$?
- So we can write any random variable in the form $X = \mathbb{E}[X] + \varepsilon, $ where $\mathbb{E}[\varepsilon]=0$ and $\mathbb{V}[\varepsilon] = \sigma_X^2$
- Now replace $\mathbb{E}$ with $x\beta$, and the stage is set for regression models

## Exercise: Transformations of Normals
- Show that if $X$ is a normally distributed random variable, then $a + bX$ is distributed normally with mean $a+ b \mathbb{E}[X]$ and variance $b^2 \sigma_X^2$ 

## Example: Indicator Functions
- 🎶 *Hello darkness, my old friend...* 🎶
- Let
$$
\mathbb{I}\{ x \in A \} = \begin{cases}
0, & x \notin A \\
1, & x \in A
\end{cases}
$$
- What is the expectation of an indicator function? 
- Let's start simple: Whenever the indicator is 1, we integrate over $f_X(x)$, and otherwise we integrate over 0:
$$
\mathbb{E}_X[ \mathbb{I}\{ X \le x \} ] = \int_{-\infty}^x 1 f_X(z)dz + \int_{x}^\infty 0 f_X(z)dz  = F_X(x)
$$
- Exercise: What is $\mathbb{E}_X[ \mathbb{I}\{ X > x \} ]$?
- The expected value of an indicator function for the set $A$ is probability of the event $A$: $\int_{x \in A}f(x)dx = p(\{z\in \mathcal{Z}:X(z) \in A\})$

## ECDF to CDF
- Let's take a look at what happens when we take the expectation of our ECDF:
\begin{alignat*}{2}
\mathbb{E}[\hat{F}_X(x)] &=& \mathbb{E}\left[ \frac{1}{N} \sum_{i=1}^N \mathbb{I}\{x_i \le X \} \right] \quad (\text{Definition})\\
&=&  \frac{1}{N} \sum_{i=1}^N \mathbb{E}\left[\mathbb{I}\{x_i \le x \} \right] \quad (\text{Swap sum and expectation})\\
&=&  \frac{1}{N} \sum_{i=1}^N \mathbb{E}\left[\mathbb{I}\{X \le x \} \right] \quad (\text{The $x_i$ is the RV here, so swap notation})\\
&=&  \frac{1}{N} \sum_{i=1}^N F_X(x) \quad (\text{Expectation of indicator is probability})\\
&=&  \frac{1}{N} \underbrace{(F_X(x)+F_X(x)+...+F_X(x))}_{\text{$N$ times}} \\
&=&  F_X(x) \\
\end{alignat*}
- The expected value of the ECDF is the CDF; the ECDF is an **unbiased estimator** of the CDF. We think the ECDF is always a good -- but potentially noisy -- estimate of the true distribution.
- So, even before we gather our data, we **expect** the ECDF to be a good estimator of the true CDF.

## KDE to PDF
- Let's look at the KDE:
$$
\mathbb{E}[ \hat{f}_{X,h}(x)] = \mathbb{E} \left[ \dfrac{1}{N} \sum_{i=1}^N \frac{1}{2h} \mathbb{I}\{ |x_i -x| \le h\} \right] = \dfrac{1}{N} \sum_{i=1}^N \mathbb{E}_X \left[ \frac{1}{2h} \mathbb{I}\{ |X -x| \le h\} \right]
$$
- Now take the expectation of the indicator:
$$
\mathbb{E}[ \hat{f}_{X,h}(x)] = \dfrac{1}{N} \sum_{i=1}^N \mathbb{E} \left[ \frac{1}{2h} \mathbb{I}\{ |X -x| \le h\} \right] = \dfrac{1}{N} \sum_{i=1}^N \dfrac{F(x+h) - F(x-h)}{2h}
$$
- Notice that the $1/N$ and the sum "cancel out", and we get:
$$
\mathbb{E}[ \hat{f}_{X,h} (x)] = \dfrac{F(x+h) - F(x-h)}{2h}
$$
- This isn't quite $f(x)$: We have a slight **bias** in our expected value of the kde, but it's shrinking in $h$, since $F'(x)=f(x)$.
- So taking the limit as $h \rightarrow 0$,
$$
\lim_{h \rightarrow 0} \mathbb{E}[ \hat{f}_{X,h} (x)] = \lim_{h \rightarrow 0} \dfrac{F(x+h) - F(x-h)}{2h} = f(x)
$$
- So our KDE estimates the density as $h\rightarrow 0$. In general, the KDE is a biased but consistent estimator of the true pdf (as $N\rightarrow \infty$ and $h_N\rightarrow 0$ at the right rate, it will get closer and closer to the truth)

## ECDF/CDF, KDE/PDF
- So this whole time, we've been estimating the underlying densities and distributions of dozens of random variables just by visualizing them
- What connects these objects, mathematically? The expectation of the indicator function is a probability
- The expectation operator is the bridge from our theory to our likely results

# 2. Random Vectors

## Random Vectors
- So far, we've dealt with a random variable $X: \mathcal{Z} \rightarrow \mathbb{R}$
- An $n$-dimensional random vector is a mapping $X: \mathcal{Z} \rightarrow \mathbb{R} \times \mathbb{R} \times... \times \mathbb{R} = \mathbb{R}^n$
- This is much harder than one dimension: We want to track how random variables move together

## Joint Distribution and Density
- The distribution function for $(X_1, ..., X_N)$ is given by
$$
F_{X_1, X_2, ... , X_N}(x_1, x_2, ..., x_N) = \mathbb{P}\left( \{ z \in \mathcal{Z} : X_1 \le x_1, X_2 \le x_2, ..., X_N \le x_n \} \right),
$$
and the density function is given by
$$
f_{X_1, X_2, ... , X_N}(x_1, x_2, ..., x_N) = \dfrac{\partial ^N F_{X_1, X_2, ... , X_N}(x_1, x_2, ..., x_N)}{\partial x_1 \partial x_2 ... \partial x_N}
$$
- These aren't very expressive. What we want to know is, how do these variables "move together"?

## Independence
- Before we dive into non-trivial examples of random vectors, we have a nice edge case: If it's true that
$$
f_{X_1, X_2, ..., X_n}(x_1, x_2, ..., x_n) = f_{X_1}(x_1) \times f_{X_2}(x_2) \times ... \times f_{X_n}(x_n)
$$
then we say that $(X_1,...,X_n)$ are **independent random variables**
- We are basically saying that the realizations of each random variable $X_i$ are unrelated to the realizations of the others
- For example, flip a coin and then flip it again: Independence
- We typically write this as
$$
f_{X_1, X_2, ..., X_n}(x_1, x_2, ..., x_n) = \prod_{i=1}^n f_{X_i}(x_i)
$$



## Exercise: Covariance and Independence
- The **covariance** of $X$ and $Y$ is
$$
\text{cov}(X,Y) = \int_{x,y} (x-\mathbb{E}[X])(y-\mathbb{E}[Y])f_{XY}(x,y) dxdy
$$
- Show that if $f_{XY}(x,y)=f_X(x)f_Y(y)$, then $\text{cov}(X,Y)=0$
- Provide an example where $\text{cov}(X,Y)=0$ but $f_{XY}(x,y)\neq 0$

## Exercise: Markov Chains and Independence
- An example of a **continuous state Markov chain** is the **autoregressive process** or AR(1):
$$
x_{t+1} = \beta x_t + \sigma \varepsilon_t,
$$
where $\varepsilon_t$ is distributed $\text{Normal}(0,1)$
- What's the covariance between $x_{t+1}$ and $x_t$?
- What are the mean and variance of $X_{t+1}$?
- What's the distribution of $X_{t+1}$?
- Spatial and temporal correlation is crucial for modern data science

## Bivariate Normal
- Let's start with something concrete: A pair of variables $(X,Y)$ over the same probability space, having a **bivariate normal distribution**
- It would not be very interesting to have $f_{XY}(x,y) = f_X(x)f_Y(y)$: That's just two normals living in the same space, with no connection. We want $X$ and $Y$ to carry information about one another. This complexity comes at a cost
- The bivariate normal density function is:
$$
f_{XY}(x,y) = \dfrac{1}{2 \pi \sqrt{1-\rho^2} \sigma_x \sigma_y} \exp \left\lbrace - \frac{1}{2(1-\rho^2)} \left[ \left( \frac{x-\mu_X}{\sigma_X}\right)^2 + \left( \frac{y-\mu_Y}{\sigma_Y}\right)^2 - 2 \rho \frac{(x-\mu_X)(y-\mu_Y)}{\sigma_X \sigma_Y} \right] \right\rbrace
$$
where $\sigma_X, \sigma_Y > 0$ and $-1 < \rho <1 $

## Multivariate Normal
- Slightly less hideous: Letting $z = [x,y]^{\top}$ and $\mu = [\mu_X, \mu_Y]^{\top}$,
$$
f_{z}(z) = \dfrac{1}{\sqrt{(2\pi)^2 \det(\Sigma)}} \exp \left\lbrace - \frac{(z-\mu)^{\top} \Sigma^{-1}(z-\mu)}{2}\right\rbrace
$$
where
$$
\Sigma = \left[ \begin{array}{cc} \sigma_X^2 & \rho \sigma_X \sigma_Y \\ \rho \sigma_X \sigma_Y & \sigma_Y^2 \end{array}\right]
$$
- If you squint, it's as if you're substituting $(x-\mu)/\sigma$ in the one-dimensional version for $ (z-\mu)^{\top} \Sigma^{-1}(z-\mu)$, and $1/\sigma$ for $1/\det(\Sigma)$
- This equation generalizes to the general pdf for a multivariate normal in $n$ dimensions:
$$
f_X(x_1, ..., x_N) = \dfrac{1}{\sqrt{(2\pi)^n \det(\Sigma)}} \exp \left\lbrace - \frac{(z-\mu)^{\top} \Sigma^{-1}(z-\mu)}{2}\right\rbrace
$$
where 
$$
\Sigma = \left[ \begin{array}{cccc} \sigma_1^2 & \sigma_1 \sigma_2 & ... & \sigma_1 \sigma_N \\  \sigma_2 \sigma_1 & \sigma_2^2 & ... & \sigma_2 \sigma_N \\ \vdots & \vdots & \ddots & \vdots \\ \sigma_N \sigma_1 & \sigma_N \sigma_2 & ... & \sigma_N^2 \end{array}\right]
$$
- This quantity, $\sqrt{(z-\mu)^{\top} \Sigma^{-1}(z-\mu)}$ is called the **Mahalanobis distance** from $z$ to $\mu$: It's a variance-weighted distance metric

## Exercise
- Suppose $X$ and $Y$ are distributed bivariate normal. Show that if $\rho=0$, then $X$ and $Y$ are independent.