# Probability Theory


## Notations and Rules:  
 
#### Notations  
 >$X,Y$, random variable, can be discrete or continuous.  
 >$x_i, y_i$, a specific value that X,Y take.  
 >**PDF**, probability density function, distribution for continuous random variable.  
 >**PMF**, probability mass function, distribution for discrete random variable.  
 >**CDF**, cumulative distribution function, the probability that $x$ lies in the interval $(-\infty, x)$.  
 >$P(X)$, the CDF of random variable $X$. 
 >$p(X)$, the PDF of random variable $X$.  
 >$p(x_i)=p(X=x_i)$, the probability that $X$ equals to $x_i$.   
 >$p(X, Y)$, joint PDF of random variable $X$ and $Y$.  
 >$p(X=x_i, Y=y_i)$, the probability of $X=x_i$ as well as $Y=y_i$.  
 >$p(X|Y=y_i)$, when $Y=y_i$, the PDF of $X$.  
 >$p(X=x_i|Y=y_i)$, when $Y=y_i$, the probability of $X=x_i$.  
 
 
#### <font color='red'>The Rules of Probability</font>
 >sum rule $$\displaystyle{p(X) = \sum_Y p(X, Y)}$$
 >product rule $$\displaystyle{p(X, Y)=p(Y|X)p(X)}$$
 
#### <font color='red'>Bayes' theorem</font>
 $$\displaystyle{p(Y|X) = \frac{p(X|Y)p(Y)}{p(X)}=\frac{p(X|Y)p(Y)}{\sum_Y p(X|Y)p(Y)}}$$
 >where $p(Y)$ is a **prior probability**, because it is the probability before we observe the identity of $X$.  
 >$p(Y|X)$ is a **posterior probability**, because it is the probability after we have observed $X$.
 
#### Independent 
 >If $X$ and $Y$ are independent  
 $$p(X, Y) = p(X)p(Y)$$
 
#### PDF
 >For PDF $p(x)$
 $$\begin{align*}&p(x\in(a,b))=\int_a^b p(x)dx\\
 &p(x) \geq 0\\
 &p(x\in(-\infty, \infty))=\int_{-\infty}^{\infty} p(x)dx=1\end{align*}$$
 
#### PDF Transform
 >PDF transform  
 >If $x = g(y)$, then a function $f(x)$ becomes \tilde{f}(y)=f(g(y)). Now consider a probability density $p_x(x)$ that corresponds to density $p_y(y)$ with respect to the new variable $y$, where the suffices denote the fact that $p_x(x)$ and $p_y(y)$ are different densities. Observations falling in the range $(x, x+\Delta x)$ will, for small value of $\Delta x$, be transformed into the range $(y, y+\Delta y)$ where $p_x(x)\Delta x \approx p_y(y)\Delta y$, and hence  
 $$\displaystyle{p_y(y) = p_x(x)\left|\frac{\Delta x}{\Delta y}\right|= p_x(x)\left|\frac{ dx}{dy}\right|=p_x(x)|g'(y)|}$$
 
#### CDF  
 $$\displaystyle{P(z)=\int_{-\infty}^{z}p(x)dx}$$
 
#### <font color='red'>The Rules of Probability for Continuous Random Variables</font>
 >sum rule $$\displaystyle{p(x) = \int p(x, y)dy}$$
 >product rule $$\displaystyle{p(x, y)=p(y|x)p(x)}$$
 
-----------
## Expectations and Covariances


### Expectations and Covariance of function

#### Expectation of function  
>One of the most important operations involving probabilities is that of finding weighted averages of functions. The **<font color='red'>average value</font>** of some function $f(x)$ under a probability distribution $p(x)$ is called the expectation of $f(x)$ and will be denoted by $\mathbb{E}[f]$ *(this is a value)*.  
>Discrete $$\displaystyle{\mathbb{E}[f]=\sum_x p(x)f(x)}$$
>Continuous $$\displaystyle{\mathbb{E}[f]=\int p(x)f(x)dx}$$

>Generally, we have finite number N of points drawn from the probability distribution.  
$$\displaystyle{\mathbb{E}[f]\approx\frac{1}{N}\sum_{n=1}^{N} f(x_n)}$$

>For a 2-D function $f(x, y)$
$$\mathbb{E}_x[f(x,y)]$$
>denotes the average of the function $f(x,y)$ with respect to the distribution of x. The $\mathbb{E}_x[f(x,y)]$ will be a function of y.  


#### Condictional expectation  
$$\displaystyle{\mathbb{E}_x[f|y]=\sum_xp(x|y)f(x)}$$

#### Variance  
$$var[f]=\mathbb{E}\big[(f(x)-\mathbb{E}[f(x)])^2\big]=\mathbb{E}[f(x)^2]-\mathbb{E}[f(x)]^2$$

### Variances and Covariance of random variable
$$\begin{align*}var[x]&=\mathbb{E}[x^2]-\mathbb{E}[x]^2\\cov[x,y]&=\mathbb{E}[\{x-\mathbb{E}[x]\}\{y-\mathbb{E}[y]\}]=\mathbb{E}_{x,y}[xy]-\mathbb{E}[x]\mathbb{E}[y]\end{align*}$$
>If $x$ and $y$ are independent, then their convariance vanishes.  

>For vectors of random variables $\mathbf{x},\mathbf{y}$
$$cov[\mathbf{x},\mathbf{y}]=\mathbb{E}_{\mathbf{x},\mathbf{y}}[\{\mathbf{x}-\mathbb{E}[\mathbf{x}]\}\{\mathbf{y}^T-\mathbb{E}[\mathbf{y}^T]\}]=\mathbb{E}_{\mathbf{x},\mathbf{y}}[\mathbf{x}\mathbf{y}^T]-\mathbb{E}[\mathbf{x}]\mathbb{E}[\mathbf{y}]$$


--------------------

## The Gaussian Distribution

### Normal/Gaussian Distribution
#### Definition
$$\displaystyle{\mathcal{N}(x|\mu,\sigma^2)=\frac{1}{(2\pi\sigma^2)^{1/2}}exp\left\{-\frac{1}{2\sigma^2}(x-\mu)^2\right\}}$$
>where $x$ is a single real-variable.  
>$\mu$ is mean of $x$.  
>$\sigma^2$ is variance of $x$.  

#### Expectation and Variance
>If there is a function $f(x) = x$, then the expectation of this function is  
$$\displaystyle{ \mathbb{E}[x]=\int_{-\infty}^{\infty}\mathcal{N}(x|\mu,\sigma^2)xdx=\mu }$$
>This is the **Expectation** of random variable $x$
>Similarly, for the second order moment  
$$\displaystyle{ \mathbb{E}[x^2]=\int_{-\infty}^{\infty}\mathcal{N}(x|\mu,\sigma^2)x^2dx=\mu^2+\sigma^2 }$$
>The **variance** of $x$ is  
$$var[x]=\mathbb{E}[x^2]-\mathbb{E}[x]^2=\sigma^2$$

#### D-dim
>Gaussian distribution of D-dimensional vector $\mathbf{x}$
$$\displaystyle{\mathcal{N}(\mathbf{x}|\mathbf{\mu},\mathbf{\Sigma})=\frac{1}{(2\pi)^{D/2}}\frac{1}{|\mathbf{\Sigma}|^{1/2}}exp\left\{-\frac{1}{2}(\mathbf{x}-\mathbf{\mu})^T\mathbf{\Sigma}^{-1}(\mathbf{x}-\mathbf{\mu})\right\}}$$

### <font color='red'>Use Gaussian in Maximum Likelihood</font>
>Consider a 1-D data set $\mathbb{x}=\{x_1,x_2,\cdots,x_N\}$, each element in this set is independent.  
>Now we suppose the observasions of these data are drawn independently from a Gaussian distribution whose mean $\mu$ and variance $\sigma^2$ are unknown, and we would like to determine these parameters from the data set.  
>Because each element is independent, the overall probability of the data set generated from the Gaussian distribution is  
$$\displaystyle{p(\mathbb{x}|\mu,\sigma^2)=\prod_{n=1}^N \mathcal{N}(x_n|\mu,\sigma^2)}$$
>This is the **<font color='red'>likelihood function</font>** for the Gaussian with two unknown parameters $\mu$ and $\sigma^2$. By arrangement of the parameters $\mu$ and $\sigma^2$, the likelihood (probability) shall be manimum, which means the best Gaussian distribution of generating these data.  
>Anyway, we take the logarithm of each side of the equation for convinience.  
$$\displaystyle{\ln p(\mathbb{x}|\mu,\sigma^2)=-\frac{1}{2\sigma^2}\sum_{n=1}^N(x_n-\mu)^2-\frac{N}{2}\ln\sigma^2-\frac{N}{2}\ln(2\pi)}$$
>For maximizing the likelihood with respect to $\mu$, the solution is  
$$\mu_{ML}=\frac{1}{N}\sum_{n=1}^{N}x_n$$
>which is the sample mean. Simmilarly, maximizing with respect to $\sigma^2$, the solution is  
$$\sigma^2_{ML}=\frac{1}{N}\sum_{n=1}^N(x_n-\mu_{ML})^2$$
>which is the sample variance. We can also calibrate the sample variance  
$$\displaystyle{\tilde{\sigma}^2=\frac{N}{N-1}\sigma^2_{ML}}$$