# Math Recap

## Outline
1. Linear Algebra
1. Probability Theory
1. Information Theory
1. Numerical Optimization

## Readings

1. GoodFellow. Deep Learning. p43-106
1. recap.pdf
1. (Positive Definite Matrix) http://mlwiki.org/index.php/Positive-Definite_Matrices

## 1 Linear Algebra

**Vector Space**
1. $0$: $x + 0 = x$
1. $(-x)$: $x + (-x) = 0$
1. $1$: $1x = x$
1. $x + y = y + x$
1. $(x + y) + z = x + (y + z)$ 
1. $\alpha (x+y) = \alpha x + \alpha y$ and $(\alpha + \beta) x = \alpha x + \beta x$
<img src="images/vector.png" style="height:200px">

**Scalar Product**
$ < *| * >$ : $ V x V \rightarrow R$
1. $<x|x> \geq 0$
1. $<0|0> = 0 $
1. $<x|y> = <y|x>$
1. $<x+y|z> = <x|z> + <y|z>$ and $<\alpha x| y> = \alpha <x|y>$
<img src="images/scalar.png" style="height:200px">

**Norm**  
1. $||x|| \geq 0$
2. $||0|| = 0$
3. $||\alpha x || = |\alpha | ||x||$
4. $|| x + y || \leq ||x|| + ||y||$

**Metric**  
1. $d(x,y) \geq 0$
2. $d(x,x) = 0$
3. $d(x,y) = d(y,x)$
4. $d(x,y) \leq d(x,z) + d(y,z)$
<img src="images/metric.svg" style="height:200px">

**Linear Mapping**  
$L: V \rightarrow W$
1. $L(\alpha x + \beta y) = \alpha L(x) + \beta L(y)$

**Linear Operator**  
$L: V \rightarrow V$
1. $L(\alpha x + \beta y) = \alpha L(x) + \beta L(y)$

**Inverse Matrix**  
$\exists! B = A^{-1}$ : $AB = BA = I$ iff $det(A) > 0$

**Orthogonal Matrix**  
$QQ^T = Q^TQ = I$

**Posititve definite matrix**  
for $\forall x$:  $x^T A x \geq 0$

**Singular Value Decomposition**  
$A = U \Sigma V^T$

## 2 Probability Theory
**Probability**  
$$p(x)$$

**Conditional Probability**    
<img src="images/cond.png" style="height:200px">

**Joint Probability**     
$$p(x,y) = p(x | y) p(y) = p(y|x)p(x)$$

**Bayes Theorem**  
<img src="images/bayes.jpg" style="height:300px">

**Expected Mean**  
$$E[x] = \int x p(x) dx$$

**Variance**  
$$Var[x] = \int (E[x] - x)^2 p(x) dx$$

**Covariance**  
$$Var[x_i, x_j] = \int (E[x_i] - x_i)(E[x_j] - x_j) p(x_i, x_j) dx_i dx_j$$

**Correlation**  
$$\rho(x_i, x_j) = \frac {Var[x_i, x_j]} {Var[x_i] Var[x_j]}$$

**Normal Probability Distribution**  
$$N(m, \sigma) = \frac 1 {\sqrt{2\pi} \sigma} \exp^{-\frac {(x - m)^2} {2\sigma^2}}$$
<img src="images/normal.png" style="height:200px">

**Maximum Likelihood Estimation**  
i.i.d. samples ${x_1, .. , x_n}$  
Likelihood function:   
$$L(\theta) = p(x_1, ..., x_n; \theta) = \prod_i p(x_i; \theta)$$  
$$\hat \theta_{MLE} = \arg\max_{\theta} \log L(\theta) = \arg\max_{\theta} \sum_i \log p(x_i; \theta) $$  

**Maximum a posteriori estimation**   
i.i.d. samples ${x_1, .. , x_n}$ 

posterior probability:  
$$p(\theta | x_1, ...,x_n) = \frac {p(\theta) p(x_1, ..., x_n | \theta)} {p(x_1, ..., x_n)}$$  
$$\hat \theta_{MAP} = \arg\max_{\theta} p(\theta) p(x_1, ..., x_n | \theta) = \arg\max (\log p(\theta) + \sum_i \log p(x_i; \theta))$$

## 3 Information Theory

**Entropy**  
$$ H(p) = - \sum_i p_i \log p_i$$ for discrete  
$$ H(p) = - \int p(x) \log p(x) dx $$ for continious  

<img src="images/entropy.png" style="height:200px">

**Cross-entropy**

$$ H(p,q) = - \sum_i p_i \log q_i$$ for discrete
$$ H(p,q) = - \int p(x) \log q(x) dx$$ for continious

## 4 Numerical Optimization  
$$f: R^n \rightarrow R$$  
$$f(x) = f(x_1, x_2, ... x_n)$$  

**Derivative**
<img src="images/derivative.svg" style="height:200px">

**Gradient**  
$$[\nabla f]_i = \frac {\partial f} {\partial x_i}$$
<img src="images/grad.png" style="height:200px">

**Hessian**  
$$[\nabla^2 f]_{i,j} = \frac {\partial^2 f} {\partial x_i \partial x_j}$$

**Gradient Decent**

$$f(x) \rightarrow \min_{x}$$

$$ x_i^{(t)} \leftarrow x_i^{(t-1)} - \alpha [\nabla f(x^{(t-1)}) ]_i$$