# Part I: Shaowei

## Overview

- Lesson 1 : Introduction
     - Types of Machine Learning (ML)
- Lesson 2 : Regression
    - Linear Regression
    - Gradient Descent
    - Multivariate Linear Regression
    - Ridge Regression
- Lesson 3: Classification
    - Linear Regression
    - Perceptron Algorithm
    - Hinge Loss Algorithm
    - Logistic Regression
- Lesson 4: Clustering
    - K-means
- Lesson 5: Recommendation Systems
    - K-nearest Neighbours
    - Matrix Factorization
- Lesson 6: Support Vector Machines (SVM)
    - Legrange Multipliers
    - Maximum Margins
- Lesson 7: Deep Learning
    - Feedforward Networks
    - Backpropagation
    - Autoencoders
- Lesson 8: Generative Models
    - Maximum Likelihood
    - Log Likelihood
- Lesson 9: Expectation Maximum
    - Expectation Maximum

---

## Gradient Search

### Exact Solution

- Given Average Gradient: $\nabla\mathcal L_n(\theta)$
- Calculate solution(s) for $\nabla\mathcal L_n(\theta)=0$
- Find solution with $\theta$, that returns training loss (Average Loss): $\mathcal L_n(\theta;S_n)$

### Gradient Descent

- Start with random $\theta$
- Update $\theta\leftarrow\theta-\eta_k\nabla\mathcal L_n(\theta)$
    - $\eta_k$ is the learning rate
    - $k$ is the iteration number
- Repeat step 2 until the change in $\mathcal L_n(\theta)$ is below a *small* threshold (convergence)

### Stochastic Gradient Descent

- Start with random $\theta$
- Update $\theta\leftarrow\theta-\eta_k\nabla\mathcal L_m(\theta;\mathcal B_m)$
    - $\mathcal B_m$ is a random sample of $S_n$ (minibatch) 
- Repeat step 2 until convergence

---

## Linear Regression

### Problem: 
- Input: matrix x
$$
x=\left(\begin{array}{ccc}x_{11}&x_{21}&...&x_{d1}\\x_{12}&x_{22}&...&x_{d2}\\...&...&...&...\\x_{1n}&x_{2n}&...&x_{dn}\end{array}\right)
$$
    - $n$ is the number of observations
    - $d$ is the number of features
- Ouput: vector $y:\{y_1, y_2, ... , y_n\}$
- x and y can be any numerical value
- $x\in\mathbb R^d, y\in\mathbb R$

### Model:
- $f(x;\theta, \theta_0)=\theta_1x_1+\theta_2x_2+...+\theta_dx_d+\theta_0$
- $f(x;\theta, \theta_0)=\theta^Tx+\theta_0$
- parameters: $\theta\in\mathbb R^d,\theta_0\in\mathbb R$

Optimization (**General**):
- Loss function: $\text{Loss}(z)=\frac{1}{2}z^2$
- Point loss: $\mathcal L_1(\theta;x,y)=Loss(y-f(x;\theta))$
- Average loss: $\mathcal L_n(\theta;S_n)=\frac{1}{n}\sum_{(x,y)\in S_n}\frac{1}{2}(y-f(x;\theta))^2$
- Point gradient: $\nabla\mathcal L_1(\theta;x,y)=\frac{d}{d\theta}\mathcal L_1(\theta;x,y)$
- Average gradien: $\nabla\mathcal L_n(\theta;S_n) = \frac{1}{n}\sum_{(x,y)\in S_n}\nabla\mathcal L_1(\theta;x,y)$

### Model (**Constant Feature Trick**):
- $f(x;\theta,\theta_0)=\theta_1x_1+\theta_2x_2+...+\theta_dx_d+\boldsymbol{\theta_0x_0}$
    - $x_0 = \mathbb1$ 
- $f(x;\tilde\theta)=\tilde\theta^T\tilde x$
    - $\tilde\theta=(\theta,\theta_0)$
    - $\tilde x=(x,x_0)$
- parameters: $\tilde\theta\in\mathbb R^{d+1},\tilde x\in\mathbb R^{d+1}$

### Optimization
$$
\begin{aligned}
x\leftarrow\tilde x\\
\theta\leftarrow\tilde\theta
\end{aligned}
$$
- Point loss: $\mathcal L_1(\theta;x,y)=\frac{1}{2}(y-\theta^Tx)$
- Average loss: $\mathcal L_n(\theta;S_n)=\frac{1}{n}\sum_{(x,y)\in S_n}\frac{1}{2}(y-\theta^Tx)^2$
- Point gradient: 
$$
\begin{aligned}
\nabla\mathcal L_1(\theta;x,y)&=\frac{d}{d\theta}\frac{1}{2}(y-\theta^Tx)^2\\
&=\frac{d}{d\theta}\left(\begin{array}{c}\frac{1}{2}(y-\theta^Tx_1)^2\\\frac{1}{2}(y-\theta^Tx_2)^2\\...\\\frac{1}{2}(y-\theta^Tx_d)^2\end{array}\right)\\
&=\left(\begin{array}{c}-x_1(y-\theta^Tx_1)\\-x_2(y-\theta^Tx_2)\\...\\-x_d(y-\theta^Tx_d)\end{array}\right)\\
&=-\left(\begin{array}{c}x_1\\x_2\\...\\x_d\end{array}\right)(y-\theta^Tx)\\
&=-x(y-\theta^Tx)
\end{aligned}
$$
- Average gradient:
$$
\begin{aligned}
\nabla\mathcal L_n(\theta)&=\frac{1}{n}\sum_{(x,y)\in S_n}-x(y-\theta^Tx)\\
&= \frac{1}{n}(-x_1(y_1-\theta^Tx_1)-x_2(y_2-\theta^Tx_2)+...-x_n(y_n-\theta^Tx_n))\\
&= \frac{1}{n}\sum_{i=1}^n-x_iy_i+x_ix_i^T\theta\\
&=\frac{1}{n}(X^TX\theta -X^TY)
\end{aligned}
$$

### Solution:
- Exact solution: 
$$
\begin{aligned}
\frac{1}{n}(X^TX\hat\theta -X^TY) &= 0\\
X^TX\hat\theta &=X^TY\\
\hat\theta&=(X^TX)^{-1}X^TY
\end{aligned}
$$
- Gradient Descent: $\theta\leftarrow\theta-\eta_k\left[\frac{1}{n}(X^TX\theta-X^TY)\right]$

### Test Loss (**Actual Data**):

 $\mathcal R(\theta)=\frac{1}{n}\sum_{x,y}\frac{1}{2}(y-\theta^Tx)^2$
 
 ---

## Ridge Regression

### Problem:
Same as Linear Regression

### Model:
Same as Linear Regression with Constant Feature Trick

### Optimization:
- Point loss: $\mathcal L_{1,\lambda}(\theta)=\frac{1}{2}(y-\theta^Tx)^2\boldsymbol{+\frac{\lambda}{2}{\left\lVert\theta\right\rVert}^2}$
- Average loss: $\mathcal L_{n,\lambda}(\theta)=\frac{1}{n}\sum_{(x,y)\in S_n}\frac{1}{2}(y-\theta^Tx)^2+\frac{\lambda}{2}{\left\lVert\theta\right\rVert}^2$
- Point gradient:
$$
\begin{array}
\nabla\mathcal L_{1,\lambda}&= \frac{d}{d\theta}(\frac{1}{2}(y-\theta^Tx)^2+\frac{\lambda}{2}{\left\lVert\theta\right\rVert}^2)\\
&= -x(y-\theta^Tx)+\lambda\theta
\end{array}
$$
- Average gradint:
$$
\begin{array}
\nabla \mathcal L_{n,\lambda}&=\frac{d}{d\theta}(\frac{1}{n}\sum_{(x,y)\in S_n}\frac{1}{2}(y-\theta^Tx)^2+\frac{\lambda}{2}{\left\lVert\theta\right\rVert}^2)\\
&=\frac{1}{n}\sum_{i=0}^n-x_i(y_i-\theta^Tx_i)+\lambda\theta\\
&=\frac{1}{n}(X^TX\theta-X^TY)+\lambda\theta
\end{array}
$$
### Solution:
- Exact solution:
$$
\begin{aligned}
\frac{1}{n}(X^TX\hat\theta-X^TY)+\lambda\hat\theta&=0\\
X^TX\hat\theta+n\lambda\hat\theta&=X^TY\\
\hat\theta(X^TX+n\lambda I)&=X^TY\\
\hat\theta&=(X^TX+n\lambda I)^{-1}X^TY
\end{aligned}
$$
- Gradient descent: 
$$
\begin{aligned}
\theta&\leftarrow\theta-\eta_k(\frac{1}{n}(X^TX\theta-X^TY)+\lambda\theta)\\
\theta&\leftarrow\theta-\eta_k(\frac{1}{n}(X^TX\theta-X^TY))-\eta_k\lambda\theta\\
\theta&\leftarrow(1-\eta_k\lambda)\theta-\eta_k(\frac{1}{n}(X^TX\theta-X^TY))
\end{aligned}
$$
*Note: Always Invertible*

### Test Loss:

$\mathcal R(\theta)=\frac{1}{n}\sum_{x,y}\frac{1}{2}(y-\theta^Tx)^2$

---

## Linear Classification (Perceptrons)

### Problem:

- Input: matrix
$$
x=\left(\begin{array}{cccc}x_{11}&x_{21}&...&x_{d1}\\x_{12}&x_{22}&...&x_{d2}\\...&...&...&...\\x_{1n}&x_{2n}&...&x_{dn}\end{array}\right)
$$
- Output: vector $y:\{y_1,y_2,...,y_d\}$
- x can take any numerical value, $y_i\in\{-1,+1\}$
- $x\in\mathbb R^d, y\in\mathbb R$

### Model

$$\begin{aligned}h(x;\theta,\theta_0)&=\text{sign}(\theta_1x_1+\theta_2x_2+...+\theta_dx_d+\theta_0)\\&=\text{sign}(\theta_1x_1+\theta_2x_2+...+\theta_dx_d+\theta_0x_0)\\&=\text{sign}(\tilde\theta^T\tilde x)\end{aligned}$$
$$
\begin{aligned}
x&\leftarrow\tilde x\\
\theta&\leftarrow\tilde\theta
\end{aligned}
$$
- parameters: $\theta\in\mathbb R^{d+1}$

### Optimization:

- Loss function: $\text{Loss}(z)=\text{Ind}[z\leq0]$
    - $\text{Ind}[x]=\left\{\begin{array}{ll}1&x=\text{True}\\0&x=\text{False}\end{array}\right.$
- Point loss: $\mathcal L_1(\theta;x,y)=\text{Ind}[y(\theta^Tx)\leq0]$
    - $y\neq h(x;\theta)\ \ \Rightarrow\ \ y(\theta^Tx)\leq0$
    - *Note: misclassification and boundary return same results*
- Average loss: $\mathcal L_n(\theta;S_n)=\frac{1}{n}\sum_{(x,y)\in S_n}\mathcal L_1(\theta;x,y)$
- *note: gradient is 0 almost everywhere, gradient descent impossible.*

## Solution:

**Perceptron Algorithm (Mistake-Driven Algorithm)**
- Initialize $\theta=0$
- For each data $(x,y)\in S_n$
    - if $y(\theta^Tx)\leq0$, 
        - $\theta\leftarrow\theta+yx$
- Repeat Step 2 until no mistakes are found
- *note: algorithm never terminates if not linearly seperable.*

---

## Hinge Loss (Support Vector Machine)

### Problem

Same as Linear Classification

### Model

Same as Linear Classification with Constant Feature Trick

### Optimization

- Loss function: $\text{Loss}(z)=\text{max}\{1-z,0\}$
- Point loss: $\mathcal L_1(\theta)=\text{max}\{1-y(\theta^Tx),0\}$ 
- Average loss: $\mathcal L_n(\theta)=\frac{1}{n}\sum_{(x,y)\in S_n}\text{max}\{1-y(\theta^Tx),0\}$
- Point gradient: $\nabla\mathcal L_1(\theta)=\left\{\begin{array}{ll}0&\text{if }z>1\\-yx&\text{otherwise}\end{array}\right.$

### Solution

**Stochastic Gradient Descent**
- Initialize $\theta = 0$
- Select data $(x,y)\in S_n$ **at random**
    - if $y(\theta^Tx)\leq1$,
        - $\theta\leftarrow\theta+\eta_kyx$
- Repeat Step 2 until convergence.

---

## Logistic Regression

### Problem:

Same as Linear Classification

### Model:

- $\begin{aligned}h(x;\theta)&=\mathbb P(y=+1|\ x)\\&=\text{Sig}(\theta^Tx)\end{aligned}$
    - $\text{Sig}(z)=\frac{1}{1+e^{-z}}$
    - $\text{Sig}(z)\in[0,1]$
- $\text{Sig}(\theta^Tx)\geq0.5\Rightarrow\theta^Tx\geq0$
- $\text{Sig}(\theta^Tx)<0.5\Rightarrow\theta^Tx<0$
- *Note: *$\theta^Tx=0$* is the decision boundary.*
- $\mathbb P(y|x)=\text{Sig}(y(\theta^Tx))$ for $y\in\{-1,+1\}$

### Optimization:
- Loss function: $\text{Loss}(z)=\log(1+e^{-z})$
- Point loss: $\mathcal L_1(\theta;x,y)=\log(1+e^{-y(\theta^Tx)})$
- Average loss:
$$
\begin{aligned}
\mathcal L_n(\theta;S_n)&=-\frac{1}{n}\log\prod_{(x,y)\in S_n}\mathbb P(y|x)\\
&= -\frac{1}{n}\sum_{(x,y)\in S_n}\log\frac{1}{1+e^{-y(\theta^Tx)}}\\
&=\frac{1}{n}\sum_{(x,y)\in S_n}\log(1+e^{-y(\theta^Tx)})
\end{aligned}
$$
- Point gradient:
$$
\begin{aligned}
\nabla\mathcal L_1(\theta;x,y)&=\frac{d}{d\theta}(\log(1+e^{-y(\theta^Tx)}))\\
&=\frac{-yxe^{-y(\theta^Tx)}}{1+e^{-y(\theta^Tx)}}\\
&=\frac{-yx}{1+e^{y(\theta^Tx)}}\\
&=-yx\text{Sig}(-y(\theta^Tx))\\
&=\left\{\begin{array}{ll}x(\text{Sig}(\theta^Tx)-1)&\text{if }y=+1\\x(\text{Sig}(\theta^Tx))&\text{if }y=-1\end{array}\right.\\
&=x(h(x;\theta)-\text{Ind}[y=1])
\end{aligned}
$$
- Average gradient: $\nabla\mathcal L_n(\theta;S_n)=\frac{1}{n}\sum_{(x,y)\in S_n}x(h(x;\theta)-\text{Ind}[y=1])$

### Solution

- Gradient descent: $\theta\leftarrow\theta-\frac{\eta_k}{n}\sum_{(x,y)\in S_n}x(h(x;\theta)-\text{Ind}[y=1])$

---

## K-Means

### Problem:

- Input: matrix
$$
x=\left(\begin{array}{cccc}x_{11}&x_{21}&...&x_{d1}\\x_{12}&x_{22}&...&x_{d2}\\...&...&...&...\\x_{1n}&x_{2n}&...&x_{dn}\end{array}\right)
$$
- Output: vectors $C_1,C2,...,C_k\subset\{1,2,...,n\}$
- x can take any numerical value, $C_i$ are mutually exclusive sets of n

- $x\in\mathbb R^d, C_i\in\mathbb R\ \ \forall i$

### Optimization:

- Average loss: 
$\begin{aligned}\mathcal L_{n,k}(C_1,...,C_n;S_n)&=\sum_{j=1}^n\sum_{i\in C_j}\left\lVert x_i-\frac{1}{|C_j|}\sum_{i'\in C_j}x_{i'}\right\rVert^2\\
\mathcal L_{n,k}(C_1,...,C_n,z_1,...,z_n;S_n)&=\sum_{j=1}^n\sum_{i\in C_j}\left\lVert x_i-z_j\right\rVert^2
\end{aligned}$


### Solution:

**Coordinate Descent**
- Initialize centroids $z_1,z_2,...,z_k$
- For each $j\in\{1,2,...,k\}$,
    - $C_j=\{\text{all }i\text{ such that }x^i\text{ is closest to }z_j\}$
- For each new cluster,
    - $z_j\leftarrow\frac{1}{|C_j|}\sum_{i\in C_j}x_i$
- Repeat Step 2 and Step 3 until convergence.

---

# Part II: Lu Wei

## Overview

- Hidden Markov Model (HMM)
- Bayesian Network
- Reinforcement Learning (RL)

---

## Hidden Markov Model (HMM)

### Problem

- Input: vector $x:\{x_1,x_2,...,x_n\}$
- Output: vector $y:\{y_1,y_2,...,y_d\}$
- Where $x$ is the observation and $y$ is the hidden state
- $x,y$ is usually categorical, but can be numerical
- $x\in\mathbb R,y\in\mathbb R$

### Model

### Optimization

### Solution