# Notes and Equations in Machine Learining
Linjian Li <br/>
Dec. 27th, 2018

---
## Linear Regression

### Closed-form Solotion
* $ L(w) = \frac{1}{2} ||y-Xw||_2^2 $
* $ w^* = argmin_w L(w) = (X^T X)^{-1} x^T y $


---
## Linear Classification

### SVM
* y = {-1,+1}
* $ min_{w,b} \frac{||w||^2}{2} $ <br/>
    s.t. $ y_i (w^T x_i + b) \geq 1 $
* Soft Margin
    * $ min_{w,b} \frac{||w||^2}{2} + \frac{C}{n} \ \sum_{i=1}^{n} \xi_i $ <br/>
        s.t. $ y_i (w^T x_i + b) \geq 1- \xi_i $
    * hinge loss: $ \xi_i = max \big( 0,1-y_i (w^T x_i + b) \big) $


---
## Logistic Regression
* sigmoid function: $ g(z) = \big( 1+exp(-z) \big)^{-1} $
    * $ g'(z) = g(z) \cdot \big( 1-g(z) \big) $
* $ h_w(x) = g(w^T x) $ = P(y = +1 | x)
* $ J(w) = - \frac{1}{n} \bigg[ \sum_{i=1}^{n} y_i log \big( h_w(x_i) \big) + (1-y_i) log \big( 1-h_w(x_i) \big) \bigg] $ <br/>
    assume $ y \in \{0,1\} $
* Gradient
    * For one sample: $ \frac{\partial J(w)}{\partial w} = \big( h_w(x)-y \big) x $
    * For all sample: $ \frac{\partial J(w)}{\partial w} = \frac{1}{n} \sum_{i=1}^{n} \big( h_w(x_i)-y_i \big) x_i $

---
## Softmax Regression
* $ P(y_i = j | x) = \frac{exp(w_j^T x_i)}{\sum_{l=1}^{K} exp(w_l^T x_i)} $
* $ J(w) = - \frac{1}{n} \bigg[ \sum_{i=1}^{n} \sum_{j=1}^{K} I(y_i = j) log \frac{exp(w_j^T x_i)}{\sum_{l=1}^{K} exp(w_l^T x_i)} \bigg] $
* $ \frac{\partial J(w)}{\partial w_j} = \frac{1}{n} \sum_{i=1}^{n} \bigg[ \big( P(y_i = j | x_i ; w) - I(y_i = j) \big) x_i \bigg] + \lambda w_j $

---
## Ensemble Learning

### Decision Tree
* $ Entropy(D) = \sum_{i=1}^{c} -p_i log_2 p_i $
* $ Gain(D,A) = Entropy(D)-\sum_{v \in Values(A)} \frac{|D_v|}{D} Entropy(D_v) $

### Adaboost
* $ w_{m+1}(i) = \frac{w_{m}(i)}{z_m} exp\big(-\alpha_m y_i h_m(x_i)\big) $
* $ z_m = \sum_{i-1}^{n} exp\big(-\alpha_m y_i h_m(x_i)\big) $
* $ \alpha_m = \frac{1}{2} log \frac{1-\epsilon_m}{\epsilon_m} $
* $ H(x) = \sum_{m=1}^{M} \alpha_m h_m(x) $


---
## Clustering

### K-means
* $ L(r,\mu) = \sum_{i=1}^{n} \sum_{k=1}^{K} r_{ik} ||x_i - \mu_k|| $
* Expectation-Maximization (EM)
    * E: $ \mu_k = \frac{1}{n_k} \sum_{i=1}^{n} r_{ik}x_i $
    * M: $ r_i = argmin_k ||x_i - \mu_k|| $

---
## Principle Component Analysis (PCA)

基于 Mingkui Tan 的课件 “PCA.pptx”。

原本是向量x，找到一个矩阵W，使得新向量z满足 $z = Wx$。相当于旋转坐标系。z,x的维度相同。

z的每一维度的重要性是递减的，即$z_1$的重要性大于$z_2$大于$z_3$大于$z_4$......

$z_1$是最重要的1维，$z_2$是第二重要的1维......

$w^1$，也就是W的第1行，把x投影到$z_1$上，也就是z的第1维。其他$w^i$同理

假设现在我们只关注***最重要的1维$z_1$***。

对任意一个x，要把x投影到这1维上，那么$z_1 = w^1 x$。**如何找到这个$w^1$？**

We want the variance of $z_1$ as large as possible. So, 
$$w^1 = argmax \: Var(z_1) = \frac{1}{N} \sum_{z_1}(z_1-\bar{z_1})^2$$
subject to $||w^1||_2 = 1$

那么$Var(z_1)$是什么？

看复习PPT的第10页，最后导出 $Var(z_1) = (w^1)^T Cov(x) (w^1)$。把$Cov(x)$记为S。

So, 
$$w^1 = argmax \big((w^1)^T S (w^1)\big)$$
subject to $||w^1||_2 = 1$

有约束的优化问题就可以用拉格朗日乘子法来解。PPT第11页的$\alpha$就是Lagrange multiplier。
各项求导并令其为零就得到$Sw^1 - \alpha w^1 = 0$，于是得到红色的等式$Sw^1 = \alpha w^1$，左右同时左乘一个$(w^1)^T$就得到了$(w^1)^T S (w^1) = \alpha$。联系前面的等式，so，
$$ w^1 = argmax(\alpha) $$
所以choose the maximum one for $\alpha$。
$w^1$就是S的特征向量，对应S的最大特征值$\lambda_1$。

同理，如果我们想要找到$w^2$，那它也是S的特征向量，对应S的第二大特征值$\lambda_2$。同时，$w^1$ 与 $w^2$ 应该是正交的，相乘为零。

The eigenvalues for symmetric matrices are always real. A symmetric n×n real matrix M is said to be positive definite if the scalar $z^TMz$ is positive for every non-zero column vector z of n real numbers.
$z^TMz \geq 0$

至于怎么算特征值和特征向量？

***忘了***

不过WQY说只考概念，所以有可能不用计算？

---
## Recommender System

### Model-based Collaborative Filtering

m: Number of users

n: Number of items

k: Number of feature

R is m-by-n matrix, P is m-by-k user matrix, Q is k-by-n item matrix.R=PQ

Suppose $p$ and $q$ are k-by-1 matries, which are k-dimensional **column** vectors.

$P = [p_1, p_2, ..., p_m]^T$

$Q = [q_1, q_2, ..., q_m]$

Prediction of one element: $\hat{R_{ui}} = P_{u\cdot} Q_{\cdot i} = (p_u)^T q_i$

Loss for one element (squared error loss): $L(R_{ui},\hat{R_{ui}}) = (R_{ui}-\hat{R_{ui}})^2 = (R_{ui}-(p_u)^T q_i)^2$

Loss for the whole P and Q is the sum of loss of each element, plus regulation part: $L = \sum_{u,i}(R_{ui}-(p_u)^T q_i)^2 + \lambda ( \sum_u n_{p_u} ||p_u||_2^2 + \sum_i n_{q_i} ||q_i||_2^2 )$ 

## ALS
Fixing Q, optimize P, update each $p_u$ as: $p_u \gets (q_i q_i^T + \lambda n_{p_u}I)^{-1} Q^T R_{u \cdot}^T$

Fixing P, optimize Q, update each $q_i$ as: $q_i \gets (p_u p_u^T + \lambda n_{q_i}I)^{-1} P^T R_{\cdot i}$

ALS is **NOT** scalable to large-scale datasets. But SGD is scalable to large-scale datasets.

## SGD
SGD choose the loss function as: $L = \sum_{u,i}(R_{ui}-(p_u)^T q_i)^2 + ( \lambda_p ||p_u||_2^2 + \lambda_q ||q_i||_2^2 )$ 
* $ E_{ui} = R_{ui} - (p_u)^T q_i $
* $ \frac{\partial L}{\partial p_u} = E_{ui}(-q_i) + \lambda_p p_u $
* $ \frac{\partial L}{\partial q_i} = E_{ui}(-p_u) + \lambda_q q_i $