### Linear regression
* How is the model defined?

$\widehat{y} = \theta_{0} + \theta_{1}x_{1} + ... + \theta_{n}x_{n} = \theta^{T}X $

* How is the loss defined?

$MSE(X, \theta_{h}) = \frac {1}{m} \Sigma (y_{i} - \theta_{h}^{T}X_{i})^2 $

* How is the model trained?

-> Normal equation

$ \frac{\partial{MSE}}{\partial{\theta_{h}}} = \frac{2}{m} (y - \theta_{h}^TX)^T (- X) = 0
=> \theta_{h} = (X^TX)^{-1}X^Ty$

    Computational complexity scales fast given increased number of features

-> Gradient descent
    
    Tweak parameter interatively until the minimum of the cost function is reached.
    - learning rate
    - Feature scale
    
    Batch gradient descent -- utilize all the training examples
$\frac{\partial{}}{\partial{\theta_{j}}}MSE(X, \theta) = \frac {2}{m} \Sigma (\theta_{h}^{T}x_{i}-y_{i})x^{i}_{j}
= \frac {2}{m}X^T (X\theta-y)$ -- Gradient vector of the cost function

    Gradient vector always point up hill, when update the parameter, take the downhill direction
    
$\theta = \theta - \eta \nabla {MSE}_{\theta}$

### Regularized linear model

Ridge regression -- sensitive to feature scaling

$ J(\theta) = MSE(\theta) + \alpha \frac{1}{2} \Sigma \theta_i^2$

Lasso regression

$ J(\theta) = MSE(\theta) + \alpha \Sigma |\theta_{i}|$

Elastic Nets

$ J(\theta) = MSE(\theta) + r\alpha \Sigma {|\theta_{i}|} + \frac {1-r}{2} \alpha \Sigma{\theta_i}^2$ 

### Logistic regression

* How is the model defined?

Estimate the probability that an instance belongs to a particular class. Instead of outputing linear regression result, it outputs the logistic of the result

$\widehat p = \sigma (\theta^T x)$ 

where the logistic/logit function: $\sigma(t) = \frac {1}{1+exp(-t)}$

* How is the loss (cost function) defined?

$J(\theta) = -\frac{1}{m}\Sigma y^{i}log(\widehat {p}_{i}) + (1-y^i)log(1-\widehat p_{i})$

* How is the model trained?

$ \frac{\partial{}}{\partial{\theta_{j}}}J(\theta) = \frac {1}{m} \Sigma (\sigma(\theta_{h}^{T}x_{i})-y_{i})x^{i}_{j}$

### Softmax Regression

* How is the model defined?

For logistic regression, the weight matrices is n * 1, where n is the feature size +1. For softmax regression, the weight matrices is n * k, where k is the number of output classes

$ S_{k}(x) = \theta^T_{k}x$

* How is the loss (cost function) defined?  -- crossentropy

$\widehat p_{k} = \frac {exp(s_k(x))}{\Sigma exp(s_k(x))}$

$J(\theta) = -\frac{1}{m}\Sigma_{m} \Sigma_{k} y_k log(\widehat {p_k})$

(cross entropy between p(x) and q(x): $H(p, q) = - \Sigma p(x)log(q(x))$

* How is the model trained?

$\nabla J(\theta)_{\theta_{k}} = \frac{1}{m}\Sigma (\widehat p^i_k - \widehat y^i_k)x^i$

### Support Vector Machine

Large margin classifier.

    -- Hard margin classifier or soft margin classifier
    -- Polynomial kernel methods (svm(kernel='poly', gamma=5)
    -- Similarity feature methods (svm(kernel='rbf', gamma=5)


* How is the model defined?
    
    $\widehat y = 1, if w^Tx+b <0$
    
    $\widehat y = 0, if w^Tx+b >=0 $

* How is the loss (cost function) defined?

    -- Hard margin

    minimize $\frac {1}{2}w^T w $
    
    subjective to $ t^{i}(w^Tx^i+ b) >= 1, t^{i} = 1, or -1$
    
    -- Soft margine
    
    minimize $\frac {1}{2}w^T w + C \Sigma \zeta_i$
    
    subjective to $ t^{i}(w^Tx^i + b) >= 1-\zeta_i, t^{i} = 1, or -1$
    

* How is the model trained?
    
    -- For soft and hard margin problem, use the off the shelf QP solver

* Kernel trick



### Decision Trees

* How is the model defined?

* How is the loss (cost function) defined?

Gini impurity: $G_i = 1 - \Sigma p_{i,k}^2$, k is the number of instances belongs to class k divided by all the instances at node i

$J(k, t_k) = \frac{m_{left}}{m}G_{left} + \frac{m_{right}}{m}G_{right}$

Use entropy to measure impurity

$H_i = -\Sigma p_{k,i}log (p_{k,i})$, if the group is pure, $H_i = 0$

* How is the model trained?
    - CART algorithm: create binary trees, only two children per leaf
    - ID3: more than two Children per leaf


### Adaboost and Gradient boost

Adaboost: pay more attention to the training instances that the predecessor underfitted

Gradientboost: fit the new predictor to residual error

* How is the model defined?

* How is the loss (cost function) defined?

* How is the model trained?

### 

* How is the model defined?

* How is the loss (cost function) defined?

* How is the model trained?

###

* How is the model defined?

* How is the loss (cost function) defined?

* How is the model trained? -- crossentropy

### Multilayer perceptron

* How is the model defined?

* How is the loss (cost function) defined?

* How is the model trained? -- crossentropy