---
## Why minimizing the cost function is equivalent to maximizing the likelihood of our model, regardless of the value of $\sigma$?

### Some background knowledge:
- The cost function of a general linear regression:
$$ J(\theta) = J({\theta}_0,...{\theta}_n) = \frac{1}{2}\sum_{i=1}^{n}(y_i-{\theta}^{T}h(x_i))^2 $$
- The math equation of the Gaussian distribution:
$$ f(x) = \frac{1}{\sigma \sqrt{2\pi}} e^{-{\frac{1}{2}}(\frac{x-\mu}{\sigma})^2} $$
- The Maximum Likelihood Estimation:
  - A method of estimating the parameters of a probability distribution by maximizing a likelihood function, so that under the assumed statistical model the observed data is most probable. 
  - The idea is to transfer solving the density function directly to solveing parameters for the Likelihood Function (the product of univariate density function with unknown parameters to be solved for independent and identically distributed random variables).
  - The general likelihood function:
$$ L(\theta) = f(x_1,...,x_n|\theta) = \prod_{i=1}^n f(x_i|\theta) $$
$$ log(L(\theta)) = log(\underset{\theta} {\textrm{arg max}} \prod_{i=1}^n f(x_i|\theta)) $$
, where
$$ \theta = {\theta_1, \theta_2,..., \theta_i} $$
To solve $\theta$, we write the partial derivatives for each parameters and let them equal to 0, then solve the parameters.
$$\frac{\partial}{\partial {\theta_i}} = 0 $$
  - ```Why we use 'log' (we usually use the natural log): ```
  ![alt text](https://img-blog.csdn.net/2018041616140618)
When $a$ in ${log_a}()$ is greater than 1, the $x$ can be limitlessly large when the slope $k$ equals 0, that is, when the slope $k = 0$, $y$ can get its maximum value.

- Now we assume that $y = {\theta}^{T}h(x_i) + \epsilon$, where $\epsilon$ ~ $N(0, {\sigma}^2)$ ; this implies that $y|x_i$ ~ $ N({\theta}^{T}h(x_i), {\sigma}^2) $
- The likelihood function (for all data) can be expressed as:
  $$ L(y_i, x_1,...,x_i|\theta) = \prod_{i=1}^n \frac{1}{\sigma \sqrt{2\pi}} e^{-{\frac{1}{2}}(\frac{y_i-{\theta}^{T}h(x_i)}{\sigma})^2} $$
  here, we want to solve the parameters, $\theta ( = {\theta_1, \theta_2,..., \theta_i})$, that can make this model (distribution) most probable:
  $$ log(L(\theta)) = log(\prod_{i=1}^n \frac{1}{\sigma \sqrt{2\pi}} e^{-{\frac{1}{2}}(\frac{y_i-{\theta}^{T}h(x_i)}{\sigma})^2}) $$
- Traditionally, to maximize the output of the likelihood function, we should let the slope $k$ of $log(L(\theta))$ equals to 0, that is, let the partial derivative of $log(L(\theta))$ equals to 0. 
- Here, we can maximize the output of the likelihood function by minimizing its negative log function, $-log(L(\theta))$:
  $$ minimize -log(L(y|x_i)) = - log(\prod_{i=1}^n \frac{1}{\sigma \sqrt{2\pi}} e^{-{\frac{1}{2}}(\frac{y_i-{\theta}^{T}h(x_i)}{\sigma})^2}) $$
  $$ = \sum_{i=1}^n \frac{1}{2}(\frac{y_i-{\theta}^{T}h(x_i)}{\sigma})^2 + \sum_{i=1}^n log(\sigma \sqrt{2\pi}) $$
  $$ = \sum_{i=1}^n \frac{({{y_i}-{\theta}^{T}h(x_i)})^2}{2{\sigma}^2} + \sum_{i=1}^n log(\sigma \sqrt{2\pi}) $$
- As $\frac{1}{{\sigma}^2}$ and $\sum_{i=1}^n log(\sigma \sqrt{2\pi})$ are constant (we know the value of $\sigma$), to maximize the likelihood function, we only need to minimize 
  $$\sum_{i=1}^n \frac{1}{2}({{y_i}-{\theta}^{T}h(x_i)})^2$$
, the same as the cost function, $\frac{1}{2}\sum_{i=1}^{n}(y_i-{\theta}^{T}h(x_i))^2 $
-  Therefore, we can say that minimizing the cost function is equivalent to maximizing the likelihood of our model, regardless of the value of $\sigma$.
---

---
## Normal Equation:
- ```The Formula:``` 

A mathematical equation that gives the result directly -- the Normal Equation:
$$ \hat{\theta} = ({X^{T}X})^{-1} \quad X^{T}y $$
where $\hat{\theta}$ is the value of \theta that minimizes the cost function and $y$ is the vector of trage values.
- ```Why Normal Equation gives us the result?```

The idea is to minimize the cost function $J(\theta)$,
  $$ J(\theta) = J({\theta}_0,...{\theta}_n) = \frac{1}{2m}\sum_{i=1}^{m}({\theta}^{T} X_i-y_i)^2 $$
  $$ = \frac{1}{2m}(X_i\theta-y_i)^{T}(X_i\theta-y_i) $$
  $$ = \frac{1}{2m}({\theta}^{T}{X}^{T}-{y_i}^{T})(X_i\theta-y_i) $$
  $$ = \frac{1}{2m}({\theta}^{T}{X}^{T}{X}{\theta}-(X\theta)^Ty_i-{y_i}^{T}X{\theta}+{y_i}^{T}y_i) $$
  $$ = \frac{1}{2m}({\theta}^{T}{X}^{T}{X}{\theta}-2(X\theta)^Ty_i+{y_i}^{T}y_i) $$
  here, to minimize the cost function, we set partial derivatives to 0:
  $$ \frac{\partial}{\partial \theta}J(\theta)=\frac{1}{2m}({\theta}^{T}{X}^{T}{X}{\theta}-2(X\theta)^Ty_i+{y_i}^{T}y_i)=0 $$
  $$ \frac{\partial}{\partial \theta}J(\theta)=\frac{1}{2m}\frac{\partial}{\partial \theta}({\theta}^{T}{X}^{T}{X}{\theta}-2{y_i}^{T}X\theta) $$
  $$ = \frac{1}{2m}(X^{T}X\theta+X^{T}X\theta-2X^{T}y_i) $$
  $$ = \frac{1}{2m}(2X^{T}X\theta-2X^{T}y_i) $$
  when setting the partial derivatives to 0, we will get:
  $$ X^{T}X\theta = X^{T}y_i $$
  multiply $(X^{T}X)^{-1}$ at both side:
  $$ \theta = (X^{T}X)^{-1}X^{T}y_i $$ 

References:
- [机器学习：正规方程(Normal Equation)的推导](https://blog.csdn.net/Mao_Jonah/article/details/82119408);
- [向量，标量对向量求导数](https://blog.csdn.net/xidianliutingting/article/details/51673207)
---

---
## Optimization
- Main idea in machine learning is to convert the learning problem into a continuous optimization problem.
- Examples: maximum likelihood, minimize cost function;

### Batch Gradient Descent
- Gradient Descent is guaranteed to approach arbitrarily close the gloabl minimum (with approriate learning rate):
  - The MSE cost function for a Linear Regression model happens to be a convex function: there are no local minima, just one gloabl minimum. 
  - It is also a continuous function with a slope that never change abruptly. 
- Batch Gradient Descent uses the whole batch of training data at every step. As a result it is terribly slow on every large training sets.
  - Partial derivatives of the cost function:
$$ J(\theta) = J({\theta}_0,...{\theta}_n) = \frac{1}{2m}\sum_{i=1}^{m}({\theta}^{T} X-y_i)^2 $$
$$ \frac{\partial}{\partial {\theta}_j}J(\theta) = \frac{1}{m}\sum_{i=1}^{m}({\theta}^{T}x^{(i)}-y^{(i)})x_{j}^{(i)} $$
$$ = \frac{1}{m}(X^{T}X\theta-X^{T}y_i) $$
$$ = \frac{1}{m}X^{T}(X\theta-y_i) $$
  - Once we have the gradient vector, which points uphill, just go in the opposite direction to go downhill:
$$ {\theta}_j := {\theta}_j - \alpha\frac{\partial}{\partial {\theta}_j}J(\theta) $$
- How to select approriate learning rate:
  - We can try an alpha and see if the new $\theta$ will decrease the cost function.

In [None]:
### Stochastic Gradient Descent

### Batch vs. Online learning
- Batch: use all patterns in training set and update all the coefficients simultaneously;
- Online: 