Gradient Descent

1) **Optimization** - refers to the task of either minimizing or maximizing some function $f(x)$ by altering $x$
* What are we trying to accomplish in optimization?
    1. Find the parameters of a model which maximize the likelihood of data
    2. Find the parameters of a model which minimize a cost function
* **Objective function** - any function for which we wish to find the minimum or maximum
    * ff we are minimizing it has several names: **cost function, loss function, error function**
    * cost function optimization examples:
        * example: cost of quality of care associated with the total number of patients in an emergency
        * example: interested in predicting profit at a business and there is some cost associated with producing the product
    * cost functions in models:
        * $RSS$ for Linear Regression: $J(\beta)=\frac{1}{n}\sum_{i=1}^n (h_{\beta}(x_i)-y_i)^2$ for $\hat{\beta}=(X^TX)^{-1}X^Ty$
            * $\sum (y_i-\beta^T x_i)^2$
        * Log-likelihood for Logistic Regression: $J(\theta)=\frac{1}{n}ln(p(\bar{y}|X;\theta))=\frac{1}{n}\sum_{i=1}^n (y_i ln(h_\theta(x_i))+(1-y_i)ln(1-h_{\theta}(x_i)))$
            * $\sum y_i log(g(\beta^Tx_i)+(1-y_i)log(1-g(B^Tx_i))$ where $g(z)=\frac{1}{1+e^{-z}}$
* Useful calculus notations:
    * $\frac{dy}{dx} \rightarrow$ derivative of $y$ with respect to $x$
    * $\frac{\partial y}{\partial x} \rightarrow$ partial derivative of $y$ with respect to $x$
    * $\triangledown_x y \rightarrow$ gradient of $y$ with respect to $x$
    * $\triangledown_X y \rightarrow$ matrix derivatives of $y$ with respect to $X$
    * $\triangledown_{\text{X}}y \rightarrow$ tensor containing derivatives of $y$ with respect to X
        * tensors are geometric objects that describe linear relations between geometric vectors, scalars, and other tensors
    * $\frac{\partial f}{\partial x} \rightarrow$ Jacobian matrix $J \in \mathbb{R}^n \rightarrow \mathbb{R}^m$
    * $\triangledown_x^2 f(x)$ or $H(f)(x) \rightarrow$ the Hessian matrix (or second derivative) of $f$ at the input point $x$
    * $\int f(x)dx \rightarrow$ definite integral over the entire domain of $x$
    * $\int_{\mathbb{S}} f(x)dx \rightarrow$ definite integral with respect to $x$ over the set $\mathbb{S}$
* Using the derivatives of a function can be used to follow a function to its minimum
![grad_desc_deriv](gradient_descent_derivative.png)
    * Suppose we have a function $y=f(x)$, where both $x$ and $y$ are real numbers. The derivative of this function is denoted as $f'(x)$ or as $\frac{dy}{dx}$. In other words, it specifies how o scale a small change in the input in order to obtain the corresponding change in the output: $f(x+\epsilon)\approx f(x) + \epsilon f'(x)$
* **Batch Gradient Descent (BGD)**
    * Intuition
        * explore a neighborhood of parameters
        * go in the direction of steepest descent
    * Mathematical definition
        * 1-dimension: $J(\beta)=\beta^2$
![grad_desc_steps](http://i.imgur.com/uqKsueE.jpg)
            * minimize $f(x)$ or $J(\beta)$
            * calculate the direction of steepest descent: $\frac{dJ}{d\beta}$
                * move in the direction of $-\triangledown f(x)$
            * choose a learning rate/step size: $\epsilon$ / $\alpha$
                * if $\alpha$ is too small, convergence takes a long time
                * if $\alpha$ is too big, can overshoot the minimum
            * repeatedly update parameters: $\beta^{t+1}=\beta^t-\epsilon \frac{dJ}{d\beta}(\beta^t)$
                * update: $x=x-\alpha \triangledown f(x)$
        * $n$-dimension: $J(\beta)=\beta^2$
            * calculate the direction of steepest descent: $\triangledown J= \frac{\partial J}{\partial \beta_1}\hat{e}_1 + \frac{\partial J}{\partial \beta_2}\hat{e}_2 + \cdots + \frac{\partial J}{\partial \beta_n}\hat{e}_n$
                * $\triangledown f(a) = (\frac{\partial f}{\partial x_1}(a),\cdots,\frac{\partial f}{\partial x_n}(a))$
                * $\triangledown f(a)$ points in the direction of greatest increase of $f$ at $a$
            * choose a learning rate: $\epsilon$
            * repeatedly update parameters: $\vec{\beta}^{t+1} = \vec{\beta}^t - \epsilon \triangledown J$
    * Gradient ascent
        * to maximize $f$, we can minimize $-f$
        * still use almost the same algorithm but:
            * replace: $x = x - \alpha \triangledown f(x)$
            * with: $x = x + \alpha \triangledown f(x)$
    * **Convergence criteria**
        * when a set number of iterations is done (may not have converged)
        * when the percent change is small enough: $\frac{cost_{old}-cost_{new}}{cost_{old}}$
        * when the cost function is flat enough: $|\triangledown f|<\epsilon$
    * When to use gradient descent
        * when cost function are differentiable
        * when there is only one global optimum
        * global optimum is guaranteed when the cost function is globally convex
        * when features have similar scales
        * when asymptotic answer is acceptable
    * When is gradient descent bad?
        * memory constraints (data doesn't fit in memory)
        * takes long time to compute cost function over many rows (cpu constrained - cost function is a function of *all* data)
        * "online" setting (data keeps coming in / new data continuously)
        * only finds local extrema
        * poor performance without feature scaling

2) **Stochastic Gradient Descent (True SGD)** - performs gradient descent for each training example in $x$ along with its corresponding $y$ / same as gradient descent except at each step compute the cost function by using just one observation
* example: for linear regression, compute $(y_i-\beta^Tx_i)^2$ instead of $\sum_i(y_i-\beta^Tx_i)^2$
* SGD Algorithm:
    * sample a data point without replacement
    * for each data point, do a step of gradient descent: $\beta \rightarrow \beta - \epsilon \triangledown J_i(\beta)$
* Expected direction is correct
    * cost function is expected cost per observation: $J(\beta)=E[J_i(\beta)]=\sum_{i=1}^n \frac{1}{n}J_i(\beta)$
    * the gradient and expected gradient are also the same: $\triangledown J(\beta)=E[\triangledown J_i(\beta)]=\sum_{i=1}^n \frac{1}{n}\triangledown J_i(\beta)$
* Convergence criteria (can't just wait until a random jump is flat or doesn't improve the cost)
    * take a moving average of these criteria: $T_{old}\rightarrow pT_{current} + (1-p)T_{old}$
    * cut off iterations
* Pros and Cons of SGD:
    * (+) only requires one observation in memory at once
    * (+) converges faster on average than batch SGD
    * (+) can optimize over a changing cost function (e.g. online setting
    * (-) can oscillate around optimum

3) Variants of Stochastic Gradient Descent (SGD)
* **"Online" SGD** - uses each observation as it is collected / updates model by performing a gradient descent step each time a new observation is collected
    * example: every time a new transaction occurs, update your fraud model with that transaction
    * can optionally discard old observations
* **"Batch" SGD** - normal or plain vanilla gradient descent that computes the gradient of the cost function with respective to $\theta$ for the entire data set
* **"Minibatch" SGD** - uses random subset of data / performs an update for every mini-batch of training examples
    * if the entire dataset doesn't fit in memory, train on random subset in each iteration
    * this is like the sample average of the gradient
* Which variant of gradient descent algorithm to use?
    * in practice, SGD is often preferred because it requires less memory and computation
    * but small batches may reduce the variance of your steps

4) **Newton-Raphson Method** - (aka Newton's Method) optimization method similar to gradient descent except uses root-finding method applid to cost function's first derivative, $f'(x)$
* Algorithm in 1-dimension:
    * choose initial $x_0$
    * while $f'(x)>\epsilon: x_{i+1}=x_i-\frac{f'(x_i)}{f''(x_i)}$
* Algorithm in higher dimensions:
    * $y_{i+1}=y_i-H(y_i)^{-1}\triangledown f(y_i)$
    * $H(a) = [\frac{\partial f}{\partial x_i \partial x_j}(a)]$ is the Hessian matrix (the matrix of the second partial derivatives at $a$
* What are issues with Newton's method?
    * Hessian might be singular, or computation can be slow
    * can diverge with a bad starting guess
* Gradient descent vs Newton's Method:

|            | Gradient descent | Newton's Method |
|------------:|:----------------:|:---------------:|
| Simplicity | a bit less       | a bit more      |
| Parameters | $\alpha$         | none            |
| Iterations | more             | less            |
| n < 1000   | same             | same            |
| n > 10000  | better           | worse           |
    