# Q1
## Basic optimization

Consider a simplified logistic regression problem. Given m training samples (x i , y i ), i = 1, . . . , m.
The data x i ∈ R (note that we only have one feature for each sample), and y i ∈ {0, 1}. To fit a
logistic regression model for classification, we solve the following optimization problem, where θ ∈ R
is a parameter we aim to find:

$$
max_{\theta}\ell(\theta)
$$

where the log-likelihood function

$$
\ell (\theta) = \sum_{i=1}^m \{ - log(1 + exp\{-\theta x^i\}) + (y^i - 1)\theta x^i\}\ \ \ \ \ \ \ \ \text{(1)}
$$

#### (a) Show step-by-step mathematical derivation for the gradient of the cost function $\ell(\theta)$

Let **X** be a vector in $\mathbb{R}^m$ that represents $x^i$ for i = 1, ..., m. Substituting **X** in (1) and removing the summation yields:

$$
-log( 1 + exp\{-\theta X\}) + (y^i - 1) \theta X \ \ \ \ \ \ \ \ \text{(2)}
$$

Taking the derivative of (2) w.r.t $\theta$

$$
\frac{\partial \ell}{\partial \theta} = \frac{\partial \ell}{\partial \theta} \theta \cdot X(y-1) - \frac{\partial \ell}{\partial \theta} log(exp\{-\theta X\} + 1 )
$$
$$
= X(y-1) - \frac{\frac{\partial \ell}{\partial \theta} exp\{-\theta X\} + \frac{\partial \ell}{\partial \theta} 1}{exp\{-\theta X\} + 1}
$$
$$
= X(y-1) - \frac{(-X \cdot \frac{\partial \ell}{\partial \theta} \theta) \cdot exp\{-\theta X\}}{exp\{-\theta X\} + 1}
$$
Which yields
$$
\boxed{X(y-1) + \frac{exp\{-\theta X\}X}{exp\{-\theta X\} + 1}}
$$

#### (b) Write a pseudo-code for performing gradient descent to find the optimizer $θ^*$. This is essentially what the training procedure does.


* Given m examples of feature vector **X** and label vector **Y**, learning rate $\alpha$
* Randomly initialize **$\theta$** , bias **b**
* Repeat until convergence:
    * Predict $\hat{y}$ = h(X) = z($\theta X + b$)
    * Calculate cost $\ell(\hat{y}, y)$
    * Calculate $\nabla \ell(\theta)$ = = $X(Y - 1) + \frac{exp(-\theta X)X}{exp(-\theta X)+1}$
    * Update $\theta \leftarrow \theta - \alpha \nabla\ell(\theta)$

#### (c) Write the pseudo-code for performing the stochastic gradient descent algorithm to solve the training of logistic regression problem (1). Please explain the difference between gradient descent and stochastic gradient descent for training logistic regression.

* Given m examples of $(x^{(i)},y^{(i)})$, learning rate $\alpha$
* Randomly initialize **$\theta$** , bias **b**
* Repeat until convergence:
    * For i = 1, 2, ..., m
        * Predict $\hat{y}^{(i)}$ = $h(x^{(i)})$ = z($\theta x^{(i)} + b$)
        * Calculate cost $\ell(\hat{y}^{(i)}, y^{(i)})$
        * Calculate $\nabla \ell(\theta)$ = $x^{(i)}(y^{(i)} - 1) + \frac{exp(-\theta x^{(i)})x^{(i)}}{exp(-\theta x^{(i)})+1}$
        * Update $\theta \leftarrow \theta - \alpha \nabla\ell(\theta)$
        
        
Gradient descent computes the gradients after iterating through all m training observations. With large training sets, this can become computationally expensive to do in the event of a vectorized implementation. Run time for an iterative approach can take a long time to loop through all observations as well. Stochastic gradient descent, on the other hand, either samples a small subset of traiing observations to compute the gradient. Or in the case of "online" learning, the gradient is calculated and the weights are updated after each observation seen (as shown above in the pseudo code).

#### (d) We will show that the training problem in basic logistic regression problemis concave. Derive the Hessian matrix of (θ) and based on this, show the training problem (1) is concave (note that in this case, since we only have one feature, the Hessian matrixis just ascalar). Explain why the problem can be solved efficiently and gradient descent will achieve a unique global optimizer, as we discussed in class.

Taking the derivative of the result from part **(a)** yields:


$$
\frac{\partial'' \ell}{\partial \theta''} = \frac{\partial' \ell}{\partial \theta'} X(y-1) + \frac{exp\{-\theta X\}X}{exp\{-\theta X\} + 1}
$$

$$
\boxed{= - \frac{exp(-\theta X)X^2}{(1 + exp(-\theta X))^2}\ \ \ \ \ \ \ \text{(1)}}
$$

Since X $\in \mathbb{R}^{mx1}$ then there is a single theta in the Logistic Regression problem. Because of this, the hessian turns out to be a single value. In this instance, that value will be negative as seen above. Any real value entered in (1) above will be negative. The hessian of a function of a single variable is concave if its second derivative is negative everywhere. A concave optimization problem and conversely a convex optimization problem when minimzing an optjective function, will have a unique global solution and can be solved efficiently without fear of being stuck in a local minimum (or maximum). This is in contrast to a non-convex (non-concave) optimization problem, such as LP or integer optimization. These types of problems can get stuck in a local minimum and/or maximum.