# Logistic Regression

## Numerical Optimization
The Logistic Regression model is obtained by **minimizing the average cross-entropy between the model predictions and the observed labels**. As we have seen, this corresponds also to a **Maximum Likelihood solution for the observed labels**, or to **minimizing the average Logistic Loss function.** See the next section for all the theory related to Logistic Regression, here we just report the final formulas for the three objective functions:
- log-likelihood: $l(\mathbf{w}, b) = \sum_{i=1}^{n} \left[c_i \log y_i + (1 - c_i) \log(1 - y_i)\right] \rightarrow$ GOAL: maximize $l(\mathbf{w}, b)$ wrt $\mathbf{w}, b$.

- average cross-entroy: $\mathcal{J}(\mathbf{w}, b) = -l(\mathbf{w}, b) = - \sum_{i=1}^{n} \left[c_i \log y_i + (1 - c_i) \log(1 - y_i)\right] \rightarrow$ GOAL: minimize $\mathcal{J}(\mathbf{w}, b)$ wrt $\mathbf{w}, b$.

- average Logistic Loss function: $\mathcal{J}(\mathbf{w}, b) = \sum_{i=1}^{n} \log(1 + e^{-z_i(\mathbf{w}^T \mathbf{x}_i + b)})  \rightarrow$ GOAL: minimize $\mathcal{J}(\mathbf{w}, b)$ wrt $\mathbf{w}, b$.

While for Gaussian models closed form expressions are available for the
ML solutions, this is not the case for Logistic Regression. This means, **we can't just solve system of equations to find the optimal parameters**. This is because the **sigmoid** function involved in binary Logistic regression (and the **softmax** function involved in multiclass Logistic regression) make the loss function nonlinear and non-convex in general<br>
Therefore, we turn to numerical optimization
to find the maximizer of the class likelihoods, or, equivalently, the minimizer of the average cross-entropy or average Logistic Loss function. <br>
Numerical optimization algorithms look for the minimum of a function $f(x)$ with respect to the argument
$x$. Here we briefly explain two methods, the second one will be the one adopted by us:
### 1) Gradient Descent (GD)
with this iterative method, at each iteration $t$ we compute $x_{t+1}$ from $x_{t}$:
- we compute the gradient $\nabla f(x_t)$ of the loss function with respect to the current parameters $x_t$.
- we then update the parameters by moving in the **opposite direction of the gradient** (this is done by multiplying the gradient by $-1$), scaled by a learning rate (also called step) $\alpha_t$:

$$
x_{t+1} = x_t - \alpha_t \nabla f(x_t)
$$

Under the assumptions that the step becomes lower when iterations pass $\left( \alpha_t \rightarrow 0 \right)$ and that the whole sum of all the steps at each iteration is unbounded $\left( \sum_{t=1}^{\infty} \alpha_t \rightarrow \infty \right)$, we are certain that the algorithm converges to a **local minimum** of $f$.
#### Pros of GD
- Easy to implement  
- Low memory usage

#### Cons of GD
- Can be **very slow to converge**  
- Sensitive to choice of learning rate  
- Struggles with ill-conditioned loss surfaces

### 2) L-BFGS (Limited-memory Broyden–Fletcher–Goldfarb–Shanno)
L-BFGS is a more advanced optimization algorithm that uses an **approximate second-order method**. Instead of relying solely on the gradient, it also uses curvature information so second order information, such as the Hessian of the function, from previous iterations to guide the search more efficiently. <br>

#### Pros of L-BFGS
- Much **faster convergence** than gradient descent  
- No need to compute or store the full Hessian (which would be $\mathcal{O}(d^2)$)

#### Cons of L-BFGS
- Slightly more complex and higher per-iteration cost ($\mathcal{O}(md)$, whereas GD has just $\mathcal{O}(d)$)


This second algorithm is the one we''l use and is implemented in `scipy` (requires importing `scipy.optimize`). We will use the `scipy.optimize.fmin_l_bfgs_b` interface to the numerical solver.

`scipy.optimize.fmin_l_bfgs_b` requires at least 2 arguments (check the documentation for more details):

* `func`: the function we want to minimize.
* `x0`: the starting value for the algorithm.

The L-BFGS algorithm requires computing the objective function and its gradient. To pass the gradient we have different options:

* Through `func`: `func` should return a tuple `(f(x), \nabla_x f(x))`.
* Through the optional parameter `fprime`: `fprime` is a function computing the gradient. In this case, `func` should only return the objective value $f(x)$.
* Let the implementation compute an approximated gradient: pass `approx_grad = True`. In this case, `func` should only return the objective value $f(x)$.

The last option does not require writing a function that computes the gradient, as an approximation of the gradient is automatically obtained through finite differences. While this has the advantage that we do not need to derive and implement the gradient, it has two drawbacks:

* The gradient computed through finite differences may not be accurate enough.
* The computations are much more expensive, since we need to evaluate the objective function a number of times at least $D$, where $D$ is the size of $x$, at each iteration, and if we want a more accurate approximation of the gradient we may need to evaluate $f$ many more times.



As an example, we now try to apply the L-BFGS to the function:
$$
f(y, z) = (y + 3)^2 + \sin(y) + (z + 1)^2
$$

In [1]:
import numpy as np
import scipy.optimize as opt

In [4]:
#implementation of function f(y,z)
def f(x):
    #x is an numpy array of shape (2,)
    #x[0] is y and x[1] is z
    #te function returns the value of f(y,z) = (y+3)^2 + sin(y) + (z+1)^2
    y = x[0]
    z = x[1]

    return (y+3)**2 + np.sin(y) + (z+1)**2


#Now we call scipy.optimize.fmin_l_bfgs_b passing the function f and the initial x0 which is a numpy array of values [0,0] and approx_grad = True
x_0 = np.array([0, 0])

#x_min is the minimum point of the function f
#f_min is the value of the function f at the minimum point x_min
#d is a dictionary with information about the optimization process
x_min, f_min, d = opt.fmin_l_bfgs_b(f, x_0, approx_grad=True)

print(f"Mimimum of f(y,z) is at x_min = {x_min}")
print(f"f(y_min,z_min) = f(x_min) = {f_min}")
print(f"Optimization info: {d}")

Mimimum of f(y,z) is at x_min = [-2.57747138 -0.99999927]
f(y_min,z_min) = f(x_min) = -0.3561430123647649
Optimization info: {'grad': array([-1.49324998e-06,  1.46549439e-06]), 'task': 'CONVERGENCE: NORM_OF_PROJECTED_GRADIENT_<=_PGTOL', 'funcalls': 21, 'nit': 6, 'warnflag': 0}
