In [1]:
#import libraries needed
import numpy as np
import matplotlib.pyplot as plt

# Optimization via Stochastic Gradient Descent

While working with Machine Learning (ML) you are usually given a dataset $D = {X, Y}$ with $X = [x^1 x^2 ... x^N] \in \mathbb{dxN}$ and $Y = [y^1 y^2 ... y^N] \in \mathbb{N}$ and a parametric function $f_w(x)$ where the vector $w$ is sually referred to as the weights of the model. The training procedure can be written as

$$ w^*=\underset{w}{arg\,min}\;l(w; \mathbb{D}) = \underset{w}{arg\,min\;\sum^N_{i=1} l_i(w; x^{(i)},y^{(i)})}$$

what is interesting from the optimization point of view, is that the objective function l(w; D) is written as a sum of independent terms that are related to datapoints (we will see in the next lab why this formulation is so common).

Suppose we want to apply GD. Given an initial vector $w_0 ∈ \mathbb{R}^n$, the iteration become

$$w_{k+1}=w_k-\alpha_k \nabla_wl(w_k;\mathbb{D})=w_k-\alpha_k \sum_{i=1}^N \nabla_wl(w_k;x^{(i)},y^{(i)})$$

Thus, to compute the iteration we need the gradient with respect to the weights of the objective functions, that can be computed by summing up the gradients of the independent functions $l_i(w;x^{(i)},y^{(i)})$.

Unfortunately, even if it is easy to compute the gradient for each of the $l_i(w;x^{(i)},y^{(i)})$, when the number of samples N is large (which is common in Machine Learning), the computation of the full gradient $∇_wl(w_k; D)$ is prohibitive. For this reason, in such optimization problems, instead of using a standard GD algorithm, it is better using the Stochastic Gradient Descent (SGD) method. That is a variant of the classical GD where, instead of computing $∇_wl(w; D) = \sum^N_{i=1} ∇_wl_i(w; x^{(i)}, y^{(i)})$, the summation is reduced to a limited numberof terms, called a batch. The idea is the following:

- Given a number $N_{batch}$ (usually called batch size), randomly extract a subdataset $M$ with $|M| = N_{batch}$ from $\mathbb{D}$.

- Approximate the true gradient 
    $$∇_wl(w; D) = \sum^N_{i=1} ∇_wl_i(w; x^{(i)}, y^{(i)})$$ 
    with 
    $$∇_wl(w; M) = \sum_{i∈\mathbb{M}} ∇_wl_i(w; x^{(i)}, y^{(i)})$$

- Compute one single iteration of the GD algorithm
$$w_k+1 = w_k − α_k∇w_l(w;M)$$

- Repeat until you have extracted the full dataset. Notice that the random sampling at each iteration is done without replacement.

Each iteration of the algorithm above is usually called batch iteration. When the whole dataset has been
processed, we say that we completed an epoch of the SGD method. This algorithm should be repeated for e
fixed number E of epochs to reach convergence.

Unfortunately, one of the biggest drawbacks of SGD with respect to GD, is that now we cannot check the
convergence anymore (since we can’t obviously compute the gradient of $l(w; D)$ to check its distance from
zero) and we can’t use the backtracking algorithm, for the same reason. As a consequence, the algorithm
will stop ONLY after reaching the fixed number of epochs, and we must set a good value for the step size
αk by hand. Those problems are solved by recent algorithms like SGD with Momentum, Adam, AdaGrad, ...

## Implement SGD function

Write a Python script that implement the SGD algorithm, following the structure you already wrote
for GD. That script should work as follows:

    Input:
    - l: the function l(w; D) we want to optimize.
        It is supposed to be a Python function, not an array.
    - grad_f: the gradient of l(w; D). 
        It is supposed to be a Python function, not an array.
    - w0: an n-dimensional array which represents the initial iterate. 
        By default, it should be randomly sampled.
    - data: a tuple (x, y) that contains the two arrays x and 
            y, where x is the input data, y is the output data.
    - batch_size: an integer. 
        The dimension of each batch. Should be a divisor of the number of data.
    - n_epochs: an integer. The number of epochs you want to 
                reapeat the iterations.
    
    Output:
    - w: an array that contains the value of w_k FOR EACH   
        iterate w_k (not only the latter).
    - f_val: an array that contains the value of l(w_k; D)
         FOR EACH iterate w_k ONLY after each epoch.
    - grads: an array that contains the value of grad_l(w_k;D) 
        FOR EACH iterate w_k ONLY after each epoch.
    - err: an array the contains the value of ||grad_l(w_k; D)||_2 
        FOR EACH iterate w_k ONLY after each epoch.

## Prepare Dataset and Loss

• To test the script above, consider the MNIST dataset we used in the previous laboratories, and do the following:

1. From the dataset, select only two digits. It would be great to let the user input the two digits to select.
2. Do the same operation of the previous homework to obtain the training and test set from (X, Y), selecting The $N_{train}$ you prefer.
3. Implement a logistic regression classificator as described in the corresponding post on my website.

- Test the ogistic regression classificator for different digits and different training set dimensions.

- The training procedure will end up with a set of optimal parameters $w^*$. Compare $w^*$ when vomputed with GD and SGD, for different digits and different training set dimensions.

- Comment the obtained results (in terms of the accuracy of the learned classificator).

- _Hard (optional)_: Try to implement the 3-digits logistic regression classificator and compare its accuracy with the accuracy of LDA and PCA classificators.