In [4]:
#import libraries needed
import numpy as np
import matplotlib.pyplot as plt
import math
import pandas as pd

# Optimization via Stochastic Gradient Descent

While working with Machine Learning (ML) you are usually given a dataset $D = {X, Y}$ with $X = [x^1 x^2 ... x^N] \in \mathbb{R^{dxN}}$ and $Y = [y^1 y^2 ... y^N] \in \mathbb{R^N}$ and a parametric function $f_w(x)$ where the vector $w$ is sually referred to as the weights of the model. The training procedure can be written as

$$ w^*=\underset{w}{arg\,min}\;l(w; \mathbb{D}) = \underset{w}{arg\,min\;\sum^N_{i=1} l_i(w; x^{(i)},y^{(i)})}$$

what is interesting from the optimization point of view, is that the objective function l(w; D) is written as a sum of independent terms that are related to datapoints (we will see in the next lab why this formulation is so common).

Suppose we want to apply GD. Given an initial vector $w_0 ∈ \mathbb{R}^n$, the iteration become

$$w_{k+1}=w_k-\alpha_k \nabla_wl(w_k;\mathbb{D})=w_k-\alpha_k \sum_{i=1}^N \nabla_wl(w_k;x^{(i)},y^{(i)})$$

Thus, to compute the iteration we need the gradient with respect to the weights of the objective functions, that can be computed by summing up the gradients of the independent functions $l_i(w;x^{(i)},y^{(i)})$.

Unfortunately, even if it is easy to compute the gradient for each of the $l_i(w;x^{(i)},y^{(i)})$, when the number of samples N is large (which is common in Machine Learning), the computation of the full gradient $∇_wl(w_k; D)$ is prohibitive. For this reason, in such optimization problems, instead of using a standard GD algorithm, it is better using the Stochastic Gradient Descent (SGD) method. That is a variant of the classical GD where, instead of computing $∇_wl(w; D) = \sum^N_{i=1} ∇_wl_i(w; x^{(i)}, y^{(i)})$, the summation is reduced to a limited numberof terms, called a batch. The idea is the following:

- Given a number $N_{batch}$ (usually called batch size), randomly extract a subdataset $M$ with $|M| = N_{batch}$ from $\mathbb{D}$.

- Approximate the true gradient 
    $$∇_wl(w; D) = \sum^N_{i=1} ∇_wl_i(w; x^{(i)}, y^{(i)})$$ 
    with 
    $$∇_wl(w; M) = \sum_{i∈\mathbb{M}} ∇_wl_i(w; x^{(i)}, y^{(i)})$$

- Compute one single iteration of the GD algorithm
$$w_{k+1} = w_k − α_k∇_wl(w;M)$$

- Repeat until you have extracted the full dataset. Notice that the random sampling at each iteration is done without replacement.

Each iteration of the algorithm above is usually called batch iteration. When the whole dataset has been
processed, we say that we completed an epoch of the SGD method. This algorithm should be repeated for e
fixed number E of epochs to reach convergence.

Unfortunately, one of the biggest drawbacks of SGD with respect to GD, is that now we cannot check the
convergence anymore (since we can’t obviously compute the gradient of $l(w; D)$ to check its distance from
zero) and we can’t use the backtracking algorithm, for the same reason. As a consequence, the algorithm
will stop ONLY after reaching the fixed number of epochs, and we must set a good value for the step size
αk by hand. Those problems are solved by recent algorithms like SGD with Momentum, Adam, AdaGrad, ...

## Implement SGD function

Write a Python script that implement the SGD algorithm, following the structure you already wrote
for GD. That script should work as follows:

    Input:
    - l: the function l(w; D) we want to optimize.
        It is supposed to be a Python function, not an array.
    - grad_f: the gradient of l(w; D). 
        It is supposed to be a Python function, not an array.
    - w0: an n-dimensional array which represents the initial iterate. 
        By default, it should be randomly sampled.
    - data: a tuple (x, y) that contains the two arrays x and 
            y, where x is the input data, y is the output data.
    - batch_size: an integer. 
        The dimension of each batch. Should be a divisor of the number of data.
    - n_epochs: an integer. The number of epochs you want to 
                reapeat the iterations.
    
    Output:
    - w: an array that contains the value of w_k FOR EACH   
        iterate w_k (not only the latter).
    - f_val: an array that contains the value of l(w_k; D)
         FOR EACH iterate w_k ONLY after each epoch.
    - grads: an array that contains the value of grad_l(w_k;D) 
        FOR EACH iterate w_k ONLY after each epoch.
    - err: an array the contains the value of ||grad_l(w_k; D)||_2 
        FOR EACH iterate w_k ONLY after each epoch.

In [2]:
def SGD(l, grad_l, w0, D, batch_size, n_epochs):
    
    tot_batch = batch_size*n_epochs
    

    w = np.zeros((tot_batch+1, ) + w0.shape)
    f_val = np.zeros((tot_batch+1, ))
    grads = np.zeros((tot_batch+1, ) + w0.shape)
    err = np.zeros((tot_batch+1,))
    
    #return x -> the stationary point
    alpha = 1
    
    #D-> (X, Y) where X is dxN --- Y is N

    X, Y = D[0], D[1]  #Split D into X and Y
    X_backup, Y_backup = X, Y
    d, N = X.shape

    n_batch_per_epoch = math.ceil(1/batch_size)

    w[0] = w0
    k = 0

    len_data = len(D)

    #For each epoch
    
    for epoch in range(n_epochs):

        np.random.shuffle(D)

        for b in range (n_batch_per_epoch):            
            
            n = b*batch_size
            m = min((b+1)*batch_size, len_data)
            data_b = D[n:m]

            #Sample M from D
            Mx = data_b[0][:d]
            My = data_b[1][:d]

            print(Mx.shape)
            print(My.shape)

            #Mx <- X    Mx batch from x  --> shape d x batch_size
            #My <- Y    My batch from y  --> shape batch_size

            M = (Mx, My)

            #Remove Mx and My from X and Y
            X = data_b[0][d:]
            Y = data_b[1][d:]


            f_val[k] = l(w[k], M)
            grads[k] = grad_l(w[k], M)
            err[k] = np.linalg.norm(grads[k])

            #Update w
            w[k+1] = w[k] - alpha*grads[k]

            k+=1


        #Reload X and Y
        X = X_backup
        Y = Y_backup

        ## ATTENTION: you have to shuffle again (differently)


    return w, f_val, grads, err


#REMEMBER: in SG, w0 should be chosen randomly (sample from Gaussian)


## Prepare Dataset and Loss

• To test the script above, consider the MNIST dataset we used in the previous laboratories, and do the following:

1. From the dataset, select only two digits. It would be great to let the user input the two digits to select.
2. Do the same operation of the previous homework to obtain the training and test set from (X, Y), selecting The $N_{train}$ you prefer.
3. Implement a logistic regression classificator as described in the corresponding post on my website.

In [5]:
#Load data into memory
data = pd.read_csv('./data.csv')

#Convert data into a matrix
data = np.array(data)   

X0 = data[:, 1:].T
Y0 = data[:, 0]

def choose_labels(labels):
    idx = [index for index, elem in enumerate(Y0) if elem in labels]

    X = X0[:, idx]     
    Y = Y0[idx]

    return X, Y

def split_data(X, Y, Ntrain):

    d, N = X.shape

    idx = np.arange(N)
    np.random.shuffle(idx)

    train_idx = idx[:Ntrain]
    test_idx = idx[Ntrain:]

    Xtrain = X[:, train_idx]
    Ytrain = Y[train_idx]
    
    Xtest = X[:, test_idx]
    Ytest = Y[test_idx]

    return (Xtrain, Ytrain), (Xtest, Ytest)



In [None]:
# def logistic_regression_classificator(X, Y, w):
#     # Create Xhat
#     X_hat = np.concatenate((np.ones((N,1)), X), axis=0)

#     fw_Xhat = np.zeros((N, ))
#     for i in range(N):
#         fw_Xhat = f(w[i], X[:, i])

#     l_wD = ell(w, X, Y)





#Build the model

# Compute the value of f
def f(w, xhat):
    return sigmoid(xhat.T @ w)

def sigmoid(z):
    return 1/(1 + np.exp(-z))

#--
#Training


def MSE(y, y1):
    return (np.linalg.norm(y - y1)**2)

# Value of the loss
def loss_function(w, X, Y):
    y = f(w, X)
    return MSE(y, Y)

# Value of the gradient
def grad_loss_function(w, X, Y):
    y = f(w, X)
    return y @ (1 - y) @ X.T @ (y - Y)


#Prediction over new data
def predict(w, X, treshold=0.5):
    y = f(w, X)
    if (y>treshold):
        return 1
    return 0


In [None]:
#Model setup and running
def log(X, Y):
    lr = 0.01 #learning rate
    W = np.random.uniform(0,1) 
    b = 0.1

    for i in range(10000):
        z = (X @ W) + b

    y_pred = sigmoid(z)

    l = loss_function(W, X, y_pred)

    gradient_W = (y_pred - y_train).T @ X_train / X_train.shape[0]

    gradient_b = np.mean(y_pred - Y_train)

    W = W -lr * gradient_W
    b = b - lr * gradient_b

#Test the performance of the model
for i in range(len(X_test)):
    r = sigmoid(X_test @ W +b)
    predict(r)

In [None]:
def logi_reg(X, Y):
    
    w0 = np.zeros((N, ))
    D = (X, Y)
    batch_size = 5
    n_epochs = 50


    w = SGD(loss_function, grad_loss_function, w0, D, batch_size, n_epochs)

    predict(w, X)

- Test the logistic regression classificator for different digits and different training set dimensions.

In [None]:
label1 = int(input("Choose a digit: "))
label2 = int(input("Choose another digit: "))
classes = [label1, label2]
X, Y = choose_labels(classes)
X = X.T # To make it d x N

# Check the shape
print(f"Shape of X: {X.shape}")
print(f"Shape of Y: {Y.shape}")

# Memorize the shape
d, N = X.shape

# Add dimension on Y
Y = Y.reshape((N, 1))

Ntrain = 8000
(Xtrain, Ytrain) ,(Xtest, Ytest) = split_data(X,Y,Ntrain)




- The training procedure will end up with a set of optimal parameters $w^*$. Compare $w^*$ when computed with GD and SGD, for different digits and different training set dimensions.

- Comment the obtained results (in terms of the accuracy of the learned classificator).

In [None]:
def accuracy_check(pred,actual):
    c = 0
    for i in range(len(actual)):
        if(pred[i]==actual[i]):
            c+=1
    acc = (c/len(actual))*100
    return acc

- _Hard (optional)_: Try to implement the 3-digits logistic regression classificator and compare its accuracy with the accuracy of LDA and PCA classificators.