<!-- HTML file automatically generated from DocOnce source (https://github.com/doconce/doconce/)
doconce format html exercisesweek42.do.txt  -->
<!-- dom:TITLE: Exercises week 42 -->

# Exercises week 42
**October 9-13, 2023**

Date: **Deadline is Sunday October 22 at midnight**

You can hand in the exercises from week 41 and week 42 as one exercise and get a total score of two additional points.

# Overarching aims of the exercises this week

The aim of the exercises this week is to get started with implementing
gradient methods of relevance for project 2. The exercise this week is a simple
continuation from the  previous week with the addition of automatic differentation.
Everything you develop here will be used in project 2. 

In order to get started, we will now replace in our standard ordinary
least squares (OLS) and Ridge regression codes (from project 1) the
matrix inversion algorithm with our own gradient descent (GD) and SGD
codes.  You can use the Franke function or the terrain data from
project 1. **However, we recommend using a simpler function like**
$f(x)=a_0+a_1x+a_2x^2$ or higher-order one-dimensional polynomials.
You can obviously test your final codes against for example the Franke
function. Automatic differentiation will be discussed next week.

You should include in your analysis of the GD and SGD codes the following elements
1. A plain gradient descent with a fixed learning rate (you will need to tune it) using automatic differentiation. Compare this with the analytical expression of the gradients you obtained last week. Feel free to use **Autograd** as Python package or **JAX**. You can use the examples form last week.

2. Add momentum to the plain GD code and compare convergence with a fixed learning rate (you may need to tune the learning rate). Compare this with the analytical expression of the gradients you obtained last week.

3. Repeat these steps for stochastic gradient descent with mini batches and a given number of epochs. Use a tunable learning rate as discussed in the lectures from week 39. Discuss the results as functions of the various parameters (size of batches, number of epochs etc)

4. Implement the Adagrad method in order to tune the learning rate. Do this with and without momentum for plain gradient descent and SGD using automatic differentiation..

5. Add RMSprop and Adam to your library of methods for tuning the learning rate. Again using automatic differentiation.

The lecture notes from weeks 39 and 40 contain more information and code examples. Feel free to use these examples.

We recommend reading chapter 8 on optimization from the textbook of [Goodfellow, Bengio and Courville](https://www.deeplearningbook.org/). This chapter contains many useful insights and discussions on the optimization part of machine learning.

In [25]:
import numpy as np
import autograd.numpy as np
from autograd import grad
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.preprocessing import StandardScaler

""" Data """
def FrankeFunction(x,y, noice, alpha, seed):
    term1 = 0.75*np.exp(-(0.25*(9*x-2)**2) - 0.25*((9*y-2)**2))
    term2 = 0.75*np.exp(-((9*x+1)**2)/49.0 - 0.1*(9*y+1))
    term3 = 0.5*np.exp(-(9*x-7)**2/4.0 - 0.25*((9*y-3)**2))
    term4 = -0.2*np.exp(-(9*x-4)**2 - (9*y-7)**2)
    if noice:
        np.random.seed(seed)
        return term1 + term2 + term3 + term4 + alpha*np.random.normal(0, 1, x.shape)
    else:
        return term1 + term2 + term3 + term4

def data_FF(noise=True, step_size=0.05, alpha=0.05, reshape=True):
    x = np.arange(0, 1, step_size)
    y = np.arange(0, 1, step_size)

    X, Y = np.meshgrid(x, y)
    if reshape:
        x = X.flatten().reshape(-1, 1)
        y = Y.flatten().reshape(-1, 1)
        Z = FrankeFunction(X, Y, noise, alpha, seed=3155)
        z = Z.flatten().reshape(-1, 1)
        return x, y, z
    if not reshape:
        Z = FrankeFunction(X, Y, noise, alpha, seed=3155)
        return X, Y, Z


def y_func(x, exact_theta):
    y = 0
    for i, theta in enumerate(exact_theta):
        y += theta*x**i
    return y

def polynomial(coeff, n, noise, alpha, seed):
    X = np.linspace(0, 1, n).reshape(-1, 1)    
    Y_true = y_func(X, coeff)
    if noise:
        np.random.seed(seed)
        Y_noise = (Y_true + alpha*np.random.normal(0, 1, X.shape))
    else:
        Y_noise = Y_true
    return X, Y_noise, Y_true

def xor():
    X = np.reshape([[0, 0], [0, 1], [1, 0], [1, 1]], (4, 2))
    Y = np.reshape([[0], [1], [1], [0]], (4, 1))
    return X, Y



def scaling(X):
    scaler = StandardScaler()
    scaler.fit(X)
    X = scaler.transform(X)
    X[:,0] = 1
    return X

def designMatrix_1D(x, polygrad):
    n = len(x)
    X = np.ones((n,polygrad))
    for i in range(1,polygrad):
        X[:,i] = (x**i).ravel()
    return X

def learning_schedule(epoch, init_LR, decay):
    return init_LR * 1/(1 + decay*epoch)



""" Task a) """
def OLS(X_train, X_test, y):
    """ OLS """
    XT_X = X_train.T @ X_train
    theta_linreg = np.linalg.pinv(XT_X) @ (X_train.T @ y)
    y_predict_OLS = X_test @ theta_linreg
    return y_predict_OLS, theta_linreg

def cost_theta(theta, X, y, lmb):
    n = X.shape[0]
    return (1.0/n)*np.sum((y - (X @ theta))**2) + (lmb/2) * (theta.T@theta)

def cost_theta_diff(theta, X, y, lmb):
    n = X.shape[0]
    return lmb*theta - 2*X.T @(y - X@theta)/n

def auto_gradient(theta, X, y, lmb):
    gradient = grad(cost_theta)
    return gradient(theta, X, y, lmb)


""" NN """
def cost_w_b(w, b, X, y, lmb):
    n = X.shape[0]
    return (1.0/n)*np.sum((y - (X @ w + b))**2) + (lmb/2)*(w.T@w)

def cost_w_b_diff(w, b, X, y, lmb):
    n = X.shape[0]
    grad_w = lmb*w + np.sum(2*X*(b + w*X - y))/n
    grad_b = (2/n)*np.sum(b + w*X - y)
    return grad_w, grad_b

def cost(z, z_tilde, lmb, w):
    return mean_squared_error(z, z_tilde) + (lmb/2)*(w.T@w)

def cost_diff(z, z_tilde, lmb, w):
    return 2 * (z_tilde - z) / np.size(z_tilde) # + lmb*w

def MSE(z, z_tilde):
    return mean_squared_error(z, z_tilde)

def MSE_diff(z, z_tilde):
    return 2 * (z_tilde - z) / np.size(z_tilde)

def auto_gradient_NN(w, b, X, y, lmb):
    w_gradient = grad(cost_w_b,0)
    b_gradient = grad(cost_w_b,1)
    return w_gradient(w, b, X, y, lmb), b_gradient(w, b, X, y, lmb)

def r2(z, z_tilde):
    return r2_score(z, z_tilde)

# def Cost_GD(w, b, X, y, lmb):
#     n = X.shape[0]
#     return (1.0/n)*np.sum((y - (X @ w + b))**2) + (lmb/2) * w**2

# def Cost_GD_diff(w, b, X, y, lmb):
#     n = X.shape[0]
#     grad_w = lmb*w + (2*X*(b + w*X - y))/n # (2/n)*X*(b + w*X - y) # 2.0/n * X.T @ ((X @ w + b) - y)
#     grad_b = (2/n)*(b + w*X - y)
#     return np.mean(grad_w, axis=1), np.mean(grad_b, axis=1)


""" Logistic Regression """

def accuracy(y_pred,y):
    """Logistic:"""
    accuracy = np.mean(y_pred.flatten()==y.flatten())
    return accuracy

def cross_entropy_cost(y_train, output, lmb, weights):
    """the binary cross entropy cost function
    lmb is L2 reg. w is for the regularization of NN"""
    eps =1e-15 # small epsilon for numeric stability.
    p = output #prediction.
    y = y_train
    w= weights
    cost = - np.sum(y*np.log(p+eps) + (1-y)*np.log(1-p+eps))+ (lmb/2)* w.T @ w
    cost = cost/y.shape[0]
    return cost

def diff_cross_entropy_cost(y_train, output, lmb, weights):
    """The derivative of the cost function for cross entropy NN classifier.
    Note that L2 regularisation happens in the layer.py backprop"""
    p = output
    y = y_train
    w = weights

    grad_cross_entropi =  (p - y) #Simplified for first layer backwards.
    mean = np.mean(grad_cross_entropi,axis =1)
    gradient_CE = np.expand_dims(mean, axis=1)
    return gradient_CE

def grad_cost_func_logregression(beta, X, y, lmb):
    """The derivative of the cost function for logistic regression"""
    p = 1/(1+np.exp(-X@beta))
    grad_cross_entropi = - X.T @ (y - p) +lmb*beta
    mean = np.mean(grad_cross_entropi,axis =1)
    gradient_CE = np.expand_dims(mean, axis=1)
    return gradient_CE

In [26]:
class Optimizer:
    """A super class for three optimizers."""
    def __init__(self, eta):
        self.eta = eta
        self.delta = 1e-7 #to avoid division by zero.

    def __call__(self,gradients):
        raise TypeError("You need to specify which Optimizer")

class Adagrad(Optimizer):
    def __call__(self,gradients, Giter):
        Giter += gradients @ gradients.T
        self.Ginverse = np.c_[self.eta/(self.delta + np.sqrt(np.diagonal(Giter)))]
        return np.multiply(self.Ginverse,gradients)

class RMSprop(Optimizer):
    def __call__(self,gradients, Giter):
        beta = 0.90 #Ref Geron boka.
        Previous = Giter.copy() #stores the current Giter.
        Giter += gradients @ gradients.T
        Giter = (beta*Previous + (1 - beta)*Giter)
        self.Ginverse = np.c_[self.eta/(self.delta + np.sqrt(np.diagonal(Giter)))]
        return np.multiply(self.Ginverse,gradients)

class Adam(Optimizer):
    """https://towardsdatascience.com/how-to-implement-an-adam-optimizer-from-scratch-76e7b217f1cc rand
    Algoritm 8.7 Adam in Chapter 8 of Ian Goodfellow"""
    def __init__(self,eta):
        super().__init__(eta) # Optimizer stores these.
        self.m = 0
        self.s = 0
        self.t = 1
        self.beta_1 = 0.90 #Ref Geron and Goodfellow bøkene.
        self.beta_2 = 0.999 #Ref Geron and Goodfellow bøkene.

    def __call__(self, gradients):

        #Update of 1st and 2nd moment:
        m = (self.beta_1*self.m + (1 - self.beta_1)*gradients)
        s = (self.beta_2*self.s + (1 - self.beta_2)*gradients**2)

        #Bias correction:
        self.mHat = m/(1 - self.beta_1**self.t) #med tidsteg t.
        self.sHat = s/(1 - self.beta_2**self.t)

        #Compute update:
        self.Ginverse = self.eta/(self.delta + np.sqrt(self.sHat))
        self.m = m
        self.s = s
        self.t += 1
        
        return np.multiply(self.Ginverse,self.mHat)

In [27]:

def GD(X_train, X_test, y_train, y_test, Gradient_method, Optimizer_method, Niterations, init_LR, decay, momentum, seed, lmb):

    """ Gradient Decent """
    np.random.seed(seed)
    theta = np.random.randn(np.shape(X_train)[1],1) # Initial thetas/betas

    change = 0.0
    mse_test = np.zeros(Niterations)
    mse_train = np.zeros(Niterations)
    
    """ Optimizer method """
    if Optimizer_method == 'Adagrad':
        optim = Adagrad(init_LR)
    if Optimizer_method == 'RMSprop':
        optim = RMSprop(init_LR)
    if Optimizer_method == 'Adam':
        optim = Adam(init_LR)

    # Gradient decent:
    Giter = np.zeros(shape=(X_train.shape[1],X_train.shape[1]))
    for i in range(Niterations):

        """ Gradient method """
        if Gradient_method == 'auto':  
            gradients =  auto_gradient(theta, X_train, y_train, lmb) # Autograd
        if Gradient_method == 'anal':
            gradients = cost_theta_diff(theta, X_train, y_train, lmb) # Analytical

        """ Optimizer method """
        if Optimizer_method == 'Adagrad':
            update = optim(gradients, Giter)#uses class
            theta -= update

        if Optimizer_method == 'RMSprop':
            update = optim(gradients, Giter)#uses class
            theta -= update

        if Optimizer_method == 'Adam':
            update = optim(gradients)
            theta -= update

        if Optimizer_method == 'momentum':
            eta = learning_schedule(i, init_LR, decay) # LR
            update = eta * gradients + momentum * change # Update to the thetas

            theta -= update
            change = update # Update the amount the momentum gets added

        y_predict_GD_test = X_test @ theta
        mse_test[i] = cost(y_test, y_predict_GD_test, lmb, theta)

        y_predict_GD_train = X_train @ theta
        mse_train[i] = cost(y_train, y_predict_GD_train, lmb, theta)

    y_predict_GD = X_test @ theta

    return y_predict_GD, theta, mse_test, mse_train

def SGD(X_train, X_test, y_train, y_test, Optimizer_method, Gradient_method, minibatch_size, n_minibatches, n_epochs, init_LR, decay, momentum, seed, lmb):
    """ Stochastic Gradient Decent """

    np.random.seed(seed)
    theta = np.random.randn(np.shape(X_train)[1],1) # Initial thetas/betas

    mse_test = np.zeros(n_epochs*n_minibatches)
    mse_train = np.zeros(n_epochs*n_minibatches)

    count = 0
    change = 0.0

    """ Optimizer method """
    if Optimizer_method == 'Adagrad':
        optim = Adagrad(init_LR)
    if Optimizer_method == 'RMSprop':
        optim = RMSprop(init_LR)
    if Optimizer_method == 'Adam':
        optim = Adam(init_LR)

    for epoch in range(n_epochs):
        Giter = np.zeros(shape=(X_train.shape[1],X_train.shape[1]))
        for batch in range(n_minibatches):

            random_index = minibatch_size*np.random.randint(n_minibatches)
            X_batch = X_train[random_index:random_index+minibatch_size]
            y_batch = y_train[random_index:random_index+minibatch_size]


            """ Gradient method """
            if Gradient_method == 'auto':  
                gradients =  auto_gradient(theta, X_batch, y_batch, lmb) # Autograd
            if Gradient_method == 'anal':
                gradients = cost_theta_diff(theta, X_batch, y_batch, lmb) # Analytical

            """ Optimizer method """
            if Optimizer_method == 'Adagrad':
                update = optim(gradients, Giter)#uses class
                theta -= update

            if Optimizer_method == 'RMSprop':
                update = optim(gradients, Giter)#uses class
                theta -= update

            if Optimizer_method == 'Adam':
                update = optim(gradients)
                theta -= update

            if Optimizer_method == 'momentum':
                eta = learning_schedule(epoch, init_LR, decay) # LR
                update = eta * gradients + momentum * change # Update to the thetas
                theta -= update
                change = update


            y_predict_SGD_test = X_test @ theta
            mse_test[count] = cost(y_test, y_predict_SGD_test, lmb, theta)

            y_predict_SGD_train = X_train @ theta
            mse_train[count] = cost(y_train, y_predict_SGD_train, lmb, theta)
            count += 1

    y_predict_GD = X_test @ theta
    return y_predict_GD, theta, mse_test, mse_train

In [28]:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
import seaborn as sns
# from tqdm import tqdm



""" PLOTS: """
plt.rcParams.update({
    "text.usetex": True,
    "font.family": "serif",
    "font.serif": ["ComputerModern"]})

""" (1/8) Heatmap of GD to find best lambda and model complexity """
def find_lambda_DG(x_train, x_test, y_train, y_test, x, y_true, G_M, lambda_min, lambda_max, nlambdas, max_polydeg, plot, n_epochs, seed):

    lambdas = np.logspace(lambda_min, lambda_max, nlambdas)
    polydeg = np.arange(max_polydeg)
    cost_lambda_degree = np.empty((nlambdas, max_polydeg))

    """ Hyperparameters """
    init_LR = 0.1                          # Initial learning rate (LR)
    decay = 0.0                             # init_LR/n_epochs # LR decay rate (for fixed LR set decay to 0)
    momentum = 0.0                          # Momentum value for GD.

    """ Optimization method """
    # To get plain GD without any optimization choose 'momentum' with momentum value of 0
    Optimizer_method = ['Adagrad', 'RMSprop', 'Adam','momentum']
    O_M = Optimizer_method[3] # Choose the optimization method

    for d_idx, deg in enumerate(polydeg):

        X_train = designMatrix_1D(x_train, deg + 1)
        X_test = designMatrix_1D(x_test, deg + 1)

        for l_idx, lmb in enumerate(lambdas):
            y_predict_GD, theta, test_cost_GD, train_cost_GD = GD(X_train, X_test, y_train, y_test, G_M, O_M, n_epochs, init_LR, decay, momentum, seed, lmb)
            cost_lambda_degree[l_idx, d_idx] = test_cost_GD[-1]
    
    index = np.argwhere(cost_lambda_degree == np.min(cost_lambda_degree))
    best_poly_deg_cost = polydeg[index[0,1]]
    best_lambda_cost = lambdas[index[0,0]]

    print(f'The lowest cost with GD was achieved at polynomial degree = {best_poly_deg_cost}, and with lambda = {best_lambda_cost}.')

    if plot:
        fig, ax = plt.subplots(figsize=(14,8))
        plt.rcParams.update({'font.size': 26})
        sns.heatmap(cost_lambda_degree[:,1:], cmap="RdYlGn_r", 
        annot=True, annot_kws={"size": 20},
        fmt="1.4f", linewidths=1, linecolor=(30/255,30/255,30/255,1),
        cbar_kws={"orientation": "horizontal", "shrink":0.8, "aspect":40, "label":r"Cost", "pad":0.05})
        x_idx = np.arange(max_polydeg-1) + 0.5
        y_idx = np.arange(nlambdas) + 0.5
        ax.set_xticks(x_idx, [deg for deg in polydeg[1:]], fontsize='medium')
        ax.set_yticks(y_idx, [float(f'{lam:1.1E}') for lam in lambdas], rotation=0, fontsize='medium')
        ax.set_xlabel(r"Polynomial degree", labelpad=10, fontsize='medium')
        ax.set_ylabel(r'$\log_{10} \lambda$', labelpad=10, fontsize='medium')
        ax.set_title(r'\bf{Cost Heatmap for plain GD}', pad=15)
        ax.xaxis.tick_top()
        ax.xaxis.set_label_position('top')
        plt.tight_layout()
        plt.savefig('cost_heatmap_plain_GD_LR_0_1.png', dpi=150)
        plt.clf()

        plt.rc('axes', facecolor='whitesmoke', edgecolor='none',
        axisbelow=True, grid=True)
        plt.rc('grid', color='w', linestyle='solid')
        plt.rc('lines', linewidth=2)

        X_train = designMatrix_1D(x_train, best_poly_deg_cost + 1)
        X_test = designMatrix_1D(x_test, best_poly_deg_cost + 1)

        y_predict_GD, theta, test_cost_GD, train_cost_GD = GD(X_train, X_test, y_train, y_test, G_M, O_M, n_epochs, init_LR, decay, momentum, seed, lmb)

        fig, ax = plt.subplots(figsize=(16,9))
        fig.subplots_adjust(bottom=0.22)
        """ Regression line plot """
        ax.scatter(x_test, y_predict_GD, c='limegreen', s=5, label=r'GD')
        # ax.scatter(x_test, y_predict_OLS, c='dodgerblue', s=5, label='OLS')
        ax.plot(x, y_true, zorder=100, c='black', label='True y')
        ax.scatter(x_train, y_train, c='indigo', marker='o', s=3, alpha=0.3, label='Data') # Data
        ax.set_title(r'\bf{Regression line plot for plain GD}', pad=15)
        ax.set_xlabel(r'$x$', labelpad=10)
        ax.set_ylabel(r'$y$',  labelpad=10)
        ax.legend(framealpha=0.9, facecolor=(1, 1, 1, 1))
        string = f'init-LR = {init_LR}, decay = {decay}, momentum = {momentum}, n\_epochs = {n_epochs}'
        plt.figtext(0.5, 0.05, string, ha="center", fontsize=18, bbox={'facecolor':'white', 'edgecolor':'black', 'lw':0.5, 'boxstyle':'round'})
        plt.savefig('regression_line_plot_plain_GD.png', dpi=150)
        plt.clf()

        iters = np.arange(n_epochs)
        fig, ax = plt.subplots(figsize=(16,9))
        fig.subplots_adjust(bottom=0.22)
        ax.plot(iters, train_cost_GD, color='crimson', zorder=100, lw=2, label=r"Train Cost for GD") #zorder=0,
        ax.plot(iters, test_cost_GD, color='royalblue', lw=2, label=r"Test Cost for GD") #zorder=0,
        # plt.scatter(best_poly_deg_GD, np.min(MSE_test), color='forestgreen', marker='x', zorder=100, s=150, label='Lowest MSE')
        ax.set_xlabel(r"Iterations", labelpad=10)
        ax.set_ylabel(r"Cost", labelpad=10)
        ax.set_title(r"\bf{Cost as function of iterations for plain GD}", pad=15)
        ax.legend(framealpha=0.9, facecolor=(1, 1, 1, 1))
        ax.set_yscale('log')
        string = f'init-LR = {init_LR}, decay = {decay}, momentum = {momentum}, n\_epochs = {n_epochs}'
        plt.figtext(0.5, 0.05, string, ha="center", fontsize=18, bbox={'facecolor':'white', 'edgecolor':'black', 'lw':0.5, 'boxstyle':'round'})
        plt.savefig('cost_plot_plain_GD.png', dpi=150)
        plt.clf()

    return best_poly_deg_cost, best_lambda_cost

""" (2/8) Heatmap of SGD to find best lambda and model complexity """
def find_lambda_SDG(x_train, x_test, y_train, y_test, x, y_true, G_M, lambda_min, lambda_max, nlambdas, max_polydeg, plot, n_epochs, seed):

    lambdas = np.logspace(lambda_min, lambda_max, nlambdas)
    polydeg = np.arange(max_polydeg)
    cost_lambda_degree = np.empty((nlambdas, max_polydeg))

    """ Hyperparameters """
    init_LR = 0.1                          # Initial learning rate (LR)
    decay = 0.0                             # init_LR/n_epochs # LR decay rate (for fixed LR set decay to 0)
    momentum = 0.0                          # Momentum value for GD.
    minibatch_size = np.shape(x_train)[0]//20
    n_minibatches = np.shape(x_train)[0]//minibatch_size #number of minibatches

    """ Optimization method """
    # To get plain GD without any optimization choose 'momentum' with momentum value of 0
    Optimizer_method = ['Adagrad', 'RMSprop', 'Adam','momentum']
    O_M = Optimizer_method[3] # Choose the optimization method

    for d_idx, deg in enumerate(polydeg):

        X_train = designMatrix_1D(x_train, deg + 1)
        X_test = designMatrix_1D(x_test, deg + 1)

        for l_idx, lmb in enumerate(lambdas):
            y_predict_SGD, theta, test_cost_SGD, train_cost_SGD = SGD(X_train, X_test, y_train, y_test, O_M, G_M, minibatch_size, n_minibatches, n_epochs, init_LR, decay, momentum, seed, lmb)
            cost_lambda_degree[l_idx, d_idx] = test_cost_SGD[-1]
    
    index = np.argwhere(cost_lambda_degree == np.min(cost_lambda_degree))
    best_poly_deg_cost = polydeg[index[0,1]]
    best_lambda_cost = lambdas[index[0,0]]

    print(f'The lowest cost with SGD was achieved at polynomial degree = {best_poly_deg_cost}, and with lambda = {best_lambda_cost}.')

    if plot:
        fig, ax = plt.subplots(figsize=(14,8))
        plt.rcParams.update({'font.size': 26})
        sns.heatmap(cost_lambda_degree[:,1:], cmap="RdYlGn_r", 
        annot=True, annot_kws={"size": 20},
        fmt="1.4f", linewidths=1, linecolor=(30/255,30/255,30/255,1),
        cbar_kws={"orientation": "horizontal", "shrink":0.8, "aspect":40, "label":r"Cost", "pad":0.05})
        x_idx = np.arange(max_polydeg-1) + 0.5
        y_idx = np.arange(nlambdas) + 0.5
        ax.set_xticks(x_idx, [deg for deg in polydeg[1:]], fontsize='medium')
        ax.set_yticks(y_idx, [float(f'{lam:1.1E}') for lam in lambdas], rotation=0, fontsize='medium')
        ax.set_xlabel(r"Polynomial degree", labelpad=10, fontsize='medium')
        ax.set_ylabel(r'$\log_{10} \lambda$', labelpad=10, fontsize='medium')
        ax.set_title(r'\bf{Cost Heatmap for plain SGD}', pad=15)
        ax.xaxis.tick_top()
        ax.xaxis.set_label_position('top')
        plt.tight_layout()
        plt.savefig('cost_heatmap_plain_SGD_LR_0_1.png', dpi=150)
        plt.clf()

        plt.rc('axes', facecolor='whitesmoke', edgecolor='none',
        axisbelow=True, grid=True)
        plt.rc('grid', color='w', linestyle='solid')
        plt.rc('lines', linewidth=2)

        X_train = designMatrix_1D(x_train, best_poly_deg_cost + 1)
        X_test = designMatrix_1D(x_test, best_poly_deg_cost + 1)

        y_predict_SGD, theta, test_cost_SGD, train_cost_SGD = SGD(X_train, X_test, y_train, y_test, O_M, G_M, minibatch_size, n_minibatches, n_epochs, init_LR, decay, momentum, seed, best_lambda_cost)
        
        """ Regression line plot """
        fig, ax = plt.subplots(figsize=(16,9))
        fig.subplots_adjust(bottom=0.22)
        ax.scatter(x_test, y_predict_SGD, c='limegreen', s=5, label=r'SGD')
        # ax.scatter(x_test, y_predict_OLS, c='dodgerblue', s=5, label='OLS')
        ax.plot(x, y_true, zorder=100, c='black', label='True y')
        ax.scatter(x_train, y_train, c='indigo', marker='o', s=3, alpha=0.3, label='Data') # Data
        ax.set_title(r'\bf{Regression line plot for plain SGD}', pad=15)
        ax.set_xlabel(r'$x$', labelpad=10)
        ax.set_ylabel(r'$y$',  labelpad=10)
        ax.legend(framealpha=0.9, facecolor=(1, 1, 1, 1))
        string = f'init-LR = {init_LR}, decay = {decay}, momentum = {momentum}, n\_epochs = {n_epochs}, minibatch\_size = {minibatch_size}, n\_minibatches = {n_minibatches}'
        plt.figtext(0.5, 0.05, string, ha="center", fontsize=18, bbox={'facecolor':'white', 'edgecolor':'black', 'lw':0.5, 'boxstyle':'round'})
        plt.savefig('regression_line_plot_plain_SGD.png', dpi=150)
        plt.clf()

        iters = np.arange(n_epochs*n_minibatches)
        fig, ax = plt.subplots(figsize=(16,9))
        fig.subplots_adjust(bottom=0.22)
        ax.plot(iters, train_cost_SGD, color='crimson', zorder=100, lw=2, label=r"Train Cost for SGD") #zorder=0,
        ax.plot(iters, test_cost_SGD, color='royalblue', lw=2, label=r"Test Cost for SGD") #zorder=0,
        # plt.scatter(best_poly_deg_GD, np.min(MSE_test), color='forestgreen', marker='x', zorder=100, s=150, label='Lowest MSE')
        ax.set_xlabel(r"Iterations", labelpad=10)
        ax.set_ylabel(r"Cost", labelpad=10)
        ax.set_title(r"\bf{Cost as function of iterations for plain SGD}", pad=15)
        ax.set_yscale('log')
        ax.legend(framealpha=0.9, facecolor=(1, 1, 1, 1))
        string = f'init-LR = {init_LR}, decay = {decay}, momentum = {momentum}, n\_epochs = {n_epochs}, minibatch\_size = {minibatch_size}, n\_minibatches = {n_minibatches}'
        plt.figtext(0.5, 0.05, string, ha="center", fontsize=18, bbox={'facecolor':'white', 'edgecolor':'black', 'lw':0.5, 'boxstyle':'round'})
        plt.savefig('cost_plot_plain_SGD.png', dpi=150)
        plt.clf()

    return best_poly_deg_cost, best_lambda_cost

""" Func to find best minibatch size """
def find_minibatch_size_SDG(x_train, x_test, y_train, y_test, best_poly_deg_SGD, best_lambda_SGD, G_M, n_epochs, seed):

    minibatchs = [5, 10, 20, 30, 40, 50]

    X_train = designMatrix_1D(x_train, best_poly_deg_SGD)
    X_test = designMatrix_1D(x_test, best_poly_deg_SGD)

    """ Hyperparameters """
    init_LR = 0.1                          # Initial learning rate (LR)
    decay = 0.0                             # init_LR/n_epochs # LR decay rate (for fixed LR set decay to 0)
    momentum = 0.0                          # Momentum value for GD.

    cost_minibatch = np.zeros(len(minibatchs))

    """ Optimization method """
    # To get plain GD without any optimization choose 'momentum' with momentum value of 0
    Optimizer_method = ['Adagrad', 'RMSprop', 'Adam','momentum']
    O_M = Optimizer_method[3] # Choose the optimization method

    for mb_idx, size in enumerate(minibatchs):
        minibatch_size = np.shape(x_train)[0]//size
        n_minibatches = np.shape(x_train)[0]//minibatch_size #number of minibatches

        y_predict_SGD, theta, test_cost_SGD, train_cost_SGD = SGD(X_train, X_test, y_train, y_test, O_M, G_M, minibatch_size, n_minibatches, n_epochs, init_LR, decay, momentum, seed, best_lambda_SGD)
        cost_minibatch[mb_idx] = test_cost_SGD[-1]

    index = np.argwhere(cost_minibatch == np.min(cost_minibatch))
    best_minibatch_size = minibatchs[index[0,0]]

    minibatch_size = np.shape(x_train)[0]//best_minibatch_size
    n_minibatches = np.shape(x_train)[0]//minibatch_size #number of minibatches

    print(f'The lowest cost with SGD was achieved with minibatch size = {minibatch_size}. Thus, {n_minibatches} minibatches.')
    return best_minibatch_size

""" (3/8) cost-plot for GD and SGD with a fixed learning rate using the chosen lambda """
def fixed_LR(x_train, x_test, y_train, y_test, x, y_true, best_poly_deg_GD, best_lambda_GD, best_poly_deg_SGD, best_lambda_SGD, G_M, n_epochs, seed, best_minibatch_size):
    """ Hyperparameters """
    init_LR = 0.1                          # Initial learning rate (LR)
    decay = 0.0                             # init_LR/n_epochs # LR decay rate (for fixed LR set decay to 0)
    momentum = 0.0                          # Momentum value for GD.
    minibatch_size = np.shape(x_train)[0]//best_minibatch_size
    n_minibatches = np.shape(x_train)[0]//minibatch_size #number of minibatches

    """ Optimization method """
    # To get plain GD without any optimization choose 'momentum' with momentum value of 0
    Optimizer_method = ['Adagrad', 'RMSprop', 'Adam','momentum']
    O_M = Optimizer_method[3] # Choose the optimization method

    X_train_GD = designMatrix_1D(x_train, best_poly_deg_GD) # Train design matrix for GD 
    X_test_GD = designMatrix_1D(x_test, best_poly_deg_GD) # Test design matrix for GD

    X_train_SGD = designMatrix_1D(x_train, best_poly_deg_SGD) # Train design matrix for SGD
    X_test_SGD = designMatrix_1D(x_test, best_poly_deg_SGD) # Test design matrix for SGD

    y_predict_GD, theta, test_cost_GD, train_cost_GD = GD(X_train_GD, X_test_GD, y_train, y_test, G_M, O_M, n_epochs*n_minibatches, init_LR, decay, momentum, seed, best_lambda_GD)
    y_predict_SGD, theta, test_cost_SGD, train_cost_SGD = SGD(X_train_SGD, X_test_SGD, y_train, y_test, O_M, G_M, minibatch_size, n_minibatches, n_epochs, init_LR, decay, momentum, seed, best_lambda_SGD)

    plt.rc('axes', facecolor='whitesmoke', edgecolor='none',
    axisbelow=True, grid=True)
    plt.rc('grid', color='w', linestyle='solid')
    plt.rc('lines', linewidth=2)

    iters = np.arange(n_epochs*n_minibatches)
    fig, ax = plt.subplots(figsize=(16,9))
    fig.subplots_adjust(bottom=0.22)
    ax.plot(iters, test_cost_GD, color='crimson', lw=2, zorder=100, label=r"Cost for the test data - GD") #zorder=0,
    ax.plot(iters, test_cost_SGD, color='royalblue', lw=2, label=r"Cost for the test data - SGD") #zorder=0,
    # plt.scatter(best_poly_deg_GD, np.min(MSE_test), color='forestgreen', marker='x', zorder=100, s=150, label='Lowest MSE')
    ax.set_xlabel(r"Iterations", labelpad=10)
    ax.set_ylabel(r"Cost", labelpad=10)
    ax.set_title(r"\bf{Cost as function of iterations for fixed LR}", pad=15)
    ax.set_yscale('log')
    ax.legend(framealpha=0.9, facecolor=(1, 1, 1, 1))
    string = f'init-LR = {init_LR}, decay = {decay}, momentum = {momentum}, n\_epochs = {n_epochs}, minibatch\_size = {minibatch_size}, n\_minibatches = {n_minibatches}'
    plt.figtext(0.5, 0.05, string, ha="center", fontsize=18, bbox={'facecolor':'white', 'edgecolor':'black', 'lw':0.5, 'boxstyle':'round'})
    plt.savefig('cost_plot_fixed_LR.png', dpi=150)
    plt.clf()

    """ Regression line plot """
    fig, ax = plt.subplots(figsize=(16,9))
    fig.subplots_adjust(bottom=0.22)
    ax.scatter(x_test, y_predict_SGD, c='limegreen', s=5, label='SGD')
    ax.scatter(x_test, y_predict_GD, c='crimson', s=5,label='GD')
    # ax.scatter(x_test, y_predict_OLS, c='dodgerblue', s=5, label='OLS')
    ax.plot(x, y_true, zorder=100, c='black', label='True y')
    ax.scatter(x_train, y_train, c='indigo', marker='o', s=3, alpha=0.3, label='Data') # Data
    ax.set_title(r'\bf{Regression line plot for fixed LR}', pad=15)
    ax.set_xlabel(r'$x$', labelpad=10)
    ax.set_ylabel(r'$y$',  labelpad=10)
    ax.legend(framealpha=0.9, facecolor=(1, 1, 1, 1))
    string = f'init-LR = {init_LR}, decay = {decay}, momentum = {momentum}, n\_epochs = {n_epochs}, minibatch\_size = {minibatch_size}, n\_minibatches = {n_minibatches}'
    plt.figtext(0.5, 0.05, string, ha="center", fontsize=18, bbox={'facecolor':'white', 'edgecolor':'black', 'lw':0.5, 'boxstyle':'round'})
    plt.savefig('regression_line_plot_fixed_LR.png', dpi=150)
    plt.clf()

""" (4/8) cost-plot for GD and SGD with a fixed learning rate and momentum using the chosen lambda """
def fixed_LR_momentum(x_train, x_test, y_train, y_test, x, y_true, best_poly_deg_GD, best_lambda_GD, best_poly_deg_SGD, best_lambda_SGD, G_M, n_epochs, seed, best_minibatch_size):
    """ Hyperparameters """
    init_LR = 0.1                          # Initial learning rate (LR)
    decay = 0.0                             # init_LR/n_epochs # LR decay rate (for fixed LR set decay to 0)
    momentum = 0.9                          # Momentum value for GD.
    minibatch_size = np.shape(x_train)[0]//best_minibatch_size
    n_minibatches = np.shape(x_train)[0]//minibatch_size #number of minibatches

    """ Optimization method """
    # To get plain GD without any optimization choose 'momentum' with momentum value of 0
    Optimizer_method = ['Adagrad', 'RMSprop', 'Adam','momentum']
    O_M = Optimizer_method[3] # Choose the optimization method

    X_train_GD = designMatrix_1D(x_train, best_poly_deg_GD) # Train design matrix for GD 
    X_test_GD = designMatrix_1D(x_test, best_poly_deg_GD) # Test design matrix for GD

    X_train_SGD = designMatrix_1D(x_train, best_poly_deg_SGD) # Train design matrix for SGD
    X_test_SGD = designMatrix_1D(x_test, best_poly_deg_SGD) # Test design matrix for SGD

    y_predict_GD, theta, test_cost_GD, train_cost_GD = GD(X_train_GD, X_test_GD, y_train, y_test, G_M, O_M, n_epochs*n_minibatches, init_LR, decay, momentum, seed, best_lambda_GD)
    y_predict_SGD, theta, test_cost_SGD, train_cost_SGD = SGD(X_train_SGD, X_test_SGD, y_train, y_test, O_M, G_M, minibatch_size, n_minibatches, n_epochs, init_LR, decay, momentum, seed, best_lambda_SGD)

    plt.rc('axes', facecolor='whitesmoke', edgecolor='none',
    axisbelow=True, grid=True)
    plt.rc('grid', color='w', linestyle='solid')
    plt.rc('lines', linewidth=2)

    iters = np.arange(n_epochs*n_minibatches)
    fig, ax = plt.subplots(figsize=(16,9))
    fig.subplots_adjust(bottom=0.22)
    ax.plot(iters, test_cost_GD, color='crimson', lw=2, zorder=100, label=r"Cost for the test data - GD") #zorder=0,
    ax.plot(iters, test_cost_SGD, color='royalblue', lw=2, label=r"Cost for the test data - SGD") #zorder=0,
    # plt.scatter(best_poly_deg_GD, np.min(MSE_test), color='forestgreen', marker='x', zorder=100, s=150, label='Lowest MSE')
    ax.set_xlabel(r"Iterations", labelpad=10)
    ax.set_ylabel(r"Cost", labelpad=10)
    ax.set_title(r"\bf{Cost as function of iterations for fixed LR and momentum}", pad=15)
    ax.set_yscale('log')
    ax.legend(framealpha=0.9, facecolor=(1, 1, 1, 1))
    string = f'init-LR = {init_LR}, decay = {decay}, momentum = {momentum}, n\_epochs = {n_epochs}, minibatch\_size = {minibatch_size}, n\_minibatches = {n_minibatches}'
    plt.figtext(0.5, 0.05, string, ha="center", fontsize=18, bbox={'facecolor':'white', 'edgecolor':'black', 'lw':0.5, 'boxstyle':'round'})
    plt.savefig('cost_plot_fixed_LR_momentum.png', dpi=150)
    plt.clf()

    """ Regression line plot """
    fig, ax = plt.subplots(figsize=(16,9))
    fig.subplots_adjust(bottom=0.22)
    ax.scatter(x_test, y_predict_SGD, c='limegreen', s=5, label='SGD')
    ax.scatter(x_test, y_predict_GD, c='crimson', s=5,label='GD')
    # ax.scatter(x_test, y_predict_OLS, c='dodgerblue', s=5, label='OLS')
    ax.plot(x, y_true, zorder=100, c='black', label='True y')
    ax.scatter(x_train, y_train, c='indigo', marker='o', s=3, alpha=0.3, label='Data') # Data
    ax.set_title(r'\bf{Regression line plot for fixed LR and momentum}', pad=15)
    ax.set_xlabel(r'$x$', labelpad=10)
    ax.set_ylabel(r'$y$',  labelpad=10)
    ax.legend(framealpha=0.9, facecolor=(1, 1, 1, 1))
    string = f'init-LR = {init_LR}, decay = {decay}, momentum = {momentum}, n\_epochs = {n_epochs}, minibatch\_size = {minibatch_size}, n\_minibatches = {n_minibatches}'
    plt.figtext(0.5, 0.05, string, ha="center", fontsize=18, bbox={'facecolor':'white', 'edgecolor':'black', 'lw':0.5, 'boxstyle':'round'})
    plt.savefig('regression_line_plot_fixed_LR_momentum.png', dpi=150)
    plt.clf()

""" (5/8) cost-plot for GD and SGD with an adaptive learning rate and momentum using the chosen lambda """
def adaptive_LR_momentum(x_train, x_test, y_train, y_test, x, y_true, best_poly_deg_GD, best_lambda_GD, best_poly_deg_SGD, best_lambda_SGD, G_M, n_epochs, seed, best_minibatch_size):
    """ Hyperparameters """
    init_LR = 0.1                          # Initial learning rate (LR)
    decay = 0.01                             # init_LR/n_epochs # LR decay rate (for fixed LR set decay to 0)
    momentum = 0.9                        # Momentum value for GD.
    minibatch_size = np.shape(x_train)[0]//best_minibatch_size
    n_minibatches = np.shape(x_train)[0]//minibatch_size #number of minibatches

    """ Optimization method """
    # To get plain GD without any optimization choose 'momentum' with momentum value of 0
    Optimizer_method = ['Adagrad', 'RMSprop', 'Adam','momentum']
    O_M = Optimizer_method[3] # Choose the optimization method

    X_train_GD = designMatrix_1D(x_train, best_poly_deg_GD) # Train design matrix for GD 
    X_test_GD = designMatrix_1D(x_test, best_poly_deg_GD) # Test design matrix for GD

    X_train_SGD = designMatrix_1D(x_train, best_poly_deg_SGD) # Train design matrix for SGD
    X_test_SGD = designMatrix_1D(x_test, best_poly_deg_SGD) # Test design matrix for SGD

    y_predict_GD, theta, test_cost_GD, train_cost_GD = GD(X_train_GD, X_test_GD, y_train, y_test, G_M, O_M, n_epochs*n_minibatches, init_LR, decay, momentum, seed, best_lambda_GD)
    y_predict_SGD, theta, test_cost_SGD, train_cost_SGD = SGD(X_train_SGD, X_test_SGD, y_train, y_test, O_M, G_M, minibatch_size, n_minibatches, n_epochs, init_LR, decay, momentum, seed, best_lambda_SGD)

    plt.rc('axes', facecolor='whitesmoke', edgecolor='none',
    axisbelow=True, grid=True)
    plt.rc('grid', color='w', linestyle='solid')
    plt.rc('lines', linewidth=2)

    iters = np.arange(n_epochs*n_minibatches)
    fig, ax = plt.subplots(figsize=(16,9))
    fig.subplots_adjust(bottom=0.22)
    ax.plot(iters, test_cost_SGD, color='royalblue', lw=2, label=r"Cost for the test data - SGD") #zorder=0,
    ax.plot(iters, test_cost_GD, color='crimson', lw=2, zorder=100, label=r"Cost for the test data - GD") #zorder=0,
    # plt.scatter(best_poly_deg_GD, np.min(MSE_test), color='forestgreen', marker='x', zorder=100, s=150, label='Lowest MSE')
    ax.set_xlabel(r"Iterations", labelpad=10)
    ax.set_ylabel(r"Cost", labelpad=10)
    ax.set_title(r"\bf{Cost as function of iterations for adaptive LR and momentum}", pad=15)
    ax.set_yscale('log')
    ax.legend(framealpha=0.9, facecolor=(1, 1, 1, 1))
    string = f'init-LR = {init_LR}, decay = {decay}, momentum = {momentum}, n\_epochs = {n_epochs}, minibatch\_size = {minibatch_size}, n\_minibatches = {n_minibatches}'
    plt.figtext(0.5, 0.05, string, ha="center", fontsize=18, bbox={'facecolor':'white', 'edgecolor':'black', 'lw':0.5, 'boxstyle':'round'})
    plt.savefig('cost_plot_adaptive_LR_momentum.png', dpi=150)
    plt.clf()

    """ Regression line plot """
    fig, ax = plt.subplots(figsize=(16,9))
    fig.subplots_adjust(bottom=0.22)
    ax.scatter(x_test, y_predict_SGD, c='limegreen', s=5, label='SGD')
    ax.scatter(x_test, y_predict_GD, c='crimson', s=5,label='GD')
    # ax.scatter(x_test, y_predict_OLS, c='dodgerblue', s=5, label='OLS')
    ax.plot(x, y_true, zorder=100, c='black', label='True y')
    ax.scatter(x_train, y_train, c='indigo', marker='o', s=3, alpha=0.3, label='Data') # Data
    ax.set_title(r'\bf{Regression line plot for adaptive LR and momentum}', pad=15)
    ax.set_xlabel(r'$x$', labelpad=10)
    ax.set_ylabel(r'$y$',  labelpad=10)
    ax.legend(framealpha=0.9, facecolor=(1, 1, 1, 1))
    string = f'init-LR = {init_LR}, decay = {decay}, momentum = {momentum}, n\_epochs = {n_epochs}, minibatch\_size = {minibatch_size}, n\_minibatches = {n_minibatches}'
    plt.figtext(0.5, 0.05, string, ha="center", fontsize=18, bbox={'facecolor':'white', 'edgecolor':'black', 'lw':0.5, 'boxstyle':'round'})
    plt.savefig('regression_line_plot_adaptive_LR_momentum.png', dpi=150)
    plt.clf()

""" (6/8) cost-plot for GD and SGD with Adagrad with momentum using the chosen lambda """
def Adagrad_w_momentum(x_train, x_test, y_train, y_test, x, y_true, best_poly_deg_GD, best_lambda_GD, best_poly_deg_SGD, best_lambda_SGD, G_M, n_epochs, seed, best_minibatch_size):
    """ Hyperparameters """
    init_LR = 0.1                          # Initial learning rate (LR)
    decay = 0.01                             # init_LR/n_epochs # LR decay rate (for fixed LR set decay to 0)
    momentum = 0.9                        # Momentum value for GD.
    minibatch_size = np.shape(x_train)[0]//best_minibatch_size
    n_minibatches = np.shape(x_train)[0]//minibatch_size #number of minibatches

    """ Optimization method """
    # To get plain GD without any optimization choose 'momentum' with momentum value of 0
    Optimizer_method = ['Adagrad', 'RMSprop', 'Adam','momentum']
    O_M = Optimizer_method[0] # Choose the optimization method

    X_train_GD = designMatrix_1D(x_train, best_poly_deg_GD) # Train design matrix for GD 
    X_test_GD = designMatrix_1D(x_test, best_poly_deg_GD) # Test design matrix for GD

    X_train_SGD = designMatrix_1D(x_train, best_poly_deg_SGD) # Train design matrix for SGD
    X_test_SGD = designMatrix_1D(x_test, best_poly_deg_SGD) # Test design matrix for SGD

    y_predict_GD, theta, test_cost_GD, train_cost_GD = GD(X_train_GD, X_test_GD, y_train, y_test, G_M, O_M, n_epochs*n_minibatches, init_LR, decay, momentum, seed, best_lambda_GD)
    y_predict_SGD, theta, test_cost_SGD, train_cost_SGD = SGD(X_train_SGD, X_test_SGD, y_train, y_test, O_M, G_M, minibatch_size, n_minibatches, n_epochs, init_LR, decay, momentum, seed, best_lambda_SGD)

    plt.rc('axes', facecolor='whitesmoke', edgecolor='none',
    axisbelow=True, grid=True)
    plt.rc('grid', color='w', linestyle='solid')
    plt.rc('lines', linewidth=2)

    iters = np.arange(n_epochs*n_minibatches)
    fig, ax = plt.subplots(figsize=(16,9))
    fig.subplots_adjust(bottom=0.22)
    ax.plot(iters, test_cost_SGD, color='royalblue', lw=2, label=r"Cost for the test data - SGD") #zorder=0,
    ax.plot(iters, test_cost_GD, color='crimson', lw=2, zorder=100, label=r"Cost for the test data - GD") #zorder=0,
    # plt.scatter(best_poly_deg_GD, np.min(MSE_test), color='forestgreen', marker='x', zorder=100, s=150, label='Lowest MSE')
    ax.set_xlabel(r"Iterations", labelpad=10)
    ax.set_ylabel(r"Cost", labelpad=10)
    ax.set_title(r"\bf{Cost as function of iterations for Adagrad w/ momentum}", pad=15)
    ax.set_yscale('log')
    ax.legend(framealpha=0.9, facecolor=(1, 1, 1, 1))
    string = f'init-LR = {init_LR}, decay = {decay}, momentum = {momentum}, n\_epochs = {n_epochs}, minibatch\_size = {minibatch_size}, n\_minibatches = {n_minibatches}'
    plt.figtext(0.5, 0.05, string, ha="center", fontsize=18, bbox={'facecolor':'white', 'edgecolor':'black', 'lw':0.5, 'boxstyle':'round'})
    plt.savefig('cost_plot_Adagrad_w_momentum.png', dpi=150)
    plt.clf()

    """ Regression line plot """
    fig, ax = plt.subplots(figsize=(16,9))
    fig.subplots_adjust(bottom=0.22)
    ax.scatter(x_test, y_predict_SGD, c='limegreen', s=5, label='SGD')
    ax.scatter(x_test, y_predict_GD, c='crimson', s=5,label='GD')
    # ax.scatter(x_test, y_predict_OLS, c='dodgerblue', s=5, label='OLS')
    ax.plot(x, y_true, zorder=100, c='black', label='True y')
    ax.scatter(x_train, y_train, c='indigo', marker='o', s=3, alpha=0.3, label='Data') # Data
    ax.set_title(r'\bf{Regression line plot for Adagrad w/ momentum}', pad=15)
    ax.set_xlabel(r'$x$', labelpad=10)
    ax.set_ylabel(r'$y$',  labelpad=10)
    ax.legend(framealpha=0.9, facecolor=(1, 1, 1, 1))
    string = f'init-LR = {init_LR}, decay = {decay}, momentum = {momentum}, n\_epochs = {n_epochs}, minibatch\_size = {minibatch_size}, n\_minibatches = {n_minibatches}'
    plt.figtext(0.5, 0.05, string, ha="center", fontsize=18, bbox={'facecolor':'white', 'edgecolor':'black', 'lw':0.5, 'boxstyle':'round'})
    plt.savefig('regression_line_plot_Adagrad_w_momentum.png', dpi=150)
    plt.clf()
  
""" (7/8) cost-plot for GD and SGD with Adagrad without momentum using the chosen lambda """
def Adagrad_w_o_momentum(x_train, x_test, y_train, y_test, x, y_true, best_poly_deg_GD, best_lambda_GD, best_poly_deg_SGD, best_lambda_SGD, G_M, n_epochs, seed, best_minibatch_size):
    """ Hyperparameters """
    init_LR = 0.1                          # Initial learning rate (LR)
    decay = 0.01                             # init_LR/n_epochs # LR decay rate (for fixed LR set decay to 0)
    momentum = 0.0                        # Momentum value for GD.
    minibatch_size = np.shape(x_train)[0]//best_minibatch_size
    n_minibatches = np.shape(x_train)[0]//minibatch_size #number of minibatches

    """ Optimization method """
    # To get plain GD without any optimization choose 'momentum' with momentum value of 0
    Optimizer_method = ['Adagrad', 'RMSprop', 'Adam','momentum']
    O_M = Optimizer_method[0] # Choose the optimization method

    X_train_GD = designMatrix_1D(x_train, best_poly_deg_GD) # Train design matrix for GD 
    X_test_GD = designMatrix_1D(x_test, best_poly_deg_GD) # Test design matrix for GD

    X_train_SGD = designMatrix_1D(x_train, best_poly_deg_SGD) # Train design matrix for SGD
    X_test_SGD = designMatrix_1D(x_test, best_poly_deg_SGD) # Test design matrix for SGD

    y_predict_GD, theta, test_cost_GD, train_cost_GD = GD(X_train_GD, X_test_GD, y_train, y_test, G_M, O_M, n_epochs*n_minibatches, init_LR, decay, momentum, seed, best_lambda_GD)
    y_predict_SGD, theta, test_cost_SGD, train_cost_SGD = SGD(X_train_SGD, X_test_SGD, y_train, y_test, O_M, G_M, minibatch_size, n_minibatches, n_epochs, init_LR, decay, momentum, seed, best_lambda_SGD)

    plt.rc('axes', facecolor='whitesmoke', edgecolor='none',
    axisbelow=True, grid=True)
    plt.rc('grid', color='w', linestyle='solid')
    plt.rc('lines', linewidth=2)

    iters = np.arange(n_epochs*n_minibatches)
    fig, ax = plt.subplots(figsize=(16,9))
    fig.subplots_adjust(bottom=0.22)
    ax.plot(iters, test_cost_SGD, color='royalblue', lw=2, label=r"Cost for the test data - SGD") #zorder=0,
    ax.plot(iters, test_cost_GD, color='crimson', lw=2, zorder=100, label=r"Cost for the test data - GD") #zorder=0,
    # plt.scatter(best_poly_deg_GD, np.min(MSE_test), color='forestgreen', marker='x', zorder=100, s=150, label='Lowest MSE')
    ax.set_xlabel(r"Iterations", labelpad=10)
    ax.set_ylabel(r"Cost", labelpad=10)
    ax.set_title(r"\bf{Cost as function of iterations for Adagrad w/ no momentum}", pad=15)
    ax.set_yscale('log')
    ax.legend(framealpha=0.9, facecolor=(1, 1, 1, 1))
    string = f'init-LR = {init_LR}, decay = {decay}, momentum = {momentum}, n\_epochs = {n_epochs}, minibatch\_size = {minibatch_size}, n\_minibatches = {n_minibatches}'
    plt.figtext(0.5, 0.05, string, ha="center", fontsize=18, bbox={'facecolor':'white', 'edgecolor':'black', 'lw':0.5, 'boxstyle':'round'})
    plt.savefig('cost_plot_Adagrad_w_o_momentum.png', dpi=150)
    plt.clf()

    """ Regression line plot """
    fig, ax = plt.subplots(figsize=(16,9))
    fig.subplots_adjust(bottom=0.22)
    ax.scatter(x_test, y_predict_SGD, c='limegreen', s=5, label='SGD')
    ax.scatter(x_test, y_predict_GD, c='crimson', s=5,label='GD')
    # ax.scatter(x_test, y_predict_OLS, c='dodgerblue', s=5, label='OLS')
    ax.plot(x, y_true, zorder=100, c='black', label='True y')
    ax.scatter(x_train, y_train, c='indigo', marker='o', s=3, alpha=0.3, label='Data') # Data
    ax.set_title(r'\bf{Regression line plot for Adagrad w/ no momentum}', pad=15)
    ax.set_xlabel(r'$x$', labelpad=10)
    ax.set_ylabel(r'$y$',  labelpad=10)
    ax.legend(framealpha=0.9, facecolor=(1, 1, 1, 1))
    string = f'init-LR = {init_LR}, decay = {decay}, momentum = {momentum}, n\_epochs = {n_epochs}, minibatch\_size = {minibatch_size}, n\_minibatches = {n_minibatches}'
    plt.figtext(0.5, 0.05, string, ha="center", fontsize=18, bbox={'facecolor':'white', 'edgecolor':'black', 'lw':0.5, 'boxstyle':'round'})
    plt.savefig('regression_line_plot_Adagrad_w_0_momentum.png', dpi=150)
    plt.clf()

""" (8/8) cost-plot for either GD or SGD with Adagrad, Adam, RMSprop and with momentum using the chosen lambda """
def optim_plot_SGD(x_train, x_test, y_train, y_test, x, y_true, best_poly_deg_SGD, best_lambda_SGD, G_M, n_epochs, seed, best_minibatch_size):

    """ Hyperparameters """
    init_LR = 0.1                          # Initial learning rate (LR)
    decay = 0.01                             # init_LR/n_epochs # LR decay rate (for fixed LR set decay to 0)
    momentum = 0.9                        # Momentum value for GD.
    minibatch_size = np.shape(x_train)[0]//best_minibatch_size
    n_minibatches = np.shape(x_train)[0]//minibatch_size #number of minibatches

    X_train_SGD = designMatrix_1D(x_train, best_poly_deg_SGD) # Train design matrix for SGD
    X_test_SGD = designMatrix_1D(x_test, best_poly_deg_SGD) # Test design matrix for SGD


    # Optimizer_method = ['Adagrad', 'RMSprop', 'Adam','momentum']
    # colors_SGD = ['forestgreen', 'crimson', 'royalblue', 'darkorange']

    Optimizer_method = ['RMSprop', 'Adagrad', 'Adam', 'momentum']
    colors_SGD = ['crimson', 'forestgreen', 'royalblue', 'darkorange']
    zor = [5, 10, 20, 50]

    iters = np.arange(n_epochs*n_minibatches)

    plt.rc('axes', facecolor='whitesmoke', edgecolor='none',
    axisbelow=True, grid=True)
    plt.rc('grid', color='w', linestyle='solid')
    plt.rc('lines', linewidth=2)

    fig, ax = plt.subplots(figsize=(16,9))
    fig.subplots_adjust(bottom=0.22)

    for idx, O_M in enumerate(Optimizer_method):

        y_predict_SGD, theta, test_cost_SGD, train_cost_SGD = SGD(X_train_SGD, X_test_SGD, y_train, y_test, O_M, G_M, minibatch_size, n_minibatches, n_epochs, init_LR, decay, momentum, seed, best_lambda_SGD)

        ax.plot(iters, test_cost_SGD, c=colors_SGD[idx], zorder=zor[idx], lw=2, label=fr"Cost - SGD using {O_M}") #zorder=0,

    # ax.plot(iters, test_cost_SGD, color='royalblue', lw=2, label=r"Cost for the test data - SGD") #zorder=0,
    # plt.scatter(best_poly_deg_GD, np.min(MSE_test), color='forestgreen', marker='x', zorder=100, s=150, label='Lowest MSE')
    ax.set_xlabel(r"Iterations", labelpad=10)
    ax.set_ylabel(r"Cost", labelpad=10)
    ax.set_title(r"\bf{Cost as function of iterations for different optimizers - SGD}", pad=15)
    ax.set_yscale('log')
    ax.legend(framealpha=0.9, facecolor=(1, 1, 1, 1))
    string = f'init-LR = {init_LR}, decay = {decay}, momentum = {momentum}, n\_epochs = {n_epochs}, minibatch\_size = {minibatch_size}, n\_minibatches = {n_minibatches}'
    plt.figtext(0.5, 0.05, string, ha="center", fontsize=18, bbox={'facecolor':'white', 'edgecolor':'black', 'lw':0.5, 'boxstyle':'round'})
    plt.savefig('cost_plot_optims_SGD.png', dpi=150)
    plt.clf()

    """ Regression line plot """
    fig, ax = plt.subplots(figsize=(16,9))
    fig.subplots_adjust(bottom=0.22)
    for idx, O_M in enumerate(Optimizer_method):

        y_predict_SGD, theta, test_cost_SGD, train_cost_SGD = SGD(X_train_SGD, X_test_SGD, y_train, y_test, O_M, G_M, minibatch_size, n_minibatches, n_epochs, init_LR, decay, momentum, seed, best_lambda_SGD)

        ax.scatter(x_test, y_predict_SGD, c=colors_SGD[idx], s=5, label=fr'SGD using {O_M}')

    # ax.scatter(x_test, y_predict_OLS, c='dodgerblue', s=5, label='OLS')
    ax.plot(x, y_true, zorder=100, c='black', label='True y')
    ax.scatter(x_train, y_train, c='indigo', marker='o', s=3, alpha=0.3, label='Data') # Data
    ax.set_title(r'\bf{Regression line plot for different optimizers - SGD}', pad=15)
    ax.set_xlabel(r'$x$', labelpad=10)
    ax.set_ylabel(r'$y$',  labelpad=10)
    ax.legend(framealpha=0.9, facecolor=(1, 1, 1, 1))
    string = f'init-LR = {init_LR}, decay = {decay}, momentum = {momentum}, n\_epochs = {n_epochs}, minibatch\_size = {minibatch_size}, n\_minibatches = {n_minibatches}'
    plt.figtext(0.5, 0.05, string, ha="center", fontsize=18, bbox={'facecolor':'white', 'edgecolor':'black', 'lw':0.5, 'boxstyle':'round'})
    plt.savefig('regression_line_plot_optims_SGD.png', dpi=150)
    plt.clf()

def optim_plot_GD(x_train, x_test, y_train, y_test, x, y_true, best_poly_deg_GD, best_lambda_GD, G_M, n_epochs, seed, best_minibatch_size):

    """ Hyperparameters """
    init_LR = 0.1                          # Initial learning rate (LR)
    decay = 0.01                             # init_LR/n_epochs # LR decay rate (for fixed LR set decay to 0)
    momentum = 0.9                        # Momentum value for GD.
    minibatch_size = np.shape(x_train)[0]//best_minibatch_size
    n_minibatches = np.shape(x_train)[0]//minibatch_size #number of minibatches

    X_train_GD = designMatrix_1D(x_train, best_poly_deg_GD) # Train design matrix for GD
    X_test_GD = designMatrix_1D(x_test, best_poly_deg_GD) # Test design matrix for GD


    Optimizer_method = ['RMSprop', 'Adagrad', 'Adam','momentum']
    colors_GD = ['crimson', 'forestgreen', 'royalblue', 'darkorange']


    iters = np.arange(n_epochs*n_minibatches)


    plt.rc('axes', facecolor='whitesmoke', edgecolor='none',
    axisbelow=True, grid=True)
    plt.rc('grid', color='w', linestyle='solid')
    plt.rc('lines', linewidth=2)

    fig, ax = plt.subplots(figsize=(16,9))
    fig.subplots_adjust(bottom=0.22)

    for idx, O_M in enumerate(Optimizer_method):

        y_predict_GD, theta, test_cost_GD, train_cost_GD = GD(X_train_GD, X_test_GD, y_train, y_test, G_M, O_M, n_epochs*n_minibatches, init_LR, decay, momentum, seed, best_lambda_GD)

        ax.plot(iters, test_cost_GD, c=colors_GD[idx], lw=2, label=fr"Cost - GD using {O_M}") #zorder=0,

    ax.set_xlabel(r"Iterations", labelpad=10)
    ax.set_ylabel(r"Cost", labelpad=10)
    ax.set_title(r"\bf{Cost as function of iterations for different optimizers - GD}", pad=15)
    ax.set_yscale('log')
    ax.legend(framealpha=0.9, facecolor=(1, 1, 1, 1))
    string = f'init-LR = {init_LR}, decay = {decay}, momentum = {momentum}, n\_epochs = {n_epochs}, minibatch\_size = {minibatch_size}, n\_minibatches = {n_minibatches}'
    plt.figtext(0.5, 0.05, string, ha="center", fontsize=18, bbox={'facecolor':'white', 'edgecolor':'black', 'lw':0.5, 'boxstyle':'round'})
    plt.savefig('cost_plot_optims_GD.png', dpi=150)
    plt.clf()

    """ Regression line plot """
    fig, ax = plt.subplots(figsize=(16,9))
    fig.subplots_adjust(bottom=0.22)
    for idx, O_M in enumerate(Optimizer_method):

        y_predict_GD, theta, test_cost_GD, train_cost_GD = GD(X_train_GD, X_test_GD, y_train, y_test, G_M, O_M, n_epochs*n_minibatches, init_LR, decay, momentum, seed, best_lambda_GD)

        ax.scatter(x_test, y_predict_GD, c=colors_GD[idx], s=5, label=fr'GD using {O_M}')

    ax.plot(x, y_true, zorder=100, c='black', label='True y')
    ax.scatter(x_train, y_train, c='indigo', marker='o', s=3, alpha=0.3, label='Data') # Data
    ax.set_title(r'\bf{Regression line plot for different optimizers - GD}', pad=15)
    ax.set_xlabel(r'$x$', labelpad=10)
    ax.set_ylabel(r'$y$',  labelpad=10)
    ax.legend(framealpha=0.9, facecolor=(1, 1, 1, 1))
    string = f'init-LR = {init_LR}, decay = {decay}, momentum = {momentum}, n\_epochs = {n_epochs}, minibatch\_size = {minibatch_size}, n\_minibatches = {n_minibatches}'
    plt.figtext(0.5, 0.05, string, ha="center", fontsize=18, bbox={'facecolor':'white', 'edgecolor':'black', 'lw':0.5, 'boxstyle':'round'})
    plt.savefig('regression_line_plot_optims_GD.png', dpi=150)
    plt.clf()



""" Data: """
coeff = [1.0, 1.0, 1.0]
n = 1000 # Number of datapoints
noise = True
alpha = 0.1 # Noise scaling
seed = 55 # Seed for noise (used to replicate results)
x, y_noise, y_true = polynomial(coeff, n, noise, alpha, seed)

x_train, x_test, y_train, y_test = train_test_split(x, y_noise, test_size=0.2, random_state = seed)

""" Gradient Gradient_method """
Gradient_method = ['auto', 'anal']
G_M = Gradient_method[1] #Choose the Gradient Gradient_method

lambda_min = -15
lambda_max = -1
nlambdas = 15
max_polydeg = 3

plot = True
n_epochs = 1000
seed = 55

""" Plots: """
best_poly_deg_GD, best_lambda_GD = find_lambda_DG(x_train, x_test, y_train, y_test, x, y_true, G_M, lambda_min, lambda_max, nlambdas, max_polydeg, plot, n_epochs, seed)
best_poly_deg_SGD, best_lambda_SGD = find_lambda_SDG(x_train, x_test, y_train, y_test, x, y_true,G_M, lambda_min, lambda_max, nlambdas, max_polydeg, plot, n_epochs, seed)
best_poly_deg_GD += 1
best_poly_deg_SGD += 1

best_minibatch_size = find_minibatch_size_SDG(x_train, x_test, y_train, y_test, best_poly_deg_SGD, best_lambda_SGD, G_M, n_epochs, seed)

fixed_LR(x_train, x_test, y_train, y_test, x, y_true, best_poly_deg_GD, best_lambda_GD, best_poly_deg_SGD, best_lambda_SGD, G_M, n_epochs, seed, best_minibatch_size)
fixed_LR_momentum(x_train, x_test, y_train, y_test, x, y_true, best_poly_deg_GD, best_lambda_GD, best_poly_deg_SGD, best_lambda_SGD, G_M, n_epochs, seed, best_minibatch_size)
adaptive_LR_momentum(x_train, x_test, y_train, y_test, x, y_true, best_poly_deg_GD, best_lambda_GD, best_poly_deg_SGD, best_lambda_SGD, G_M, n_epochs, seed, best_minibatch_size)
Adagrad_w_momentum(x_train, x_test, y_train, y_test, x, y_true, best_poly_deg_GD, best_lambda_GD, best_poly_deg_SGD, best_lambda_SGD, G_M, n_epochs, seed, best_minibatch_size)
Adagrad_w_o_momentum(x_train, x_test, y_train, y_test, x, y_true, best_poly_deg_GD, best_lambda_GD, best_poly_deg_SGD, best_lambda_SGD, G_M, n_epochs, seed, best_minibatch_size)
optim_plot_SGD(x_train, x_test, y_train, y_test, x, y_true, best_poly_deg_SGD, best_lambda_SGD, G_M, n_epochs, seed, best_minibatch_size)
optim_plot_GD(x_train, x_test, y_train, y_test, x, y_true, best_poly_deg_GD, best_lambda_GD, G_M, n_epochs, seed, best_minibatch_size)


KeyboardInterrupt: 

## <h2 style="text-align: center;"> <ins> NB: </ins> </h2>

I did not have enough time to elaborate on the theory nor aspects of analysis in this week`s set of exercises(sadly). This is mainly due to the work and process of finishing and delivering project 1, whilst in combination with tedious other mandatory assignments in other courses. I hope, nonetheless, that this is okay considering a lot work has already been put into project 2 code-wise (as seen above). 

I would also like to mention that I do not specifically know why this notebook fails in providing output when plotting (due to a fontsize issue), but I suspect it has to do with the version of matplotlib that is set as a standard (this was mainly written as a pure python file). 

Lastly, the autograd library is an exstension that can be pip installed on your computer. The link to where a more detailed documentation of it can be found is put here.

Link: https://github.com/HIPS/autograd 