# Implementing Batch Gradient Descent with early stopping for Softmax Regression
__Author__ : Mohammad Rouintan , 400222042

__Course__ : Undergraduate Machine Learning Course

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

## Problem
Implement batch gradient descent with early stopping for softmax regression from scratch. Use it on a classification task on the Penguins dataset

### Softmax Regression
The Logistic Regression model can be generalized to support multiple classes directly, without having to train and combine multiple binary classifiers. This is called $Softmax Regression$, or $Multinomial Logistic Regression$.

The idea is quite simple: when given an instance $x$, the Softmax Regression model first computes a score $S_{k}(x)$ for each class $k$, then estimates the probability of each class by applying the softmax function (also called the normalized exponential) to the scores. The equation to compute $S_{k}(x)$ should look familiar, as it is just like the equation for Linear Regression prediction.

#### Softmax score for class k
$$
\begin{align}
S_{k}(x) &= (\theta^{(k)})^{T}x \tag{1}
\end{align}
$$

Note that each class has its own dedicated parameter vector $\theta^{(k)}$.All these vectors are typically stored as rows in a parameter matrix $\Theta$.

Once you have computed the score of every class for the instance $x$, you can estimate the probability $\hat{p}_{k}$ that the instance belongs to class $k$ by running the scores through the softmax function: it computes the exponential of every score, then normalizes them (dividing by the sum of all the exponentials).

#### Softmax function
$$
\begin{align}
\hat{p}_{k} &= \sigma(s(x))_{k} = \frac{e^{S_{k}(x)}}{\sum\limits_{j = 1}^{K} e^{S_{j}(x)}} \tag{2}
\end{align}
$$

1. $K$ is the number of classes.
2. $s(x)$ is a vector containing the scores of each class for instance $x$.
3. $\sigma(s(x))_{k}$ is the estimated probability that the instance $x$ belongs to class $k$ given the scores of each class for that instance.

Just like the Logistic Regression classifier, the Softmax Regression classifier predicts the class with the highest estimated probability (which is simply the class with the highest score)

#### Softmax Regression classifier prediction

$$
\begin{align}
\hat{y} = argmax \,\, \sigma(s(x))_{k} = argmax \,\, S_{k}(x) = argmax \,\, ((\theta^{(k)})^{T}x) \tag{3}
\end{align}
$$
The $argmax$ operator returns the value of a variable that maximizes a function. In this equation, it returns the value of $k$ that maximizes the estimated probability $\sigma(s(x))_{k}$.

Now that you know how the model estimates probabilities and makes predictions, let’s take a look at training. The objective is to have a model that estimates a high probability for the target class (and consequently a low probability for the other classes). Minimizing the cost function, called the cross entropy, should lead to this objective because it penalizes the model when it estimates a low probability for a target class. Cross entropy is frequently used to measure how well a set of estimated class probabilities match the target classes.

#### Cross entropy cost function
$$
\begin{align}
J(\Theta) = -\frac{1}{m} \sum\limits_{i = 1}^{m} \sum\limits_{k = 1}^{K} y_{k}^{(i)}\log(\hat{p}_{k}^{(i)}) \tag{4}
\end{align}
$$
$y_{k}^{(i)}$ is the target probability that the $i^{th}$ instance belongs to class $k$. In general, it is either equal to 1 or 0, depending on whether the instance belongs to the class or not.
Notice that when there are just two classes ($
K = 2$), this cost function is equivalent to the Logistic Regression’s cost function 

#### Cross entropy gradient vector for class k
$$
\nabla_{\theta^{(k)}} J(\Theta) = \frac{1}{m} \sum\limits_{i = 1}^{m} (\hat{p}_{k}^{(i)} - y_{k}^{(i)})x^{(i)} \tag{5}
$$
Now you can compute the gradient vector for every class, then use Gradient Descent (or any other optimization algorithm) to find the parameter matrix $\Theta$ that minimizes the cost function.

In [1]:
class SoftmaxRegression():
    def __init__(self, eta=0.01, epochs=5000, l2=0.5):
        self.eta = eta
        self.epochs = epochs
        self.l2 = l2

    def one_hot(y, n_labels):
        y_one_hot = np.zeros((len(y), self.n_labels))
        y_one_hot[np.arange(len(y)), y] = 1
        return y_one_hot

    def init_param(self, weights_shape, bias_shape, dtype='float64', scale=0.01):
        w = np.random.normal(loc=0.0, scale=scale, size=weights_shape)
        b = np.zeros(bias_shape)
        return b.astype(dtype), w.astype(dtype)

    def soft_max(logits):
        numerator = np.exp(logits)
        denominator = np.sum(numerator, axis=1, keepdims=True)
        return numerator / denominator

    def cross_entropy(y_encoded, y_proba):
        return -np.mean(np.sum(np.log(y_proba) * y_encoded), axis=1)

    def compute_gradient(X, y_encoded, y_proba):
        m = X.shape[0]
        dw = (1 / m) * np.dot(X.T, (y_proba - y_encoded))
        db = (1 / m) * np.sum(y_proba - y_encoded)
        return  

    def fit(self, X, y, init_params=True):
        if init_params:
            self.n_classes = np.max(y) + 1
            self.n_features = X.shape[1]
            self.bias, self.weight = self.init_param((self.n_features, self.n_classes), (self.n_classes, ))

        y_encoded = self.one_hot(y, self.n_classes)

        for epoch in range(self.epochs):
            logits = X.dot(self.weight) + self.bias
            softmax = self.soft_max(logits)
            cross_ent_loss = self.cross_entropy(y_encoded, softmax)
            l2_loss = self.l2 * np.sum(np.square(self.weight))
            loss = cross_ent_loss + l2_loss
            dw, db = self.compute_gradient(X, y_encoded, softmax)

            self.weight -= self.eta * dw
            self.bias -= self.eta * db


    def predict_proba(self, X):
        logits = X.dot(self.weight) + self.bias
        softmax = self.soft_max(logits)
        return softmax

    def predict(self, X):
        proba_predict = self.predict_proba(X)
        return proba_predict.argmax(axis=1)

After each cell, you should explain your entire code. Please consider clean code in cells too and use comments if you should

### Part b)
Description and code of second part

In [None]:
# Your code for first problem

After each cell, you should explain your entire code. Please consider clean code in cells too and use comments if you should

## Conclusion for this problem
Write a conclusion and references which you've used in your homework