## Softmax

Softmax Regression allows Logistic Regression to be generalised to support multiple classes. Given an instance $\bf{x}$, the Softmax model computes a score for each class k

$\Large s_k(\bf{x}) = \bf{x}^T\bf{\theta}^{(k)}$.

The probability of each class is then computed using the softmax function

$\Large \hat{p}_k = \sigma(\bf{s}(\bf{x}))_k = \frac{\exp(s_k(\bf{x}))}{\sum_{j=1}^{K}\exp(s_j(\bf{x}))} $

The Softmax Regression classifier predicts the class with the highest estimated probability which is the class with the highest score.

$\Large \hat{y} =$ argmax $\Large \sigma(\bf{s}(\bf{x}))_k =$ argmax $\Large s_k(\bf{x}) = $ argmax $\Large \left((\bf{\theta}^{(k)})^T\bf{x}\right)$

## Cost Function

The generalised cost function for multinomial Logistic Regression is the cross-entropy

$\Large J(\bf{\theta}) = -\frac{1}{m}\sum_{i = 1}^{m}\sum_{j = 1}^{k=K}y_{k}^{(i)}\log\left(\hat{p}_{k}^{(i)}\right)$

when k=2 this becomes the log-loss cost function of binary logistic regression.

The gradient vector of this cost function with respect to $\bf{\theta}^{(k)}$

$\Large \nabla_{\bf{\theta}^{(k)}}J(\bf{\theta}) = \frac{1}{m}\sum_{i=1}^{m}\left(\hat{p}_{k}^{(i)} - y_{k}^{(i)}\right)\bf{x}^{(i)}$

In [2]:
import numpy as np

In [311]:
class LogisticRegression:
    
    def __init__(self, theta = None, epochs = 1000, learning_rate = 0.01):
        self.theta = theta
        self.epochs = epochs
        self.learning_rate = learning_rate
   
    def softmax(self, s):
        s = s - np.max(s, axis=1, keepdims=True) # To prevent the exponential overloading
        prob = np.exp(s) / np.sum(np.exp(s), axis=1, keepdims=True)
        return prob
    
    def fit(self, X, y):
        m, n = X.shape
        X_b = np.c_[np.ones((m, 1)), X] # include a column of ones for the bias term
        k = len(np.unique(y)) # number of classes
        self.theta = np.random.randn(n+1, k) # initalise the weight vector
        y_one_hot = np.zeros((m, k))
        y_one_hot[np.arange(m), y] = 1
        for i in range(self.epochs):
            s = X_b.dot(self.theta)
            gradients = 1/m * X_b.T.dot(self.softmax(s) - y_one_hot)
            self.theta -= self.learning_rate*gradients
    def predict(self, X):
        X = np.array(X)
        m = X.shape[0]
        X_b = np.c_[np.ones((m, 1)), X]
        y_pred = self.softmax(X_b.dot(self.theta))
        return np.argmax(y_pred, axis=1)

In [8]:
from sklearn.datasets import load_iris

In [9]:
iris = load_iris().data

In [10]:
pl = iris[:, 2:3]
target = load_iris().target

In [315]:
log_reg = LogisticRegression()

In [316]:
log_reg.fit(pl, target)

In [317]:
log_reg.predict([[1.2], [3.2], [5.8]])

array([0, 1, 2], dtype=int64)

In [27]:
y_one_hot = np.zeros((150, 3))
#y_one_hot[np.arange(150), y] = 1
y_one_hot[np.arange(150), target] = 1
y_one_hot

array([[1., 0., 0.],
       [1., 0., 0.],
       [1., 0., 0.],
       [1., 0., 0.],
       [1., 0., 0.],
       [1., 0., 0.],
       [1., 0., 0.],
       [1., 0., 0.],
       [1., 0., 0.],
       [1., 0., 0.],
       [1., 0., 0.],
       [1., 0., 0.],
       [1., 0., 0.],
       [1., 0., 0.],
       [1., 0., 0.],
       [1., 0., 0.],
       [1., 0., 0.],
       [1., 0., 0.],
       [1., 0., 0.],
       [1., 0., 0.],
       [1., 0., 0.],
       [1., 0., 0.],
       [1., 0., 0.],
       [1., 0., 0.],
       [1., 0., 0.],
       [1., 0., 0.],
       [1., 0., 0.],
       [1., 0., 0.],
       [1., 0., 0.],
       [1., 0., 0.],
       [1., 0., 0.],
       [1., 0., 0.],
       [1., 0., 0.],
       [1., 0., 0.],
       [1., 0., 0.],
       [1., 0., 0.],
       [1., 0., 0.],
       [1., 0., 0.],
       [1., 0., 0.],
       [1., 0., 0.],
       [1., 0., 0.],
       [1., 0., 0.],
       [1., 0., 0.],
       [1., 0., 0.],
       [1., 0., 0.],
       [1., 0., 0.],
       [1., 0., 0.],
       [1., 0

In [24]:
target[0]

0