# Softmax


The flowers from the iris dataset will be classified in this notebook. The Iris dataset contains 150 examples of Iris flowers belonging to 3 species *Iris-setosa*, *Iris-versicolor* and, *Iris-virginica* equally (Each group has 50 samples). Each example has 4 features *sepal length*, *sepal width*, *petal length*, and *petal width*. See the image below for an illustration.

<img src="https://s3.amazonaws.com/assets.datacamp.com/blog_assets/Machine+Learning+R/iris-machinelearning.png" alt="alt text" width="500" height="200">

In this notebook,  the problem of multiclass classification will be directly addressed using a softmax classifier.

In [None]:
import numpy as np
import pandas as pd

URL_='https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data'
data = pd.read_csv(URL_, header = None)

print(type(data))
print(data)

<class 'pandas.core.frame.DataFrame'>
       0    1    2    3               4
0    5.1  3.5  1.4  0.2     Iris-setosa
1    4.9  3.0  1.4  0.2     Iris-setosa
2    4.7  3.2  1.3  0.2     Iris-setosa
3    4.6  3.1  1.5  0.2     Iris-setosa
4    5.0  3.6  1.4  0.2     Iris-setosa
..   ...  ...  ...  ...             ...
145  6.7  3.0  5.2  2.3  Iris-virginica
146  6.3  2.5  5.0  1.9  Iris-virginica
147  6.5  3.0  5.2  2.0  Iris-virginica
148  6.2  3.4  5.4  2.3  Iris-virginica
149  5.9  3.0  5.1  1.8  Iris-virginica

[150 rows x 5 columns]


Below, store the observations into a matrix of size $150 \times 4$ (call it $\boldsymbol{X}$), and give labels $\{0,1,2\}$ to *Iris-setosa*, *Iris-versicolor* and, *Iris-virginica*, respectively. This is obviously just a choice and label Iris-setosa by say 2.    

In [None]:
X = np.asarray(data.iloc[:,:4]) # Use data.iloc[:,:4] to abondon the label column
# Use np.asarray() to convert the data type to array
num_samples, num_features = X.shape
# print(X)

num_classes = 3

y = np.zeros((num_samples, num_classes), dtype='int')

for i in range(y.shape[0]):
    if i < 50:
        y[i][0] = 1
    elif i < 99:
        y[i][1] = 1
    else:
        y[i][2] = 1

# Create one-hot vectors for class labels.

Recall that the softmax classifier has the form
\begin{align}
p = \mathrm{softmax}\big({z}\big) &=
\mathrm{softmax}\big({W}^\top {x} \big)\\
\mathrm{softmax}\big({z_1,z_2,\cdots,z_K}\big) &= \frac{1}{\sum_i \exp(z_i)}
\big(\exp(z_1),\exp(z_2),\cdots,\exp(z_K)\big)^\top
\end{align}

The code below realizes the softmax classifier.

In [None]:
def softmax_func(W, X):
    z0 = np.matmul(X, W)
    z1 = np.exp(z0)
    p = z1.T / np.sum(z1, axis=1)

    return(p.T)

The CE loss (NLL of softmax output) is:

\begin{align}
    \mathcal{L}_{\mathrm{NLL}}\big({W}\big) \triangleq -\frac{1}{m}
    \sum_{i=1}^m {y}_i^\top \log\big(\hat{y}_i\big)\;.   
\end{align}
with $\hat{y}_i = \mathrm{softmax}\big({W}^\top {x}_i \big)$.
The function below implements the NLL loss.

In [None]:
def nll_loss(y, y_hat):

    # Avoid division by zero by adding a small epsilon
    epsilon = 1e-15

    # Ensure y_hat is not exactly 0 or 1 to prevent log(0) or log(1) issues
    y_hat = np.clip(y_hat, epsilon, 1 - epsilon) # Clip the values in an array

    # Calculate the CE loss
    ce_loss = -np.sum(y * np.log(y_hat))

    return ce_loss

To update the weights of the model ${W}$, the gradient of the NLL wrt ${W}$ is required which can be written as:

\begin{align}
    \nabla_{W}  \mathcal{L}_{\mathrm{NLL}} = \frac{1}{m} \sum_{i=1}^m x_i \big(\hat{y}_i - {y}_i\big)^\top\;.
\end{align}

To write a loop to perform gradient descent in order to learn $W$. The updating rule for $W$ can be written as:
\begin{align}
W \gets W - \eta  \nabla_{W}  \mathcal{L}_{\mathrm{NLL}}
\end{align}

The parameter $\eta \in (0,1]$ is the learning rate of the algorithm (call it lR below). The training loop is

In [None]:
np.random.seed(10)
W = np.random.randn(num_features, num_classes) # Initialize the W parameter

lr = 0.0001
max_iter = 50
loss = []

# Run iterations by Gradient Descent
for iter in range(max_iter):
    y_hat = softmax_func(W, X)
    loss_iter = nll_loss(y, y_hat)
    loss.append(loss_iter)
    grad = np.matmul(X.T, (y_hat-y))
    W -= lr * grad

    # Computing accuracy
    pred_label = np.argmax(y_hat, axis=1) # Return indices of the max element of the array in a particular axis
    acc_iter = np.mean(pred_label == np.argmax(y, axis=1))

    print(f'iter:{iter:3} : loss value {loss_iter:.3f}, classification accuracy {100*acc_iter:.2f}%')

iter:  0 : loss value 1123.093, classification accuracy 33.33%
iter:  1 : loss value 1058.168, classification accuracy 33.33%
iter:  2 : loss value 1009.615, classification accuracy 27.33%
iter:  3 : loss value 970.020, classification accuracy 25.33%
iter:  4 : loss value 933.860, classification accuracy 19.33%
iter:  5 : loss value 898.822, classification accuracy 22.00%
iter:  6 : loss value 864.145, classification accuracy 24.00%
iter:  7 : loss value 829.593, classification accuracy 24.67%
iter:  8 : loss value 795.094, classification accuracy 27.33%
iter:  9 : loss value 760.626, classification accuracy 28.00%
iter: 10 : loss value 726.183, classification accuracy 33.33%
iter: 11 : loss value 691.760, classification accuracy 34.67%
iter: 12 : loss value 657.359, classification accuracy 35.33%
iter: 13 : loss value 622.978, classification accuracy 36.67%
iter: 14 : loss value 588.617, classification accuracy 36.67%
iter: 15 : loss value 554.277, classification accuracy 39.33%
iter: