<a href="https://colab.research.google.com/github/RP272/Hands-On-ML/blob/main/Batch_Gradient_Descent_for_Softmax_Regression_with_Early_Stopping_from_scratch.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Chapter 4: Training Models

Exercise 12: Implement Batch Gradient Descent with early stopping for Softmax Regression (without using Scikit-Learn).

In the first step lets define functions in regard to Softmax Regression: softmax function, softmax cost function (cross entropy) and as last the cross entropy gradient vector.

In [20]:
import numpy as np

def softmax_score(input_vector, parameter_matrix, k: int):
  return np.dot(input_vector, parameter_matrix[k])

def softmax(input_vector, parameter_matrix):
  softmax_scores = []
  sum_of_softmax_scores = 0
  for i in range(len(parameter_matrix)):
    score = np.exp(softmax_score(input_vector, parameter_matrix, i))
    softmax_scores.append(score)
    sum_of_softmax_scores += score
  return softmax_scores / sum_of_softmax_scores

def cross_entropy_cost(parameter_matrix, input_vectors, class_count, target_probabilities):
  cost_sum = 0
  m = len(input_vectors)
  for i in range(m):
    softmax_scores = softmax(input_vectors[i], parameter_matrix)
    for k in range(class_count):
      cost_sum += target_probabilities[i][k] * np.log(softmax_scores[k])
  return -1/m * cost_sum


def cross_entropy_gradient(parameter_matrix, input_vectors, class_count, target_probabilities):
  m = len(input_vectors)
  class_gradients = []
  softmax_score_cache = {}
  for k in range(class_count):
    tmp = 0
    for i in range(m):
      if i not in softmax_score_cache:
        softmax_score_cache[i] = softmax(input_vectors[i], parameter_matrix)
      softmax_scores = softmax_score_cache[i]
      difference = softmax_scores[k] - target_probabilities[i][k]
      tmp += np.array(input_vectors[i]) * difference
    class_gradients.append(1/m * tmp)
  return class_gradients


We have defined functions regarding softmax function. Let's prepare the dataset.

In [21]:
from sklearn import datasets
from sklearn.model_selection import train_test_split

iris = datasets.load_iris()

X = iris["data"][:, (2, 3)] # petal length, petal width
y = iris["target"]

X_with_bias = np.c_[np.ones(len(X)), X]

X_train, X_test, y_train, y_test = train_test_split(X_with_bias, y, test_size=0.2, random_state=42)

Now lets write down the code for Batch Gradient Descent.

In [22]:
import numpy as np
from sklearn.metrics import accuracy_score

eta = 0.1
n_iterations = 5001
m = len(X_train) # size of training set
class_count = 3

# parameter matrix. rows represent individual classes. columns represent features. given cell represents specified feature value for given class. ??? I guess that it can be randomly initialized at start. Batch Gradient will do its work ???
theta = np.random.rand(class_count, len(X_train[0]))

# input_vectors matrix. rows represent individual train entity. columns represent features. given cell represents specified feature value for given train entity.
input_vectors = X_train
# NUMBER OF COLUMNS IN theta IS EQUAL TO NUMBER OF COLUMNS IN input_vectors

# target probabilities. rows represent individual train entity. columns represent class. given cell represents specified class probability for given train entity.
target_probabilities = np.zeros((m, class_count))
for i in range(m):
  target_probabilities[i][y_train[i]] = 1

for iteration in range(n_iterations):
  gradients = np.array(cross_entropy_gradient(theta, input_vectors, class_count, target_probabilities))
  theta = theta - eta * gradients

# Lets test the accuracy on train set
y_pred = np.zeros(y_train.shape)

for i in range(m):
  softmax_scores = softmax(input_vectors[i], theta)
  y_pred[i] = np.argmax(softmax_scores)

print(accuracy_score(y_train, y_pred))



0.9583333333333334


Doesn't seem to learn so well. Will inspect it later.

Found one problem. I forgot to use the exponential function in softmax calculation. It boosted performance from 50% to 75% on train accuracy.

Second problem may be the lack of bias term. Every element in input_vectors has to have value 1 as it's first element. Lets fix it.

BINGO !!! 95% accuracy on train set. Now we can say that the model learned something.

Lets test its performance on test dataset.

In [23]:
y_test_pred = np.zeros(y_test.shape)

n = len(X_test)
for i in range(n):
  softmax_scores = softmax(X_test[i], theta)
  y_test_pred[i] = np.argmax(softmax_scores)

print(accuracy_score(y_test, y_test_pred))

1.0


Perfect accuracy on the test set. This may be due to how small the dataset is.