# Softmax layer implementation (with cross-entropy loss)
If we have network trained to predict among 10 input classes like cifar-10, softmax activation layer is used in the output layer. It calculates the probability of each class from which the input belong. Number of softmax units in output layer should always equal to number of class in dataset. So, each unit can hold probability of a sample/input being of that class. So, sum of probability of all softmax units always equals 1 (probability distribution).

### Things to remember
- This function converts real value into a probability.
- It is only used at the output layer.
- Higher probability can be considered as actual output.

In [27]:
import numpy as np

class SoftmaxCrossEntropy:
    def __init__(self):
        self.probs = None 
        self.cache_labels = None # Cache store output for backward pass.

    def forward(self, logits: np.ndarray, cache_labels: np.ndarray) -> np.ndarray:
        """
        Performs forward pass for this softmax layer

        Args:
        - logits: shape (batch_size, num_classes). For reference it's just x but 
                When a single sample  x that has gone through trenches with couple of hidden layers and 
                when it passes through linear layer and reaches SoftMax it's called "logits".
        - labels: one-hot encoded, shape (batch_size, num_classes)
                Remember it has to be one-hot encoded
        
        Returns:
        - Int scalar loss value
        """
        # Softmax
        shifted = logits - np.max(logits, axis=1, keepdims=True)  # stability
        exp_x = np.exp(shifted)
        self.probs = exp_x / np.sum(exp_x, axis=1, keepdims=True)

        # Cache labels
        self.cache_labels = cache_labels

        # Cross-entropy loss
        batch_size = logits.shape[0]
        log_likelihood = -np.log(np.sum(self.probs * cache_labels, axis=1))
        loss = np.sum(log_likelihood) / batch_size
        return loss

    '''
    Why jacobian-vector product is not suitable over this?
    Jacobian-vector product is correct in theory, but inefficient in practice.
    With SoftMax + cross-entropy, the gradient simplifies to probs - labels.
    That's why your SoftmaxCrossEntropy.backward() is leaner and better suited than a full Jacobian approach.

    '''
    def backward(self) -> np.ndarray:
        batch_size = self.cache_labels.shape[0]
        dX = (self.probs - self.cache_labels) / batch_size
        return dX


# Testing Softmax implementation.

### Forward Pass (Test)
Loading the dataset for testing

In [4]:
import tensorflow as tf
import matplotlib.pyplot as plt

cifar10 = tf.keras.datasets.cifar10
(X_train, y_train), (X_test, y_test) = cifar10.load_data()

Function to plot a sample data

In [28]:
classes = ["airplane", "automobile", "bird", "cat", "deer", "dog", "frog", "horse", "ship", "truck"]
def plot_data(X, y, index):
    plt.figure(figsize=(10, 2))  # Scaling the image
    plt.imshow(X[index])
    plt.xlabel(classes[(y[index])[0]]) # Because y_train[index] return [val]

Pipeline to build
- Image (32x32x3 -> 3072 features) -> hidden layers (eg. ... -> Linear layer(mostly before softmax in usual design)) -> softmax + loss entropy -> probabilities

Current pipeline (for testing)
- Image -> softmax -> probabiltity

In [36]:
# Vector of single image.
img = X_train[0]
reshaped_img = img.reshape(1, -1) # Because current softmax expects in form (1, 3072) not (3072) i.e. 32x32x3

softmax = SoftmaxCrossEntropy()

# Synthetic weights + bias
np.random.seed(0)
W = np.random.rand(3072, 10) * 0.01
b = np.zeros(10)

# Applying linear layer
logits = np.dot(reshaped_img, W) + b  

# One-hot label for frog.
# classes = ["airplane", "automobile", "bird", "cat", "deer", "dog", "frog", "horse", "ship", "truck"]
label_index = y_train[0][0] # For frog i.e. 6 cause 6th index in class.
labels = np.zeros((1, 10))
labels[0, label_index] = 1

# Forward pass
loss = softmax.forward(logits, labels)
print("Predicted class:", np.argmax(softmax.probs))
print("Loss:", loss)

# Backward pass
grad = softmax.backward()
print("Gradient:", grad)



Predicted class: 7
Loss: 41.66676707127592
Gradient: [[ 2.66660736e-13  4.32930938e-23  8.07802945e-13  1.05902255e-08
   5.21080679e-17  1.58621037e-01 -1.00000000e+00  8.41373308e-01
   4.39864035e-06  1.24555925e-06]]


In [None]:
b = np.random.rand(10)
print(b)

[0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
