In [1]:
%config IPCompleter.greedy=True

import numpy as np
import matplotlib.pyplot as plt
import sklearn.preprocessing, sklearn.datasets, sklearn.model_selection

At the end of the previous notebook, I briefly showed one-vs-rest classification technique and normalization of the distribution. We may put this normalization after the sigmoid activation functions - that exactly, what softmax is suppose to do.

# Softmax

Formally, we may define the softmax as follow:

$$
softmax(\pmb{x})_i = \frac{e^{x_i}}{\sum_k e^{x_k}}
$$

In other word, we apply exponential to each input of the softmax and then normalize the probability distribution.

Note that softmax is just a generalization of the sigmoid activation.

$$
softmax(x, 0) = \frac{e^x}{e^x + e^0} = \frac{e^x}{e^x + 1} \cdot \frac{e^{-x}}{e^{-x}} = \frac{e^{x-x}}{e^{x-x} + e^{-x}} = \frac{e^0}{e^0 + e^{-x}} = \frac{1}{1+e^{-x}} = \sigma(x)
$$

Now, when we have the formula, it is easy to implement the sigmoid in numpy.

In [2]:
def softmax(x):
    return np.exp(x) / np.sum(np.exp(x))

But why to bother with sigmoid and not just normalize the outputs of sigmoids? It turns out, we may get better numerical properties if we join sigmoids and normalization into one "layer". Thanks to the following equation.

$$
softmax(\pmb{x}+c)_i=\frac{e^{x_i+c}}{\sum_k e^{x_k+c}} = \frac{e^c}{e^c} \cdot \frac{e^{x_i}}{\sum_k e^{x_k}} = softmax(\pmb{x})_i
$$

The danger in the softmax are the exponents, when the divisor can become too big for float values and as a result become infinity. However, we may set $c=max(\pmb{x})$ and as all the scalars are negative or zero, there would be no overflow. Cases, where the divisor become zero are much more unlikely and even if some scalars are so small, that they are round to zero because of the float precision, it doesn't matter for use to distinguish between their actual value and zero. 

This way, we have our modified softmax function.

In [3]:
def softmax(x):
    x = x - np.max(x)
    return np.exp(x) / np.sum(np.exp(x))

## Gradient

Before we move on, we need to know how to compute gradient. That is not that easy as with softmax, as we have multiple inputs and multiple outputs. You may try it by hand, if you wish.

$$
\frac{\partial softmax(\pmb{x})_i}{\partial x_j} = \frac{\partial \frac{e^{x_i}}{\sum_k e^{x_k}}}{\partial x_j}
$$

There are three indices! Moreover, this is derivative only in respect to one input variable. We need to compute gradient of every output in respect to every input. I don't want to dig too much into the math, you can see [[1]](#Bibliography) if you wanna know more. In the end, the gradient pops out, so we may implement it.

$$
\frac{\partial softmax(\pmb{x})_i}{\partial x_j} = 
\begin{cases}
    softmax(\pmb{x})_j - softmax(\pmb{x})_j \cdot softmax(\pmb{x})_i & \text{if } i = j \\
    0 - softmax(\pmb{x})_j \cdot softmax(\pmb{x})_i & \text{if } i \neq j \\
\end{cases}
$$

You may notice something strange - the gradient is table. Remember, we compute gradient of each output in respect to each input. That need's to be a table. But we want only gradients in respect to the inputs - we need to sum the gradients over the outputs, the same way, as we sum gradients of multiple examples. I won't show the implementation just yet, as I believe it would be more helpfull to view the code and the explanation in bigger picture.

I mentioned **layer** a while ago. The fact is, the softmax is really a layer, how we understand them in the context of neural networks. We will implement our first simple neural networks in the very next notebook. It tooks us eight notebooks, but we finally have enough knowledge to do that. However, we need some refactoring of the code before we can do that.

# Refactoring

Let's start with the **layer** implementation itself. Maybe I should tell first, what the layer is. During the neural network training (we may thing about the logistic regression as about a simple neural network) we are exploiting the chain rule. You may remember from the school the chain rule.

$$
\frac{\partial f(g(h(x)))}{\partial x} = \frac{\partial f(g(h(x)))}{\partial g(h(x))} \cdot \frac{\partial g(h(x))}{\partial h(x)} \cdot \frac{\partial h(x)}{\partial x}
$$

This is not different from our model, remember that loss of logistic regression was written as:

$$
\mathcal{L}(\sigma(\pmb{x}\pmb{w})
$$

And we need to update the weights.

$$
\frac{\mathcal{L}}{\partial \pmb{w}} = \frac{\partial \mathcal{L}}{\partial \sigma} \cdot \frac{\partial \sigma}{\partial \pmb{x}\pmb{w}} \cdot \frac{\partial \pmb{x}\pmb{w}}{\partial \pmb{w}}
$$

Each part of the gradient will be one layer. We will use the same layers for neural nerworks later on. We already implemented some - for example the loss function.

In [186]:
class CategoricalCrossEntropyLoss:
    def __call__(self, target, predicted):
        indices = np.arange(len(target))
        return -np.log(np.maximum(predicted[indices,target], 1e-15))
    
    def gradient(self, target, predicted):
        grad = np.zeros((len(target), 10))
        indices = np.arange(len(target))
        grad[indices,target] = -1 / predicted[indices,target]
        return grad

Notice, that the layer accepts output of the previous layer (in this case in shape `(batchsize, classes)`) and it returns it's gradient in respect to it's input - again in the shape `(batchsize,classes`).

We may use the same approach - let's now implement the sigmoid layer. It takes inputs and apply sigmoid activation function on each of it. It is simple to implement it right away, but there is one problem. If you carefully look at the chain rule, the gradients are multiplied by the gradients of the following layer. In other word, by computing gradient of the sigmoid, we need to compute this:

$$
\frac{\mathcal{L}}{\partial \pmb{w}} = \frac{\partial \mathcal{L}}{\partial \sigma} \cdot \frac{\partial \sigma}{\partial \pmb{x}\pmb{w}}
$$

As a result, the gradient function doesn't accept only layer's input, but gradient from the following layer as well. Now we may implement the sigmoid activation layer.

In [5]:
class SigmoidLayer:
    def __call__(self, inputs):
        return 1 / (1 + np.exp(-inputs))
    
    def gradient(self, inputs, gradients):
        outputs = self(inputs)
        my_gradient = outputs * (1 - outputs)
        return my_gradient * gradients

Now let's look at the softmax layer. This case is a bit complicated, as the gradient is not a vector, but a table. In general, most layers that have multiple inputs and multiple outputs will generate gradient table (note that for each example), as it needs to compute gradient of each output in respect to each input. The sigmoid activation layer, and activation layers in general, are exceptions. Each output corresponds to one input and as such the inputs doesn't interfere. You may look at it from the other side, the gradient of output $i$ in respect to input $j$ is zero if $i \neq j$. That results in table full of zeros expect diagonals.

But let's return back to the softmax layer. 

In [162]:
class SoftmaxLayer:
    def __init__(self):
        pass
    
    def __call__(self, inputs):
        inputs = inputs - np.max(inputs)
        return np.exp(inputs) / np.sum(np.exp(inputs), axis=-1)[:,np.newaxis]
    
    def gradient(self, inputs, gradients):
        outputs = self(inputs)
        diag = np.zeros((outputs.shape[0], outputs.shape[1], outputs.shape[1]))
        my_gradient = np.diag(outputs) - outputs.T @ outputs
        return np.sum(gradients[np.newaxis,:] * my_gradient, axis=1)

In [325]:
target = np.random.randint(0,10,size=(3,), dtype=int)
vals = np.random.uniform(size=(3,10))
soft = SoftmaxLayer()
loss = CategoricalCrossEntropyLoss()

predicted = soft(vals)
l = loss(target, predicted)
l_grad = loss.gradient(target, predicted)
s_grad = soft.gradient(vals, l_grad)

ValueError: operands could not be broadcast together with shapes (3,) (10,10) 

In [326]:
import tensorflow as tf

target_tf = tf.Variable(target)
vals_tf = tf.Variable(vals)

with tf.GradientTape() as tape:
    predicted_tf = tf.nn.softmax(vals_tf)
    l_tf = tf.keras.losses.sparse_categorical_crossentropy(target_tf, predicted_tf)

l_grad_tf, s_grad_tf = tape.gradient(l_tf, [predicted_tf, vals_tf])

In [327]:
print("prediction")
print(predicted)
print("loss")
print(l)
print("loss grad")
print(l_grad)
print("softmax grad")
print(s_grad)

prediction
[[0.16005908 0.06725888 0.10924701 0.11384503 0.08331395 0.06459966
  0.1163387  0.06519166 0.06509037 0.15505567]
 [0.08421609 0.07562292 0.08035415 0.10763828 0.13765494 0.09973486
  0.08157759 0.07799766 0.17106817 0.08413533]
 [0.14815491 0.09326063 0.08769906 0.12453189 0.05808831 0.07857035
  0.05885886 0.08420299 0.13066567 0.13596733]]
loss
[2.73197873 2.47436928 1.99534064]
loss grad
[[  0.           0.           0.           0.           0.
    0.           0.           0.         -15.36325671   0.        ]
 [-11.87421547   0.           0.           0.           0.
    0.           0.           0.           0.           0.        ]
 [  0.           0.           0.           0.           0.
    0.           0.           0.           0.          -7.35470788]]
softmax grad
[[5.79791929 5.40976589 5.79812171 5.76095188 5.76404722 5.58312189
  5.29561435 5.88078342 5.96542029 5.89731699]]


In [328]:
print("prediction")
print(predicted_tf)
print("loss")
print(l_tf)
print("loss grad")
print(l_grad_tf)
print("softmax grad")
print(s_grad_tf)

prediction
tf.Tensor(
[[0.16005908 0.06725888 0.10924701 0.11384503 0.08331395 0.06459966
  0.1163387  0.06519166 0.06509037 0.15505567]
 [0.08421609 0.07562292 0.08035415 0.10763828 0.13765494 0.09973486
  0.08157759 0.07799766 0.17106817 0.08413533]
 [0.14815491 0.09326063 0.08769906 0.12453189 0.05808831 0.07857035
  0.05885886 0.08420299 0.13066567 0.13596733]], shape=(3, 10), dtype=float64)
loss
tf.Tensor([2.73197873 2.47436928 1.99534064], shape=(3,), dtype=float64)
loss grad
tf.Tensor(
[[  1.           1.           1.           1.           1.
    1.           1.           1.         -14.36325671   1.        ]
 [-10.87421547   1.           1.           1.           1.
    1.           1.           1.           1.           1.        ]
 [  1.           1.           1.           1.           1.
    1.           1.           1.           1.          -6.35470788]], shape=(3, 10), dtype=float64)
softmax grad
tf.Tensor(
[[ 0.16005908  0.06725888  0.10924701  0.11384503  0.08331395  0.

In [329]:
import torch

target_t = torch.tensor(target, dtype=torch.long)
vals_t = torch.tensor(vals, requires_grad=True)

predicted_t = torch.nn.functional.softmax(vals_t, dim=1)
predicted_t.retain_grad()
l_t = torch.nn.functional.nll_loss(torch.log(predicted_t), target_t, reduction='none')
l_t.backward(torch.ones(l_t.size()))

In [330]:
print("prediction")
print(predicted_t)
print("loss")
print(l_t)
print("loss grad")
print(predicted_t.grad)
print("softmax grad")
print(vals_t.grad)

prediction
tensor([[0.1601, 0.0673, 0.1092, 0.1138, 0.0833, 0.0646, 0.1163, 0.0652, 0.0651,
         0.1551],
        [0.0842, 0.0756, 0.0804, 0.1076, 0.1377, 0.0997, 0.0816, 0.0780, 0.1711,
         0.0841],
        [0.1482, 0.0933, 0.0877, 0.1245, 0.0581, 0.0786, 0.0589, 0.0842, 0.1307,
         0.1360]], dtype=torch.float64, grad_fn=<SoftmaxBackward>)
loss
tensor([2.7320, 2.4744, 1.9953], dtype=torch.float64,
       grad_fn=<NllLossBackward>)
loss grad
tensor([[  0.0000,   0.0000,   0.0000,   0.0000,   0.0000,   0.0000,   0.0000,
           0.0000, -15.3633,   0.0000],
        [-11.8742,   0.0000,   0.0000,   0.0000,   0.0000,   0.0000,   0.0000,
           0.0000,   0.0000,   0.0000],
        [  0.0000,   0.0000,   0.0000,   0.0000,   0.0000,   0.0000,   0.0000,
           0.0000,   0.0000,  -7.3547]], dtype=torch.float64)
softmax grad
tensor([[ 0.1601,  0.0673,  0.1092,  0.1138,  0.0833,  0.0646,  0.1163,  0.0652,
         -0.9349,  0.1551],
        [-0.9158,  0.0756,  0.0804,  0.

# Bibliography

- \[1\] The Softmax Function Derivative, Stephen Oman, 17th June 2019, [online](https://aimatters.wordpress.com/2019/06/17/the-softmax-function-derivative/)