Softmax activation function

The formula for the **softmax activation function** is:

$$
\sigma(z_i) = \frac{e^{z_i}}{\sum_{j=1}^{n} e^{z_j}}
$$

Where:
- $ z_i $ is the $ i $-th input element (logit) in a vector of $ n $ elements.
- $ e^{z_i} $ is the exponential of $ z_i $.
- $ \sum_{j=1}^{n} e^{z_j} $ is the sum of the exponentials of all $ n $ input elements.

The softmax function normalizes the outputs to produce a probability distribution, where the sum of all outputs is 1, and each output is in the range (0, 1). 

In [12]:
import math
import numpy as np

In [2]:
layer_outputs = [4.8 , 1.21 , 2.385]

In [3]:
#E = 2.71828182846
E = math.e

In [6]:
exp_values = []
for output in layer_outputs:
    exp_values.append(E**output)
exp_values

[121.51041751873483, 3.353484652549023, 10.859062664920513]

In [10]:
exp_values = np.exp(layer_outputs)
exp_values

array([121.51041752,   3.35348465,  10.85906266])

In [8]:
norm_base = sum(exp_values)
norm_values = []

for value in exp_values:
    norm_values.append(value / norm_base)
print(norm_values)
print(sum(norm_values))

[0.8952826639572619, 0.024708306782099374, 0.0800090292606387]
0.9999999999999999


In [11]:
norm_values = exp_values / np.sum(exp_values)
print(norm_values)
print(sum(norm_values))

[0.89528266 0.02470831 0.08000903]
0.9999999999999999


Batches

In [13]:
layer_outputs = [[4.8 , 1.21 , 2.385],
                 [8.9 , -1.81 , 0.2],
                 [1.41 , 1.051 , 0.026]]

In [14]:
exp_values = np.exp(layer_outputs)
print(exp_values)

[[1.21510418e+02 3.35348465e+00 1.08590627e+01]
 [7.33197354e+03 1.63654137e-01 1.22140276e+00]
 [4.09595540e+00 2.86051020e+00 1.02634095e+00]]


axis = None adds up all the values in the matrix , axis = 0 adds up the values of each column , axis = 1 adds up the values of each row
keepdims = True , ensures the dimensions so that dot product or division becomes possible 

In [15]:
print(np.sum(layer_outputs , axis = 1 , keepdims= True))

[[8.395]
 [7.29 ]
 [2.487]]


In [16]:
norm_values = exp_values /  np.sum(exp_values , axis= 1 , keepdims= True)
norm_values

array([[8.95282664e-01, 2.47083068e-02, 8.00090293e-02],
       [9.99811129e-01, 2.23163963e-05, 1.66554348e-04],
       [5.13097164e-01, 3.58333899e-01, 1.28568936e-01]])

In [17]:
print(np.sum(norm_values , axis = 1 , keepdims= True))

[[1.]
 [1.]
 [1.]]


### Why do we subtract the maximum value in the softmax function?

In the softmax function, we often subtract the maximum value of the input $ z $ vector from each element before applying the exponential function. This step is crucial to avoid numerical overflow, especially when dealing with very large values.

Consider the softmax formula:

$$
\sigma(z_i) = \frac{e^{z_i}}{\sum_{j=1}^{n} e^{z_j}}
$$

If any $ z_i $ is large, the exponentiation $ e^{z_i} $ can result in extremely large values, which can cause overflow issues or make the computations unstable. To prevent this, we subtract the maximum value of $ z $ (denoted as $ z_{\text{max}} $) from all elements:

$$
\sigma(z_i) = \frac{e^{z_i - z_{\text{max}}}}{\sum_{j=1}^{n} e^{z_j - z_{\text{max}}}}
$$

#### Why does this work without changing the result?

Subtracting a constant like $ z_{\text{max}} $ from each element of $ z $ doesn't change the result of the softmax function. This is because the softmax function is scale-invariant to such shifts. Here’s why:

- Both the numerator and denominator are scaled by the same factor, $ e^{-z_{\text{max}}} $. 
- This factor cancels out:

$$
\sigma(z_i) = \frac{e^{z_i - z_{\text{max}}}}{\sum_{j=1}^{n} e^{z_j - z_{\text{max}}}} = \frac{e^{z_i}}{\sum_{j=1}^{n} e^{z_j}}
$$

Thus, the result remains the same, but we avoid large exponentials that could lead to overflow.
