# Softmax
**Outline**
1. [What is Softmax?](#1-what-is-softmax)
2. [Getting Started](#2-getting-started)
3. [Implementing softmax from scratch with NumPy](#3-implementing-softmax-from-scratch-with-numpy)
4. [Implementing softmax with TensorFlow](#4-implementing-softmax-with-tensorflow)

### 1. What is Softmax?
[Softmax][softmax] is an **activation function*** often used in multiclass classification models in deep learning. Softmax serves to convert a vector of $n$ real numbers into a probability distribution vector of $n$ numbers that sums to $1$. 

Softmax produces this output probabilty vector by first applying the standard exponential function to the input vector and then mean normalizing the exponentiated vector to produce a vector sum of $1$. The softmax equation is found below:

$$\textit{softmax}(x_{i}) = \frac{e^{x_{i}}}{\sum_{i=1}^{n}e^{x_{i}}}$$

**Questions that I had when learning about softmax:**

>**Q1:** Why does softmax exponentiate the input vector? <br><br>
>**A1:** Softmax aims to produce a probabilty distribution vector that sums to 1. To achieve this, it must handle non-negative numbers. Moreover, softmax aims to apply greater probabilities to monotonically larger numbers. Exponentiating the input vector can resolve both of these needs.

>**Q2:** Why does softmax use the standard exponential function instead of another base?<br><br>
>**A2:** Finding the answer to this one was a bit more difficult for me. It seems that on some level, choosing $e$ was slightly arbitrary, as there are other forms of softmax functions<sup>[1][stackexchange],[2][arxiv]</sup>. However, it seems possible that the main reason for selecting $e$ is because it is so easily differentiable, as $\frac{d}{dx}e^x = e^x$. [3Blue1Brown][3b1b] has a great explanation of this. 

>**Q3:** Why use softmax instead using sigmoid activations for each final unit and finding the largest sigmoid output?
>**A3:** Hardmax is not differentiable, whereas softmax is. Moreover, hardmax results in information loss and can hinder training.

>**Final note:** It was important for me to differentiate between 'multi-class' and 'multi-label' classifications when learning about softmax. Softmax should be used when there is only *one* correct output label. When more than one label is possible, sigmoid activations should be used for each final unit to determine whether the label is present, regardless of other labels.


***Related** - Activation functions are non-linear functions that produce an output value for a unit. This output value represents the unit's level of activation, given some input value.

[softmax]: https://en.wikipedia.org/wiki/Softmax_function
[stackexchange]: https://stats.stackexchange.com/questions/296471/why-e-in-softmax
[arxiv]: https://arxiv.org/abs/1511.05042
[3b1b]: https://www.youtube.com/watch?v=m2MIpDrF7Es

### 2. Getting Started
To place the softmax activation function into a realistic context, let's imagine we've trained a CNN that serves to recognize four types of animals and a classifies any other unknown image as 'other'.

At the end of our model, we have five output units. These units provide their raw output values to the softmax layer, which converts the outputs into a probability distribution vector for the classification of our input image.

### 3. Implementing Softmax from Scratch with NumPy
Now let's say that we've just run the following image of an elephant through our CNN, and the final five units produce the five output values below.

<p align="center">
<img src="./images/Softmax Figure.png" alt="example multi-class cnn with softmax output/" width="75%">
</p>

In [1]:
import numpy as np

raw_output = np.array([[3.8, 0.2, -0.6, 0.9, 0.3]])  # AKA logits in TF jargon

Let's now write a softmax function to convert this raw ouptut vector into a probability distribution vector. The function will be quite short!

In [2]:
def softmax(input_vector: np.ndarray) -> np.ndarray:
    """Apply softmax to an input vector and return result.
    
    Parameters:
        input_vector : np.ndarray
        
    Returns:
        np.ndarray
    """
    
    exp = np.exp(input_vector)
    prob_vector = exp / np.sum(exp)
    
    return prob_vector

output_probabilities = softmax(raw_output)

print(f"Softmax output values: {output_probabilities[0]}")

Softmax output values: [0.88902982 0.0242916  0.01091492 0.04891728 0.02684637]


So in this mock example, our model is classifying the with an ~89% probability of containing an elephant. Nice!

### 4. Implementing softmax with TensorFlow
As usual, it's better to tend to rely on well-established packages for most needs. We can apply softmax in a few different ways using TensorFlow. Below are two examples. We'll print the output value of the first example to compare it to our own function.

1. It can be applied as an individual activation layer for our model.

2. It can be applied as an activation function for a `tf.keras.layers.Dense` fully-connected layer.

In [4]:
import tensorflow as tf

# first let's convert the numpy array to a tensor
raw_output_tensor = tf.convert_to_tensor(raw_output)

# Example 1. Create a separate activation layer
output = tf.keras.activations.softmax(raw_output_tensor)
print(f"TF softmax output values: {output_probabilities[0]}")

# Example 2. create a FC layer in a model with a softmax activation. 
#   We'll treat this layer as if it were the final layer of our example model.
layer = tf.keras.layers.Dense(5, activation=tf.keras.activations.softmax)

TF softmax output values: [0.88902982 0.0242916  0.01091492 0.04891728 0.02684637]


Feel free to create an issue in the repository if you have an concerns or find any problems with this demonstration.