### **Softmax Activation function**

- It converts the outputs into a probability distribution
- It's a score with represents class membership probability for any label $t$
- Basically, a version of Logistic Regression which can be applied to multiple (more than 2) classes
$$
P(y = t | z^{[i]}_{t}) = \sigma_{softmax}(z^{[i]}_{t}) = \frac{e^{z^{[i]}_{t}}}{\sum^{h}_{j = 1}e^{z^{[i]}_{j}}}
$$
- Here, $z^{[i]}_{t}$ represents a single training example ($i-th$ training example)
- Basically, the above equation reads "Here is the probability that the given input belongs to class $t$ given it's features"

### **Example**

In [2]:
import torch
import torch.nn.functional as F

In [3]:
z = torch.tensor([[3.1, -2.3, 5.8]])

In [4]:
F.softmax(z, dim = 1)

tensor([[6.2955e-02, 2.8434e-04, 9.3676e-01]])

In [6]:
# If you have scientific notation
torch.set_printoptions(precision = 3, sci_mode = False)
s = F.softmax(z, dim = 1)
s

tensor([[    0.063,     0.000,     0.937]])

In [7]:
# Check that the probability distributions sum up to 1
torch.sum(s)

tensor(1.000)

### **Now, convert these scores into class labels**

In [8]:
# Get the index of highest probability
torch.argmax(s, dim = 1)

tensor([2])

### **Cross-entropy loss**

- It is the loss function used in softmax regression
- Let's start with **Binary cross-entropy loss function**, which was
$$
L = \frac{1}{n}\sum^{n}_{i = 1}-(y^{[i]}\log(a^{[i]}) + (1 - y^{[i]})\log(1 - a^{[i]}))
$$
- **Note the $\sum$, if we have multiple training examples, we sum the loss over all training examples**
- Also, the $a$ here refers the output of activation value (in this case, it's the output of sigmoid activation function)

**Multi-category cross-entropy loss function** for multiple training examples and multiple classes
$$
L = \frac{1}{n}\sum^{n}_{i = 1}\sum^{K}_{k = 1}-y^{[i]}_{k}\log(a^{[i]}_{k})
$$
- First, we need to one-hot encode the labels for cross-entropy loss to work

**A complete example (kind of)**

In [11]:
net_inputs = torch.tensor([
  [1.5, 0.1, -0.4],
  [0.5, 0.7, 2.1],
  [-2.1, 1.1, 0.8],
  [1.1, 2.5, -1.2]
])

In [12]:
activations = torch.softmax(net_inputs, dim = 1)
activations

tensor([[0.716, 0.177, 0.107],
        [0.139, 0.170, 0.690],
        [0.023, 0.561, 0.416],
        [0.194, 0.787, 0.019]])

In [9]:
y = torch.tensor([0, 2, 2, 1])
y_onehot = F.one_hot(y)
y_onehot

tensor([[1, 0, 0],
        [0, 0, 1],
        [0, 0, 1],
        [0, 1, 0]])

- For training example $1:$
  - Take $1^{st}$ row from softmax activations and one-hot vector
  - Pass it through the cross-entropy loss
$$
-1 \cdot \log(0.716) - 0 \cdot \log(0.177) - 0 \cdot \log(0.107) = -\log(0.716) \approx 0.334
$$
- Do this for all the other rows, if you want lol