In [11]:
import torch
import torch.nn.functional as F
import torch.nn as nn

# Softmax function

The softmax function takes as input a vector z of K real numbers, and normalizes it into a probability distribution consisting of K probabilities proportional to the exponentials of the input numbers.

![Alt Text](https://miro.medium.com/v2/resize:fit:1400/1*ReYpdIZ3ZSAPb2W8cJpkBg.jpeg)


In [67]:
x = torch.tensor([1.3, 5.1, 2.2, 0.7, 1.1])
outputs = F.softmax(x, dim = 0)
print(outputs)

tensor([0.0202, 0.9025, 0.0497, 0.0111, 0.0165])


What actually happens:

In [9]:
print(torch.exp(x) / torch.exp(x).sum())

tensor([0.0202, 0.9025, 0.0497, 0.0111, 0.0165])

In [87]:
x = torch.tensor([[1.3, 5.1, 2.2, 0.7, 1.1],
                  [2.3, 1.3, 1.2, 7.5, 4.1]])
outputs = F.softmax(x, dim = 0)
print('Softmax accross rows: \n', outputs)
outputs = F.softmax(x, dim = 1)
print('Softmax accross columns: \n', outputs)

Softmax accross rows: 
 tensor([[0.2689, 0.9781, 0.7311, 0.0011, 0.0474],
        [0.7311, 0.0219, 0.2689, 0.9989, 0.9526]])
Softmax accross columns: 
 tensor([[0.0202, 0.9025, 0.0497, 0.0111, 0.0165],
        [0.0053, 0.0019, 0.0018, 0.9590, 0.0320]])


### Argmax function

In [88]:
probs = F.softmax(x, dim = 1)
print('Probabilities: \n', probs)

# argmax returns the index of the maximum value
# we need max index across columns
most_likely_class = torch.argmax(probs, dim = 1)
print('Most likely class: ', most_likely_class)

Probabilities: 
 tensor([[0.0202, 0.9025, 0.0497, 0.0111, 0.0165],
        [0.0053, 0.0019, 0.0018, 0.9590, 0.0320]])
Most likely class:  tensor([1, 3])


So in our case we need to predict the class of the example. When we apply softmax we get probavilities. We need to choose the class with the highest probability. This is done by argmax function.

### What does dim (dimension) mean here?
The shape of our tensor x here is (2, 5) where 2 means 2 rows and 5 means 5 columns. When we specify dim=0, it means we want to apply softmax across rows. When we specify dim=1, it means we want to apply softmax across columns.

## Binary Cross Entropy:

![text](https://androidkt.com/wp-content/uploads/2023/05/Selection_099.png)

## Multiclass Cross Entropy:

![alt](https://cdn.analyticsvidhya.com/wp-content/uploads/2021/03/Screenshot-from-2021-03-03-11-43-42.png)

In both cases only one term will be non-zero. Because something can only belong to one class.

### Now let's to see how Cross Etropy Loss works with code

In [113]:
# random output from a neural network for 4 training examples
# We will have 4 classes
outputs = torch.randn(4, 3)
print(outputs)

tensor([[ 0.4786,  0.2170,  0.2445],
        [-0.9200,  0.7587,  0.0682],
        [ 0.1503, -1.6424,  0.7011],
        [ 0.8171, -0.7246, -1.2948]])


In [114]:
activations = F.softmax(outputs, dim = 1) # dim = 1 is across columns
print(activations)

tensor([[0.3905, 0.3006, 0.3090],
        [0.1106, 0.5924, 0.2970],
        [0.3447, 0.0574, 0.5979],
        [0.7490, 0.1603, 0.0906]])


Now we have activations which are actually predictions for 4 training examples and each has prediction of 3 classes.

In [116]:
# suppose these are true labels
y_labels = torch.tensor([2, 0, 1, 1]) # 4 training examples

# Now we need to One Hot Encode these labels
# 0 -> [1, 0, 0]
# 1 -> [0, 1, 0]
# 2 -> [0, 0, 1]
# We can use PyTorch's built-in function
y_labels_oh = F.one_hot(y_labels)
print(y_labels_oh)

tensor([[0, 0, 1],
        [1, 0, 0],
        [0, 1, 0],
        [0, 1, 0]])


In [130]:
# Using the formula for cross entropy loss
print('Loss of each training example:\n', -1. * y_labels_oh * torch.log(activations))
loss = -torch.sum(y_labels_oh * torch.log(activations))
print('\nOverall Loss:', loss)
print('Avg Loss', loss / len(y_labels_oh))

Loss of each training example:
 tensor([[0.0000, 0.0000, 1.1746],
        [2.2022, 0.0000, 0.0000],
        [0.0000, 2.8578, 0.0000],
        [0.0000, 1.8306, 0.0000]])

Overall Loss: tensor(8.0652)
Avg Loss tensor(2.0163)


I really like how beautiful this is. We are using matrices and vectors to do all the calculations. This is the power of linear algebra. We can do all the calculations with just one line of code.

So when we multiply y_labels_oh by torch.log(activations) we know that only one term will be non-zero. So we will get the log of the probability of the correct class. And when we multiply it by -1 we will get the negative log of the probability of the correct class. And this is exactly what we want.

### Now let's wrap this into one function

In [131]:
def cross_entropy(nn_output, y_labels):
    activations = F.softmax(nn_output, dim = 1)
    y_labels_oh = F.one_hot(y_labels)
    loss = -torch.sum(y_labels_oh * torch.log(activations))
    avg_loss = loss / len(y_labels_oh)
    return avg_loss

In [135]:
# random output from a neural network for 4 training examples
# We will have 4 classes
outputs = torch.randn(4, 3)
print('Sample output for 4 training examples:\n', outputs)

# suppose these are true labels
y_labels = torch.tensor([2, 0, 1, 1]) # 4 training examples
print('True labels:', y_labels)

loss = cross_entropy(outputs, y_labels)
print('Loss:', loss)

print('Correct loss using built-in function:', F.cross_entropy(outputs, y_labels))

Sample output for 4 training examples:
 tensor([[ 0.8414,  0.3765,  1.1708],
        [-0.4216, -1.7063, -1.3469],
        [ 0.9570, -2.6469, -1.8086],
        [-0.1210,  0.4247,  0.9643]])
True labels: tensor([2, 0, 1, 1])
Loss: tensor(1.5431)
Correct loss using built-in function: tensor(1.5431)


In [144]:
-2*torch.log(torch.tensor(1/3.))/2

tensor(1.0986)

# SoftmaxRegression Model using PyTorch
It has one hidden layer consisting of 3 neurons (for 3 classes)

![ss](softmax-regression-model.png)

Sofmax regression is also a logistic regression, the only difference is that it can predict the probability of more than two classes. It uses softmax function as the hypothesis function. The softmax function is a generalization of the logistic function that "squashes" a K-dimensional vector z of arbitrary real values to a K-dimensional vector σ(z) of real values in the range [0, 1] that add up to 1. 

In [82]:
class SoftmaxRegression(nn.Module):
    def __init__(self, num_features) -> None:
        super().__init__()
        self.linear = nn.Linear(num_features, 3)

    def forward(self, x):
        return F.softmax(self.linear(x), dim = 0) # here we need dim = 0 because self.linear(x) will return a 1D tensor of length 3

# Multilayer Neural Networks
![](multilayer-perceptron.png)

1) Why do we need non-linear activation functions?

Because if there was no activation function (or non-linearity) then the output would be just a simple linear function of the input. So it would be just a linear regression model. Linear models have limited power and they only can learn linear relationships between variables. So if we want to learn non-linear relationships we need to use non-linear activation functions.

We can use Logistic activation or nowadays people mostly use very simple activation function called ReLU (Rectified Linear Unit). It is very simple and it is just max(0, x). It is very simple and it works very well in practice.

---

2) Wide vs Deep Neural Networks

Wide neural networks have more number of neurons in a single layer. Deep neural networks have more number of layers. So the number of parameters in a wide neural network is more than the number of parameters in a deep neural network. So wide neural networks are more prone to overfitting. Deep neural networks are more prone to underfitting. Also, deep neural networks can suffer from gradient vanishing and gradient exploding problems. So we need to use some techniques to avoid these problems.

---

3) Why do we initialize weights randomly and not just zero?

This is because if we initilized everything to zero. Then the hidden layer would have just one neuron. Because all the neurons would be computing the same thing. So we need to initialize weights randomly.

![](random-weight-init.png)

That is why we initialize weights to small random values. So that each neuron can have their own lives :)