## Objectives:

- Quickly Understand Backpropagation in a Single Neuron
- Implement Multi-Layer Perceptrons with Backpropagation algorithm using PyTorch

\

## Prerequisites:

1. Perceptrons (Single neuron)

2. Backpropagation and Computational Graphs

3. Basic Knowledge of PyTorch (and/or TensorFlow)


    Reference: https://github.com/SunlightWings/Machine-Learning-and-Data-Mining-/blob/main/ANN/PyTorch_ANN.ipynb

\


# Part A: Understanding Backpropagation

## Computational Graphs:

Components:
  * Nodes: Represent operations or variables.
  * Edges: Represent the flow of data and carry weights.

**Illustration**:

Lets consider a simple equation, that is represented in the form of  computational graph as below:

<p align="center">s = 2x</p>
<p align="center">y = s + a</p>

<center>
<img src="https://doc.google.com/a/fusemachines.com/uc?id=1_WwWjWXvT6b31TS402a9tVe7EjZw_Exu">

Figure 1: Computational graph
</center>


**Illustration**:
Problem: Logistic regression with two inputs variables $x_1$ and $x_2$.

* $z$ is weighted sum of input plus bias,
* $\hat{y}$ is the output of a sigmoid function and
* $y$ is the label for a training example which are finally feed to Binary cross entropy loss.

<center>
    <img src="https://doc.google.com/a/fusemachines.com/uc?export=download&id=14MXS74SqI0ouBzvtwgyLTA8v93g66mRG" alt="ford_prop" height="400" width = "900">

Figure 3: Computational Graph for Logistic regression
</center>

We have,

$$z = 1 * w_0 + x_1 * w_1 + x_2 * w_2$$

$$\hat{y} = \frac{1}{1 + e^{-(1 * w_0 + x_1 * w_1 + x_2 * w_2)}}$$

$$Loss (L) = - ( y \log{\hat{y}} + (1 - y) \log{(1-\hat{y})})$$

## Backprop Using Chain Rule:

<center>
    <img src="https://doc.google.com/a/fusemachines.com/uc?export=download&id=1RFe8TVQyTZcg41W1Gc64rbE3u_VX2GfU" alt="chain_rule" height="350">

Figure 5: Chain Rule for Logistic Regression.
</center>


1. **Known Gradients** = Local Gradients = $\frac{\partial{L}}{\partial{ŷ}}$, $\frac{\partial{ŷ}}{\partial{Z}}$, ( $\frac{\partial{Z}}{\partial{w0}}$, $\frac{\partial{Z}}{\partial{w1}}$, $\frac{\partial{Z}}{\partial{w2}}$ )

2. **To Find** = Final Loss with respect to local inputs (weights) = $\frac{\partial{L}}{\partial{ŷ}}$, $\frac{\partial{L}}{\partial{Z}}$, ( $\frac{\partial{L}}{\partial{w0}}$, $\frac{\partial{L}}{\partial{w1}}$, $\frac{\partial{L}}{\partial{w2}}$ )

3. **Chain Rule** = Use point (1) to calculate point (2) as shown in Figure (5).


## Calculation:

**Task1: Calculate the gradients**

Lets calculate all the local gradients.

1. Loss w.r.t. output:

$\frac{\partial{L}}{\partial{\hat{y}}}=\frac{\partial{(-(y\log{\hat{y}}+(1-y))\log{(1-\hat{y})})}}{\partial{\hat{y}}}$

$\therefore \frac{\partial{L}}{\partial{\hat{y}}}=-(\frac{y}{\hat{y}}-\frac{(1-y)}{(1-\hat{y})})$

\

2. Output w.r.t. z:

$\frac{\partial{\hat{y}}}{\partial{z}}=\frac{\partial({\frac{1}{1+e^{-z}}})}{\partial{z}}$        (Since, $\hat{y}=\frac{1}{1+e^{-z}}$)

$\frac{\partial{\hat{y}}}{\partial{z}}=\frac{(1+e^{-z})*\frac{\partial{(1)}}{\partial{z}}-\frac{\partial{(1+e^{-z})}}{\partial{z}}*1}{(1+e^{-z})^2}$

$\frac{\partial{\hat{y}}}{\partial{z}}=\frac{0-(-e^{-z})}{(1+e^{-z})^2}$

$\frac{\partial{\hat{y}}}{\partial{z}}=\frac{e^{-z}}{(1+e^{-z})^2}$

$\frac{\partial{\hat{y}}}{\partial{z}}=\frac{1}{(1+e^{-z})}*\frac{e^{-z}}{(1+e^{-z})}$

$\frac{\partial{\hat{y}}}{\partial{z}}=\hat{y}*(1-\frac{1}{(1+e^{-z})})$

$\therefore \frac{\partial{\hat{y}}}{\partial{z}}=\hat{y}*(1-\hat{y})$

\

3. Z w.r.t. input weights:

$\frac{\partial{z}}{\partial{w_0}}=\frac{\partial{(w_0+x_1*w_1+x_2*w_2)}}{\partial{w_0}}$

$\therefore \frac{\partial{z}}{\partial{w_0}}=1$

$\therefore \frac{\partial{z}}{\partial{w_1}}=x_1$

$\therefore \frac{\partial{z}}{\partial{w_2}}=x_2$

* **Note**: The gradient of `z` with respect to `w` = `Input` as seen just now. This is so because z is the weighted sum.

\

### Chain Rule using the local gradients:

$\therefore\frac{\partial{L}}{\partial{\hat{y}}}=-(\frac{y}{\hat{y}}-\frac{(1-y)}{(1-\hat{y})})$

$\therefore\frac{\partial{L}}{\partial{z}}=\frac{\partial{L}}{\partial{\hat{y}}}*\frac{\partial{\hat{y}}}{\partial{z}}=-(\frac{y}{\hat{y}}-\frac{(1-y)}{(1-\hat{y})})*\hat{y}*(1-\hat{y})$


$\therefore\frac{\partial{L}}{\partial{w_0}}=\frac{\partial{L}}{\partial{\hat{y}}}*\frac{\partial{\hat{y}}}{\partial{z}}*\frac{\partial{z}}{\partial{w_0}}=-(\frac{y}{\hat{y}}-\frac{(1-y)}{(1-\hat{y})})*\hat{y}(1-\hat{y})*1$

$\therefore\frac{\partial{L}}{\partial{w_1}}=\frac{\partial{L}}{\partial{\hat{y}}}*\frac{\partial{\hat{y}}}{\partial{z}}*\frac{\partial{z}}{\partial{w_1}}=-(\frac{y}{\hat{y}}-\frac{(1-y)}{(1-\hat{y})})*\hat{y}(1-\hat{y})*x_1$

$\therefore \frac{\partial{L}}{\partial{w_2}}=\frac{\partial{L}}{\partial{\hat{y}}}*\frac{\partial{\hat{y}}}{\partial{z}}*\frac{\partial{z}}{\partial{w_2}}=-(\frac{y}{\hat{y}}-\frac{(1-y)}{(1-\hat{y})})*\hat{y}(1-\hat{y})*x_2$


### Weight Update:
The main objective of the backpropagation is to update the weights so that the model may "learn" something.

Now, we update the weights as follows:

$new \space w = old \space w + \Delta w = old \space w + \eta (- \frac{\partial L}{\partial w})$

where, $\eta$ is the learning rate.

For example,

$$new \space w_0 = old \space w_0 + \Delta w_0 = old \space w_0 + \eta (- \frac{\partial L}{\partial w_0})$$
$$new \space w_1 = old \space w_1 + \Delta w_1 = old \space w_1 + \eta (- \frac{\partial L}{\partial w_1})$$

$$new \space w_2 = old \space w_2 + \Delta w_2 = old \space w_2 + \eta (- \frac{\partial L}{\partial w_2})$$

### Implementation of Backpropagation from Scratch:

Link: https://github.com/SunlightWings/Machine-Learning-and-Data-Mining-/blob/main/ANN/Backpropagation_and_Computational_Graph.ipynb


# Part B: MLP with Backpropagation (With PyTorch):

Autograd in PyTorch is used for rapid computation of multiple derivatives (gradients).

A machine learninig model can be defined as:

$\vec{y} = \vec{M}(\vec{x})$

where,

$\vec{x}$ = *i*-dimensional vector $\vec{x}$

$\vec{M}$ = Vector-valued function (function that outputs a vector $\vec{y}$, and not a scalar)

Meanwhile, The Loss function L($\vec{y}$) = L($\vec{M}$($\vec{x}$)) is a single-valued scalar function of the model's output because it produces a single scalar value.

Goal: Minimize loss function.
      Ideally Make its gradient w.r.t input=0

For a function $\vec{y}=f(\vec{x})$, with n-dimensional input and m-dimensional output, the complete gradient is a matrix of the derivative of every output with respect to every input, called the **Jacobian:**

$$\begin{aligned}
J
=
\left(\begin{array}{ccc}
\frac{\partial y_{1}}{\partial x_{1}} & \cdots & \frac{\partial y_{1}}{\partial x_{n}}\\
\vdots & \ddots & \vdots\\
\frac{\partial y_{m}}{\partial x_{1}} & \cdots & \frac{\partial y_{m}}{\partial x_{n}}
\end{array}\right)
\end{aligned}$$

A second function, $l=g\left(\vec{y}\right)$ that takes m-dimensional input and produces a scalar output (like a loss function),  you can express its gradients with respect to $\vec{y}$ as a column vector,
$v=\left(\begin{array}{ccc}\frac{\partial l}{\partial y_{1}} & \cdots & \frac{\partial l}{\partial y_{m}}\end{array}\right)^{T}$

Now imagine 'J' is the PyTorch model, and 'v' is the loss function, now they can be multiplied to obtain the required column matrix, which contains the gradient of Loss with respect to inputs. Thats all we want!!


$$\begin{aligned}
J\cdot v^{T}=\left(\begin{array}{ccc}
\frac{\partial y_{1}}{\partial x_{1}} & \cdots & \frac{\partial y_{m}}{\partial x_{1}}\\
\vdots & \ddots & \vdots\\
\frac{\partial y_{1}}{\partial x_{n}} & \cdots & \frac{\partial y_{m}}{\partial x_{n}}
\end{array}\right)\left(\begin{array}{c}
\frac{\partial l}{\partial y_{1}}\\
\vdots\\
\frac{\partial l}{\partial y_{m}}
\end{array}\right)=\left(\begin{array}{c}
\frac{\partial l}{\partial x_{1}}\\
\vdots\\
\frac{\partial l}{\partial x_{n}}
\end{array}\right)
\end{aligned}$$

## Building Network Architecture:
Lets begin by building neural network architecture.
The architecture looks something like this:


<center>
<img src="https://drive.google.com/uc?export=view&id=18Z0NofFLsTIxW4zBdILxcMEczhd113c4" alt="ANN" height="550" width="650">

Figure 1: Neural Network with two hidden layers
</center>

In the above figure, the layers with the number of neurons are:
* First layer = 10 neurons
* Second layer = 16 neurons
* Third layer = 8 neurons
* Fourt layer = 1 neuron

In [1]:
import torch
import torch.nn as nn

In [2]:
linear1 = nn.Linear(10,16)
linear2 = nn.Linear(16,8)
linear3 = nn.Linear(8,1)

In [3]:
linear1, linear2, linear3  # print

(Linear(in_features=10, out_features=16, bias=True),
 Linear(in_features=16, out_features=8, bias=True),
 Linear(in_features=8, out_features=1, bias=True))

In [4]:
model = nn.Sequential(linear1, linear2, linear3)

In [5]:
model

Sequential(
  (0): Linear(in_features=10, out_features=16, bias=True)
  (1): Linear(in_features=16, out_features=8, bias=True)
  (2): Linear(in_features=8, out_features=1, bias=True)
)

### Data preparation:

Lets define some random data for now.

Note that in actual application, this is one of the most important steps.

It is usually done by making a 'Dataset' class.

In [6]:
torch.manual_seed(42)
input1 = torch.randn(10)   # input to the model (fixed set of random numbers for reproducibility)
input1

tensor([ 0.3367,  0.1288,  0.2345,  0.2303, -1.1229, -0.1863,  2.2082, -0.6380,
         0.4617,  0.2674])

### Forward pass:
Output from the model

In [7]:
output = model(input1)
print(output)

tensor([-0.4179], grad_fn=<ViewBackward0>)


In [8]:
output.shape

torch.Size([1])

**Task2: Manually do the forward pass through each layer and print them, along with their shape**

In [9]:
## TO DO
output1 = linear1(input1)
print(output1)

tensor([-1.0438, -0.5736, -0.6513, -0.3797, -0.8882, -0.3074, -0.2861, -0.4009,
         0.0018, -0.1239, -0.3237, -1.1705,  0.0145,  0.5963, -0.2574, -0.2240],
       grad_fn=<ViewBackward0>)


In [10]:
output1.shape

torch.Size([16])

In [11]:
## TO DO



## With activation function:

Activation functions add non-linearity to the network.

Otherwise, whole network is equivalent to a single neuron.

**Task3: Discuss why**

Lets create a new model(network) with 2 neurons in the final layer

In [12]:
linear1 = nn.Linear(10,16)
linear2 = nn.Linear(16,8)
linear3 = nn.Linear(8,2)

model2 = nn.Sequential(
    linear1,
    nn.LeakyReLU(0.02),
    linear2,
    nn.Sigmoid(),
    linear3,
    nn.Softmax(dim = -1)
)

In [13]:
model2

Sequential(
  (0): Linear(in_features=10, out_features=16, bias=True)
  (1): LeakyReLU(negative_slope=0.02)
  (2): Linear(in_features=16, out_features=8, bias=True)
  (3): Sigmoid()
  (4): Linear(in_features=8, out_features=2, bias=True)
  (5): Softmax(dim=-1)
)

**Task4: Do forward pass on this new model**

In [14]:
## TO DO
output2 = model2(input1)
output2

tensor([0.4489, 0.5511], grad_fn=<SoftmaxBackward0>)

### One hot encoding:

Until now, we have:
* Input = `input1` = [ 0.3367,  0.1288,  0.2345,  0.2303, -1.1229, -0.1863,  2.2082, -0.6380, 0.4617,  0.2674]
* Output = `output_of_model` = [prob1, prob2]

Now we can create ground truth as:
* GT = `actual_output` = [actual_prob1, actual_prob2]

But often in real world dataset, the ground truth could only mention the target class instead of probabilities such as:
* GT = `actual_output` = Class 0

**Task4: Answer/Ponder**

In such case, how do we compare `output_of_model` and `actual_output` to calculate loss function?? Here they are simply incompatible in shape!!

Lets say the output is class1. So the objective is to make it [0, 1] so that it can be compared to [prob1, prob2]

In [15]:
import torch.nn.functional as F

actual_output_class = 0

actual_output_one_hot = F.one_hot(torch.tensor(actual_output_class), num_classes=2).float()
print("One-hot encoded ground truth:", actual_output_one_hot)

One-hot encoded ground truth: tensor([1., 0.])


**Task5: Check the one hot output after the forward pass on above model2**

In [16]:
output_of_model = model2(input1)
print(output_of_model)
print(actual_output_one_hot)

tensor([0.4489, 0.5511], grad_fn=<SoftmaxBackward0>)
tensor([1., 0.])


### Loss function:

In [17]:
criteria = nn.CrossEntropyLoss()                           # this is defined before the training loop
criteria

CrossEntropyLoss()

In [18]:
loss = criteria(output_of_model, actual_output_one_hot)      # this will be done in the training loop.
loss

tensor(0.7456, grad_fn=<DivBackward1>)

In [None]:
loss.backward(retain_graph = True)

### Weights and gradients:

**Task6: Print them**

In [46]:
# Accessing parameters of specific layers
print(model2)
print("Linear1 weights:", model2[0].weight)
print("Linear1 biases:", model2[0].bias)

Sequential(
  (0): Linear(in_features=10, out_features=16, bias=True)
  (1): LeakyReLU(negative_slope=0.02)
  (2): Linear(in_features=16, out_features=8, bias=True)
  (3): Sigmoid()
  (4): Linear(in_features=8, out_features=2, bias=True)
  (5): Softmax(dim=-1)
)
Linear1 weights: Parameter containing:
tensor([[-0.1457, -0.0371, -0.1284,  0.2098, -0.2496, -0.1458, -0.0893, -0.1901,
          0.0298, -0.3123],
        [ 0.2856, -0.2686,  0.2441,  0.0526, -0.1027,  0.1954,  0.0493,  0.2555,
          0.0346, -0.0997],
        [ 0.0850, -0.0858,  0.1331,  0.2823,  0.1828, -0.1382,  0.1825,  0.0566,
          0.1606, -0.1927],
        [-0.3130, -0.1222, -0.2426,  0.2595,  0.0911,  0.1310,  0.1000, -0.0055,
          0.2475, -0.2247],
        [ 0.0199, -0.2158,  0.0975, -0.1089,  0.0969, -0.0659,  0.2623, -0.1874,
         -0.1886, -0.1886],
        [ 0.2844,  0.1054,  0.3043, -0.2610, -0.3137, -0.2474, -0.2127,  0.1281,
          0.1132,  0.2628],
        [-0.1633, -0.2156,  0.1678, -0.1278,

In [21]:
## TO DO:

# Access the gradients of the first layer's parameters
print("Gradients of Linear1 weights:", model2[0].weight.grad)
print("Gradients of Linear1 biases:", model2[0].bias.grad)

Gradients of Linear1 weights: tensor([[-1.1016e-03, -4.2145e-04, -7.6714e-04, -7.5363e-04,  3.6739e-03,
          6.0965e-04, -7.2250e-03,  2.0875e-03, -1.5105e-03, -8.7474e-04],
        [-7.7563e-04, -2.9674e-04, -5.4013e-04, -5.3062e-04,  2.5867e-03,
          4.2924e-04, -5.0870e-03,  1.4697e-03, -1.0635e-03, -6.1589e-04],
        [ 1.8456e-04,  7.0609e-05,  1.2852e-04,  1.2626e-04, -6.1551e-04,
         -1.0214e-04,  1.2105e-03, -3.4973e-04,  2.5307e-04,  1.4655e-04],
        [ 1.1969e-05,  4.5789e-06,  8.3346e-06,  8.1878e-06, -3.9915e-05,
         -6.6235e-06,  7.8496e-05, -2.2679e-05,  1.6411e-05,  9.5037e-06],
        [-3.0357e-03, -1.1614e-03, -2.1140e-03, -2.0767e-03,  1.0124e-02,
          1.6800e-03, -1.9910e-02,  5.7524e-03, -4.1624e-03, -2.4105e-03],
        [-3.0969e-03, -1.1848e-03, -2.1566e-03, -2.1186e-03,  1.0328e-02,
          1.7139e-03, -2.0311e-02,  5.8683e-03, -4.2463e-03, -2.4591e-03],
        [-6.3868e-04, -2.4434e-04, -4.4476e-04, -4.3693e-04,  2.1300e-03,
  

### Weight updation:

We could update the weights of the first layers using the formula:

`model2[0].weight = model2[0].weight - learning_rate * model2[0].weight.grad`.

But there are various reasons we don't do so. They are:

1. PyTorch doesn't allow assigning a new tensor to a parameter.

  Thus, we can instead do this:

  `model2[0].weight -= learning_rate * model2[0].weight.grad`

2. Even though the above update works, Non-convex functions dont converge easily, so we need a proper optimizer to update the weights.

  Thus we may use one of the most popular, simple and effective optimizer called `SGD`

In [22]:
learning_rate = 0.001

In [23]:
with torch.no_grad():
    model2[0].weight -= learning_rate * model2[0].weight.grad    # updation using the above point number 1.
    model2[0].bias -= learning_rate * model2[0].bias.grad

In [24]:
# Access the gradients of the first layer's parameters
print("Gradients of Linear1 weights:", model2[0].weight.grad)
print("Gradients of Linear1 biases:", model2[0].bias.grad)

Gradients of Linear1 weights: tensor([[-1.1016e-03, -4.2145e-04, -7.6714e-04, -7.5363e-04,  3.6739e-03,
          6.0965e-04, -7.2250e-03,  2.0875e-03, -1.5105e-03, -8.7474e-04],
        [-7.7563e-04, -2.9674e-04, -5.4013e-04, -5.3062e-04,  2.5867e-03,
          4.2924e-04, -5.0870e-03,  1.4697e-03, -1.0635e-03, -6.1589e-04],
        [ 1.8456e-04,  7.0609e-05,  1.2852e-04,  1.2626e-04, -6.1551e-04,
         -1.0214e-04,  1.2105e-03, -3.4973e-04,  2.5307e-04,  1.4655e-04],
        [ 1.1969e-05,  4.5789e-06,  8.3346e-06,  8.1878e-06, -3.9915e-05,
         -6.6235e-06,  7.8496e-05, -2.2679e-05,  1.6411e-05,  9.5037e-06],
        [-3.0357e-03, -1.1614e-03, -2.1140e-03, -2.0767e-03,  1.0124e-02,
          1.6800e-03, -1.9910e-02,  5.7524e-03, -4.1624e-03, -2.4105e-03],
        [-3.0969e-03, -1.1848e-03, -2.1566e-03, -2.1186e-03,  1.0328e-02,
          1.7139e-03, -2.0311e-02,  5.8683e-03, -4.2463e-03, -2.4591e-03],
        [-6.3868e-04, -2.4434e-04, -4.4476e-04, -4.3693e-04,  2.1300e-03,
  

# Part C: Standard Pipeline

## Dataset

In [None]:
from sklearn.datasets import make_classification

X, y = make_classification(n_samples=5000, n_features=10, n_redundant=0, n_informative=3, n_clusters_per_class=2, n_classes=2)

In [52]:
print(X.shape)
print(y.shape)
# print(y[:50])

(5000, 10)
(5000,)


In [27]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=42)

In [28]:
import numpy as np
print(np.unique(y_train))

[0 1]


In [29]:
print(X_train.shape)
print(y_train.shape)
print(X_test.shape)
print(y_test.shape)

(4000, 10)
(4000,)
(1000, 10)
(1000,)


In [30]:
from torch.utils.data import Dataset
class CustomDataset(Dataset):

    # method to initialize dataset, and conversion to tensor
    def __init__(self, X_train, y_train):
        self.X = torch.from_numpy(X_train.astype(np.float32))
        self.y = torch.from_numpy(y_train).type(torch.LongTensor)
        self.len = self.X.shape[0]

    # method to return a single sample from the dataset given an index.
    def __getitem__(self, index):
        return self.X[index], self.y[index]

    # method to return the number of samples in the dataset.
    def __len__(self):
        return self.len


In [31]:
print(X_train[25])                                     # simple access

traindata = CustomDataset(X_train, y_train)
print(traindata[25])                                   # access with above class' method

[ 0.33787199  0.05870305  0.14166288 -0.57210746 -1.07297266 -0.28239403
  0.25714039 -0.64167298  1.80071396  0.93124613]
(tensor([ 0.3379,  0.0587,  0.1417, -0.5721, -1.0730, -0.2824,  0.2571, -0.6417,
         1.8007,  0.9312]), tensor(1))


In [32]:
from torch.utils.data import DataLoader

trainloader = DataLoader(traindata, batch_size = 16)

In [33]:
# Iterate through the TrainLoader

for batch_X, batch_y in trainloader:
    print(batch_X.shape, batch_y.shape)

torch.Size([16, 10]) torch.Size([16])
torch.Size([16, 10]) torch.Size([16])
torch.Size([16, 10]) torch.Size([16])
torch.Size([16, 10]) torch.Size([16])
torch.Size([16, 10]) torch.Size([16])
torch.Size([16, 10]) torch.Size([16])
torch.Size([16, 10]) torch.Size([16])
torch.Size([16, 10]) torch.Size([16])
torch.Size([16, 10]) torch.Size([16])
torch.Size([16, 10]) torch.Size([16])
torch.Size([16, 10]) torch.Size([16])
torch.Size([16, 10]) torch.Size([16])
torch.Size([16, 10]) torch.Size([16])
torch.Size([16, 10]) torch.Size([16])
torch.Size([16, 10]) torch.Size([16])
torch.Size([16, 10]) torch.Size([16])
torch.Size([16, 10]) torch.Size([16])
torch.Size([16, 10]) torch.Size([16])
torch.Size([16, 10]) torch.Size([16])
torch.Size([16, 10]) torch.Size([16])
torch.Size([16, 10]) torch.Size([16])
torch.Size([16, 10]) torch.Size([16])
torch.Size([16, 10]) torch.Size([16])
torch.Size([16, 10]) torch.Size([16])
torch.Size([16, 10]) torch.Size([16])
torch.Size([16, 10]) torch.Size([16])
torch.Size([

### Model

In [34]:
class Network(nn.Module):

    # Define the layers and all in the constructor.
    def __init__(self):
        super(Network, self).__init__()  # Initialize parent class within the constructor
        self.linear1 = nn.Linear(10, 16)
        self.leaky_relu = nn.LeakyReLU()
        self.linear2 = nn.Linear(16, 8)
        self.sigmoid = nn.Sigmoid()
        self.linear3 = nn.Linear(8, 2)
        self.softmax = nn.Softmax(dim=-1)

    # Method for forward pass
    def forward(self, x):
        x = self.linear1(x)
        x = self.leaky_relu(x)
        x = self.linear2(x)  # Pass input to the linear2 layer
        x = self.sigmoid(x)  # Pass input to the sigmoid layer
        x = self.linear3(x)  # Pass input to the linear3 layer
        x = self.softmax(x)  # Pass input to the softmax layer
        return x

In [35]:
model3 = Network()
print(model3)

Network(
  (linear1): Linear(in_features=10, out_features=16, bias=True)
  (leaky_relu): LeakyReLU(negative_slope=0.01)
  (linear2): Linear(in_features=16, out_features=8, bias=True)
  (sigmoid): Sigmoid()
  (linear3): Linear(in_features=8, out_features=2, bias=True)
  (softmax): Softmax(dim=-1)
)


In [36]:
criterion = nn.CrossEntropyLoss()
print(criterion)

CrossEntropyLoss()


In [37]:
import torch.optim as optim

optimizer = optim.Adam(model3.parameters(), lr = 0.001)
print(optimizer)

Adam (
Parameter Group 0
    amsgrad: False
    betas: (0.9, 0.999)
    capturable: False
    decoupled_weight_decay: False
    differentiable: False
    eps: 1e-08
    foreach: None
    fused: None
    lr: 0.001
    maximize: False
    weight_decay: 0
)


## Training loop:

In [38]:
num_epochs = 10
epoch_loss = []

for epoch in range(num_epochs):
  train_loss = 0
  for batch_X, batch_y in trainloader:
    optimizer.zero_grad()
    predicted = model3(batch_X)                      # forward pass in batches
    loss = criterion(predicted, batch_y)
    loss.backward()                                  # gradient calculation with help of loss
    optimizer.step()                                 # weight update

    # Evaluation example
    train_loss += loss.item()

    # After processing all batches, calculate average loss for the epoch
    average_loss = train_loss / len(trainloader)
    epoch_loss.append(average_loss)

    print(f'Epoch {epoch+1}/{num_epochs}, Loss: {average_loss}')


Epoch 1/10, Loss: 0.002774604320526123
Epoch 1/10, Loss: 0.00556499981880188
Epoch 1/10, Loss: 0.008438211917877197
Epoch 1/10, Loss: 0.011234872579574584
Epoch 1/10, Loss: 0.01400769019126892
Epoch 1/10, Loss: 0.016783303260803222
Epoch 1/10, Loss: 0.019548776388168335
Epoch 1/10, Loss: 0.022294262886047363
Epoch 1/10, Loss: 0.0250646595954895
Epoch 1/10, Loss: 0.027904199600219726
Epoch 1/10, Loss: 0.03071305775642395
Epoch 1/10, Loss: 0.033452316999435426
Epoch 1/10, Loss: 0.036163713216781614
Epoch 1/10, Loss: 0.03898351550102234
Epoch 1/10, Loss: 0.04169109177589417
Epoch 1/10, Loss: 0.044447808265686034
Epoch 1/10, Loss: 0.047232107639312744
Epoch 1/10, Loss: 0.0500559024810791
Epoch 1/10, Loss: 0.052795119285583496
Epoch 1/10, Loss: 0.05554371023178101
Epoch 1/10, Loss: 0.05831700849533081
Epoch 1/10, Loss: 0.06103218007087707
Epoch 1/10, Loss: 0.06380135369300842
Epoch 1/10, Loss: 0.06657582259178162
Epoch 1/10, Loss: 0.06932265090942383
Epoch 1/10, Loss: 0.07216454982757568
Ep

## Evaluation:

In [39]:
# save the trained model
PATH = './mymodel.pth'
torch.save(model3.state_dict(), PATH)

In [40]:
# Loading the saved model
model3 = Network()
model3.load_state_dict(torch.load(PATH))

<All keys matched successfully>

In [41]:
testdata = CustomDataset(X_test, y_test)

testloader = DataLoader(testdata, batch_size=4, shuffle=True, num_workers=2)

In [42]:
print(testdata[5])
print(testloader)

(tensor([ 0.6102, -1.8529, -0.3723,  0.7270,  0.7007,  0.6258, -1.2360,  1.4126,
        -3.3738, -1.0379]), tensor(0))
<torch.utils.data.dataloader.DataLoader object at 0x7f220d01b1f0>


In [43]:
# Function to perform inference
def test_model(model, testloader):
    model.eval()  # Set the model to evaluation mode
    correct = 0
    total = 0
    with torch.no_grad():  # Disable gradient calculation
        for batch_X, batch_y in testloader:
            outputs = model(batch_X)  # Perform forward pass
            _, predicted = torch.max(outputs, 1)  # Max probability value discarded, index obtained in predicted. [since output = (probability, class), here "class" is the index]
            total += batch_y.size(0)              # total no of samples so far
            correct += (predicted == batch_y).sum().item()     # total correct prediction so far. [(predicted==batch_y) is a boolean tensor, .sums true value in tensor, .item() converts the sum to scalar to give total true values.

    accuracy = correct / total
    print(f'Test Accuracy: {accuracy * 100:.2f}%')

In [44]:
# Test the model
test_model(model3, testloader)

Test Accuracy: 88.30%
