# Basics of installing PyTorch (CUDA) in Anaconda
- Open Anaconda Powershell Prompt
- Create new virtual environment: conda create -n py312 python=3.12
- Activate it: conda activate py312
- Install PyTorch using CONDA: conda install pytorch torchvision torchaudio pytorch-cuda=12.1 -c pytorch -c nvidia
- Verify PyTorch installation: python -c "import torch; print(torch.__ version__)"
- Verify CUDA availability: python -c "import torch; print(torch.cuda.is_available())"
### Alternative PyTorch
- Install PyTorch for CPU: conda install pytorch torchvision torchaudio cpuonly -c pytorch

## Tips to redirect your Jupyter Notebook kernel to the new environment
Activate your virtual environment first on Anaconda Powershell Prompt
- #### Install ipykernel
conda install ipykernel

- #### Add the environment to Jupyter (e.g. if your virtual environment name is py312)
python -m ipykernel install --user --name=py312

In [2]:
# check if PyTorch exists otherwise follow the above steps to install PyTorch

import torch
torch.__version__

'2.5.1'

# Introduction to PyTorch

A tensor can be viewed as a multi-dimensional array. Similar to how an n-dimensional vector is shown as a one-dimensional array with _n_ elements relative to a specific basis, any tensor can be expressed as a multi-dimensional array when referenced to a basis. The individual values within this multi-dimensional structure are referred to as the tensor's components.

The PyTorch library offers multi-dimensional tensor data structures and implements various mathematical functions to manipulate these tensors. It also includes numerous tools for effective tensor serialisation, handling arbitrary data types, and provides several other practical utilities.

PyTorch shares significant similarities with NumPy, though it uses the term ''tensor'' instead of ''N-dimensional array''. For example,

In [3]:
import torch
import numpy as np

array_np = np.array([[1, 2, 3],
                    [4, 5, 6]])
array_pytorch = torch.tensor([[1, 2, 3],
                             [4, 5, 6]])
print(array_np)
print(array_pytorch)

[[1 2 3]
 [4 5 6]]
tensor([[1, 2, 3],
        [4, 5, 6]])


A parallel [PyTorch CUDA](https://pytorch.org/docs/2.5/cuda.html) version is also available, allowing you to execute tensor calculations on NVIDIA GPUs that have a compute capability of 3.0 or higher. But in this course, as time may not permit, we will be restricted to pre-definied dataset. In future, your project might need CUDA acceleration for which please install the CUDA version of PyTorch.

## Classes

Classifications (classes) usually refers to catergories or labels our neural network is supposed to predict.

It can be of two types: 
- binary classification (yes or no/malignant or benign/dog or cat etc.) or
- multi-class classification (cat or dog or capibara/digit recognition/species of flowers etc.) .

Okay, now that we have some knowledge about the basic terminologies and we have our libraries set up, let us try building our first network.

For this, we will use the [iris dataset](https://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_iris.html) from scikit-learn.

Before proceeding make sure to install scikit-learn from your Anaconda Powershell Prompt.

In [4]:
import matplotlib.pyplot as plt
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader, TensorDataset
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
import seaborn as sns

Note: if you find any error message for example saying `No module named 'matplotlib'`, open your Anaconda Powershell Prompt and install the missing library from there. 

Once installed restart kernel.

### Step 1: Load and explore the Iris dataset
------------------------------------------
The [Iris dataset](https://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_iris.html) is a classic dataset in machine learning practice containing measurements of sepals and petals from three species of iris flowers.

In [5]:
from sklearn.datasets import load_iris

# load the dataset
iris = load_iris()

# extract features and target classes
X = iris.data
y = iris.target
feature_names = iris.feature_names
target_names = iris.target_names

# print to check the overall structure of our dataset
# and also to find how many classes we have

print(f"Dataset dimensions: {X.shape}")
print(f"Target classes: {target_names}")

Dataset dimensions: (150, 4)
Target classes: ['setosa' 'versicolor' 'virginica']


We now know that we have 150 samples and 4 features in our dataset

### Step 2: Split data into training and testing sets
------------------------------------------

We now divide our data into training and testing datasets in 80:20 ratio. This means, we will be using 80% of our data for training and 20% for evaluating the model's performance.data

In [10]:
# split data into training and testing sets with a seed for reproducibility
# X_train here contains training set for feature data
# y_train here contains target labels for training set, or what we want to predict, or the ground truth

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

### Step 3: Standarise or scale the feature data
------------------------------------------

In [11]:
# standardise the feature data
scaler = StandardScaler()

# learn the parameter from training data and fit a transformer to it
# fit() - computes mean and std deviation to scale
# transform() - used to scale using mean and std deviation calculated using fit()
# fit_transform() - combination of both fit() and transform()
X_train = scaler.fit_transform(X_train)

# no fit() as we want to avoid data leakage
X_test = scaler.transform(X_test)

Now let us convert feature matrices to FloatTensor (tensor type for numerical data) and LongTensor (tensor type for integer labels).

In [12]:
X_train_tensor = torch.FloatTensor(X_train)
y_train_tensor = torch.LongTensor(y_train)


X_test_tensor = torch.FloatTensor(X_test)
y_test_tensor = torch.LongTensor(y_test)

### Step 4: Create tensor dataset and [data loader](https://www.eletreby.me/blog/getting-started-with-pytorch-dataset-and-dataloader) for batch training
-------------------------------------------------------

The `DataLoader` class wraps the `Dataset` class and handles batching, shuffling, and utilise Python's multiprocessing to speed up data retrieval.

In [21]:
# combine features and labels into a single dataset
train_dataset = TensorDataset(X_train_tensor, y_train_tensor)

# create a dataloader 
# batch_size - instead of processing all training examples at once, it splits them into smaller batches of 16 examples each
# shuffle = True - randomise the order of training examples before creating batches
train_loader = DataLoader(dataset=train_dataset, batch_size=16, shuffle=True)

Finally, our dataset is ready for model definition, training, and evaluation.

The following sections will explain the model that we will utilise in this notebook.

## Multi-layer perceptron
---------------------------------------
A [multi-layer perceptron](https://www.datacamp.com/tutorial/multilayer-perceptrons-in-machine-learning) is a type of feedforward neural network (FNN) comprised of fully connected neurons with a non-linear activation function. It is commonly employed to differentiate data that cannot be separated linearly.
![MLP](https://upload.wikimedia.org/wikipedia/commons/4/46/Colored_neural_network.svg)

### Input layer:
This is where the data enters and each neuron represents one piece of information (e.g. petal length)

### Hidden layer:
This is where real work happens. Each neuron here connects to input and output layer where the connections have weights. These weights resembles importance of some connections over others.

### Output layer:
This is final layer where we get our results. In our iris dataset example, we want these neurons to represent a possible class (setosa, versicolor, or virginica).

#### Workflow:
- Information propagates in a forward direction through the network
- Within each (artificial) neuron, input signals are aggregated via a weighted sum operation
- This aggregated value is then passed through an activation function (introducing non-linearity). Common activation functions include [sigmoid](https://machinelearningmastery.com/a-gentle-introduction-to-sigmoid-function/), [tanh](https://www.geeksforgeeks.org/tanh-activation-in-neural-network/), [ReLU (Rectified Linear Unit)](https://medium.com/@gauravnair/the-spark-your-neural-network-needs-understanding-the-significance-of-activation-functions-6b82d5f27fbf#69d4), etc.
- The resulting output is then forwarded to neurons in the subsequent layer


Check out [Neural Network Playground](https://playground.tensorflow.org/) to visualise neural network and play around a bit with features like learning rate, activation, regularization, and problem type.

## Step 1: Define the MLP model

In [22]:
class MLP(nn.Module):
    def __init__(self, input_size, hidden_size, num_classes):
        """
        Initialise the MLP architecture with specified dimensions.
        
        Parameters:
         input_size: Number of input features (4 for Iris dataset)
         hidden_size: Number of neurons in the hidden layer
         num_classes: Number of output classes (3 for Iris species)
        """
        
        super(MLP, self).__init__()
        
        # First layer (input to hidden)
        # Linear transformation from input features to hidden neurons
        self.layer1 = nn.Linear(input_size, hidden_size)

        # ReLU activation function
        self.relu = nn.ReLU()
        
        # Second layer (hidden to hidden)
        # Another hidden layer with the same size for more complex pattern recognition
        self.layer2 = nn.Linear(hidden_size, hidden_size)
        
        # Output layer (hidden to output)
        # Maps from hidden representation to class scores (logits)
        self.output = nn.Linear(hidden_size, num_classes)
        
    def forward(self, x):
        """
        Define the forward pass through the network.
        
        Parameter:
         x: Input tensor of shape [batch_size, input_size]
        
        Returns:
         Output tensor of shape [batch_size, num_classes]
        """
        
        # Forward pass through the network
        # Each step applies a linear transformation followed by a non-linear activation
        x = self.layer1(x)
        x = self.relu(x)
        
        x = self.layer2(x)
        x = self.relu(x)
        
        x = self.output(x)
        return x

## Step 2: Set model parameters

In [24]:
input_size = X_train.shape[1]  # Number of features (4 for Iris)
hidden_size = 10               # Number of neurons in hidden layer
num_classes = 3                # Number of output classes (3 for Iris)
learning_rate = 0.01           # Learning rate for optimiser
num_epochs = 100               # Number of training epochs

## Step 3: Initialise model

In [25]:
model = MLP(input_size, hidden_size, num_classes)
print(model)

MLP(
  (layer1): Linear(in_features=4, out_features=10, bias=True)
  (relu): ReLU()
  (layer2): Linear(in_features=10, out_features=10, bias=True)
  (output): Linear(in_features=10, out_features=3, bias=True)
)


## Step 4: Loss function and optimiser

In [26]:
criterion = nn.CrossEntropyLoss()
optimiser = optim.Adam(model.parameters(), lr=learning_rate)

`criterion = nn.CrossEntropyLoss()`
- Defines our loss function, which measures how far our predictions deviate from the actual labels
- [CrossEntropyLoss](https://pytorch.org/docs/stable/generated/torch.nn.CrossEntropyLoss.html) is ideal for multi-class classification (such as our Iris dataset)
- It combines softmax activation and negative log-likelihood loss in a single, numerically stable function
- It expects raw model outputs ([logits](https://www.columbia.edu/~so33/SusDev/Lecture_9.pdf)) rather than probabilities

`optimiser = optim.Adam(model.parameters(), lr=learning_rate)`
- This creates an optimiser that will update our model's weights
- [Adam (Adaptive Moment Estimation)](https://arxiv.org/abs/1412.6980) is a popular optimiser that:
     - Adapts the learning rate for each parameter
     - Combines the benefits of [AdaGrad](https://medium.com/@brijesh_soni/understanding-the-adagrad-optimization-algorithm-an-adaptive-learning-rate-approach-9dfaae2077bb) and [RMSProp](https://medium.com/@nerdjock/deep-learning-course-lesson-7-3-rmsprop-root-mean-square-propagation-7ff9a3ae2cca)
     - Works well for most problems without requiring excessive tuning
- model.parameters() gives the optimiser access to all trainable weights in our network
- lr=learning_rate sets the learning rate (how significant each update step should be)


In [27]:
# Lists to track loss values
train_losses = []
test_losses = []

# Training loop
for epoch in range(num_epochs):
    model.train()  # Set model to training mode
    running_loss = 0.0
    
    for inputs, labels in train_loader:
        # Zero the parameter gradients
        optimiser.zero_grad()
        
        # Forward pass
        outputs = model(inputs)
        loss = criterion(outputs, labels)
        
        # Backward pass and optimise
        loss.backward()
        optimiser.step()
        
        running_loss += loss.item()
    
    # Calculate average loss for this epoch
    epoch_loss = running_loss / len(train_loader)
    train_losses.append(epoch_loss)
    
    # Evaluate on test data
    model.eval()  # Set model to evaluation mode
    with torch.no_grad():
        test_outputs = model(X_test_tensor)
        test_loss = criterion(test_outputs, y_test_tensor).item()
        test_losses.append(test_loss)
    
    # Print progress every 10 epochs
    if (epoch + 1) % 10 == 0:
        print(f"Epoch {epoch+1}/{num_epochs}, Training Loss: {epoch_loss:.4f}, Test Loss: {test_loss:.4f}")

print("Training complete!")

Epoch 10/100, Training Loss: 0.1442, Test Loss: 0.0901
Epoch 20/100, Training Loss: 0.0597, Test Loss: 0.0356
Epoch 30/100, Training Loss: 0.0465, Test Loss: 0.0201
Epoch 40/100, Training Loss: 0.0485, Test Loss: 0.0336
Epoch 50/100, Training Loss: 0.0401, Test Loss: 0.0143
Epoch 60/100, Training Loss: 0.0406, Test Loss: 0.0148
Epoch 70/100, Training Loss: 0.0447, Test Loss: 0.0113
Epoch 80/100, Training Loss: 0.0402, Test Loss: 0.0112
Epoch 90/100, Training Loss: 0.0463, Test Loss: 0.0099
Epoch 100/100, Training Loss: 0.0400, Test Loss: 0.0093
Training complete!


## Step 5: Model evaluation