## Introduction to PyTorch
* This tutorial is designed to introduce Pytorch, training models with Pytorch and evaluating models using weights and biases. All of which will be critical for the rest of the course
* Proficiency with Python is expected as well as a familiarity with object orientated programming within Python. For further information on Pytorch, please refer to https://pytorch.org/tutorials/beginner/basics/intro.html#learn-the-basics.
* An introductory understanding to machine learning is also expected i.e., data set splitting, data loader, difference between sklearn and pytorch, feature selection.

Connect environment to a GPU by:

* Select 'Runtime' in the top left
* Select 'Change Runtime Type'
* Select the GPU runtime available

In [2]:
%pip install wandb
import wandb
import torch
from torch.utils.data import Dataset
from torch.utils.data import DataLoader
import os
from typing import Union, Callable, Tuple, List, Literal
from torch.autograd import Variable
import torch.nn as nn
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
import pandas as pd
from tqdm import tqdm
from sklearn.model_selection import train_test_split
import random
random.seed(0)
np.random.seed(0)

Collecting wandb
  Downloading wandb-0.18.3-py3-none-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (9.7 kB)
Collecting docker-pycreds>=0.4.0 (from wandb)
  Downloading docker_pycreds-0.4.0-py2.py3-none-any.whl.metadata (1.8 kB)
Collecting gitpython!=3.1.29,>=1.0.0 (from wandb)
  Downloading GitPython-3.1.43-py3-none-any.whl.metadata (13 kB)
Collecting sentry-sdk>=1.0.0 (from wandb)
  Downloading sentry_sdk-2.15.0-py2.py3-none-any.whl.metadata (9.7 kB)
Collecting setproctitle (from wandb)
  Downloading setproctitle-1.3.3-cp310-cp310-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (9.9 kB)
Collecting gitdb<5,>=4.0.1 (from gitpython!=3.1.29,>=1.0.0->wandb)
  Downloading gitdb-4.0.11-py3-none-any.whl.metadata (1.2 kB)
Collecting smmap<6,>=3.0.1 (from gitdb<5,>=4.0.1->gitpython!=3.1.29,>=1.0.0->wandb)
  Downloading smmap-5.0.1-py3-none-any.whl.metadata (4.3 kB)
Downloading wandb-0.18.3-py3-none-manylinux_2_17_x86_64.manylinux2014_x86_

### Dataset
* The diabetes dataset used in this tutorial is small and tabular therefore we'll use the standard dataloader and define a custom dataset to handle input data which:

  * Is a pandas dataframe
  * Has a 1-dimensional dependant variable which does not require processing
  * Has an n-dimensional feature space which requires min-max scaling

In [22]:
# Load example dataset
from google.colab import drive

# Mount Google Drive
drive.mount('/content/drive')

Mounted at /content/drive


In [23]:
# Specify the path to your CSV file in Google Drive
file_path = '/content/drive/My Drive/COMP0188 - Deep Representations of Learning/diabetes.csv'

In [24]:
# Load the CSV file into a DataFrame
df = pd.read_csv(file_path)

In [25]:
# Check the shape of the DataFrame
print(df.shape)

# Define your variables
y_var = "Outcome"
X_vars = [col for col in df.columns if col != y_var]

# Display the first few rows
df.head()

(768, 9)


Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1


### Data Type and Visualization
* Understanding your data is a critical step in any data science workflow. In this part, we analyze the data types and visualize the diabetes dataset to understand its structure and relationships between variables.
  * Data Types Check: Checking data types helps ensure all features are in the correct format.
  * Visualizing Distributions: Understanding the distribution of the target variable is essential for choosing the right model and evaluation metric.
  * Pair Plot: A pair plot shows relationships between features and the target, allowing for a quick visual understanding of potential correlations.

In [None]:
# Check the data types of the features
# Hint: a pandas DataFrame has an attribute "dtype"
print() # YOUR CODE HERE

# Visualizing the distribution of features to find target and independency features
# Hint: Consider using the seaborn package
# YOUR CODE HERE


* Independent Variables: Usually numeric or categorical variables that represent different attributes or features.
* Target Variable: Often numeric (for regression problems) or categorical (for classification problems). It is the variable that the model is trying to predict.

In [None]:
target_variable = 'Outcome'
independent_variables = df.columns[df.columns != target_variable]

### Train/Test Splits with Scikit-Learn
* To build a reliable machine learning model, it is crucial to assess its performance on unseen data. To achieve this, we split the dataset into a training set, which is used to train the model, and a testing set, which is used to evaluate its performance. This helps in understanding the model's generalization ability.
  * Utilize train_test_split from Scikit-Learn to split the data.
  * Create training and testing datasets.
  * Ensure that the test size is 20% (optional) of the original dataset.

In [None]:
# Split the dataset into training and testing sets
# YOUR CODE HERE

print(f'Training data shape: {X_train.shape}, Testing data shape: {X_test.shape}')

### Linear Regression with Scikit-Learn
* Scikit-Learn LinearRegression provides a class to train a model to predict the target variable from the independent variables. The model will then be evaluated using the Mean Squared Error (MSE) metric.
  * Linear Regression assumes that the relationship is linear, and it finds the line of best fit by minimizing the sum of squared differences between the observed and predicted values
  * MSE is the average of the squared differences between the predicted and actual values. A lower MSE indicates a better fit

In [None]:
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error

# Initialize the linear regression model
# YOUR CODE HERE

# Train the model
# YOUR CODE HERE

# Predict on the test set
# YOUR CODE HERE

# Evaluate the model
# YOUR CODE HERE
print(f'Mean Squared Error on Test Set: {mse:.2f}')

### Pytorch Basic
#### Tensor
* Pytorch provides 'tensors' as the fundamental data structure which enable efficient linear algebra functionality, auto differentiation and integration with CUDA
    * tens.T performs the transpose of the matrix
    * Try pushing the tens to the GPU with tens.cuda()
    * A torch.Tensor is a multi-dimensional matrix containing elements of a single data type.

##### What's special about torch tensors?
1. Torch tensors can be used on GPUs which are much faster than CPUs at large parallel computations
2. Torch automatically keeps track of the gradient information

In [None]:
# Create multi-dimension tensor (4-dims)
rands = torch.rand(1,2,3,4)
print(rands)
# torch.reshape returns a tensor with the same data and number of elements as input, but with the specified shape.
print(rands.shape)

# 1*2*3*4 = 1*(2*3)*4
rands_reshape = rands.reshape(1, 6, 4)
print(rands_reshape.shape)

# A single dimension may be -1, in which case it’s inferred from the remaining dimensions and the number of elements in input.
rands_reshape = rands.reshape(1, -1)
print(rands_reshape.shape)

In [None]:
# Change our dataset to tensor
tens = torch.tensor(df.values)
print(tens)
print(tens.shape)

In [None]:
# CPU vs GPU
# Use GPU if available
import time
device = "cuda" if torch.cuda.is_available() else "cpu"
dim=6000

x=torch.randn(dim,dim)
y=torch.randn(dim,dim)
start_time = time.time()
z=torch.matmul(x,y)
elapsed_time = time.time() - start_time
print('CPU_time = ', elapsed_time)


x=torch.randn(dim,dim,device=device)
y=torch.randn(dim,dim,device=device)
start_time = time.time()
z=torch.matmul(x,y)
elapsed_time = time.time() - start_time
print('GPU_time = ',elapsed_time)

In [None]:
# Torch tensors automatically keep track of the gradient information
a=torch.rand(64, requires_grad=True)

b=4*a
c=6*a

out=(b+c).sum()
out.backward()
print(a.grad)

# detach() can exclude some operations from gradient calculation, saving memory and computation
a = a.detach()
print(a.grad)

#### Pytorch Dataset and Dataloader
Pytorch Datasets and Dataloaders provide a useful API for loading batches of data for deep learning models

* Dataset
  * The 'Dataset' represents the entire training/validation/test data. The \_\_len\_\_ and \_\_getitem\_\_ dunder methods are required for the Dataset class as they:
    * Define the number of data observations e.g., a single row in a dataset, a single image and;
    * Allow a single data observation to be retrieved
    * The Dataset class simplifies managing large and non-standard datasets as e.g., not all of the data needs to be loaded into RAM at onces etc
* DataLoader
  * The 'Dataloader' handles how a given dataset should be batched. If the output of a dataset.\_\_getitem\_\_ call is a tensor then the base dataloader class can be used however, if non-standard types are being used i.e. dictionaries then defining custom batching is useful

The diabetes dataset used in this tutorial is small and tabular therefore we'll use the standard dataloader and define a custom dataset to handle input data which:
* Is a pandas dataframe;
* Has a 1-dimensional dependant variable which does not require processing
* Has an n-dimensional feature space which requires min-max scaling

In [None]:
class PandasDataset(Dataset):
    def __init__(self, X:pd.DataFrame, y:pd.Series)->None:
        # Your code here
        self._X = torch.from_numpy(X.values).float()
        self._X = self.__min_max_norm(self._X)
        self.feature_dim = X.shape[1]
        self._len = X.shape[0]
        self._y = torch.from_numpy(y.values)[:,None].float()

    def __len__(self)->int:
        # Your code here


    def __getitem__(self, idx:int) -> Tuple[torch.Tensor, torch.Tensor]:
        # Your code here


    def __min_max_norm(self, in_tens:torch.Tensor) -> torch.Tensor:
        # Your code here
        # Perform min-max normalization on the input tensor

        # Calculate normalized tensor: (X - X.min(axis=0)) / (X.max(axis=0) - X.min(axis=0))
        # Note: Add a small epsilon to avoid division by zero
        return norm_tens

In [None]:
# Split train_data to training and validation dataset
# YOUR CODE HERE
# Create datasets
train_data = PandasDataset(X=X_train, y=y_train)
val_data = PandasDataset(X=X_val, y=y_val)
test_data = PandasDataset(X=X_test, y=y_test)
# Get length of training, validation and test dataset
print(f"The training data has: {len(train_data)} samples")
print(f"The validation data has: {len(val_data)} samples")
print(f"The test data has: {len(test_data)} samples")

In [None]:
# Let's load the dataset into PyTorch dataloaders, given the dataset is only small, a large batch size is not required.
batch_size = 32
shuffle = True
train_dataloader = DataLoader(train_data, batch_size=batch_size, shuffle=shuffle)
val_dataloader = DataLoader(val_data, batch_size=batch_size, shuffle=shuffle)
print(f"First train example: {train_data[0]} \n with shape {(train_data[0][0].shape, train_data[0][1].shape)}")
print("\n")
print(f"Second train example: {train_data[1]} \n with shape {(train_data[1][0].shape, train_data[1][1].shape)}")
print("\n")

In [None]:
# Notice how the dataloader concatenates the observations by adding a new first dimension
first_batch = train_dataloader.__iter__()._next_data()
print(f"First train example: {first_batch} \n with shape {(first_batch[0].shape, first_batch[1].shape)}")
print("\n")

#### Pytorch Model
* Pytorch models are developed by subclassing the nn.Module. The core requirement for a Pytorch model is defining the forward method which defines the model's forward pass. The new subclass will most likely make us of other nn.Module subclasses, some of which are:
* nn.Linear(in_features, out_features) - this defines a single fully connected layer with a given number of input and output features
    

#### Linear Regression in PyTorch
Unlike Scikit-Learn, PyTorch provides more flexibility for customizing the model architecture and training process. We define a simple linear model using nn.Module, specify the loss function, and use an optimizer to minimize the loss during training.

In [None]:
# Define the linear regression model
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu') # cuda is a type of gpu
device = 'cpu'
class LinearRegressionModel(nn.Module):
    def __init__(self, input_dim):
        # YOUR CODE HERE

    def forward(self, x):
        # YOUR CODE HERE

#### Training pipeline
The train_single_epoch function provides an examplar function that trains an initialised model for a single epoch and returns the batch losses and predictions. Of note:
* model.train(): certain nn.Module functionality such as dropout behaves differently during training and eval so we must tell the model that it is being trained
* optimizer.zero_grad(), train_loss.backward() and optimizer.step(): for every minibatch, gradients are 'accumulated', based on this accumulation, the optimiser takes a 'step'. At the start of a gradient step the previous gradients are set to 0 to reaccumulate - _gradient calculations will be covered later in the course!

In [None]:
def train_single_epoch(model, data_loader, criterion, optimizer):
    model.to(device)
    model.train()
    epoch_loss = 0
    for X_batch, y_batch in data_loader:
        X_batch = X_batch.to(device)
        y_batch = y_batch.to(device)
        optimizer.zero_grad()
        y_pred = model(X_batch)
        loss = criterion(y_pred, y_batch)
        loss.backward()
        optimizer.step()
        epoch_loss += loss.item()
    return epoch_loss / len(data_loader)

After each epoch, we would like to evaluate the model. Notice:
* model.eval() now tells the model we are evaluating and ensures functionality such as dropout behave appropriately
* torch.no_grad() tells the model not to calculate gradients since, in evaluation, we do not update the parameters!

Complete the function to calculate the epoch lossses and predictions, take inspiraton from the training function above

In [None]:
def validate(model, data_loader, criterion):
    model.to(device)
    model.eval()
    val_loss = 0
    with torch.no_grad():
        for X_batch, y_batch in data_loader:
            X_batch = X_batch.to(device)
            y_batch = y_batch.to(device)
            y_pred = model(X_batch)
            loss = criterion(y_pred, y_batch)
            val_loss += loss.item()
    return val_loss / len(data_loader)

def test(model, data_loader, criterion):
    model.to(device)
    model.eval()
    test_loss = 0
    with torch.no_grad():
        for X_batch, y_batch in data_loader:
            X_batch = X_batch.to(device)
            y_batch = y_batch.to(device)
            y_pred = model(X_batch)
            loss = criterion(y_pred, y_batch)
            test_loss += loss.item()
    return test_loss / len(data_loader)

* We can now use the above functions to run a single epoch worth of training and validation.
* nn.MSELoss() is used since we are performing a linear regrassion task. This is not the only training metric which we can use again, experiment with others if you wish!

In [None]:
# Initialize the model, loss function, and optimizer
input_dim = X_train.shape[1]
model = LinearRegressionModel(input_dim)
criterion = nn.MSELoss()
optimizer = torch.optim.SGD(model.parameters(), lr=0.001)

# Train for one epoch and display the result
train_loss_one_epoch = train_single_epoch(model, train_dataloader, criterion, optimizer)
print(f'Training Loss after one epoch: {train_loss_one_epoch:.4f}')

# Validate the model after one epoch
val_loss_one_epoch = validate(model, val_dataloader, criterion)
print(f'Validation Loss after one epoch: {val_loss_one_epoch:.4f}')

#### Monitoring
* A significant part of developing machine learning models involves experimentation. Tracking and managing these experiments can become challenging as the number of experiments grows. To address this, we use tools like Weights and Biases (wandb), which help in logging and visualizing metrics, saving model checkpoints, and comparing different runs effectively.
* The training loop has been updated to log both training and validation metrics (such as loss) to wandb. This allows you to monitor the model's performance in real-time and keep track of the training process. Additionally, the model's parameters are saved at each epoch, but only the best-performing model (based on validation loss) is preserved. This ensures that you can always retrieve the best model from your experiments.

In [None]:
wandb.login()
epochs = 50
lr=0.001
weight_decay=0.0

config={
    "learning_rate": lr,
    "architecture": "LinearRegressionModel",
    "epochs": epochs,
    "weight_decay": weight_decay,
    "batch_size": batch_size,
    "shuffle": shuffle,
    "loss": criterion
    }

wandb.init(project='diabetes_prediction', config=config)


def train_all_epochs(model, train_loader, val_loader, criterion, optimizer, epochs):
    train_losses = []
    val_losses = []
    best_val_loss = float('inf')  # Initialize best validation loss to a very high value
    model.to(device)

    for epoch in range(epochs):
        # Train for one epoch
        model.train()  # Set model to training mode
        train_loss = train_single_epoch(model, train_loader, criterion, optimizer)

        # Validate the model after the epoch
        model.eval()  # Set model to evaluation mode
        with torch.no_grad():  # Disable gradient calculation for validation
            val_loss = validate(model, val_loader, criterion)

        # Log the training and validation loss to wandb
        wandb.log({"train_loss": train_loss, "val_loss": val_loss})

        # Save the best model checkpoint
        if val_loss < best_val_loss:
            best_val_loss = val_loss
            chkp_pth = os.path.join(wandb.run.dir, f"mdl_chkpnt_epoch_{epoch}.pt")
            torch.save(
                {
                    'epoch': epoch,
                    'model_state_dict': model.state_dict(),
                    'optimizer_state_dict': optimizer.state_dict(),
                }, chkp_pth)
            # Log the path of the best model checkpoint
            wandb.log({"best_model_path": chkp_pth})  # Optionally log the path

        train_losses.append(train_loss)
        val_losses.append(val_loss)
        print(f'Epoch {epoch+1}/{epochs}, Training Loss: {train_loss:.4f}, Validation Loss: {val_loss:.4f}')

    return train_losses, val_losses

In [None]:
# Train the model
train_losses, val_losses = train_all_epochs(model, train_dataloader, val_dataloader, criterion, optimizer, epochs)
wandb.finish()

In [None]:
# Plot MSE loss over epochs
# Hint: Use the matplotlib package and the plot object
# Multiple lines can be placed on a single graph by calling plot multiple times before calling show
# YOUR CODE HERE

### Feature Selection
Feature selection is a technique to improve model performance by focusing on the most relevant features. We hypothesize which features are most important and iteratively add features to our model to observe changes in performance. This can help in building more efficient models.
* The f_regression method in the sklearn package helps evaluate the features' significance.

In [None]:
from sklearn.feature_selection import f_regression

# Start with the most important feature
initial_feature = 'Glucose'  # Hypothetical important feature

# Create a new dataset with only the selected features
selected_features = [initial_feature]

# Loop to add one feature at a time and check performance
for feature in independent_variables:
    if feature not in selected_features:
        selected_features.append(feature)
        X_train_subset = X_train[selected_features]
        X_test_subset = X_test[selected_features]

        # Train a new model with the selected features
        model = LinearRegression()
        model.fit(X_train_subset, y_train)
        y_pred = model.predict(X_test_subset)
        mse = mean_squared_error(y_test, y_pred)

        print(f'Features: {selected_features}, MSE: {mse:.5f}')

### Extended exercise 1
* Update the Dataset class and train functions to make running the model on a GPU more efficient! _Hint: Front load the data being pushed!_
* Compare differences in time consumption

### Extended exercise 2
* Using weights and biases to diagnose model performance, try and develop the best performing model
* Don't evaluate the model on the test set until you are finished with experimentation