# Starting with Pytorch

Welcome to the first in a series of Pytorch tutorials we've planned for the last two weeks of the Machine Learning module. This fun and engaging tutorial will guide you through the installation process, teach you about Pytorch tensors, Autograd, and even help you create your first neural network based on a single artificial neuron. Get ready to learn, grow and enjoy the ride!

## Set up your Pytorch environment

Pytorch is a widely favored platform for building and training deep neural network models. You have the option to get Pytorch up and running on your personal computer, or you can use Google Colab to run your notebooks. For this week's tutorial, your laptop should suffice, but as we move into next week, we'll need to harness the power of a GPU. That's when Google Colab will come in really handy!

### Install Pytorch locally
You can install Pytorch locally by visiting

https://pytorch.org/get-started/locally/

It's quite easy to get started. Just grab the command from this website and pop it into your Anaconda Prompt. You won't even need a GPU to run Pytorch on your own computer, as it runs on a CPU by default. The examples in this notebook are nice and simple and won't require any GPU power.

But, next week, we're going to turn things up a notch and you'll need a GPU. To make that happen, you'd typically need an NVIDIA GPU card and CUDA installed. It can be a bit of a pickle, so we won't get into the nitty-gritty here. Instead, we'll guide you to use Google Colab, which makes everything much simpler.

### Use Google Colab
To run your notebooks on Google Colab instead, follow these instructions:
1. Go to https://colab.research.google.com/
2. Click __Upload__
3. Upload this notebook

You are now ready to start working in Colab. Next week we will also show how to use GPU in Colab.

### Import Pytorch

Check that Pytorch is available in your environment by importing it:

In [None]:
import torch

##Pytorch tensors
Pytorch tensors bear a close resemblance to numpy arrays, but they've got some extra bells and whistles. Not only can they operate on a GPU, but they also come with features that allow automatic differentiation.

Let's go ahead and create a random Pytorch tensor!

In [None]:
tensor = torch.rand(2, 2, 2)
print(tensor)

or we can go about it by assigning some specific values.

In [None]:
tensor2 = torch.tensor([1,2])
print(tensor2)

Just like numpy arrays, we can easily get to the individual elements of tensors and check out their shape:

In [None]:
print(tensor[0,0,0])
print(tensor.shape)

We can reshape the tensors using `view` without copying the data

In [None]:
print(tensor.view(2,4))

We can easily convert between numpy arrays and tensors:

In [None]:
import numpy as np
a = np.array([1,2,3])
t = torch.from_numpy(a)
print(t)

In [None]:
a2 = t.numpy()
print(a)

**Activity 1:** Practise  working  with  Pytorch  tensors.   First,  create  and  display  a Pytorch tensor:
* Create a numpy array with values `[1,...,12]` (*Hint:* You can use `np.linspace`).
* Convert it to a pytorch tensor.
* Print out the tensor as a matrix of size $3\times 4$

In [None]:
import numpy as np
import torch

# Create a numpy array
x = None

# Convert to tensor
t = None

# Display in shape 3x4
print(None)

**Activity 2:** Learn how to concatenate two Pytorch tensors
* Create two random Pytorch tensors with sizes $1 \times 2 \times 4 \times 4$ and $1 \times 5 \times 4 \times 4$
* Concatenate them on the 1st axis using `torch.cat`(*Hint:* axis are numbered from 0)
* Print the dimensions of the concatenated tensor

In [None]:
# Create two random torch arrays
a = None
b = None

# Concatenate
c = None

# Print shape
print('c.shape: ', None)

## Autograd

Pytorch offers automatic differentiation to enable training of neural networks. If the tensor stores parameters that we want to learn, we can set the `tensor` atrribute `requires_grad` to `True`.


---


Isn't it fantastic that PyTorch does some of the heavy lifting for us? One of its superpowers is automatic differentiation, which is like having an expert mathematician doing complex calculations for us while training our neural networks.

So, when we're dealing with a tensor that holds the parameters we're eager to learn, all we need to do is adjust a tiny setting. We simply flip the 'requires_grad' attribute of the tensor to 'True'. Consider it like switching on the learning mode for that particular tensor.

In [None]:
w = torch.tensor([1.0,2.0], requires_grad=True)

Let's minimise the sum of squared error loss with respect to `w` implemented as

In [None]:
s=w**2
loss=s.sum()

Note that Pytorch created functions to calculate the derivatives of each tensor with respect to `w`


---

PyTorch has got our back! It smartly crafts functions to calculate the derivatives of each tensor with respect to 'w'. This is one of the reasons why PyTorch is such a popular tool for machine learning. It takes care of the heavy mathematical lifting so that we can focus on the big picture.


In [None]:
print(s)
print(loss)

The derivatives are calculated by chain rule, or in other words by **backpropagation** from the variable `loss` trhough internmediate steps (variable `s`) towards the parameters `w`. This is implemented using function `backward`. The gradients can be accessed through `w.grad`.

*Note: if you run `backward` twice, Pytorch will complain, re-run the cells above to fix this*


---
In our process, we calculate derivatives using the **chain rule**, a principle you might remember from calculus. In the world of machine learning, or in the lingo of machine learning, we use **backpropagation**. This process starts from the 'loss' variable and works its way back through the intermediate steps (captured in the variable 's') all the way to the parameters 'w'.

This process of backpropagation is initiated using the function named 'backward'. Once this function has done its job, you can see the computed gradients by looking into 'w.grad'.

**Analogy:**
Picture it like a domino effect, starting from the loss variable, tumbling through intermediate steps (the variable s), and finally reaching the parameters w.
This whole domino tumbling process (backpropagation) is carried out by a handy function called backward. Now, after you've set those dominos tumbling with backward, you can peek at the clues or gradients through w.grad.

**Attention: **One quirky thing to remember though, if you try to knock down the same dominos twice by running backward twice, PyTorch will start grumbling. But no worries! If you run into this hiccup, just go back and run the previous cells again. It's like setting up the dominos again for a fresh start!

In [None]:
loss.backward()
print(w.grad)

**Activity 3:** Practise using the Autograd feature of Pytorch:
* Create a Pytorch tensor `y` with values `[0,1]`
* Create another Pytorch tensor `p` with values `[0.5,0.5]`. Set `requires_gradient` to `True`.
* Implement cross-entropy loss $L=-y_0\log(p_0)-y_1\log(p_1)$
* Print out the loss value
* Calculate the gradients of the loss with respect to `p` and print them out.

In [None]:
# Create tensor y = [0,1]
y = None
# Create tensor p=[0.5.0.5], requires grad
p = None

# Calculate cross-entropy loss
#m = -y*torch.log(p)
ce_loss = None
# Print loss value
print(ce_loss)

# Calculate gradients of loss w.r.t. p

# Print gradients w.r.t. p


## The first neural network in Pytorch

<img src="https://drive.google.com/uc?export=view&id=112QQeYbWEQcnsTu4JYL_nNHN9fLZHzfI" width = "300" style="float: right;">

The simplest neural network consists of a single artificial neuron. It can be expressed by equation
$$z=\sum_{j=0}^Dw_jx_j$$
$$\hat{y}=f(z)$$
where $x_j$ are input features, $\hat{y}$ are outputs, $w_j$ are learnable weights and $f$ is an activation function. If we choose **mean squared error** as the loss to be minimised and **identity** as an activation function, we will obtain a simple multivariate **linear regression** with $D$ input features $\hat{y}=\sum_{j=0}^Dw_jx_j$.

In Pytorch the equation $z=\sum_{j=0}^Dw_jx_j$ is implemented as a **linear layer** with $D$ inputs and $1$ output:

In [None]:
import torch.nn as nn

D=3
nn.Linear(D,1)

**Activity 4:** Play with parameters of the linear layer to see how it changes.

### Create a neural network model
To create a neural network model in Pytorch we need to define its architecture in a new class inherited from `torch.nn.Module`.
To do that we need to define functions `__init__` and `forward`:

1. The function `__init__` is a constructor in which we define the layers and any parameters we need. We'll always need to call the `super` function inside it to ensure our parent class gets initialized too.
2. The function `forward` defines the forward pass, which calculates the output $\hat{y}$ from the input features $x_j$.

Here's the best part - we don't need to define the backward pass at all. PyTorch is going to do that for us automatically!

The network architecture for a **univariate linear regression** model will consist of a **single linear layer** with $D=1$ input features and $1$ output. Our new single artificial neuron regressor `ANRegressor` is defined below:

We'll start with a univariate linear regression model, with just one input feature and one output. So let's get our hands dirty and start defining our new single artificial neuron regressor `ANRegressor` below:

In [None]:
class ANRegressor(nn.Module):
    def __init__(self):
        super(ANRegressor, self).__init__()
        self.layer = nn.Linear(1, 1)

    def forward(self, x):
        x = self.layer(x)
        return x

We're about to bring our model to life. So, our next step is to create an instance of this model. Let's name it net and watch our model take its first breath!

In [None]:
net = ANRegressor()
print(net)

 We've got ourselves a simple, yet powerful model in place. This one has a single linear layer with just one input and one output feature. With the bias being set to True, it is perfectly shaped to represent our univariate linear model:

$$y=wx+b$$

But that's not all! We can also take a peek at the learnable parameters of this model, our 'knobs and dials' if you will. Just call net.parameters(), and you'll see that we have two of them ready to adjust and refine:

In [None]:
for parameter in net.parameters():
    print(parameter)

**Activity 5:** Alright, time for a little hands-on activity! Let's tweak the `ANRegressor` a bit. Try changing the number of input features and see what happens. Notice any change in the number of parameters we have to optimize? It's a neat way to see how the complexity of our model changes with more input features.

But don't forget! Once you're done exploring, make sure you set the number of input features back to 1. We've got more exciting things to learn in this tutorial and we want everything set up just right. Enjoy playing around with your model!

**Answer:** Number of input features + 1.

### The loss and the optimiser

Alright, moving on to the next key elements of our model: the loss function and the optimizer.

The `loss function` helps us measure how well our model is doing during the training. It's like a scorecard in a game, but instead of trying to get the highest score, our goal is to get the lowest one! We'll create our loss function, aptly named `loss_function`.

Since we're building a regression model, we're going to use the **mean squared error (MSE)** as our loss function. Here's what it looks like mathematically:

$$L(\mathbf{y},\mathbf{\hat{y}})=\sum_{i=1}^N(y_i-\hat{y_i})^2$$

Here, $y$ represents the true target values (the actual scores in our game), and $\hat{y_i}$ are the predicted target values (the scores our model thinks it's going to get). The MSE loss function then calculates the average squared differences between these two, giving us a nice, tidy number to aim for reducing. Let's dive in and get this set up!

In [None]:
loss_function = nn.MSELoss()

Awesome, we're making great progress! Now, it's time to pick our model's guide for this learning journey - the optimizer. In our case, we'll go for the Stochastic Gradient Descent (SGD) optimizer with a learning rate ($\eta$) of 0.2.

This learning rate is like the size of the steps our model takes while learning. With a learning rate of 0.2, our model is going to take reasonably sized strides towards its goal of minimizing the loss.

And here's an important note: When setting up our optimizer, the first input argument is going to be the learnable parameters of our network net. So, it's the optimizer's job to tune these parameters and help our model improve over time. Let's set it up!

In [None]:
optimizer = torch.optim.SGD(net.parameters(), lr=0.2)

### Training data

Excellent! Now, let's revisit an example we've seen before - predicting brain volumes from a baby's age. We worked with this dataset in Week 2 of this module, remember?

We need to prep our data a bit before we start training. Here are the steps we'll follow:

- Reshape the data to a size of N x 1, where N is the number of samples, and 1 represents the number of input (for `X`) or output (for `y`) features.
- Convert our input and output values into Pytorch tensors. You can think of tensors like multi-dimensional arrays.
- Ensure our data are `float` values. Pytorch needs the data in this format to work its magic.

**One cool thing to note:** as long as our tensors don't require gradients, we can plot Pytorch tensors just like numpy arrays.

Now, if you're working on Google Colab, there's one more step. You'll need to upload the dataset first. Just run the code below and upload the file `'neonatal_brain_volumes.csv'` that you downloaded from KEATS. Then, we're all set to dive in!

In [None]:
# only do this if you work on Google Colab
# run the cell
# then upload file 'neonatal_brain_volumes.csv'

from google.colab import files
files.upload()

Now, let's put our data into action.

Just hit run on the next cell. This will load our data, convert it into the right format, and then plot it for us to see.

Remember, visualizing our data is a really helpful step. It allows us to understand the distribution and the relationships between our variables. It's always a good idea to know what you're working with!

So go ahead, let the data take the stage. Hit run and let's see what we've got!

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler

# load
data = pd.read_csv('neonatal_brain_volumes.csv').to_numpy()

# standardise and reshape
X = StandardScaler().fit_transform(data[:,0].reshape(-1,1))
y = data[:,1].reshape(-1,1)

# convert
X = torch.from_numpy(X).float()
y = torch.from_numpy(y).float()
print('X: ', X.shape)
print('y: ', y.shape)

# plot
plt.plot(X, y,'*')
plt.xlabel('age of the baby [weeks GA]')
plt.ylabel('brain volume [mL]')

### Training  the neural network

Alright, it's time to train our neural network! So, what exactly happens when we train the network? Well, this training takes place over several rounds, and each of these rounds is known as an **epoch**.

In every epoch, there are five key steps we'll go through:

1.   We'll start off by clearing the gradients. Think of this as cleaning the slate before we start our calculations.
2.   Next, it's time for the forward pass. This is where our network makes its predictions ($\hat{y_i}$) based on the current estimate of network parameters ($w_j^{(n)}$). In other words, predict outputs $\hat{y_i}$ for current estimate of network parameters $w_j^{(n)}$.
3.   Once we have our predictions, we compute the loss $L(\mathbf{y},\mathbf{\hat{y}})$. This is basically a measure of how far off our predictions are from the true outputs.
4.   Up next is the backward pass. In this step, we calculate the gradients (or derivatives) $\frac{\partial L(\mathbf{w}^{(n)})}{\partial w_j^{(n)}} $of the loss with respect to the network parameters. It's like figuring out the slope of the error, which helps us understand the direction we need to move in to reduce this error.
5.   Lastly, we update the network parameters $w_j^{(n+1)}=w_j^{(n)}-\eta \frac{\partial L(\mathbf{w}^{(n)})}{\partial w_j^{(n)}}   $. This is like adjusting our compass based on the direction we found in the previous step. We'll keep tweaking these parameters until we find the sweet spot where our loss is minimized.

Now, remember, we'll iterate over these steps for each epoch. And the best part is, there's no one-size-fits-all number of epochs. It's something we figure out based on our dataset, the task at hand, and the learning rate.

**Quick heads-up: **If you decide to run this cell again, make sure you also rerun the cells from the point where we created the network. This is because the network keeps track of the fitted parameters.

Okay, let's train this network! On your marks, get set, go!


In [None]:
epochs = 10
for i in range(epochs):

    # 1. Clear gradients
    optimizer.zero_grad()
    # 2. Forward pass
    prediction = net(X)
    # 3. Compute loss
    loss = loss_function(prediction, y)
    # 4. Calculate gradients
    loss.backward()
    # 5. Update network parameters
    optimizer.step()

    # Display results
    if i % 2 == 0:
        # note how we need to tranform data back to numpy
        #plt.cla()
        plt.plot(X.data.numpy(), y.data.numpy(),'*')
        plt.plot(X.data.numpy(), prediction.data.numpy(), 'r-', lw=2)
        plt.xlabel('x')
        plt.ylabel('f(x)')
        plt.title(f"Epoch={i} | Loss={loss.data.numpy():.4f}")
        plt.pause(0.1)

## Exercise 1: Crafting a Neural Network Classifier

Alright, are you ready for some action? Let's roll up our sleeves and build a simple binary classifier using Pytorch! We'll be predicting heart failure based on cardiac indices EF and GLS.

Now, first things first. Click 'Run' on the next cell. This will get all our necessary libraries and plotting functions in place. You might notice that the plotting functions look familiar - they're quite similar to what we used in Week 4. The only difference is we've tweaked the 'PlotClassification' function a bit so that it can predict data using the Pytorch model, instead of the Scikit-learn model.

So, let's get this party started.

In [None]:
# imports
import torch
import torch.nn as nn

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler


# plotting functions
def PlotData(X,y):
    y=y.flatten()
    plt.plot(X[y==0,0],X[y==0,1],'bo',alpha=0.75,markeredgecolor='k',label = 'Healthy')
    plt.plot(X[y==1,0],X[y==1,1],'rd',alpha=0.75,markeredgecolor='k',label = 'HF')
    plt.title('Diagnosis of Heart Failure')
    plt.xlabel('EF')
    plt.ylabel('GLS')
    plt.legend()

def PlotClassification(net,X,y):

    # Create an 1D array of samples for each feature
    x1 = np.linspace(-2.5, 2, 1000)
    x2 = np.linspace(-3, 3.5, 1000).T # note the transpose
    # Creates 2D arrays that hold the coordinates in 2D feature space
    x1, x2 = np.meshgrid(x1, x2)
    # Flatten x1 and x2 to 1D vector and concatenate into a feature matrix
    Feature_space = np.c_[x1.ravel(), x2.ravel()]

    # NEW: convert numpy to torch
    Feature_space = torch.from_numpy(Feature_space).float()
    # NEW: Predict output scores for the whole feature space
    output_scores = net(Feature_space)
    # NEW: Threshold the output probabilites
    y_pred = output_scores>0.5
    # NEW: Convert to numpy
    y_pred = y_pred.numpy()

    # Resahpe to 2D
    y_pred = y_pred.reshape(x1.shape)
    # Plot using contourf
    plt.contourf(x1, x2, y_pred, cmap = 'summer')

    # Plot data
    PlotData(X,y)


### Training data


The first step we're going to take is dealing with our training data. The code below is going to load and plot our data, so go ahead and hit 'Run'.

**Task 1.1:** We need a bit of your magic here! I want you to complete the following code. Your task is to convert our feature matrix `X` and labels `y` into Pytorch tensors.

Remember, tensors are Pytorch's way of storing data, and they're essential for building and training our network. So, give it a shot, and let's make it happen!


In [None]:
# only do this if you work on Google Colab
# run the cell
# then upload file 'heart_failure_data.csv'

from google.colab import files
files.upload()

In [None]:
# load, standardise and reshape the training data
df = pd.read_csv('heart_failure_data.csv')
scaler = StandardScaler()
data = df.to_numpy()
X = scaler.fit_transform(data[:,:2])
y = data[:,2].reshape(-1,1)

# convert to tensors


print('X: ', X.shape)
print('y: ', y.shape)

# Plot data
PlotData(X,y)

### Network architecture

Our simple binary linear classifier network will be in fact **logistic regression**.

To create the network architecture we will need a single **linear layer** $z=\sum_{j=0}^Dw_jx_j$ with $D=2$ input features (EF, GLS) and one output feature (HF):

For this simple binary linear classifier, we're essentially going to build a logistic regression model. Now, to achieve this, we only need a single linear layer. The equation for this layer would look like $z=\sum_{j=0}^Dw_jx_j$, where we have $D=2$ input features (EF, GLS) and one output feature (HF).

In [None]:
nn.Linear(2,1)

Since we're putting together a logistic regression classifier, our choice of activation function $f(z)$ will be the **sigmoid** function. This function, denoted as $\sigma(z)=\frac{1}{1+e^{-z}}$, is a perfect fit for our needs. It maps any input into a value between 0 and 1, which aligns well with our goal of predicting a binary outcome. So let's move forward with the sigmoid activation function!

In [None]:
nn.Sigmoid()

For our binary classification task, we'll use the binary cross-entropy as our loss function. It's mathematically expressed as:

$$L(\mathbf{y},\mathbf{\hat{p}})=-\sum_{i=1}^N(y_i\log\hat{p_i}+(1-y_i)\log(1-\hat{p_i}))$$

In this equation, $\hat{p_i}=\sigma(z)$ is the prediction made by our model. This binary cross-entropy loss effectively measures the error between our model's predictions and the actual values. And the good news is that Pytorch provides a built-in function for this, making our work even easier!

The binary cross-entropy loss in Pytorch is:

In [None]:
nn.BCELoss()

**Task 1.2:** Complete the code to create a linear binary classifier network using the building blocks shown above

In [None]:
# Define network architecture
class ANClassifier(nn.Module):
    def __init__(self):
        super(ANClassifier, self).__init__()
        self.layer = None
        self.sigmoid = None

    def forward(self, x):
        x = self.layer(x)
        x = self.sigmoid(x)
        return x

### Training
**Task 1.3:**
Fill in the code to create the network instance, loss and optimiser, as well as the 5 steps of training that are performed during each epoch. The code the plot the fitted model is provided for you. Some code has been given but commented.


---

**Do you need some guidance? **

Alright, it's your time to shine! You're now going to put your network into action. Your mission for this exercise involves creating an instance of your network, setting up the loss function, getting the optimizer ready, and preparing your network for training. Don't worry, we've got some code snippets ready to guide you. Follow these steps:

- Start by creating **an instance** of your network. Give it a name you'll remember. we'll need to call on it soon.

- Next, we'll need to specify our **loss function**. We've previously talked about using **Binary Cross-Entropy**, but it's your decision!

- Now, choose an **optimizer**. Remember to pass in the network parameters and specify the learning rate.

- **Training time!** Set up a loop that will run for a number of **epochs** of your choosing. For each epoch, you'll have to implement these steps:

 - clear the gradients,
 - perform a forward pass,
 - compute the loss,
 - execute a backward pass,
 - and finally update the parameters.

You've got this! Let's dive in and see how well your network performs. Remember, we're here for the journey, not the destination. And don't forget, when the training is done, some code is provided to help you visualize your fitted model. You'll be able to see just how well your network has learned to classify the data. Happy coding!

In [None]:
# Create network
net2 = None

# Loss
loss_function = None

# Optimizer
optimizer = None #torch.optim.SGD(net2.parameters(), lr=0.2)

# Training
epochs = 100
for i in range(epochs):

    # 1. Clear gradients

    # 2. Forward pass
    prediction = None
    # 3. Compute loss
    loss = None
    # 4. Calculate gradients

    # 5. Update network parameters


# Plot classification result
#PlotClassification(net2,X,y)