# Lab 5.2: An Introduction to PyTorch

**Sources**: This lab is based on the PyTorch introduction lab by Juan-José Giraldo, Mauricio A Álvarez and Haiping Lu.

**Note**: Try to answer the questions when you first see them rather than coming back after going through the rest.

In this Notebook, we look at the torch library in Python that allows automatic differentiation. PyTorch will be used to implement different neural network models later on.

**Suggested reading**: 
* What is PyTorch from [PyTorch tutorial](https://pytorch.org/tutorials/beginner/blitz/tensor_tutorial.html#sphx-glr-beginner-blitz-tensor-tutorial-py)

**Assumptions** : basic python programming and [Anaconda](https://anaconda.org/) installed.


## PyTorch Installation and Basics

If you are running on Google Colab then PyTorch will already be installed but you may need to install it on your own machines. There are different options on the PyTorch homepage for downloading it including conda and pip. Here we will describe how to install it using conda.



### Install-1: direct installation (e.g., on your own machine with full installation right)

**Install [PyTorch](https://github.com/pytorch/pytorch) via [Anaconda](https://anaconda.org/)**
`conda install -c pytorch pytorch`

 When you are asked whether to proceed, say `y`

**Install [torchvision](https://github.com/pytorch/vision)**
`conda install -c pytorch torchvision`

 When you are asked whether to proceed, say `y`

### Install-2: Set up Anaconda Python environment (e.g., on a university desktop)

On a university desktop, you may not have permission to install new packages on the main environment of Anaconda. Please follow the instructions below to set up a new environment. This is also recommended if you have different python projects running that may require different environments.

Open a command line terminal.

**Create a new conda environment with Python 3.8**<br>
`conda create -n mlai python=3.8 anaconda`

**Activate the conda environment `mlai`** (see [conda documentation](https://docs.conda.io/projects/conda/en/latest/user-guide/getting-started.html#managing-environments))<br>

For conda 4.6 and later versions: `conda activate mlai`

For conda versions prior to 4.6<br>
`activate mlai` (Windows)<br>
`source activate mlai` (Mac/Linux)<br><br>
You will see `(mlai)` on the left indciating your environment

**Install Pytorch and Torchvision** (non-CUDA/GPU version for simplicity)<br>
`conda install pytorch torchvision cpuonly -c pytorch`<br>
If you have GPU, install the GPU version with command at [here](https://pytorch.org/)

**Start Jupyter notebook server**: `jupyter notebook`

## Tensors
A tensor generalises the concept of vectors and matrices to an arbitrary number of dimensions. Another name for the same concept is multidimensional arrays. The dimensionality of a tensor is the number of indexes used to refer to scalar values within the tensor. The cell below shows an example initialising a Tensor uniformly for 1D, 2D and 3D:

In [24]:
# We first import the torch library that comes with the Anaconda distribution
import torch 
# Tensor 1D presents 1 index
y = torch.rand([2])
print('Tensor 1D presents one index','with shape', y.shape,':\n',y) #get specific size with .shape
# Tensor 2D presents 2 indexes
y = torch.rand([2,3])
print('\nTensor 2D presents two indexes','with shape',y.shape,':\n',y)
#Tensor 3D presents 3 indexes
y = torch.rand([5,2,3])
print('\nTensor 3D presents three indexes','with shape',y.shape,':\n',y)

Tensor 1D presents one index with shape torch.Size([2]) :
 tensor([0.1953, 0.8615])

Tensor 2D presents two indexes with shape torch.Size([2, 3]) :
 tensor([[4.8850e-02, 1.2034e-01, 7.1931e-01],
        [1.5116e-01, 9.6343e-01, 3.1418e-04]])

Tensor 3D presents three indexes with shape torch.Size([5, 2, 3]) :
 tensor([[[0.7422, 0.4080, 0.0885],
         [0.4490, 0.8160, 0.9972]],

        [[0.7986, 0.0524, 0.7635],
         [0.4398, 0.1688, 0.6449]],

        [[0.8268, 0.0685, 0.7311],
         [0.3339, 0.4794, 0.2739]],

        [[0.4282, 0.9191, 0.7203],
         [0.9002, 0.9195, 0.7612]],

        [[0.2984, 0.9470, 0.0988],
         [0.2995, 0.5953, 0.1672]]])


In [21]:
# Create a tensor with specific values
x = torch.tensor([4.0,5.0], dtype=torch.float32)
y = torch.tensor([2.0,3.0], dtype=torch.float32)

# Tensor multiplication (point-wise multiplication)
print(x*y)

# Tensor matrix multiplication
# Like numpy the @ symbol is defined as operating the matrix multiplication
# We should get the same output as if we call the matmul function
print('Using @:', x @ y)
print('Torch.matmul: ', torch.matmul(x, y))

# But wait, these are vectors so how can we do a matmul. Well Pytorch 
# defaults to using the dot product when you call a matmul on vectors.
# That's why we are getting a scalar from
print('Torch.dot: ', torch.dot(x, y))

# If we define a 2D matrix instead then it will return a vector
A = torch.tensor([[1.0, 2.0], [3.0, 4.0]], dtype=torch.float32)
print(A @ y)

tensor([ 8., 15.])
Using @: tensor(23.)
Torch.matmul:  tensor(23.)
Torch.dot:  tensor(23.)
tensor([ 8., 18.])


### Initialise a tensor with torch.zeros or torch.ones 

In [22]:
x_zeros = torch.zeros(size=[3,4])
print('x_zeros:',x_zeros,'with shape',x_zeros.shape,'\n')
x_ones = torch.ones(size=[2,6])
print('x_ones:',x_ones,'with shape',x_ones.shape)

x_zeros: tensor([[0., 0., 0., 0.],
        [0., 0., 0., 0.],
        [0., 0., 0., 0.]]) with shape torch.Size([3, 4]) 

x_ones: tensor([[1., 1., 1., 1., 1., 1.],
        [1., 1., 1., 1., 1., 1.]]) with shape torch.Size([2, 6])


### Reshape a tensor using .view

In [23]:
y = torch.ones([3,2])
y_reshaped = y.view(6,1)  # in contrast to the common numpy library, we use .view instead of .reshape
print(y_reshaped)

tensor([[1.],
        [1.],
        [1.],
        [1.],
        [1.],
        [1.]])


## Numpy interoperability 

PyTorch tensors can be converted efficiently to NumPy arrays and vice versa. By doing so, you can leverage the huge swath of functionality in the wider Python ecosystem that has built up around the NumPy array type.

In [5]:
# Tensor_torch to tensor Numpy
Tensor_torch = torch.ones(3,4)
Tensor_numpy = Tensor_torch.numpy() #Returns a NumPy multidim. array of the right size, shape and numerical type.
print('Array in numpy form with shape', Tensor_numpy.shape,':\n',Tensor_numpy, type(Tensor_numpy))

# Tensor Numpy to Tensor_torch
import numpy as np
Tensor_np = np.random.randn(5,8)    
Tensor_numpy_to_torch = torch.from_numpy(Tensor_np)
print('\nArray from Numpy to Torch with shape', Tensor_numpy_to_torch.shape,':\n',Tensor_numpy_to_torch)

Array in numpy form with shape (3, 4) :
 [[1. 1. 1. 1.]
 [1. 1. 1. 1.]
 [1. 1. 1. 1.]] <class 'numpy.ndarray'>

Array from Numpy to Torch with shape torch.Size([5, 8]) :
 tensor([[-0.0797, -0.7932,  1.0387,  0.0757,  0.2101,  0.1149, -1.3329, -0.0188],
        [ 0.4187, -0.7249,  1.0331,  0.3841,  0.5832,  1.8500,  0.8665,  1.7129],
        [ 1.2308,  0.1967, -1.8170,  0.2413, -1.5926,  1.5866, -0.4867,  0.0288],
        [ 1.4488,  0.9729,  0.1536, -1.1897,  2.0185, -0.6600, -1.0183,  0.7835],
        [ 0.1226, -0.1430,  0.4806, -0.5670,  0.7416,  0.4147,  0.9983,  0.7901]],
       dtype=torch.float64)


### Loading a .csv dataset

We can take advantage of the interoperability between Numpy and PyTorch by loading a .csv data as a numpy array and transforming it to a Torch Tensor using `torch.from_numpy(dataset_np)`. 

In [6]:
# This cell is simply to download the winequality-red.csv dataset from its root url
import urllib.request
urllib.request.urlretrieve('https://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-red.csv', './winequality-red.csv')

('./winequality-red.csv', <http.client.HTTPMessage at 0x1039abeb0>)

In [7]:
import numpy as np
#In the line below we avoid the first row (skiprows=1) of .csv file that contains names
#the delimeter of data for this dataset is ";"
wine_np = np.loadtxt("./winequality-red.csv", dtype=np.float32, delimiter=";", skiprows=1)
wine_torch = torch.from_numpy(wine_np)  #We take advantage of the interoperability with numpy
wine_torch

tensor([[ 7.4000,  0.7000,  0.0000,  ...,  0.5600,  9.4000,  5.0000],
        [ 7.8000,  0.8800,  0.0000,  ...,  0.6800,  9.8000,  5.0000],
        [ 7.8000,  0.7600,  0.0400,  ...,  0.6500,  9.8000,  5.0000],
        ...,
        [ 6.3000,  0.5100,  0.1300,  ...,  0.7500, 11.0000,  6.0000],
        [ 5.9000,  0.6450,  0.1200,  ...,  0.7100, 10.2000,  5.0000],
        [ 6.0000,  0.3100,  0.4700,  ...,  0.6600, 11.0000,  6.0000]])

## Automatic Differentiation





### Computational Graph


A computation graph defines/visualises a sequence of operations to go from input to model output. 

Consider a linear regression model $\hat y = Wx + b$, where $x$ is our input, $W$ is a weight matrix, $b$ is a bias, and $\hat y$ is the predicted output. As a computation graph, this looks like:

![Linear Regression Computation Graph](https://imgur.com/IcBhTjS.png)

PyTorch dynamically build the computational graph, for example

![DynamicGraph.gif](https://raw.githubusercontent.com/pytorch/pytorch/master/docs/source/_static/img/dynamic_graph.gif)

### Computing gradients with PyTorch
PyTorch allows to automatically obtain the gradients of a tensor with respect to a defined function. When creating the tensor, we have to indicate that it requires the gradient computation using the flag `requires_grad`  

In [8]:
x = torch.rand(3, requires_grad=True)
print(x)

tensor([0.6484, 0.6068, 0.4757], requires_grad=True)


Notice that now the Tensor shows the flag `requires_grad` as True. We can also activate such a flag in a Tensor already created as follows:

In [16]:
x = torch.tensor([1.0,2.0,3.0])
x.requires_grad_(True)
print(x)

tensor([1., 2., 3.], requires_grad=True)


Let us define a function $y=x^2+5$. The function $y$ will not only carry the result of evaluating $x$, but also the gradient function $\frac{\partial y}{\partial x}$ called `grad_fn` in the new tensor $y$

In [17]:
x = torch.tensor([2.0])
x.requires_grad_(True)  #indicate we will need the gradients with respecto to this variable
y = x**2 + 5
print(y)

tensor([9.], grad_fn=<AddBackward0>)


To evaluate the partial derivative $\frac{\partial y}{\partial x}$, we use the `.backward()` function and the result of the gradient evaluation is stored in `x.grad` 

In [18]:
y.backward()  #dy/dx
print('PyTorch gradient:', x.grad)

#L et us compare with the analytical gradient of y = x**2+5
with torch.no_grad():    #this is to only use the tensor value without its gradient information
    dy_dx = 2*x  #analytical gradient
print('Analytical gradient:',dy_dx)

PyTorch gradient: tensor([4.])
Analytical gradient: tensor([4.])


If we evaluate a vector $\mathbf{w}=[w_1, \ldots, w_D]^{\top}$, to compute another vector $\mathbf{g}=[g_1, \ldots, g_D]^{\top}$ with elements $g_i=w_i^2+5$, then we obtain a vector $\mathbf{g}$ that contains each evaluation of the function. If we want to obtain the gradient w.r.t $\mathbf{w}$ by using "g.backward()", we have to bypass a vector of size equal to w.shape to the function, i.e., "g.backward(vect)". 

In [12]:
w = torch.tensor([1.0,2.0,3.0])
w.requires_grad_(True)

g = w**2+5
# Below, the values [1.0,1.0,1.0] are multiplied by the gradient g.backward(vect)
# of course using the ones does not modify the value of the gradient
vect = torch.tensor([1.0,1.0,1.0],dtype=torch.float32) 
g.backward(vect)
print(w.grad)

tensor([2., 4., 6.])


On the other hand, when accessing the gradients in a for loop, PyTorch acummulates the gradients at each
iteration. In order to avoid this behaviour, we have to use the function .grad.zero_() also at each iteration. See in the example below what happens when commenting and uncommenting the line "w.grad.zero_()":

In [14]:
#Pytorch uses a cumulative process for the gradients
w = torch.tensor([1.0,2.0,3.0])
w.requires_grad_(True)

for i in range(3):
    g = w**2+5
    g.backward(torch.ones_like(w))
    print(w.grad)
    w.grad.zero_()    #this line avoids the acummulation of the gradients uncomment it to see its effect

tensor([2., 4., 6.])
tensor([2., 4., 6.])
tensor([2., 4., 6.])


###  Question 5

Verify that the gradients provided by PyTorch coincide with the *analytical* gradients of the function $f(x) = \exp \big(-x^2-2x- \sin (x) \big)$ w.r.t $x$.

In [None]:
# Provide your answer here

## Linear Regression Basic Example

We now provide a very simple example of linear regression with one input dimension, $y=wx+b$, and illustrate how we use PyTorch to optimise the parameters of the model

In [None]:
Ndata = 100 
x = torch.rand(Ndata)
true_w = 1.5
true_bias = 1.0
# We generate the dataset from the actual model but adding some noise
y = true_w*x + true_bias + 0.05*torch.randn(Ndata)
# We make sure to set the requires_grad flag to True for both paratemers
w = torch.tensor(0.0,dtype=torch.float32,requires_grad=True)
bias = torch.tensor(0.0,dtype=torch.float32,requires_grad=True)

We now define two useful functions, the prediction function and the objective function

In [None]:
def model_prediction(x,w,bias):
    return w*x + bias

def loss_function(y,y_pred):
    return ((y_pred-y)**2).mean()  #Mean Squared Error (MSE)

And we use coordinate descent to estimate the parameters of the model

\begin{align*}
    w_{k+1} = w_k - \eta \frac{dE}{dw}\\ 
    b_{k+1} = b_k - \eta \frac{dE}{db}\\ 
\end{align*}

We know that there is a closed form solution for $w$ and $b$ through the normal equation. The example is for illustrative purposes.

In [None]:
Max_Niter = 500
step_size = 0.1
for Niter in range(Max_Niter):
    # Evaluate the prediction and the loss
    y_approx = model_prediction(x,w,bias)
    my_loss = loss_function(y,y_approx)
    
    # The function .backward() has to be called in order to load the grads in w.grad
    # Notice that here it is not necessary to bypass a vector since loss_function is a scalar function
    my_loss.backward()  
        
    with torch.no_grad():        # this line avoids the gradient update while allowing to change the value of w
        w -= step_size*w.grad    # it is necessary to avoid the grad update while modifying the variable
        bias -= step_size*bias.grad
    
    # Make the zero gradient to avoid acummulation
    w.grad.zero_()
    bias.grad.zero_()
    
    # We print the loss, and the parameters values every 50 iterations
    if Niter%50==0:
        print(f'Iteration = {Niter+1}, Loss = {my_loss:.8f}, w = {w:.3f}, bias = {bias:.3f}')        

print(f'Iteration = {Niter+1}, Loss = {my_loss:.8f}, w = {w:.3f}, bias = {bias:.3f}')    
    

We finally plot the result

In [None]:
import matplotlib.pyplot as plt
%matplotlib inline
plt.plot(x,y,'x')
xtest = torch.linspace(0,1,10)
with torch.no_grad():
    y_pred = model_prediction(xtest,w,bias)
plt.plot(xtest,y_pred)

## Revisiting linear regression for the Rented Bike Dataset of lab notebook 2

We will implement a linear regression for the Rented Bike dataset previously used in Lab. 2. We will use the same data preparation through `sklearn.preprocessing`: the OneHotEncoder() that allows to transform a categorical variable to a one-hot encoding representation, and StandardScaler() performs feature scaling by standardisation.

In [None]:
import urllib.request
urllib.request.urlretrieve('https://archive.ics.uci.edu/ml/machine-learning-databases/00560/SeoulBikeData.csv', './SeoulBikeData.csv')

The following code was borrowed from Lab Notebook 2. You can go back to that Notebook for details.

In [None]:
import pandas as pd 

bike_sharing_data = pd.read_csv('SeoulBikeData.csv', encoding= 'unicode_escape')
bike_sharing_data = bike_sharing_data.drop('Date', axis=1)

for col in ['Rented Bike Count', 'Hour', 'Humidity(%)', 'Visibility (10m)']:
    bike_sharing_data[col] = bike_sharing_data[col].astype('float64')

attributes_cat = ['Seasons', 'Holiday', 'Functioning Day']
attributes_num = ['Hour', 'Temperature(°C)', 'Humidity(%)', 'Wind speed (m/s)', 'Visibility (10m)', \
                  'Dew point temperature(°C)', 'Solar Radiation (MJ/m2)', 'Rainfall(mm)', 'Snowfall (cm)']

# We split our dataset for Training and Testing

from sklearn.model_selection import train_test_split
bs_train_set, bs_test_set = train_test_split(bike_sharing_data, test_size=0.15, random_state=42)
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import StandardScaler
from sklearn.compose import ColumnTransformer

full_transform = ColumnTransformer([
    ("num", StandardScaler(), attributes_num),
    ("cat", OneHotEncoder(), attributes_cat),
])

# We separate the features from the labels

bs_train_set_attributes = bs_train_set.drop('Rented Bike Count', axis=1)
bs_test_set_attributes = bs_test_set.drop('Rented Bike Count', axis=1)
bs_train_set_labels = bs_train_set['Rented Bike Count']
bs_test_set_labels = bs_test_set['Rented Bike Count']

We now use the function `torch.from_numpy()` to transform the data previously prepared, into a Torch Tensor. We make sure to add a column of ones to the attributes (remember that $x_0=1$) both in the train and test sets.

In [None]:
# We apply the preprocessing transformation over the features of the training data

bs_train_set_attributes_prepared = full_transform.fit_transform(bs_train_set_attributes)
bs_test_set_attributes_prepared = full_transform.transform(bs_test_set_attributes)

Train_torch = torch.from_numpy(bs_train_set_attributes_prepared)

# The line below adds a feature vector of ones in order to allow the bias weight
# to be represented in a unique weight vector.

Train_torch = torch.cat((torch.ones([Train_torch.shape[0],1],dtype=torch.float64),Train_torch), 1)  
Test_torch = torch.from_numpy(bs_test_set_attributes_prepared)

# The line below adds a feature vector of ones in order to allow the bias weight
# to be represented in a unique weight vector.

Test_torch = torch.cat((torch.ones([Test_torch.shape[0],1],dtype=torch.float64),Test_torch), 1)
Train_Label_torch = torch.from_numpy(bs_train_set_labels.values)

Test_Label_torch = torch.from_numpy(bs_test_set_labels.values)

We create a vector of weights $\mathbf{w}$ with the corresponding flag for the gradient and two functions, one for prediction and one for the loss function.

In [None]:
# We create the vector of weights to be optimised in the linear regression model
dim = Train_torch.shape[1]
w = torch.randn([dim,1],dtype=torch.float64)  # vector of weight w is a vector Dim x 1
w.requires_grad_(True)

# We create the model prediction which consists on an inner product X'w, where X is a design matrix of N x Dim
def model_prediction_lr(x,w):
    return torch.matmul(x,w)

def loss_function_lr(y,y_pred):
    return ((y_pred-y)**2).mean()  # Mean Squared Error (MSE)

We finally use gradient descent to find the optimal value for $\mathbf{w}$
$$
\mathbf{w}_{k+1} = \mathbf{w}_k - \eta \frac{dE(\mathbf{w})}{d\mathbf{w}}
$$


In [None]:
# Training the model with Gradient Descent

Max_Niter = 50 # If you have many iterations, this process can take some time
step_size = 0.001
for Niter in range(Max_Niter):
    y_approx = model_prediction_lr(Train_torch,w)
    my_loss = loss_function_lr(Train_Label_torch,y_approx)
    
    # The function .backward() has to be called in order to load the grads in w.grad
    # Notice that here it is not necessary to bypass a vector since loss_function is a scalar function
    
    my_loss.backward()  
        
    with torch.no_grad():    # this line avoids the gradient update while allowing to change the value of w
        w -= step_size*w.grad    # it is necessary to avoid the grad update while modifying the variable
        
    # print(w.grad)
    # Make the zero gradient to avoid acummulation
    w.grad.zero_()
    
    if Niter%20==0 or Niter == Max_Niter-1:
        print(f'Iteration = {Niter+1}, Loss = {my_loss:.8f}')
        #print('Weights vector:\n', w)

We finally provide the RMSE for the test set

In [None]:
# RMSE over the test set

y_pred_test = model_prediction_lr(Test_torch,w)
MSE_test = loss_function_lr(Test_Label_torch,y_pred_test)
print('The Root Mean Squared Error over the test set is:', np.sqrt(MSE_test.detach().numpy()))

## Using torch.nn and torch.optim

In the above sections we have performed a lot of the training steps by hand but PyTorch provides many useful implementations of these steps. In this section, we will use the torch.nn and torch.optim parts to simplify the training.

### torch.nn
Let's start with the nn part of the library. This implements many building blocks for neural networks and graphs. This includes linear layers and non-linear activation functions. We will start to use more of this next time but this will hopefully get you started with it. 


In [None]:
from torch import nn


First let's have a look at a linear layer. This is simply the operation that we have been performing with $\mathbf{w} \mathbf{x}$ i.e it is a linear combination of the weight parameters and the inputs. The [nn.Linear](https://pytorch.org/docs/stable/generated/torch.nn.Linear.html#torch.nn.Linear) class will set up and apply this operation for us. Internally it holds weights and bias parameters that we will learn in linear regression. Let's test it out with a simple 2 inputs and 1 output.

In [None]:
# The weights and biases will be initialised randomly, so let's fix the seed to get the same values.
torch.manual_seed(12345)

# Create an instance of a Linear layer (setting the bias to be included)
lin = nn.Linear( in_features = 2, out_features = 1, bias=True)

print("The initial weights are:\n", lin.weight)
print("The initial bias is:\n", lin.bias)
# Note that when we print these it tells us that it is a Parameter. This a special type indicator that will be used for training.

#Let's test it out. It has an () operator that will apply the operation to the input.
input = torch.rand([2])
print("Input values are:\n", input)
print("The result of the linear layer is:\n", lin(input))

# Calculate the result for your self, is it correct?

### torch.nn.functional

nn.Linear allowed us to implement the linear summation but we can also use the [nn.functional](https://pytorch.org/docs/stable/nn.functional.html) library. Again there are many options here but let's just consider replacing our loss function with one implemented by PyTorch. Earlier we used the mean squared error so we can do the same here. Note that it is also implemented as a class in torch.nn but this gives us the direction function.




In [None]:
import torch.nn.functional as F

loss_func = F.mse_loss

#Let's try it out
predictions = torch.rand([3])
targets = torch.rand([3])

print(predictions)
print(targets)
print("PyTorch loss function:", loss_func(predictions, targets))
print("Our earlier loss function:", loss_function_lr(predictions, targets))


## torch.optim

When we have many parameters (weights, biases, etc) it can be tricky keeping track of them individually. The [torch.optim](https://pytorch.org/docs/stable/optim.html) library helps with this by automatically applying our gradient descent weight updates. There is also a variety of update tools available, for example Adam is a widely used learning rule. We will use the stochastic gradient descent (SGD) as earlier. 

In [None]:
from torch import optim

# When we initialise the SGD class we need to tell it what parameters we want to update and the learning rate.
# Here we will just use the linear layer from above and step_size from before.
opt = optim.SGD(lin.parameters(), lr = step_size)

# Since the parameters are linked to the optimiser we can use that to zero them all out
opt.zero_grad()

# So now if we use our linear layer as a model to accumulate the gradients
targets = torch.rand([1])
predictions = lin(input)
loss = loss_func(predictions, targets)
loss.backward()

print("Our gradients are:", lin.weight.grad)

# Now when we want to update the parameters we can just call
opt.step()

print('Our updated weights are:\n', lin.weight)
print('Our updated bias is:\n', lin.bias)

# And remember to zero the gradients to stop multiple steps building up (unless that's what you want, see batches)
opt.zero_grad()




## Putting them together
Let's now use this new functionality to simplify our training routine from earlier.

In [None]:
# Training the model with Gradient Descent

Max_Niter = 50 # If you have many iterations, this process can take some time
step_size = 0.001

# Create a linear model with the correct number of input features
# Here we will turn the bias off since we didn't have it before
# By default, Linear will be float32 (single precision) dtype but our training
# data is float64 (double) so we will need to initialise it correctly.
model = nn.Linear(in_features = dim, out_features = 1, bias=False, dtype=Train_torch.dtype)

# Set up our optimiser, linked to the model parameters and step size.
opt = optim.SGD(model.parameters(), lr=step_size)

# nn.Linear is giving us an output which has a shape (number_of_samples, 1) but our labels are 1d.
# To stop errors lets reshape our label array.
Train_Label_torch = Train_Label_torch.reshape((-1,1))
print(Train_Label_torch.shape)

for Niter in range(Max_Niter):
    y_approx = model(Train_torch)
    my_loss = loss_func(y_approx, Train_Label_torch)
    
    my_loss.backward()  

    opt.step()
    opt.zero_grad()
        
    if Niter%20==0 or Niter == Max_Niter-1:
        print(f'Iteration = {Niter+1}, Loss = {my_loss:.8f}')
        #print('Weights vector:\n', w)

Try changing the number of max iterations (epochs) to see how low you can get the loss. While it is performing the same operations as before this will set us up for more complicated models.

---
## Logistic regression with PyTorch

Now that we are more familiar let's try implementing logistic regression using PyTorch rather than Scikit-Learn. The implementation is very similar to linear regression but we need to modify our model and loss function. 

For converting the output of the linear layer to a probability you will need to use the [logistic sigmoid](https://pytorch.org/docs/stable/generated/torch.nn.functional.sigmoid.html#torch.nn.functional.sigmoid).
Instead of the mse_loss, the logistic regression uses the negative log likelihood. Another name for this is [binary cross entropy loss](https://pytorch.org/docs/stable/generated/torch.nn.functional.binary_cross_entropy.html#torch.nn.functional.binary_cross_entropy). 



### Question 6

Implement logistic regression using PyTorch (the `torch.nn.`) and apply it to the synthetic 2D data in part A (or a real dataset if you want it to be more challenging) for classification. You may also vary the synthetic data to observe performance variation. Check out [**reproducibility** in PyTorch](https://pytorch.org/docs/stable/notes/randomness.html).

In [None]:
# Provide your answer here

## Additional ideas to explore

* Change the [loss function](https://pytorch.org/docs/stable/nn.html#loss-functions) to different choices and compare the results.  
* Formulate another regression problem and solve it using `torch.nn`
* Compare the `torch.nn` solution against the closed-form solution
* Explore any other variations that you can think of to learn more
* Explore more advanced examples at the [PyKale library](https://github.com/pykale/pykale/tree/master/examples) 