<center>
<img src="https://venturebeat.com/wp-content/uploads/2019/06/pytorch-e1576624094357.jpg?w=1200&strip=all" style="width: 800px;">
</center>


# PyTorch



---

### In this lesson you'll learn:
* how to program a simple neural net using PyTorch.
* how to implement more advanced layers in your neural net (Dropout, Batchnorm).
* about more advanced optimisation (Momentum, adam).
---

Last week you programmed a simple neural network yourself. As mentioned earlier, it is not necessary to program every net yourself. Certain software packages take care of many of the inconveniences of creating and training nets "by hand".

Essentially, there are two libaries that can be used: PyTorch and TensorFlow. TensorFlow is developed by Google and is the more popular choice, especially in industry. PyTorch, on the other hand, is mostly used in the scientific world. Basically, PyTorch is considered the easier framework to learn and is a bit more user-friendly overall. 

While there used to be major differences, today the two libraries are becoming more and more similar in functionality.

Finally, there is Keras and PyTorch Lightning. Both aim to make neural network creation even easier. Keras uses TensorFlow in the background, but makes it easier to train networks, especially for beginners. The same is true for PyTorch Lightning and PyTorch.

In cheminformatics, however, PyTorch is a good choice, since special libraries such as for Graph Neural Networks exist or existed only for PyTorch.


An essential part of PyTorch is **autograd**. Autograd is a library that, as the name suggests, can compute and collect the gradients automatically. So you don't have to calculate the gradients yourself.
Also, there are many functions, like activation functions or linear transformations, that are already implemented in PyTorch. 

*TensorFlow has these functionalities as well, of course.*


### Tensors

While you have been working with `numpy` arrays so far, today we will use `tensors`, more precisely PyTorch `tensors`. 

**What is the difference?**

First of all, none. Arrays and tensors are similar in many ways. Both store numbers/values in a structured form. So in NumPy you can store matrices in a 2D array, but you can also store the same matrix in a 2D tensor.
Also, tensors can be converted to arrays and arrays can be converted to tensors.

The difference between the two "storage options" is that PyTorch tensors were developed by PyTorch. And NumPy arrays were developed by the developers of NumPy. 
Many features that NumPy offers are also available from PyTorch for their tensors (but may be called differently). 
PyTorch developed its tensors to perform mathematical operations faster. In addition, tensors can also be "loaded" onto the graphics card, which increases the speed of operations many times over.

Calculations with `tensors` are almost identical to calculations with `np.arrays`. But the functions can have different names. For example `torch.mm()` is the function for matrix multiplication and `.t()` is the transpose of a matrix. Similar to the function `np.array()` to create arrays, `torch.tensor()` creates tensors.

In [None]:
import torch # loads PyTorch

In [None]:
X = torch.tensor([[1,2,3],
                [4,5,6]])

W = torch.tensor([[8,9,10],
                 [11,12,313]])

b = torch.tensor([1,2])

torch.mm(X,W.t())+b

This is the linear transformation, which you also already know.<br>
However, PyTorch simplifies this step. 
In PyTorch there is a module called `nn`, this contains many functions that are helpful in creating neural networks.

We can load the module `nn` with `from torch import nn`. 

# Neural Net with PyTorch

In [None]:
from torch import nn

The submodule `nn` provides among others the function `nn.Linear`. It performs the linear transformation $xW^T +b $.
As input the function takes:


* `in_features` the number of features the input has before the transformation, or the size of the input layers. Yesterday the images had 784 pixels, so 784 features.
* `out_features` the number of features the input should have after the transformation. So `out_features` defines the size of the hidden layer. 



In [None]:
layer_1 = nn.Linear(in_features = 784, out_features=300, bias=True)

Isn`t the input for the layers missing?

That's right, until now you haven't performed a linear transformation either, but only created a variable `layer_1`. This can then perform the linear transformation for us.

---
*Technically `nn.Linear` is not a function, but a `class`. Classes are special Python objects. How exactly classes work is not relevant for this course. What is important to understand is that*

```py
layer_1 = nn.Linear(in_features = 784, out_features=300, bias=True)
```
*creates an object `layer_1` which belongs to the class `nn.Linear`. Each class in Python can have its own functions. For example, most `nn` classes have a `forward` function that executes a particular forward pass.*

---



A practical feature of `nn` layers is that the weights of these layers are automatically initialized by PyTorch. This already saves you a little work.
The weights $W$ of these layers can also be viewed.

For this you use `list(layer_1.parameters())[0]`.
If you want to know the exact size of the weight matrix, you can use `.shape` like NumPy: `list(layer_1.parameters())[0].shape`. 

In [None]:
list(layer_1.parameters())[0]

In [None]:
list(layer_1.parameters())[0].shape

As you can see, the weight matrix has the same size as the matrix of the last week. You can also see that the matrix actually contains weights.
All you need now is an input (images) that you want to change with this linear transformation. 
This is done by loading the training dataset from last week with `numpy`.

Additionally, the images need to be transformed into a tensor. For this you use `torch.tensor()`.


Of course you have to scale the data again. For this you use the min-max scaler.
In PyTorch, you have to pay more attention to the data types. 
Therefore we define the datatype `dtype`. The datatype for our images is `float32`. You know `float` from yesterday, the `32` defines how exact this number can be.
`long` may not mean anything to you, but it simply denotes `integers`.



<br>
<details>
<summary><strong>Only for the particularly interested:</strong></summary>

In the last notebook we discussed why we initialize the weight matrix in this particular way.
In fact, in TensorFlow, the weight matrices are created so that matrix multiplication can be performed without `transpose`.

    
Here is the description of the code for the forward pass in TensorFlow.
    

  `Dense` implements the operation:
  `output = activation(dot(input, kernel) + bias)`
  where `activation` is the element-wise activation function
  passed as the `activation` argument, `kernel` is a weights matrix
  created by the layer, and `bias` is a bias vector created by the layer
  (only applicable if `use_bias` is `True`). These are all attributes of
  `Dense`.

PyTorch's description can be found [here](https://pytorch.org/docs/stable/generated/torch.nn.Linear.html).
    

</details>


In [None]:
import numpy as np
def min_max(x):
    return (x - np.min(x)) / (np.max(x) - np.min(x))


train_data = np.genfromtxt('https://uni-muenster.sciebo.de/s/xSU1IKM6ui4WKAV/download', delimiter=',', skip_header =False) #genfromtxt reads .txt files if we chose delimiter ="," the function can read also .csv files  (comma seperated values)

train_images = min_max(train_data[:,1:])
train_images = torch.tensor(train_images, dtype = torch.float32)
train_labels=torch.tensor(train_data[:,0].astype(int), dtype = torch.long) 


test_data = np.genfromtxt('https://uni-muenster.sciebo.de/s/fByBt5wd24chROg/download', delimiter=',', skip_header =False) #genfromtxt reads .txt files if we chose delimiter ="," the function can read also .csv files  (comma seperated values)

test_images = min_max(test_data[:,1:])
test_images = torch.tensor(test_images, dtype = torch.float32)
test_labels=torch.tensor(test_data[:,0].astype(int), dtype = torch.long) 

train_images.shape

The data set contains 60000 images with 784 pixels each.
You can now use these as input for the linear transformation.

In [None]:
z_1=layer_1(train_images)
print(z_1)
z_1.shape

The `layer_1` returns the output (`z_1`). This has the shape `[60000,300]`. So still 60000 images, but this time each has only 300 features (size of the hidden layer). Just like it was defined when `layer_1` was created.


What you should notice is the `grad_fn=<AddmmBackward>` at the end of the `z_1` tensor. You can see that *autograd* has captured the gradients for this transformation. You can also see that we have performed a matrix multiplication `mm` and an addition `Add`. The last performed transformation of the tensor is always shown.

What you are missing now is the activation function. PyTorch can help here as well. 
The `nn` library of PyTorch has a submodule `functional`. Here are many additional mathematical functions included, among others the `relu` and `sigmoid` functions. Since the `relu` function gives even better results in practice than the `sigmoid` function, we will use it now.

`functional` can be imported as follows: `from torch.nn import functional as F`. We rename `functional` to `F`, sort of a standard when working with PyTorch.

`F.relu()` can now be used to apply the ReLU function.

In [None]:
from torch.nn import functional as F

a_1 = F.relu(z_1)
print(a_1)

If you compare `a_1` with `z_1`, you can see that all values that were negative before became zero and all values that were positive are unchanged.
Also you can see in the `grad_fn` that a ReLU was applied. This was also recorded by *autograd*. 

The first part of the forward pass is already done. 
For the second step, we can simply create another layer that gets `a_1` as input.

In [None]:
layer_2 = nn.Linear(300,10) # 10 is the number of out_features, since we have 10 digits.
z_2=layer_2(a_1)

Again you need an activation function, but this time the `softmax` function to get the probabilities.
`nn.functional` has also a `softmax()` function.

In [None]:
y_hat = F.softmax(z_2,dim=1) # the dim parameter defines whether the softmax function is applied over columns or rows.
y_hat.shape

In PyTorch you can also combine different layers. With `nn.Sequential` you can write the linear transformation and the activation function directly one after the other. The input is automatically passed through each of the layers.
This makes code clearer and easier to write.

In [None]:
net = nn.Sequential(nn.Linear(784,300), 
                         nn.ReLU(), 
                         nn.Linear(300,10))
net

As you can see, a network with a hidden layer has been created. What you should notice is that instead of `F.relu` `nn.ReLU` has been used. If a `relu` function is to be used inside `Sequential()`, you must always use `nn.ReLU`. 

The `network` can now classify the images:
`network(input)` can be used to pass the input, e.g. our images, through the network.
From the tensor size of the output you can see that in the end there are actually 60000 images with 10 features each (the 10 digits) as output.

In [None]:
output = net(train_images)
output.shape

Another change is that you no longer use the last activation function. PyTorch chooses it automatically. The decision of which activation function to use in the last layer depends on the choice of the loss function.

### Loss Function


`nn` can also help with the loss function. The most common loss functions are already included in PyTorch.
You can simply create a new variable and assign the `nn.CrossEntropyLoss()` function to it.

In [None]:
loss_function = nn.CrossEntropyLoss()

The function `loss_function` can now calculate the loss by automatically applying the softmax function.
To do this, you only have to enter `y_hat` and the `train_labels` into the function. Here you can see another advantage of PyTorch: you don't have to `one-hot` encode the labels.

In [None]:
loss = loss_function(output, train_labels)
loss

### Backpropagation

The last step is to perform a backpropagation. Thanks to *autograd* this is easily possible with the command `loss.backward()`. It calculates the gradients for all weight matrices.

After that you only have to update the weight matrices. As you can imagine, PyTorch takes care of this task as well.
PyTorch even provides a variety of different algorithms that update the weights in different ways.

To update the weights, a new module of PyTorch is needed.
For this one loads `from torch import optim`. `optim` contains functions that optimize the net for us - i.e. updating the weights.

Similar to the loss function, you can simply create a variable and assign an update function to it. 
You can now use the function `optim.SGD()` to update the weights. SGD = Stochastic Gradient Descent.  In the function itself you define which parameters (weights) are to be changed. You also define the learning rate here.

In [None]:
loss.backward() # collects the gradients

In [None]:
from torch import optim
update_weights=optim.SGD(net.parameters(), lr=0.01) 
# You define which parameters and with which learning rate these should be changed.

update_weights.step()  # step() updates the weights


Now you have everything you need to train a net.

You can again use a `for-loop` to automate the training.

You will notice that we also use `update.zero_grad()`. This function is used to clear the gradients from the previous update. If this is not done, the optimizer would constantly sum all gradients from all epochs.

In [None]:
## Define network, loss function and update algorithm
net = nn.Sequential(nn.Linear(784,300), 
                    nn.ReLU(), 
                    nn.Linear(300,10))

loss_function = nn.CrossEntropyLoss()
update = optim.SGD(net.parameters(), lr=0.3)
EPOCHS = 50

## Trainings Loop
for i in range(EPOCHS):
    update.zero_grad()
    output = net(train_images) # forward propagation
    
    loss   = loss_function(output, train_labels)
    loss.backward()
    acc=((output.max(dim=1)[1]==train_labels).sum()/float(output.shape[0])).item()
    print(i, 
        "Training Loss: %.2f Training Accuracy: %.2f"
        % (loss.item(), acc)
    )
    
    update.step()

You can see that you can train a neural network with much less effort. You can also add a second or third hidden layer to your model without much effort.
Just add an `nn.ReLU` and an `nn.Linear` layer to `Sequential`, and *autograd* can calculate the gradients for these layers as well. Everything else remains the same. Just remember that the dimensions must fit from one layer to the next.

In [None]:
## Define network, loss function and update algorithm
net = nn.Sequential(nn.Linear(784,300), 
                    nn.ReLU(), 
                    nn.Linear(300,300),# <----- EXTRA LAYER
                    nn.ReLU(), 
                    nn.Linear(300,10))
print(net)
loss_function = nn.CrossEntropyLoss()
update = optim.SGD(net.parameters(), lr=0.3)
EPOCHS = 50

## Trainings Loop
for i in range(EPOCHS):
    update.zero_grad()
    output = net(train_images) # forward propagation
    
    loss   = loss_function(output, train_labels)
    loss.backward()
    
    acc=((output.max(dim=1)[1]==train_labels).sum()/float(output.shape[0])).item()
    print(i,
        "Training Loss: %.2f Training Accuracy: %.2f"
        % (loss.item(), acc)
    )
    
    update.step()

You may have noticed that we are using Stochastic Gradient Descent as an optimizer (to update the weights). So far, we have only talked about Gradient Descent.
In fact, as explained in the lecture, Gradient Descent is actually no longer used, but Stochastic Gradient Descent is used as an alternative.

The difference:<br>
In Stochastic Gradient Descent, the data set is not sent through the network all at once, but the dataset is passed through the network in smaller parts (**minibatches**).<br>
In this dataset, there are 60000 images in total. With Gradient Descent, the forward pass is done simultaneously with 60000 images, and for the 60000 images the loss is calculated simultaneously. After that, the weights are updated **once**.
Then the step is repeated.

It would be more efficient if an update would not take place after every 60000 images, but after 200 or even only 100 images, so that the network can learn much faster.
This is exactly what the Stochastic Gradient Descent does.  Not all images, but only e.g. 32 images are sent through the network at once. For these 32 images the loss is then calculated and the updates are performed.
Then this step is repeated, but this time with 32 new images. In this way, the weights can be updated much more frequently within an epoch.

The batch size specifies how large a minibatch (the small portion of data that is passed through the network) should be, and can also affect the performance of the model.
<center>
<img src="https://encrypted-tbn0.gstatic.com/images?q=tbn%3AANd9GcT0TVrYkk0A0FfPvnzYTe747F0qPLG2rU2Bmg&usqp=CAU" style="width: 600px;">
</center>
<h8><center>Source: Insu Han and Jongheon Jeong, http://alinlab.kaist.ac.kr/resource/Lec2_SGD.pdf</center></h8>


Advantages:<br>
* faster
* needs less memory (on the graphics card)

Disadvantages: <br>
* cannot find the optimum (but can also be good to prevent overfitting)

To take advantage of the Stochastic Gradient Descent, we first need to split the data into minibatches. Also for this we can use PyTorch, the corresponding functions are available in the submodule `torch.utils.data`.

In [None]:
from torch.utils import data

In `torch.utils.data` there are two functions you need:

* `data.TensorDataset(input,labels)` creates a PyTorch dataset from the data.
* `data.DataLoader(Dataset, batch_size)` creates minibatches of the specified size from a PyTorch dataset.

In [None]:
train_data = data.TensorDataset(train_images, train_labels) 
# input are our tensors which contain the images and the labels
loader = data.DataLoader(train_data, batch_size = 32)

In [None]:
print(len(loader))

The variable `loader` now contains 1875 minibatches, each with 32 images and their 32 labels. In the next cell you can see the content of the first batch.

In [None]:
list(loader)[0]

To bring everything together, you need a second `for-loop` that selects the minibatches one by one within the first `for-loop` and passes them through the network. 

In [None]:
## Define network, loss function and update algorithm
net = nn.Sequential(nn.Linear(784,300), 
                    nn.ReLU(), 
                    nn.Linear(300,300),
                    nn.ReLU(), 
                    nn.Linear(300,10))
loss_function = nn.CrossEntropyLoss()
update = optim.SGD(net.parameters(), lr=0.3)
EPOCHS = 2

## Trainings Loop
for i in range(EPOCHS):
    loss_list = [] # this list stores the loss of each minibatch
                   # with this we can calculate the average loss within the epoch
    for minibatch in loader: # loop through all minibatches
        images, labels = minibatch # minibatch is divided into images and labels
        
        update.zero_grad()
        output = net(images) # forward propagation
    
        loss   = loss_function(output, labels)
        loss.backward()
        loss_list.append(loss.item())
        update.step()
        
    output = net(train_images)
    acc=((output.max(dim=1)[1]==train_labels).sum()/float(output.shape[0])).item()
    print(
        "Training Loss: %.2f Training Accuracy: %.2f"
        % (np.mean(loss_list), acc)
    )
    

After only two epochs, the accuracy is much higher than ever before. A single epoch takes much longer compared to "normal" Gradient Descent, but the total training time is reduced.

To calculate the Accuracy, after all the minibatches have passed through the network, the entire dataset is sent through the network again (without changing the weights). Accuracy is then calculated based on these predictions, The loss of the epoch is the average loss of the minibatches.

Tip: If you want to copy/output the value of a tensor and not the tensor itself, you can use `x.item()`.

# Advanced Layers


In the following we will deal with new layers that are used in addition to linear layers.

## Dropout

Dropout is used during training to randomly *remove* individual neurons from the network. In other words, they are turned off for a short time. Mathematically, this means that their output is simply set to zero.
Each time a batch passes through the network, the output of randomly selected neurons is set to zero. 

This random temporary *erasure* of neurons forces the network to not rely on individual neurons. Similar to Random Forest, where variables are randomly removed, Dropout prevents *overfitting* .
However, during the validation of the network (validation or test set), no more neurons are removed and the dropout layer is skipped.

How many neurons are removed from the network is another hyperparameter that, like the learning rate, can be chosen by you. The dropout is given as a percentage. So a dropout of `0.8` means that in this layer 80% of the neurons are removed or their output is set to zero. A default value for the dropout is `0.2`. Often a dropout layer is used directly after the input.

In PyTorch, a dropout layer is defined as `nn.Dropout(0.2)`.


In [None]:
torch.manual_seed(1235)

example_x = torch.tensor([[1.,2.,3.,4.,5.]] )
do = nn.Dropout(0.5)
do(example_x)

Three of the values were set to `0`, but the other values doubled. **Why did this happen?**

This is because we train with dropout but evaluate without dropout. This means that with a dropout of `0.5` only half of the neurons are used. In the forward pass, the values are passed as a weighted sum. If half of the neurons have the value `0`, the sum is obviously much smaller than in a network where no dropout is used. In the course of training, the scale of the weights is adjusted to the expected scale of the inputs.

Without Dropout: $$z = \beta_0 + \beta_11 + \beta_22 +\beta_33 +\beta_44 +\beta_55$$

With Dropout: $$\begin{align}z&= \beta_0 + \beta_10 + \beta_22 +\beta_30 +\beta_44 +\beta_50 \
&=\beta_0 + \beta_22+\beta_44\end{align}$$


The sum $z$ with dropout is always smaller than the sum without dropout. By increasing the size of the forwarded inputs, we ensure that the weights do not adjust to the wrong scale of inputs.

This is because in the evaluation, where no dropout is used, we suddenly have twice as many inputs. This discrepancy between training and evaluation is thus prevented.

You can change a layer or a complete network from training to evaluation mode by using `network.eval()`. To put it in training mode, `.train()` is used. **Important**: By default every layer and every network is in `train()` mode.



In [None]:
do.eval()
do(example_x)

In `.eval()` mode the dropout is not applied.

## Batchnorm


Batchnorm layers are another commonly used layer in neural networks. As the name suggests, they normalize the batches, or more accurately, the activations of the minibatches.
Recall that we scale our inputs. Batchnorm does the same thing, but in the net itself, so the values deeper in the net are also of roughly the same order of magnitude. Basically, the variables are normalized as follows:

$$x_s = \frac{x-\bar{x}}{sd_x}$$

Where $\bar{x}$ is the mean and $sd_x$ is the standard deviation of $x$.<br><br>
In a classical neural network, the activations of a layer are computed in two steps. First, the linear transformation is performed:

$$Z = XW^T+b$$
*Here $X$ is a minibatch, so for example only 32 images*.

This is followed by a non-linear activation function:

$$A = \sigma(Z)$$

With Batchnorm the values are normalized again before the activation function.

$$Z_s = (\frac{Z-\bar{Z}}{sd_Z}) \cdot \gamma + \beta $$

This is done independently for each neuron. The mean $\bar{Z}$ and the standard deviation $sd_Z$ are again calculated only per minibatch. What is new is the $\gamma$ and $\beta$. These are just two individual parameters that can shift and scale the normal distribution $N(0,1)$. These two parameters are also learnable, i.e. they are also changed during training.

It is also important to note that during training, the average values of the minibatches are combined into an average value over all minibatches. This average is then used in the evaluation of the test set to normalize the minibatches.

*Whether you apply the activation function first and then the batchnorm or vice versa is a question that no one can answer for you. There are pros and cons to both methods.*

Complete the `layer_one`. The size of the hidden layers and the dropout is up to you.

In [None]:
batch_x, batch_y = next(iter(loader)) # here the first minibatch is chosen

layer_one = nn.Sequential(nn.Linear(____,___),
                         nn.BatchNorm1d(_____),
                         nn.ReLU(),
                         nn.Dropout(____))
layer_one(batch_x)

<details>
    <summary><b>Solution:</b></summary>

```python
layer_one = nn.Sequential(nn.Linear(784,300),
                         nn.BatchNorm1d(300),
                         nn.ReLU(),
                         nn.Dropout(0.2))
```
</details>

Now extend the complete net from before.
Important: Neither batchnorm nor dropout is used after the last linear layer.

In [None]:
net= nn.Sequential(nn.Linear(784,300), 
                   ______________,
                   ______________,
                   ______________,
                   nn.Linear(300,300),
                   ______________,
                   ______________,
                   ______________,
                   nn.Linear(300,10))

<details>
    <summary><b>Solution:</b></summary>

```python
net= nn.Sequential(nn.Linear(784,300), 
                   nn.BatchNorm1d(300),
                   nn.ReLU(),
                   nn.Dropout(0.2),
                   nn.Linear(300,300),
                   nn.BatchNorm1d(300),
                   nn.ReLU(),
                   nn.Dropout(0.2),
                   nn.Linear(300,10))
```
</details>

# Optimizers


Optimizers determine how accurately the weights of the net are updated. So far you have always used the `SGD`.  In the PyTorch implementation of `SGD` there are more parameters for `SGD` that can make the training more effective. One of them is *momentum*. When you use momentum for training, the weights are not only updated according to the gradients of the current minibatch. The gradients of the previous minibatches also have an influence. The parameter `momentum` indicates how strong or weak the influence of the previous minibatches is.

The momentum should optimize the loss in a straight line. As an example you can imagine a ball rolling down a hill. The longer this rolls in the same direction, the faster it becomes. And the faster it gets, the less influence small changes in direction triggered by changes in the terrain (gradients) have. 

So if the gradients have been pointing in the same direction for a while, a single minibatch should not change that direction all at once. 

One of the most commonly used algorithms for training neural nets is the "ADAM" algorithm. It combines many improvements from SGD, including a version of Momentum.  

**ADAM does not always have to be better than SGD**.

An overview of all available optimization algorithms can be found [here](https://pytorch.org/docs/stable/optim.html#algorithms). 




In [None]:
loss_function = nn.CrossEntropyLoss()
update = optim.Adam(net.parameters(), lr=0.1)
EPOCHS = 10

## Training Loop
for i in range(EPOCHS):
    loss_list = [] # in this list we save the loss of each minibatch so we can
                   # calculate the average loss at the end of the epoch
    net.train()
    for minibatch in loader: # loop through all minibatches
        images, labels = minibatch # divide minibatches in labels and images
        
        update.zero_grad()
        output = net(images) # forward propagation
    
        loss   = loss_function(output, labels)
        loss.backward()
        loss_list.append(loss.item())
        update.step()
    net.eval()    
    output = net(train_images)
    acc=((output.max(dim=1)[1]==train_labels).sum()/float(output.shape[0])).item()
    
    print(i,
        "Training Loss: %.2f Training Accuracy: %.2f"
        % (np.mean(loss_list), acc)
    )

# Practice Exercise

For the exercise, you will train a network, but this time using the toxicity data from notebook 5.
First, load all the required libraries and data again. This time, also use Batchnorm, Dropout and the ADAM algorithm.

In [None]:
import pandas as pd
from sklearn.model_selection import train_test_split
import sys
if 'google.colab' in sys.modules: # checks whether the notebook runs on collab
    !wget https://raw.githubusercontent.com/kochgroup/intro_pharma_ai/main/utils/utils.py
    !pip install rdkit==2022.3.4
    %run utils.py
else:
    %run ../utils/utils.py # loads pre-written functions

In [None]:
data_tox = pd.read_csv("https://raw.githubusercontent.com/filipsPL/tox21_dataset/master/compounds/sr-mmp.tab", sep = "\t")
data_tox = data_tox.iloc[:,1:] # all columns except the first (index 0) are chosen
data_tox.columns = ["smiles", "activity"]
data_tox.head()

Next, you calculate the fingerprints. As in notebook 5, the function `get_fingerprints` is available for this purpose.

In [None]:
fps = get_fingerprints(data_tox)
fps["activity"] = data_tox.activity
fps.head()

Before you can use them in Pytorch, you need to convert both the fingerprints and the `acitivty` to `tensors`. Note that both are  in the DataFrame `fps`. 

`.values` converts a DataFrame into an `np.array`.

Then the data is split into a training set and a test set.

In [None]:
fps = torch.tensor(___.values, dtype=torch.float32) #

In [None]:
train, test=train_test_split(fps,test_size= 0.2 , train_size= 0.8, random_state=1234)


train_x = train[:,:-1]
train_y = train[:,-1]
test_x = test[:,:-1]
test_y = test[:,-1]

Now we want to use minibatches again. For this we still have to convert our training data into a `DataLoader`. Why only the training data? The use of minibatches is only relevant for training. As long as your computer is able to run the test dataset through the network all at once, we don`t need to split the test dataset into minibatches.

In [None]:
train_data=data.TensorDataset(______, _____) # input are our tensors, for the fingerprints and the activities
loader=data.DataLoader(train_data, batch_size = 32)
len(loader)

Adjust the net so that the input and output are the right size. So the length of the fingerprints and the number of classes we predict.

In [None]:
net= nn.Sequential(nn.Linear(), 
                   nn.BatchNorm1d(),
                   nn.ReLU(), 
                   nn.Dropout(),
                   nn.Linear(),
                   nn.BatchNorm1d(),
                   nn.ReLU(), 
                   nn.Dropout(),
                   nn.Linear())

loss_function = nn.BCEWithLogitsLoss()
update = ___________________, lr=0.1)    
EPOCHS = 10

Last, fill the `for loop`. 

`.squeeze` converts the `(n,1)` `output` tensor to a 1-dimensional `tensor` of length `n`.

In [None]:
for i in range(EPOCHS):
    loss_list = [] # in this list we save the loss of each minibatch 
    for minibatch in loader: # loop through all minibatches
        update.__________
        molecules, activity = minibatch # divide minibatches in labels and molecules
        output = net(____________) # forward propagation
        loss   = loss_function(output.squeeze(), ____________)
        loss._______
        loss_list.append(loss.item())
        update.________
    # here the accuracy for the testset is calculated
    output = net(test_x)
    acc = torch.sum((output>0).squeeze().int() == test_y)/float(test_y.shape[0])
   
    print(
        "Training Loss: %.2f Test Accuracy: %.2f"
        % (np.mean(loss_list), acc.item())
    )