<a href="https://colab.research.google.com/github/ML-Bioinfo-CEITEC/ECCB2022/blob/main/notebooks/05b_CNN_for_sequences_fastai.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Runtime -> Change runtime type -> GPU

In [1]:
!pip install datasets torchmetrics --quiet

[K     |████████████████████████████████| 365 kB 4.3 MB/s 
[K     |████████████████████████████████| 419 kB 43.7 MB/s 
[K     |████████████████████████████████| 120 kB 64.9 MB/s 
[K     |████████████████████████████████| 115 kB 60.2 MB/s 
[K     |████████████████████████████████| 212 kB 64.0 MB/s 
[K     |████████████████████████████████| 127 kB 40.1 MB/s 
[?25h

In [16]:
import torch
from torch.nn import functional as F
from torch.utils.data import DataLoader

nucleotide_to_number = {
    'A':0,
    'C':1,
    'T':2,
    'G':3,
    'N':4,
}

def numericalize(x, vocab):
  x = [vocab[s] for s in x]
  return x

We will create a function that takes multiple raw samples, and prepares them for the neural network.

This will 
- numericalize
- one hot encode
- put data to tensors (needed for pytorch models)

In [None]:
def preprocess(batch):
  xs, ys = [], []
  for example in batch:
    x = example['seq']
    y = example['labels']

    x = numericalize(x, vocab=nucleotide_to_number)
    x = F.one_hot(torch.tensor(x), num_classes = 5).float()
    x = x.transpose(0,1)

    xs.append(x)
    ys.append([y])
  
  return torch.stack(xs), torch.tensor(ys).float()

We know how to improve on one example. 

We could just go through the dataset and improve a little bit on each single one of the sequences. (improve on 1st, then on 2nd, then on 3rd....)

In practice, we usually try to improve on a whole 'batch' of sequences at once. (improve on samples 1-100, then on 101-200,...)

In practice, this makes the learning process go much faster, and improves generalizability.

In [25]:
from torch.utils.data import DataLoader
from datasets import load_dataset

train_dset = load_dataset("simecek/human_nontata_promoters", split="train")
test_dset = load_dataset("simecek/human_nontata_promoters", split="test")

train_loader = DataLoader(train_dset, batch_size=32, shuffle=True, collate_fn=preprocess)  
test_loader = DataLoader(test_dset, batch_size=32, collate_fn=preprocess)  



We will wrap our architecture so its ready for the training

In [32]:
import torch
from torch import nn
from torch.nn import functional as F

class FullyConv(torch.nn.Module):
    def __init__(self, input_dim, hidden_dim, output_dim):
      super().__init__()
      self.net = nn.Sequential(
          nn.Conv1d(in_channels=input_dim, out_channels=hidden_dim, kernel_size=5, stride=1),
          nn.ReLU(),
          nn.Conv1d(in_channels=hidden_dim, out_channels=hidden_dim, kernel_size=5, stride=1),
          nn.ReLU(),
          nn.Flatten(),
          nn.LazyLinear(out_features=output_dim), #Lazy layer allows us to skip the in_features parameter and derive it automatically
          nn.Sigmoid(),
      )

    def forward(self, x):
      return self.net(x)

net = FullyConv(5,30,1)



The most basic training through the whole dataset goes like this:

```
for each batch in data_loader:
  model.learn_from(batch)
```

In practice, we iterate through the same dataset multiple times, since we are improving in small steps. 

Going through the whole dataset is called an epoch.


```
for i in range(number_of_epochs):
  for each batch in data_loader:
    model.learn_from(batch)
```

This is exactly what a Trainer/Learner does.


In [29]:
from torchmetrics import Accuracy

acc = Accuracy()#.to('cuda')
def test_accuracy(x,y):
  return acc(x, y.int())

In [33]:
from fastai.text.all import *

data = DataLoaders(train_loader, test_loader)
learn = Learner(data, net, loss_func=F.binary_cross_entropy, opt_func=SGD, metrics=[test_accuracy])

In [34]:
learn.fit_one_cycle(3, 1e-2)

epoch,train_loss,valid_loss,test_accuracy,time
0,0.499127,0.504473,0.760793,00:20
1,0.458048,0.461305,0.780607,00:18
2,0.436222,0.454155,0.784038,00:18
