<a href="https://colab.research.google.com/github/ML-Bioinfo-CEITEC/genomic_benchmarks/blob/main/notebooks/How_To_Train_CNN_Classifier_With_Pytorch.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# How To Train CNN Classifier With Pytorch

This notebook demonstrates how to use `genomic_benchmarks` to train a neural network classifier on one of its benchmark datasets [human_nontata_promoters](https://github.com/ML-Bioinfo-CEITEC/genomic_benchmarks/tree/main/docs/human_nontata_promoters).

In [1]:
#If you work in Google Colaboratory - uncomment the following line to install the package to your virtual machine
#!pip install -qq tensorflow_addons genomic-benchmarks

[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/612.3 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━━━━━[0m[90m╺[0m[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m92.2/612.3 kB[0m [31m2.4 MB/s[0m eta [36m0:00:01[0m[2K     [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m [32m604.2/612.3 kB[0m [31m8.8 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m612.3/612.3 kB[0m [31m7.3 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.1/3.1 MB[0m [31m59.6 MB/s[0m eta [36m0:00:00[0m
[?25h  Building wheel for genomic-benchmarks (setup.py) ... [?25l[?25hdone


In [2]:
import torch
from torch.utils.data import DataLoader
from torchtext.data.utils import get_tokenizer

from genomic_benchmarks.dataset_getters.pytorch_datasets import HumanEnhancersCohn
from genomic_benchmarks.models.torch import CNN
from genomic_benchmarks.dataset_getters.utils import coll_factory, LetterTokenizer, build_vocab
from genomic_benchmarks.data_check import info


  from tqdm.autonotebook import tqdm


# Choose the dataset

Create pytorch dataset object

In [3]:
train_dset =  HumanEnhancersCohn('train', version=0)

Downloading...
From: https://drive.google.com/uc?id=176563cDPQ5Y094WyoSBF02QjoVQhWuCh
To: /root/.genomic_benchmarks/human_enhancers_cohn.zip
100%|██████████| 11.9M/11.9M [00:00<00:00, 83.6MB/s]


# Print out information about the dataset

In [4]:
info("human_enhancers_cohn", 0)

Dataset `human_enhancers_cohn` has 2 classes: negative, positive.

All lengths of genomic intervals equals 500.

Totally 27791 sequences have been found, 20843 for training and 6948 for testing.


Unnamed: 0,train,test
negative,10422,3474
positive,10421,3474


# Tokenizer and vocab

Create tokenizer for the dataset, so we can numericalize the data and feed them to neural network.
From the dataset info we can notice that all sequences are of the same length, hence we will use no padding.


In [5]:
tokenizer = get_tokenizer(LetterTokenizer())
vocabulary = build_vocab(train_dset, tokenizer, use_padding=False)

print("vocab len:" ,vocabulary.__len__())
print(vocabulary.get_stoi())

vocab len: 7
{'A': 5, 'C': 4, '<eos>': 6, 'G': 3, 'T': 2, '<bos>': 1, '<unk>': 0}


# Dataloader and batch preparation

We will create pytorch data loader, which will preprocess and batch the data for the neural network.

In [6]:
device = 'cuda' if torch.cuda.is_available() else 'cpu'
print('Using {} device'.format(device))

collate = coll_factory(vocabulary, tokenizer, device, pad_to_length = None)
train_loader = DataLoader(train_dset, batch_size=32, shuffle=True, collate_fn=collate)

Using cpu device


# Model
We will initialize our model.
From the dataset info, we know that all inputs are 500 characters long, and the number of classes is 2.

In [8]:
model = CNN(
    number_of_classes=2,
    vocab_size=vocabulary.__len__(),
    embedding_dim=100,
    input_len=500,
    device=device
).to(device)

## Training

In [10]:
model.fit(train_loader, epochs=5)

Epoch 0
Train metrics: 
 Accuracy: 64.6%, Avg loss: 0.653754 

Epoch 1
Train metrics: 
 Accuracy: 67.1%, Avg loss: 0.645691 

Epoch 2
Train metrics: 
 Accuracy: 67.6%, Avg loss: 0.641918 

Epoch 3
Train metrics: 
 Accuracy: 68.8%, Avg loss: 0.638241 

Epoch 4
Train metrics: 
 Accuracy: 69.1%, Avg loss: 0.637030 



## Testing

In [11]:
test_dset = HumanEnhancersCohn('test', version=0)
test_loader = DataLoader(test_dset, batch_size=32, shuffle=False, collate_fn=collate)

predictions = []
for x,y in test_loader:
  output = torch.round(model(x))
  for prediction in output.tolist():
    predictions.append(prediction[0])

In [12]:
from sklearn.metrics import f1_score
from genomic_benchmarks.data_check.info import labels_in_order

labels = labels_in_order(dset_name='human_enhancers_cohn')
f1_score(labels, predictions)

0.34848759282738634