<a href="https://colab.research.google.com/github/ML-Bioinfo-CEITEC/genomic_benchmarks/blob/main/notebooks/How_To_Train_CNN_Classifier_With_Pytorch.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# How To Train CNN Classifier With Pytorch

This notebook demonstrates how to use `genomic_benchmarks` to train a neural network classifier on one of its benchmark datasets [human_nontata_promoters](https://github.com/ML-Bioinfo-CEITEC/genomic_benchmarks/tree/main/docs/human_nontata_promoters).

In [1]:
#If you work in Google Colaboratory - uncomment the following line to install the package to your virtual machine  
#!pip install -qq tensorflow_addons genomic-benchmarks

In [2]:
import torch
from torch.utils.data import DataLoader
from torchtext.data.utils import get_tokenizer

from genomic_benchmarks.dataset_getters.pytorch_datasets import HumanEnhancersCohn
from genomic_benchmarks.models.torch import CNN
from genomic_benchmarks.dataset_getters.utils import coll_factory, LetterTokenizer, build_vocab
from genomic_benchmarks.data_check import info


# Choose the dataset

Create pytorch dataset object

In [3]:
train_dset =  HumanEnhancersCohn('train', version=0)

Downloading 176563cDPQ5Y094WyoSBF02QjoVQhWuCh into /root/.genomic_benchmarks/human_enhancers_cohn.zip... Done.
Unzipping...Done.


# Print out information about the dataset

In [4]:
info("human_enhancers_cohn", 0)

Dataset `human_enhancers_cohn` has 2 classes: negative, positive.

All lenghts of genomic intervals equals 500.

Totally 27791 sequences have been found, 20843 for training and 6948 for testing.


Unnamed: 0,train,test
negative,10422,3474
positive,10421,3474


# Tokenizer and vocab

Create tokenizer for the dataset, so we can numericalize the data and feed them to neural network. 
From the dataset info we can notice that all sequences are of the same length, hence we will use no padding.


In [5]:
tokenizer = get_tokenizer(LetterTokenizer())
vocabulary = build_vocab(train_dset, tokenizer, use_padding=False)

print("vocab len:" ,vocabulary.__len__())
print(vocabulary.get_stoi())

vocab len: 7
{'<eos>': 6, '<bos>': 1, '<unk>': 0, 'C': 5, 'A': 2, 'T': 3, 'G': 4}


# Dataloader and batch preparation

We will create pytorch data loader, which will preprocess and batch the data for the neural network.

In [6]:
device = 'cuda' if torch.cuda.is_available() else 'cpu'
print('Using {} device'.format(device))

collate = coll_factory(vocabulary, tokenizer, device, pad_to_length = None)
train_loader = DataLoader(train_dset, batch_size=32, shuffle=True, collate_fn=collate)

Using cpu device


# Model
We will initialize our model.
From the dataset info, we know that all inputs are 500 characters long, and the number of classes is 2.

In [7]:
model = CNN(
    number_of_classes=2,
    vocab_size=vocabulary.__len__(),
    embedding_dim=100,
    input_len=500
).to(device)

## Training

In [8]:
model.train(train_loader, epochs=5)

Epoch 0
Train metrics: 
 Accuracy: 65.4%, Avg loss: 0.645070 

Epoch 1
Train metrics: 
 Accuracy: 67.7%, Avg loss: 0.637024 

Epoch 2
Train metrics: 
 Accuracy: 68.1%, Avg loss: 0.637149 

Epoch 3
Train metrics: 
 Accuracy: 67.3%, Avg loss: 0.636471 

Epoch 4
Train metrics: 
 Accuracy: 70.7%, Avg loss: 0.630674 



## Testing

In [9]:
test_dset = HumanEnhancersCohn('test', version=0)
test_loader = DataLoader(test_dset, batch_size=32, shuffle=False, collate_fn=collate)

predictions = []
for x,y in test_loader:
  output = torch.round(model(x))
  for prediction in output.tolist():
    predictions.append(prediction[0])

In [10]:
from sklearn.metrics import f1_score
from genomic_benchmarks.data_check.info import labels_in_order

labels = labels_in_order(dset_name='human_enhancers_cohn')
f1_score(labels, predictions)

0.40943812595484635