<a href="https://colab.research.google.com/github/ML-Bioinfo-CEITEC/genomic_benchmarks/blob/main/notebooks/How_To_Train_CNN_Classifier_With_Pytorch.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# How To Train CNN Classifier With Pytorch

This notebook demonstrates how to use `genomic_benchmarks` to train a neural network classifier on one of its benchmark datasets [human_nontata_promoters](https://github.com/ML-Bioinfo-CEITEC/genomic_benchmarks/tree/main/docs/human_nontata_promoters).

In [1]:
pip install genomic-benchmarks

Collecting genomic-benchmarks
  Downloading genomic_benchmarks-0.0.3.tar.gz (16 kB)
Collecting biopython>=1.79
  Downloading biopython-1.79-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.whl (2.3 MB)
[K     |████████████████████████████████| 2.3 MB 5.7 MB/s 
Collecting pyyaml>=5.3.1
  Downloading PyYAML-6.0-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (596 kB)
[K     |████████████████████████████████| 596 kB 25.4 MB/s 
Collecting yarl
  Downloading yarl-1.7.2-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (271 kB)
[K     |████████████████████████████████| 271 kB 43.7 MB/s 
Collecting multidict>=4.0
  Downloading multidict-5.2.0-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (160 kB)
[K     |████████████████████████████████| 160 kB 47.7 MB/s 
[?25hBuilding wheels for collected packages: genomic-benchmarks
  Building wheel for gen

In [2]:
import torch
from torch.utils.data import DataLoader
from torchtext.data.utils import get_tokenizer

from genomic_benchmarks.dataset_getters.pytorch_datasets import HumanEnhancersCohn
from genomic_benchmarks.models.torch import CNN
from genomic_benchmarks.dataset_getters.utils import coll_factory, LetterTokenizer, build_vocab
from genomic_benchmarks.data_check import info


# Choose the dataset

Create pytorch dataset object

In [3]:
train_dset =  HumanEnhancersCohn('train', version=0)

Downloading 176563cDPQ5Y094WyoSBF02QjoVQhWuCh into /root/.genomic_benchmarks/human_enhancers_cohn.zip... Done.
Unzipping...Done.


# Print out information about the dataset

In [4]:
info("human_enhancers_cohn", 0)

Dataset `human_enhancers_cohn` has 2 classes: negative, positive.

All lenghts of genomic intervals equals 500.

Totally 27791 sequences have been found, 20843 for training and 6948 for testing.


Unnamed: 0,train,test
negative,10422,3474
positive,10421,3474


# Tokenizer and vocab

Create tokenizer for the dataset, so we can numericalize the data and feed them to neural network. 
From the dataset info we can notice that all sequences are of the same length, hence we will use no padding.


In [5]:
tokenizer = get_tokenizer(LetterTokenizer())
vocabulary = build_vocab(train_dset, tokenizer, use_padding=False)

print("vocab len:" ,vocabulary.__len__())
print(vocabulary.get_stoi())

vocab len: 7
{'<eos>': 6, '<bos>': 1, '<unk>': 0, 'C': 2, 'A': 3, 'T': 4, 'G': 5}


# Dataloader and batch preparation

We will create pytorch data loader, which will preprocess and batch the data for the neural network.

In [6]:
device = 'cuda' if torch.cuda.is_available() else 'cpu'
print('Using {} device'.format(device))

collate = coll_factory(vocabulary, tokenizer, device, pad_to_length = None)
train_loader = DataLoader(train_dset, batch_size=32, shuffle=True, collate_fn=collate)

Using cuda device


# Model
We will initialize our model.
From the dataset info, we know that all inputs are 500 characters long, and the number of classes is 2.

In [7]:
model = CNN(
    number_of_classes=2,
    vocab_size=vocabulary.__len__(),
    embedding_dim=100,
    input_len=500
).to(device)

## Training

In [8]:
model.train(train_loader, epochs=15)

Epoch 0
Train metrics: 
 Accuracy: 66.6%, Avg loss: 0.642394 

Epoch 1
Train metrics: 
 Accuracy: 68.3%, Avg loss: 0.639536 

Epoch 2
Train metrics: 
 Accuracy: 69.2%, Avg loss: 0.635860 

Epoch 3
Train metrics: 
 Accuracy: 69.9%, Avg loss: 0.633593 

Epoch 4
Train metrics: 
 Accuracy: 70.2%, Avg loss: 0.631239 

Epoch 5
Train metrics: 
 Accuracy: 69.6%, Avg loss: 0.629804 

Epoch 6
Train metrics: 
 Accuracy: 70.6%, Avg loss: 0.628481 

Epoch 7
Train metrics: 
 Accuracy: 68.7%, Avg loss: 0.630534 

Epoch 8
Train metrics: 
 Accuracy: 71.4%, Avg loss: 0.627743 

Epoch 9
Train metrics: 
 Accuracy: 72.7%, Avg loss: 0.623673 

Epoch 10
Train metrics: 
 Accuracy: 72.1%, Avg loss: 0.621493 

Epoch 11
Train metrics: 
 Accuracy: 72.3%, Avg loss: 0.622850 

Epoch 12
Train metrics: 
 Accuracy: 72.6%, Avg loss: 0.620830 

Epoch 13
Train metrics: 
 Accuracy: 74.1%, Avg loss: 0.620416 

Epoch 14
Train metrics: 
 Accuracy: 73.3%, Avg loss: 0.620259 



## Testing

In [9]:
test_dset = HumanEnhancersCohn('test', version=0)
test_loader = DataLoader(test_dset, batch_size=32, shuffle=True, collate_fn=collate)

model.test(test_loader)

test_loss  140.02356469631195
num_batches 218
correct 4789
size 6948
Test Error: 
 Accuracy: 68.9%, Avg loss: 0.642310 

