# How To Train CNN Classifier With Pytorch

This notebook demonstrates how to use `genomic_benchmarks` to train a neural network classifier on one of its benchmark datasets [human_nontata_promoters](https://github.com/ML-Bioinfo-CEITEC/genomic_benchmarks/tree/main/docs/human_nontata_promoters).

# Intall torchtext

This demo uses utilities from torchtext

In [None]:
!pip install torchtext

In [1]:
import torch
from torch.utils.data import DataLoader
from torchtext.data.utils import get_tokenizer

from genomic_benchmarks.dataset_getters.pytorch_datasets import HumanEnhancersCohn
from genomic_benchmarks.models.torch_cnn import CNN
from genomic_benchmarks.dataset_getters.utils import coll_factory, LetterTokenizer, build_vocab
from genomic_benchmarks.data_check import info


# Choose the dataset

Create pytorch dataset object

In [2]:
train_dset =  HumanEnhancersCohn('train', version=0)

Downloading 176563cDPQ5Y094WyoSBF02QjoVQhWuCh into /home/martinekvlastimil95/.genomic_benchmarks/translated_human_enhancers_cohn/human_enhancers_cohn.zip... Done.
Unzipping...Done.


# Print out information about the dataset

In [4]:
info("human_enhancers_cohn")

Dataset `human_enhancers_cohn` has 2 classes: negative, positive.

All lenghts of genomic intervals equals 500.

Totally 27791 sequences have been found, 20843 for training and 6948 for testing.




Unnamed: 0,train,test
negative,10422,3474
positive,10421,3474


# Tokenizer and vocab

Create tokenizer for the dataset, so we can numericalize the data and feed them to neural network. 
From the dataset info we can notice that all sequences are of the same length, hence we will use no padding.


In [5]:
tokenizer = get_tokenizer(LetterTokenizer())
vocabulary = build_vocab(train_dset, tokenizer, use_padding=False)

print("vocab len:" ,vocabulary.__len__())
print(vocabulary.get_stoi())

vocab len: 6
{'<eos>': 5, 'G': 4, 'A': 3, 'C': 2, 'T': 1, '<bos>': 0}


# Dataloader and batch preparation

We will create pytorch data loader, which will preprocess and batch the data for the neural network.

In [6]:
device = 'cuda' if torch.cuda.is_available() else 'cpu'
print('Using {} device'.format(device))

collate = coll_factory(vocabulary, tokenizer, device, pad_to_length = None)
train_loader = DataLoader(train_dset, batch_size=32, shuffle=True, collate_fn=collate)

Using cuda device


# Model
We will initialize our model.
From the dataset info, we know that all inputs are 500 characters long, and the number of classes is 2.

In [7]:
model = CNN(
    number_of_classes=2,
    vocab_size=vocabulary.__len__(),
    embedding_dim=100,
    input_len=500
).to(device)

## Training

In [8]:
model.train(train_loader, epochs=15)

Epoch 0
Train metrics: 
 Accuracy: 65.0%, Avg loss: 0.653522 

Epoch 1
Train metrics: 
 Accuracy: 66.2%, Avg loss: 0.650962 

Epoch 2
Train metrics: 
 Accuracy: 67.3%, Avg loss: 0.643297 

Epoch 3
Train metrics: 
 Accuracy: 68.4%, Avg loss: 0.641080 

Epoch 4
Train metrics: 
 Accuracy: 68.7%, Avg loss: 0.636772 

Epoch 5
Train metrics: 
 Accuracy: 69.9%, Avg loss: 0.631040 

Epoch 6
Train metrics: 
 Accuracy: 70.1%, Avg loss: 0.630158 

Epoch 7
Train metrics: 
 Accuracy: 67.5%, Avg loss: 0.633789 

Epoch 8
Train metrics: 
 Accuracy: 71.4%, Avg loss: 0.625305 

Epoch 9
Train metrics: 
 Accuracy: 73.0%, Avg loss: 0.625349 

Epoch 10
Train metrics: 
 Accuracy: 71.5%, Avg loss: 0.625071 

Epoch 11
Train metrics: 
 Accuracy: 71.2%, Avg loss: 0.622787 

Epoch 12
Train metrics: 
 Accuracy: 73.3%, Avg loss: 0.621158 

Epoch 13
Train metrics: 
 Accuracy: 73.0%, Avg loss: 0.620399 

Epoch 14
Train metrics: 
 Accuracy: 72.4%, Avg loss: 0.622846 



## Testing

In [10]:
test_dset = HumanEnhancersCohn('test', version=0)
test_loader = DataLoader(test_dset, batch_size=32, shuffle=True, collate_fn=collate)

model.test(test_loader)

Downloading 176563cDPQ5Y094WyoSBF02QjoVQhWuCh into /home/martinekvlastimil95/.genomic_benchmarks/translated_human_enhancers_cohn/human_enhancers_cohn.zip... Done.
Unzipping...Done.
test_loss  140.27838283777237
num_batches 218
correct 4736
size 6948
Test Error: 
 Accuracy: 68.2%, Avg loss: 0.643479 

