# TF CNN Classifier

To run this notebook on an another benchmark, use

```
papermill utils/tf_cnn_classifier.ipynb tf_cnn_experiments/[DATASET NAME].ipynb -p DATASET [DATASET NAME]
```

In [1]:
DATASET = 'demo_coding_vs_intergenomic_seqs'
VERSION = 0
BATCH_SIZE = 64
EPOCHS = 10

In [2]:
# Parameters
DATASET = "demo_coding_vs_intergenomic_seqs"


In [3]:
print(DATASET, VERSION, BATCH_SIZE, EPOCHS)

demo_coding_vs_intergenomic_seqs 0 64 10


# Data download

In [4]:
from pathlib import Path
import tensorflow as tf
import tensorflow_addons as tfa

import numpy as np
from genomic_benchmarks.loc2seq import download_dataset
from genomic_benchmarks.data_check import is_downloaded, info
from genomic_benchmarks.models.tf import vectorize_layer
from genomic_benchmarks.models.tf import get_basic_cnn_model_v0 as get_model

if not is_downloaded(DATASET):
    download_dataset(DATASET)

  from tqdm.autonotebook import tqdm
2022-06-02 23:13:55.552490: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2022-06-02 23:13:55.942375: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1532] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 43670 MB memory:  -> device: 0, name: NVIDIA A40, pci bus id: 0000:a3:00.0, compute capability: 8.6


Reference /home/jovyan/.genomic_benchmarks/fasta/Homo_sapiens.GRCh38.cdna.all.fa.gz already exists. Skipping.
Reference /home/jovyan/.genomic_benchmarks/fasta/Homo_sapiens.GRCh38.dna.toplevel.fa.gz already exists. Skipping.


100%|█████████▉| 189154/190000 [00:03<00:00, 47949.96it/s]
100%|██████████| 24/24 [00:26<00:00,  1.11s/it]


In [5]:
info(DATASET)



Dataset `demo_coding_vs_intergenomic_seqs` has 2 classes: coding_seqs, intergenomic_seqs.

All lengths of genomic intervals equals 200.

Totally 100000 sequences have been found, 75000 for training and 25000 for testing.


Unnamed: 0,train,test
coding_seqs,37500,12500
intergenomic_seqs,37500,12500


## TF Dataset object

In [6]:
SEQ_PATH = Path.home() / '.genomic_benchmarks' / DATASET
CLASSES = [x.stem for x in (SEQ_PATH/'train').iterdir() if x.is_dir()]
NUM_CLASSES = len(CLASSES)

train_dset = tf.keras.preprocessing.text_dataset_from_directory(
    SEQ_PATH / 'train',
    batch_size=BATCH_SIZE,
    class_names=CLASSES)

Found 75000 files belonging to 2 classes.


In [7]:
if NUM_CLASSES > 2:
    train_dset = train_dset.map(lambda x, y: (x, tf.one_hot(y, depth=NUM_CLASSES)))

## Text vectorization

In [8]:
vectorize_layer.adapt(train_dset.map(lambda x, y: x))
VOCAB_SIZE = len(vectorize_layer.get_vocabulary())
vectorize_layer.get_vocabulary()

['', '[UNK]', 'a', 't', 'g', 'c']

In [9]:
def vectorize_text(text, label):
  text = tf.expand_dims(text, -1)
  return vectorize_layer(text)-2, label

train_ds = train_dset.map(vectorize_text)

## Model training

In [10]:
model = get_model(NUM_CLASSES, VOCAB_SIZE)

In [11]:
history = model.fit(
    train_ds,
    epochs=EPOCHS)

Epoch 1/10


2022-06-02 23:18:00.896593: I tensorflow/stream_executor/cuda/cuda_dnn.cc:384] Loaded cuDNN version 8204
2022-06-02 23:18:02.262349: I tensorflow/stream_executor/cuda/cuda_blas.cc:1786] TensorFloat-32 will be used for the matrix multiplication. This will only be logged once.


Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


## Evaluation on the test set

In [12]:
test_dset = tf.keras.preprocessing.text_dataset_from_directory(
    SEQ_PATH / 'test',
    batch_size=BATCH_SIZE,
    class_names=CLASSES)

if NUM_CLASSES > 2:
    test_dset = test_dset.map(lambda x, y: (x, tf.one_hot(y, depth=NUM_CLASSES)))
test_ds =  test_dset.map(vectorize_text)

Found 25000 files belonging to 2 classes.


In [13]:
model.evaluate(test_ds)



[0.2579881250858307, 0.8960800170898438, 0.8944406509399414]