# TF CNN Classifier

To run this notebook on an another benchmark, use

```
papermill tf_cnn_classifier.ipynb tf_cnn_experiments/[DATASET NAME].ipynb -p DATASET [DATASET NAME]
```

In [1]:
DATASET = 'human_nontata_promoters'
VERSION = 0
BATCH_SIZE = 64
EPOCHS = 10

In [2]:
# Parameters
DATASET = "demo_mouse_enhancers"


In [3]:
print(DATASET, VERSION, BATCH_SIZE, EPOCHS)

demo_mouse_enhancers 0 64 10


# Data download

In [4]:
from pathlib import Path
import tensorflow as tf
import numpy as np

from genomic_benchmarks.loc2seq import download_dataset
from genomic_benchmarks.data_check import is_downloaded, info
from genomic_benchmarks.models.tf import vectorize_layer
from genomic_benchmarks.models.tf import basic_cnn_model_v0 as model

if not is_downloaded('human_nontata_promoters'):
    download_dataset('human_nontata_promoters')

2021-10-23 00:13:57.743279: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
2021-10-23 00:13:57.743293: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.


  from tqdm.autonotebook import tqdm


2021-10-23 00:14:00.552973: E tensorflow/stream_executor/cuda/cuda_driver.cc:271] failed call to cuInit: CUDA_ERROR_NO_DEVICE: no CUDA-capable device is detected
2021-10-23 00:14:00.553018: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:156] kernel driver does not appear to be running on this host (michle): /proc/driver/nvidia/version does not exist
2021-10-23 00:14:00.553526: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.


In [5]:
info(DATASET)

Dataset `demo_mouse_enhancers` has 2 classes: negative, positive.

The lenght of genomic intervals ranges from 331 to 4776, with average 2369.5768595041322 and median 2381.0.

Totally 1210 sequences have been found, 968 for training and 242 for testing.




Unnamed: 0,train,test
negative,484,121
positive,484,121


## TF Dataset object

In [6]:
SEQ_PATH = Path.home() / '.genomic_benchmarks' / DATASET
CLASSES = [x.stem for x in (SEQ_PATH/'train').iterdir() if x.is_dir()]

train_dset = tf.keras.preprocessing.text_dataset_from_directory(
    SEQ_PATH / 'train',
    batch_size=BATCH_SIZE,
    class_names=CLASSES)

Found 968 files belonging to 2 classes.


## Text vectorization

In [7]:
vectorize_layer.adapt(train_dset.map(lambda x, y: x))
vectorize_layer.set_vocabulary(vocabulary=np.asarray(['a', 'c', 't', 'g', 'n']))
vectorize_layer.get_vocabulary()

2021-10-23 00:14:00.991052: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:185] None of the MLIR Optimization Passes are enabled (registered 2)


['', '[UNK]', 'a', 'c', 't', 'g', 'n']

In [8]:
def vectorize_text(text, label):
  text = tf.expand_dims(text, -1)
  return vectorize_layer(text)-2, label

train_ds = train_dset.map(vectorize_text)

## Model training

In [9]:
model.compile(loss=tf.keras.losses.BinaryCrossentropy(from_logits=True),
              optimizer='adam',
              metrics=tf.metrics.BinaryAccuracy(threshold=0.0))

In [10]:
history = model.fit(
    train_ds,
    epochs=EPOCHS)

Epoch 1/10


 1/16 [>.............................] - ETA: 15s - loss: 0.9479 - binary_accuracy: 0.3281

 2/16 [==>...........................] - ETA: 3s - loss: 0.9111 - binary_accuracy: 0.2969 

 3/16 [====>.........................] - ETA: 3s - loss: 0.8575 - binary_accuracy: 0.3021



























Epoch 2/10


 1/16 [>.............................] - ETA: 4s - loss: 0.5804 - binary_accuracy: 0.7500

 2/16 [==>...........................] - ETA: 4s - loss: 0.5635 - binary_accuracy: 0.7422

 3/16 [====>.........................] - ETA: 3s - loss: 0.5597 - binary_accuracy: 0.7396



























Epoch 3/10


 1/16 [>.............................] - ETA: 4s - loss: 0.5677 - binary_accuracy: 0.7344

 2/16 [==>...........................] - ETA: 3s - loss: 0.5524 - binary_accuracy: 0.7656

 3/16 [====>.........................] - ETA: 3s - loss: 0.5588 - binary_accuracy: 0.7708



























Epoch 4/10


 1/16 [>.............................] - ETA: 4s - loss: 0.5626 - binary_accuracy: 0.7031

 2/16 [==>...........................] - ETA: 3s - loss: 0.5534 - binary_accuracy: 0.7109

 3/16 [====>.........................] - ETA: 3s - loss: 0.5229 - binary_accuracy: 0.7604



























Epoch 5/10


 1/16 [>.............................] - ETA: 4s - loss: 0.5786 - binary_accuracy: 0.6719

 2/16 [==>...........................] - ETA: 4s - loss: 0.5435 - binary_accuracy: 0.7266

 3/16 [====>.........................] - ETA: 3s - loss: 0.5258 - binary_accuracy: 0.7500



























Epoch 6/10


 1/16 [>.............................] - ETA: 4s - loss: 0.5149 - binary_accuracy: 0.7656

 2/16 [==>...........................] - ETA: 3s - loss: 0.4975 - binary_accuracy: 0.7734

 3/16 [====>.........................] - ETA: 3s - loss: 0.5144 - binary_accuracy: 0.7604



























Epoch 7/10


 1/16 [>.............................] - ETA: 4s - loss: 0.5320 - binary_accuracy: 0.7031

 2/16 [==>...........................] - ETA: 3s - loss: 0.5542 - binary_accuracy: 0.7344

 3/16 [====>.........................] - ETA: 3s - loss: 0.5325 - binary_accuracy: 0.7292



























Epoch 8/10


 1/16 [>.............................] - ETA: 4s - loss: 0.5459 - binary_accuracy: 0.6875

 2/16 [==>...........................] - ETA: 4s - loss: 0.5028 - binary_accuracy: 0.7500

 3/16 [====>.........................] - ETA: 3s - loss: 0.5207 - binary_accuracy: 0.7500



























Epoch 9/10


 1/16 [>.............................] - ETA: 4s - loss: 0.4963 - binary_accuracy: 0.8125

 2/16 [==>...........................] - ETA: 4s - loss: 0.5687 - binary_accuracy: 0.7344

 3/16 [====>.........................] - ETA: 3s - loss: 0.5541 - binary_accuracy: 0.7448



























Epoch 10/10


 1/16 [>.............................] - ETA: 4s - loss: 0.4664 - binary_accuracy: 0.7500

 2/16 [==>...........................] - ETA: 3s - loss: 0.4861 - binary_accuracy: 0.7891

 3/16 [====>.........................] - ETA: 3s - loss: 0.4992 - binary_accuracy: 0.7969





























## Evaluation on the test set

In [11]:
test_dset = tf.keras.preprocessing.text_dataset_from_directory(
    SEQ_PATH / 'test',
    batch_size=BATCH_SIZE,
    class_names=CLASSES)

test_ds =  test_dset.map(vectorize_text)

Found 242 files belonging to 2 classes.


In [12]:
model.evaluate(test_ds)









[0.6554393768310547, 0.5702479481697083]