# TF CNN Classifier

To run this notebook on an another benchmark, use

```
papermill utils/tf_cnn_classifier.ipynb tf_cnn_experiments/[DATASET NAME].ipynb -p DATASET [DATASET NAME]
```

In [1]:
DATASET = 'demo_coding_vs_intergenomic_seqs'
VERSION = 0
BATCH_SIZE = 64
EPOCHS = 10

In [2]:
# Parameters
DATASET = "demo_mouse_enhancers"


In [3]:
print(DATASET, VERSION, BATCH_SIZE, EPOCHS)

demo_mouse_enhancers 0 64 10


# Data download

In [4]:
from pathlib import Path
import tensorflow as tf

import numpy as np
from genomic_benchmarks.loc2seq import download_dataset
from genomic_benchmarks.data_check import is_downloaded, info
from genomic_benchmarks.models.tf import vectorize_layer, binary_f1_score
from genomic_benchmarks.models.tf import basic_cnn_model_v0 as model

if not is_downloaded(DATASET):
    download_dataset(DATASET)

2021-11-23 01:05:58.772882: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
2021-11-23 01:05:58.772899: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.


  from tqdm.autonotebook import tqdm


2021-11-23 01:06:00.177928: E tensorflow/stream_executor/cuda/cuda_driver.cc:271] failed call to cuInit: CUDA_ERROR_NO_DEVICE: no CUDA-capable device is detected
2021-11-23 01:06:00.177947: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:156] kernel driver does not appear to be running on this host (michle): /proc/driver/nvidia/version does not exist
2021-11-23 01:06:00.178104: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.


In [5]:
info(DATASET)



Dataset `demo_mouse_enhancers` has 2 classes: negative, positive.

The lenght of genomic intervals ranges from 331 to 4776, with average 2369.5768595041322 and median 2381.0.

Totally 1210 sequences have been found, 968 for training and 242 for testing.


Unnamed: 0,train,test
negative,484,121
positive,484,121


## TF Dataset object

In [6]:
SEQ_PATH = Path.home() / '.genomic_benchmarks' / DATASET
CLASSES = [x.stem for x in (SEQ_PATH/'train').iterdir() if x.is_dir()]

train_dset = tf.keras.preprocessing.text_dataset_from_directory(
    SEQ_PATH / 'train',
    batch_size=BATCH_SIZE,
    class_names=CLASSES)

Found 968 files belonging to 2 classes.


## Text vectorization

In [7]:
vectorize_layer.adapt(train_dset.map(lambda x, y: x))
# vectorize_layer.set_vocabulary(vocabulary=np.asarray(['a', 'c', 't', 'g', 'n']))
vectorize_layer.get_vocabulary()

2021-11-23 01:06:03.103174: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:185] None of the MLIR Optimization Passes are enabled (registered 2)


['', '[UNK]', 'n', 't', 'a', 'c', 'g']

In [8]:
def vectorize_text(text, label):
  text = tf.expand_dims(text, -1)
  return vectorize_layer(text)-2, label

train_ds = train_dset.map(vectorize_text)

## Model training

In [9]:
model.compile(loss=tf.keras.losses.BinaryCrossentropy(from_logits=True),
              optimizer='adam',
              metrics=[tf.metrics.BinaryAccuracy(threshold=0.0), binary_f1_score])

In [10]:
history = model.fit(
    train_ds,
    epochs=EPOCHS)

Epoch 1/10


 1/16 [>.............................] - ETA: 29s - loss: 0.7253 - binary_accuracy: 0.5000 - f1_score: 0.0000e+00

 2/16 [==>...........................] - ETA: 6s - loss: 0.7306 - binary_accuracy: 0.4453 - f1_score: 0.0000e+00 

 3/16 [====>.........................] - ETA: 5s - loss: 0.7272 - binary_accuracy: 0.4531 - f1_score: 0.0000e+00



























Epoch 2/10


 1/16 [>.............................] - ETA: 4s - loss: 0.6697 - binary_accuracy: 0.5781 - f1_score: 0.1250

 2/16 [==>...........................] - ETA: 4s - loss: 0.6914 - binary_accuracy: 0.5234 - f1_score: 0.2133

 3/16 [====>.........................] - ETA: 3s - loss: 0.6895 - binary_accuracy: 0.5365 - f1_score: 0.1887



























Epoch 3/10


 1/16 [>.............................] - ETA: 5s - loss: 0.6433 - binary_accuracy: 0.6562 - f1_score: 0.2162

 2/16 [==>...........................] - ETA: 4s - loss: 0.6686 - binary_accuracy: 0.6016 - f1_score: 0.1667

 3/16 [====>.........................] - ETA: 4s - loss: 0.6632 - binary_accuracy: 0.6146 - f1_score: 0.1667



























Epoch 4/10


 1/16 [>.............................] - ETA: 5s - loss: 0.6059 - binary_accuracy: 0.7188 - f1_score: 0.2927

 2/16 [==>...........................] - ETA: 5s - loss: 0.6121 - binary_accuracy: 0.7031 - f1_score: 0.3077

 3/16 [====>.........................] - ETA: 4s - loss: 0.6130 - binary_accuracy: 0.7031 - f1_score: 0.2261





























Epoch 5/10


 1/16 [>.............................] - ETA: 8s - loss: 0.5758 - binary_accuracy: 0.8281 - f1_score: 0.2500

 2/16 [==>...........................] - ETA: 7s - loss: 0.5764 - binary_accuracy: 0.8125 - f1_score: 0.2651

 3/16 [====>.........................] - ETA: 5s - loss: 0.5655 - binary_accuracy: 0.8229 - f1_score: 0.3125





























Epoch 6/10


 1/16 [>.............................] - ETA: 7s - loss: 0.5921 - binary_accuracy: 0.6562 - f1_score: 0.4186

 2/16 [==>...........................] - ETA: 7s - loss: 0.5824 - binary_accuracy: 0.7188 - f1_score: 0.4878

 3/16 [====>.........................] - ETA: 6s - loss: 0.5675 - binary_accuracy: 0.7396 - f1_score: 0.4706



























Epoch 7/10


 1/16 [>.............................] - ETA: 5s - loss: 0.6061 - binary_accuracy: 0.7344 - f1_score: 0.2326

 2/16 [==>...........................] - ETA: 6s - loss: 0.5688 - binary_accuracy: 0.7812 - f1_score: 0.3478

 3/16 [====>.........................] - ETA: 5s - loss: 0.5492 - binary_accuracy: 0.7865 - f1_score: 0.4058





























Epoch 8/10


 1/16 [>.............................] - ETA: 8s - loss: 0.5437 - binary_accuracy: 0.7812 - f1_score: 0.4167

 2/16 [==>...........................] - ETA: 5s - loss: 0.5331 - binary_accuracy: 0.7969 - f1_score: 0.4583

 3/16 [====>.........................] - ETA: 5s - loss: 0.5466 - binary_accuracy: 0.7708 - f1_score: 0.4741



























Epoch 9/10


 1/16 [>.............................] - ETA: 5s - loss: 0.5398 - binary_accuracy: 0.8125 - f1_score: 0.5778

 2/16 [==>...........................] - ETA: 4s - loss: 0.5486 - binary_accuracy: 0.8047 - f1_score: 0.5243

 3/16 [====>.........................] - ETA: 5s - loss: 0.5543 - binary_accuracy: 0.7760 - f1_score: 0.5098





























Epoch 10/10


 1/16 [>.............................] - ETA: 7s - loss: 0.5398 - binary_accuracy: 0.7656 - f1_score: 0.5652

 2/16 [==>...........................] - ETA: 5s - loss: 0.5280 - binary_accuracy: 0.7969 - f1_score: 0.5306

 3/16 [====>.........................] - ETA: 4s - loss: 0.5398 - binary_accuracy: 0.7812 - f1_score: 0.5000



























## Evaluation on the test set

In [11]:
test_dset = tf.keras.preprocessing.text_dataset_from_directory(
    SEQ_PATH / 'test',
    batch_size=BATCH_SIZE,
    class_names=CLASSES)

test_ds =  test_dset.map(vectorize_text)

Found 242 files belonging to 2 classes.


In [12]:
model.evaluate(test_ds)











[0.6139058470726013, 0.7355371713638306, 0.21739129722118378]