# TF CNN Classifier

To run this notebook on an another benchmark, use

```
papermill utils/tf_cnn_classifier.ipynb tf_cnn_experiments/[DATASET NAME].ipynb -p DATASET [DATASET NAME]
```

In [1]:
DATASET = 'demo_coding_vs_intergenomic_seqs'
VERSION = 0
BATCH_SIZE = 64
EPOCHS = 10

In [2]:
# Parameters
DATASET = "drosophila_enhancers_stark"


In [3]:
print(DATASET, VERSION, BATCH_SIZE, EPOCHS)

drosophila_enhancers_stark 0 64 10


# Data download

In [4]:
from pathlib import Path
import tensorflow as tf
import tensorflow_addons as tfa

import numpy as np
from genomic_benchmarks.loc2seq import download_dataset
from genomic_benchmarks.data_check import is_downloaded, info
from genomic_benchmarks.models.tf import vectorize_layer
from genomic_benchmarks.models.tf import get_basic_cnn_model_v0 as get_model

if not is_downloaded(DATASET):
    download_dataset(DATASET)

2022-06-29 15:39:28.461432: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
2022-06-29 15:39:28.461456: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
 The versions of TensorFlow you are currently using is 2.8.0 and is not supported. 
Some things might work, some things might not.
If you were to encounter a bug, do not file an issue.
If you want to make sure you're using a tested and supported configuration, either change the TensorFlow version or the TensorFlow Addons's version. 
You can find the compatibility matrix in TensorFlow Addon's readme:
https://github.com/tensorflow/addons
2022-06-29 15:39:29.916467: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcuda.so.1'; dlerror: libcuda.so.1: cannot ope

In [5]:
info(DATASET)



Dataset `drosophila_enhancers_stark` has 2 classes: negative, positive.

The length of genomic intervals ranges from 236 to 3237, with average 2118.1238067688746 and median 2142.0.

Totally 6914 sequences have been found, 5184 for training and 1730 for testing.


Unnamed: 0,train,test
negative,2592,865
positive,2592,865


## TF Dataset object

In [6]:
SEQ_PATH = Path.home() / '.genomic_benchmarks' / DATASET
CLASSES = [x.stem for x in (SEQ_PATH/'train').iterdir() if x.is_dir()]
NUM_CLASSES = len(CLASSES)

train_dset = tf.keras.preprocessing.text_dataset_from_directory(
    SEQ_PATH / 'train',
    batch_size=BATCH_SIZE,
    class_names=CLASSES)

Found 5184 files belonging to 2 classes.


In [7]:
if NUM_CLASSES > 2:
    train_dset = train_dset.map(lambda x, y: (x, tf.one_hot(y, depth=NUM_CLASSES)))

## Text vectorization

In [8]:
vectorize_layer.adapt(train_dset.map(lambda x, y: x))
VOCAB_SIZE = len(vectorize_layer.get_vocabulary())
vectorize_layer.get_vocabulary()

['', '[UNK]', 't', 'a', 'c', 'g']

In [9]:
def vectorize_text(text, label):
  text = tf.expand_dims(text, -1)
  return vectorize_layer(text)-2, label

train_ds = train_dset.map(vectorize_text)

## Model training

In [10]:
model = get_model(NUM_CLASSES, VOCAB_SIZE)

In [11]:
history = model.fit(
    train_ds,
    epochs=EPOCHS)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


## Evaluation on the test set

In [12]:
test_dset = tf.keras.preprocessing.text_dataset_from_directory(
    SEQ_PATH / 'test',
    batch_size=BATCH_SIZE,
    class_names=CLASSES)

if NUM_CLASSES > 2:
    test_dset = test_dset.map(lambda x, y: (x, tf.one_hot(y, depth=NUM_CLASSES)))
test_ds =  test_dset.map(vectorize_text)

Found 1730 files belonging to 2 classes.


In [13]:
model.evaluate(test_ds)



[0.9589601159095764, 0.5236994028091431, 0.6909236311912537]