# DNABERT

## Dependencies

First, it's important to bootstrap the notebook in order for local imports to work correctly.

In [18]:
import bootstrap

Installed dependencies

In [19]:
import tensorflow as tf
import tensorflow.keras as keras
import numpy as np
import os
import shelve
import time
import tf_utils as tfu
import wandb

Local dependencies

In [20]:
from common.data import find_shelves, DnaSequenceGenerator
from common.models.dnabert import DnaBertBase, DnaBertPretrainModel, create_dnabert_pretrain_model

---
## Strategy

In [21]:
strategy = tfu.strategy.gpu(0)

## Wandb API

In [22]:
api = wandb.Api()

Here, we connect to the latest run of pretraining DNABERT on W&B that we're interested in analyzing.

In [23]:
run = api.runs(
    path="sirdavidludwig/deep-learning-dna",
    filters={"group": {"$regex": "dnabert:pretrain"}})[0]
run.name

'dnabert-1651388268'

## Dataset

Next we can fetch the dataset artifact.

In [24]:
temp_path = "./tmp"
os.makedirs(temp_path, exist_ok=True)

### Fetch from Artifact

In [25]:
dataset_artifact = run.used_artifacts()[0]
dataset_artifact.name

'dnasamples:v5'

In [26]:
dataset_dir = dataset_artifact.download(temp_path)
dataset_dir

[34m[1mwandb[0m: Downloading large artifact dnasamples:v5, 328.84MB. 63 files... Done. 0:0:0


'./tmp'

### Data Generator

In [27]:
sample_files = find_shelves(os.path.join(dataset_dir, "test"), prepend_path=True)
sample_files

['./tmp/test/fall_2016-10-07',
 './tmp/test/fall_2017-10-13',
 './tmp/test/spring_2016-04-22',
 './tmp/test/spring_2017-05-02',
 './tmp/test/spring_2018-04-23',
 './tmp/test/spring_2019-05-14',
 './tmp/test/spring_2020-05-11']

In [28]:
dataset = DnaSequenceGenerator(
    sample_files,
    length=run.config["length"],
    kmer=run.config["kmer"],
    batch_size=run.config["batch_size"],
    batches_per_epoch=run.config["val_batches_per_epoch"],
    augment=run.config["data_augment"],
    balance=run.config["data_balance"]
)

In [29]:
dataset[0]

array([[76,  7, 37, ..., 32, 35, 52],
       [30, 27, 11, ..., 10, 52, 12],
       [67, 86, 56, ..., 62, 60, 52],
       ...,
       [67, 86, 56, ..., 63, 65, 77],
       [27, 12, 61, ..., 35, 50,  2],
       [86, 56, 30, ..., 55, 27, 10]], dtype=int32)

In [13]:
dataset[0].shape

(512, 148)

## Model

In [14]:
model_path = run.file("model.h5").download(temp_path, replace=True)

In [15]:
model = DnaBertPretrainModel.load(os.path.join(temp_path, "model.h5"))



In [16]:
model.summary()

Model: "DNABERT_pretrain"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_2 (InputLayer)         [(None, 148)]             0         
_________________________________________________________________
dna_bert_base (DnaBertBase)  (None, 149, 128)          2547584   
_________________________________________________________________
lambda_1 (Lambda)            (None, 148, 128)          0         
_________________________________________________________________
dense_16 (Dense)             (None, 148, 125)          16125     
Total params: 2,563,709
Trainable params: 2,563,709
Non-trainable params: 0
_________________________________________________________________
