DNA Foundation model for practice after learning the LLM NanoGPT / GPT2 from Dr. Andrej Karpathy and Mistral DNA. 

**References**

<li> https://github.com/shreydan/makemore-series  </li>
<li> https://github.com/karpathy/nanoGPT  </li>
<li> https://github.com/raphaelmourad/Mistral-DNA  </li>



In [1]:
# conda activate torch_gpu
import os
import pandas as pd
from dataclasses import dataclass
import torch
import torch.nn as nn
from torch.nn import functional as F
from transformers import GPT2LMHeadModel
from datasets import load_dataset

  from .autonotebook import tqdm as notebook_tqdm


## Tokenization
To pretrain a model, we will use a file containig 100,000 non-overlapping DNA sequences of 200 bases corresponding to around 1% of the human genome (hg38 assembly). Mistral DNA works on Causal Language modelling (CLM) just like GPT not like BERT that works on MLM (Masked language model). In CLM, the token is predicted from the previous tokens. While in MLM, the token is masked irrespective of the position and all the tokens.

Tokenization is performed based on the Byte Pair Encoding (BPE). 

In [2]:
# First load the dataset
# Got the dataset from https://github.com/raphaelmourad/LLM-for-genomics-training
# dataset_text = load_dataset("csv", data_files="/mnt/data/projects/.immune/Personal/DNA-Language-Model/DNA_FM/data/genome_sequences/hg38/sequences_hg38_200b_verysmall.csv.gz")
savedir = "/mnt/data/projects/.immune/Personal/DNA-Language-Model/Mistral_DNA/"
os.chdir(savedir)
import pandas as pd
DNA_text = pd.read_csv(os.path.join(savedir,"data/genome_sequences/hg38/sequences_hg38_200b_verysmall.csv.gz"))

In [3]:
DNA_text[0:5]

Unnamed: 0,text
0,TAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAAC...
1,CCCTAACCCTAACCCTAACCCTAACCCTAACCCCTAACCCTAACCC...
2,TAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCCTAA...
3,GAGGAGAACGCAACTCCGCCGTTGCAAAGGCGCGCCGCGCCGGCGC...
4,CACATGCTAGCGCGTCGGGGTGGAGGCGTGGCGCAGGCGCAGAGAG...


### Performing Tokenization

In [6]:
[ord(x)for x in DNA_text['text'][0]][0:5] ## These are basically the ASCII character

[84, 65, 65, 67, 67]

In [7]:
DNA_joined = "".join(DNA_text['text'].tolist())
DNA_joined[0:5]

'TAACC'

In [8]:
tokens = DNA_joined.encode("utf-8") # raw bytes
tokens = list(map(int, tokens)) # convert to a list of integers in range 0..255 for convenience
print('---')
print(DNA_joined[0:10])
print("length:", len(DNA_joined))
print('---')
print(tokens[0:10])
print("length:", len(tokens))

---
TAACCCTAAC
length: 19999800
---
[84, 65, 65, 67, 67, 67, 84, 65, 65, 67]
length: 19999800


In [9]:
print(max(tokens))
set(tokens) # ATGC

84


{65, 67, 71, 84}

In [10]:
### Combining the two character into one
def get_stats(ids):
    counts = {}
    for pair in zip(ids, ids[1:]): # Pythonic way to iterate consecutive elements
        counts[pair] = counts.get(pair, 0) + 1
    return counts

stats = get_stats(tokens)
# print(stats)
# print(sorted(((v,k) for k,v in stats.items()), reverse=True))

In [11]:
top_pair = max(stats, key=stats.get)
top_pair

(67, 65)

In [13]:
def merge(ids, pair, idx):
    new_text = []
    i = 0
    while i < len(ids):
        if i < len(ids) -1 and ids[i] == pair[0] and ids[i+1] == pair[1]:
            new_text.append(idx)
            i+=2
        else:
            new_text.append(ids[i])
            i+=1
    return new_text

# print(merge([5,5,6,6,7,8,6,7,9],[6,7],99))

# tokens2 = merge(tokens, top_pair, 256)
# print(tokens2[0:10])
# print("length:", len(tokens2))

In [None]:
vocab_size = 4096 # the desired final vocabulary size
num_merges = vocab_size - 4 # since it is only A,T,G,C
ids = list(tokens) # copy so we don't destroy the original list

merges = {}
for i in range(num_merges):
    print(i)
    stats = get_stats(ids)
    pair = max(stats, key = stats.get)
    idx = 256 + i
    ids = merge(ids, pair, idx)
    merges[pair] = idx


In [12]:
print(len(ids), len(set(ids)))
print(f"compression ratio: {len(tokens) / len(ids):.2f}X")

3721774 4093
compression ratio: 5.37X


In [10]:
def encode(text):
    token = text.encode("utf-8")
    token = list(map(int, token))
    return token

In [None]:
text = "ATGGCCTTAACCCCCCTCTGCGAATTACCATTGGGAGTTTCACCC"
token_encoded = encode(text)
print(len(token_encoded), len(text))
print(token_encoded, "\n",text)

In [None]:
## Now UTF-8 is not 256 it is now 4096. 
def encode(text):
  # given a string, return list of integers (the tokens)
  tokens = list(text.encode("utf-8"))
  while len(tokens) >= 2:
    stats = get_stats(tokens)
    pair = min(stats, key=lambda p: merges.get(p, float("inf")))
    if pair not in merges:
      break # nothing else can be merged
    idx = merges[pair]
    tokens = merge(tokens, pair, idx)
  return tokens

token_encoded = encode(text)
print(len(token_encoded), len(text))
print(token_encoded, "\n",text)

12 45
[308, 309, 259, 1493, 743, 337, 481, 293, 497, 256, 260, 67] 
 ATGGCCTTAACCCCCCTCTGCGAATTACCATTGGGAGTTTCACCC


In [16]:
len(token_encoded)

12

In [None]:
vocab = {idx: bytes([idx]) for idx in range(256)}
for (p0, p1), idx in merges.items():
    vocab[idx] = vocab[p0] + vocab[p1]
value_voc =list(vocab.values())
print(value_voc[740:744], value_voc[100:104])

[b'AACAAA', b'TGGGGT', b'CAGGGT', b'CTGCG'] [b'd', b'e', b'f', b'g']


In [None]:
vocab

In [20]:
for i in range(len(token_encoded)):
    print(vocab[token_encoded[i]])

b'ATGG'
b'CCTT'
b'AA'
b'CCCCCCT'
b'CTGCG'
b'AATT'
b'ACCAT'
b'TGGG'
b'AGTTT'
b'CA'
b'CC'
b'C'


In [14]:
def decode(token):
    token_join = b''.join(vocab[idx] for idx in token)
    # print(token_join)
    text = token_join.decode("utf-8", errors = "replace") ## output not valid so we replace
    return text

In [22]:
text = decode(token_encoded)
print(text)

ATGGCCTTAACCCCCCTCTGCGAATTACCATTGGGAGTTTCACCC


In [23]:
print(decode(encode("ATGGCCTTAACC")))
text2 = decode(encode(text))
print(text2 == text)

ATGGCCTTAACC
True


In [25]:
print(b''.join(vocab[idx] for idx in token_encoded))

b'ATGGCCTTAACCCCCCTCTGCGAATTACCATTGGGAGTTTCACCC'


In [45]:
pd.DataFrame(ids).to_csv("/mnt/data/projects/.immune/Personal/DNA-Language-Model/Mistral_DNA/data/genome_sequences/hg38/ids_encoded_2.csv", index = False, header = False)

In [52]:
import pickle
with open("/mnt/data/projects/.immune/Personal/DNA-Language-Model/Mistral_DNA/data/genome_sequences/hg38/vocab.pkl", "wb") as f:
    pickle.dump(my_dict, f)

In [4]:
ids = pd.read_csv(os.path.join(savedir, "data/genome_sequences/hg38/ids_encoded_2.csv"), header = None)

In [5]:
ids = ids[0].tolist()

In [6]:
len(set(ids))

4093

In [17]:
import pickle
with open(os.path.join("data/genome_sequences/hg38/vocab.pkl"), "rb") as f:
    vocab = pickle.load(f)

In [18]:
vocab

{'A': 1, 'B': [1, 2, 3], 'C': {'x': 10}}

In [42]:
sequences = [
    "ACGTACGTACGT",
    "CGTACGTACGTA",
    "ATATATATAT"
]

tokenizer = DNABert2LikeTokenizer(max_len=6)
tokenizer.train(sequences)

encoded = tokenizer.encode("ACGTACGT")
print(encoded)
print(tokenizer.decode(encoded))


[2, 31, 12, 3]
['[CLS]', 'ACGTAC', 'GT', '[SEP]']


## DNA GPT Model Architecture

In [6]:
ids_t = torch.tensor(ids)

In [7]:
# I have tried a configuration of 2**n makes it more efficients
@dataclass
class DNAGPTconfig:
    block_size: int = 1024 ## it is the token size
    n_layer: int = 16
    embd_size: int = 512
    n_head: int = 16
    vocab_size: int = (ids_t.max() + 1)

In [8]:
import torch
import torch.nn as nn
from torch.nn import functional as f

class CausalSelfAttention(nn.Module):
    def __init__(self, config):
        super().__init__()
        self.c_attn = nn.Linear(config.embd_size, 3 * config.embd_size) # 3 dimension as it is divided into q,k,v
        self.c_proj = nn.Linear(config.embd_size, config.embd_size)
        self.n_head = config.n_head
        self.embd_size = config.embd_size
        # self.dropout = nn.Dropout(dropout)

    def forward(self, x):
        B,T,C = x.size()
        qkv = self.c_attn(x)
        q, k, v = qkv.split(self.embd_size, dim=2)
        q = q.view(B, T, self.n_head, C // self.n_head).transpose(1,2)
        k = k.view(B, T, self.n_head, C // self.n_head).transpose(1,2)
        v = v.view(B, T, self.n_head, C // self.n_head).transpose(1,2)
        # wei = q @ k.transpose(-2,-1)
        # wei = wei * C**-0.5
        # wei = F.softmax(wei, dim = -1) 
        # wei = self.dropout(wei)
        # wei = wei @ v
        # Instead of running all of them, we can use flash attention at once
        wei = F.scaled_dot_product_attention(q,k,v, is_causal = True) # flash attention
        # combine all of them
        wei = wei.transpose(1,2).contiguous().view(B,T,C)
        wei = self.c_proj(wei)
        return wei


In [9]:
class MLP(nn.Module):

    def __init__(self, config):
        super().__init__()
        self.ln1 = nn.Linear(config.embd_size, 4 * config.embd_size)
        self.nln = nn.GELU(approximate = "tanh")
        self.ln2 = nn.Linear(4 * config.embd_size, config.embd_size)
    
    def forward(self, x):
        x = self.ln1(x)
        x = self.nln(x)
        x = self.ln2(x)
        return x


In [10]:
class Block(nn.Module):

    def __init__(self, config):
        super().__init__()
        self.ln_f1 = nn.LayerNorm(config.embd_size)
        self.self_attn = CausalSelfAttention(config)
        self.ln_f2 = nn.LayerNorm(config.embd_size)
        self.mlp = MLP(config)
    
    def forward(self, x):
        x = x + self.self_attn(self.ln_f1(x))
        x = x + self.mlp(self.ln_f2(x))
        return x


In [11]:
class DNAGPT(nn.Module):
    def __init__(self, config):
        super().__init__()
        self.config = config
        self.transformers = nn.ModuleDict(dict(
            wte = nn.Embedding(config.vocab_size, config.embd_size),
            wpe = nn.Embedding(config.block_size, config.embd_size),
            h = nn.ModuleList([Block(config) for _ in range(config.n_layer)]),
            ln_norm = nn.LayerNorm(config.embd_size)
            )
        )
        self.lm_head = nn.Linear(config.embd_size, config.vocab_size, bias = False)
    
    def forward(self, idx, targets = None):
        B,T = idx.shape
        tok = self.transformers.wte(idx)
        pos = self.transformers.wpe(torch.arange(T, dtype = torch.long, device = idx.device))
        x = tok + pos ## require both the position and token
        for block in self.transformers.h:
            x=block(x) ## Since it will go through all the layers of the transformers
        x=self.transformers.ln_norm(x)
        logits=self.lm_head(x)
        loss = None

        if targets is not None:
            B, T, C = logits.shape
            logits = logits.view(B*T,C)
            targets = targets.view(B*T)
            loss = F.cross_entropy(logits, targets) # neg log likelihood

        return logits,loss


In [12]:
DNAGPT(DNAGPTconfig())

DNAGPT(
  (transformers): ModuleDict(
    (wte): Embedding(4348, 512)
    (wpe): Embedding(1024, 512)
    (h): ModuleList(
      (0-15): 16 x Block(
        (ln_f1): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
        (self_attn): CausalSelfAttention(
          (c_attn): Linear(in_features=512, out_features=1536, bias=True)
          (c_proj): Linear(in_features=512, out_features=512, bias=True)
        )
        (ln_f2): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
        (mlp): MLP(
          (ln1): Linear(in_features=512, out_features=2048, bias=True)
          (nln): GELU(approximate='tanh')
          (ln2): Linear(in_features=2048, out_features=512, bias=True)
        )
      )
    )
    (ln_norm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
  )
  (lm_head): Linear(in_features=512, out_features=4348, bias=False)
)

In [13]:
## Now for training you surely need to have GPU or cuda
# device = "cpu"
if torch.cuda.is_available():
    device = "cuda"
elif hasattr(torch.backends,"mps") and torch.backends.mps.is_available():
    device = "mps"

print("device name",device)

device name cuda


In [14]:
## This will pass all the weights and bias from gpt2 to our model
model = DNAGPT(DNAGPTconfig())
model.to(device)

DNAGPT(
  (transformers): ModuleDict(
    (wte): Embedding(4348, 512)
    (wpe): Embedding(1024, 512)
    (h): ModuleList(
      (0-15): 16 x Block(
        (ln_f1): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
        (self_attn): CausalSelfAttention(
          (c_attn): Linear(in_features=512, out_features=1536, bias=True)
          (c_proj): Linear(in_features=512, out_features=512, bias=True)
        )
        (ln_f2): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
        (mlp): MLP(
          (ln1): Linear(in_features=512, out_features=2048, bias=True)
          (nln): GELU(approximate='tanh')
          (ln2): Linear(in_features=2048, out_features=512, bias=True)
        )
      )
    )
    (ln_norm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
  )
  (lm_head): Linear(in_features=512, out_features=4348, bias=False)
)

In [110]:
split = int(0.9 * len(ids))
train_ids = ids[:split]
val_ids   = ids[split:]

In [111]:
print(len(train_ids), len(val_ids))

3349596 372178


In [15]:
class DataLoaderLite:
    def __init__(self, B, T):
        self.B = B
        self.T = T
        # state
        self.current_position = 0

    def nextbatch(self, ids):
        B = self.B
        T = self.T
        ids = torch.tensor(ids)
        buf = ids[self.current_position : self.current_position+B*T+1]
        x = buf[:-1].view(B,T)
        y = buf[1:].view(B,T)
        self.current_position += B * T + 1
        # if loading the last batch is greater than the lenght
        if (self.current_position + (B * T + 1) > len(ids)):
            self.current_position = 0
        x=x.to(device) ## putting it on GPU
        y=y.to(device)
        return x,y


In [113]:
device = "cuda"
d = DataLoaderLite(6, 4)

In [114]:
model.to(device)

DNAGPT(
  (transformers): ModuleDict(
    (wte): Embedding(4348, 512)
    (wpe): Embedding(1024, 512)
    (h): ModuleList(
      (0-15): 16 x Block(
        (ln_f1): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
        (self_attn): CausalSelfAttention(
          (c_attn): Linear(in_features=512, out_features=1536, bias=True)
          (c_proj): Linear(in_features=512, out_features=512, bias=True)
        )
        (ln_f2): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
        (mlp): MLP(
          (ln1): Linear(in_features=512, out_features=2048, bias=True)
          (nln): GELU(approximate='tanh')
          (ln2): Linear(in_features=2048, out_features=512, bias=True)
        )
      )
    )
    (ln_norm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
  )
  (lm_head): Linear(in_features=512, out_features=4348, bias=False)
)

In [115]:
# Create a torch optimizer
optimizer = torch.optim.AdamW(model.parameters(), lr = 3e-4)

In [133]:
model.train() # In Training mode
for _ in range(1000):
    x, y = d.nextbatch(ids[:192 + 1])
    logits,loss = model(x,y)
    optimizer.zero_grad(set_to_none=True)
    loss.backward()
    optimizer.step()
print(loss.item())

0.0792996808886528


In [134]:
@torch.no_grad() ## So it should not update any weights

def val_step():
    model.eval()
    losses=[]
    for _ in range(100):
        x, y = d.nextbatch(val_ids[:192 + 1])
        logits, lossing = model(x, targets = None)
        B, T, C = logits.shape
        logits = logits.view(B*T,C)
        targets = y.view(B*T)
        loss = F.cross_entropy(logits, targets) # neg log likelihood
        losses.append(loss.item())
    return sum(losses) / len(losses)


In [135]:
print(val_step())

13.109191064834596


In [136]:
1000 % 100

0

In [None]:
@torch.no_grad() ## So it should not update any weights

def val_step2():
    model.eval()
    losses=[]
    for _ in range(10):
        x, y = d.nextbatch(val_ids[:192 + 1])
        logits, loss = model(x, targets = None)
        # B, T, C = logits.shape
        # logits = logits.view(B*T,C)
        # targets = y.view(B*T)
        # loss = F.cross_entropy(logits, targets) # neg log likelihood
        # losses.append(loss.item())
    return logits


In [91]:
logistic, loss = val_step2()
print(logistic.shape)

torch.Size([24, 8, 4348])


In [87]:
logistic.shape

torch.Size([24, 8, 4348])

In [77]:
logits.shape

torch.Size([192, 4348])

In [16]:
### testing to see if this work
buf = torch.tensor(ids[:192 + 1]) # one for target
x = buf[:-1].view(24,8)
y = buf[1:].view(24,8)
print(x,"\n",y)
x=x.to(device)
y=y.to(device)

tensor([[  84, 1331, 1331, 1331, 1331, 1331, 1331, 1331],
        [1331, 1331, 1331, 1331, 1331, 1331, 1331, 1331],
        [1331, 1331, 1083,  296,  362,  314, 1331, 1331],
        [1331, 1331, 2846, 1331, 1331, 1331, 1331,  681],
        [1331, 1331, 1331, 1331, 1331, 1331, 1331, 1331],
        [2846, 1331, 1331,  682,  362,  296,  362,  314],
        [1331, 1331, 1331, 1976,  830,  830,  830,  830],
        [ 830,  296,  362,  368, 1331, 1331, 1331,  296],
        [ 362,  314, 1331, 1331, 1331, 1331, 2846, 2846],
        [1331, 1331, 1331, 1331, 1331, 1331, 2846, 1331],
        [1331, 1331, 1331,  271,  496, 3393,  352,  349],
        [ 349, 2293,  264,  405, 1803, 1949, 4094,  881],
        [ 325,  292,  355, 3649, 1105, 3220,  642,  314],
        [ 824,  972, 1404, 3070, 1803,  271,  621,  292],
        [ 383, 3522,  539,  834, 3093, 1211, 3828,  539],
        [ 834, 3093, 1211, 3828,  539,  834, 3093, 1211],
        [3828,  539,  834, 3093, 1211, 3828,  539,  834],
        [3093,

In [17]:
## This will pass all the weights and bias from gpt2 to our model
model = DNAGPT(DNAGPTconfig())
model.to(device)

DNAGPT(
  (transformers): ModuleDict(
    (wte): Embedding(4348, 512)
    (wpe): Embedding(1024, 512)
    (h): ModuleList(
      (0-15): 16 x Block(
        (ln_f1): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
        (self_attn): CausalSelfAttention(
          (c_attn): Linear(in_features=512, out_features=1536, bias=True)
          (c_proj): Linear(in_features=512, out_features=512, bias=True)
        )
        (ln_f2): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
        (mlp): MLP(
          (ln1): Linear(in_features=512, out_features=2048, bias=True)
          (nln): GELU(approximate='tanh')
          (ln2): Linear(in_features=2048, out_features=512, bias=True)
        )
      )
    )
    (ln_norm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
  )
  (lm_head): Linear(in_features=512, out_features=4348, bias=False)
)

In [18]:
logits,loss = model(x,y)

In [19]:
loss

tensor(8.4052, device='cuda:0', grad_fn=<NllLossBackward0>)

In [16]:
# Create a torch optimizer
optimizer = torch.optim.AdamW(model.parameters(), lr = 3e-4)

In [21]:
for steps in range(100):
    logits, loss = model(x,y)
    optimizer.zero_grad(set_to_none=True)
    loss.backward()
    optimizer.step()
print(loss.item())

### We have overfit the model since we are using a single batch. So it learns everything
## We will updae it in the batches using DataLoader that changes the batches randomly for it to generalize well

0.0823088064789772


In [22]:
len(ids)

3721774

In [18]:
class DataLoaderLite:
    def __init__(self, B, T):
        self.B = B
        self.T = T
        # state
        self.current_position = 0

    def nextbatch(self, ids):
        B = self.B
        T = self.T
        ids = torch.tensor(ids)
        buf = ids[self.current_position : self.current_position+B*T+1]
        x = buf[:-1].view(B,T)
        y = buf[1:].view(B,T)
        self.current_position += B * T + 1
        # if loading the last batch is greater than the lenght
        if (self.current_position + (B * T + 1) > len(ids)):
            self.current_position = 0
        x=x.to(device) ## putting it on GPU
        y=y.to(device)
        return x,y


In [35]:
split = int(0.9 * len(ids))
train_ids = ids[:split]
val_ids   = ids[split:]

In [17]:
d = DataLoaderLite(16,DNAGPTconfig.block_size)
# val_loader = DataLoaderLite(16,DNAGPTconfig.block_size)
# x,y = d.nextbatch()
# print(x.shape, y.shape)

In [20]:
## This will pass all the weights and bias from gpt2 to our model
model = DNAGPT(DNAGPTconfig())
model.to(device)

DNAGPT(
  (transformers): ModuleDict(
    (wte): Embedding(4348, 512)
    (wpe): Embedding(1024, 512)
    (h): ModuleList(
      (0-15): 16 x Block(
        (ln_f1): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
        (self_attn): CausalSelfAttention(
          (c_attn): Linear(in_features=512, out_features=1536, bias=True)
          (c_proj): Linear(in_features=512, out_features=512, bias=True)
        )
        (ln_f2): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
        (mlp): MLP(
          (ln1): Linear(in_features=512, out_features=2048, bias=True)
          (nln): GELU(approximate='tanh')
          (ln2): Linear(in_features=2048, out_features=512, bias=True)
        )
      )
    )
    (ln_norm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
  )
  (lm_head): Linear(in_features=512, out_features=4348, bias=False)
)

In [20]:
for _ in range(100):
    x,y = d.nextbatch(ids)
    logits,loss = model(x,y)
    optimizer.zero_grad(set_to_none = True)
    loss.backward()
    optimizer.step()
    print(loss.item())

7.4897894859313965
7.39863395690918
7.3050737380981445
7.250854969024658
7.307278633117676
7.214249610900879
7.238442897796631
7.284749507904053
7.297162055969238
7.394287586212158
7.199598789215088
7.128458499908447
7.172311782836914
7.1416239738464355
7.016969203948975
7.126737594604492
6.961976528167725
7.029718399047852
7.576591491699219
7.290907859802246
7.022881507873535
7.155508995056152
7.075373649597168
7.07481050491333
7.047640800476074
7.051405429840088
7.029366970062256
6.996538162231445
6.9593024253845215
7.092647552490234
7.07595157623291
7.297584533691406
7.218262195587158
7.260923862457275
7.181478023529053
7.200864315032959
7.255090713500977
7.162593364715576
7.179099082946777
7.173696517944336
7.153949737548828
7.111686706542969
7.13004207611084
7.139341354370117
7.132724761962891
7.135664463043213
7.141758441925049
7.115208148956299
7.132131099700928
7.122882843017578
7.128759384155273
7.106016635894775
7.123307228088379
7.1655802726745605
7.15963077545166
7.15639209

In [19]:
# training for more steps upto 100 
for _ in range(100):
    x,y = train_loader.nextbatch(ids)
    logits,loss = model(x,y)
    optimizer.zero_grad(set_to_none = True)
    loss.backward()
    optimizer.step()
    print(loss.item())

NameError: name 'train_loader' is not defined

# Mistral DNA

<p><strong>Generative Artificial Intelligence</strong> (AI) represents a cutting-edge domain within machine learning, focused on creating new, synthetic yet realistic data. This includes generating text, images, music, and even biological sequences. At the heart of many generative AI applications are <strong>Large Language Models</strong> (LLMs), which have revolutionized natural language processing and beyond.</p>
<p>LLMs are <strong>sophisticated neural networks</strong> trained on vast amounts of text data to understand, generate, and interact with human language. Their architecture, often based on <strong>Transformers</strong>, allows them to capture complex patterns and context within data, making them powerful tools for various applications, from chatbots to creative writing and scientific discovery.</p>
<blockquote class="details" style="border: 2px solid #ddd; margin: 1em 0.2em">
<div class="box-title details-title" id="details-transformers"><button class="gtn-boxify-button details" type="button" aria-controls="details-transformers" aria-expanded="true"><i class="fas fa-info-circle" aria-hidden="true" ></i> <span>Details:  Transformers </span><span class="fold-unfold fa fa-minus-square"></span></button></div>
<p>Transformers are a type of neural network model designed to handle sequential data, such as text, by using self-attention mechanisms to weigh the importance of input elements relative to each other, enabling the model to understand and generate coherent and contextually relevant outputs.</p>
</blockquote>
<p>In this tutorial, we will explore the intersection of generative AI and genomics by <strong>pretraining an LLM from scratch on DNA sequences</strong>. This process will equip the model with a foundational understanding of the ‚Äúgrammar‚Äù of DNA, enabling it to generate and analyze genetic data with remarkable accuracy.</p>
<p><a href="https://mistral.ai/">Mistral AI</a>, French artificial intelligence (AI) startup, recently launched large language models (LLMs) showing performances superior to Llama2. In particular, Mixtral-8x7B implements:</p>
<ul>
<li><strong>Grouped-Query Attention</strong>: Efficiently computes attention by grouping queries, reducing computational load and memory usage.</li>
<li><strong>Sliding-Window Attention</strong>: Focuses on a fixed-size window of tokens, sliding over the sequence to manage long texts efficiently.</li>
<li><strong>Byte-fallback BPE Tokenizer</strong>: Tokenizes text into subword units, falling back to byte-level tokenization for unknown words, ensuring robust handling of diverse text inputs.</li>
</ul>
<p>These techniques collectively enhance the performance and efficiency of large language models, enabling them to process and generate text more effectively.</p>
<p>In this tutorial, we will use a simplified Mistral model architecture with fewer layers and hidden units to reduce computational requirements. The model will be trained to predict the next base in the sequence. For instance, for a sequence like <code style="color: inherit">ATTTGTTGGT</code>, the model will be trained to predict the suffix <code style="color: inherit">TTGGT</code> given the prefix <code style="color: inherit">ATTTG</code>. This process is called <strong>causal language modeling</strong>.</p>
<p>To pretrain the model, we will use a file containing 100,000 non-overlapping DNA sequences of 200 bases, corresponding to around 1% of the human genome (hg38 assembly). This involves training the model to predict the end of a DNA sequence.</p>
<p>By the end of this tutorial, we will obtain a Mistral-DNA model with an internal representation of DNA sequence grammar. This pretrained model can then be used for various applications, such as fine-tuning for classification tasks or predicting mutational effects.</p>

<blockquote class="question" style="border: 2px solid #8A9AD0; margin: 1em 0.2em">
<div class="box-title question-title" id="question"><i class="far fa-question-circle" aria-hidden="true" ></i> Question</div>
<p>What are the required dependencies doing?</p>
<br/><details style="border: 2px solid #B8C3EA; margin: 1em 0.2em;padding: 0.5em; cursor: pointer;"><summary>üëÅ View solution</summary>
<div class="box-title solution-title" id="solution"><button class="gtn-boxify-button solution" type="button" aria-controls="solution" aria-expanded="true"><i class="far fa-eye" aria-hidden="true" ></i> <span>Solution</span><span class="fold-unfold fa fa-minus-square"></span></button></div>
<ul>
<li>
<p><code style="color: inherit">accelerate</code>: A library by <a href="https://huggingface.co/">Hugging Face</a> ‚Äì a platform that provides tools and resources for building, training, and deploying machine learning models ‚Äì designed to simplify the process of training and deploying machine learning models across different hardware environments. It provides tools to optimize performance on GPUs, TPUs, and other accelerators, making it easier to scale models efficiently.</p>
</li>
<li>
<p><code style="color: inherit">datasets</code>: A library by Hugging Face for managing and processing datasets. It provides tools to load, manipulate, and share datasets in a standardized format, making it easier to work with machine learning data.</p>
</li>
<li>
<p><code style="color: inherit">numpy</code>: A fundamental package for scientific computing in Python.</p>
</li>
<li>
<p><code style="color: inherit">torch</code>: Also known as PyTorch, it is an open-source machine learning library developed by Facebook‚Äôs AI Research lab. It provides a flexible platform for building and training neural networks, with a focus on tensor computations and automatic differentiation.</p>
</li>
<li>
<p><code style="color: inherit">transformers</code>: A library by Hugging Face that provides implementations of state-of-the-art transformer models for natural language processing (NLP). It includes pre-trained models and tools for fine-tuning, making it easier to apply transformers to various NLP tasks.</p>
</li>
<li>
<p><code style="color: inherit">flash-attn</code>: Implementation of FlashAttention, a Fast and Memory-Efficient Exact Attention with IO-Awareness
These libraries are widely used in the machine learning and data science communities for their efficiency, flexibility, and extensive functionality.</p>
</li>
</ul>
</details>
</blockquote>


<blockquote class="details" style="border: 2px solid #ddd; margin: 1em 0.2em">
<div class="box-title details-title" id="details-loaded-functions-and-classes-from-datasets-and-transformers-libraries"><button class="gtn-boxify-button details" type="button" aria-controls="details-loaded-functions-and-classes-from-datasets-and-transformers-libraries" aria-expanded="true"><i class="fas fa-info-circle" aria-hidden="true" ></i> <span>Details: Loaded functions and classes from datasets and transformers libraries</span><span class="fold-unfold fa fa-minus-square"></span></button></div>
<ul>
<li><code style="color: inherit">datasets</code>:
<ul>
<li><code style="color: inherit">load_dataset</code>: function to load datasets from the Hugging Face Hub or local files.</li>
</ul>
</li>
<li><code style="color: inherit">transformers</code>:
<ul>
<li><code style="color: inherit">AutoConfig</code>: Automatically loads the configuration for a pre-trained model. It defines the architecture and hyperparameters of the model.</li>
<li><code style="color: inherit">AutoModelForCausalLM</code>: Loads a pre-trained causal language model for tasks like text generation, where the model predicts the next token in a sequence.</li>
<li><code style="color: inherit">AutoTokenizer</code>: Loads the tokenizer associated with a pre-trained model. It converts text into tokens that the model can process.</li>
<li><code style="color: inherit">DataCollatorForLanguageModeling</code>: A data collator specifically designed for language modeling tasks. It prepares batches of data for training by handling padding and masking.</li>
<li><code style="color: inherit">EarlyStoppingCallback</code>: A callback used during training to stop the process early if the model‚Äôs performance on the validation set stops &gt; improving, saving time and resources.</li>
<li><code style="color: inherit">Trainer</code>: A high-level API for training and evaluating transformer &gt; models. It simplifies the training loop and handles tasks like gradient accumulation and evaluation.</li>
<li><code style="color: inherit">TrainingArguments</code>: A class to define the training configuration, including hyperparameters like learning rate, batch size, and number &gt; of epochs. It is used to configure the <code style="color: inherit">Trainer</code>.</li>
</ul>
</li>
</ul>
<p>These components work together to streamline the process of training and fine-tuning transformer models for various NLP tasks.</p>
</blockquote>
<blockquote class="comment" style="border: 2px solid #ffecc1; margin: 1em 0.2em">
<div class="box-title comment-title" id="comment-versions"><i class="far fa-comment-dots" aria-hidden="true" ></i> Comment: Versions</div>
<p>This tutorial has been tested with following versions:</p>
<ul>
<li><code style="color: inherit">accelerate</code> &gt; 0.32.1</li>
<li><code style="color: inherit">flash_attn</code> &gt; 2.6.0.post1 and 2.7.0.post2</li>
<li><code style="color: inherit">transformers</code> &gt; 4.47.1</li>
</ul>
<p>You can check the versions with:</p>
<div class="language-plaintext highlighter-rouge"><div><pre style="color: inherit; background: transparent"><code style="color: inherit">accelerate.__version__
flash_attn.__version__
transformers.__version__
</code></pre></div>  </div>

In [None]:
!pip install accelerate
!pip install flash-attn

Collecting flash-attn
  Downloading flash_attn-2.8.3.tar.gz (8.4 MB)
[2K     [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m8.4/8.4 MB[0m [31m49.6 MB/s[0m  [33m0:00:00[0m eta [36m0:00:01[0m
[?25h  Preparing metadata (setup.py) ... [?25ldone
Collecting einops (from flash-attn)
  Using cached einops-0.8.1-py3-none-any.whl.metadata (13 kB)
Using cached einops-0.8.1-py3-none-any.whl (64 kB)
Building wheels for collected packages: flash-attn
[33m  DEPRECATION: Building 'flash-attn' using the legacy setup.py bdist_wheel mechanism, which will be removed in a future version. pip 25.3 will enforce this behaviour change. A possible replacement is to use the standardized build interface by setting the `--use-pep517` option, (possibly combined with `--no-build-isolation`), or adding a `pyproject.toml` file to the source tree of 'flash-attn'. Discussion can be found at https://github.com/pypa/pip/issu

In [None]:
import os
import accelerate
# import flash_attn
import torch
import transformers
from datasets import load_dataset
from transformers import (
    AutoConfig, # load the configuration of pre-trained model. architecture and hyperparameter of the model
    AutoModelForCausalLM, # loads the pretrained causal language model for task like text generation
    AutoTokenizer, # load the tokenizer with a pre-trained model. convert the text to tokens
    DataCollatorForLanguageModeling, # designed for language modelling task. prepares batches for training by handling padding and masking
    EarlyStoppingCallback,  # is used to stop the training, if in the validation performance stops improving to save time and resources
    Trainer, # A high level API for training and evaluating the transformers. 
    TrainingArguments, # define the hyperparameter like learning rate, batch size, epoch, weight decay. 
)

: 

# Choose the LLM architecture

Let‚Äôs look at the original archicture of Mixtral-8x7B-v0.1 which is stored in the data/models/Mixtral-8x7B-v0.1 folder Github https://github.com/raphaelmourad/Mistral-DNA/tree/main/data/models/Mixtral-8x7B-v0.1

In [28]:
savedir = "/mnt/data/projects/.immune/Personal/DNA-Language-Model/Mistral_DNA/"
os.chdir(savedir)
config = AutoConfig.from_pretrained("data/models/Mixtral-8x7B-v0.1")

In [20]:
config

MixtralConfig {
  "_name_or_path": "data/models/Mixtral-8x7B-v0.1",
  "architectures": [
    "MixtralForCausalLM"
  ],
  "attention_dropout": 0.0,
  "bos_token_id": 1,
  "eos_token_id": 2,
  "hidden_act": "silu",
  "hidden_size": 256,
  "initializer_range": 0.02,
  "intermediate_size": 256,
  "max_position_embeddings": 512,
  "model_type": "mixtral",
  "num_attention_heads": 8,
  "num_experts_per_tok": 1,
  "num_hidden_layers": 8,
  "num_key_value_heads": 8,
  "num_local_experts": 64,
  "output_router_logits": false,
  "rms_norm_eps": 1e-05,
  "rope_theta": 1000000.0,
  "router_aux_loss_coef": 0.02,
  "sliding_window": null,
  "tie_word_embeddings": false,
  "torch_dtype": "bfloat16",
  "transformers_version": "4.38.2",
  "use_cache": true,
  "vocab_size": 4096
}

By loading the configuration, we can inspect or modify the model‚Äôs architecture without loading the actual model weights. Let‚Äôs now initialize a causal language model from the loaded configuration object, with a specific attention implementation:


In [27]:
!nvidia-smi

Sat Dec 20 05:42:19 2025       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.90.07              Driver Version: 550.90.07      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|   0  Tesla T4                       On  |   00000000:00:04.0 Off |                    0 |
| N/A   34C    P8              9W /   70W |       1MiB /  15360MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                

<p>By loading the configuration, we can inspect or modify the model‚Äôs architecture without loading the actual model weights. Let‚Äôs now initialize a causal language model from the loaded configuration object, with a specific attention implementation:</p>

In [21]:
model = AutoModelForCausalLM.from_config(config, attn_implementation="eager")
# eager specifies the attention implementatin to use. Attention mechanism will be executed 
# eagerly which can be useful for debugging or when working with dynamic computation graphs
# Eager execution runs operations immediatedy as they are called in Python rather than adding 
# them to graph execution

In [22]:
# This is GPT2 model architecture
# DNAGPT(
#   (transformers): ModuleDict(
#     (wte): Embedding(4348, 512)
#     (wpe): Embedding(1024, 512)
#     (h): ModuleList(
#       (0-15): 16 x Block(
#         (ln_f1): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
#         (self_attn): CausalSelfAttention(
#           (c_attn): Linear(in_features=512, out_features=1536, bias=True)
#           (c_proj): Linear(in_features=512, out_features=512, bias=True)
#         )
#         (ln_f2): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
#         (mlp): MLP(
#           (ln1): Linear(in_features=512, out_features=2048, bias=True)
#           (nln): GELU(approximate='tanh')
#           (ln2): Linear(in_features=2048, out_features=512, bias=True)
#         )
#       )
#     )
#     (ln_norm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
#   )
#   (lm_head): Linear(in_features=512, out_features=4348, bias=False)
# )

In [23]:
model

MixtralForCausalLM(
  (model): MixtralModel(
    (embed_tokens): Embedding(4096, 256)
    (layers): ModuleList(
      (0-7): 8 x MixtralDecoderLayer(
        (self_attn): MixtralAttention(
          (q_proj): Linear(in_features=256, out_features=256, bias=False)
          (k_proj): Linear(in_features=256, out_features=256, bias=False)
          (v_proj): Linear(in_features=256, out_features=256, bias=False)
          (o_proj): Linear(in_features=256, out_features=256, bias=False)
          (rotary_emb): MixtralRotaryEmbedding()
        )
        (block_sparse_moe): MixtralSparseMoeBlock(
          (gate): Linear(in_features=256, out_features=64, bias=False)
          (experts): ModuleList(
            (0-63): 64 x MixtralBlockSparseTop2MLP(
              (w1): Linear(in_features=256, out_features=256, bias=False)
              (w2): Linear(in_features=256, out_features=256, bias=False)
              (w3): Linear(in_features=256, out_features=256, bias=False)
              (act_fn): SiL

In [26]:
Total_parameters = sum(p.numel() for p in model.parameters()) / 1000 ** 2
print(f"Total Parameter {Total_parameters:.1f} million") 

Total Parameter 105.0 million


### Model Architecture
##### Embedding layer
4096 input i.e. 4**6 4 = [A,T,G,C], 6 mers and 256 dimensions
##### 8 Decoder layer
query, key, value, output, rotary: Position information
<p> This allows the model to weigh the importance of differenttokens in the sequence relative to each other, capturing dependenciesand context. </p>

### MixtralSparseMoeBlock 



## Tokenization

In [29]:
tokenizer = AutoTokenizer.from_pretrained("zhihan1996/DNABERT-2-117M", trust_remote_code=True)



In [31]:
tokenizer

PreTrainedTokenizerFast(name_or_path='zhihan1996/DNABERT-2-117M', vocab_size=4096, model_max_length=1000000000000000019884624838656, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'unk_token': '[UNK]', 'sep_token': '[SEP]', 'pad_token': '[PAD]', 'cls_token': '[CLS]', 'mask_token': '[MASK]'}, clean_up_tokenization_spaces=True),  added_tokens_decoder={
	0: AddedToken("[UNK]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	1: AddedToken("[CLS]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	2: AddedToken("[SEP]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	3: AddedToken("[PAD]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	4: AddedToken("[MASK]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
}

In [36]:
tokens = tokenizer(
    # "ATGGCCTTAACCCCCCTCTGCGAATTACCATTGGGAGTTTCACCC",
    "ATTGCATTACHHCCGGGCCAAKKKA!!##",
    return_tensors="pt"
)

print(tokens)
print(tokenizer.convert_ids_to_tokens(tokens["input_ids"][0]))


{'input_ids': tensor([[   1, 2061,  754,    6,    0,    0,  443,  156,    0,    0,    0,    5,
            0,    0,    0,    0,    2]]), 'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])}
['[CLS]', 'ATT', 'GCATTA', 'C', '[UNK]', '[UNK]', 'CCGG', 'GCCAA', '[UNK]', '[UNK]', '[UNK]', 'A', '[UNK]', '[UNK]', '[UNK]', '[UNK]', '[SEP]']


In [37]:
dna = "ACGTAGCATCGGATCTATCTATCGACACTTGGTTATCGATCTACGAGCATCTCGTTAGC"
inputs = tokenizer(dna, return_tensors = 'pt')["input_ids"]
# hidden_states = model(inputs)[0] # [1, sequence_length, 768]

In [38]:
tokenizer.convert_ids_to_tokens(inputs[0])

['[CLS]',
 'A',
 'CGTA',
 'GCA',
 'TCGGA',
 'TCTATCTA',
 'TCGACA',
 'CTTGG',
 'TTA',
 'TCGA',
 'TCTA',
 'CGA',
 'GCA',
 'TCTC',
 'GTTA',
 'GC',
 '[SEP]']

In [49]:
tokenizer.padding_side  = "left"

In [50]:
encoding = tokenizer("ATT", padding="longest", return_tensors="pt")
print(encoding)

{'input_ids': tensor([[   1, 2061,    2]]), 'token_type_ids': tensor([[0, 0, 0]]), 'attention_mask': tensor([[1, 1, 1]])}




PreTrainedTokenizerFast is a fast and efficient tokenizer used to process text data for the DNABERT-2-117M model. 

Here‚Äôs a breakdown of its configuration:<li>name_or_path='zhihan1996/DNABERT-2-117M': Specifies the name or path of the pre-trained tokenizer, indicating that it is associated with the DNABERT-2-117M model, which is designed for processing DNA sequences.</li>
<li>vocab_size=4096 Defines the size of the tokenizer‚Äôs vocabulary. 4**6 (ATGC) * 6 mer = 4096</li>

**Special_tokens:** Defines a set of special tokens used by the tokenizer: 
<li> unk_token: '[UNK]' - Represents unknown or out-of-vocabulary tokens.</li>
<li> sep_token: '[SEP]' - Used to separate segments within a sequence. </li>
<li> pad_token: '[PAD]' - Used for padding sequences to a uniform length. </li>
<li> cls_token: '[CLS]' - Typically used as the first token in a sequence to represent the classification token.</li>
<li> mask_token: '[MASK]' - Used in masked language modeling to hide tokens that the model must predict.</li>

</ul>
</li>
</ul>
<blockquote class="question" style="border: 2px solid #8A9AD0; margin: 1em 0.2em">
<div class="box-title question-title" id="question-11"><i class="far fa-question-circle" aria-hidden="true" ></i> Question</div>
<p>What do the other configuration parameters mean?</p>
<ol>
<li><code style="color: inherit">model_max_length=1000000000000000019884624838656</code></li>
<li><code style="color: inherit">is_fast=True</code></li>
<li><code style="color: inherit">padding_side='right'</code></li>
<li><code style="color: inherit">truncation_side='right'</code></li>
<li><code style="color: inherit">clean_up_tokenization_spaces=False</code></li>
<li><code style="color: inherit">added_tokens_decoder</code></li>
</ol>
<br/><details style="border: 2px solid #B8C3EA; margin: 1em 0.2em;padding: 0.5em; cursor: pointer;"><summary>üëÅ View solution</summary>
<div class="box-title solution-title" id="solution-11"><button class="gtn-boxify-button solution" type="button" aria-controls="solution-11" aria-expanded="true"><i class="far fa-eye" aria-hidden="true" ></i> <span>Solution</span><span class="fold-unfold fa fa-minus-square"></span></button></div>
<ol>
<li>
<p><code style="color: inherit">model_max_length=1000000000000000019884624838656</code>: Represents the maximum length of sequences that the model can handle.</p>
<p>This extremely large value suggests that the model is designed to process very long sequences, although in practice, the actual limit will be constrained by available computational resources.</p>
</li>
<li><code style="color: inherit">is_fast=True</code>: Indicates that this tokenizer is optimized for speed, leveraging Rust-based implementations to accelerate tokenization processes.</li>
<li><code style="color: inherit">padding_side='right'</code>: Configures the tokenizer to pad sequences on the right side, ensuring that all sequences in a batch have the same length by adding padding tokens to the end of shorter sequences.</li>
<li><code style="color: inherit">truncation_side='right'</code>: Specifies that sequences will be truncated from the right side if they exceed the maximum length, preserving the beginning of the sequence.</li>
<li><code style="color: inherit">clean_up_tokenization_spaces=False</code>: Indicates that the tokenizer will not remove spaces after tokenization, preserving the original spacing in the text.</li>
<li><code style="color: inherit">added_tokens_decoder</code>: Maps token IDs to their corresponding <code style="color: inherit">AddedToken</code> objects, which include metadata such as whether the token is a special token and how it should be processed (e.g., stripping whitespace).</li>
</ol>
</blockquote>
</blockquote>
<p>This configuration ensures that the tokenizer is tailored to efficiently process DNA sequences, handling both the tokenization and padding/truncation of sequences in a manner that aligns with the model‚Äôs requirements.</p>
<p>By default, tokenizers may pad sequences on the right side (<code class="language-plaintext highlighter-rouge">padding_side='right'</code>). Let‚Äôs set the padding direction for the tokenizer.</p>


In [54]:
# Tokenize Data based on BPE letter
def tokenize_function(examples):
    return tokenizer(examples['text'], padding="longest", truncation=True, return_tensors="pt")

In [52]:
dataset_text = load_dataset("csv", data_files="data/genome_sequences/hg38/sequences_hg38_200b_verysmall.csv.gz")

Generating train split: 99999 examples [00:00, 287295.34 examples/s]


In [55]:
dataset = dataset_text.map(tokenize_function, batched=True)

Map:   0%|          | 0/99999 [00:00<?, ? examples/s]Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.
Map: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 99999/99999 [00:07<00:00, 13819.28 examples/s]
