## Training and Inference on the Mistral DNA Language Model

<p><strong>Generative Artificial Intelligence</strong> (AI) represents a cutting-edge domain within machine learning, focused on creating new, synthetic yet realistic data. This includes generating text, images, music, and even biological sequences. At the heart of many generative AI applications are <strong>Large Language Models</strong> (LLMs), which have revolutionized natural language processing and beyond.</p>
<p>LLMs are <strong>sophisticated neural networks</strong> trained on vast amounts of text data to understand, generate, and interact with human language. Their architecture, often based on <strong>Transformers</strong>, allows them to capture complex patterns and context within data, making them powerful tools for various applications, from chatbots to creative writing and scientific discovery.</p>
<blockquote class="details" style="border: 2px solid #ddd; margin: 1em 0.2em">
<div class="box-title details-title" id="details-transformers"><button class="gtn-boxify-button details" type="button" aria-controls="details-transformers" aria-expanded="true"><i class="fas fa-info-circle" aria-hidden="true" ></i> <span>Details:  Transformers </span><span class="fold-unfold fa fa-minus-square"></span></button></div>
<p>Transformers are a type of neural network model designed to handle sequential data, such as text, by using self-attention mechanisms to weigh the importance of input elements relative to each other, enabling the model to understand and generate coherent and contextually relevant outputs.</p>
</blockquote>
<p><a href="https://mistral.ai/">Mistral AI</a>, French artificial intelligence (AI) startup, recently launched large language models (LLMs) showing performances superior to Llama2. In particular, Mixtral-8x7B implements:</p>
<ul>
<li><strong>Grouped-Query Attention</strong>: Efficiently computes attention by grouping queries, reducing computational load and memory usage.</li>
<li><strong>Sliding-Window Attention</strong>: Focuses on a fixed-size window of tokens, sliding over the sequence to manage long texts efficiently.</li>
<li><strong>Byte-fallback BPE Tokenizer</strong>: Tokenizes text into subword units, falling back to byte-level tokenization for unknown words, ensuring robust handling of diverse text inputs.</li>
</ul>
<p>These techniques collectively enhance the performance and efficiency of large language models, enabling them to process and generate text more effectively.</p>
<p>In this tutorial, we will use a simplified Mistral model architecture with fewer layers and hidden units to reduce computational requirements. The model will be trained to predict the next base in the sequence. For instance, for a sequence like <code style="color: inherit">ATTTGTTGGT</code>, the model will be trained to predict the suffix <code style="color: inherit">TTGGT</code> given the prefix <code style="color: inherit">ATTTG</code>. This process is called <strong>causal language modeling</strong>.</p>
<p>To pretrain the model, we will use a file containing 100,000 non-overlapping DNA sequences of 200 bases, corresponding to around 1% of the human genome (hg38 assembly). This involves training the model to predict the end of a DNA sequence.</p>
<p>By the end of this tutorial, we will obtain a Mistral-DNA model with an internal representation of DNA sequence grammar. This pretrained model can then be used for various applications, such as fine-tuning for classification tasks or predicting mutational effects.</p>

<blockquote class="question" style="border: 2px solid #8A9AD0; margin: 1em 0.2em">
<div class="box-title question-title" id="question"><i class="far fa-question-circle" aria-hidden="true" ></i> Question</div>
<p>What are the required dependencies doing?</p>
<br/><details style="border: 2px solid #B8C3EA; margin: 1em 0.2em;padding: 0.5em; cursor: pointer;"><summary>üëÅ View solution</summary>
<div class="box-title solution-title" id="solution"><button class="gtn-boxify-button solution" type="button" aria-controls="solution" aria-expanded="true"><i class="far fa-eye" aria-hidden="true" ></i> <span>Solution</span><span class="fold-unfold fa fa-minus-square"></span></button></div>
<ul>
<li>
<p><code style="color: inherit">accelerate</code>: A library by <a href="https://huggingface.co/">Hugging Face</a> ‚Äì a platform that provides tools and resources for building, training, and deploying machine learning models ‚Äì designed to simplify the process of training and deploying machine learning models across different hardware environments. It provides tools to optimize performance on GPUs, TPUs, and other accelerators, making it easier to scale models efficiently.</p>
</li>
<li>
<p><code style="color: inherit">datasets</code>: A library by Hugging Face for managing and processing datasets. It provides tools to load, manipulate, and share datasets in a standardized format, making it easier to work with machine learning data.</p>
</li>
<li>
<p><code style="color: inherit">numpy</code>: A fundamental package for scientific computing in Python.</p>
</li>
<li>
<p><code style="color: inherit">torch</code>: Also known as PyTorch, it is an open-source machine learning library developed by Facebook‚Äôs AI Research lab. It provides a flexible platform for building and training neural networks, with a focus on tensor computations and automatic differentiation.</p>
</li>
<li>
<p><code style="color: inherit">transformers</code>: A library by Hugging Face that provides implementations of state-of-the-art transformer models for natural language processing (NLP). It includes pre-trained models and tools for fine-tuning, making it easier to apply transformers to various NLP tasks.</p>
</li>
<li>
<p><code style="color: inherit">flash-attn</code>: Implementation of FlashAttention, a Fast and Memory-Efficient Exact Attention with IO-Awareness
These libraries are widely used in the machine learning and data science communities for their efficiency, flexibility, and extensive functionality.</p>
</li>
</ul>
</details>
</blockquote>


In [2]:
import os
import accelerate
# import flash_attn
import torch
import transformers
from datasets import load_dataset
from transformers import (
    AutoConfig, # load the configuration of pre-trained model. architecture and hyperparameter of the model
    AutoModelForCausalLM, # loads the pretrained causal language model for task like text generation
    AutoTokenizer, # load the tokenizer with a pre-trained model. convert the text to tokens
    DataCollatorForLanguageModeling, # designed for language modelling task. prepares batches for training by handling padding and masking
    EarlyStoppingCallback,  # is used to stop the training, if in the validation performance stops improving to save time and resources
    Trainer, # A high level API for training and evaluating the transformers. 
    TrainingArguments, # define the hyperparameter like learning rate, batch size, epoch, weight decay. 
)

  from .autonotebook import tqdm as notebook_tqdm


Let‚Äôs look at the original archicture of Mixtral-8x7B-v0.1 which is stored in the data/models/Mixtral-8x7B-v0.1 folder Github https://github.com/raphaelmourad/Mistral-DNA/tree/main/data/models/Mixtral-8x7B-v0.1 


<p> In this 8x7B is the 8 is the number of experts and 7B is the 7 Billion parameters </p>

In [3]:
savedir = "/mnt/data/projects/.immune/Personal/DNA-Language-Model/Mistral_DNA/"
os.chdir(savedir)
config = AutoConfig.from_pretrained("data/models/Mixtral-8x7B-v0.1")

In [4]:
config

MixtralConfig {
  "_name_or_path": "data/models/Mixtral-8x7B-v0.1",
  "architectures": [
    "MixtralForCausalLM"
  ],
  "attention_dropout": 0.0,
  "bos_token_id": 1,
  "eos_token_id": 2,
  "hidden_act": "silu",
  "hidden_size": 256,
  "initializer_range": 0.02,
  "intermediate_size": 256,
  "max_position_embeddings": 512,
  "model_type": "mixtral",
  "num_attention_heads": 8,
  "num_experts_per_tok": 1,
  "num_hidden_layers": 8,
  "num_key_value_heads": 8,
  "num_local_experts": 64,
  "output_router_logits": false,
  "rms_norm_eps": 1e-05,
  "rope_theta": 1000000.0,
  "router_aux_loss_coef": 0.02,
  "router_jitter_noise": 0.0,
  "sliding_window": null,
  "tie_word_embeddings": false,
  "torch_dtype": "bfloat16",
  "transformers_version": "4.41.1",
  "use_cache": true,
  "vocab_size": 4096
}

<p>By loading the configuration, we can inspect or modify the model‚Äôs architecture without loading the actual model weights. Let‚Äôs now initialize a causal language model from the loaded configuration object, with a specific attention implementation:</p>

In [5]:
model = AutoModelForCausalLM.from_config(config, attn_implementation="eager")
# eager specifies the attention implementatin to use. Attention mechanism will be executed 
# eagerly which can be useful for debugging or when working with dynamic computation graphs
# Eager execution runs operations immediatedy as they are called in Python rather than adding 
# them to graph execution

In [6]:
model

MixtralForCausalLM(
  (model): MixtralModel(
    (embed_tokens): Embedding(4096, 256)
    (layers): ModuleList(
      (0-7): 8 x MixtralDecoderLayer(
        (self_attn): MixtralAttention(
          (q_proj): Linear(in_features=256, out_features=256, bias=False)
          (k_proj): Linear(in_features=256, out_features=256, bias=False)
          (v_proj): Linear(in_features=256, out_features=256, bias=False)
          (o_proj): Linear(in_features=256, out_features=256, bias=False)
          (rotary_emb): MixtralRotaryEmbedding()
        )
        (block_sparse_moe): MixtralSparseMoeBlock(
          (gate): Linear(in_features=256, out_features=64, bias=False)
          (experts): ModuleList(
            (0-63): 64 x MixtralBlockSparseTop2MLP(
              (w1): Linear(in_features=256, out_features=256, bias=False)
              (w2): Linear(in_features=256, out_features=256, bias=False)
              (w3): Linear(in_features=256, out_features=256, bias=False)
              (act_fn): SiL

In [7]:
Total_parameters = sum(p.numel() for p in model.parameters()) / 1000 ** 2
print(f"Total Parameter {Total_parameters:.1f} million") 

Total Parameter 105.0 million


<blockquote class="details" style="border: 2px solid #ddd; margin: 1em 0.2em">
<div class="box-title details-title" id="details-loaded-functions-and-classes-from-datasets-and-transformers-libraries"><button class="gtn-boxify-button details" type="button" aria-controls="details-loaded-functions-and-classes-from-datasets-and-transformers-libraries" aria-expanded="true"><i class="fas fa-info-circle" aria-hidden="true" ></i> <span>Details: Loaded functions and classes from datasets and transformers libraries</span><span class="fold-unfold fa fa-minus-square"></span></button></div>
<ul>
<li><code style="color: inherit">datasets</code>:
<ul>
<li><code style="color: inherit">load_dataset</code>: function to load datasets from the Hugging Face Hub or local files.</li>
</ul>
</li>
<li><code style="color: inherit">transformers</code>:
<ul>
<li><code style="color: inherit">AutoConfig</code>: Automatically loads the configuration for a pre-trained model. It defines the architecture and hyperparameters of the model.</li>
<li><code style="color: inherit">AutoModelForCausalLM</code>: Loads a pre-trained causal language model for tasks like text generation, where the model predicts the next token in a sequence.</li>
<li><code style="color: inherit">AutoTokenizer</code>: Loads the tokenizer associated with a pre-trained model. It converts text into tokens that the model can process.</li>
<li><code style="color: inherit">DataCollatorForLanguageModeling</code>: A data collator specifically designed for language modeling tasks. It prepares batches of data for training by handling padding and masking.</li>
<li><code style="color: inherit">EarlyStoppingCallback</code>: A callback used during training to stop the process early if the model‚Äôs performance on the validation set stops &gt; improving, saving time and resources.</li>
<li><code style="color: inherit">Trainer</code>: A high-level API for training and evaluating transformer &gt; models. It simplifies the training loop and handles tasks like gradient accumulation and evaluation.</li>
<li><code style="color: inherit">TrainingArguments</code>: A class to define the training configuration, including hyperparameters like learning rate, batch size, and number &gt; of epochs. It is used to configure the <code style="color: inherit">Trainer</code>.</li>
</ul>
</li>
</ul>
<p>These components work together to streamline the process of training and fine-tuning transformer models for various NLP tasks.</p>
</blockquote>

<p>As expected, the model is a <code style="color: inherit">MixtralForCausalLM</code> model with several key components:</p>
<ol>
<li>
<p><strong>Embedding Layer (<code class="language-plaintext highlighter-rouge">embed_tokens</code>)</strong>: Converts input DNA sequences into dense vectors of fixed size. It maps each of the 4,096 (\(4^{6}\)) possible DNA tokens (representing 6-mers) to a 256-dimensional vector space. This embedding layer is crucial for transforming discrete DNA sequences into a format suitable for neural network processing.</p>
</li>
<li><strong>Decoder Layers (<code class="language-plaintext highlighter-rouge">layers</code>)</strong>: Consists of eight <code style="color: inherit">MixtralDecoderLayer</code> modules, each containing several sub-components:
<ul>
<li>
<p><strong>Self-Attention Mechanism (<code class="language-plaintext highlighter-rouge">self_attn</code>)</strong></p>
<blockquote class="question" style="border: 2px solid #8A9AD0; margin: 1em 0.2em">
<div class="box-title question-title" id="question-6"><i class="far fa-question-circle" aria-hidden="true" ></i> Question</div>
<ol>
<li>What are the components?</li>
<li>How is the purpose?</li>
</ol>
<br/><details style="border: 2px solid #B8C3EA; margin: 1em 0.2em;padding: 0.5em; cursor: pointer;"><summary>üëÅ View solution</summary>
<div class="box-title solution-title" id="solution-6"><button class="gtn-boxify-button solution" type="button" aria-controls="solution-6" aria-expanded="true"><i class="far fa-eye" aria-hidden="true" ></i> <span>Solution</span><span class="fold-unfold fa fa-minus-square"></span></button></div>
<ol>
<li>The components are linear projections (<code class="language-plaintext highlighter-rouge">q_proj</code>, <code style="color: inherit">k_proj</code>,<code class="language-plaintext highlighter-rouge">v_proj</code>, <code style="color: inherit">o_proj</code>) for queries, keys, values, and outputs, along witha rotary embedding (<code class="language-plaintext highlighter-rouge">rotary_emb</code>) to incorporate positiona linformation.</li>
<li>This allows the model to weigh the importance of differenttokens in the sequence relative to each other, capturing dependenciesand context.</li>
</ol>
</details>
</blockquote>
</li>
<li>
<p><strong>Sparse Mixture of Experts (<code class="language-plaintext highlighter-rouge">block_sparse_moe</code>)</strong>:</p>
<blockquote class="question" style="border: 2px solid #8A9AD0; margin: 1em 0.2em">
<div class="box-title question-title" id="question-7"><i class="far fa-question-circle" aria-hidden="true" ></i> Question</div>
<ol>
<li>What are the components?</li>
<li>How is the purpose?</li>
</ol>
<br/><details style="border: 2px solid #B8C3EA; margin: 1em 0.2em;padding: 0.5em; cursor: pointer;"><summary>üëÅ View solution</summary>
<div class="box-title solution-title" id="solution-7"><button class="gtn-boxify-button solution" type="button" aria-controls="solution-7" aria-expanded="true"><i class="far fa-eye" aria-hidden="true" ></i> <span>Solution</span><span class="fold-unfold fa fa-minus-square"></span></button></div>
<ol>
<li>The components are gating mechanism (<code class="language-plaintext highlighter-rouge">gate</code>) and list of 64 expert networks (<code class="language-plaintext highlighter-rouge">experts</code>), each with multiple linear layers (<code class="language-plaintext highlighter-rouge">w1</code>, <code style="color: inherit">w2</code>, <code style="color: inherit">w3</code>) and an activation function (<code class="language-plaintext highlighter-rouge">act_fn</code>).</li>
<li>This efficiently processes input data by activating only a subset of expert networks, reducing computational load while maintaining model capacity.</li>
</ol>
</blockquote>
</blockquote>
</li>
<li>
<p><strong>Layer Normalization (<code class="language-plaintext highlighter-rouge">input_layernorm</code>, <code style="color: inherit">post_attention_layernorm</code>)</strong>: Stabilizes and accelerates the training process by normalizing the inputs and outputs of the attention mechanism.</p>
</li>
</ul>
</li>
<li>
<p><strong>Final Layer Normalization (<code class="language-plaintext highlighter-rouge">norm</code>)</strong>: Applies normalization to the output of the final decoder layer, ensuring stable and consistent outputs.</p>
</li>
<li><strong>Language Model Head (<code class="language-plaintext highlighter-rouge">lm_head</code>)</strong>: Projects the 256-dimensional output of the final decoder layer back into the 4,096-dimensional vocabulary space of DNA tokens. This linear layer (<code class="language-plaintext highlighter-rouge">Linear</code>) maps the hidden states to the original token space, enabling the model to predict the next DNA token accurately.</li>
</ol>
<p>This architecture ensures that the model can capture complex patterns in DNA sequences while maintaining computational efficiency, making it suitable for tasks like DNA sequence generation and analysis. The model‚Äôs design culminates in the output of 4,096 tokens, aligning with the input dimension. This consistency is crucial for accurately predicting the next token in a given DNA sequence, ensuring that the model‚Äôs predictions are coherent and reliable.</p>
<blockquote class="question" style="border: 2px solid #8A9AD0; margin: 1em 0.2em">
<div class="box-title question-title" id="question-8"><i class="far fa-question-circle" aria-hidden="true" ></i> Question</div>
<p>How many parameters are in this model?</p>
<br/><details style="border: 2px solid #B8C3EA; margin: 1em 0.2em;padding: 0.5em; cursor: pointer;"><summary>üëÅ View solution</summary>
<div class="box-title solution-title" id="solution-8"><button class="gtn-boxify-button solution" type="button" aria-controls="solution-8" aria-expanded="true"><i class="far fa-eye" aria-hidden="true" ></i> <span>Solution</span><span class="fold-unfold fa fa-minus-square"></span></button></div>
<div class="language-plaintext highlighter-rouge"><div><pre style="color: inherit; background: transparent"><code style="color: inherit">pytorch_total_params = sum(p.numel() for p in model.parameters())
print(f"Model size: {pytorch_total_params/1000**2:.1f}M parameters")
</code></pre></div>    </div>
<p>There are 105 millions parameters. It is a big model.</p>
</blockquote>
</blockquote>
<h1 id="prepare-the-tokenizer">Prepare the tokenizer</h1>
<p>A tokenizer is a crucial component in natural language processing (NLP) that transforms raw text into a format that can be processed by machine learning models. In this section, we will load and configure the <strong>Byte-Pair Encoding (BPE) letter tokenizer</strong>. The BPE tokenizer efficiently handles rare and unknown words by breaking them down into frequent subword units, ensuring that the model can generalize better to unseen data. This process involves initializing the tokenizer with a predefined vocabulary and settings, enabling it to convert text into a format suitable for neural network processing. By doing so, we prepare the tokenizer to effectively manage DNA sequences, facilitating accurate and reliable model predictions.</p>
<p>Let‚Äôs loads a pre-trained tokenizer from the Hugging Face Model Hub. The tokenizer is associated with the model <code style="color: inherit">DNABERT-2-117M</code>, which is designed for processing DNA sequences.</p>


In [7]:
# !pip install transformers==4.41.1

In [8]:
tokenizer = AutoTokenizer.from_pretrained("zhihan1996/DNABERT-2-117M", trust_remote_code=True)

In [9]:
tokenizer

PreTrainedTokenizerFast(name_or_path='zhihan1996/DNABERT-2-117M', vocab_size=4096, model_max_length=1000000000000000019884624838656, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'unk_token': '[UNK]', 'sep_token': '[SEP]', 'pad_token': '[PAD]', 'cls_token': '[CLS]', 'mask_token': '[MASK]'}, clean_up_tokenization_spaces=True),  added_tokens_decoder={
	0: AddedToken("[UNK]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	1: AddedToken("[CLS]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	2: AddedToken("[SEP]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	3: AddedToken("[PAD]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	4: AddedToken("[MASK]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
}

</code></pre></div></div>
<p>The <code style="color: inherit">PreTrainedTokenizerFast</code> is a fast and efficient tokenizer used to process text data for the <code style="color: inherit">DNABERT-2-117M</code> model. Here‚Äôs a breakdown of its configuration:</p>
<ul>
<li>
<p><code style="color: inherit">name_or_path='zhihan1996/DNABERT-2-117M'</code>: Specifies the name or path of the pre-trained tokenizer, indicating that it is associated with the <code style="color: inherit">DNABERT-2-117M</code> model, which is designed for processing DNA sequences.</p>
</li>
<li>
<p><code style="color: inherit">vocab_size=4096</code>: Defines the size of the tokenizer‚Äôs vocabulary.</p>
<blockquote class="question" style="border: 2px solid #8A9AD0; margin: 1em 0.2em">
<div class="box-title question-title" id="question-10"><i class="far fa-question-circle" aria-hidden="true" ></i> Question</div>
<p>Why is the size of the tokenizer‚Äôs vocabulary set to 4,096?</p>
<br/><details style="border: 2px solid #B8C3EA; margin: 1em 0.2em;padding: 0.5em; cursor: pointer;"><summary>üëÅ View solution</summary>
<div class="box-title solution-title" id="solution-10"><button class="gtn-boxify-button solution" type="button" aria-controls="solution-10" aria-expanded="true"><i class="far fa-eye" aria-hidden="true" ></i> <span>Solution</span><span class="fold-unfold fa fa-minus-square"></span></button></div>
<p>It corresponds to the number of unique tokens (6-mers) that the model can recognize in DNA sequences.</p>
</details>
</blockquote>
</li>
<li>
<p><code style="color: inherit">special_tokens</code>: Defines a set of special tokens used by the tokenizer:</p>
<ul>
<li><code style="color: inherit">unk_token: '[UNK]'</code> - Represents unknown or out-of-vocabulary tokens.</li>
<li><code style="color: inherit">sep_token: '[SEP]'</code> - Used to separate segments within a sequence.</li>
<li><code style="color: inherit">pad_token: '[PAD]'</code> - Used for padding sequences to a uniform length.</li>
<li><code style="color: inherit">cls_token: '[CLS]'</code> - Typically used as the first token in a sequence to represent the classification token.</li>
<li><code style="color: inherit">mask_token: '[MASK]'</code> - Used in masked language modeling to hide tokens that the model must predict.</li>
</ul>
</li>
</ul>
<blockquote class="question" style="border: 2px solid #8A9AD0; margin: 1em 0.2em">
<div class="box-title question-title" id="question-11"><i class="far fa-question-circle" aria-hidden="true" ></i> Question</div>
<p>What do the other configuration parameters mean?</p>
<ol>
<li><code style="color: inherit">model_max_length=1000000000000000019884624838656</code></li>
<li><code style="color: inherit">is_fast=True</code></li>
<li><code style="color: inherit">padding_side='right'</code></li>
<li><code style="color: inherit">truncation_side='right'</code></li>
<li><code style="color: inherit">clean_up_tokenization_spaces=False</code></li>
<li><code style="color: inherit">added_tokens_decoder</code></li>
</ol>
<br/><details style="border: 2px solid #B8C3EA; margin: 1em 0.2em;padding: 0.5em; cursor: pointer;"><summary>üëÅ View solution</summary>
<div class="box-title solution-title" id="solution-11"><button class="gtn-boxify-button solution" type="button" aria-controls="solution-11" aria-expanded="true"><i class="far fa-eye" aria-hidden="true" ></i> <span>Solution</span><span class="fold-unfold fa fa-minus-square"></span></button></div>
<ol>
<li>
<p><code style="color: inherit">model_max_length=1000000000000000019884624838656</code>: Represents the maximum length of sequences that the model can handle.</p>
<p>This extremely large value suggests that the model is designed to process very long sequences, although in practice, the actual limit will be constrained by available computational resources.</p>
</li>
<li><code style="color: inherit">is_fast=True</code>: Indicates that this tokenizer is optimized for speed, leveraging Rust-based implementations to accelerate tokenization processes.</li>
<li><code style="color: inherit">padding_side='right'</code>: Configures the tokenizer to pad sequences on the right side, ensuring that all sequences in a batch have the same length by adding padding tokens to the end of shorter sequences.</li>
<li><code style="color: inherit">truncation_side='right'</code>: Specifies that sequences will be truncated from the right side if they exceed the maximum length, preserving the beginning of the sequence.</li>
<li><code style="color: inherit">clean_up_tokenization_spaces=False</code>: Indicates that the tokenizer will not remove spaces after tokenization, preserving the original spacing in the text.</li>
<li><code style="color: inherit">added_tokens_decoder</code>: Maps token IDs to their corresponding <code style="color: inherit">AddedToken</code> objects, which include metadata such as whether the token is a special token and how it should be processed (e.g., stripping whitespace).</li>
</ol>
</blockquote>
</blockquote>
<p>This configuration ensures that the tokenizer is tailored to efficiently process DNA sequences, handling both the tokenization and padding/truncation of sequences in a manner that aligns with the model‚Äôs requirements.</p>
<p>By default, tokenizers may pad sequences on the right side (<code class="language-plaintext highlighter-rouge">padding_side='right'</code>). Let‚Äôs set the padding direction for the tokenizer.</p>


In [10]:
tokenizer.padding_side  = "left"

<p>When tokenizing a batch of sequences, shorter sequences will be padded with special tokens on the left to match the length of the longest sequence in the batch. This can be useful for ensuring consistent input sizes, especially in models that expect fixed-size inputs.</p>

In [11]:
encoding = tokenizer("ATT", padding="longest", return_tensors="pt")
print(encoding)

{'input_ids': tensor([[   1, 2061,    2]]), 'token_type_ids': tensor([[0, 0, 0]]), 'attention_mask': tensor([[1, 1, 1]])}


In [12]:
## encoding for maximum token length
encoding = tokenizer("ATTGTGGGTCCCCGTAGATGATAGGGGCCCCCC", max_length=5, padding='max_length', truncation=True, return_tensors="pt")
print(encoding)

{'input_ids': tensor([[   1, 2061,  281,  485,    2]]), 'token_type_ids': tensor([[0, 0, 0, 0, 0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1]])}


In [13]:
dataset_text = load_dataset("csv", data_files="data/genome_sequences/hg38/sequences_hg38_200b_verysmall.csv.gz")

In [14]:
print(dataset_text)
print(dataset_text['train']['text'][0])
print(len(dataset_text['train']['text'][0]))
print(len(dataset_text['train']['text']))

DatasetDict({
    train: Dataset({
        features: ['text'],
        num_rows: 99999
    })
})
TAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCCTAACCCTAACCCTAACCCTAACCCTAACCTAACCCTAACCCTAACCCTAA
200
99999


### Tokenization

In [15]:
def tokenize_function(examples):
    return tokenizer(examples['text'], padding="longest", truncation=True, return_tensors="pt")

<blockquote class="question" style="border: 2px solid #8A9AD0; margin: 1em 0.2em">
<div class="box-title question-title" id="question-14"><i class="far fa-question-circle" aria-hidden="true" ></i> Question</div>
<p>What do the following parameters?</p>
<ol>
<li><code style="color: inherit">padding="longest"</code></li>
<li><code style="color: inherit">truncation=True</code></li>
<li><code style="color: inherit">return_tensors="pt"</code></li>
</ol>
<br/><details style="border: 2px solid #B8C3EA; margin: 1em 0.2em;padding: 0.5em; cursor: pointer;"><summary>üëÅ View solution</summary>
<div class="box-title solution-title" id="solution-14"><button class="gtn-boxify-button solution" type="button" aria-controls="solution-14" aria-expanded="true"><i class="far fa-eye" aria-hidden="true" ></i> <span>Solution</span><span class="fold-unfold fa fa-minus-square"></span></button></div>
<ol>
<li><code style="color: inherit">padding="longest"</code> ensures that all sequences in the batch are padded to the length of the longest sequence, adding padding tokens as needed.</li>
<li><code style="color: inherit">truncation=True</code> specifies that sequences exceeding the model‚Äôs maximum length will be truncated to fit.</li>
<li><code style="color: inherit">return_tensors="pt"</code> indicates that the output should be in the form of PyTorch tensors, suitable for use with PyTorch-based models.</li>
</ol>

In [16]:
dataset = dataset_text.map(tokenize_function, batched=True)

In [17]:
dataset

DatasetDict({
    train: Dataset({
        features: ['text', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 99999
    })
})

</code></pre></div>        </div>
<p><code style="color: inherit">dataset</code> is a <code style="color: inherit">DatasetDict</code> with 1 <code style="color: inherit">train</code> <code style="color: inherit">Dataset</code> made of 99,999 rows and 4 features:</p>
<ul>
<li><code style="color: inherit">text</code>: The original text data before tokenization.</li>
<li><code style="color: inherit">input_ids</code>: The tokenized input data, represented as numerical IDs.</li>
<li><code style="color: inherit">token_type_ids</code>: Indicates the type of each token, useful for models that handle multiple segments.</li>
<li><code style="color: inherit">attention_mask</code>: Specifies which tokens should be attended to by the model (<code style="color: inherit">1</code> for real tokens, <code style="color: inherit">0</code> for padding).</li>
</ul>
</li>

In [18]:
print(dataset['train']['input_ids'][1][0:10])
print(dataset['train']['input_ids'][1][32:42])

[3, 3, 3, 3, 3, 3, 3, 3, 3, 3]
[888, 956, 102, 615, 79, 956, 338, 956, 956, 956]


In [19]:
print(dataset['train']['token_type_ids'][1][0:10])
print(dataset['train']['token_type_ids'][1][32:42])

[0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0]


In [20]:
print(dataset['train']['attention_mask'][1][0:10])
print(dataset['train']['attention_mask'][1][32:42])

[0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
[1, 1, 1, 1, 1, 1, 1, 1, 1, 1]


<li>The first tokenized sequence of <code style="color: inherit">train</code> <code style="color: inherit">Dataset</code> (<code class="language-plaintext highlighter-rouge">dataset["train"][1]</code>) is a dictionary with:
<ul>
<li><code style="color: inherit">text</code>: 200 base pair sequence</li>
<li><code style="color: inherit">input_ids</code>: list of 49 numerical values, the token IDs.</li>
<li><code style="color: inherit">token_type_ids</code>: list 49 <code style="color: inherit">0</code></li>
<li><code style="color: inherit">attention_mask</code>: list of 7 <code style="color: inherit">0</code> (padding) and 42 <code style="color: inherit">1</code> (real tokens)</li>
</ul>
</li>
</ol>


In [21]:
# Data splitting
train_size = int(0.8 * len(dataset["train"]))
val_size = len(dataset["train"]) - train_size

In [22]:
train_set, val_set = torch.utils.data.random_split(dataset["train"], [train_size, val_size])

In [23]:
print(train_set.dataset, len(train_set.indices))
print(val_set.dataset, len(val_set.indices))

Dataset({
    features: ['text', 'input_ids', 'token_type_ids', 'attention_mask'],
    num_rows: 99999
}) 79999
Dataset({
    features: ['text', 'input_ids', 'token_type_ids', 'attention_mask'],
    num_rows: 99999
}) 20000


<h2 id="data-collation">Data Collation</h2>
<p>The <code style="color: inherit">DataCollatorForLanguageModeling</code> is a utility class, designed to prepare and format batches of data for language modeling tasks. It handles the dynamic padding and masking of input sequences, ensuring that each batch fed into the model is correctly formatted and optimized for training.</p>

A data collator is a function/class that:
<li>Takes a list of tokenized samples
<li>Pads them to the same length
<li>Creates the final tensors (input_ids, attention_mask, labels)
<li>Prepares them correctly for training
<li>It runs at batch time, right before data is fed to the model.</li>

In [24]:
data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=False)
# tokenizer convert the raw text into the numerical tokens
# mlm= False, which is set up for causal language modeling (CLM) rather than masked MLM

In [25]:
data_collator

DataCollatorForLanguageModeling(tokenizer=PreTrainedTokenizerFast(name_or_path='zhihan1996/DNABERT-2-117M', vocab_size=4096, model_max_length=1000000000000000019884624838656, is_fast=True, padding_side='left', truncation_side='right', special_tokens={'unk_token': '[UNK]', 'sep_token': '[SEP]', 'pad_token': '[PAD]', 'cls_token': '[CLS]', 'mask_token': '[MASK]'}, clean_up_tokenization_spaces=True),  added_tokens_decoder={
	0: AddedToken("[UNK]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	1: AddedToken("[CLS]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	2: AddedToken("[SEP]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	3: AddedToken("[PAD]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	4: AddedToken("[MASK]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
}, mlm=False, mlm_probability=0.15, pad_to_multiple_of=

<p>This will:</p>
<ol>
<li>Automatically pads sequences within a batch to ensure they are of equal length, which is necessary for efficient batch processing in neural networks.</li>
<li>Generates attention masks that indicate which tokens should be attended to by the model, ignoring padding tokens.</li>
<li>Collates individual examples into batches, handling the necessary formatting and ensuring compatibility with the model‚Äôs input requirements.</li>
</ol>


# Train the model

## Define parameters for pretraining

</p> We are going to define the hyperparameters & configurations for training the language models using the Hugging Face transformers. </p>

<p> Batch size implies total number of samples will be processed before the model weights. This should be chosen to balance the computational efficiency and memnory usage </p>

In [26]:
batchsize = 128
training_args = TrainingArguments(
    output_dir="./results/models", # model, checkpoints & results saved
    eval_strategy="epoch", # performance evaluate after each epoch
    save_strategy="epoch", # model save at each epoch
    num_train_epochs=50, # Iterates entire dataset 50 times
    per_device_train_batch_size=batchsize, # samples iterates based on batchsize
    per_device_eval_batch_size=batchsize,
    learning_rate=5e-4, 
    weight_decay=0.01, # L2 regularization to prevent overfitting
    logging_dir="./logs",
    load_best_model_at_end=True, 
    bf16=True,  # enables mixed-precision training using 16-bit floating-point numbers
    gradient_accumulation_steps=50,
    report_to="none"
)

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


**load_best_model_at_end**
<p> ensures that the best model, based on the lowest evaluation loss, is loaded at the end of training. This helps in selecting the model with the best performance across all epochs. During gradient descent, the model will be optimized, and at some point, the loss will start to increase again. We want to pick the model with the lowest loss, not when it starts increasing. So, ‚Äúload best model at the end‚Äù means selecting the model with the best loss across all epochs </p>

**gradient_accumulation_steps**
<p>accumulates gradients over 50 steps before performing a backward pass. This effectively increases the batch size without requiring additional memory, helping to stabilize training.</p>

In [27]:
trainer = Trainer(
    model=model,
    args=training_args,
    data_collator=data_collator,
    train_dataset=train_set,
    eval_dataset=val_set,
    callbacks = [EarlyStoppingCallback(early_stopping_patience=3)]
)

<p>With a patience of three, even if we find a good minimum, we wait for three more epochs to ensure that the loss does not improve further. If the loss does not decrease for three consecutive epochs, we stop training. However, if a better model with a lower loss is found within those three epochs, training continues. This approach helps in finding a more robust local minimum by reducing the impact of noise in the training data.</p>


In [28]:
trainer.train()
# I ran the whole training unfortunately it is not showing up, may be forgot to save. 
# The training stops at 20 epochs
# The loss is 6.09

Epoch,Training Loss,Validation Loss


KeyboardInterrupt: 

In [30]:
# taking previous trained Mistral DNA language model
model = AutoModelForCausalLM.from_pretrained("RaphaelMourad/Mistral-DNA-v1-17M-hg38")

<p>This is a mixed model that was pre-trained on the entire Human Genome. It contains approximately 17 million parameters and was trained using the Human Genome assembly GRCh38. Unlike models pre-trained on sequences of 200 bases, this model was pre-trained on sequences of 10,000 bases (10K). The advantage of this model is its ability to process larger DNA contexts or sequences. This capability allows it to capture more extensive patterns and dependencies within the genomic data.</p>


In [31]:
model

MixtralForCausalLM(
  (model): MixtralModel(
    (embed_tokens): Embedding(4096, 256)
    (layers): ModuleList(
      (0-7): 8 x MixtralDecoderLayer(
        (self_attn): MixtralSdpaAttention(
          (q_proj): Linear(in_features=256, out_features=256, bias=False)
          (k_proj): Linear(in_features=256, out_features=256, bias=False)
          (v_proj): Linear(in_features=256, out_features=256, bias=False)
          (o_proj): Linear(in_features=256, out_features=256, bias=False)
          (rotary_emb): MixtralRotaryEmbedding()
        )
        (block_sparse_moe): MixtralSparseMoeBlock(
          (gate): Linear(in_features=256, out_features=8, bias=False)
          (experts): ModuleList(
            (0-7): 8 x MixtralBlockSparseTop2MLP(
              (w1): Linear(in_features=256, out_features=256, bias=False)
              (w2): Linear(in_features=256, out_features=256, bias=False)
              (w3): Linear(in_features=256, out_features=256, bias=False)
              (act_fn): Si

In [None]:
dna = ["ACGTAGCATCGGATCTATCTATCGACACTTGGTTATCGATCTACGAGCATCTCGTTAGC","AAGGTTTCCCAAAAATTTTTACTCTACTCGGGCAGGCAGACGACGAGCTTCATATCAGCGAGCA"]
tokenized_dna = tokenizer(dna, max_length = 10, return_tensors = 'pt', padding="longest", truncation=True)
# Adding to truncate at the maximum length i.e. 10
inputs = tokenized_dna["input_ids"]
model_outputs = model(inputs)
hidden_states = model_outputs[0]

In [67]:
print(inputs)
inputs.shape

tensor([[   1,    5,  194,   32,  757, 1239, 2092,  294,   24,    2],
        [   1,    9,  101,   26,  175,  536,  436, 3189,  253,    2]])


torch.Size([2, 10])

In [None]:
print(len(dna))
print(hidden_states.shape)

2
torch.Size([2, 10, 4096])


In [71]:
model_outputs[0].shape

torch.Size([2, 10, 4096])