In [1]:
# Following this notebook https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/nucleotide_transformer_dna_sequence_modelling.ipynb#scrollTo=8XJthJun4mVw 
# !conda create -n nt_finetune_v2 python=3.10
# !pip install transformers datasets huggingface_hub accelerate
# !pip install biopython
# !pip install bitsandbytes
# !pip install xformers
# !pip install --upgrade torchvision torchaudio --index-url https://download.pytorch.org/whl/cu128
# !pip install numpy==1.26.4 scipy==1.13.1 Keep the numpy and scipy with this version
# !source ~/.bashrc

In [None]:
from transformers import AutoTokenizer, TrainingArguments, Trainer, AutoModelForSequenceClassification

  from .autonotebook import tqdm as notebook_tqdm


#### 1. AutoTokenizer

    This class is a factory that automatically loads the correct tokenizer associated with a specific pre-trained model checkpoint (e.g., an NT model on the Hugging Face Hub).
    Why you use it: You don't need to know the exact tokenizer class (like BertTokenizer or DNATokenizer). You just provide the model's name or path, and AutoTokenizer ensures you get the exact tokenization logic (like the 6-mer split, k-mer vocabulary, and special tokens) that the original model was trained with. This is crucial for data consistency.

#### 2. AutoModelForSequenceClassification

    Similar to AutoTokenizer, this is a factory that automatically loads the weights of a pre-trained Transformer model (like NT) and adds a sequence classification head on top.
    Why you use it: This is the specific architecture for fine-tuning. It loads the massive, frozen Transformer layers and attaches a small, learnable classification layer (or "head") with the appropriate number of output classes for your specific task (e.g., 2 classes for benign/pathogenic, or 5 classes for tissue types).

#### 3. TrainingArguments

    This class is used to define all the hyperparameters and configuration settings for your fine-tuning run.
    You instantiate this class with arguments like:
        output_dir: Where to save the checkpoints.
        learning_rate: The specific rate for optimization.
        per_device_train_batch_size: The batch size you asked about (e.g., 512).
        num_train_epochs: How many times to loop over the data.
        logging_steps: How often to print training progress.
        ...and many other settings (like weight decay, evaluation strategy, etc.).

#### 4. Trainer

    The Trainer class is a high-level training API that orchestrates the entire fine-tuning process. It ties together the model, data, and hyperparameters.
    You initialize the Trainer by passing it four key ingredients:
        The model (from AutoModelForSequenceClassification).
        The training arguments (from TrainingArguments).
        Your training and evaluation datasets.
        A custom function to compute metrics (like accuracy or F1-score).
    Execution: Once initialized, you simply call trainer.train(), and it handles the entire loop: moving data to the GPU, calculating loss, backpropagation, gradient updates, and saving checkpoints.

In [3]:
import torch
from sklearn.metrics import matthews_corrcoef, f1_score
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt
import numpy as np
from accelerate.test_utils.testing import get_backend
device, _, _ = get_backend()


The Matthews Correlation Coefficient (MCC) is a single, comprehensive metric used for evaluating the quality of binary classification models (and can be extended to multiclass).
Why MCC is Often Preferred

In genomics and bioinformatics, you often deal with highly imbalanced datasets (e.g., many neutral variants but very few pathogenic ones). Simple metrics like accuracy or F1-score can be misleading in these cases:

Accuracy can be high if the model just predicts the majority class every time.
F1-Score is good but doesn't fully account for all four components of the confusion matrix equally.

MCC is generally considered the most informative single score because it takes into account all four values in the confusion matrix:
    True Positives (TP)
    True Negatives (TN)
    False Positives (FP)
    False Negatives (FN)

The MCC Formula

The MCC is calculated using this formula:
MCC = (TP X TN) - (FP X FN) / (TP + FP)(TP + FN)(TN + FP)(TN + FN)^1/2'

The MCC score ranges from -1 to +1:
    +1: Represents a perfect prediction (the model is correct in every instance).
    0: Represents a prediction no better than random guessing.
    -1: Represents a total disagreement between prediction and observation (the model is always wrong).

| Model name          | Num layers | Num parameters | Training dataset       |
|---------------------|------------|----------------|------------------------|
| `500M Human Ref`    | 24         | 500M           | Human reference genome |
| `500M 1000G`        | 24         | 500M           | 1000G genomes          |
| `2.5B 1000G`        | 32         | 2.5B           | 1000G genomes          |
| `2.5B Multispecies` | 32         | 2.5B           | Multi-species dataset  |

This task was introduced in DeePromoter, where a set of TATA and non-TATA promoters was gathered. A negative sequence was generated from each promoter, by randomly sampling subsets of the sequence, to guarantee that some obvious motifs were present both in the positive and negative dataset.

In [4]:
num_labels_promoter = 2
# Load the model
model = AutoModelForSequenceClassification.from_pretrained("InstaDeepAI/nucleotide-transformer-500m-human-ref", num_labels=num_labels_promoter)
model = model.to(device)

Some weights of EsmForSequenceClassification were not initialized from the model checkpoint at InstaDeepAI/nucleotide-transformer-500m-human-ref and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [5]:
model

EsmForSequenceClassification(
  (esm): EsmModel(
    (embeddings): EsmEmbeddings(
      (word_embeddings): Embedding(4105, 1280, padding_idx=1)
      (dropout): Dropout(p=0.0, inplace=False)
      (position_embeddings): Embedding(1002, 1280, padding_idx=1)
    )
    (encoder): EsmEncoder(
      (layer): ModuleList(
        (0-23): 24 x EsmLayer(
          (attention): EsmAttention(
            (self): EsmSelfAttention(
              (query): Linear(in_features=1280, out_features=1280, bias=True)
              (key): Linear(in_features=1280, out_features=1280, bias=True)
              (value): Linear(in_features=1280, out_features=1280, bias=True)
            )
            (output): EsmSelfOutput(
              (dense): Linear(in_features=1280, out_features=1280, bias=True)
              (dropout): Dropout(p=0.0, inplace=False)
            )
            (LayerNorm): LayerNorm((1280,), eps=1e-12, elementwise_affine=True)
          )
          (intermediate): EsmIntermediate(
            

In [12]:
from datasets import load_dataset, Dataset

# Load the promoter dataset from the InstaDeep Hugging Face ressources
dataset_name = "promoter_all"
train_dataset_promoter = load_dataset(
        "InstaDeepAI/nucleotide_transformer_downstream_tasks",
        dataset_name,
        split="train",
        streaming= False,
    )
test_dataset_promoter = load_dataset(
        "InstaDeepAI/nucleotide_transformer_downstream_tasks",
        dataset_name,
        split="test",
        streaming= False,
    )

OSError: [Errno 28] No space left on device: '/home/cb_voyant_bio_com/.cache/huggingface/hub/datasets--InstaDeepAI--nucleotide_transformer_downstream_tasks'