# Using Sequence Packing to Improve PreTraining in ESM-2 with BioNeMo Recipes
This Starter Kit demonstrates pretraining the [ESM-2 model](https://www.biorxiv.org/content/10.1101/2022.07.20.500902v2) using BioNeMo Recipes.
BioNeMo Recipes showcases an easy path to accelerate, scale and deploy transformer based biological foundation models using NVIDIA [TransformerEngine](https://github.com/NVIDIA/TransformerEngine).To learn more about BioNeMo Recipes, checkout the the Github repo: https://github.com/NVIDIA/bionemo-framework/tree/main/bionemo-recipes

ESM2 is pre-trained, bi-directional encoder (BERT-style model) over amino acid sequences. ESM-2 models provide embeddings for amino acids that have led to state-of-the-art performance on downstream tasks such as structure and function prediction. ESM2

The ESM2 recipe example also includes sequence packing with THD (Total, Height, Depth) format to achieve maximum computational efficiency when training on variable-length protein sequences. This example will showcase and pretrain the ESM2 model with and without sequence packing to showcase it's benefits. 


#### Requirements:
* must be run on the Ampere version or above hardware
* should be run on the NGC `pytorch:25.06-py3`image with TransformerEngine

#### Dataset:
This example will use a subset of the `esm2_uniref_pretraining_data` available on [HuggingFace](https://huggingface.co/datasets/nvidia/esm2_uniref_pretraining_data)

## Setting up BioNeMo Recipes

To start using BioNeMo Recipes, you will need to clone BioNeMo Framework from github and install the `requirements.txt` for your desired recipe. 

This example uses the [`esm2_native_te` recipe from BioNeMo Recipes](https://github.com/NVIDIA/bionemo-framework/tree/main/bionemo-recipes/recipes/esm2_native_te).


In [None]:
%%bash
git clone https://github.com/NVIDIA/bionemo-framework.git
cd bionemo-framework/bionemo-recipes/recipes
pip install -r esm2_native_te/requirements.txt

## ESM2 Training with Megatron FDSP

The ESM2 training recipe has support for the following parallelism strategies:
* [Distributed Data Parallelism (DDP)](https://docs.pytorch.org/docs/stable/generated/torch.nn.parallel.DistributedDataParallel.html) - The full model is replicated onto each gpu and data is batched and split amongst the GPUs
* [Fully Sharded Data Paralleism (FSDP2)](https://docs.pytorch.org/docs/stable/distributed.fsdp.fully_shard.html) - The model parameters, gradients and optimizer states are all sharded. This allows for models that do not fit on a single GPU to be trained.
* [Megatron-FSDP (mFSDP)](https://github.com/NVIDIA/Megatron-LM/tree/main/megatron/core/distributed/fsdp/src) - An NVIDIA implementation of FSDP that provides up to a 25% speed up and 23% memory savings compared to FSDP2

In this example, we will be showing training with mFSDP; however, DDP and FSDP2 can be used by replacing `train_mfsdp` with `train_ddp.py` and `train_fsdp2.py` respectively.

In [None]:
%%bash
cd bionemo-framework/bionemo-recipes/recipes/esm2_native_te
torchrun --nproc_per_node=8 train_mfsdp.py --config-name L1_3B.yaml \
    +wandb_init_args.mode=offline \
    num_train_steps=500 

## ESM2 Training with Sequence Packing
### TODO: Get Jonathan to send me the visualizations
Sequence Packing is implemented in THD (Total, Height, Depth) format to achieve maximum computational efficiency when training on variable-length protein sequences. 

To turn on sequence packing, we set `dataset.use_sequence_packing=true` in your ESM-2 config.

Let's explore the value of using sequence packing for your data.

### The Problem with Traditional Padding

Traditional BERT-like models pad all sequences to the same length, leading to significant computational waste:

- **Memory waste**: Padding tokens consume GPU memory but provide no learning signal
- **FLOPS waste**: Every layer processes padding tokens through expensive operations (attention, feed-forward)
- **Scaling issues**: Waste increases with batch size and sequence length variance

For protein sequences with high length variability (50-1000+ amino acids), padding can waste **65-90% of computation**.

### THD Format with Sequence Packing

Instead of padding, we can:
1. **Concatenate sequences** without padding tokens
2. **Pack multiple sequences** into efficient batches
3. **Use Transformer Engine w/ Flash Attention** with sequence boundary metadata (`cu_seq_lens`)
4. **Achieve 100% computational efficiency** - every FLOP contributes to learning


In [None]:
%%bash
cd bionemo-framework/bionemo-recipes/recipes/esm2_native_te
torchrun train_mfsdp.py --config-name L0_sanity \
    +dataset.use_sequence_packing=true \
    num_train_steps=500 \

# TODO: analysis

In [None]:
!pip install seaborn

In [None]:
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
import tensorboard.backend.event_processing.event_accumulator as event_accumulator


# Function to extract data from TensorBoard event files and convert to DataFrame
def tensorboard_to_dataframe(event_file):
    """Given a TensorBoard event file, return a pandas DataFrame with the training metrics."""
    # Load the event file
    ea = event_accumulator.EventAccumulator(
        event_file,
        size_guidance={
            event_accumulator.SCALARS: 0,  # 0 means load all
        },
    )
    ea.Reload()

    # Get list of all available tags
    tags = ea.Tags()["scalars"]

    # First, find the union of all steps
    all_steps = set()
    for tag in tags:
        events = ea.Scalars(tag)
        steps = [event.step for event in events]
        all_steps.update(steps)

    # Sort steps for proper ordering
    all_steps = sorted(all_steps)

    # Initialize the dataframe with steps
    df = pd.DataFrame({"step": all_steps})

    # Add each metric as a column
    for tag in tags:
        events = ea.Scalars(tag)
        # Create a dictionary mapping steps to values
        step_to_value = {event.step: event.value for event in events}
        # Add the values to the dataframe, using NaN for missing steps
        df[tag] = df["step"].map(step_to_value)

    return df


# Example of creating a multi-metric plot with seaborn
def plot_multiple_training_metrics(df, metrics_to_plot, figsize=(15, 10)):
    """Given a pandas DataFrame with the training metrics, plot the metrics."""
    n = len(metrics_to_plot)
    fig, axes = plt.subplots(n, 1, figsize=figsize, sharex=True)

    if n == 1:  # Handle the case of a single plot
        axes = [axes]

    sns.set_style("whitegrid")

    for i, metric in enumerate(metrics_to_plot):
        if metric in df.columns:
            sns.lineplot(x="step", y=metric, data=df, ax=axes[i], linewidth=2.5, errorbar="sd")
            axes[i].set_title(metric, fontsize=14)
            axes[i].set_ylabel("Value", fontsize=12)
    axes[-1].set_xlabel("Steps", fontsize=14)
    plt.tight_layout()
    plt.show()

In [None]:
log_dirs = !find pretraining_demo/evo2/dev -name "events.out.tfevents*"
tf_event_file = log_dirs[0]

# Extract data from your event file
df = tensorboard_to_dataframe(tf_event_file)

In [None]:
plot_multiple_training_metrics(df, ["reduced_train_loss", "lr", "grad_norm", "val_loss"])