## Parallel Head Data Preprocessing Example

### Background and motivation
In this notebook, we demonstrate how to preprocess data for training models with multiple prediction heads in parallel using the BioNemo Evo2 framework. This approach allows for efficient handling of diverse biological data types, such as RNA-seq and ChIP-seq, by leveraging parallel processing techniques.

For this example, we will focus on preprocessing RNA-seq data from BigWig files and preparing it for model training.

In [None]:
# Replace current config.py with modified version for parallel head support, saving a backup of the original.
!cp /usr/local/lib/python3.12/dist-packages/bionemo/evo2/utils/config.py /usr/local/lib/python3.12/dist-packages/bionemo/evo2/utils/config.py.bak
!cp /workspace/sub-packages/bionemo-evo2/src/bionemo/evo2/utils/config.py /usr/local/lib/python3.12/dist-packages/bionemo/evo2/utils/config.py

In [2]:
import os

from bionemo.core.utils.subprocess_utils import run_subprocess_safely  # noqa


data_path = "parallel_head_data"

In [None]:
CLEANUP: bool = True
if CLEANUP and os.path.exists(data_path):
    !rm -rf {data_path}
    !rm -rf ./preprocessed_data
    !rm parallel_preprocess_config.yaml

In [None]:
if not os.path.exists(data_path):
    !mkdir -p {data_path}
    !wget https://storage.googleapis.com/tbb-public-bucket/datasets/parallel-head-example/GCA_000525045.1_DREv1_genomic.fna -O {data_path}/GCA_000525045.1_DREv1_genomic.fna
    !wget https://storage.googleapis.com/tbb-public-bucket/datasets/parallel-head-example/SRR1145649_forward.normalized.bw -O {data_path}/SRR1145649_forward.normalized.bw
    !wget https://storage.googleapis.com/tbb-public-bucket/datasets/parallel-head-example/SRR1145649_reverse.normalized.bw -O {data_path}/SRR1145649_reverse.normalized.bw
    !wget https://storage.googleapis.com/tbb-public-bucket/datasets/parallel-head-example/GCA_000525045.1_DREv1_genomic.gtf -O {data_path}/GCA_000525045.1_DREv1_genomic.gtf

In [None]:
# Let's create a YAML config for preprocessing with RNA-Seq bigwig files.
fasta_base = "GCA_000525045.1_DREv1_genomic.fna"
bigwig_forward = "SRR1145649_forward.normalized.bw"  # No need for reverse, since both are handled together.
full_fasta_path = os.path.abspath(os.path.join(data_path, fasta_base))
output_prefix = "fungi_dna_rnaseq"

output_dir = os.path.abspath("preprocessed_data")
output_yaml = f"""
- datapaths: ["{full_fasta_path}"]
  output_dir: "{output_dir}"
  output_prefix: {output_prefix}
  train_split: 0.9
  valid_split: 0.05
  test_split: 0.05
  overwrite: True
  embed_reverse_complement: true
  random_reverse_complement: 0.0
  random_lineage_dropout: 0.0
  include_sequence_id: false
  transcribe: "back_transcribe"
  force_uppercase: false
  indexed_dataset_dtype: "uint8"
  tokenizer_type: "Byte-Level"
  vocab_file: null
  vocab_size: null
  merges_file: null
  pretrained_tokenizer_model: null
  special_tokens: null
  fast_hf_tokenizer: true
  append_eod: true
  enforce_sample_length: null
  ftfy: false
  workers: 1
  preproc_concurrency: 100000
  chunksize: 25
  drop_empty_sequences: true
  nnn_filter: false  # If you split your fasta on NNN (in human these are contigs), then you should set this to true.
  seed: 12342  # Not relevant because we are not using random reverse complement or lineage dropout.
  fasta_rnaseq_bigwig_map:
    {full_fasta_path}: {os.path.abspath(os.path.join(data_path, bigwig_forward))}
"""
with open("parallel_preprocess_config.yaml", "w") as f:
    print(output_yaml, file=f)

In [None]:
# Now we can run the preprocessing script with this config.
!python \
    /workspace/sub-packages/bionemo-evo2/src/bionemo/evo2/utils/heads/preprocess.py \
    --config parallel_preprocess_config.yaml

Now that we have a prepared dataset, we can proceed to train our model using the parallel head approach. This involves defining a model architecture that can handle multiple outputs and configuring the training process to optimize for each head simultaneously.

We will use the simple dataset we created in the previous section to illustrate this process.

In [None]:
# First, lets get a model to train
if not os.path.exists("nemo2_evo2_1b_8k"):
    !evo2_convert_to_nemo2 \
      --model-path hf://arcinstitute/savanna_evo2_1b_base \
      --model-size 1b --output-dir nemo2_evo2_1b_8k

In [None]:
# Configure the training dataset
from pathlib import Path


output_pfx = str(Path(os.path.abspath("preprocessed_data")) / output_prefix)
output_yaml = f"""
- dataset_prefix: {output_pfx}_byte-level_train
  dataset_split: train
  dataset_weight: 1.0
- dataset_prefix: {output_pfx}_byte-level_val
  dataset_split: validation
  dataset_weight: 1.0
- dataset_prefix: {output_pfx}_byte-level_test
  dataset_split: test
  dataset_weight: 1.0
"""
with open("training_data_config.yaml", "w") as f:
    print(output_yaml, file=f)

In [None]:
# Now, lets copy folder /workspace/sub-packages/bionemo-evo2/src/bionemo/evo2/utils/heads to /usr/local/lib/python3.12/dist-packages/bionemo/evo2/utils/
!cp -r \
    /workspace/sub-packages/bionemo-evo2/src/bionemo/evo2/utils/heads \
    /usr/local/lib/python3.12/dist-packages/bionemo/evo2/utils/

# Also copy over loss folder /workspace/sub-packages/bionemo-evo2/src/bionemo/evo2/utils/loss to /usr/local/lib/python3.12/dist-packages/bionemo/evo2/utils/
!cp -r \
    /workspace/sub-packages/bionemo-evo2/src/bionemo/evo2/utils/loss \
    /usr/local/lib/python3.12/dist-packages/bionemo/evo2/utils/

In [None]:
# Now lets go ahead and train a model with parallel heads!
WARMUP_STEPS = 100
MAX_STEPS = 1000
VAL_CHECK_INTERVAL = 25

MODEL_SUBNET_OPTION = "--activation-checkpoint-recompute-num-layers 5"

!NCCL_P2P_DISABLE=1 NCCL_IB_DISABLE=1 python \
    /workspace/sub-packages/bionemo-evo2/src/bionemo/evo2/run/train_parallel.py \
    -d training_data_config.yaml \
    --dataset-dir ./preprocessed_data \
    --result-dir parallel_pretraining_demo \
    --experiment-name evo2 \
    --model-size 1b \
    --devices 2 \
    --num-nodes 1 \
    --seq-length 8192 \
    --micro-batch-size 2 \
    --lr 0.000015 \
    --min-lr 0.0000149 \
    --warmup-steps {WARMUP_STEPS} \
    --grad-acc-batches 4 \
    --max-steps {MAX_STEPS} \
    --ckpt-dir nemo2_evo2_1b_8k \
    --clip-grad 5 \
    --wd 0.001 \
    --attention-dropout 0.01 \
    --hidden-dropout 0.01 \
    --val-check-interval {VAL_CHECK_INTERVAL} \
    {MODEL_SUBNET_OPTION} \
    --create-tensorboard-logger \
    --parallel-heads \
    --parallel-dna-head \
    --parallel-rna-seq-head \
    --ckpt-async-save

In [None]:
CLEANUP: bool = True
if CLEANUP and os.path.exists(data_path):
    !rm -rf {data_path}
    !rm -rf parallel_pretraining_demo
    !rm -rf preprocessed_data
    !rm parallel_preprocess_config.yaml
    !mv /usr/local/lib/python3.12/dist-packages/bionemo/evo2/utils/config.py.bak /usr/local/lib/python3.12/dist-packages/bionemo/evo2/utils/config.py