# Step1. DAPT (Domain Adaptive Pre-Training)

Before you begin, you need to create [step 1 dummy data](./Step0_Dummy_Data.ipynb) or prepare real data ([see here](https://github.com/NVIDIA/NeMo-Curator/tree/main/tutorials/dapt-curation)) for the actual DAPT.
If you plan to train the real model, make sure to prepare not only the domain-specific data but also general-purpose data to be used in the continued pretraining.

We use huggingface "meta-llama/Llama-3.1-8B" model for practice.

In this step, you will perform domain-adaptive tokenization and domain-adaptive continued pretraining (DAPT).


## (1) Domain-adaptive tokenization

In [1]:
import glob
import jsonlines


MODEL_ROOT_DIR = "/work/Models" # change to your path
DATA_ROOT_DIR = "/work/Data"

all_files = glob.glob(f"{DATA_ROOT_DIR}/dapt/*.jsonl") # DAPT Data Path 

all_texts = ""
for data_file in all_files:
    with jsonlines.open(data_file) as reader:
        for obj in reader:
            all_texts+=obj["text"]+"\n"
                
# Write the text data into a file
all_text_file = f"{DATA_ROOT_DIR}/all_dapt_text.txt"
with open(all_text_file, 'w') as data_fp:
  data_fp.write(all_texts)
  
print(f"Save all dapt text data to {all_text_file}")


Save all dapt text data to /work/Data/all_dapt_text.txt


In [2]:
tokenizer_spe_type = "bpe"
vocab_size = 100 # target vocab size for domain specific data

!python /opt/NeMo/scripts/tokenizers/process_asr_text_tokenizer.py --data_file $all_text_file --data_root=$DATA_ROOT_DIR --vocab_size=$vocab_size --tokenizer=spe --spe_type=$tokenizer_spe_type  

[NeMo I 2024-12-19 05:10:07 sentencepiece_tokenizer:378] Processing /work/Data/all_dapt_text.txt and store at /work/Data/tokenizer_spe_bpe_v100
sentencepiece_trainer.cc(178) LOG(INFO) Running command: --input=/work/Data/all_dapt_text.txt --model_prefix=/work/Data/tokenizer_spe_bpe_v100/tokenizer --vocab_size=100 --shuffle_input_sentence=true --hard_vocab_limit=false --model_type=bpe --character_coverage=1.0 --bos_id=-1 --eos_id=-1 --normalization_rule_name=nmt_nfkc_cf --remove_extra_whitespaces=false
sentencepiece_trainer.cc(78) LOG(INFO) Starts training with : 
trainer_spec {
  input: /work/Data/all_dapt_text.txt
  input_format: 
  model_prefix: /work/Data/tokenizer_spe_bpe_v100/tokenizer
  model_type: BPE
  vocab_size: 100
  self_test_sample_size: 0
  character_coverage: 1
  input_sentence_size: 0
  shuffle_input_sentence: 1
  seed_sentencepiece_size: 1000000
  shrinking_factor: 0.75
  max_sentence_length: 4192
  num_threads: 16
  num_sub_iterations: 2
  max_sentencepiece_length: 16


In [3]:
custom_tokenizer_dir = DATA_ROOT_DIR + f"/tokenizer_spe_{tokenizer_spe_type}_v{vocab_size}"

! ls $custom_tokenizer_dir

tokenizer.model  tokenizer.vocab  vocab.txt


## (2) Add domain specific token to original tokenizer

In [4]:
import os
import wget
from nemo.collections import nlp as nemo_nlp
from nemo.collections import common as nemo_common
from omegaconf import OmegaConf
import huggingface_hub as hf
from transformers import AutoTokenizer
from transformers import AutoModelForCausalLM
import torch

HF_LLM_MODEL = "meta-llama/Llama-3.1-8B"

domain_tokenizer = nemo_nlp.modules.get_tokenizer(tokenizer_name="sentencepiece", tokenizer_model=custom_tokenizer_dir+"/tokenizer.model")

tokenizer = AutoTokenizer.from_pretrained(HF_LLM_MODEL)
model = AutoModelForCausalLM.from_pretrained(HF_LLM_MODEL)


  from .autonotebook import tqdm as notebook_tqdm
      cm = get_cmap("Set1")
    


[NeMo I 2024-12-19 05:10:17 tokenizer_utils:106] tokenizer_model: /work/Data/tokenizer_spe_bpe_v100/tokenizer.model


Loading checkpoint shards: 100%|██████████| 4/4 [00:03<00:00,  1.01it/s]


In [5]:
# Filtering Domain-Only Token

general_vocab = set(tokenizer.vocab.keys())
domain_vocab = set(domain_tokenizer.vocab)
domain_only_vocab = domain_vocab - general_vocab
domain_only_vocab = list(domain_only_vocab)
print("Domain Only Vocab: ", domain_only_vocab)

Domain Only Vocab:  ['jv', '<unk>', 'nq', 'zr', 'zg', '▁', 'uq']


In [6]:
print("Ori Vocab: ", len(tokenizer))
tokenizer.add_tokens(domain_only_vocab)
model.resize_token_embeddings(len(tokenizer))
print("New Vocab: ", len(tokenizer))

Ori Vocab:  128256


The new embeddings will be initialized from a multivariate normal distribution that has old embeddings' mean and covariance. As described in this article: https://nlp.stanford.edu/~johnhew/vocab-expansion.html. To disable this, use `mean_resizing=False`
The new lm_head weights will be initialized from a multivariate normal distribution that has old embeddings' mean and covariance. As described in this article: https://nlp.stanford.edu/~johnhew/vocab-expansion.html. To disable this, use `mean_resizing=False`


New Vocab:  128263


## (3) Reinitialize embedding matrix on LLM

In [7]:
def get_embedding_mean(tokens, tokenizer):
    # get embedding initialize values
    embedding_layer = model.get_input_embeddings()
    embedding_values = []
    with torch.no_grad():
        for token in tokens:
            split_token = tokenizer.tokenize(token, add_special_tokens=False)
            token_ids = tokenizer.convert_tokens_to_ids(split_token)
            embeddings = embedding_layer.weight[token_ids]
            avg_embedding = embeddings.mean(dim=0)
            embedding_values.append(avg_embedding)
            
    return embedding_values

embedding_values = get_embedding_mean(domain_only_vocab, tokenizer)

In [8]:

def set_embedding_value(tokens, new_tokenizer, mean_emb_values):
    new_embedding_layer = model.get_input_embeddings()
    output_embedding_layers = model.get_output_embeddings()
    with torch.no_grad():
        for i, token in enumerate(tokens):
            token_id = new_tokenizer.convert_tokens_to_ids(token)
            new_embedding_layer.weight[token_id] = mean_emb_values[i]
            output_embedding_layers.weight[token_id] = torch.zeros_like(mean_emb_values[i])
            

set_embedding_value(domain_only_vocab, tokenizer, embedding_values)            

In [9]:
# Check Init is Okay
embedding_layer = model.get_input_embeddings()
output_embedding_layer = model.get_output_embeddings()

for i, token in enumerate(domain_only_vocab):
    token_id = tokenizer.convert_tokens_to_ids(token)
    ori_value = embedding_values[i].data.numpy()
    init_value = embedding_layer.weight[token_id].data.numpy()
    out_value = output_embedding_layer.weight[token_id].data.numpy()
    print(f"Embedding for {token}: {init_value}", "Is Same: ", ori_value==init_value)
    print(f"Output Embedding for {token}: {out_value}")

Embedding for jv: [-0.00240288 -0.00399607  0.00210693 ...  0.00032894 -0.00118068
  0.0008361 ] Is Same:  [ True  True  True ...  True  True  True]
Output Embedding for jv: [0. 0. 0. ... 0. 0. 0.]
Embedding for <unk>: [-0.00240288 -0.00399607  0.00210693 ...  0.00032894 -0.00118068
  0.0008361 ] Is Same:  [ True  True  True ...  True  True  True]
Output Embedding for <unk>: [0. 0. 0. ... 0. 0. 0.]
Embedding for nq: [-0.00240288 -0.00399607  0.00210693 ...  0.00032894 -0.00118068
  0.0008361 ] Is Same:  [ True  True  True ...  True  True  True]
Output Embedding for nq: [0. 0. 0. ... 0. 0. 0.]
Embedding for zr: [-0.00240288 -0.00399607  0.00210693 ...  0.00032894 -0.00118068
  0.0008361 ] Is Same:  [ True  True  True ...  True  True  True]
Output Embedding for zr: [0. 0. 0. ... 0. 0. 0.]
Embedding for zg: [-0.00240288 -0.00399607  0.00210693 ...  0.00032894 -0.00118068
  0.0008361 ] Is Same:  [ True  True  True ...  True  True  True]
Output Embedding for zg: [0. 0. 0. ... 0. 0. 0.]
Embe

In [10]:
# Save Converted Model
new_hf_model_path = f"{MODEL_ROOT_DIR}/llama3-new-token"

tokenizer.save_pretrained(new_hf_model_path)
model.save_pretrained(new_hf_model_path)

## (4) Convert HF model to .nemo

In [11]:

nemo_ckpt_path = os.path.join(new_hf_model_path, "model.nemo")
precision = "bf16"

# Convert HF Model to NeMo
!python /opt/NeMo/scripts/checkpoint_converters/convert_llama_hf_to_nemo.py --input_name_or_path $new_hf_model_path --output_path $nemo_ckpt_path --precision $precision --llama31 True 

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


      cm = get_cmap("Set1")
    
[NeMo I 2024-12-19 05:12:39 convert_llama_hf_to_nemo:128] loading checkpoint /work/Models/llama3-new-token
Loading checkpoint shards: 100%|██████████████████| 7/7 [00:01<00:00,  4.64it/s]
hf_config: {'vocab_size': 128263, 'max_position_embeddings': 131072, 'hidden_size': 4096, 'intermediate_size': 14336, 'num_hidden_layers': 32, 'num_attention_heads': 32, 'num_key_value_heads': 8, 'hidden_act': 'silu', 'initializer_range': 0.02, 'rms_norm_eps': 1e-05, 'pretraining_tp': 1, 'use_cache': True, 'rope_theta': 500000.0, 'rope_scaling': {'factor': 8.0, 'high_freq_factor': 4.0, 'low_freq_factor': 1.0, 'original_max_position_embeddings': 8192, 'rope_type': 'llama3'}, 'attention_bias': False, 'attention_dropout': 0.0, 'mlp_bias': False, 'head_dim': 128, 'return_dict': True, 'output_hidden_states': False, 'output_attentions': False, 'torchscript': False, 'torch_dtype': torch.float32, 'use_bfloat16': False, 'tf_legacy_loss': False, 'pruned_heads': {}, 'tie_word_emb

## (5) Convert Jsonl data to MMAP

In [12]:
# To train a real model, you also need to transform general purpose data.

domain_data_folder = f"{DATA_ROOT_DIR}/dapt"
if not os.path.exists(f"{DATA_ROOT_DIR}/mmap"):
    os.mkdir(f"{DATA_ROOT_DIR}/mmap")
output_folder = f"{DATA_ROOT_DIR}/mmap/da_mmap"

!python /opt/NeMo/scripts/nlp_language_modeling/preprocess_data_for_megatron.py \
--input=$domain_data_folder \
--json-keys=text \
--tokenizer-library=huggingface \
--dataset-impl mmap \
--tokenizer-type $new_hf_model_path \
--output-prefix=$output_folder \
--append-eod \
--workers=4 --preproc-folder

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


      cm = get_cmap("Set1")
    
Searching folder for .json or .jsonl or json.gz or .jsonl.gz files...
Found 4 .json or .jsonl or json.gz or .jsonl.gz files.
[NeMo I 2024-12-19 05:14:51 tokenizer_utils:185] Getting HuggingFace AutoTokenizer with pretrained_model_name: /work/Models/llama3-new-token
Vocab size: 128263
Output prefix: /work/Data/mmap/da_mmap
Time to startup: 0.4700319766998291
[NeMo I 2024-12-19 05:14:51 tokenizer_utils:185] Getting HuggingFace AutoTokenizer with pretrained_model_name: /work/Models/llama3-new-token
[NeMo I 2024-12-19 05:14:51 tokenizer_utils:185] Getting HuggingFace AutoTokenizer with pretrained_model_name: /work/Models/llama3-new-token
[NeMo I 2024-12-19 05:14:51 tokenizer_utils:185] Getting HuggingFace AutoTokenizer with pretrained_model_name: /work/Models/llama3-new-token
Processing file /work/Data/dapt/dapt_data2.jsonl 1/4
[NeMo I 2024-12-19 05:14:51 tokenizer_utils:185] Getting HuggingFace AutoTokenizer with pretrained_model_name: /work/Models/llama3-

## (6) Domain adaptive continued pretraining

In [None]:
"""
If you want to train the model, please blend the domain-specific data with general-purpose data and use them together for training.
Additionally, make sure to adjust the hyperparameters as needed.
"""

data_prefix = output_folder + "_text_document"
output_dir = "/work/log/megatron_llama_dapt"
max_steps=10 # 23200
global_batch_size=64 # 256

TP=4
PP=2

!python /opt/NeMo/examples/nlp/language_modeling/megatron_gpt_pretraining.py  \
    --config-path=/opt/NeMo/examples/nlp/language_modeling/conf \
    --config-name=megatron_llama_config \
    restore_from_path=$nemo_ckpt_path \
    trainer.devices=8 \
    trainer.num_nodes=1 \
    trainer.max_steps=$max_steps \
    trainer.val_check_interval=10 \
    trainer.log_every_n_steps=5 \
    trainer.limit_val_batches=8 \
    trainer.limit_test_batches=8 \
    trainer.accumulate_grad_batches=1 \
    trainer.precision=bf16 \
    model.micro_batch_size=1 \
    model.global_batch_size=$global_batch_size \
    model.tensor_model_parallel_size=$TP \
    model.pipeline_model_parallel_size=$PP \
    model.tokenizer.library=huggingface \
    model.tokenizer.type=$new_hf_model_path \
    model.tokenizer.model=null \
    model.megatron_amp_O2=true \
    model.encoder_seq_length=4096 \
    model.sequence_parallel=true \
    ++model.data.data_prefix=[1.0,$data_prefix] \
    model.data.num_workers=8 \
    model.optim.name=fused_adam \
    model.optim.lr=5e-6 \
    model.optim.betas=[0.9,0.95] \
    exp_manager.explicit_log_dir=$output_dir \
    exp_manager.resume_if_exists=true \
    exp_manager.resume_ignore_no_checkpoint=true \
    exp_manager.create_checkpoint_callback=true \
    exp_manager.create_wandb_logger=true \
    exp_manager.wandb_logger_kwargs.project=DAPT \
    exp_manager.wandb_logger_kwargs.name=step1 \
    exp_manager.checkpoint_callback_params.monitor=val_loss \
    exp_manager.checkpoint_callback_params.save_top_k=1 \
    exp_manager.checkpoint_callback_params.mode=min \
    exp_manager.checkpoint_callback_params.always_save_nemo=false \
    exp_manager.checkpoint_callback_params.save_nemo_on_train_end=true \
    ~model.optim.sched