# Domain Adaptive Pre-Training (DAPT)

## Goal

Given a foundational language model (in this case llama-2-7B) that was pre-trained on a broad, general-purpose corpus, our goal is to further pretrain the model on a specific domain (in this example, ChipDesign) to enhance its understanding of domain-specific language and context. This process is called Domain-Adaptive Pretraining (DAPT). DAPT adapts a general-purpose model to specialized tasks within a particular field. Instead of training from scratch, we aim to “specialize” the model by focusing on a target domain corpus, allowing it to adapt to the unique vocabulary, semantics, and syntax of that field.

Our primary goals with respect to DAPT are as follows:
* Improve the model’s performance and accuracy on domain-specific tasks
* Ensure the model retains general language capabilities
* Minimize pretraining time by leveraging existing knowledge in the model

DAPT typically enhances a model’s efficacy in downstream tasks for the domain by exposing it to domain-relevant texts. This pretraining phase can result in more accurate and context-aware predictions on domain-specific data, as the model gains an understanding of field-specific terminology, abbreviations, and common phrases.

# NeMo Tools and Resources

* [NeMo Framework](https://docs.nvidia.com/nemo-framework/user-guide/latest/overview.html)

# Software Requirements
* Access to latest NeMo Framework NGC Containers
* This playbook has been tested on: nvcr.io/nvidia/nemo:dev. It is expected to work similarly on other environments.


#### Launch the NeMo Framework container as follows: 

```
docker run -it -p 8080:8080 -p 8088:8088 --rm --gpus '"device=0,1"' --ipc=host --network host -v $(pwd):/workspace nvcr.io/nvidia/nemo:dev
```

#### Launch Jupyter Notebook as follows: 
```
jupyter notebook --allow-root --ip 0.0.0.0 --port 8088 --no-browser --NotebookApp.token=''

```


# Hardware Requirements

* This playbook has been tested on 2xA100 80G but can be scaled to multiple GPUs as well as multiple nodes by modifying the appropriate parameters

# Data

* In this playbook, we will leverage chip domain/hardware datasets from open-source GitHub repositories, wiki URLs, and academic papers. Data has been processed and curated using [NeMo Curator](https://github.com/NVIDIA/NeMo-Curator/tree/main) as shown in this [playbook](https://github.com/jvamaraju/ndc_dapt_playbook/tree/dapt_jv)

# Notebook Outline

* Step 1: Prepare the data for pretraining. This is a multi-step process discussed in detail later in the specific section (later in the notebook).

* Step 2: Download the llama-2-7B hugging face checkpoint and convert to .nemo format.

* Step 3: Continued pretraining the llama-2-7b model using the prepared data and the custom trained tokenizer (from the previous notebook).

# Step 1: Data Preparation for pretraining

Identify the different file types (example: code, text, etc) in the pretraining data, in this case we only have 'code' type files. This is typically dataset dependent. 

If you used the Data Curation tutorial as instructed in the Readme, you can point ```data_path ``` variable to the path containing the curated data.

In [1]:
import os
import json

# Function to count the number of files in each of the different file types- code, text
def identify_jsonl_files(data_path):
    code_files = []
    text_files = []
    cnt_text = 0
    cnt_code = 0
    for root, _, files in os.walk(data_path):
        for file in files:
            if file.endswith('.jsonl'):
                file_path = os.path.join(root, file)
                with open(file_path, 'r') as f:
                    has_code = False
                    has_text = False
                    for line in f:
                        try:
                            json_obj = json.loads(line.strip())
                            file_type = json_obj.get('file_type', '').lower()
                            if file_type == 'code':
                                has_code = True
                            elif file_type == 'text':
                                has_text = True
                            if has_code and has_text:
                                break
                        except json.JSONDecodeError:
                            continue
                if has_code:
                    code_files.append(file_path)
                    cnt_code = cnt_code + 1
                if has_text:
                    text_files.append(file_path)
                    cnt_text = cnt_text + 1
    return code_files, text_files, cnt_code, cnt_text

# Modify data path to point to jsonl data source, in this case data_path='code/data/all_jsonl_data'
data_path = 'code/data/all_jsonl_data'

code_files, text_files, cnt_code, cnt_text = identify_jsonl_files(data_path)

print("\nNumber of Files containing 'file_type':'text':", cnt_text)
print("Number of Files containing 'file_type':'code':", cnt_code)


Number of Files containing 'file_type':'text': 0
Number of Files containing 'file_type':'code': 8835


### Merging code JSONL files into a single JSONL file for further preprocessing

This is an optional step, it is possible to use multiple jsonl files in this workflow as well. This example uses a single merged. jsonl file

In [3]:
import os
import json

def list_jsonl_files(directory):
    jsonl_files = []
    for root, _, files in os.walk(directory):
        for file in files:
            if file.endswith('.jsonl'):
                jsonl_files.append(os.path.join(root, file))
    return jsonl_files

# Function to merge multiple jsonl files into a single file 
def merge_jsonl_files(directory, output_file):
    jsonl_files = list_jsonl_files(directory)
    
    with open(output_file, 'w') as outfile:
        for input_file in jsonl_files:
            with open(input_file, 'r') as infile:
                for line in infile:
                    try:
                        json_object = json.loads(line.strip())
                        json.dump(json_object, outfile)
                        outfile.write('\n')
                    except json.JSONDecodeError:
                        print(f"Skipping invalid JSON in {input_file}: {line.strip()}")

    print(f"Merged {len(jsonl_files)} JSONL files into {output_file}")

In [4]:
directory = 'code/data/all_jsonl_data'
output_file = 'code_merged_output.jsonl'
merge_jsonl_files(directory, output_file)

Merged 8835 JSONL files into code_merged_output.jsonl


### Data Format Conversion for pretraining: JSONL to bin/idx files 

For efficient pretraining, we convert data from JSONL to bin/idx format. 

JSONL files, while convenient for storing structured text data, are not optimized for high-speed data loading during large language model training. In pretraining workflows, particularly those with large datasets and complex model architectures, the need for fast data access and efficient memory management is essential.

The bin/idx format is a binary format specifically designed to facilitate high-throughput data loading. This format allows direct, randomized access to data samples, which speeds up I/O operations and reduces the memory footprint compared to loading JSONL files. By converting data to bin/idx format, hardware utilization can be maximized and bottlenecks in data processing can be avoided, leading to a more efficient pretraining process.

#### Benefits of bin/idx format for Pretraining:

* **Optimized I/O Performance:** The binary format enables quicker data reads and reduces latency, allowing the model to continuously access data at high speeds.
* **Efficient Memory Usage:** Data in bin/idx format consumes less memory during loading, making it suitable for large datasets and enabling better use of available system resources.
* **Enhanced Scalability:** With bin/idx, it’s easier to handle shuffling and batching of large datasets, which is essential for pretraining on diverse domain-specific data.

In [5]:
# After the running through the custom_tokenization.ipynb, you would have 
# the new domain adpated tokenizer model in the following directory
!ls models/tokenizer/llama2/custom_tokenizer_init_20000_json

merges.txt		 tokenizer.json		vocab.json
special_tokens_map.json  tokenizer_config.json


Modify the `input` to point to the merged `jsonl` file. Similarly modify paths to `vocab`, `tokenizer-model`, `merge-file` to point to relevant file paths. 

In the following code block, ```tokenizer-model``` is set to using the original tokenizer that comes as a part of llama2-7b-hf, but `tokenizer-model` should point to the custom tokenizer (trained in the custom tokenizer training notebook) if your data has domain specific terminology

In [None]:
!python3 /opt/NeMo/scripts/nlp_language_modeling/preprocess_data_for_megatron.py \
--input='code_merged_output.jsonl' \
--json-keys=text \
--tokenizer-library=sentencepiece \
--vocab 'models/tokenizer/llama2/custom_tokenizer_init_20000_json/vocab.json' \
--dataset-impl mmap \
--tokenizer-model '/workspace/Llama-2-7b-hf/tokenizer.model' \
--tokenizer-type llama \
--merge-file 'models/tokenizer/llama2/custom_tokenizer_init_20000_json/merges.txt' \
--append-eod \
--output-prefix='preprocessed_data'

In [6]:
# If the above step runs successfully, two files with the extensions .bin and .idx will be generated
!ls 

README.md			   nemo_experiments
cdeng				   preprocessed_data_text_document
code				   preprocessed_data_text_document.bin
code_merged_output.jsonl	   preprocessed_data_text_document.idx
domain_adaptive_pretraining.ipynb  venv


# Step 2: Download Llama-2-7b Hugging Face checkpoint and convert to .nemo checkpoint

The code below assumes you already have the llama-2-7b checkpoint downloaded in ```/workspace/Llama-2-7b-hf/```

Llama-2-7b-hf checkpoint can be downloaded from https://huggingface.co/meta-llama/Llama-2-7b-hf/tree/main

In [None]:
!python /opt/NeMo/scripts/checkpoint_converters/convert_llama_hf_to_nemo.py --input_name_or_path=/workspace/Llama-2-7b-hf/ --output_path=/workspace/llama2-7b.nemo

The conversion will generate a ```llama2-7b.nemo``` file which can be used for the continued pretraining using NeMo Toolkit as shown in Step 3. 

In [7]:
!ls /workspace

Llama-2-7b-hf		  dapt-custom-tokenization  megatron_llama
bin-idx-conversion.ipynb  dapt-data-curation	    megatron_llama_config.yaml
convert.py		  llama2-7b.nemo	    sentencepiece
custom-tokenizer	  loader_llama2.py	    venv


# Step 3: Continued Pretraining using Llama2-7b with NeMo

For this step `megatron_gpt_pretraining.py` from NeMo Toolkit is used for continued pretraining, this step allows to configure different parameters for the pretraining depending on the set up. For example `trainer.devices` `model.tensor_model_parallel_size` depend on the number of GPUs available for this job. 

Additionally, specify the path to the custom trained tokenizer for `model.tokenizer.model`, the `.nemo` checkpoint for `model.restore_from_path`. 

The `model.data.data_prefix` is specified in the form [weightage to data, datafile] Example `[1,preprocessed_data_text_document]` assigns the whole weightage [=1] to `preprocessed_data_text_document`. If there are multiple files, different weightage (should sum to 1) can be assigned to each file to control the data blend for pretraining. 


In [None]:
# Test out the pretraining set up with mock data: model.data.data_impl=mock

!python /opt/NeMo/examples/nlp/language_modeling/megatron_gpt_pretraining.py  \
    --config-path=/opt/NeMo/examples/nlp/language_modeling/conf \
    --config-name=megatron_llama_config \
    trainer.precision=bf16 \
    trainer.devices=1 \
    trainer.num_nodes=1 \
    trainer.max_steps=2 \
    trainer.val_check_interval=8 \
    model.data.data_impl=mock \
    model.micro_batch_size=1 \
    model.global_batch_size=4 \
    model.tensor_model_parallel_size=1 \
    model.pipeline_model_parallel_size=1 \
    model.tokenizer.library=sentencepiece \
    model.tokenizer.model=/workspace/Llama-2-7b-hf/tokenizer.model \
    +model.restore_from_path=/workspace/llama2-7b.nemo \
    exp_manager.name=megatron_llama_continual \
    exp_manager.resume_ignore_no_checkpoint=false \
    exp_manager.resume_if_exists=false 

In [None]:
# Pretraining using preprocessed data (+model.data.data_prefix)

!python /opt/NeMo/examples/nlp/language_modeling/megatron_gpt_pretraining.py  \
    --config-path=/opt/NeMo/examples/nlp/language_modeling/conf \
    --config-name=megatron_llama_config \
    trainer.precision=bf16 \
    trainer.devices=2 \
    trainer.num_nodes=1 \
    trainer.max_steps=5 \
    trainer.val_check_interval=8 \
    model.micro_batch_size=1 \
    model.global_batch_size=4 \
    model.tensor_model_parallel_size=2 \
    model.pipeline_model_parallel_size=1 \
    model.tokenizer.library=sentencepiece \
    model.tokenizer.model=/workspace/Llama-2-7b-hf/tokenizer.model \
    model.megatron_amp_O2=True \
    +model.restore_from_path=/workspace/llama2-7b.nemo \
    +model.data.data_prefix=[1,preprocessed_data_text_document] \
    exp_manager.name=megatron_llama_continual \
    exp_manager.resume_ignore_no_checkpoint=true \
    exp_manager.resume_if_exists=false 

### To monitor the training, launch Tensorboard from another terminal

`tensorboard --logdir nemo_experiments --bind_all`