In this notebook, we will prepare the model and dataset for the subsequent steps.

Let's define the paths to the model and the dataset.

In [None]:
HF_MODEL_NAME_OR_PATH = "meta-llama/Llama-3.1-8B"

ROOT_DIR = "/workspace"
NEMO_OUTPUT_PATH = f"{ROOT_DIR}/Llama-3.1-8B-nemo"
DATA_PATH = f"{ROOT_DIR}/wikitext-data"

### Step 1: Convert the Hugging Face model to NeMo checkoint format

You can skip this step if you already have the model in NeMo 2.0 checkpoint format.

In [None]:
!python -c 'from nemo.collections import llm; llm.import_ckpt(llm.LlamaModel(llm.Llama31Config8B()), source="hf://{HF_MODEL_NAME_OR_PATH}", output_path="{NEMO_OUTPUT_PATH}")'

This is an example of what the nemo checkpoint should look like:

```
Llama-3.1-8B-nemo/
├── context
│   ├── artifacts
│   │   └── generation_config.json
│   ├── io.json
│   ├── model.yaml
│   └── nemo_tokenizer
│       ├── special_tokens_map.json
│       ├── tokenizer_config.json
│       └── tokenizer.json
└── weights
    ├── __0_0.distcp
    ├── __0_1.distcp
    ├── common.pt
    └── metadata.json
```


`NOTE:` If you wish to convert the NeMo models back to Hugging Face format after pruning and distillation, you can use the following command:

```bash
python -c 'from nemo.collections import llm; llm.export_ckpt(path="<NEMO_MODEL_PATH>", target="hf", output_path="<HF_OUTPUT_PATH>")'
```

### Step 2: Prepare the dataset

**Obtain the dataset**: Generate the `wikitext-{train/validation/test}.jsonl` splits after loading the [WikiText-103-v1](https://huggingface.co/datasets/Salesforce/wikitext/viewer/wikitext-103-v1) dataset.

In [None]:
import json
import os

from datasets import load_dataset

# Load the WikiText-103 dataset
dataset = load_dataset("wikitext", "wikitext-103-v1")

# Define the destination folder
os.makedirs(DATA_PATH, exist_ok=True)


# Function to save dataset split to a JSONL file
def save_to_jsonl(file_path, data):
    with open(file_path, "w") as file:
        for item in data:
            file.write(json.dumps(item) + "\n")


# Define splits
splits = ["train", "validation", "test"]
file_paths = {split: os.path.join(DATA_PATH, f"wikitext-{split}.jsonl") for split in splits}

# Save splits to JSONL files and calculate their sizes
for split in splits:
    if split in dataset:
        print(f"Saving {split} split to {file_paths[split]}")
        save_to_jsonl(file_paths[split], dataset[split])
    else:
        print(f"Split {split} not found in the dataset.")

print("Dataset saved to JSONL files.")

The dataset has to be preprocessed using the [preprocess_data_for_megatron.py](https://github.com/NVIDIA/NeMo/blob/main/scripts/nlp_language_modeling/preprocess_data_for_megatron.py) script included in the NeMo Framework. This step will also tokenize data using the `meta-llama/Llama-3.1-8B` tokenizer model to convert the data into a memory map format.

> `NOTE:` In the block of code below, pass the paths to your train, test, and validation data files.

In [None]:
!python /opt/NeMo/scripts/nlp_language_modeling/preprocess_data_for_megatron.py \
    --input="{DATA_PATH}/wikitext-train.jsonl" \
    --tokenizer-library=huggingface \
    --tokenizer-type="{HF_MODEL_NAME_OR_PATH}" \
    --output-prefix="{DATA_PATH}/wikitext_tokenized_train" \
    --append-eod \
    --workers=32

In [None]:
!python /opt/NeMo/scripts/nlp_language_modeling/preprocess_data_for_megatron.py \
    --input="{DATA_PATH}/wikitext-validation.jsonl" \
    --tokenizer-library=huggingface \
    --tokenizer-type="{HF_MODEL_NAME_OR_PATH}" \
    --output-prefix="{DATA_PATH}/wikitext_tokenized_val" \
    --append-eod \
    --workers=32

In [None]:
!python /opt/NeMo/scripts/nlp_language_modeling/preprocess_data_for_megatron.py \
    --input="{DATA_PATH}/wikitext-test.jsonl" \
    --tokenizer-library=huggingface \
    --tokenizer-type="{HF_MODEL_NAME_OR_PATH}" \
    --output-prefix="{DATA_PATH}/wikitext_tokenized_test" \
    --append-eod \
    --workers=32

After running the above scripts, you will see the preprocesed `/workspace/wikitext-data/wikitext_tokenized_{train/val/test}_text_document.{idx/bin}`files. These output files will be used in the next step.