# Data Processing for NeMo 2.0 LLMs with the SlimPajama Dataset

This tutorial will guide you through the process of transforming a raw pretraining dataset into a configured data module for pretraining with a NeMo 2.0 recipe. We will use the [SlimPajama-627B](https://huggingface.co/datasets/cerebras/SlimPajama-627B>) dataset as our reference. Additionally, we will demonstrate how to exclude specific sources from the dataset, such as excluding all data from the `RedPajamaBook` set by default.

This tutorial involves four steps:

1. Download data
2. Extract data
3. Concatenate data
4. Preprocess data for NeMo 2.0/Megatron

First, we'll define each step. Next, we will see how we can use NeMo-Run to execute the steps sequentially on your local workstation using Docker or on Slurm.

### Prerequisites
This notebook assumes familiarity with [NeMo-Run](https://github.com/NVIDIA/NeMo-Run). Additionally, the Docker execution and Slurm execution steps require access to Docker on your host and a remote Slurm cluster, respectively.
Additionally, you will have to complete the following steps:

1. Set HOST_DATA_PATH in the first cell to a parent folder on your workstation where you want to save the data.
1. Create directories `HOST_DATA_PATH/tokenizer` and `HOST_DATA_PATH/slimpajama`.
1. Download the Llama `tokenizer.model` file either from [Hugging Face](https://huggingface.co/meta-llama/Llama-2-7b/blob/main/tokenizer.model) or https://www.llama.com/llama-downloads/ and place it at `{HOST_DATA_PATH}/tokenizer/tokenizer.model`.
    For HF, you can do it by running 
    ```bash
    HF_TOKEN=... huggingface-cli download meta-llama/Llama-2-7B tokenizer.model --local-dir {HOST_DATA_PATH}/tokenizer/
    ```

> [!NOTE]
> All code for this tutorial can be found at https://github.com/NVIDIA/NeMo/tree/main/examples/llm/slimpajama.

In [2]:
import nemo_run as run

from data.download import download_slimpajama
from data.extract import run_extraction
from data.preprocess import preprocess_data

HOST_DATA_PATH = "/data"

## Download Data

First, we will configure the task to download data from Hugging Face. We will use the Hugging Face CLI for this. The function that configures the download script can be found [here](./data/download.py).

In [None]:
download_task = download_slimpajama(
    include_pattern='--include "train/chunk1/*_100*zst"',
)

# The configured script looks like below
print(download_task.inline)

## Extract Data

The downloaded data is in compressed ZST format. We need to extract it into JSONL files. For that, we will configure the `extract_data` function defined [here](./data/extract.py). This function also allows excluding certain sources. By default, we exclude all data from the `RedPajamaBook` set, but this setting is configurable.

In [None]:
run_extraction??

In [None]:
extract_task = run.Partial(run_extraction, data_dir="/data/slimpajama")
extract_task

## Concatenate Data

This optional step concatenates small JSONL files into a single large JSONL file. The example script is [here](./data/concat.sh), but feel free to change it based on your needs.

In [None]:
concat_task = run.Script("/nemo_run/code/data/concat.sh", args=["/data/slimpajama/train", "1"])
concat_task

## Preprocess Data

This final step preprocesses the JSONL files to the BIN and IDX files required by NeMo and Megatron Core. It uses the `preprocess_data` function defined [here](./data/preprocess.py).

In [None]:
preprocess_data??

In [None]:
preprocess_task = run.Partial(
    preprocess_data,
    data_dir="/data/slimpajama",
    output_dir="/data/slimpajama_megatron",
    tokenizer_model="/data/tokenizer/tokenizer.model",
    tokenizer_library="sentencepiece",
)

In [None]:
preprocess_task

## Put it all together

Now that all the tasks are configured, lets define an executor to run them on and an experiment to run them sequeuntially. 

> [!NOTE]
> Each task can be run individually or in any combination. The notebook runs all tasks sequentially. To remove a task, just remove the corresponding `exp.add(...)` for that corresponding task.
> This customization is handy if you already have JSONL files processed, for example, from NeMo-Curator.

In [9]:
# Let's define a local executor to run the experiment locally.
def docker_executor(host_data_path: str):
    packager = run.GitArchivePackager(subpath="examples/llm/slimpajama") # This will package all code inside the folder. NOTE: only committed changes are packaged, so if you make a change, make sure to commit it.
    executor = run.DockerExecutor(
        packager=packager,
        ipc_mode="host",
        shm_size="30g",
        env_vars={"PYTHONUNBUFFERED": "1"},
        volumes=[f"{host_data_path}:/data"],
        container_image="python:3.11",
        ulimits=["memlock:-1", "stack:67108864"],
    )
    return executor

In [None]:
# Replace the host_data_path with the path on your host to save the data to.
executor = docker_executor(host_data_path="/data")

with run.Experiment("slimpajama-data-pipeline") as exp:
    exp.add(download_task, name="download_slimpajama", executor=executor)

    # Use NeMo image for the remaining tasks
    executor.container_image = "nvcr.io/nvidia/nemo:dev"
    exp.add(extract_task, name="extract_slimpajama", executor=executor)

    # examples/llm/slimpajama is automatically mounted to /nemo_run/code
    exp.add(concat_task, name="concat_slimpajama", executor=executor)
    exp.add(preprocess_task, name="preprocess_slimpajama", executor=executor)

    exp.run(sequential=True, tail_logs=True)

If the experiment runs successfully, you will see the BIN and IDX files as shown below. These files can directly be used in NeMo and Megatron Data Loaders.

In [3]:
!ls {HOST_DATA_PATH}/slimpajama_megatron

concatenated_chunk1.jsonl_text_document.bin
concatenated_chunk1.jsonl_text_document.idx


## Appendix

### Running on Slurm

You can also run the same experiment on a remote cluster like Slurm by replacing the Docker executor with a Slurm executor. A sample definition of a Slurm executor looks like:

```python
def slurm_executor(
    user: str,
    host: str,
    remote_job_dir: str,
    account: str,
    partition: str,
    nodes: int,
    tasks_per_node: int,
    time: str = "04:00:00",
    custom_mounts: Optional[list[str]] = None,
    custom_env_vars: Optional[dict[str, str]] = None,
    container_image: str = "nvcr.io/nvidia/nemo:dev",
    retries: int = 0,
) -> run.SlurmExecutor:
    if not (user and host and remote_job_dir and account and partition and nodes and tasks_per_node):
        raise RuntimeError(
            "Please set user, host, remote_job_dir, account, partition, nodes and devices args for using this function."
        )

    mounts = []
    if custom_mounts:
        mounts.extend(custom_mounts)

    env_vars = {
        "NVIDIA_VISIBLE_DEVICES": "void", # Might be needed for CPU only nodes with NeMo docker image
    }
    if custom_env_vars:
        env_vars |= custom_env_vars

    executor = run.SlurmExecutor(
        account=account,
        partition=partition,
        tunnel=run.SSHTunnel(
            user=user,
            host=host,
            job_dir=remote_job_dir,
            identity="/path/to/identity/file/for/ssh/to/cluster",  # OPTIONAL: Provide path to the private key that can be used to establish the SSH connection without entering your password
        ),
        nodes=nodes,
        ntasks_per_node=tasks_per_node,
        mem="0",
        exclusive=True,
        packager=run.GitArchivePackager(subpath="examples/llm/slimpajama"),
    )

    executor.container_image = container_image
    executor.container_mounts = mounts
    executor.env_vars = env_vars
    executor.retries = retries
    executor.time = time

    return executor
```