# Building Your Own Small Language Model from Scratch

Author: [Yoshi Suhara](https://github.com/suhara)

*_NOTE_**: This notebook has been tested in the following environment:
- [NeMo Framework](https://catalog.ngc.nvidia.com/orgs/nvidia/containers/nemo) (`nvcr.io/nvidia/nemo:24.01.framework`) 

## Overview

Small Language Models (SLMs) are a type of LLM family with fewer number of parameters (~3B) so that they can be easily used with consumer GPUs (e.g., NVIDIA RTX GPUs) and embedded systems (e.g., NVIDIA Jetson Orin). Recently, many open-source SLMs such as [phi2](https://huggingface.co/microsoft/phi-2), [Gemma](https://huggingface.co/google/gemma-2b), and [TinyLlama](https://huggingface.co/TinyLlama/TinyLlama-1.1B-Chat-v1.0) have been released. Training SLMs does not require as much computational resources as larger models. This helps us consider training our own SLM for specific purposes. 

In this tutorial, you will learn how to train an SLM from scratch using pre-training and supervised fine-tuning techniques. You'll also learn how to generate text using your SLM. The exact number of GPUs needed will depend on the amount of training data and the model architecture you use. This notebook was tested on 1x A100/H100-80GB GPU.


### Objective

In this notebook, you will be learning how to:
- Pre-train an SLM from scratch
- Generate responses from the trained SLM
- Fine-tune the pre-trained SLM

This tutorial uses the following Nvidia services and resources:
- [NeMo Framework](https://catalog.ngc.nvidia.com/orgs/nvidia/containers/nemo) 
- [NeMo Toolkit](https://github.com/NVIDIA/NeMo)
- [NeMo Aligner](https://github.com/NVIDIA/NeMo-Aligner)

This tutorial uses the following opensource components and resources:
- [Hugging Face Tokenizers](https://huggingface.co/docs/tokenizers/en/indexhttps://huggingface.co/docs/tokenizers/en/index)
- [Hugging Face Models](https://huggingface.co/modelss)
- [Hugging Face Datasets](https://huggingface.co/datasets)
- [cosmopedia-100k](https://huggingface.co/datasets/HuggingFaceTB/cosmopedia-100k) - Hugging Face
- [databricks-dolly-15k](https://huggingface.co/datasets/databricks/databricks-dolly-15k)


The steps performed include:
- Step 1: Downloading Pre-training Data and Tokenizer
- Step 2: Preprocessing Pre-training Data
- Step 3: Pre-training SLM from Scratch
- Step 4: Generating Text using Your SLM
- Step 5: Improving Trained SLM to Talk Better (Optional)


## Background: Training Techniques for SLMs


### Pre-training

**Pre-training** is an important step for LLMs to learn knowledge from a large amount of textual data. The exact same techniques as LLMs can be used to train SLMs. A recent trend is to use high-quality data including synthetically generated by larger LMs. For example, Microsoft's phi series claim that they used a small amount of "text-book quality" pre-training data generated by GPT-3.5/4 to train the models [paper](https://arxiv.org/abs/2306.11644).

In this tutorial, we will use a subset of [Cosmopedia](https://huggingface.co/datasets/HuggingFaceTB/cosmopedia), which is generated by Mixtral-8x7B model---one of the best open-source LLMs.


### Supervised Fine-tuning / Alignment (Optional) 

**Supervised Fine-Tuning (SFT)** is the process of training all of a model’s parameters on supervised data of inputs and outputs to help acquire **instruction-following** capability. This notebook describes the steps involved in fine-tuning your custom SLM. Another type of training is to use human preference data to teach the model to output generations that would be preferred by human. This type of alignment is often called **Reinforcement Learning from Human Feedback (RLHF)**. [NeMo-Aligner](https://github.com/NVIDIA/NeMo-Aligner) implements major techniques including PPO, DPO and NVIDIA's preference-based aligment techninque SteerLM.


## Before you begin

### Getting NeMo Framework

[NeMo Framework](https://catalog.ngc.nvidia.com/orgs/nvidia/containers/nemo)  is a generative AI framework built for researchers and PyTorch developers working on large language models (LLMs), multimodal models (MM), automatic speech recognition (ASR), and text-to-speech synthesis (TTS). The primary objective of NeMo is to provide a scalable framework for researchers and developers from industry and academia to more easily implement and design new generative AI models by being able to leverage existing code and pretrained models.

If you haven't already, you can pull a container that includes the version of NeMo Framework and all dependencies needed for this notebook with the following:

```bash
docker pull nvcr.io/nvidia/nemo:24.01.framework
```

The best way to run this notebook is from within the container. You can do that by launching the container with the following command

```bash
docker run -it --rm --gpus all --ipc=host --network host -v $(pwd):/workspace nvcr.io/nvidia/nemo:24.01.framework
```

Then, from within the container, start the jupyter server with

```bash
jupyter lab --no-browser --port=8080 --allow-root --ip 0.0.0.0
```

Alternatively, you can combine the two steps and can directly launch a Jupyter Lab server with the following Docker command.

```bash
docker run -it --rm --gpus all --ipc=host --network host -v $(pwd):/workspace nvcr.io/nvidia/nemo:24.01.framework "jupyter lab --no-browser --port=8080 --ip 0.0.0.0"
```



### Installation

Install the following packages required to execute this notebook. 

In [None]:
!pip install --quiet ipywidgets
!pip install --upgrade --quiet huggingface_hub

Restart the kernel after installing packages:

In [None]:
import IPython

app = IPython.Application.instance()
app.kernel.do_shutdown(True)

In [None]:
import json
import os
import random
import requests

from datasets import load_dataset
import numpy as np
from transformers import AutoTokenizer

## Step 1: Download Pre-training Data and Tokenizer

Collecting high-quality data and blending them to create pre-training data is crucial for the quality of SLMs. It is getting more common to use powerful LLMs such as GPT-3.5/4 and/or Mixtral 8x7B to generate synthetic data to train SLMs. Synthetic data generation is active research area and different types of recipes have been created.

In this example, we will use a small portion of [Cosmopedia](https://huggingface.co/datasets/HuggingFaceTB/cosmopedia) to quickly walk through the training process with the NeMo Framework.


### Cosmopedia 100k data

Run the following command to download [Cosmopedia-100k](https://huggingface.co/datasets/HuggingFaceTB/cosmopedia-100k)---a subset of Cosmopedia.

In [None]:
# Download Cosmopedia 100k data
TEXT_FILE = "cosmopedia-100k.jsonl"

if not os.path.exists(TEXT_FILE):
    dataset = load_dataset("HuggingFaceTB/cosmopedia-100k", split="train")
    dataset.to_json(TEXT_FILE, orient="records", lines=True)
else:
    print("Dataset already downloaded. Skip.")

### Tokenizer

You will need a pre-trained tokenizer to preprocess the pre-training dataset. In this tutorial, we will use a pre-trained Byte Pair Encoding (BPE) tokenizer trained for GPT-2. More details about BPE and other types of tokenizers can be found in [Hugging Face's documentation](https://huggingface.co/docs/transformers/en/tokenizer_summary). 

The pre-trained tokenizer needs the following two files 

- `vocab.json` contains mapping between tokens and token IDs.
- `merges.txt` stores the rules on what tokens will be merged at each iteration.

In [None]:
VOCAB_FILE = "vocab.json"
MERGE_FILE = "merges.txt"

# Download the tokenizer files
!wget -nc https://huggingface.co/openai-community/gpt2/resolve/main/vocab.json -O {VOCAB_FILE}
!wget -nc https://huggingface.co/openai-community/gpt2/resolve/main/merges.txt -O {MERGE_FILE}

In [None]:
# Confirm if NeMo directory, merges.txt, and vocab.json exist
!ls

## Step 2: Preprocessing Pre-training Data

Next, you will need to tokenize the pre-training data using the tokenizer that you just downloaded to convert the pre-training data into the format that is ready to use for training. More specifically, we want each data to be converted into a sequence of token IDs, which is the input format for the LM.

[The NeMo Toolkit](https://github.com/NVIDIA/NeMo/tree/main) has [a preprocessing script](https://github.com/NVIDIA/NeMo/blob/main/scripts/nlp_language_modeling/preprocess_data_for_megatron.py) for this purpose. The script will create two binary files (.idx and .bin), which can be efficiently loaded during the pre-training step.

Run the command below to preprocess the cosmopedia data. It will take a few minutes or more depending on the environment. 

In [None]:
DATA_PREFIX = "cosmopedia-100k"

idx_file = "{}_text_document.idx".format(DATA_PREFIX)
bin_file = "{}_text_document.bin".format(DATA_PREFIX)
if not (os.path.exists(idx_file) and os.path.exists(bin_file)):
    !python /opt/NeMo/scripts/nlp_language_modeling/preprocess_data_for_megatron.py \
        --input={TEXT_FILE} \
        --json-keys=text \
        --tokenizer-library=megatron \
        --tokenizer-type=GPT2BPETokenizer \
        --dataset-impl=mmap \
        --merge-file={MERGE_FILE} \
        --vocab-file={VOCAB_FILE} \
        --output-prefix={DATA_PREFIX} \
        --append-eod \
        --workers=4
else:
    print("Files already exist. Skip.")

After the preprocess, confirm if the following files were created
- `cosmopedia-100k_text_document.idx`
- `cosmopedia-100k_text_document.bin`

In [None]:
!ls

## Step 3: Pre-training SLM from Scratch

Now we are ready to train a custom SLM using the pre-training data we just preprocessed. To start training, we need to configure training settings including the model architecture. In this tutorial, we train a 200M GPT model. The training process might take more than 30 minutes for 1000 steps. You can use a larger value for `MAX_STEPS` if you'd like the model to learn more. At the end of training, the script will create a NeMo checkpoint `results/megatron_gpt/checkpoints/megatron_gpt.nemo`, which we weill use for text generation later.

### Model Architecture / Optimization Settings (Advanced)

Here are tips for deciding SLM architecture design. The following descriptions are for advanced users who want to have a better understanding in how those hyperparameters determine the final model architecture.

#### Architecture configuration

- Increasing the depth of the model (= a higher value for `model.num_layers`) is a good way to have a larger model with fewer risks.
- Increasing the value for `model.hidden_size` may make the model take longer time to converge.
- Make sure `model.ffn_hidden_size` is always larger than `model.hidden_size` (`4 * model.hidden_size` is a commonly used setting)

Below is the default setting for this tutorial.

- `model.num_layers: 12`
- `model.hidden_size: 768`
- `model.ffn_hidden_size: 3072`
- `model.num_attention_heads: 12`

#### Optimization settings (Advanced)

- A higher learning rate (LR) helps the model converge faster with a risk of loss divergence.
- If you observe unstable training/validation loss values over steps, consider decreasing `model.optim.lr`.
- `model.optim.sched.min_lr` should be set to 1/10 of `model.optim.lr`.

Below is the default setting for this tutorial.

- `model.optim.lr: 2e-3`
- `model.optim.sched.min_lr: 2e-4`

In [None]:
# ======================================
MAX_STEPS = 1000  # Change to higher values such as 2124 (= 1 epoch) if you'd like to get a better model :)
NUM_LAYERS = 12
NUM_GPUS = 1
HIDDEN_SIZE = 768
FFN_HIDDEN_SIZE = 3072
NUM_ATTENTION_HEADS = 12
SEQ_LENGTH = 1024

MAX_LR = 2e-3
MIN_LR = 2e-4
WARMUP_STEPS = 100

TENSOR_MP_SIZE = 1
PIPELINE_MP_SIZE = 1
INPUT_DATA_PREFIX = "cosmopedia-100k_text_document"
INDEX_MAPPING_DIR = "index_cache"
EXP_DIR = "results"
# =======================================    
    
!python /opt/NeMo/examples/nlp/language_modeling/megatron_gpt_pretraining.py \
        trainer.devices=1 \
        trainer.accelerator=gpu \
        trainer.log_every_n_steps=100 \
        trainer.val_check_interval=100 \
        trainer.max_steps={MAX_STEPS} \
        trainer.precision=16 \
        trainer.gradient_clip_val=1.0 \
        exp_manager.exp_dir={EXP_DIR} \
        model.global_batch_size=64 \
        model.tensor_model_parallel_size={TENSOR_MP_SIZE} \
        model.pipeline_model_parallel_size={PIPELINE_MP_SIZE} \
        model.optim.name=fused_adam \
        model.optim.lr={MAX_LR} \
        model.optim.sched.warmup_steps={WARMUP_STEPS} \
        model.optim.sched.min_lr={MIN_LR} \
        model.optim.sched.constant_steps=0 \
        model.max_position_embeddings={SEQ_LENGTH} \
        model.encoder_seq_length={SEQ_LENGTH} \
        model.data.seq_length={SEQ_LENGTH} \
        model.tokenizer.type=GPT2BPETokenizer \
        model.tokenizer.library=megatron \
        model.tokenizer.vocab_file={VOCAB_FILE} \
        model.tokenizer.merge_file={MERGE_FILE} \
        model.data.eod_mask_loss=True \
        model.data.splits_string=\'98,1,1\' \
        model.num_layers={NUM_LAYERS} \
        model.hidden_size={HIDDEN_SIZE} \
        model.num_attention_heads={NUM_ATTENTION_HEADS} \
        model.ffn_hidden_size={FFN_HIDDEN_SIZE} \
        model.data.data_prefix=[{INPUT_DATA_PREFIX}] \
        model.data.index_mapping_dir={INDEX_MAPPING_DIR} \
        exp_manager.checkpoint_callback_params.save_nemo_on_train_end=True

### After training

After the training, please confirm if the following .nemo checkpoint has been created.
- `results/megatron_gpt/checkpoints/megatron_gpt.nemo`

In [None]:
# Check if `results/megatron_gpt/checkpoints/megatron_gpt.nemo` has been created.
!ls results/megatron_gpt/checkpoints

## Step 4: Generating Text using Your SLM

Let's play with this freshly  made SLM!

Open [megatron_gpt_eval_server.ipynb](./megatron_gpt_eval_server.ipynb) in another tab on the Jupyter server to launch a text generating server. This step has to be done after Step 3. In case the port is in use, change the port number accordingly. 

- [megatron_gpt_eval_server.ipynb](./megatron_gpt_eval_server.ipynb)

```
!python /opt/NeMo/examples/nlp/language_modeling/megatron_gpt_eval.py \
  gpt_model_file=results/megatron_gpt/checkpoints/megatron_gpt.nemo \
  server=True \
  port=55555
```

You can communicate with the server with the following Python script to get a response to prompts.

In [None]:
class MegatronGPTEvalClient:
    def __init__(self,
                 batch_size: int = 1,
                 port_num: int = 55555,
                 headers = {"Content-Type": "application/json"}):
        self.batch_size = batch_size
        self.port_num = port_num
        self.headers = headers

    def generate(self, text: str):
        data = {"sentences": [text] * self.batch_size,
                "tokens_to_generate": 32,
                "temperature": 1.0,
                "add_BOS": True,
                "top_k": 0,
                "top_p": 0.9,
                "greedy": False,
                "all_probs": False,
                "repetition_penalty": 1.2,
                "min_tokens_to_generate": 2}
        resp = requests.put('http://localhost:{}/generate'.format(self.port_num),
                            data=json.dumps(data),
                            headers=self.headers)
        sentences = resp.json()['sentences']
        generation = sentences[0]
        return generation[len(text):].lstrip()

In [None]:
client = MegatronGPTEvalClient()
client.generate("How are you?")

## Step 5: Improving Trained SLM to Talk Better (Optional)

Are you happy with the responses that the SLM generated? Probably not. You must have got something that looks like English but the SLM did not answer your question and/or follow your instruction. A trial run got a response like below


```
>>> client.generate("How are you?")

There's what it's all like complex AI is how data theft can help organizations save inventory and continuous products that make they need to their chances.
```

Why did it happen? The pre-training step teaches the SLM how to generate natural English sentences but may not always help the SLM learn to answer questions and follow instructions, especially when only a limited amount of data is used. It seems that the SLM still does not have a sufficient instruction-following capability.

As mentioned above, **Supervised Fine-Tuning (SFT)** (or often called as **Instruction Tuning**) is a solution for this. With SFT, the model can learn to follow instructions and have a better conversational ability. We will use [NeMo-Aligner](https://github.com/NVIDIA/NeMo-Aligner), which implements major alignment techniques including SFT and Reinforcement Learning from Human Feedback (RLHF). Please see [the NeMo-Aligner official GitHub repo](https://github.com/NVIDIA/NeMo-Aligner) for more details.


In this optional step, we will further train the SLM by Instruction Tuning to talk better. Please make sure if you've completed the pre-training step and you have a .nemo checkpoint file (`megatron_gpt.nemo`).

### Step 5-1: Download Dolly 15k dataset

We need to download training data for Insturction Tuning. Typical Insutruction Tuning data contain pairs of **prompts** and **human generated responses**, which help the model to learn how to follow instructions. 

In this example, we will use DataBricks' [Dolly 15k dataset](https://huggingface.co/datasets/databricks/databricks-dolly-15k)


In [None]:
!wget -nc https://huggingface.co/datasets/databricks/databricks-dolly-15k/resolve/main/databricks-dolly-15k.jsonl

#### Preparation: Format Conversion

Let's take a look at one example in the JSONL file.

In [None]:
!head -1 databricks-dolly-15k.jsonl | jq .

Since NeMo-Aligner's SFT script expects the input data to follow the format below, we need to concatenate `instruction` and `context` into a single `input` and use `response` as `output` in the converted file.

```
{"input": str,
 "output": str,
 "category": str}
```

Note that randomly shuffling the order of `instruction` and `context` is a commonly used technique that helps stabilize the model. Run the following Python script to convert into the NeMo-Aligner format. This script also removes examples that are longer than the context length (the default value for this notebook example is `512`.) Update the context length value accordingly if you'd like to use longer context.

In [None]:
os.environ["TOKENIZERS_PARALLELISM"] = "false"

tokenizer = AutoTokenizer.from_pretrained("openai-community/gpt2")

max_token = 512
input_file = "databricks-dolly-15k.jsonl"
output_file = input_file.replace(".jsonl", "-output.jsonl")
with open(input_file, "r") as fin, open(output_file, "w") as fout:
    for line in fin:
        # Read JSONL line in original format
        line = json.loads(line)
        context = line["context"].strip()
        instruction = line["instruction"].strip()
        if context == "" or instruction == "":
            continue
        output = line["response"]
        # Randomly shuffle to make context/instruction comes first 
        fmt = random.choice([r'f"{context}\n\n{instruction}"',
                             r'f"{instruction}\n\n{context}"'])
        input = eval(fmt)
        if len(tokenizer.encode("{} {}".format(input, output))) < max_token:
            fout.write("{}\n".format(
                json.dumps({"input": input, "output": output, "category": line["category"]})))

Let's take a quick look at how many examples were filtered out.

In [None]:
!wc -l databricks-dolly-15k.jsonl 
!wc -l databricks-dolly-15k-output.jsonl 

Only 3762 examples were remained after filtering. We should definitely consider increasing the context length for the next trial. Let's keep moving forward with this option for now.

Confirm if the converted file follows the format described above.

In [None]:
!head -1 databricks-dolly-15k-output.jsonl | jq .

### Step 5-2: Fine-Tuning SLM with NeMo-Aligner

We are ready to use a subset of the Dolly 15k dataset to fine-tune the SLM to make it a better model. :)
This training step will consume 3762 examples and it should take about 10+ minutes. 

In [None]:
TRAIN_FILEPATH = "databricks-dolly-15k-output.jsonl"
VALID_FILEPATH = "databricks-dolly-15k-output.jsonl"
NEMO_CKPT_PATH = "results/megatron_gpt/checkpoints/megatron_gpt.nemo"
EXP_DIR = "results"

!python /opt/NeMo-Aligner/examples/nlp/gpt/train_gpt_sft.py \
   trainer.precision=bf16 \
   trainer.num_nodes=1 \
   trainer.devices=1 \
   trainer.sft.max_epochs=1 \
   trainer.sft.max_steps=-1 \
   trainer.sft.val_check_interval=20 \
   trainer.sft.limit_val_batches=10 \
   model.tensor_model_parallel_size=1 \
   model.pipeline_model_parallel_size=1 \
   model.megatron_amp_O2=False \
   model.restore_from_path={NEMO_CKPT_PATH} \
   model.optim.lr=5e-6 \
   model.answer_only_loss=True \
   ++model.bias_activation_fusion=true \
   model.data.num_workers=8 \
   model.data.train_ds.micro_batch_size=1 \
   model.data.train_ds.global_batch_size=64 \
   model.data.train_ds.file_path={TRAIN_FILEPATH} \
   model.data.train_ds.drop_last=True \
   model.data.validation_ds.micro_batch_size=1 \
   model.data.validation_ds.global_batch_size=64 \
   model.data.validation_ds.file_path={VALID_FILEPATH} \
   model.data.validation_ds.drop_last=True \
   exp_manager.exp_dir={EXP_DIR} \
   exp_manager.checkpoint_callback_params.save_nemo_on_train_end=True

In [None]:
# Check if megatron_gpt_sft.nemo has been created
!ls results/megatron_gpt_sft/checkpoints/

### Step 5-3: Generating Responses from Fine-tuned SLM

Open the tab for [megatron_gpt_eval_server.ipynb](./megatron_gpt_eval_server.ipynb) again (open a new tab if you haven't) to launch a text generating server. You will need to do an additional fix on the .nemo checkpoint for this time.

- [megatron_gpt_eval_server.ipynb](./megatron_gpt_eval_server.ipynb)

```
!python /opt/NeMo/examples/nlp/language_modeling/megatron_gpt_eval.py \
  gpt_model_file=results/megatron_gpt_sft/checkpoints/megatron_gpt_sft.nemo \
  server=True \
  port=55555
```

You can communicate with the server with the following Python script to get a response to prompts.

In [None]:
client = MegatronGPTEvalClient()
client.generate("How are you?")

## Next Steps

I hope you got better responses with the fine-tuned SLM! However, just fine-tuning with 3762 examples may not be sufficient.

Now that you know how to pre-traini and fine-tune your own SLM, you should be able to explore different model architectures as well as different/more training data to make smarter SLMs using the NeMo Framework!