<a href="https://colab.research.google.com/github/TrelisResearch/llama-2-setup/blob/main/Llama_2_Prompt_and_Tokenizer_Format.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Llama 2: Prompt, Tokenizer and Padding Guide

---

Created by Trelis Research.

Learn more:
- Function-calling Llama 2 on [Trelis' HuggingFace](https://huggingface.co/Trelis)
- Llm install-guides on [Github](https://github.com/TrelisResearch)
- Fine-tuning tutorials on [YouTube](https://www.youtube.com/@TrelisResearch)


In [2]:
# Required when training models/data that are gated on HuggingFace, like Llama 2.
!pip install huggingface_hub
from huggingface_hub import notebook_login

notebook_login()

Collecting huggingface_hub
  Downloading huggingface_hub-0.17.1-py3-none-any.whl (294 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m294.8/294.8 kB[0m [31m4.9 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: huggingface_hub
Successfully installed huggingface_hub-0.17.1


VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

### Connect Google Drive

Optional but saves time by caching the model to avoid re-downloading the next time.

In [3]:
# https://stackoverflow.com/questions/56081324/why-are-google-colab-shell-commands-not-working
import locale
def getpreferredencoding(do_setlocale = True):
    return "UTF-8"
locale.getpreferredencoding = getpreferredencoding

In [4]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [5]:
import os
cache_dir = "/content/drive/My Drive/huggingface_cache"
os.makedirs(cache_dir, exist_ok=True) # Ensure the directory exists

# Load the Llama 2 Model

In [6]:
!pip install -q -U bitsandbytes
!pip install -q -U git+https://github.com/huggingface/transformers.git
!pip install -q -U git+https://github.com/huggingface/peft.git
!pip install -q -U git+https://github.com/huggingface/accelerate.git
!pip install -q datasets

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m92.6/92.6 MB[0m [31m3.0 MB/s[0m eta [36m0:00:00[0m
[?25h  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.8/7.8 MB[0m [31m23.9 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.3/1.3 MB[0m [31m46.9 MB/s[0m eta [36m0:00:00[0m
[?25h  Building wheel for transformers (pyproject.toml) ... [?25l[?25hdone
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m258.1/258.1 kB[0m [31m6.1 MB/s[0m eta [36m0:00:00[0m
[?25h  Building wheel for peft (pyproject.toml) ... [?25l[?25hdone
  Installing build depend

In [7]:
# We're loading quantized (i.e. shrinked down to 4-bit instead of 16-bit weights) so that you can use a free Colab notebook.

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig

model_id = "meta-llama/Llama-2-7b-chat-hf"
# model_id = "meta-llama/Llama-2-7b-hf"
# model_id = "meta-llama/Llama-2-13b-chat-hf"
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16
)

model = AutoModelForCausalLM.from_pretrained(
    model_id,
    quantization_config=bnb_config,
    device_map={"":0},
    cache_dir=cache_dir)

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

# Llama 2 Tokenizer and Padding

Let's start by loading the Llama 2 tokenizer and inspecting it.

In [8]:
tokenizer = AutoTokenizer.from_pretrained(model_id, use_fast=True)

Downloading (…)okenizer_config.json:   0%|          | 0.00/776 [00:00<?, ?B/s]

Downloading tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/1.84M [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/414 [00:00<?, ?B/s]

In [33]:
# Let's check and print the bos (beginning of sequence) and end of sequence tokens

# Print the BOS and EOS tokens
print("BOS Token:", tokenizer.bos_token)
print("EOS Token:", tokenizer.eos_token)

BOS Token: <s>
EOS Token: </s>


Next, let's see how these tokens are applied when we tokenize:

In [34]:
# Sample sentence
sample_sentence = "Hello, world!"

# Tokenize without special tokens
tokenized_output_no_special = tokenizer(sample_sentence, add_special_tokens=False)
print("Without special tokens:")
print("Tokenized Text:", [tokenizer.decode([x]) for x in tokenized_output_no_special["input_ids"]])
print("Token IDs:", tokenized_output_no_special["input_ids"])

Without special tokens:
Tokenized Text: ['Hello', ',', 'world', '!']
Token IDs: [15043, 29892, 3186, 29991]


You can see that in the above, we have set 'add_special_tokens=False'. Notice what happens when we set this to true (which is often managed by the Trainer).

In [35]:
# Tokenize with special tokens
tokenized_output_with_special = tokenizer(sample_sentence, add_special_tokens=True)
print("\nWith special tokens:")
print("Tokenized Text:", [tokenizer.decode([x]) for x in tokenized_output_with_special["input_ids"]])
print("Token IDs:", tokenized_output_with_special["input_ids"])


With special tokens:
Tokenized Text: ['<s>', 'Hello', ',', 'world', '!']
Token IDs: [1, 15043, 29892, 3186, 29991]


Notice how there is a BOS (beginning of sequence) token added to the start of the tokenized text.

If you want to add an EOS token, you have to add that within the data, like this:

In [36]:
# Sample sentence
sample_sentence_with_EOS = "Hello, world!</s>"

# Tokenize with special tokens
tokenized_output_with_special_and_EOS = tokenizer(sample_sentence_with_EOS, add_special_tokens=True)
print("\nWith special tokens:")
print("Tokenized Text:", [tokenizer.decode([x]) for x in tokenized_output_with_special_and_EOS["input_ids"]])
print("Token IDs:", tokenized_output_with_special_and_EOS["input_ids"])


With special tokens:
Tokenized Text: ['<s>', 'Hello', ',', 'world', '!', '</s>']
Token IDs: [1, 15043, 29892, 3186, 29991, 2]


### Pad Token (and Mask and Unknown Tokens)

Let's start by printing out other special tokens:
- Padding token (used during training to pad sequences to a certain length)
- Mask token (used to ignore certain tokens - either ignore them for attention or for loss calculations)
- Unknown tokens, unk, which are not in the vocabulary.

In [37]:
# Print additional special tokens
print("Mask Token:", tokenizer.mask_token)
print("Pad Token:", tokenizer.pad_token)
print("Unknown Token:", tokenizer.unk_token)

Mask Token: <mask>
Pad Token: <pad>
Unknown Token: <unk>


### Setting the pad token
Since there is no default pad token for Llama 2, it can be common to use the end of sequence token (< /s >).

But since the end of sequence token is supposed to serve it's own purpose, it's best to define a new pad token:

In [39]:
# Check if the pad token is already in the tokenizer vocabulary
if '<pad>' not in tokenizer.get_vocab():
  # Add the pad token
  tokenizer.add_special_tokens({"pad_token":"<pad>"})

#Resize the embeddings
model.resize_token_embeddings(len(tokenizer))

#Configure the pad token in the model
model.config.pad_token_id = tokenizer.pad_token_id

# Check if they are equal
assert model.config.pad_token_id == tokenizer.pad_token_id, "The model's pad token ID does not match the tokenizer's pad token ID!"

# Print the pad token ids
print('Tokenizer pad token ID:', tokenizer.pad_token_id)
print('Model pad token ID:', model.config.pad_token_id)

You are resizing the embedding layer without providing a `pad_to_multiple_of` parameter. This means that the new embedding dimension will be 32002. This might induce some performance reduction as *Tensor Cores* will not be available. For more details about this, or help on choosing the correct value for resizing, refer to this guide: https://docs.nvidia.com/deeplearning/performance/dl-performance-matrix-multiplication/index.html#requirements-tc


Tokenizer pad token ID: 32000
Model pad token ID: 32000


Above, we have set the pad token. Notice how we also have to:
1. Update the model's pad token (as well as the tokenizer).
1. Increase the size of the token embeddings (because now our vocabulary has one more token, the pad token).

One other thing - by default padding of sequences (relevant for training) is done by adding tokens to the left. It shouldn't really matter on what side the pad tokens are added, because those tokens are then ignored (by setting an attention mask and loss mask to 0 in those positions). However, it's probably safer to add pad tokens to the end and always have actual tokens at the start of each sequence. Here's how to do that:

In [None]:
# tokenizer.padding_side = 'left'
tokenizer.padding_side = 'right'

### Adding a mask token
Using mask tokens is getting more advanced because it means you are really customising the training. Mask tokens are used to block out certain positions from either:
a) being taken into account by neighbouring tokens (an attention mask), or
b) being taken into account when calculating the loss (a loss mask).

You can set the mask token just like the pad token:

In [40]:
# Check if the mask token is already in the tokenizer vocabulary
if '<mask>' not in tokenizer.get_vocab():
  # Add the mask token
  tokenizer.add_special_tokens({"mask_token":"<mask>"})

#Resize the embeddings
model.resize_token_embeddings(len(tokenizer))

#Configure the mask token in the model
model.config.mask_token_id = tokenizer.mask_token_id

# Check if they are equal
assert model.config.mask_token_id == tokenizer.mask_token_id, "The model's mask token ID does not match the tokenizer's mask token ID!"

# Print the mask token ids
print('Tokenizer mask token ID:', tokenizer.mask_token_id)
print('Model mask token ID:', model.config.mask_token_id)
print('Model config mask token ID:', model.config.mask_token_id)

You are resizing the embedding layer without providing a `pad_to_multiple_of` parameter. This means that the new embedding dimension will be 32002. This might induce some performance reduction as *Tensor Cores* will not be available. For more details about this, or help on choosing the correct value for resizing, refer to this guide: https://docs.nvidia.com/deeplearning/performance/dl-performance-matrix-multiplication/index.html#requirements-tc


Tokenizer mask token ID: 32001
Model mask token ID: 32001
Model config mask token ID: 32001


Notice how embeddings are now 32002 in size (counting starts at 0!) because we've added a pad and a mask token.

You could, alternatively set the pad token (or mask token) as an unknown token. This avoids resizing the embeddings.

# Llama 2 Prompt Format and Inference

Llama 2 chat (only the chat form!) is fine-tuned to have a specific prompt format.

This prompt format involves:
- B_INST, beginning of instruction
- E_INST, end of instruction
- B_SYS, beginning of system message
- E_SYS, end of system message

User messages must be wrapped within B_INST and E_INST, while system messages are wrapped within B_SYS and E_SYS.

Perhaps somewhat oddly, the system message (and wrappers) have to be included within the first user prompt, but not subsequent user inputs.

So, here is a simple prompt with a system and user message.

In [42]:
from transformers import TextStreamer
from peft import PeftModel

# Define a stream
def stream(user_prompt):

    system_prompt = 'You are a helpful assistant.'

    B_SYS, E_SYS = "<<SYS>>\n", "\n<</SYS>>\n\n"
    B_INST, E_INST = "[INST]", "[/INST]"

    # Chat model prompt
    prompt = f"{B_INST} {B_SYS}{system_prompt.strip()}{E_SYS}{user_prompt.strip()} {E_INST}\n\n"

    inputs = tokenizer([prompt], return_tensors="pt").to("cuda")

    streamer = TextStreamer(tokenizer)

    # Despite returning the usual output, the streamer will also print the generated text to stdout.
    _ = model.generate(**inputs, streamer=streamer, max_new_tokens=50, temperature=0.01)

In [43]:
stream("Howdy!")

<s> [INST] <<SYS>>
You are a helpful assistant.
<</SYS>>

Howdy! [/INST]

Well, howdy there! *adjusts cowboy hat* It's a pleasure to meet you! How can I help you today? Do you have any questions or tasks you'd like me to assist you with? Just let me


But then as we have more user messages, we have to format like this:

In [48]:
from transformers import TextStreamer
from peft import PeftModel

# Define a stream
def multi_stream():

    system_prompt = 'You are a helpful assistant.'

    B_SYS, E_SYS = "<<SYS>>\n", "\n<</SYS>>\n\n"
    B_INST, E_INST = "[INST]", "[/INST]"

    user_prompt_1 = "Howdy!"

    assistant_response_1 = "Well, howdy there! *adjusts cowboy hat* It's a pleasure to meet you!"

    user_prompt_2 = "What is the largest solar system in the universe? Give a very brief response"

    # Chat model prompt
    prompt = f"{B_INST} {B_SYS}{system_prompt.strip()}{E_SYS}{user_prompt_1.strip()} {E_INST}" + f"{assistant_response_1.strip()}{B_INST} {user_prompt_2.strip()}{E_INST}"

    inputs = tokenizer([prompt], return_tensors="pt").to("cuda")

    streamer = TextStreamer(tokenizer)

    # Despite returning the usual output, the streamer will also print the generated text to stdout.
    _ = model.generate(**inputs, streamer=streamer, max_new_tokens=200, temperature=0.01)

In [49]:
multi_stream()

<s> [INST] <<SYS>>
You are a helpful assistant.
<</SYS>>

Howdy! [/INST]Well, howdy there! *adjusts cowboy hat* It's a pleasure to meet you![INST] What is the largest solar system in the universe? Give a very brief response[/INST]  The largest solar system in the universe is believed to be the Milky Way galaxy, which contains over 100 billion stars and a supermassive black hole at its center.</s>


Notice how .strip() is added. This removes excess white spaces from user or assistant prompts.