##**Tokenizers**
For this Colab session, we explore the world of Tokenizers

You can run this notebook on a free CPU, or locally on your box if you prefer.

Importing the necessary libraries

In [None]:
from google.colab import userdata
from huggingface_hub import login
from transformers import AutoTokenizer

Get my Hugging Face token from a variable called `userdata`, then log in to Hugging Face with it — and also save that login so I can use Git commands with Hugging Face repos.

In [None]:
hf_token = userdata.get('HF_TOKEN')
login(hf_token, add_to_git_credential=True)

Download and load the tokenizer that works with the Meta LLaMA 3.1 8B model, and trust any custom code provided with that model.

In [None]:
tokenizer = AutoTokenizer.from_pretrained('meta-llama/Meta-Llama-3.1-8B', trust_remote_code=True)

Take my sentence and use the tokenizer to turn it into a list of numbers that a language model can understand

In [None]:
text = "I am excited to show Tokenizers in action to my LLM engineers"
tokens = tokenizer.encode(text)
tokens

Displaying the length of the tokens

In [None]:
len(tokens)

Decoding the tokens

In [None]:
tokenizer.decode(tokens)

Take the list of token IDs and convert them back into human-readable text

In [None]:
tokenizer.batch_decode(tokens)

Show me the entire list of words or pieces the tokenizer understands, along with the number assigned to each one.

In [None]:
tokenizer.vocab

Show me just the extra words or tokens I’ve added to the tokenizer — not the original built-in ones

In [None]:
tokenizer.get_added_vocab()

## **Instruct variants of models**

Many models have a variant that has been trained for use in Chats.  
These are typically labelled with the word "Instruct" at the end.  
They have been trained to expect prompts with a particular format that includes system, user and assistant prompts.  

There is a utility method `apply_chat_template` that will convert from the messages list format we are familiar with, into the right input prompt for this model.

Download and use the tokenizer designed for the LLaMA 3.1 8B Instruct model, and allow it to run any special code it needs to work correctly

In [None]:
tokenizer = AutoTokenizer.from_pretrained('meta-llama/Meta-Llama-3.1-8B-Instruct', trust_remote_code=True)

Create a conversation between the user and assistant, format it in the special style that the LLaMA model understands, and get it ready so the assistant can reply.

In [None]:
messages = [
    {"role": "system", "content": "You are a helpful assistant"},
    {"role": "user", "content": "Tell a light-hearted joke for a room of Data Scientists"}
  ]

prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
print(prompt)

##**Trying new models**

We will now work with 3 models:

- Phi3 from Microsoft

- Qwen2 from Alibaba Cloud

- Starcoder2 from BigCode (ServiceNow + HuggingFace + NVidia)

Setting the `PHI3_MODEL_NAME`,`QWEN2_MODEL_NAME` & `STARCODER2_MODEL_NAME`

In [None]:
PHI3_MODEL_NAME = "microsoft/Phi-3-mini-4k-instruct"
QWEN2_MODEL_NAME = "Qwen/Qwen2-7B-Instruct"
STARCODER2_MODEL_NAME = "bigcode/starcoder2-3b"

You’re comparing how many different tokenizers **(possibly from different models)** turn text into token IDs and then decode them back to text.

In [None]:
phi3_tokenizer = AutoTokenizer.from_pretrained(PHI3_MODEL_NAME)
text = "I am excited to show Tokenizers in action to my LLM engineers"
print(tokenizer.encode(text))
print()
tokens = phi3_tokenizer.encode(text)
print(phi3_tokenizer.batch_decode(tokens))

Create a conversation between the user and assistant, format it in the special style that the LLaMA model understands, and get it ready so the assistant can reply.

In [None]:
messages = [
    {"role": "system", "content": "You are a helpful assistant"},
    {"role": "user", "content": "Tell a light-hearted joke for a room of Data Scientists"}
  ]

prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
print(prompt)

Format a conversation into model-specific prompts using both LLaMA and PHI-3 tokenizers, then print each one to see how they differ in style.

In [None]:
print(tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True))
print()
print(phi3_tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True))

You’re comparing how many different tokenizers **(possibly from different models)** turn text into token IDs and then decode them back to text.

In [None]:
qwen2_tokenizer = AutoTokenizer.from_pretrained(QWEN2_MODEL_NAME)
text = "I am excited to show Tokenizers in action to my LLM engineers"
print(tokenizer.encode(text))
print()
print(phi3_tokenizer.encode(text))
print()
print(qwen2_tokenizer.encode(text))

Comparing between PHI 3 Tokenizer and Qwen2 Tokenizer

In [None]:
print(tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True))
print()
print(phi3_tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True))
print()
print(qwen2_tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True))

Load the tokenizer for the StarCoder2 model using the model name stored in `STARCODER2_MODEL_NAME`. Allow the tokenizer to run any custom code it needs.

In [None]:
starcoder2_tokenizer = AutoTokenizer.from_pretrained(STARCODER2_MODEL_NAME, trust_remote_code=True)
code = """
def hello_world(person):
  print("Hello", person)
"""
tokens = starcoder2_tokenizer.encode(code)
for token in tokens:
  print(f"{token}={starcoder2_tokenizer.decode(token)}")