<a href="https://colab.research.google.com/github/Satyadeep-Dey/AI-experiments/blob/main/5__Tokenizers.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Tokenizers

In this notebook , we use the transformers tokenizer to encode and decode prompts .

We try this with 3 different models .

**Note :** *When actually calling the LLM we first encode , then call LLM to get response and then decode the response .*



In [None]:
from google.colab import userdata
from huggingface_hub import login
from transformers import AutoTokenizer

In [None]:
hf_token = userdata.get('HF_TOKEN')
login(hf_token, add_to_git_credential=True)

In [None]:
tokenizer = AutoTokenizer.from_pretrained('meta-llama/Meta-Llama-3.1-8B', trust_remote_code=True)

In [None]:
text = "I am excited to to learn Generative AI . It's super cool !"
tokens = tokenizer.encode(text) # encode
tokens

In [None]:
# lets count the number of words in the sentence being tokenized
words = text.split() # this creates a list of words
#print(words)
len(words) # count number of words

In [None]:
len(tokens) # total number of tokens

# Decode and Special Tokens

1. tokenizer.decode(tokens) is a method from Hugging Face's AutoTokenizer that converts token IDs (or tokens) back into a human-readable string—basically reversing the tokenization process.
2. The decoded string does not exactly match the original, since special token was added - ***<|begin_of_text|>***

***Note : These special tokens were used in training the model and hence it's different for each model.***

In [None]:

tokenizer.decode(tokens) #decode

In [None]:
#we can skip special tokens

tokenizer.decode(tokens, skip_special_tokens=True)


In [None]:
tokenizer.batch_decode(tokens) # this is a mis-use since we don't have multiple groups of tokens but it does show a cool output

# beginning of a word will start with a space
# "learn Generative AI" , after decode will be ' learn' ,' Gener','ative',' AI'

In [None]:
# real use of batch token
token1 = tokenizer.encode("How are you ?") # encode
token2 = tokenizer.encode("I am fine, thank you") # encode
print(token1)
print(token2)
tokenizer.batch_decode([token1, token2]) # decode batch

In [None]:
tokenizer.get_added_vocab() # these are the special tokens added by this LLM
# e.g., <|begin_of_text|> indicates the beginning of a prompt

In [None]:
#This returns a dictionary mapping tokens to their IDs from the tokenizer's original vocabulary—before any additional tokens were added.

tokenizer.vocab

# Instruct variants of models

Many models have a variant that has been trained for use in Chats.  
These are typically labelled with the word "Instruct" at the end.  
They have been trained to expect prompts with a particular format that includes system, user and assistant prompts.  

There is a utility method `apply_chat_template` that will convert from the messages list format we are familiar with, into the right input prompt for this model.

In [None]:
tokenizer = AutoTokenizer.from_pretrained('meta-llama/Meta-Llama-3.1-8B-Instruct', trust_remote_code=True)

In [None]:

messages = [
    {"role": "system", "content": "You are a helpful assistant"},
    {"role": "user", "content": "Tell a light-hearted joke for a room of Data Scientists"}
  ]

prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
print(prompt)

This is what the prompt will look like . Note that there is a section for **system** , **user** and a last section for **assistant**.
This is what will be send to the LLM for inference .


      <|begin_of_text|><|start_header_id|>system<|end_header_id|>

      Cutting Knowledge Date: December 2023
      Today Date: 26 Jul 2024

      You are a helpful assistant<|eot_id|><|start_header_id|>user<|end_header_id|>

      Tell a light-hearted joke for a room of Data Scientists<|eot_id|><|start_header_id|>assistant<|end_header_id|>



# Let us now try a few more models

1. Phi3 from Microsoft
2. Qwen2 from Alibaba Cloud
3. Starcoder2 from BigCode (ServiceNow + HuggingFace + NVidia)

In [None]:
PHI3_MODEL_NAME = "microsoft/Phi-3-mini-4k-instruct"
QWEN2_MODEL_NAME = "Qwen/Qwen2-7B-Instruct"
STARCODER2_MODEL_NAME = "bigcode/starcoder2-3b"

In [None]:
phi3_tokenizer = AutoTokenizer.from_pretrained(PHI3_MODEL_NAME)

text = "I am excited to show Tokenizers in action to my LLM engineers"
print(phi3_tokenizer.encode(text))
print("")
tokens = phi3_tokenizer.encode(text)
print(phi3_tokenizer.decode(tokens))
print("")
print(phi3_tokenizer.batch_decode(tokens))

In [None]:
print("This is the prompt for meta-llama/Meta-Llama-3.1-8B-Instruct  ")
print()
print(tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True))
print()
print("This is the prompt for microsoft/Phi-3-mini-4k-instruct  ")
print()
print(phi3_tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True))
# the prompts for the 2 different LLM are very different




**Output of above cell will look as follows :**


This is the prompt for meta-llama/Meta-Llama-3.1-8B-Instruct  

      <|begin_of_text|><|start_header_id|>system<|end_header_id|>

      Cutting Knowledge Date: December 2023
      Today Date: 26 Jul 2024

      You are a helpful assistant<|eot_id|><|start_header_id|>user<|end_header_id|>

      Tell a light-hearted joke for a room of Data Scientists<|eot_id|><|start_header_id|>assistant<|end_header_id|>



This is the prompt for microsoft/Phi-3-mini-4k-instruct. You'll notice it's much simpler:

      <|system|>
      You are a helpful assistant<|end|>
      <|user|>
      Tell a light-hearted joke for a room of Data Scientists<|end|>
      <|assistant|>

In [None]:
qwen2_tokenizer = AutoTokenizer.from_pretrained(QWEN2_MODEL_NAME)

text = "I am excited to show Tokenizers in action to my LLM engineers"
print(tokenizer.encode(text)) #Lllama
print()
print(phi3_tokenizer.encode(text))
print()
print(qwen2_tokenizer.encode(text))
# the output of this cell shows that encoding is very different in the 3 LLMs


In [None]:
print(tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)) #Llama
print()
print(phi3_tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True))
print()
print(qwen2_tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True))
# the output of this cell shows that the generated prompt , that will be sent to LLM , is different for each

In [None]:
starcoder2_tokenizer = AutoTokenizer.from_pretrained(STARCODER2_MODEL_NAME, trust_remote_code=True)
code = """
def hello_world(person):
  print("Hello", person)
"""
tokens = starcoder2_tokenizer.encode(code)
print(tokens)
print(starcoder2_tokenizer.decode(tokens))