# Tokenizers

Please can I bring you back to the wonderful Google Colab where we'll look at different Tokenizers:

https://colab.research.google.com/drive/1WD6Y2N7ctQi1X9wa6rpkg8UfyA4iSVuz?usp=sharing

In [1]:
# if this gives an "ERROR" about pip dependency conflicts, ignore it! It doesn't affect anything.

!pip install -q --upgrade torch==2.5.1+cu124 torchvision==0.20.1+cu124 torchaudio==2.5.1+cu124 --index-url https://download.pytorch.org/whl/cu124

!pip install -q --upgrade transformers==4.48.3 datasets==3.2.0

ERROR: Could not find a version that satisfies the requirement torch==2.5.1+cu124 (from versions: 2.6.0+cu124)
ERROR: No matching distribution found for torch==2.5.1+cu124


In [6]:

from huggingface_hub import login
from transformers import AutoTokenizer
from dotenv import load_dotenv
import os

In [7]:
load_dotenv()  # loads environment variables from a .env file if present
hf_token = os.getenv('HF_TOKEN')
login(hf_token, add_to_git_credential=True)

Note: Environment variable`HF_TOKEN` is set and is the current active token independently from the token you've just configured.


In [10]:
# Use a public model to avoid gated repo errors
tokenizer = AutoTokenizer.from_pretrained('distilbert-base-uncased')

In [12]:
text = "I am excited to show Tokenizers in action to my LLM engineers"
tokens = tokenizer.encode(text)
tokens

[101,
 1045,
 2572,
 7568,
 2000,
 2265,
 19204,
 17629,
 2015,
 1999,
 2895,
 2000,
 2026,
 2222,
 2213,
 6145,
 102]

In [13]:
len(tokens)

17

In [14]:
tokenizer.decode(tokens)

'[CLS] i am excited to show tokenizers in action to my llm engineers [SEP]'

In [15]:
tokenizer.batch_decode(tokens)

['[CLS]',
 'i',
 'am',
 'excited',
 'to',
 'show',
 'token',
 '##izer',
 '##s',
 'in',
 'action',
 'to',
 'my',
 'll',
 '##m',
 'engineers',
 '[SEP]']

In [16]:
# tokenizer.vocab
tokenizer.get_added_vocab()

{'[PAD]': 0, '[UNK]': 100, '[CLS]': 101, '[SEP]': 102, '[MASK]': 103}

In [18]:
# The requested model is gated and cannot be accessed without permission.
# Use the already loaded public tokenizer instead.
# tokenizer = AutoTokenizer.from_pretrained('meta-llama/Meta-Llama-3.1-8B-Instruct', trust_remote_code=True)
print("Using the existing public tokenizer: distilbert-base-uncased")

Using the existing public tokenizer: distilbert-base-uncased


In [22]:
# Use the already loaded public tokenizer instead of a gated model
print("Using the existing public tokenizer: distilbert-base-uncased")

# Example: encode the messages using the available tokenizer
prompt = tokenizer.encode(messages[1]['content'])
print(prompt)


Using the existing public tokenizer: distilbert-base-uncased
[101, 2425, 1037, 2422, 1011, 18627, 8257, 2005, 1037, 2282, 1997, 2951, 6529, 102]


In [20]:
PHI3_MODEL_NAME = "microsoft/Phi-3-mini-4k-instruct"
QWEN2_MODEL_NAME = "Qwen/Qwen2-7B-Instruct"
STARCODER2_MODEL_NAME = "bigcode/starcoder2-3b"

In [23]:
phi3_tokenizer = AutoTokenizer.from_pretrained(PHI3_MODEL_NAME)

text = "I am excited to show Tokenizers in action to my LLM engineers"
print(tokenizer.encode(text))
print()
tokens = phi3_tokenizer.encode(text)
print(phi3_tokenizer.batch_decode(tokens))


To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development
Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


[101, 1045, 2572, 7568, 2000, 2265, 19204, 17629, 2015, 1999, 2895, 2000, 2026, 2222, 2213, 6145, 102]

['I', 'am', 'excited', 'to', 'show', 'Token', 'izers', 'in', 'action', 'to', 'my', 'L', 'LM', 'engine', 'ers']


In [25]:
# Only call apply_chat_template if the tokenizer supports it
if getattr(tokenizer, "chat_template", None):
	print(tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True))
else:
	print("tokenizer does not support chat templates.")

print()

if getattr(phi3_tokenizer, "chat_template", None):
	print(phi3_tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True))
else:
	print("phi3_tokenizer does not support chat templates.")

tokenizer does not support chat templates.

<|system|>
You are a helpful assistant<|end|>
<|user|>
Tell a light-hearted joke for a room of Data Scientists<|end|>
<|assistant|>



In [27]:
if getattr(tokenizer, "chat_template", None):
	print(tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True))
else:
	print("tokenizer does not support chat templates.")

print()

if getattr(phi3_tokenizer, "chat_template", None):
	print(phi3_tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True))
else:
	print("phi3_tokenizer does not support chat templates.")

print()

if 'qwen2_tokenizer' in locals() and getattr(qwen2_tokenizer, "chat_template", None):
	print(qwen2_tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True))
else:
	print("qwen2_tokenizer does not support chat templates or is not defined.")

tokenizer does not support chat templates.

<|system|>
You are a helpful assistant<|end|>
<|user|>
Tell a light-hearted joke for a room of Data Scientists<|end|>
<|assistant|>


qwen2_tokenizer does not support chat templates or is not defined.


In [28]:
starcoder2_tokenizer = AutoTokenizer.from_pretrained(STARCODER2_MODEL_NAME, trust_remote_code=True)
code = """
def hello_world(person):
  print("Hello", person)
"""
tokens = starcoder2_tokenizer.encode(code)
for token in tokens:
  print(f"{token}={starcoder2_tokenizer.decode(token)}")

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development


222=

610=def
17966= hello
100=_
5879=world
45=(
6427=person
731=):
353=
 
1489= print
459=("
8302=Hello
411=",
4944= person
46=)
222=



In [1]:
!git add "C:\Users\Lenovo\Desktop\pipline-s-magic\OpenSource-Gen-AI\Tokenizers.ipynb"


In [2]:
!git commit -m "added tokenizers"

[main b809845] added tokenizers
 1 file changed, 37 insertions(+)
 create mode 100644 OpenSource-Gen-AI/Tokenizers.ipynb


In [3]:
!git push -u origin main

branch 'main' set up to track 'origin/main'.


To https://github.com/SUSH9391/pipline-s-magic.git
   3bd45ee..b809845  main -> main
