# Tokenizer Introduction

#### Load the model using the Transformer library

In [3]:
from transformers import AutoModelForCausalLM, AutoTokenizer

# Load model and tokenizer
model = AutoModelForCausalLM.from_pretrained(
    "microsoft/Phi-3-mini-4k-instruct",
    device_map="cpu",
    torch_dtype="auto",
    trust_remote_code=True,
)
tokenizer = AutoTokenizer.from_pretrained("microsoft/Phi-3-mini-4k-instruct")

`flash-attention` package not found, consider installing for better performance: No module named 'flash_attn'.
Current `flash-attention` does not support `window_size`. Either upgrade or use `attn_implementation='eager'`.
Loading checkpoint shards: 100%|██████████| 2/2 [00:00<00:00,  7.06it/s]


#### generate next 20 tokens from the given prompt

In [7]:
prompt = "What is a LLM tokenizer?"

# Tokenize the input prompt
input_ids = tokenizer(prompt, return_tensors="pt").input_ids.to("cpu")

# Generate the text
generation_output = model.generate(
  input_ids=input_ids,
  max_new_tokens=20
)

# Print the output
print(tokenizer.decode(generation_output[0]))

What is a LLM tokenizer? LLM tokenizer is a tool used in natural language processing (NLP) to break down text


#### display actual input_ids going to the LLM model, LLM model do not understand the words as humans do.

In [8]:
input_ids

tensor([[ 1724,   338,   263,   365, 26369,  5993,  3950, 29973]])

In [9]:
for id in input_ids[0]:
   print(tokenizer.decode(id))

What
is
a
L
LM
token
izer
?


- Observe how the word LLM is broken into the L-LM, tokenizer into token-izer

#### What was the actual out from the tokenizer? was it words? think again.

In [10]:
generation_output

tensor([[ 1724,   338,   263,   365, 26369,  5993,  3950, 29973,   365, 26369,
          5993,  3950,   338,   263,  5780,  1304,   297,  5613,  4086,  9068,
           313, 29940, 13208, 29897,   304,  2867,  1623,  1426]])

original output
`LLM tokenizer is a tool used in natural language processing (NLP) to break down text`

In [14]:
print(tokenizer.decode(365))
print(tokenizer.decode(26369))
print(tokenizer.decode(5993))
print(tokenizer.decode(3950))
print('...')

L
LM
token
izer
...
