# Tokens

Models interact with input as small chunks called tokens which can be words, subwords or characters. These are the input and output of the models. In the pipeline, tokenization happens before the input is processed by the model.

## Code Example

In [2]:
from transformers import AutoModelForCausalLM, AutoTokenizer

In [10]:
model = AutoModelForCausalLM.from_pretrained(
    'microsoft/Phi-3-mini-4k-instruct',
    device_map='cuda',
    torch_dtype='auto',
)

tokenizer = AutoTokenizer.from_pretrained('microsoft/Phi-3-mini-4k-instruct')

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

In [11]:
prompt = 'Write a short message telling them that I learn about tokenization today. Explain what it is. <|assistant|>'

In [12]:
input_ids = tokenizer(prompt, return_tensors="pt").input_ids.to('cuda')
input_ids

tensor([[14350,   263,  3273,  2643, 14509,   963,   393,   306,  5110,  1048,
          5993,  2133,  9826, 29889, 12027,  7420,   825,   372,   338, 29889,
         29871, 32001]], device='cuda:0')

The output tokens have been converted to their numerical representation and returned as a PyTorch tensor. These are used as the input of the model rather than the actual prompt

In [14]:
generation_output = model.generate(
    input_ids=input_ids,
    max_new_tokens=20
)
generation_output

tensor([[14350,   263,  3273,  2643, 14509,   963,   393,   306,  5110,  1048,
          5993,  2133,  9826, 29889, 12027,  7420,   825,   372,   338, 29889,
         29871, 32001, 18637,   727, 29991,   306,   925, 10972,  1048,  5993,
          2133,  9826, 29889,   739, 29915, 29879,   263, 21028,   262,  1218,
          1889,  1304]], device='cuda:0')

Comparing the input tokens to the generated output, we can see the the model starts generating on token **18637**

In [16]:
print(tokenizer.decode(generation_output[0]))

Write a short message telling them that I learn about tokenization today. Explain what it is. <|assistant|> Hey there! I just learned about tokenization today. It's a fascinating process used


In [36]:
for id in input_ids[0]:
  print('Word:', tokenizer.decode(id), '\tToken:', id.item())

Word: Write 	Token: 14350
Word: a 	Token: 263
Word: short 	Token: 3273
Word: message 	Token: 2643
Word: telling 	Token: 14509
Word: them 	Token: 963
Word: that 	Token: 393
Word: I 	Token: 306
Word: learn 	Token: 5110
Word: about 	Token: 1048
Word: token 	Token: 5993
Word: ization 	Token: 2133
Word: today 	Token: 9826
Word: . 	Token: 29889
Word: Exp 	Token: 12027
Word: lain 	Token: 7420
Word: what 	Token: 825
Word: it 	Token: 372
Word: is 	Token: 338
Word: . 	Token: 29889
Word:  	Token: 29871
Word: <|assistant|> 	Token: 32001


From the output we can analyse each numerical representation that is assigned to each token e.g. **Write - 14350**

## Types of Tokenization
There are different ways in which text can be tokenized:
1. Word tokens: Splitting text using the whitespace
2. Subword tokens: Using full or partial words
3. Character tokens: Using unique individual characters in the input
4. Byte tokens: Using individual bytes to represent unicode characters
