## Tokenizing with code

https://platform.openai.com/tokenizer

#### What Tiktoken Is?

* Tiktoken is a tokenizer library used for OpenAI models (GPT-3, GPT-4, GPT-5).
* Its main purpose: convert text into tokens and tokens back into text efficiently.
* Tokens are the basic units the model understands, often smaller than words.
* Example:
 "Hello, world!" → ["Hello", ",", " world", "!"] (simplified)
 /// Each token has an integer ID in the model’s vocabulary.

#### Why Tokenization Matters?
Large Language Models (LLMs) don’t operate on plain text—they work on tokens, which are mapped to vectors (embeddings).
* Tokens: atomic pieces of text (subwords, words, characters)
* Vocabulary: the set of all tokens the model knows
* IDs: each token has a unique numeric ID → input to the neural network
* Tiktoken handles the mapping: Text -> Tokens -> Token IDs -> Embeddings -> Model
* And also the reverse: Model output IDs -> Tokens -> Text



In [2]:
import tiktoken
encoding = tiktoken.encoding_for_model("gpt-4.1-mini")

tokens = encoding.encode("Hi my name is Betül and I like purple!")

In [3]:
print(tokens)

[12194, 922, 1308, 382, 10559, 13595, 326, 357, 1299, 37896, 0]


In [6]:
for token_id in tokens:
    token_text = encoding.decode([token_id])
    print(f"{token_id} : {token_text}")

12194 : Hi
922 :  my
1308 :  name
382 :  is
10559 :  Bet
13595 : ül
326 :  and
357 :  I
1299 :  like
37896 :  purple
0 : !


In [8]:
encoding.decode([10559])

' Bet'

In [12]:
encoding.decode([10560])

' Sunday'