### walkthrough of mistral v3 tokenizer

(https://docs.mistral.ai/guides/tokenization/)


if youre read the primer on bpe , then this statement should make sense.

mistral's v3 tokenizer uses the Byte-Pair Encoding (BPE) with Tiktoken. Its called tekken.

Compression is how well the merges happen, so more the compression better the text is represented.

Tekken was trained on more than 100 languages and compresses natural language text and source code more efficiently than the SentencePiece tokeniser used in previous Mistral models. In particular, it is ~30% more efficient at compressing source code in Chinese, Italian, French, German, Spanish, and Russian. It is also 2x and 3x more efficient at compressing Korean and Arabic, respectively. Compared to the Llama 3 tokeniser, Tekken proved more proficient in compressing text for approximately 85% of all languages.


The vocab has 130k tokens + 1k control tokens.


Control tokens

```
<unk>
<s>
</s>
[INST]
[/INST]
[AVAILABLE_TOOLS]
[/AVAILABLE_TOOLS]
[TOOL_RESULTS]
[/TOOL_RESULTS]
[TOOL_CALLS]
<pad>
[PREFIX]
[MIDDLE]
[SUFFIX]
```

special tokens we use in the encoding process to represent specific instructions. These are not encoded rather just appended like this `[INST] + encode(“I love Paris”) + [/INST]`


In [3]:
from mistral_common.protocol.instruct.messages import (
    UserMessage,
)
from mistral_common.protocol.instruct.request import ChatCompletionRequest
from mistral_common.protocol.instruct.tool_calls import (
    Function,
    Tool,
)
from mistral_common.tokens.tokenizers.mistral import MistralTokenizer

tokenizer = MistralTokenizer.v3()

In [5]:
# Tokenize a list of messages
tokenized = tokenizer.encode_chat_completion(
    ChatCompletionRequest(
        tools=[
            Tool(
                function=Function(
                    name="get_current_weather",
                    description="Get the current weather",
                    parameters={
                        "type": "object",
                        "properties": {
                            "location": {
                                "type": "string",
                                "description": "The city and state, e.g. San Francisco, CA",
                            },
                            "format": {
                                "type": "string",
                                "enum": ["celsius", "fahrenheit"],
                                "description": "The temperature unit to use. Infer this from the users location.",
                            },
                        },
                        "required": ["location", "format"],
                    },
                )
            )
        ],
        messages=[
            UserMessage(content="What's the weather like today in Paris"),
        ],
    )
)
tokens, text = tokenized.tokens, tokenized.text

In [10]:
from pprint import pprint

pprint(text)

'<s>[AVAILABLE_TOOLS]▁[{"type":▁"function",▁"function":▁{"name":▁"get_current_weather",▁"description":▁"Get▁the▁current▁weather",▁"parameters":▁{"type":▁"object",▁"properties":▁{"location":▁{"type":▁"string",▁"description":▁"The▁city▁and▁state,▁e.g.▁San▁Francisco,▁CA"},▁"format":▁{"type":▁"string",▁"enum":▁["celsius",▁"fahrenheit"],▁"description":▁"The▁temperature▁unit▁to▁use.▁Infer▁this▁from▁the▁users▁location."}},▁"required":▁["location",▁"format"]}}}][/AVAILABLE_TOOLS][INST]▁What\'s▁the▁weather▁like▁today▁in▁Paris[/INST]'
