# Mini Tutorial: Tokenization in NLP
This notebook demonstrates different tokenization strategies with examples.

## Example sentence

In [None]:

text = "Learning turbulence models is hard!"
print(text)


## 1. Word-level Tokenization

In [None]:

import re

def word_tokenize(text):
    return re.findall(r"\b\w+\b", text.lower())

print(word_tokenize(text))


## 2. Character-level Tokenization

In [None]:

def char_tokenize(text):
    return list(text)

print(char_tokenize(text))


## 3. Toy BPE-style Tokenization (simplified)

In [None]:

def bpe_tokenize(text, vocab):
    tokens = []
    i = 0
    while i < len(text):
        for j in range(len(text), i, -1):
            sub = text[i:j]
            if sub in vocab:
                tokens.append(sub)
                i = j
                break
        else:
            tokens.append(text[i])
            i += 1
    return tokens

vocab = {"Learning", " turbulence", " models", " is", " hard", "!"}
print(bpe_tokenize(text, vocab))


## 4. Byte-level Tokenization

In [None]:

def byte_tokenize(text):
    return list(text.encode("utf-8"))

print(byte_tokenize("Hi!"))


## Summary
- Word-level: simple but OOV-prone
- Character-level: robust but long sequences
- BPE: efficient subwords (GPT-2/3)
- SentencePiece: multilingual (Gemini, LLaMA)
- Byte-level: most robust (GPT-4, Grok)