# Tokenization

**INPUT**: Preprocessed train, validation, test corpora

**OUTPUT**: LSTM and LLM tokenizer model & vocabulary

| Step | Decision | Status | Comment |
|------|----------|--------|---------|
| Train tokenizer | - | Revision | Parameter decision, missing after preprocessing eda and after tokenization eda |
| LSTM tokenization | - | - | - |
| LLM tokenization | - | - | - |

## Input & Setup

### Imports

In [15]:
import pandas as pd
from pathlib import Path
import sentencepiece as spm

### Read corpora

In [16]:
root = Path.cwd().parent
train_path_preprocessed = root / "data" / "corpora" / "processed" / "train-pp.csv"
validation_path_preprocessed = root / "data" / "corpora" / "processed" / "validation-pp.csv"
test_path_preprocessed = root / "data" / "corpora" / "processed" / "test-pp.csv"
train = pd.read_csv(train_path_preprocessed)
validation = pd.read_csv(validation_path_preprocessed)
test = pd.read_csv(test_path_preprocessed)

### Prepare tokenizer training data

In [17]:
models_dir = root / "data" / "models"
train_tokenizer_path = models_dir / "lstm_tokenizer_train_corpus.txt"
train["Message"].to_csv(train_tokenizer_path, index=False, header=False)

## Steps

### Train tokenizer

In [18]:
tokenizer_model_path = models_dir / "lstm_tokenizer.model"
tokenizer_model_prefix = str(tokenizer_model_path.with_suffix(''))
spm.SentencePieceTrainer.train(
    input=str(train_tokenizer_path),
    model_prefix=tokenizer_model_prefix,
    vocab_size=16000,
    model_type='unigram',
    character_coverage=1.0
)


### Tokenization test

In [21]:
sp = spm.SentencePieceProcessor()
sp.load(str(tokenizer_model_path))

ids = sp.encode("winning prizes now!!!")
print(ids)
text = sp.decode(ids)
print(text)
ids = sp.encode(text, out_type=int)
pieces = sp.encode(text, out_type=str)

for token_id, subword in zip(ids, pieces):
    print(f"ID {token_id:5d} -> '{subword}'")

[858, 824, 11, 130, 15981, 15981, 15981]
winning prizes now!!!
ID   858 -> '▁winning'
ID   824 -> '▁prize'
ID    11 -> 's'
ID   130 -> '▁now'
ID 15981 -> '!'
ID 15981 -> '!'
ID 15981 -> '!'


TODO:  build LSTM, make notebook for LLM, build LLM OOP style???

TODO: pre-tokenization EDA - message lengths, new tokens, artifact effects (max length, num_words model paramters)

TODO: post-tokenization EDA - sequence lengths, vocab coverage OOV rates (model parameters)

TODO: decide on the **replies** and **forwarded** messages (remove them? manual inspection notebook)

TODO: 1 tokenizer for lstm one for llm, different vocab size, and different train data for the SentencePieceTrainer
