# BPE tokenization on genetic sequences

This notebook applies subword tokenization (BPE) using SentencePiece on genomic haplotype sequences extracted from a database. This step prepares the data for training models like Word2Vec or Transformers.

In [None]:
#numeric
import pandas as pd
import numpy as np
#system
import os
from pymongo import MongoClient
#graphic
import matplotlib.pyplot as plt
#tokenizers
import sentencepiece as spm

## 📁 Define Paths and Database Parameters

We define variables for:

- MongoDB database and collection names.
- File path to the dataset partition CSV (train/test splits).
- Paths to save the tokenizer input file and the trained tokenizer model.


In [None]:
db_name = "------"
collection_name = "------"
dset_partition_path = "------"
tokenizer_file_path = "------"
tokenizer_model_path = "------"

## Get training sequences

In [None]:
client = MongoClient("mongodb://localhost:11111/")

db = client[db_name]
collection = db[collection_name]

## Extract Sequences for Tokenizer Training

Here we retrieve haplotype sequences from the MongoDB collection to prepare training data for the tokenizer.


In [None]:
query = {"organism_ID":'1'}
n_sequences = collection.count_documents(query)
training_sequences = collection.find(query)
print(f'Number of traning sequences: {n_sequences}')

In [None]:
file = open(tokenizer_file_path, "w")
file.close()

for i, sequence in enumerate(training_sequences):
    file = open(tokenizer_file_path, "a")
    haplotype_1 = sequence['haplotype_1']
    file.write(haplotype_1.upper() + "\n")
    file.close()

    print(f'{i+1} of {n_sequences}', end='\r')

## Train BPE with sentence piece tokenizer

We train a subword tokenizer using the SentencePiece library with the following settings:

- `--input`: path to the text file containing haplotype sequences.
- `--model_prefix`: prefix for the output model and vocabulary files.
- `--vocab_size=12000`: the desired vocabulary size.
- `--model_type=bpe`: specifies Byte Pair Encoding (BPE) as the tokenization algorithm.
- `--unk_id=0`, `--unk_piece=N`: unknown tokens are assigned ID 0 and represented as `'N'`.
- `--num_threads=1`: sets single-threaded training (can be increased for faster training).
- `--minloglevel=2`: suppresses warnings and errors shown.

This model will later be used to tokenize sequences into subword units for downstream tasks such as sequence modeling or embedding training.


In [None]:
spm.SentencePieceTrainer.train(f'--input={tokenizer_file_path} --model_prefix={tokenizer_model_path} --vocab_size=12000 --model_type=bpe --unk_id=0 --unk_piece=N --num_threads=1 --minloglevel=2')

## 🧪 Load Tokenizer and Test on Example Sequence

This section demonstrates how to use the trained SentencePiece tokenizer:

1. We load the BPE model using `SentencePieceProcessor()`.
2. We retrieve a single example sequence from the MongoDB collection.
3. We print the original `haplotype_1` in uppercase.
4. We tokenize the sequence:
   - `encode_as_pieces`: returns the tokenized sequence as subword strings.
   - `encode_as_ids`: returns the tokenized sequence as corresponding token IDs.

This test verifies that the tokenizer was trained successfully and produces expected subword units.


In [None]:
tokenizer = spm.SentencePieceProcessor()
tokenizer.load(f'{tokenizer_model_path}.model')

xample_sequence = collection.find_one()
haplotype_1 = example_sequence['haplotype_1'].upper()
print(haplotype_1)
print('-'*100)
print(tokenizer.encode_as_pieces(haplotype_1))
print('-'*100)
print(tokenizer.encode_as_ids(haplotype_1))