# Byte Pair Encoding (BPE) Tokenizer - Testing Notebook

This notebook tests the functionality of the BPE tokenizer you've implemented.

## Setup


Run the cells below to load necessary modules.


In [1]:
import os
import sys
print("Current working directory:", os.getcwd())
sys.path.append(os.getcwd())

Current working directory: d:\Neural Networks for NLP Assignments\Assignment 1\Exercise 1.3\BPE


In [2]:
from regex_tokenizer import RegexTokenizer
from basic import BasicTokenizer
sample_text_path = "sample_text.txt"
with open(sample_text_path, "r") as file:
    sample_text = file.read()

print("Sample Text Loaded:")
print(sample_text[:200])


Sample Text Loaded:
Copy paste of the Wikipedia article on Taylor Swift, as of Feb 16, 2024.
---

Main menu

WikipediaThe Free Encyclopedia

Search
Create account
Log in

Personal tools
Contents  hide
(Top)
Life and care


### Kindly complete the code before running the shell.

In [7]:
import os
import time

text = open("sample_text.txt", "r", encoding="utf-8").read()
os.makedirs("models", exist_ok=True)

t0 = time.time()
for TokenizerClass, name in zip([BasicTokenizer, RegexTokenizer], ["basic", "regex"]):
    tokenizer = TokenizerClass()
    tokenizer.building_merges(text, 512, verbose=True)
    prefix = os.path.join("models", name)
    tokenizer.save(prefix)
t1 = time.time()

print(f"Merging took {t1 - t0:.2f} seconds")

Merge 25/256: pair (277, 102) -> 280, freq: 871
Merge 50/256: pair (116, 111) -> 305, freq: 642
Merge 75/256: pair (322, 328) -> 330, freq: 446
Merge 100/256: pair (271, 100) -> 355, freq: 277
Merge 125/256: pair (68, 342) -> 380, freq: 207
Merge 150/256: pair (321, 110) -> 405, freq: 176
Merge 175/256: pair (395, 355) -> 430, freq: 136
Merge 200/256: pair (44, 91) -> 455, freq: 118
Merge 225/256: pair (405, 366) -> 480, freq: 106
Merge 250/256: pair (480, 273) -> 505, freq: 89
Merge 25/256: pair (277, 102) -> 280, freq: 871
Merge 50/256: pair (116, 111) -> 305, freq: 642
Merge 75/256: pair (322, 328) -> 330, freq: 446
Merge 100/256: pair (271, 100) -> 355, freq: 277
Merge 125/256: pair (68, 342) -> 380, freq: 207
Merge 150/256: pair (321, 110) -> 405, freq: 176
Merge 175/256: pair (395, 355) -> 430, freq: 136
Merge 200/256: pair (44, 91) -> 455, freq: 118
Merge 225/256: pair (405, 366) -> 480, freq: 106
Merge 250/256: pair (480, 273) -> 505, freq: 89
Merging took 25.69 seconds


In [4]:
sample_text = "This is an example sentence to encode."

basic_tokenizer = BasicTokenizer()
basic_tokenizer.load("models/basic.model")

regex_tokenizer = RegexTokenizer()
regex_tokenizer.load("models/regex.model")

encoded_basic = basic_tokenizer.encode(sample_text)
encoded_regex = regex_tokenizer.encode(sample_text)

print("Encoded with Basic Tokenizer:", encoded_basic)
print("Encoded with Regex Tokenizer:", encoded_regex)

decoded_basic = basic_tokenizer.decode(encoded_basic)
decoded_regex = regex_tokenizer.decode(encoded_regex)

print("Decoded with Basic Tokenizer:", decoded_basic)
print("Decoded with Regex Tokenizer:", decoded_regex)


Encoded with Basic Tokenizer: [84, 432, 262, 498, 357, 101, 120, 362, 112, 467, 115, 450, 290, 99, 256, 352, 290, 99, 111, 100, 101, 46]
Encoded with Regex Tokenizer: [84, 104, 367, 32, 367, 32, 270, 32, 101, 120, 362, 112, 397, 32, 115, 450, 290, 99, 101, 32, 305, 32, 290, 99, 111, 100, 101, 46]
Decoded with Basic Tokenizer: This is an example sentence to encode.
Decoded with Regex Tokenizer: This is an example sentence to encode.
