## 1. Introduction

In this notebook, you will learn about data preprocessing for training a large language model and tokenizers.

In this notebook first, you download the raw data, then implement a pipeline for preprocessing this data. Finally, you train the three different types of tokenizers that you learned in the class on your data and compare them.

Answer the questions marked in green font.


In [1]:
print("➡️ Step 1: Installing a compatible NumPy version...")
!pip install -U "numpy>=1.26"

print("\n➡️ Step 2: Installing Hazm without its dependencies...")
!pip install --no-deps hazm

print("\n➡️ Step 3: Installing other required libraries (with corrected package name)...")
!pip install nltk fasttext-wheel "gensim<5.0.0" flashtext transformers huggingface_hub python-crfsuite

print("\n➡️ Step 4: Importing libraries and setting up the environment...")
from IPython.display import clear_output
from huggingface_hub import hf_hub_download
from tqdm.notebook import tqdm
import numpy as np
import random
import transformers
import gzip
import json
import fasttext
from hazm import *

# Initialize components and set the seed
normalizer = Normalizer()
SEED=21
np.random.seed(SEED)
random.seed(SEED)

clear_output()
print("✅✅✅ Success! All packages are correctly installed and the environment is ready.")

✅✅✅ Success! All packages are correctly installed and the environment is ready.


In [2]:

normalizer = Normalizer()

SEED=21

np.random.seed(SEED)
random.seed(SEED)

clear_output()

### Download Raw Data

In this section, download the raw data.

In [3]:

!GIT_LFS_SKIP_SMUDGE=1 git clone https://huggingface.co/datasets/allenai/c4
%cd c4
!git lfs pull --include "c4-fa.tfrecord-00000-of-01024.json.gz"
%cd ..

file_path = '/content/c4/multilingual/c4-fa.tfrecord-00000-of-01024.json.gz'

with gzip.open(file_path, 'rb') as f:
    file_content = f.read().decode('utf8')

data = file_content.split('\n')

text_data = []

for i in data:
    if i!='':
        text_data.append(json.loads(i)['text'])

!wget https://raw.githubusercontent.com/roshan-research/hazm/master/hazm/data/stopwords.dat
with open('stopwords.dat', 'r') as f:
    stop_words = f.read()
stop_words = stop_words.rstrip()
stop_words = stop_words.split('\n')

clear_output()

### Statistics (5 points)

In this section, you should report several statistical characteristics of the data:
- Average length of the documents
- Largest length of the documents
- Smallest length of the documents
- The number of words in the data
- The most frequent word

In [4]:
from collections import Counter
import re

# --- Calculate Word-Based Statistics ---

doc_lengths_in_words = [len(doc.split()) for doc in text_data]
avg_length = sum(doc_lengths_in_words) / len(doc_lengths_in_words)
max_length = max(doc_lengths_in_words)
min_length = min(doc_lengths_in_words)
total_words = sum(doc_lengths_in_words)
word_counts = Counter()
for doc in text_data:
    words = doc.split()
    word_counts.update(words)
most_common_word, most_common_word_count = word_counts.most_common(1)[0]

# --- Report the Statistics ---
print("--- Dataset Statistics ---")
print(f"Number of documents: {len(text_data):,}")
print(f"Average document length: {avg_length:.2f} words")
print(f"Largest document length: {max_length:,} words")
print(f"Smallest document length: {min_length:,} words")
print("-" * 25)
print(f"Total number of words: {total_words:,}")
print(f"Most frequent word: '{most_common_word}' (appeared {most_common_word_count:,} times)")


--- Dataset Statistics ---
Number of documents: 52,663
Average document length: 504.17 words
Largest document length: 32,478 words
Smallest document length: 2 words
-------------------------
Total number of words: 26,550,903
Most frequent word: 'و' (appeared 1,026,606 times)


In [8]:
text_data[0]

'قیمت دوربین مراقبت بچه بیسیم با برد 200 متر و قابلیت مکالمه 2 طرفه مدل 601 فقط 435000 تومان!!!\nخانهمراقبتی و امنیتیدوربینهای مداربسته و نظارتیدوربین مراقبت بچه بیسیم با برد 200 متر و قابلیت مکالمه 2 طرفه مدل 601\nکد کالا: 32514\nست کامل دوربین و مانیتور وایرلس با برد 200 متر با ارتباط صوتی و مکالمه دوطرفه ، 8 لامپ IR دید در شب با صفحه نمایش 2 اینچی\nدوربین مراقبت بچه چیست ؟ دستگاهی شامل یک مانیتور یا تلویزیون و یک دوربین میباشد که توانایی پخش تصاویر ویدیویی دوربین بصورت زنده و لایو را دارد.\nدوربین مراقبت از کودک و مانیتور بی سیم یکی از مناسبترین وسیله ها برای مراقبت از بچه ها و کودکان و حتی کهنسالان یا معلولین جسمی و ذهنی و فرزندان و افرادی که نیاز به مراقبت و نظارت و کنترل محسوس یا نامحسوس دارند میباشد.\nمانیتور کوچک آن براحتی قابل حمل بوده و با مزیت هایی که در زیر به بعضی از آنها اشاره میکنیم یک وسیله نظارتی کامل در دستان خود خواهید داشت .\nتصویر Live و زنده\nتصویری که از طرف دوربین بر روی مانیتور میبینید بصورت آنلاین میباشد یعنی تصویر زنده لحطه ای از محیط روبروی لنز دوربین\nوایرل

In [11]:


orginal_text_data = text_data[:]
len(orginal_text_data)


52663

### Low-quality Examples

In this section, several examples of poor-quality data from each category are provided. At the end of the preprocessing pipeline, you must remove such data from the original dataset to obtain high-quality data.

In [12]:

non_persian_example = orginal_text_data[9]
print(non_persian_example)


آذر ۷, ۱۳۹۷ - دیمه نیوز
Sportsbook Sites and Football Wagering Tips
Football gambling on tips-football playing advice on line The This country's Bookie Sportsbook, as a good useful support, delivers this kind of perceptive sections technique side bet your preferred sporting such as hockey, soccer, basketball game, tennis, sports, moose race, NASCAR, rugby, plus world of golf. Family home " Much more " Sports Bets ...
Major Business Strategies Selections Your solution to organization will most likely be influenced by the end objective. Then you can observe how without problems you operate your organization with low concerns in the foreseeable future. No matter of what topic you select to your new marketing and advertising business, ...
Top rated Business Strategies Options Your approach to organization is to damaged by the end target. Then you are going to observe how easily you run your company with very minimal issues down the road. Irrespective of what issue you select to your new on

In [13]:

short_length_example = orginal_text_data[1494]
print(short_length_example)


﻿ ﻿ مرجع آموزش زبان ایرانیان - _دانلود_نرم_افزار_سطح_مقدماتی_Oxford_Word_Skills_Basic_ ﻿


In [14]:

mean_length_word_example = orginal_text_data[250]
print(mean_length_word_example)


دو جمله حکیمانه : روانشناسی و اعتیاد - Page 9
159 پست • صفحه 9 از 11 • 1 ... 6, 7, 8, 9, 10, 11
[ 접속주소:opxx3.COM] &rdquo;&rdquo; 안동 섹시 &rdquo;&rdquo;
توسط 97WzugSKd3 » 6 مرداد 1397, 22:27
 [opxx3.COM]  &rdquo;☚&rdquo; 고양 강남 출장 만남 &there4;
توسط VTR8ve0Hhj » 6 مرداد 1397, 00:36
URL &rarr; [opxx3.COM] &rdquo;&rdquo; 김포 출장샵 &rdquo;✌&r
توسط PjTf4lb7wh » 6 مرداد 1397, 00:39
[opxx3.COM] &rdquo;&rdquo; 경주 출장안마 ※
توسط 97WzugSKd3 » 7 مرداد 1397, 06:02
[ 접속주소:opxx3.COM] &rdquo;&rdquo; 김천 서비스 &rdquo;&rdquo;
توسط rH7yB6bGKL » 7 مرداد 1397, 08:18
[opxx3.COM] &rdquo;&rdquo; 김천 콜걸 &rdquo;&rdquo;
توسط J5HDhyr3Hc » 7 مرداد 1397, 08:18
[opxx3.COM] &rdquo;&rdquo; 포항 강남 출장 아로마 &rdquo;&rdquo;
توسط J5HDhyr3Hc » 7 مرداد 1397, 08:20
WEBSITE  [opxx3.COM]  ※ 남양주 부평 출장 &rdquo;☚&rdquo;
توسط 97WzugSKd3 » 7 مرداد 1397, 12:23
[url=ï»¿https://stackoverflow.com/search?q=%F0%9F%8C%8F%20%5Bopxx3.COM%5D%20%F0%9F%8C%8F%2B%E2%80%9D%E2%9C%8C%E2%80%9D%2B%E2%80%9D%F0%9F%A4%A3%E2%80%9D%2B%E2%80%9D%F0%9F%8D%94%E2%8

In [15]:

symbol_ration_example = orginal_text_data[15018]
print(symbol_ration_example)


آموزش اضافه کردن تب های admincp و modcp به وی بی ایران خوش آمدید
ثبت نام کنید یا وارد شوید صفحه اصلی سایت مقالات رسانه تصویری آپلودسنتر فروشگاه دانلودر پست های جدید پرسش و پاسخ تقویم ابزار انجمن انتخاب بعنوان خوانده شده کلیدهای میانبر ارسال های امروز نمایش مدیران انجمن User Tagging Statistics Hash Tag Subscriptions وی بولتین مقالات آموزش اضافه کردن تب های admincp و modcp تاپیک به صورت خودکار بعد از 5 ثانیه آپدیت میشود بنابراین برای نمایش پست های جدید نیازی به رفرش صفحه نیست حالت ریفرش اتوماتیک به علت بی توجهی شما به این صفحه غیر فعال شد . «روشن کردن ریفرش اتوماتیک»
مقاله: آموزش اضافه کردن تب های admincp و modcp لینک بک آدرس لینک بک درباره لینک بک ها اضافه کردن به علاقه مندی / اشتراک فرستادن این موضوع به Digg !اضافه کردن موضوع به Delicious !اضافه کردن به علاقه مندی تکنوراتیفرستادن موضوع به Twitter ! ابزار مقاله پرینت این صفحه / حالت نمایش بصورت پرینت شده ارسال این صفحه به ایمیل یک دوست یا خودتان… Subscribe to this Article… نحوه نمایش موضوع حالت خطی تعویض به حالت ترکیبی تعوض به حالت رشته

In [16]:

alphabet_example = orginal_text_data[1]
print(alphabet_example)


املاک -- مشاهده تمام آگهی های دسته -- اجاره املاک اداری تجاری تهران ( 3) اجاره املاک مسکونی تهران ( 2) خرید و فروش آپارتمان ( 21) خرید و فروش املاک اداری تجاری تهران ( 3) خرید و فروش املاک مسکونی تهران ( 1) خرید و فروش خانه ( 3) خرید و فروش زمین ( 5) رهن و اجاره آپارتمان ( 5) رهن و اجاره خانه ( 1) سایر ( 9) کلنگی ( 2) مغازه و غرفه ( 3) ویلا ( 1) کالاو لوازم -- مشاهده تمام آگهی های دسته -- اثاثیه منزل ( 14) اداری ( 11) الکترونیک و دیجیتال ( 23) ایمنی ( 0) -- دوربین مدار بسته بازی و سرگرمی ( 1) برقی و گازی ( 12) بهداشتی ساختمان ( 9) بهداشتی و آرایشی ( 20) پزشکی ( 28) پوشاک ( 14) تزئینی , پرده , کرکره و ... ( 8) تزیینی ( 12) تعمیرگاهی ( 0) چوبی و فلزی ( 6) خیاطی و بافندگی ( 2) رستوران و فست فود ( 5) زیور آلات ( 1) ساعت ( 0) سایر ( 29) سرمایشی و گرمایشی ( 4) سیسمونی و نوزاد ( 0) شکار و ماهیگیری ( 1) صنعتی ( 5) صوتی و تصویری ( 9) عینک ( 0) فرش و زیر انداز ( 1) فیلم برداری و عکاسی ( 0) کالا و تجهیزات برقی ( 6) کتاب و مجلات ( 4) کیف و کفش ( 0) لوازم التحریر ( 1) لوازم ساختمانی ( 19) لوازم فرو

In [17]:

stopword_example = orginal_text_data[74]
print(stopword_example)

بقشاب پرنده های زیر آبی :: پنجره ی جهان نما
بقشاب پرنده های زیر آبی
چهارشنبه, ۱ مرداد ۱۳۹۳، ۰۷:۵۷ ب.ظ
منبع : مجله ی اینترنتی دانستنیها


## 2. Preprocessing (85 points)

In this part, you have to implement the pre-processing pipeline. This pipeline consists of the following sections:

1- Language Identification: In this section, you need to identify the language of the data using FastText and remove any data that is not in Farsi.

2- Length Filter: It is likely that examples that are shorter or longer than one length do not contain complete sentences or textual data. In this section, you should remove this data.

3- Mean Word Length: In high quality text, words should have an average number of characters within a certain range. you should find the average word length in each sample and exclude samples that fall outside the range.

4- Symbol Ratio: To ensure that a text is of good quality for training a model, the ratio of symbol count to word count should not exceed the threshold. In this task, you should calculate the ratio of symbol count to word count in any samples and filter out any samples with a ratio above a certain threshold.

5- Alphabetic Filter: In this part, you need to find the ratio of the number of words that contain at least one letter from the alphabet to the ratio of total number of words. If this ratio is below a certain limit, you should remove the sample.

6- Stop Word Filter: Find the number of stop words in a sample. If it's below a limit, delete the sample.

In [54]:
# Ensure that each sample is labeled either as "__label__pes_Arab" or "__label__prs_Arab"

# You must find the right hyperparameters for these!!!
length_filter_min = 50
length_filter_max = 10000
word_length_filter_min = 3
word_length_filter_max = 10
symbol_filter_max_ratio = 0.1
alphabetic_filter_min_ratio = 0.8
stop_words_filter_min_count = 3
##################################################
#Tuning
length_filter_min = 75
length_filter_max = 10000
word_length_filter_min = 3
word_length_filter_max = 8
symbol_filter_max_ratio = 0.07
alphabetic_filter_min_ratio = 0.9
stop_words_filter_min_count = 5

def language_filter(text_data):
    model_path = hf_hub_download(repo_id="facebook/fasttext-language-identification", filename="model.bin")
    model = fasttext.load_model(model_path)
    return_value = []
    for text_row in tqdm(text_data):
        text = text_row.replace("\n", " ")
        # The model.predict method returns a tuple of labels and probabilities.
        # We ask for the top prediction (k=1).
        prediction = model.predict(text, k=1)
        # The label is the first element of the first tuple, e.g., ('__label__pes_Arab',)
        lang = prediction[0][0]
        if lang in ['__label__pes_Arab', '__label__prs_Arab']:
            return_value.append(text_row)
    return return_value


# Keep samples that have a word count between "min_length" and "max_length"
def length_filter(text_data, min_length=length_filter_min, max_length=length_filter_max):
    return_value = []
    for text_row in tqdm(text_data):
        length = len(text_row.split())
        if length >= min_length and length <= max_length:
            return_value.append(text_row)
    return return_value


# Keep samples that have a mean word length between "min_length" and "max_length"
def word_length_filter(text_data, min_length=word_length_filter_min, max_length=word_length_filter_max):
    return_value = []
    for text_row in tqdm(text_data):
        words = text_row.split()

        # Handle cases with no words to avoid division by zero
        if not words:
            mean_word_length = 0
        else:
            total_word_length = sum(len(word) for word in words)
            mean_word_length = total_word_length / len(words)
        if min_length <= mean_word_length <= max_length:
            return_value.append(text_row)
    return return_value

# Calculate the ratio of symbol count to total word count for each sample then
# keep samples that have a ratio lower than the "max_ratio"
def symbol_filter(text_data, max_ratio=symbol_filter_max_ratio):
    symbols = ["#", "*", "...", "@", "$", "&"]
    return_value = []
    for text_row in tqdm(text_data):

        words = text_row.split()
        total_symbols = 0
        for j in symbols:
            # For each symbol, count its occurrences in the entire text
            total_symbols += text_row.count(j)

        total_words = len(words)

        symbol_to_word_ratio = total_symbols / total_words if total_words > 0 else 0
        if symbol_to_word_ratio <= max_ratio:
            return_value.append(text_row)
    return return_value



# Calculate the ratio of alphabetical-word count to total word count for each sample then
# keep samples that have a ratio higher than the "min_ratio"
def alphabetic_filter(text_data, min_ratio=alphabetic_filter_min_ratio):
    return_value = []
    for text_row in tqdm(text_data):
        words = text_row.split()

        if not words:
            alphabetic_word_count = 0
        else:
            # Count a word if it contains at least one alphabetic character.
            # This is more robust than word.isalpha() because it allows words like "COVID-19".
            alphabetic_word_count = sum(1 for word in words if any(char.isalpha() for char in word))

        total_words = len(words)
        alphabetic_ratio = alphabetic_word_count / total_words if total_words > 0 else 0
        if alphabetic_ratio >= min_ratio:
            return_value.append(text_row)
    return return_value


# Keep samples that have a stop word count higher than the "min_count"
# Stop words list provided in the "stop_words" variable
def stop_words_filter(text_data, stop_words_list=stop_words, min_count=stop_words_filter_min_count):
    return_value = []
    # For efficiency, convert the list of stop words to a set
    stop_words_set = set(stop_words_list)

    for text_row in tqdm(text_data):
        words = text_row.split()

        # Count how many words from the document are in our set of stop words
        stop_word_count = sum(1 for word in words if word in stop_words_set)
        if stop_word_count >= min_count:
            return_value.append(text_row)
    return return_value

### Sanity Check

In this cell, an simple check is done on the examples.

In [50]:

good_example = orginal_text_data[150]

assert (language_filter([non_persian_example, good_example])==[good_example])
assert (length_filter([short_length_example, good_example])==[good_example])
assert (word_length_filter([mean_length_word_example, good_example])==[good_example])
assert (symbol_filter([symbol_ration_example, good_example])==[good_example])
assert (alphabetic_filter([alphabet_example, good_example])==[good_example])
assert (stop_words_filter([stopword_example, good_example])==[good_example])




  0%|          | 0/2 [00:00<?, ?it/s]

  0%|          | 0/2 [00:00<?, ?it/s]

  0%|          | 0/2 [00:00<?, ?it/s]

  0%|          | 0/2 [00:00<?, ?it/s]

  0%|          | 0/2 [00:00<?, ?it/s]

  0%|          | 0/2 [00:00<?, ?it/s]

### Normalization

In this cell, normalize the filtered texts.

Using `normalizer.normalize`.

In [52]:
def normalize(text_data):
    return [
        normalizer.normalize(text_row)
        for text_row in tqdm(text_data)
    ]

### Pipeline

Run your preprocessing pipeline in this section.

In the input, give all the raw data, and in the output,  find high-quality data, and finally, normalize them.

Find the hyperparameters of the preprocessing pipeline so that at the end 60% of raw data remains. Hyperparameter list is:
- length_filter_min
- length_filter_max
- word_length_filter_min
- word_length_filter_max
- symbol_filter_max_ratio
- alphabetic_filter_min_ratio
- stop_words_filter_min_count

In [53]:
total = len(text_data)

print(f'Total number of raw data: {total}')

text_data = language_filter(text_data)
print(f'{(len(text_data)/total)*100}% of raw data remained after language filtering')

text_data = length_filter(text_data)
print(f'{(len(text_data)/total)*100}% of raw data remained after length filtering')

text_data = word_length_filter(text_data)
print(f'{(len(text_data)/total)*100}% of raw data remained after mean word length filtering')

text_data = symbol_filter(text_data)
print(f'{(len(text_data)/total)*100}% of raw data remained after symbol ratio filtering')

text_data = alphabetic_filter(text_data)
print(f'{(len(text_data)/total)*100}% of raw data remained after alphabet filtering')

text_data = stop_words_filter(text_data)
print(f'{(len(text_data)/total)*100}% of raw data remained after stop word filtering')

text_data = normalize(text_data)


Total number of raw data: 52663




  0%|          | 0/52663 [00:00<?, ?it/s]

98.72775952756204% of raw data remained after language filtering


  0%|          | 0/51993 [00:00<?, ?it/s]

85.31796517479066% of raw data remained after length filtering


  0%|          | 0/44931 [00:00<?, ?it/s]

85.23441505421263% of raw data remained after mean word length filtering


  0%|          | 0/44887 [00:00<?, ?it/s]

84.84514744697415% of raw data remained after symbol ratio filtering


  0%|          | 0/44682 [00:00<?, ?it/s]

81.93798302413458% of raw data remained after alphabet filtering


  0%|          | 0/43151 [00:00<?, ?it/s]

81.65505193399541% of raw data remained after stop word filtering


  0%|          | 0/43002 [00:00<?, ?it/s]

In [55]:
print(f'Total number of raw data: {total}')

text_data = language_filter(text_data)
print(f'{(len(text_data)/total)*100}% of raw data remained after language filtering')

text_data = length_filter(text_data)
print(f'{(len(text_data)/total)*100}% of raw data remained after length filtering')

text_data = word_length_filter(text_data)
print(f'{(len(text_data)/total)*100}% of raw data remained after mean word length filtering')

text_data = symbol_filter(text_data)
print(f'{(len(text_data)/total)*100}% of raw data remained after symbol ratio filtering')

text_data = alphabetic_filter(text_data)
print(f'{(len(text_data)/total)*100}% of raw data remained after alphabet filtering')

text_data = stop_words_filter(text_data)
print(f'{(len(text_data)/total)*100}% of raw data remained after stop word filtering')

Total number of raw data: 52663




  0%|          | 0/43002 [00:00<?, ?it/s]

81.6341644038509% of raw data remained after language filtering


  0%|          | 0/42991 [00:00<?, ?it/s]

77.07118850046523% of raw data remained after length filtering


  0%|          | 0/40588 [00:00<?, ?it/s]

77.01042477640848% of raw data remained after mean word length filtering


  0%|          | 0/40556 [00:00<?, ?it/s]

76.96295311698915% of raw data remained after symbol ratio filtering


  0%|          | 0/40531 [00:00<?, ?it/s]

58.027457607808145% of raw data remained after alphabet filtering


  0%|          | 0/30559 [00:00<?, ?it/s]

57.934413155346256% of raw data remained after stop word filtering


In [59]:
text_data = normalize(text_data)


  0%|          | 0/30510 [00:00<?, ?it/s]

### Save

In [60]:

with open('train.txt', 'w', encoding="utf-8") as f:
    f.write('\n'.join(text_data))

with open('raw_text.txt', 'w', encoding="utf-8") as f:
    f.write('\n'.join(orginal_text_data))

# 3. Tokenizer (10 points)

In this section, you will train tokenizers for cleaned and raw data. The training code is provided.

Finally, compare the output of the tokenizers.

### WordPiece Tokenizer

In [61]:

from tokenizers import (
    decoders,
    models,
    normalizers,
    pre_tokenizers,
    processors,
    trainers,
    Tokenizer,
)

tokenizer_WordPiece = Tokenizer(models.WordPiece(unk_token="[UNK]"))

tokenizer_WordPiece.normalizer = normalizers.Sequence(
    [normalizers.NFD(), normalizers.StripAccents()]
)

tokenizer_WordPiece.pre_tokenizer = pre_tokenizers.Sequence(
    [pre_tokenizers.WhitespaceSplit(), pre_tokenizers.Punctuation()]
)

print(tokenizer_WordPiece.pre_tokenizer.pre_tokenize_str("این یک تست است!!!"))

special_tokens = ["[UNK]", "[PAD]", "[CLS]", "[SEP]", "[MASK]"]
trainer = trainers.WordPieceTrainer(vocab_size=25000, special_tokens=special_tokens)
tokenizer_WordPiece.train(["train.txt"], trainer=trainer)
tokenizer_WordPiece.save("tokenizer_WordPiece.json")

[('این', (0, 3)), ('یک', (4, 6)), ('تست', (7, 10)), ('است', (11, 14)), ('!', (14, 15)), ('!', (15, 16)), ('!', (16, 17))]


In [62]:

tokenizer_WordPiece2 = Tokenizer(models.WordPiece(unk_token="[UNK]"))

tokenizer_WordPiece2.normalizer = normalizers.Sequence(
    [normalizers.NFD(), normalizers.StripAccents()]
)

tokenizer_WordPiece2.pre_tokenizer = pre_tokenizers.Sequence(
    [pre_tokenizers.WhitespaceSplit(), pre_tokenizers.Punctuation()]
)


special_tokens = ["[UNK]", "[PAD]", "[CLS]", "[SEP]", "[MASK]"]
trainer = trainers.WordPieceTrainer(vocab_size=25000, special_tokens=special_tokens)
tokenizer_WordPiece2.train(["raw_text.txt"], trainer=trainer)
tokenizer_WordPiece2.save("tokenizer_WordPiece2.json")


In [63]:
encoding = tokenizer_WordPiece.encode(text_data[0])
print(encoding.tokens)
encoding = tokenizer_WordPiece2.encode(text_data[0])
print(encoding.tokens)

['قیمت', 'دوربین', 'مراقبت', 'بچه', 'بیسیم', 'با', 'برد', '۲۰۰', 'متر', 'و', 'قابلیت', 'مکالمه', '۲', 'طرفه', 'مدل', '۶۰', '##۱', 'فقط', '۴۳', '##۵۰', '##۰۰', 'تومان', '!', '!', '!', 'خانه', '##مر', '##اق', '##بتی', 'و', 'امنیتی', '##دور', '##بین', '##های', 'مداربسته', 'و', 'نظارتی', '##دور', '##بین', 'مراقبت', 'بچه', 'بیسیم', 'با', 'برد', '۲۰۰', 'متر', 'و', 'قابلیت', 'مکالمه', '۲', 'طرفه', 'مدل', '۶۰', '##۱', 'کد', 'کالا', ':', '۳۲', '##۵۱', '##۴', 'ست', 'کامل', 'دوربین', 'و', 'مانیتور', 'وایرلس', 'با', 'برد', '۲۰۰', 'متر', 'با', 'ارتباط', 'صوتی', 'و', 'مکالمه', 'دوطرفه', '،', '۸', 'لامپ', 'IR', 'دید', 'در', 'شب', 'با', 'صفحه', 'نمایش', '۲', 'اینچی', 'دوربین', 'مراقبت', 'بچه', 'چیست', '؟', 'دستگاهی', 'شامل', 'یک', 'مانیتور', 'یا', 'تلویزیون', 'و', 'یک', 'دوربین', 'می\u200cباشد', 'که', 'توانایی', 'پخش', 'تصاویر', 'ویدیویی', 'دوربین', 'بصورت', 'زنده', 'و', 'لایو', 'را', 'دارد', '.', 'دوربین', 'مراقبت', 'از', 'کودک', 'و', 'مانیتور', 'بی\u200cسیم', 'یکی', 'از', 'مناسبت', '##رین', 'وسیله',

### BPE Tokenizer

In [64]:

tokenizer_BPE = Tokenizer(models.BPE())
tokenizer_BPE.pre_tokenizer = pre_tokenizers.ByteLevel(add_prefix_space=False)
print(tokenizer_BPE.pre_tokenizer.pre_tokenize_str("این یک تست است!!!"))

trainer = trainers.BpeTrainer(vocab_size=25000, special_tokens=["<|endoftext|>"])
tokenizer_BPE.train(["train.txt"], trainer=trainer)

tokenizer_BPE.save("tokenizer_BPE.json")


[('Ø§ÛĮÙĨ', (0, 3)), ('ĠÛĮÚ©', (3, 6)), ('ĠØªØ³Øª', (6, 10)), ('ĠØ§Ø³Øª', (10, 14)), ('!!!', (14, 17))]


In [65]:

tokenizer_BPE2 = Tokenizer(models.BPE())
tokenizer_BPE2.pre_tokenizer = pre_tokenizers.ByteLevel(add_prefix_space=False)

trainer = trainers.BpeTrainer(vocab_size=25000, special_tokens=["<|endoftext|>"])
tokenizer_BPE2.train(["raw_text.txt"], trainer=trainer)

tokenizer_BPE2.save("tokenizer_BPE2.json")


In [66]:
encoding = tokenizer_BPE.encode(text_data[0])
print(encoding.tokens)
encoding = tokenizer_BPE2.encode(text_data[0])
print(encoding.tokens)

['ÙĤÛĮÙħØª', 'ĠØ¯ÙĪØ±Ø¨ÛĮÙĨ', 'ĠÙħØ±Ø§ÙĤØ¨Øª', 'ĠØ¨ÚĨÙĩ', 'ĠØ¨ÛĮØ³ÛĮÙħ', 'ĠØ¨Ø§', 'ĠØ¨Ø±Ø¯', 'ĠÛ²Û°Û°', 'ĠÙħØªØ±', 'ĠÙĪ', 'ĠÙĤØ§Ø¨ÙĦÛĮØª', 'ĠÙħÚ©Ø§ÙĦÙħÙĩ', 'ĠÛ²', 'ĠØ·Ø±ÙģÙĩ', 'ĠÙħØ¯ÙĦ', 'ĠÛ¶Û°', 'Û±', 'ĠÙģÙĤØ·', 'ĠÛ´', 'Û³Ûµ', 'Û°Û°Û°', 'ĠØªÙĪÙħØ§ÙĨ', '!!!', 'Ġ', 'Ċ', 'Ø®Ø§ÙĨÙĩ', 'ÙħØ±', 'Ø§ÙĤ', 'Ø¨ØªÛĮ', 'ĠÙĪ', 'ĠØ§ÙħÙĨÛĮØª', 'ÛĮØ¯', 'ÙĪØ±', 'Ø¨ÛĮÙĨ', 'ÙĩØ§ÛĮ', 'ĠÙħØ¯Ø§Ø±Ø¨Ø³ØªÙĩ', 'ĠÙĪ', 'ĠÙĨØ¸Ø§Ø±Øª', 'ÛĮØ¯', 'ÙĪØ±', 'Ø¨ÛĮÙĨ', 'ĠÙħØ±Ø§ÙĤØ¨Øª', 'ĠØ¨ÚĨÙĩ', 'ĠØ¨ÛĮØ³ÛĮÙħ', 'ĠØ¨Ø§', 'ĠØ¨Ø±Ø¯', 'ĠÛ²Û°Û°', 'ĠÙħØªØ±', 'ĠÙĪ', 'ĠÙĤØ§Ø¨ÙĦÛĮØª', 'ĠÙħÚ©Ø§ÙĦÙħÙĩ', 'ĠÛ²', 'ĠØ·Ø±ÙģÙĩ', 'ĠÙħØ¯ÙĦ', 'ĠÛ¶Û°', 'Û±', 'Ċ', 'Ú©Ø¯', 'ĠÚ©Ø§ÙĦØ§', ':', 'ĠÛ³Û²', 'Ûµ', 'Û±Û´', 'Ċ', 'Ø³Øª', 'ĠÚ©Ø§ÙħÙĦ', 'ĠØ¯ÙĪØ±Ø¨ÛĮÙĨ', 'ĠÙĪ', 'ĠÙħØ§ÙĨÛĮØªÙĪØ±', 'ĠÙĪØ§ÛĮØ±ÙĦØ³', 'ĠØ¨Ø§', 'ĠØ¨Ø±Ø¯', 'ĠÛ²Û°Û°', 'ĠÙħØªØ±', 'ĠØ¨Ø§', 'ĠØ§Ø±ØªØ¨Ø§Ø·', 'ĠØµÙĪØªÛĮ', 'ĠÙĪ', 'ĠÙħÚ©Ø§ÙĦÙħÙĩ', 'ĠØ¯ÙĪØ·Ø±ÙģÙĩ', 'ØĮ', 'ĠÛ¸', 'ĠÙĦØ§ÙħÙ¾', 'ĠIR', 'ĠØ¯ÛĮØ¯', 'ĠØ¯Ø±', 'ĠØ´Ø¨', 'ĠØ¨Ø§', 'ĠØµÙģØŃÙĩ', 'ĠÙĨÙħØ§ÛĮØ´', 'ĠÛ²', 'ĠØ§ÛĮÙĨÚĨÛĮ', 'Ċ', 'Ø¯ÙĪ

### Unigram Tokenizer

In [67]:

tokenizer_unigram = Tokenizer(models.Unigram())

tokenizer_unigram.normalizer = normalizers.Sequence(
    [
        normalizers.NFKD(),
        normalizers.StripAccents()
    ]
)

tokenizer_unigram.pre_tokenizer = pre_tokenizers.Metaspace()

tokenizer_unigram.pre_tokenizer.pre_tokenize_str("این یک تست است!!!")

special_tokens = ["<cls>", "<sep>", "<unk>", "<pad>", "<mask>", "<s>", "</s>"]
trainer = trainers.UnigramTrainer(
    vocab_size=25000, special_tokens=special_tokens, unk_token="<unk>"
)
tokenizer_unigram.train(["train.txt"], trainer=trainer)


tokenizer_unigram.save("tokenizer_unigram.json")


In [68]:

tokenizer_unigram2 = Tokenizer(models.Unigram())

tokenizer_unigram2.normalizer = normalizers.Sequence(
    [
        normalizers.NFKD(),
        normalizers.StripAccents()
    ]
)

tokenizer_unigram2.pre_tokenizer = pre_tokenizers.Metaspace()

special_tokens = ["<cls>", "<sep>", "<unk>", "<pad>", "<mask>", "<s>", "</s>"]
trainer = trainers.UnigramTrainer(
    vocab_size=25000, special_tokens=special_tokens, unk_token="<unk>"
)
tokenizer_unigram2.train(["raw_text.txt"], trainer=trainer)


tokenizer_unigram2.save("tokenizer_unigram2.json")


In [69]:
encoding = tokenizer_unigram.encode(text_data[0])
print(encoding.tokens)
encoding = tokenizer_unigram2.encode(text_data[0])
print(encoding.tokens)

['▁قیمت', '▁دوربین', '▁مراقبت', '▁بچه', '▁بیسیم', '▁با', '▁برد', '▁۲۰۰', '▁متر', '▁و', '▁قابلیت', '▁مکالمه', '▁۲', '▁طرفه', '▁مدل', '▁۶', '۰۱', '▁فقط', '▁', '۴۳۵', '۰۰۰', '▁تومان', '!!!', '▁', '\n', 'خانه', 'مراقبت', 'ی', '▁و', '▁امنیتی', 'دوربین', 'های', '▁مداربسته', '▁و', '▁نظارتی', 'دوربین', '▁مراقبت', '▁بچه', '▁بیسیم', '▁با', '▁برد', '▁۲۰۰', '▁متر', '▁و', '▁قابلیت', '▁مکالمه', '▁۲', '▁طرفه', '▁مدل', '▁۶', '۰۱', '\n', 'ک', 'د', '▁کالا', ':', '▁۳', '۲۵', '۱۴', '\n', 'ست', '▁کامل', '▁دوربین', '▁و', '▁مانیتور', '▁وایرلس', '▁با', '▁برد', '▁۲۰۰', '▁متر', '▁با', '▁ارتباط', '▁صوتی', '▁و', '▁مکالمه', '▁دوطرفه', '،', '▁۸', '▁لامپ', '▁IR', '▁دید', '▁در', '▁شب', '▁با', '▁صفحه', '▁نمایش', '▁۲', '▁اینچی', '\n', 'دوربین', '▁مراقبت', '▁بچه', '▁چیست؟', '▁دستگاه', 'ی', '▁شامل', '▁یک', '▁مانیتور', '▁یا', '▁تلویزیون', '▁و', '▁یک', '▁دوربین', '▁می\u200cباشد', '▁که', '▁توانایی', '▁پخش', '▁تصاویر', '▁ویدیویی', '▁دوربین', '▁بصورت', '▁زنده', '▁و', '▁لایو', '▁را', '▁دارد.', '▁', '\n', 'دوربین', '▁مراقبت', '

### Analyze

In this part, WordPiece tokenizer and Unigram tokenizer are used on the data and the number of generated tokens and their intersection are displayed.

According to this section, answer the following questions.

<font color='green'>

1- Compare the output of tokenizers.

2- Why is the BPE output unreadable?

3- Analyze the intersection between WorkPiece tokens and Unigram tokens.

</font>

In [71]:
unigram_decoding = []
wordpiece_decoding = []

for i in tqdm(text_data):
    wordpiece_decoding.append(tokenizer_WordPiece.encode(i).tokens)
    unigram_decoding.append(tokenizer_unigram.encode(i).tokens)


  0%|          | 0/30510 [00:00<?, ?it/s]

In [72]:
import itertools

print(f'Number of tokens generated by unigram: {len(list(itertools.chain(*unigram_decoding)))}')
print(f'Number of tokens generated by wordpiece: {len(list(itertools.chain(*wordpiece_decoding)))}')

print(f'Number of unique tokens generated by unigram: {len(set(itertools.chain(*unigram_decoding)))}')
print(f'Number of unique tokens generated by wordpiece: {len(set(itertools.chain(*wordpiece_decoding)))}')

print(f'Number of intersection tokens: {len(list(set(itertools.chain(*unigram_decoding)) & set(itertools.chain(*wordpiece_decoding))))}')


Number of tokens generated by unigram: 23172876
Number of tokens generated by wordpiece: 22748283
Number of unique tokens generated by unigram: 24989
Number of unique tokens generated by wordpiece: 24380
Number of intersection tokens: 3270



# 1. Compare the output of tokenizers.

Here’s a comparison of the three tokenization models we've trained:

**WordPiece:** This model is used by BERT. It's a "greedy" algorithm that starts with an alphabet of single characters and iteratively merges them to build a vocabulary that best represents the training data. A key feature is its use of a special prefix (like ##) to mark subwords that are not at the beginning of a word (e.g., tokenization -> ['token', '##ization']).

**BPE (Byte-Pair Encoding):** This model is used by GPT. It's also a greedy merging algorithm, but it simply merges the most frequently occurring pair of tokens. The version we used with ByteLevel pre-tokenization is especially powerful because it works on raw bytes, ensuring it can handle any text in any language without ever producing an "unknown" token.

**Unigram:** This model is used by T5 and SentencePiece. It's fundamentally different because it's probabilistic. Instead of building up, it starts with a very large vocabulary and progressively removes (prunes) tokens that are the least essential, keeping the ones that are most likely to occur. This means it can produce multiple valid tokenizations for the same text, which can make a language model more robust. It often uses a Metaspace pre-tokenizer, which replaces spaces with a special character ( ) to ensure the tokenization is perfectly reversible.

# 2. Why is the BPE output unreadable?

The BPE output you we earlier looked unreadable because the ByteLevel pre-tokenizer does not work with characters; it works with the raw bytes that represent those characters.

The "garbled" text (like Ø§ÛĮÙĨ) is what happens when our computer tries to display a sequence of UTF-8 bytes (which represent Persian characters) as if they were single-byte Latin characters.

So, it's not an error, but rather a display artifact of looking "under the hood." The tokenizer is working correctly on a fundamental byte level, which is what allows it to be so universally applicable. The final merged tokens it learns are combinations of these bytes that represent meaningful Persian subwords.

# 3. Analyze the intersection between WordPiece tokens and Unigram tokens.
This is the most interesting finding from our output.

* Unique WordPiece Tokens: 24,380
* Unique Unigram Tokens: 24,989
* Intersection (Shared Tokens): 3,270

 The intersection between the two vocabularies is incredibly small (only about 13%). This means that even though both tokenizers were trained on the exact same high-quality data to a similar vocabulary size, they arrived at vastly different sets of subword tokens.

There are two primary reasons for this:

**Different Algorithms:** WordPiece and Unigram learn in fundamentally different ways. WordPiece's greedy merging process (building up) and Unigram's probabilistic pruning process (cutting down) are biased toward finding different kinds of subwords.

**Different Pre-tokenization and Formatting:** This is the biggest factor. WordPiece's tokens look like ['token', '##ization'], while Unigram's look like [' token', 'ization'] because of Metaspace. Since "##ization" and "ization" are different strings, they will never match. The tokens in the intersection are likely very simple, common ones like single characters, numbers, and punctuation that are represented identically by both.

**Conclusion:** This result powerfully demonstrates that there is no single "correct" vocabulary for a language. The subwords a model learns are a direct artifact of the tokenization algorithm used. This is why you cannot easily swap a tokenizer from one model (like BERT) to another (like T5) without retraining.

