# Description

- In this notebook, I will explore detailed about tokenization in Vietnamese language. Including: Byte Pair Encoding and word-based Tokenization. 

- The criteria to compare: 
    - Number of tokens in Vocabulary.
    - The ability to handle OOV word.


**CONCLUSION**:
- Space-based Tokenizer create more tokens in Vocabulary (345_765), compared to BPE (42_000). Furthermore, Space-based Tokenizer create some WEIRD tokens in vocab.

=> Byte Pair Encoding is much better.

In [1]:
import os
import sys
sys.path.append('../')
os.environ['TF_CPP_MIN_LOG_LEVEL'] = '2' 
import tensorflow as tf
gpus = tf.config.list_physical_devices('GPU')
if gpus:
    try: tf.config.experimental.set_memory_growth(gpus[0], True)
    except RuntimeError as e:   print(e)

import multiprocessing
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import regex as re
import string
import nltk

from tensorflow_text.tools.wordpiece_vocab import bert_vocab_from_dataset
import tensorflow_text as tf_text

from utils.read_file_utils import *
from utils.tokenizer_utils import *

2024-11-10 13:38:54.912564: E tensorflow/compiler/xla/stream_executor/cuda/cuda_dnn.cc:9342] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-11-10 13:38:54.912584: E tensorflow/compiler/xla/stream_executor/cuda/cuda_fft.cc:609] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-11-10 13:38:54.912601: E tensorflow/compiler/xla/stream_executor/cuda/cuda_blas.cc:1518] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
  from pandas.core import (


In [2]:
PATH_EN_FILE_TRAIN = r"../data/processed_data/en_sent_train.txt"
PATH_VI_FILE_TRAIN = r"../data/processed_data/vi_sent_train.txt"

PATH_EN_FILE_TEST = r"../data/processed_data/en_sent_test.txt"
PATH_VI_FILE_TEST = r"../data/processed_data/vi_sent_test.txt"

PATH_FOLDER_VOCAB = r"../data/vocab"

FILE_NAME_BPE_TOKENIZER = "vi_bpe_tokenizer.txt"
FILE_NAME_SPACE_TOKENIZER = "vi_space_tokenizer.txt"

# 1. Read file

In [3]:
list_en_sentence_train = read_text_file(PATH_EN_FILE_TRAIN)
list_vi_sentence_train = read_text_file(PATH_VI_FILE_TRAIN)

assert len(list_en_sentence_train) == len(list_vi_sentence_train)
print(f"Number of pair sentence: {len(list_en_sentence_train)}")

Number of pair sentence: 2408732


In [4]:
train_en = tf.data.Dataset.from_tensor_slices(list_en_sentence_train)
train_vi = tf.data.Dataset.from_tensor_slices(list_vi_sentence_train)

In [5]:
for en, vi in zip(train_en, train_vi):
    print("English:   ", en.numpy().decode('utf-8'))
    print("Vietnamese:   ", vi.numpy().decode('utf-8'))
    break

English:    is only the beginning .
Vietnamese:    chỉ mới bắt đầu thôi .


# 2. Generate Vietnamese Vocabulary

In this section, we will explore and compare 2 tpye of Tokenization techniques, including BPE and word-based.

## 2.1. Byte Pair Encoding 

a.k.a BertTokenizer

### 2.1.1. Build and save tokenizer using BPE

In [None]:
# bert_tokenizer_params=dict(lower_case=True)
bert_tokenizer_params=dict()
RESERVED_TOKENS=["[PAD]", "[UNK]", "[START]", "[END]"]
VOCAB_SIE = 100_000  # max number of tokens in vocab

bert_vocab_args = dict(
    vocab_size = VOCAB_SIE,
    reserved_tokens=RESERVED_TOKENS,  # Reserved tokens that must be included in the vocabulary
    bert_tokenizer_params=bert_tokenizer_params,
    learn_params={},
)

In [16]:
%%time
vi_vocab = bert_vocab_from_dataset.bert_vocab_from_dataset(
    train_vi.batch(1000).prefetch(tf.data.AUTOTUNE),
    **bert_vocab_args
)

CPU times: user 6min 24s, sys: 2.21 s, total: 6min 26s
Wall time: 5min 21s


In [24]:
print(vi_vocab[:10])
print(vi_vocab[1000:1010])
print(vi_vocab[-10:])

['[PAD]', '[UNK]', '[START]', '[END]', '!', '"', '#', '$', '%', '&']
['phải', 'năm', 'đến', 'sự', 'cô', 'về', 'lại', 'việc', 'nói', 'từ']
['##한', '##해', '##현', '##화', '##️', '##＋', '##，', '##￼', '##�', '##𒀭']


In [28]:
print(f"Number of tokens in the Vietnamese vocab: {len(vi_vocab)}")
print()
write_vocab_file(FILE_NAME_BPE_TOKENIZER, vi_vocab)
print(f"[INFO] Write Vietnamese vocab to file: {FILE_NAME_BPE_TOKENIZER}")

Number of tokens in the Vietnamese vocab: 42473

[INFO] Write Vietnamese vocab to file: vi_bpe_tokenizer.txt


- Test the BPE Tokenizer

In [None]:
vi_tokenizer = tf_text.BertTokenizer(FILE_NAME_BPE_TOKENIZER)

vi_test = 'thành phố hồ chí minh ngập nước'
vi_token_idx = vi_tokenizer.tokenize(vi_test)
vi_token_idx = vi_token_idx.merge_dims(-2, -1)
print(f"Vietnamese token index: {vi_token_idx}")

Vietnamese token index: <tf.RaggedTensor [[1026, 1433, 1641, 1420, 1406, 2861, 1067]]>


In [46]:
# Lookup each token id in the vocabulary.
# txt_tokens = tf.gather(vi_vocab, vi_token_idx)
txt_tokens = vi_tokenizer.detokenize(vi_token_idx)
print(f"Output ot detokenizer: {txt_tokens}")

# Join with spaces.
original_str = tf.strings.reduce_join(txt_tokens, separator=' ', axis=-1).numpy()[0].decode('utf-8')
original_str

Output ot detokenizer: <tf.RaggedTensor [[b'th\xc3\xa0nh', b'ph\xe1\xbb\x91', b'h\xe1\xbb\x93', b'ch\xc3\xad',
  b'minh', b'ng\xe1\xba\xadp', b'n\xc6\xb0\xe1\xbb\x9bc']]>


'thành phố hồ chí minh ngập nước'

### 2.1.2. Text Tokenizer with weird word

In [None]:
list_weird_vietnamese = ['đa đoan', 'đìu hiu', 'lom dom', 'đượm', 'phiêu linh', 'chắp bút', 'giấu giếm', 'điểm xuyết', \
                        'hàm súc', 'khẳng khái', 'xoay xở', 'súc tích', "khánh kiệt", "trầm mặc", "lửng lơ", "trắc ẩn"]

for weird_vi in list_weird_vietnamese:
    weird_vi_token_idx = vi_tokenizer.tokenize(weird_vi)
    weird_vi_token_idx = weird_vi_token_idx.merge_dims(-2, -1)
    weird_vi_txt_tokens = vi_tokenizer.detokenize(weird_vi_token_idx)
    weird_vi_original_str = tf.strings.reduce_join(weird_vi_txt_tokens, separator=' ', axis=-1).numpy()[0].decode('utf-8')
    print(f"Original: {weird_vi} -> Tokenized: {weird_vi_original_str}")

Original: đa đoan -> Tokenized: đa đoan
Original: đìu hiu -> Tokenized: đìu hiu
Original: lom dom -> Tokenized: lom dom
Original: đượm -> Tokenized: đượm
Original: phiêu linh -> Tokenized: phiêu linh
Original: chắp bút -> Tokenized: chắp bút
Original: giấu giếm -> Tokenized: giấu giếm
Original: điểm xuyết -> Tokenized: điểm xuyết
Original: hàm súc -> Tokenized: hàm súc
Original: khẳng khái -> Tokenized: khẳng khái
Original: xoay xở -> Tokenized: xoay xở
Original: súc tích -> Tokenized: súc tích


<font color='red'>NOTE</font>: BPE handle good OOV word.

## 2.2. Space-based Tokenizer

In [6]:
vi_vocab_space = ["[PAD]", "[UNK]", "[START]", "[END]"]

In [7]:
def decode_and_split_text(text):
    text = text.decode('utf-8')
    text = text.split()
    return text

# def build_space_based_vocab(train_vi):
#     vi_vocab_space = ["[PAD]", "[UNK]", "[START]", "[END]"]
    
#     for idx_sample, vi_sample in enumerate(train_vi):
#         vi_sample = vi_sample.numpy().decode('utf-8')
#         vi_text = split_text(vi_sample)
#         vi_vocab_space.extend(vi_text)

#         if idx_sample > 50_000:
#             break   
#     vi_vocab_space = set(vi_vocab_space)
#     return list(vi_vocab_space)


def build_space_based_vocab_parallel(train_vi):
    """
    This function is used to build the Vietnamese vocab space-based, using the train_vi dataset
    """
    
    # 1. Create input argument
    input_arg = []
    for idx_sample, vi_sample in enumerate(train_vi):
        vi_sample = vi_sample.numpy()
        input_arg.append(vi_sample)
        
    # 2. Decode and split text in parallel
    pool = multiprocessing.Pool(8)
    total_vi_vocab_space = pool.map(decode_and_split_text, input_arg)
    pool.close()
    
    # 3. Build vocab space
    vi_vocab_space = ["[PAD]", "[UNK]", "[START]", "[END]"]
    for vi_text in total_vi_vocab_space:
        vi_vocab_space.extend(vi_text)      
    vi_vocab_space = set(vi_vocab_space)  # Remove duplicate tokens
    
    return list(vi_vocab_space)

In [8]:
%%time
# vi_vocab_space = build_space_based_vocab(train_vi)
vi_vocab_space = build_space_based_vocab_parallel(train_vi)

print(f"Number of tokens in the Vietnamese vocab: {len(vi_vocab_space)}")
print()
print(vi_vocab_space[:10])
print(vi_vocab_space[1000:1010])
print(vi_vocab_space[-10:])

Number of tokens in the Vietnamese vocab: 345765

['vựccác', 'maclean', 'marielagriffor', 'armour)', 'cácđập', '2013-2017:', '(坊主めくり)', 'sewell', 'antechamber', 'bulông-4']
['léc', 'nolfox', '2017một', 'evgenii', 'appel', 'phuthi', 'sunbul', '(홍익대학교)', 'preparen', '"slide"']
['tinlà', 'bokura', 'kaliningrad"', '(enac)', 'beta-secretase', 'brookahven', 'ਰਾਖਾ"', 'yeigo', 'littoral', 'ancestral']
CPU times: user 49.3 s, sys: 3.36 s, total: 52.7 s
Wall time: 48 s


In [9]:
write_vocab_file(FILE_NAME_SPACE_TOKENIZER, vi_vocab_space)
print(f"[INFO] Write Vietnamese vocab to file: {FILE_NAME_SPACE_TOKENIZER}")

[INFO] Write Vietnamese vocab to file: vi_space_tokenizer.txt


Test Tokenizer by space

In [16]:
vi_tokenizer_space = tf_text.BertTokenizer(FILE_NAME_SPACE_TOKENIZER)

# vi_test = 'hoa phượng đỏ là tuổi tôi mười tám, thầm lặng ai hay mối tình đầu'
vi_test = 'haluliii là mối tình đầu'
vi_token_idx = vi_tokenizer_space.tokenize(vi_test)
vi_token_idx = vi_token_idx.merge_dims(-2, -1)
print(f"Vietnamese token index: {vi_token_idx}")

Vietnamese token index: <tf.RaggedTensor [[307926, 194886, 216521, 320506, 5014]]>


In [17]:
# Lookup each token id in the vocabulary.
txt_tokens = vi_tokenizer_space.detokenize(vi_token_idx)
print(f"Output ot detokenizer: {txt_tokens}")

# Join with spaces.
original_str = tf.strings.reduce_join(txt_tokens, separator=' ', axis=-1).numpy()[0].decode('utf-8')
original_str

Output ot detokenizer: <tf.RaggedTensor [[b'[UNK]', b'l\xc3\xa0', b'm\xe1\xbb\x91i', b't\xc3\xacnh',
  b'\xc4\x91\xe1\xba\xa7u']]>


'[UNK] là mối tình đầu'

In [14]:
list_weird_vietnamese = ['đa đoan', 'đìu hiu', 'lom dom', 'đượm', 'phiêu linh', 'chắp bút', 'giấu giếm', 'điểm xuyết', \
                        'hàm súc', 'khẳng khái', 'xoay xở', 'súc tích', "khánh kiệt", "trầm mặc", "lửng lơ", "trắc ẩn"]

for weird_vi in list_weird_vietnamese:
    weird_vi_token_idx = vi_tokenizer_space.tokenize(weird_vi)
    weird_vi_token_idx = weird_vi_token_idx.merge_dims(-2, -1)
    weird_vi_txt_tokens = vi_tokenizer_space.detokenize(weird_vi_token_idx)
    weird_vi_original_str = tf.strings.reduce_join(weird_vi_txt_tokens, separator=' ', axis=-1).numpy()[0].decode('utf-8')
    print(f"Original: {weird_vi} -> Tokenized: {weird_vi_original_str}")

Original: đa đoan -> Tokenized: đa đoan
Original: đìu hiu -> Tokenized: đìu hiu
Original: lom dom -> Tokenized: lom dom
Original: đượm -> Tokenized: đượm
Original: phiêu linh -> Tokenized: phiêu linh
Original: chắp bút -> Tokenized: chắp bút
Original: giấu giếm -> Tokenized: giấu giếm
Original: điểm xuyết -> Tokenized: điểm xuyết
Original: hàm súc -> Tokenized: hàm súc
Original: khẳng khái -> Tokenized: khẳng khái
Original: xoay xở -> Tokenized: xoay xở


Original: súc tích -> Tokenized: súc tích
Original: khánh kiệt -> Tokenized: khánh kiệt
Original: trầm mặc -> Tokenized: trầm mặc
Original: lửng lơ -> Tokenized: lửng lơ
Original: trắc ẩn -> Tokenized: trắc ẩn


<font color='red'>NOTE</font>: 
- Space-based Tokenizer create more tokens in Vocabulary, compared to BPE.
- Furthermore, It create some WEIRD tokens in vocab.