# Subword Tokenization

In this exercise, we will learn how to train our own subword tokenizers with different algorithms: BPE and Unigram. We will use `sentencepiece`, a library from Google to help create our tokenizers.

## Ref:
https://github.com/google/sentencepiece/blob/master/python

## Setup

In [1]:
!wget https://github.com/Knight-H/thai-lm/raw/refs/heads/master/data/pra-apai-manee-ch1-50.txt
!wget https://github.com/Knight-H/thai-lm/raw/refs/heads/master/data/kratoo-40000000-40002000.jsonl

--2025-01-19 15:01:58--  https://github.com/Knight-H/thai-lm/raw/refs/heads/master/data/pra-apai-manee-ch1-50.txt
Resolving github.com (github.com)... 20.205.243.166
Connecting to github.com (github.com)|20.205.243.166|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://raw.githubusercontent.com/Knight-H/thai-lm/refs/heads/master/data/pra-apai-manee-ch1-50.txt [following]
--2025-01-19 15:01:59--  https://raw.githubusercontent.com/Knight-H/thai-lm/refs/heads/master/data/pra-apai-manee-ch1-50.txt
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.109.133, 185.199.111.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.109.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 3231076 (3.1M) [application/octet-stream]
Saving to: ‘pra-apai-manee-ch1-50.txt.4’


2025-01-19 15:01:59 (40.7 MB/s) - ‘pra-apai-manee-ch1-50.txt.4’ saved [3231076/3231076]

--2025-01-1

## Code

In [2]:
import sentencepiece as spm
import io
import json

Load data

In [3]:
pantip_text = []
with open('kratoo-40000000-40002000.jsonl', 'r') as json_file:
    json_list = list(json_file)
    for json_str in json_list:
        result = json.loads(json_str)
        pantip_text.append(f"{result['title']}\n{result['content']}\n")
sum([len(t) for t in pantip_text])

1060318

In [4]:
with open("pra-apai-manee-ch1-50.txt") as f:
  pra_apai_manee_data = f.readlines()

In [5]:
sum([len(t) for t in pra_apai_manee_data])

1100605

In [6]:
pantip_train_text = pantip_text[:int(len(pantip_text)*0.8)]
pantip_test_text = pantip_text[int(len(pantip_text)*0.8):]

pam_train_text = pra_apai_manee_data[:int(len(pra_apai_manee_data)*0.8)] #pam = pra_apai_manee
pam_test_text = pra_apai_manee_data[int(len(pra_apai_manee_data)*0.8):]

## Run tokenizer training

The Python wrapper provides multiple APIs for training our tokenizers

1. `spm.SentencePieceTrainer.train(input='input.txt', model_prefix='m', vocab_size=vocab_size, model_type=model_type)`
  <br> This will output the tokenizer files `m.model` and `m.vocab` that can be later loaded into `SentencePieceProcessor`.
  <br><br>
2. `spm.SentencePieceTrainer.train(sentence_iterator=iterator, model_writer=obj_with_write_method, vocab_size=vocab_size, model_type=model_type)`
  <br> This method will require a file object e.g. `obj_with_write_method = io.BytesIO()`. The advantage of this method is you can run sentencepiece on environments that have limited access to the local file system. But you will still have to save the model file if you want to re-use the model else you will have to train it again.
<br><br>
3.  `spm.SentencePieceTrainer.train('--input=input.txt --model_prefix=m --vocab_size=vocab_size --model_type=model_type')`
<br> Same as no.1




### Unigram tokenizer

We are going to start with training a unigram tokenizer. You can use any method of training one. Make sure to set vocab_size to 1000.

In [7]:
## Train

# Define file paths for output
pantip_train_file = "pantip_train.txt"
pantip_test_file = "pantip_test.txt"
pam_train_file = "pam_train.txt"
pam_test_file = "pam_test.txt"

# Save the datasets to .txt files
with open(pantip_train_file, "w", encoding="utf-8") as f:
    f.writelines(pantip_train_text)

with open(pantip_test_file, "w", encoding="utf-8") as f:
    f.writelines(pantip_test_text)

with open(pam_train_file, "w", encoding="utf-8") as f:
    f.writelines(pam_train_text)

with open(pam_test_file, "w", encoding="utf-8") as f:
    f.writelines(pam_test_text)

print("Files written successfully:")
print(f"Pantip Train File: {pantip_train_file}")
print(f"Pantip Test File: {pantip_test_file}")
print(f"PAM Train File: {pam_train_file}")
print(f"PAM Test File: {pam_test_file}")

Files written successfully:
Pantip Train File: pantip_train.txt
Pantip Test File: pantip_test.txt
PAM Train File: pam_train.txt
PAM Test File: pam_test.txt


In [8]:
spm.SentencePieceTrainer.train(input=pam_train_file, model_prefix="unigram_pam", vocab_size=1000, model_type="unigram")

sp_pam = spm.SentencePieceProcessor(model_file='unigram_pam.model')

sentencepiece_trainer.cc(78) LOG(INFO) Starts training with : 
trainer_spec {
  input: pam_train.txt
  input_format: 
  model_prefix: unigram_pam
  model_type: UNIGRAM
  vocab_size: 1000
  self_test_sample_size: 0
  character_coverage: 0.9995
  input_sentence_size: 0
  shuffle_input_sentence: 1
  seed_sentencepiece_size: 1000000
  shrinking_factor: 0.75
  max_sentence_length: 4192
  num_threads: 16
  num_sub_iterations: 2
  max_sentencepiece_length: 16
  split_by_unicode_script: 1
  split_by_number: 1
  split_by_whitespace: 1
  split_digits: 0
  pretokenization_delimiter: 
  treat_whitespace_as_suffix: 0
  allow_whitespace_only_pieces: 0
  required_chars: 
  byte_fallback: 0
  vocabulary_output_piece_score: 1
  train_extremely_large_corpus: 0
  seed_sentencepieces_file: 
  hard_vocab_limit: 1
  use_all_vocab: 0
  unk_id: 0
  bos_id: 1
  eos_id: 2
  pad_id: -1
  unk_piece: <unk>
  bos_piece: <s>
  eos_piece: </s>
  pad_piece: <pad>
  unk_surface:  ⁇ 
  enable_differential_privacy: 0
  d

### Q1 MCV

How many tokens did you get when tokenizing the following sentence with your unigram tokenizer: <br>
'อรุณสวัสดิ์ ฉันเอามเหสีมาหาม สวัสดี ประเทศไทยสบายดีไหม'

In [9]:
len(sp_pam.encode('อรุณสวัสดิ์ ฉันเอามเหสีมาหาม สวัสดี ประเทศไทยสบายดีไหม', out_type=str))

29

### BPE Tokenizer

Now try training a BPE tokenizer.

In [10]:
spm.SentencePieceTrainer.train(input=pam_train_file, model_prefix="bpe_pam", vocab_size=1000, model_type="bpe")

bpe_pam = spm.SentencePieceProcessor(model_file='bpe_pam.model')

sentencepiece_trainer.cc(78) LOG(INFO) Starts training with : 
trainer_spec {
  input: pam_train.txt
  input_format: 
  model_prefix: bpe_pam
  model_type: BPE
  vocab_size: 1000
  self_test_sample_size: 0
  character_coverage: 0.9995
  input_sentence_size: 0
  shuffle_input_sentence: 1
  seed_sentencepiece_size: 1000000
  shrinking_factor: 0.75
  max_sentence_length: 4192
  num_threads: 16
  num_sub_iterations: 2
  max_sentencepiece_length: 16
  split_by_unicode_script: 1
  split_by_number: 1
  split_by_whitespace: 1
  split_digits: 0
  pretokenization_delimiter: 
  treat_whitespace_as_suffix: 0
  allow_whitespace_only_pieces: 0
  required_chars: 
  byte_fallback: 0
  vocabulary_output_piece_score: 1
  train_extremely_large_corpus: 0
  seed_sentencepieces_file: 
  hard_vocab_limit: 1
  use_all_vocab: 0
  unk_id: 0
  bos_id: 1
  eos_id: 2
  pad_id: -1
  unk_piece: <unk>
  bos_piece: <s>
  eos_piece: </s>
  pad_piece: <pad>
  unk_surface:  ⁇ 
  enable_differential_privacy: 0
  different

### Q2 MCV

How many tokens did you get when tokenizing the following sentence with your BPE tokenizer: <br>
'อรุณสวัสดิ์ ฉันเอามเหสีมาหาม สวัสดี ประเทศไทยสบายดีไหม'

In [11]:
len(bpe_pam.encode('อรุณสวัสดิ์ ฉันเอามเหสีมาหาม สวัสดี ประเทศไทยสบายดีไหม', out_type=str))

28

These are some of your vocabs. Note that you will see "▁" (U+2581) in every type of tokenizer in SentencePiece since it makes it possible to perform detokenization \(unsplit your sentences\) without relying on language-specific resources.

In [12]:
unigram_vocabs = [sp_pam.id_to_piece(id) for id in range(sp_pam.get_piece_size())]
" | ".join(unigram_vocabs[:500])

'<unk> | <s> | </s> | ▁ | า | เ | น | ม | ย | ก | ร | ว | ด | ส | ง | บ | ค | มา | อ | ล | จะ | ท | ให้ | ห | ไป | ไม่ | แ | ว่า | พ | ุ | ี | ๏ | ฯ | ข | ช | เป็น | พระ | โ | ที่ | ใจ | ▁จะ | จ | ะ | ิ | ต | ก็ | อยู่ | ป | ได้ | ่ | ไ | เข้า | ู | ▁พระ | ้า | ตาม | ใน | ้ | ▁แล้ว | เหมือน | รา | ศ | เจ้า | เห็น | ลา | กัน | ั | หา | นาง | ทรง | ประ | ์ | ยา | ัก | ํา | ซ | าน | ัง | ฉ | องค์ | ัด | แล้ว | อน | ดู | ถ | ด้วย | มี | ▁จึง | นี้ | ่า | ผ | น้อง | แต่ | ทํา | ▁นาง | ▁ให้ | รัก | พี่ | คิด | ลูก | พา | รู้ | การ | กับ | ัน | หน้า | กระ | วน | ออก | ่อ | เขา | ถึง | ระ | ข้า | ับ | พล | นั่ง | ทั้ง | หน | รับ | ษ | กล | วง | ลง | ฝ | กร | พร | ความ | เสีย | ดี | ขึ้น | อง | ่ง | ธ | ▁แต่ | คน | กลับ | ▁ฝ่าย | ้น | อด | ภ | หรือ | ตร | ือ | ฟัง | แม่ | ▁ไม่ | ไว้ | ยัง | ▁เห็น | นา | ขอ | มิ | น้ํา | หล | ดัง | ▁พอ | ▁ทั้ง | ช่วย | สม | นั้น | ริ | ทัพ | ต้อง | วัน | อา | น้อย | รบ | ิน | อย่า | เอา | จน | เรา | สุด | เสียง | ข้าง | หลัง | ตี | ตัว | ละ | สุ | วัง | ทุก | ่น

In [13]:
bpe_vocabs = [bpe_pam.id_to_piece(id) for id in range(bpe_pam.get_piece_size())]
" | ".join(bpe_vocabs[:500])

'<unk> | <s> | </s> | ้า | ่า | อง | ระ | ํา | รา | อย | ่ง | มา | จะ | ัง | ัน | ▁เ | าย | ้ว | ับ | ี่ | ม่ | อน | ให | าม | ้น | ็น | พระ | ีย | าง | กล | ้ง | ัก | หน | ให้ | ไม่ | หล | ่น | ึง | ▁แ | ทั | ตร | าร | ้อง | ไป | ิด | ข้า | ว่า | หม | คร | ือ | ล้ว | เป | เส | ประ | าน | ั่ง | ▁๏ | ▁ฯ | ที่ | อก | เล | ิน | ได | พล | ทร | ัด | นาง | ึก | ได้ | ู่ | ▁จะ | ค์ | ี้ | พร | เป็น | สุ | ทั้ง | อม | ัย | เร | ห็น | ▁จ | ▁พระ | ก็ | ใจ | อา | ื่ | ่าง | ต่ | กร | ิง | วง | วน | ือน | เจ | ู้ | ียง | อยู่ | รร | ตาม | ▁พ | ้วย | าว | ถึง | คล | ั้น | รี | เข | ด้วย | สม | องค์ | สน | าก | ▁แล้ว | เช | ัว | ย์ | ใน | คว | น้ | หมือน | ▁ส | ูก | อบ | กระ | เจ้า | ทรง | ลา | กัน | มี | ่าย | พรา | ิ่ง | เข้า | เห็น | ิต | สง | อด | ณ์ | วย | ้ม | คิด | เม | เก | เด | ▁นาง | วา | ุก | ▁ให้ | ดู | หา | ▁อ | ▁จึง | ทํา | ลง | รัก | เค | แล้ว | ่าน | พี่ | เหมือน | ั่น | ความ | ยง | อย่า | หร | มิ | ืน | ช่ | การ | ัญ | ▁ไม่ | ฝ่าย | ศรี | ้าง | วก | ้อม | ือง | น้อง | ยว | พา | แก |

### User-defined symbols

Another important concept to know of is User-defined symbols. These special symbols are reserved for a special purpose \(e.g.\, the \<MASK\> token used in BERT) and will always be tokenized into one token.

Refer to the documentation for ways to add these special tokens to your tokenizer.

https://github.com/google/sentencepiece/blob/master/python

## Train another tokenizer on another domain

Now try training another unigram tokenizer on `pantip_text` and we will use it to compare with the unigram tokenizer we trained earlier.

In [14]:
## Train
spm.SentencePieceTrainer.train(input=pantip_train_file, model_prefix="unigram_pantip", vocab_size=1000, model_type="unigram")

sp_pantip = spm.SentencePieceProcessor(model_file='unigram_pantip.model')

spm.SentencePieceTrainer.train(input=pantip_train_file, model_prefix="bpe_pantip", vocab_size=1000, model_type="bpe")

bpe_pantip = spm.SentencePieceProcessor(model_file='bpe_pantip.model')

sentencepiece_trainer.cc(78) LOG(INFO) Starts training with : 
trainer_spec {
  input: pantip_train.txt
  input_format: 
  model_prefix: unigram_pantip
  model_type: UNIGRAM
  vocab_size: 1000
  self_test_sample_size: 0
  character_coverage: 0.9995
  input_sentence_size: 0
  shuffle_input_sentence: 1
  seed_sentencepiece_size: 1000000
  shrinking_factor: 0.75
  max_sentence_length: 4192
  num_threads: 16
  num_sub_iterations: 2
  max_sentencepiece_length: 16
  split_by_unicode_script: 1
  split_by_number: 1
  split_by_whitespace: 1
  split_digits: 0
  pretokenization_delimiter: 
  treat_whitespace_as_suffix: 0
  allow_whitespace_only_pieces: 0
  required_chars: 
  byte_fallback: 0
  vocabulary_output_piece_score: 1
  train_extremely_large_corpus: 0
  seed_sentencepieces_file: 
  hard_vocab_limit: 1
  use_all_vocab: 0
  unk_id: 0
  bos_id: 1
  eos_id: 2
  pad_id: -1
  unk_piece: <unk>
  bos_piece: <s>
  eos_piece: </s>
  pad_piece: <pad>
  unk_surface:  ⁇ 
  enable_differential_privacy:

In [15]:
print("unigram_pantip: ")
print(len(sp_pantip.encode('อรุณสวัสดิ์ ฉันเอามเหสีมาหาม สวัสดี ประเทศไทยสบายดีไหม', out_type=str)))
print(sp_pantip.encode('อรุณสวัสดิ์ ฉันเอามเหสีมาหาม สวัสดี ประเทศไทยสบายดีไหม', out_type=str))

print("unigram_pra apai manee: ")
print(len(sp_pam.encode('อรุณสวัสดิ์ ฉันเอามเหสีมาหาม สวัสดี ประเทศไทยสบายดีไหม', out_type=str)))
print(sp_pam.encode('อรุณสวัสดิ์ ฉันเอามเหสีมาหาม สวัสดี ประเทศไทยสบายดีไหม', out_type=str))

unigram_pantip: 
32
['▁', 'อ', 'รุ', 'ณ', 'ส', 'ว', 'ั', 'ส', 'ด', 'ิ', '์', '▁', 'ฉัน', 'เอา', 'ม', 'เห', 'สี', 'มา', 'หา', 'ม', '▁', 'ส', 'ว', 'ั', 'ส', 'ดี', '▁', 'ประเทศ', 'ไทย', 'สบาย', 'ดี', 'ไหม']
unigram_pra apai manee: 
29
['▁', 'อ', 'ร', 'ุ', 'ณ', 'สวัสดิ์', '▁', 'ฉัน', 'เอา', 'มเหสี', 'มา', 'หา', 'ม', '▁', 'ส', 'ว', 'ั', 'ส', 'ดี', '▁', 'ประเทศ', 'ไ', 'ท', 'ย', 'สบาย', 'ดี', 'ไ', 'ห', 'ม']


## Analyse top tokens on different datasets

Use your tokenizers to tokenize the datasets and analyse your most common vocabularies (try 300-400 vocabs with len>1). Hint: tokenize your data and count the tokens.

In [16]:
import matplotlib.pyplot as plt
from collections import Counter
import pandas as pd

# Analyze the most common tokens in a dataset
def analyze_top_tokens(data, tokenizer, top_n=300, min_len=1):
    # Tokenize the data
    all_tokens = []
    for line in data:
        tokens = tokenizer.encode(line.strip(), out_type=str)
        all_tokens.extend(tokens)
    
    # Count token frequencies
    token_counts = Counter(all_tokens)
    
    # Filter tokens with length > min_len
    filtered_tokens = {token: count for token, count in token_counts.items() if len(token) > min_len}
    
    # Sort tokens by frequency
    sorted_tokens = sorted(filtered_tokens.items(), key=lambda x: x[1], reverse=True)
    
    # Create a DataFrame for top tokens
    top_tokens_df = pd.DataFrame(sorted_tokens[:top_n], columns=["Token", "Frequency"])
    
    # Return both DataFrame and list of pairs
    return top_tokens_df, sorted_tokens[:top_n]


# Tokenize and analyze tokens for each dataset and tokenizer
print("Analyzing Pantip dataset with Unigram tokenizer...")
pantip_unigram_results_df, sorted_tokens = analyze_top_tokens(pantip_train_text, sp_pantip)
display(sorted_tokens)

print("\nAnalyzing Pra-Apai-Manee dataset with Unigram tokenizer...")
pam_unigram_results_df, sorted_tokens = analyze_top_tokens(pam_train_text, sp_pam)
display(sorted_tokens)

Analyzing Pantip dataset with Unigram tokenizer...


[('ที่', 4183),
 ('เรา', 2568),
 ('จะ', 2559),
 ('มา', 2478),
 ('ไป', 2449),
 ('ได้', 2432),
 ('ก็', 2392),
 ('ไม่', 2222),
 ('ว่า', 2195),
 ('มี', 2076),
 ('เป็น', 2056),
 ('การ', 1747),
 ('ให้', 1608),
 ('ของ', 1475),
 ('ใน', 1376),
 ('แล้ว', 1367),
 ('เลย', 1358),
 ('ครับ', 1325),
 ('กัน', 1305),
 ('นี้', 1291),
 ('กับ', 1234),
 ('ค่ะ', 1172),
 ('ดี', 1155),
 ('▁แต่', 1107),
 ('คน', 1065),
 ('ทํา', 1044),
 ('ต้อง', 1021),
 ('มัน', 1000),
 ('้า', 932),
 ('มาก', 924),
 ('อยู่', 910),
 ('จาก', 903),
 ('เขา', 900),
 ('่า', 880),
 ('ความ', 865),
 ('ใช้', 767),
 ('ด้วย', 749),
 ('แต่', 740),
 ('อะไร', 736),
 ('▁เรา', 736),
 ('ผม', 731),
 ('ตัว', 727),
 ('ใจ', 713),
 ('เรื่อง', 709),
 ('หา', 686),
 ('ํา', 682),
 ('ไม่ได้', 666),
 ('ดู', 659),
 ('ีย', 646),
 ('▁และ', 642),
 ('วัน', 640),
 ('พอ', 636),
 ('▁(', 634),
 ('และ', 612),
 ('ับ', 611),
 ('รับ', 600),
 ('เข้า', 594),
 ('แบบ', 591),
 ('งาน', 588),
 ('อยาก', 585),
 ('นั้น', 585),
 ('ถึง', 585),
 ('ัง', 582),
 ('คือ', 580),
 ('ขึ้น', 57


Analyzing Pra-Apai-Manee dataset with Unigram tokenizer...


[('มา', 3002),
 ('จะ', 2506),
 ('ไป', 2407),
 ('ให้', 2391),
 ('ว่า', 2214),
 ('ไม่', 2085),
 ('ที่', 1632),
 ('เป็น', 1587),
 ('▁จะ', 1580),
 ('พระ', 1561),
 ('ใจ', 1458),
 ('ก็', 1368),
 ('อยู่', 1339),
 ('▁พระ', 1251),
 ('ได้', 1247),
 ('เข้า', 1190),
 ('รา', 1128),
 ('ตาม', 1096),
 ('้า', 1086),
 ('▁แล้ว', 1085),
 ('ใน', 1069),
 ('ยา', 1063),
 ('ลา', 1052),
 ('ํา', 1039),
 ('เจ้า', 1024),
 ('เหมือน', 1009),
 ('กัน', 976),
 ('หา', 973),
 ('ประ', 944),
 ('อน', 933),
 ('ทรง', 914),
 ('ัง', 914),
 ('เห็น', 910),
 ('▁ให้', 898),
 ('าน', 879),
 ('นาง', 875),
 ('ัก', 868),
 ('ัด', 837),
 ('ดู', 834),
 ('องค์', 827),
 ('มี', 809),
 ('่า', 801),
 ('▁จึง', 799),
 ('▁นาง', 792),
 ('วน', 782),
 ('พา', 777),
 ('แล้ว', 771),
 ('นี้', 770),
 ('่อ', 765),
 ('ด้วย', 760),
 ('ลูก', 744),
 ('น้อง', 741),
 ('ทํา', 738),
 ('รัก', 736),
 ('หน', 731),
 ('พี่', 719),
 ('การ', 714),
 ('คิด', 712),
 ('พล', 712),
 ('กระ', 710),
 ('รู้', 695),
 ('กับ', 683),
 ('แต่', 677),
 ('หน้า', 670),
 ('ระ', 666),
 ('ออก

### To answer
What are some notable differences you see between the two vocabs?

Write your answer below.

In [17]:
# - Token from Pantip dataset with Unigram tokenizer Seems to be spoken thai language level with ครับ ค่ะ นะคะ 
# - Token from Pra-Apai-Manee dataset with Unigram tokenizer Seems to be written in old thai language level with เจ้า พระ ข้า

## Using tokenizer across domains

One problem you may face is your dataset is very specialized. In that case the tokenizer trained on a general domain may not perform as good as it should when used on your dataset.

Next you will try using tokenizers trained on one general domain (on Pantip) and use it on a specialized domain (พระอภัยมณี) and vice versa.

### Q3 MCV

What percentage increase do you observe when tokenizing the whole พระอภัยมณี dataset with a tokenizer trained on Pantip compared to the one trained on พระอภัยมณี.

In [18]:
tokens_pam_on_pam_trained = [a for line in pra_apai_manee_data for a in sp_pam.encode(line, out_type=str)]
tokens_pam_on_pantip_trained = [a for line in pra_apai_manee_data for a in sp_pantip.encode(line, out_type=str)]

print(tokens_pam_on_pam_trained[:10])
print(tokens_pam_on_pantip_trained[:10])
print(100*len(tokens_pam_on_pantip_trained)/len(tokens_pam_on_pam_trained))

['▁', '๏', '▁แต่', 'ป', 'า', 'ง', 'หลัง', 'ยัง', 'มี', 'กรุง']
['▁', '๏', '▁แต่', 'ป', 'า', 'ง', 'หลัง', 'ยัง', 'มี', 'ก']
141.50978497925553


### Q4 MCV

What percentage increase do you observe when tokenizing the whole Pantip dataset with a tokenizer trained on พระอภัยมณี compared to the one trained on Pantip.

In [19]:
tokens_pantip_on_pantip_trained = [a for line in pantip_text for a in sp_pantip.encode(line, out_type=str)]
tokens_pantip_on_pam_trained = [a for line in pantip_text for a in sp_pam.encode(line, out_type=str)]

print(tokens_pantip_on_pantip_trained[:10])
print(tokens_pantip_on_pam_trained[:10])
print(100*len(tokens_pantip_on_pam_trained)/len(tokens_pantip_on_pantip_trained))

['▁', 'ใคร', 'รู้จัก', 'คน', 'นี้', 'บ้าง', '▁คือ', 'เรา', 'ค', 'ุ']
['▁', 'ใคร', 'รู้', 'จัก', 'คน', 'นี้', 'บ้าง', '▁', 'ค', 'ือ']
115.5704503918366


### To answer
Why do you think the number of tokens tokenized by the general tokenizer (the one trained on Pantip) has a higher percentage increase compared to the number of tokens tokenized by the specialized tokenizer? (Hint: we fixed vocab size.)

In [20]:
# I belive that In Pra apai manee dataset have a more diverse vocabulary than Pantip dataset. 
# Therefore spm trained on pam should have more generalization of the language than spm trained on pantip.
# and result in a lower percentage of tokens from pantip dataset that can be tokenized by spm trained on pam.

## The effect on language models

Next, we will see the effect of using "cross-domain" tokenizers on Language models.

### Setup
We are going to reuse the code from the last assignment

In [21]:
!pip install lightning



In [22]:
import itertools
import torch
import torch.nn as nn
from torch.utils.data import Dataset, DataLoader
import torch.optim as optim
import lightning as L
from tqdm import tqdm
import numpy as np

In [23]:
class TextDataset(Dataset):
  def __init__(self, data, tokenizer, seq_len = 128):

    token_ids = [tokenizer.encode(d, add_bos=True, add_eos=True) for d in data]
    flatten_token_ids = list(itertools.chain(*token_ids))
    encoded = torch.LongTensor(flatten_token_ids)

    left_over = len(encoded) % seq_len
    encoded = encoded[:len(encoded)-left_over]
    self.encoded = encoded.view(-1, seq_len)

  def __getitem__(self, idx):
    return self.encoded[idx]

  def __len__(self):
    return len(self.encoded)

In [24]:
class LSTM(L.LightningModule):
    def __init__(self, vocab_size, embedding_dim, hidden_dim, num_layers, dropout_rate, learning_rate, criterion):

        super().__init__()

        self.num_layers = num_layers
        self.hidden_dim = hidden_dim
        self.embedding_dim = embedding_dim
        self.vocab_size=vocab_size

        self.embedding = nn.Embedding(vocab_size, embedding_dim)
        self.lstm = nn.LSTM(embedding_dim, hidden_dim, num_layers=num_layers,
                    dropout=dropout_rate, batch_first=True)
        self.dropout = nn.Dropout(dropout_rate)
        self.fc = nn.Linear(hidden_dim, vocab_size)
        self.learning_rate = learning_rate
        self.criterion = criterion

    def forward(self, src):
        emb = self.dropout(self.embedding(src))
        lstm_out, _ = self.lstm(emb)
        lstm_out = self.dropout(lstm_out)
        out = self.fc(lstm_out)
        return out

    def training_step(self, batch, batch_idx):

        src = batch[:, :-1]
        target = batch[:, 1:]
        prediction = self(src)
        prediction = prediction.reshape(-1, self.vocab_size)
        target = target.reshape(-1)
        loss = self.criterion(prediction, target)
        self.log("train_loss", loss)
        return loss

    def test_step(self, batch, batch_idx, dataloader_idx=0):

        src = batch[:, :-1]
        target = batch[:, 1:]
        with torch.no_grad():
          prediction = self(src)
        prediction = prediction.reshape(-1, self.vocab_size)
        target = target.reshape(-1)
        loss = self.criterion(prediction, target)
        self.log("test_loss", loss)
        return loss

    def configure_optimizers(self):
        return optim.Adam(self.parameters(), lr=self.learning_rate)

In [25]:
vocab_size = sp_pam.get_piece_size()
embedding_dim = 200
hidden_dim = 512
num_layers = 3
dropout_rate = 0.2
lr = 1e-3
criterion = nn.CrossEntropyLoss()
train_batch_size = 64
test_batch_size = 128

### Training

<a name="no1"></a>
#### 1. Training on Pantip data with Pantip tokenizer

In [26]:
trainer = L.Trainer(
    max_epochs=10,
    deterministic=True
)
model = LSTM(vocab_size, embedding_dim, hidden_dim, num_layers, dropout_rate, lr, criterion)

pantip_train_dataset = TextDataset(pantip_train_text, sp_pantip)
pantip_train_loader = DataLoader(pantip_train_dataset, batch_size = train_batch_size, shuffle = True)

pantip_test_dataset = TextDataset(pantip_test_text, sp_pantip)
pantip_test_loader = DataLoader(pantip_test_dataset, batch_size = test_batch_size, shuffle = False)

pam_train_dataset = TextDataset(pam_train_text, sp_pantip)
pam_train_loader = DataLoader(pam_train_dataset, batch_size = train_batch_size, shuffle = True)

pam_test_dataset = TextDataset(pam_test_text, sp_pantip)
pam_test_loader = DataLoader(pam_test_dataset, batch_size = test_batch_size, shuffle = False)

trainer.fit(model, train_dataloaders=pantip_train_loader)

GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
HPU available: False, using: 0 HPUs
/home/jaf/anaconda3/envs/nlp/lib/python3.10/site-packages/lightning/pytorch/trainer/connectors/logger_connector/logger_connector.py:76: Starting from v1.9.0, `tensorboardX` has been removed as a dependency of the `lightning.pytorch` package, due to potential conflicts with other packages in the ML ecosystem. For this reason, `logger=True` will use `CSVLogger` as the default logger, unless the `tensorboard` or `tensorboardX` packages are found. Please `pip install lightning[extra]` or one of them to enable TensorBoard support by default
You are using a CUDA device ('NVIDIA GeForce RTX 3060 Ti') that has Tensor Cores. To properly utilize them, you should set `torch.set_float32_matmul_precision('medium' | 'high')` which will trade-off precision for performance. For more details, read https://pytorch.org/docs/stable/generated/torch.set_float32_matmul_precision.html#torch.set_

Epoch 9: 100%|██████████| 44/44 [00:01<00:00, 24.94it/s, v_num=5]

`Trainer.fit` stopped: `max_epochs=10` reached.


Epoch 9: 100%|██████████| 44/44 [00:01<00:00, 23.64it/s, v_num=5]


In [27]:
test_result = trainer.test(model, dataloaders=[pantip_train_loader, pam_train_loader, pantip_test_loader,pam_test_loader], verbose=False)

print(f"Perplexity on Pantip train set is:\t{np.exp(test_result[0]['test_loss/dataloader_idx_0'])}")
print(f"Perplexity on Pra apai manee train set is:\t{np.exp(test_result[1]['test_loss/dataloader_idx_1'])}")
print(f"Perplexity on Pantip test set is:\t{np.exp(test_result[2]['test_loss/dataloader_idx_2'])}")
print(f"Perplexity on Pra apai manee test set is:\t{np.exp(test_result[3]['test_loss/dataloader_idx_3'])}")

LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
/home/jaf/anaconda3/envs/nlp/lib/python3.10/site-packages/lightning/pytorch/trainer/connectors/data_connector.py:476: Your `test_dataloader`'s sampler has shuffling enabled, it is strongly recommended that you turn shuffling off for val/test dataloaders.
/home/jaf/anaconda3/envs/nlp/lib/python3.10/site-packages/lightning/pytorch/trainer/connectors/data_connector.py:425: The 'test_dataloader' does not have many workers which may be a bottleneck. Consider increasing the value of the `num_workers` argument` to `num_workers=15` in the `DataLoader` to improve performance.


Testing DataLoader 3: 100%|██████████| 9/9 [00:00<00:00, 53.43it/s]  
Perplexity on Pantip train set is:	77.62148913999245
Perplexity on Pra apai manee train set is:	111.55689831390926
Perplexity on Pantip test set is:	106.26474865149454
Perplexity on Pra apai manee test set is:	113.8366556777151


<a name="no2"></a>
#### 2. Training on Pantip data with Pra apai manee tokenizer

In [28]:
trainer = L.Trainer(
    max_epochs=10,
    deterministic=True
)
model = LSTM(vocab_size, embedding_dim, hidden_dim, num_layers, dropout_rate, lr, criterion)

pantip_train_dataset = TextDataset(pantip_train_text, sp_pam)
pantip_train_loader = DataLoader(pantip_train_dataset, batch_size = train_batch_size, shuffle = True)

pantip_test_dataset = TextDataset(pantip_test_text, sp_pam)
pantip_test_loader = DataLoader(pantip_test_dataset, batch_size = test_batch_size, shuffle = False)

pam_train_dataset = TextDataset(pam_train_text, sp_pam)
pam_train_loader = DataLoader(pam_train_dataset, batch_size = train_batch_size, shuffle = True)

pam_test_dataset = TextDataset(pam_test_text, sp_pam)
pam_test_loader = DataLoader(pam_test_dataset, batch_size = test_batch_size, shuffle = False)

trainer.fit(model, train_dataloaders=pantip_train_loader)

GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
HPU available: False, using: 0 HPUs
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]

  | Name      | Type             | Params | Mode 
-------------------------------------------------------
0 | embedding | Embedding        | 200 K  | train
1 | lstm      | LSTM             | 5.7 M  | train
2 | dropout   | Dropout          | 0      | train
3 | fc        | Linear           | 513 K  | train
4 | criterion | CrossEntropyLoss | 0      | train
-------------------------------------------------------
6.4 M     Trainable params
0         Non-trainable params
6.4 M     Total params
25.511    Total estimated model params size (MB)
5         Modules in train mode
0         Modules in eval mode


Epoch 9: 100%|██████████| 51/51 [00:01<00:00, 25.55it/s, v_num=6]

`Trainer.fit` stopped: `max_epochs=10` reached.


Epoch 9: 100%|██████████| 51/51 [00:02<00:00, 24.41it/s, v_num=6]


In [29]:
test_result = trainer.test(model, dataloaders=[pantip_train_loader, pam_train_loader, pantip_test_loader,pam_test_loader], verbose=False)

print(f"Perplexity on Pantip train set is:\t{np.exp(test_result[0]['test_loss/dataloader_idx_0'])}")
print(f"Perplexity on Pra apai manee train set is:\t{np.exp(test_result[1]['test_loss/dataloader_idx_1'])}")
print(f"Perplexity on Pantip test set is:\t{np.exp(test_result[2]['test_loss/dataloader_idx_2'])}")
print(f"Perplexity on Pra apai manee test set is:\t{np.exp(test_result[3]['test_loss/dataloader_idx_3'])}")

LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]


Testing DataLoader 3: 100%|██████████| 7/7 [00:00<00:00, 41.26it/s]  
Perplexity on Pantip train set is:	36.38565572511128
Perplexity on Pra apai manee train set is:	442.9921778117153
Perplexity on Pantip test set is:	46.447643820664446
Perplexity on Pra apai manee test set is:	419.1088872194669


#### To answer

The perplexity numbers should indicate that:
1. Training the LM with Pra apai manee tokenizer on Pantip (no. [2](#no2)) results in overfitting to Pantip and poor generalization to the Pra apai manee dataset.
2. However using the Pantip tokenizer (no. [1](#no1)) results in a much better generalization.

Try and come up with some reasons for the results above. <br>
Hint:
1. think about "general" vocabs and domain-specific vocabs.
2. what do you think happens to the model when the token ids become longer.

The Pantip tokenizer uses more common subwords, 
allowing the model to generalize better across datasets by breaking rare or domain-specific words into familiar components. 



In contrast, the Pra Apai Manee tokenizer focuses on rare, domain-specific tokens 
that do not appear in the Pantip dataset, leading to poor generalization.


<a name="no3"></a>
#### 3. Training on Pra apai manee data with Pantip tokenizer


In [30]:
trainer = L.Trainer(
    max_epochs=10,
    deterministic=True
)
model = LSTM(vocab_size, embedding_dim, hidden_dim, num_layers, dropout_rate, lr, criterion)

pantip_train_dataset = TextDataset(pantip_train_text, sp_pantip)
pantip_train_loader = DataLoader(pantip_train_dataset, batch_size = train_batch_size, shuffle = True)

pantip_test_dataset = TextDataset(pantip_test_text, sp_pantip)
pantip_test_loader = DataLoader(pantip_test_dataset, batch_size = test_batch_size, shuffle = False)

pam_train_dataset = TextDataset(pam_train_text, sp_pantip)
pam_train_loader = DataLoader(pam_train_dataset, batch_size = train_batch_size, shuffle = True)

pam_test_dataset = TextDataset(pam_test_text, sp_pantip)
pam_test_loader = DataLoader(pam_test_dataset, batch_size = test_batch_size, shuffle = False)

trainer.fit(model, train_dataloaders=pam_train_loader)

GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
HPU available: False, using: 0 HPUs
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]

  | Name      | Type             | Params | Mode 
-------------------------------------------------------
0 | embedding | Embedding        | 200 K  | train
1 | lstm      | LSTM             | 5.7 M  | train
2 | dropout   | Dropout          | 0      | train
3 | fc        | Linear           | 513 K  | train
4 | criterion | CrossEntropyLoss | 0      | train
-------------------------------------------------------
6.4 M     Trainable params
0         Non-trainable params
6.4 M     Total params
25.511    Total estimated model params size (MB)
5         Modules in train mode
0         Modules in eval mode


Epoch 9: 100%|██████████| 66/66 [00:02<00:00, 24.23it/s, v_num=7]

`Trainer.fit` stopped: `max_epochs=10` reached.


Epoch 9: 100%|██████████| 66/66 [00:02<00:00, 23.29it/s, v_num=7]


In [31]:
test_result = trainer.test(model, dataloaders=[pantip_train_loader, pam_train_loader, pantip_test_loader,pam_test_loader], verbose=False)

print(f"Perplexity on Pantip train set is:\t{np.exp(test_result[0]['test_loss/dataloader_idx_0'])}")
print(f"Perplexity on Pra apai manee train set is:\t{np.exp(test_result[1]['test_loss/dataloader_idx_1'])}")
print(f"Perplexity on Pantip test set is:\t{np.exp(test_result[2]['test_loss/dataloader_idx_2'])}")
print(f"Perplexity on Pra apai manee test set is:\t{np.exp(test_result[3]['test_loss/dataloader_idx_3'])}")

LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]


Testing DataLoader 3: 100%|██████████| 9/9 [00:00<00:00, 41.94it/s]  
Perplexity on Pantip train set is:	3895.934105398706
Perplexity on Pra apai manee train set is:	41.26073564164805
Perplexity on Pantip test set is:	3151.679503293764
Perplexity on Pra apai manee test set is:	44.29268968018155


<a name="no4"></a>
#### 4. Training on Pra apai manee data with Pra apai manee tokenizer




In [32]:
trainer = L.Trainer(
    max_epochs=10,
    deterministic=True
)
model = LSTM(vocab_size, embedding_dim, hidden_dim, num_layers, dropout_rate, lr, criterion)

pantip_train_dataset = TextDataset(pantip_train_text, sp_pam)
pantip_train_loader = DataLoader(pantip_train_dataset, batch_size = train_batch_size, shuffle = True)

pantip_test_dataset = TextDataset(pantip_test_text, sp_pam)
pantip_test_loader = DataLoader(pantip_test_dataset, batch_size = test_batch_size, shuffle = False)

pam_train_dataset = TextDataset(pam_train_text, sp_pam)
pam_train_loader = DataLoader(pam_train_dataset, batch_size = train_batch_size, shuffle = True)

pam_test_dataset = TextDataset(pam_test_text, sp_pam)
pam_test_loader = DataLoader(pam_test_dataset, batch_size = test_batch_size, shuffle = False)

trainer.fit(model, train_dataloaders=pam_train_loader)

GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
HPU available: False, using: 0 HPUs
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]

  | Name      | Type             | Params | Mode 
-------------------------------------------------------
0 | embedding | Embedding        | 200 K  | train
1 | lstm      | LSTM             | 5.7 M  | train
2 | dropout   | Dropout          | 0      | train
3 | fc        | Linear           | 513 K  | train
4 | criterion | CrossEntropyLoss | 0      | train
-------------------------------------------------------
6.4 M     Trainable params
0         Non-trainable params
6.4 M     Total params
25.511    Total estimated model params size (MB)
5         Modules in train mode
0         Modules in eval mode
/home/jaf/anaconda3/envs/nlp/lib/python3.10/site-packages/lightning/pytorch/loops/fit_loop.py:310: The number of training batches (48) is smaller than the logging interval Trainer(log_every_n_steps=50). Set a lower value for log_every_n

Epoch 9: 100%|██████████| 48/48 [00:01<00:00, 25.01it/s, v_num=8]

`Trainer.fit` stopped: `max_epochs=10` reached.


Epoch 9: 100%|██████████| 48/48 [00:02<00:00, 23.69it/s, v_num=8]


In [33]:
test_result = trainer.test(model, dataloaders=[pantip_train_loader, pam_train_loader, pantip_test_loader,pam_test_loader], verbose=False)

print(f"Perplexity on Pantip train set is:\t{np.exp(test_result[0]['test_loss/dataloader_idx_0'])}")
print(f"Perplexity on Pra apai manee train set is:\t{np.exp(test_result[1]['test_loss/dataloader_idx_1'])}")
print(f"Perplexity on Pantip test set is:\t{np.exp(test_result[2]['test_loss/dataloader_idx_2'])}")
print(f"Perplexity on Pra apai manee test set is:\t{np.exp(test_result[3]['test_loss/dataloader_idx_3'])}")

LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]


Testing DataLoader 3: 100%|██████████| 7/7 [00:00<00:00, 41.98it/s]  
Perplexity on Pantip train set is:	595.8866183048801
Perplexity on Pra apai manee train set is:	76.8869146157321
Perplexity on Pantip test set is:	584.2165288791805
Perplexity on Pra apai manee test set is:	85.709975512397


#### To answer

The perplexity numbers should indicate that:
1. Both LM overfits on Pra apai manee data and performs really bad on Pantip data.
2. However using the Pra apai manee tokenizer (no. [4](#no4)) results in a  better generalization than the Pantip tokenizer(no. [3](#no3)).

Try and come up with some reasons for the results above. <br>

<br>
The Pra Apai Manee tokenizer likely contains a broader range of tokens, capturing both general and domain-specific vocabulary. Since it is trained on a rich and varied dataset (Pra Apai Manee), it can better handle diverse compositions and word usage, allowing it to generalize more effectively across different domains.

On the other hand, the Pantip tokenizer is specialized in everyday language and informal writing, which limits its ability to represent the complex and less-common vocabulary found in the Pra Apai Manee dataset. As a result, it struggles to generalize well when encountering text from different domains.

This explains why the Pra Apai Manee tokenizer leads to better generalization, despite both models overfitting on the Pra Apai Manee data.