<a href="https://colab.research.google.com/github/CUknot/NLP/blob/main/Lab2_3_sentencepiece_to_student.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Subword Tokenization

In this exercise, we will learn how to train our own subword tokenizers with different algorithms: BPE and Unigram. We will use `sentencepiece`, a library from Google to help create our tokenizers.

## Ref:
https://github.com/google/sentencepiece/blob/master/python

In [None]:
from google.colab import drive
drive.mount('/content/drive')

## Setup

In [None]:
!wget https://github.com/Knight-H/thai-lm/raw/refs/heads/master/data/pra-apai-manee-ch1-50.txt
!wget https://github.com/Knight-H/thai-lm/raw/refs/heads/master/data/kratoo-40000000-40002000.jsonl

--2025-01-18 08:59:51--  https://github.com/Knight-H/thai-lm/raw/refs/heads/master/data/pra-apai-manee-ch1-50.txt
Resolving github.com (github.com)... 20.205.243.166
Connecting to github.com (github.com)|20.205.243.166|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://raw.githubusercontent.com/Knight-H/thai-lm/refs/heads/master/data/pra-apai-manee-ch1-50.txt [following]
--2025-01-18 08:59:52--  https://raw.githubusercontent.com/Knight-H/thai-lm/refs/heads/master/data/pra-apai-manee-ch1-50.txt
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 3231076 (3.1M) [application/octet-stream]
Saving to: ‘pra-apai-manee-ch1-50.txt’


2025-01-18 08:59:53 (107 MB/s) - ‘pra-apai-manee-ch1-50.txt’ saved [3231076/3231076]

--2025-01-18 08:

## Code

In [None]:
import sentencepiece as spm
import io
import json

Load data

In [None]:
pantip_text = []
with open('kratoo-40000000-40002000.jsonl', 'r') as json_file:
    json_list = list(json_file)
    for json_str in json_list:
        result = json.loads(json_str)
        pantip_text.append(f"{result['title']}\n{result['content']}\n")
sum([len(t) for t in pantip_text])

1060318

In [None]:
with open("pra-apai-manee-ch1-50.txt") as f:
  pra_apai_manee_data = f.readlines()

In [None]:
sum([len(t) for t in pra_apai_manee_data])

1100605

In [None]:
pantip_train_text = pantip_text[:int(len(pantip_text)*0.8)]
pantip_test_text = pantip_text[int(len(pantip_text)*0.8):]

pam_train_text = pra_apai_manee_data[:int(len(pra_apai_manee_data)*0.8)] #pam = pra_apai_manee
pam_test_text = pra_apai_manee_data[int(len(pra_apai_manee_data)*0.8):]

## Run tokenizer training

The Python wrapper provides multiple APIs for training our tokenizers

1. `spm.SentencePieceTrainer.train(input='input.txt', model_prefix='m', vocab_size=vocab_size, model_type=model_type)`
  <br> This will output the tokenizer files `m.model` and `m.vocab` that can be later loaded into `SentencePieceProcessor`.
  <br><br>
2. `spm.SentencePieceTrainer.train(sentence_iterator=iterator, model_writer=obj_with_write_method, vocab_size=vocab_size, model_type=model_type)`
  <br> This method will require a file object e.g. `obj_with_write_method = io.BytesIO()`. The advantage of this method is you can run sentencepiece on environments that have limited access to the local file system. But you will still have to save the model file if you want to re-use the model else you will have to train it again.
<br><br>
3.  `spm.SentencePieceTrainer.train('--input=input.txt --model_prefix=m --vocab_size=vocab_size --model_type=model_type')`
<br> Same as no.1




### Unigram tokenizer

We are going to start with training a unigram tokenizer. You can use any method of training one. Make sure to set vocab_size to 1000.

In [None]:
## Train
# Save training data to files
with open("pantip_train.txt", "w", encoding="utf-8") as f:
    f.write("\n".join(pantip_train_text))

with open("pam_train.txt", "w", encoding="utf-8") as f:
    f.write("\n".join(pam_train_text))

# Train tokenizer on pantip_train_text
spm.SentencePieceTrainer.train(
    input="pantip_train.txt",
    model_prefix="pantip_unigram",
    vocab_size=1000,
    model_type="unigram"
)

# Train tokenizer on pam_train_text
spm.SentencePieceTrainer.train(
    input="pam_train.txt",
    model_prefix="pam_unigram",
    vocab_size=1000,
    model_type="unigram"
)

In [None]:
# Load the tokenizer
pantip_tokenizer_unigram = spm.SentencePieceProcessor(model_file="pantip_unigram.model")
pam_tokenizer_unigram = spm.SentencePieceProcessor(model_file="pam_unigram.model")

### Q1 MCV

How many tokens did you get when tokenizing the following sentence with your unigram tokenizer: <br>
'อรุณสวัสดิ์ ฉันเอามเหสีมาหาม สวัสดี ประเทศไทยสบายดีไหม'

In [None]:
sp_pam = pam_tokenizer_unigram

In [None]:
len(sp_pam.encode('อรุณสวัสดิ์ ฉันเอามเหสีมาหาม สวัสดี ประเทศไทยสบายดีไหม', out_type=str))

29

### BPE Tokenizer

Now try training a BPE tokenizer.

In [None]:
# Train tokenizer on pantip_train_text
spm.SentencePieceTrainer.train(
    input="pantip_train.txt",
    model_prefix="pantip_unigram",
    vocab_size=1000,
    model_type="bpe"
)

# Train tokenizer on pam_train_text
spm.SentencePieceTrainer.train(
    input="pam_train.txt",
    model_prefix="pam_unigram",
    vocab_size=1000,
    model_type="bpe"
)

In [None]:
# Load the tokenizer
pantip_tokenizer_bpe= spm.SentencePieceProcessor(model_file="pantip_unigram.model")
pam_tokenizer_bpe = spm.SentencePieceProcessor(model_file="pam_unigram.model")

### Q2 MCV

How many tokens did you get when tokenizing the following sentence with your BPE tokenizer: <br>
'อรุณสวัสดิ์ ฉันเอามเหสีมาหาม สวัสดี ประเทศไทยสบายดีไหม'

In [None]:
bpe = pam_tokenizer_bpe

In [None]:
len(bpe.encode('อรุณสวัสดิ์ ฉันเอามเหสีมาหาม สวัสดี ประเทศไทยสบายดีไหม', out_type=str))

28

These are some of your vocabs. Note that you will see "▁" (U+2581) in every type of tokenizer in SentencePiece since it makes it possible to perform detokenization \(unsplit your sentences\) without relying on language-specific resources.

In [None]:
unigram_vocabs = [sp_pam.id_to_piece(id) for id in range(sp_pam.get_piece_size())]
" | ".join(unigram_vocabs[:500])

'<unk> | <s> | </s> | ้า | ่า | อง | ระ | ํา | รา | อย | ่ง | มา | จะ | ัง | ัน | ▁เ | าย | ้ว | ับ | ี่ | ม่ | อน | ให | าม | ้น | ็น | พระ | ีย | าง | กล | ้ง | ัก | หน | ให้ | ไม่ | หล | ่น | ึง | ▁แ | ทั | ตร | าร | ้อง | ไป | ิด | ข้า | ว่า | หม | คร | ือ | ล้ว | เป | เส | ประ | าน | ั่ง | ▁๏ | ▁ฯ | ที่ | อก | เล | ิน | ได | พล | ทร | ัด | นาง | ึก | ได้ | ู่ | ▁จะ | ค์ | ี้ | พร | เป็น | สุ | ทั้ง | อม | ัย | เร | ห็น | ▁จ | ▁พระ | ก็ | ใจ | อา | ื่ | ่าง | ต่ | กร | ิง | วง | วน | ือน | เจ | ู้ | ียง | อยู่ | รร | ตาม | ▁พ | ้วย | าว | ถึง | คล | ั้น | รี | เข | ด้วย | สม | องค์ | สน | าก | ▁แล้ว | เช | ัว | ย์ | ใน | คว | น้ | หมือน | ▁ส | ูก | อบ | กระ | เจ้า | ทรง | ลา | กัน | มี | ่าย | พรา | ิ่ง | เข้า | เห็น | ิต | สง | อด | ณ์ | วย | ้ม | คิด | เม | เก | เด | ▁นาง | วา | ุก | ▁ให้ | ดู | หา | ▁อ | ▁จึง | ทํา | ลง | รัก | เค | แล้ว | ่าน | พี่ | เหมือน | ั่น | ความ | ยง | อย่า | หร | มิ | ืน | ช่ | การ | ัญ | ▁ไม่ | ฝ่าย | ศรี | ้าง | วก | ้อม | ือง | น้อง | ยว | พา | แก |

In [None]:
bpe_vocabs = [bpe.id_to_piece(id) for id in range(bpe.get_piece_size())]
" | ".join(bpe_vocabs[:500])

'<unk> | <s> | </s> | ้า | ่า | อง | ระ | ํา | รา | อย | ่ง | มา | จะ | ัง | ัน | ▁เ | าย | ้ว | ับ | ี่ | ม่ | อน | ให | าม | ้น | ็น | พระ | ีย | าง | กล | ้ง | ัก | หน | ให้ | ไม่ | หล | ่น | ึง | ▁แ | ทั | ตร | าร | ้อง | ไป | ิด | ข้า | ว่า | หม | คร | ือ | ล้ว | เป | เส | ประ | าน | ั่ง | ▁๏ | ▁ฯ | ที่ | อก | เล | ิน | ได | พล | ทร | ัด | นาง | ึก | ได้ | ู่ | ▁จะ | ค์ | ี้ | พร | เป็น | สุ | ทั้ง | อม | ัย | เร | ห็น | ▁จ | ▁พระ | ก็ | ใจ | อา | ื่ | ่าง | ต่ | กร | ิง | วง | วน | ือน | เจ | ู้ | ียง | อยู่ | รร | ตาม | ▁พ | ้วย | าว | ถึง | คล | ั้น | รี | เข | ด้วย | สม | องค์ | สน | าก | ▁แล้ว | เช | ัว | ย์ | ใน | คว | น้ | หมือน | ▁ส | ูก | อบ | กระ | เจ้า | ทรง | ลา | กัน | มี | ่าย | พรา | ิ่ง | เข้า | เห็น | ิต | สง | อด | ณ์ | วย | ้ม | คิด | เม | เก | เด | ▁นาง | วา | ุก | ▁ให้ | ดู | หา | ▁อ | ▁จึง | ทํา | ลง | รัก | เค | แล้ว | ่าน | พี่ | เหมือน | ั่น | ความ | ยง | อย่า | หร | มิ | ืน | ช่ | การ | ัญ | ▁ไม่ | ฝ่าย | ศรี | ้าง | วก | ้อม | ือง | น้อง | ยว | พา | แก |

### User-defined symbols

Another important concept to know of is User-defined symbols. These special symbols are reserved for a special purpose \(e.g.\, the \<MASK\> token used in BERT) and will always be tokenized into one token.

Refer to the documentation for ways to add these special tokens to your tokenizer.

https://github.com/google/sentencepiece/blob/master/python

## Train another tokenizer on another domain

Now try training another unigram tokenizer on `pantip_text` and we will use it to compare with the unigram tokenizer we trained earlier.

In [None]:
## Train
pantip_tokenizer_unigram

<sentencepiece.SentencePieceProcessor; proxy of <Swig Object of type 'sentencepiece::SentencePieceProcessor *' at 0x7c628939ea90> >

## Analyse top tokens on different datasets

Use your tokenizers to tokenize the datasets and analyse your most common vocabularies (try 300-400 vocabs with len>1). Hint: tokenize your data and count the tokens.

In [None]:
from collections import Counter

# Tokenize datasets
pantip_tokens = []
pam_tokens = []

# Tokenize Pantip dataset
for text in pantip_train_text:
    pantip_tokens.extend(pantip_tokenizer_bpe.encode(text, out_type=str))

# Tokenize PAM dataset
for text in pam_train_text:
    pam_tokens.extend(pam_tokenizer_bpe.encode(text, out_type=str))

# Count token frequencies
pantip_token_counts = Counter(pantip_tokens)
pam_token_counts = Counter(pam_tokens)

# Filter and sort tokens
pantip_top_tokens = [
    (token, count) for token, count in pantip_token_counts.most_common(400)
    if len(token) > 1
]

pam_top_tokens = [
    (token, count) for token, count in pam_token_counts.most_common(400)
    if len(token) > 1
]

In [None]:
print("Top Pantip Tokens:")
for token, count in pantip_top_tokens[:300]:
    print(f"{token}: {count}")

Top Pantip Tokens:
ที่: 3847
มา: 2400
ไป: 2360
ได้: 2281
ไม่: 2274
เรา: 2228
ว่า: 2166
จะ: 1995
ก็: 1978
มี: 1892
เป็น: 1867
▁เ: 1743
การ: 1653
ให้: 1637
นี้: 1370
แล้ว: 1367
ครับ: 1332
ของ: 1322
กัน: 1303
คน: 1280
ทํา: 1279
ดี: 1273
เลย: 1217
ค่ะ: 1207
ใน: 1177
▁แต่: 1107
มาก: 1087
กับ: 1086
ความ: 1080
่า: 1036
▁ส: 998
าย: 963
ัน: 924
อยู่: 910
ใจ: 889
▁เรา: 889
ํา: 876
แต่: 859
เก: 831
จาก: 813
ต้อง: 808
ตัว: 793
สอบ: 789
ประ: 761
▁1: 752
ด้วย: 749
้า: 746
อยาก: 745
อก: 744
มัน: 744
ัง: 734
อน: 730
ผม: 729
อะไร: 719
าน: 709
ไม่ได้: 707
หา: 690
วัน: 689
ัก: 678
รับ: 678
ผู้: 677
ใช้: 675
ิน: 675
▁แ: 669
▁พ: 651
อย่าง: 646
าก: 645
▁และ: 642
อง: 640
ดู: 640
▁(: 634
งาน: 633
ับ: 631
เล: 628
่ง: 623
▁2: 620
คะ: 616
้น: 614
และ: 613
กล: 606
คร: 602
ยัง: 599
าง: 599
เรื่อง: 597
าร: 587
เขา: 586
ถึง: 585
่น: 585
หน: 584
▁อ: 583
เข้า: 579
▁น: 577
หน้า: 574
▁แล้ว: 566
คือ: 564
▁ค: 563
ทาง: 562
ทุก: 549
ขึ้น: 543
▁ไม่: 537
คิด: 535
นั้น: 529
แบบ: 519
▁มี: 518
ตร: 518
อม: 516
▁ร: 515
ัด: 514
ออก

In [None]:
print("\nTop PAM Tokens:")
for token, count in pam_top_tokens[:300]:
    print(f"{token}: {count}")


Top PAM Tokens:
มา: 3067
จะ: 2471
ให้: 2381
ไม่: 2105
ไป: 2046
ว่า: 1976
▁๏: 1923
▁ฯ: 1922
พระ: 1705
▁จะ: 1685
ประ: 1616
เป็น: 1587
าน: 1534
ใจ: 1458
ที่: 1392
กล: 1368
▁พระ: 1326
▁เ: 1284
าย: 1260
่า: 1232
ได้: 1151
รา: 1142
อยู่: 1129
▁แล้ว: 1085
ก็: 1073
ใน: 1069
อง: 1054
พล: 1029
อน: 1017
คร: 1000
ตาม: 984
เจ้า: 975
คล: 967
กัน: 938
ัด: 922
เข้า: 917
ิน: 913
เห็น: 910
ทั้ง: 907
นาง: 892
เก: 875
ลา: 852
▁นาง: 848
้า: 845
▁ให้: 843
วน: 837
ทรง: 836
▁ส: 836
เร: 831
หล: 820
▁จึง: 799
เส: 789
ัน: 783
ระ: 780
ัก: 775
แล้ว: 771
กร: 769
อด: 767
ด้วย: 760
คิด: 757
เหมือน: 752
ับ: 752
สม: 745
กระ: 734
าก: 733
ถึง: 732
เล: 727
ํา: 724
ัง: 721
สน: 715
การ: 713
หน: 710
▁ไม่: 710
้น: 707
อย: 705
อก: 691
อา: 687
พา: 677
รัก: 675
หา: 675
หน้า: 670
่น: 670
ลง: 669
ยา: 668
ดี: 667
มี: 666
▁ทั้ง: 662
อม: 657
ปรา: 655
คน: 653
ตร: 653
ึก: 647
รับ: 646
าม: 644
แต่: 643
าว: 637
ดู: 637
้อง: 635
สํา: 632
ดา: 632
นี้: 629
พร: 624
ลูก: 619
ทํา: 618
รู้: 612
ี่: 609
▁ฝ่าย: 605
ย์: 601
ั่น: 596
ข้า: 593
หม: 

### To answer
What are some notable differences you see between the two vocabs?

Write your answer below.

- Pantip จะเป็นภาษาไม่เป็นทางการ
- Pam ภาษาทางการ + วรรณคดี + ยุคเก่า

In [None]:
print(pantip_top_tokens)

[('ที่', 3847), ('มา', 2400), ('ไป', 2360), ('ได้', 2281), ('ไม่', 2274), ('เรา', 2228), ('ว่า', 2166), ('จะ', 1995), ('ก็', 1978), ('มี', 1892), ('เป็น', 1867), ('▁เ', 1743), ('การ', 1653), ('ให้', 1637), ('นี้', 1370), ('แล้ว', 1367), ('ครับ', 1332), ('ของ', 1322), ('กัน', 1303), ('คน', 1280), ('ทํา', 1279), ('ดี', 1273), ('เลย', 1217), ('ค่ะ', 1207), ('ใน', 1177), ('▁แต่', 1107), ('มาก', 1087), ('กับ', 1086), ('ความ', 1080), ('่า', 1036), ('▁ส', 998), ('าย', 963), ('ัน', 924), ('อยู่', 910), ('ใจ', 889), ('▁เรา', 889), ('ํา', 876), ('แต่', 859), ('เก', 831), ('จาก', 813), ('ต้อง', 808), ('ตัว', 793), ('สอบ', 789), ('ประ', 761), ('▁1', 752), ('ด้วย', 749), ('้า', 746), ('อยาก', 745), ('อก', 744), ('มัน', 744), ('ัง', 734), ('อน', 730), ('ผม', 729), ('อะไร', 719), ('าน', 709), ('ไม่ได้', 707), ('หา', 690), ('วัน', 689), ('ัก', 678), ('รับ', 678), ('ผู้', 677), ('ใช้', 675), ('ิน', 675), ('▁แ', 669), ('▁พ', 651), ('อย่าง', 646), ('าก', 645), ('▁และ', 642), ('อง', 640), ('ดู', 640), ('▁

In [None]:
print(pam_top_tokens)

[('มา', 3067), ('จะ', 2471), ('ให้', 2381), ('ไม่', 2105), ('ไป', 2046), ('ว่า', 1976), ('▁๏', 1923), ('▁ฯ', 1922), ('พระ', 1705), ('▁จะ', 1685), ('ประ', 1616), ('เป็น', 1587), ('าน', 1534), ('ใจ', 1458), ('ที่', 1392), ('กล', 1368), ('▁พระ', 1326), ('▁เ', 1284), ('าย', 1260), ('่า', 1232), ('ได้', 1151), ('รา', 1142), ('อยู่', 1129), ('▁แล้ว', 1085), ('ก็', 1073), ('ใน', 1069), ('อง', 1054), ('พล', 1029), ('อน', 1017), ('คร', 1000), ('ตาม', 984), ('เจ้า', 975), ('คล', 967), ('กัน', 938), ('ัด', 922), ('เข้า', 917), ('ิน', 913), ('เห็น', 910), ('ทั้ง', 907), ('นาง', 892), ('เก', 875), ('ลา', 852), ('▁นาง', 848), ('้า', 845), ('▁ให้', 843), ('วน', 837), ('ทรง', 836), ('▁ส', 836), ('เร', 831), ('หล', 820), ('▁จึง', 799), ('เส', 789), ('ัน', 783), ('ระ', 780), ('ัก', 775), ('แล้ว', 771), ('กร', 769), ('อด', 767), ('ด้วย', 760), ('คิด', 757), ('เหมือน', 752), ('ับ', 752), ('สม', 745), ('กระ', 734), ('าก', 733), ('ถึง', 732), ('เล', 727), ('ํา', 724), ('ัง', 721), ('สน', 715), ('การ', 713),

## Using tokenizer across domains

One problem you may face is your dataset is very specialized. In that case the tokenizer trained on a general domain may not perform as good as it should when used on your dataset.

Next you will try using tokenizers trained on one general domain (on Pantip) and use it on a specialized domain (พระอภัยมณี) and vice versa.

### Q3 MCV

What percentage increase do you observe when tokenizing the whole พระอภัยมณี dataset with a tokenizer trained on Pantip compared to the one trained on พระอภัยมณี.

In [None]:
# Tokenize พระอภัยมณี dataset with both tokenizers
pam_tokens_with_pantip_tokenizer = [
    pantip_tokenizer_bpe.encode(text) for text in pam_test_text
]
pam_tokens_with_pam_tokenizer = [
    pam_tokenizer_bpe.encode(text) for text in pam_test_text
]

# Calculate the total token count for each tokenizer
total_tokens_pantip = sum(len(tokens) for tokens in pam_tokens_with_pantip_tokenizer)
total_tokens_pam = sum(len(tokens) for tokens in pam_tokens_with_pam_tokenizer)

In [None]:
# Calculate percentage increase
percentage_increase = ((total_tokens_pantip - total_tokens_pam) / total_tokens_pam) * 100

print(f"Percentage Increase in Token Count: {percentage_increase:.2f}%")


Percentage Increase in Token Count: 28.75%


### Q4 MCV

What percentage increase do you observe when tokenizing the whole Pantip dataset with a tokenizer trained on พระอภัยมณี compared to the one trained on Pantip.

In [None]:
# Tokenize Pantip dataset with both tokenizers
pantip_tokens_with_pantip_tokenizer = [
    pantip_tokenizer_bpe.encode(text) for text in pantip_test_text
]
pantip_tokens_with_pam_tokenizer = [
    pam_tokenizer_bpe.encode(text) for text in pantip_test_text
]

# Calculate the total token count for each tokenizer
total_tokens_pantip = sum(len(tokens) for tokens in pantip_tokens_with_pantip_tokenizer)
total_tokens_pam = sum(len(tokens) for tokens in pantip_tokens_with_pam_tokenizer)

In [None]:
# Calculate percentage increase
percentage_increase = ((total_tokens_pam - total_tokens_pantip) / total_tokens_pantip) * 100

print(f"Percentage Increase in Token Count: {percentage_increase:.2f}%")


Percentage Increase in Token Count: 3.75%


### To answer
Why do you think the number of tokens tokenized by the general tokenizer (the one trained on Pantip) has a higher percentage increase compared to the number of tokens tokenized by the specialized tokenizer? (Hint: we fixed vocab size.)

เพราะ Vocabulary and Dataset ไม่สอดคล้องกัน และ เราได้ทำการ fixed vocab size ไว้
  - Tokenizer ที่ฝึกจาก Pantip ถูกปรับแต่งมาเพื่อให้เหมาะกับคำศัพท์และรูปแบบภาษาที่พบในข้อมูล Pantip ซึ่งมักเป็นภาษาที่หลากหลายและไม่เป็นทางการ
  - แต่ ชุดข้อมูลพระอภัยมณี เป็นข้อมูลเฉพาะทางที่อาจมีคำศัพท์ รูปประโยค และรูปแบบภาษาที่เป็นเอกลักษณ์ เช่น ภาษาโบราณหรือบทกวี ซึ่ง Tokenizer ของ Pantip ไม่สามารถแทนคำเหล่านี้ได้อย่างมีประสิทธิภาพ
  - ส่งผลให้ Tokenizer ของ Pantip แยกคำศัพท์เหล่านี้ออกเป็นส่วนย่อยมากขึ้น หรืออาจแยกเป็นตัวอักษรเดี่ยว ๆ ซึ่งทำให้จำนวนโทเค็นเพิ่มขึ้น

## The effect on language models

Next, we will see the effect of using "cross-domain" tokenizers on Language models.

### Setup
We are going to reuse the code from the last assignment

In [None]:
!pip install lightning

Collecting lightning
  Downloading lightning-2.5.0.post0-py3-none-any.whl.metadata (40 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/40.4 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m40.4/40.4 kB[0m [31m3.4 MB/s[0m eta [36m0:00:00[0m
Collecting lightning-utilities<2.0,>=0.10.0 (from lightning)
  Downloading lightning_utilities-0.11.9-py3-none-any.whl.metadata (5.2 kB)
Collecting torchmetrics<3.0,>=0.7.0 (from lightning)
  Downloading torchmetrics-1.6.1-py3-none-any.whl.metadata (21 kB)
Collecting pytorch-lightning (from lightning)
  Downloading pytorch_lightning-2.5.0.post0-py3-none-any.whl.metadata (21 kB)
Downloading lightning-2.5.0.post0-py3-none-any.whl (815 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m815.2/815.2 kB[0m [31m50.8 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading lightning_utilities-0.11.9-py3-none-any.whl (28 kB)
Downloading torchmetrics-1.6.1-py3-none-any

In [None]:
import itertools
import torch
import torch.nn as nn
from torch.utils.data import Dataset, DataLoader
import torch.optim as optim
import lightning as L
from tqdm import tqdm
import numpy as np

In [None]:
class TextDataset(Dataset):
  def __init__(self, data, tokenizer, seq_len = 128):

    token_ids = [tokenizer.encode(d, add_bos=True, add_eos=True) for d in data]
    flatten_token_ids = list(itertools.chain(*token_ids))
    encoded = torch.LongTensor(flatten_token_ids)

    left_over = len(encoded) % seq_len
    encoded = encoded[:len(encoded)-left_over]
    self.encoded = encoded.view(-1, seq_len)

  def __getitem__(self, idx):
    return self.encoded[idx]

  def __len__(self):
    return len(self.encoded)

In [None]:
class LSTM(L.LightningModule):
    def __init__(self, vocab_size, embedding_dim, hidden_dim, num_layers, dropout_rate, learning_rate, criterion):

        super().__init__()

        self.num_layers = num_layers
        self.hidden_dim = hidden_dim
        self.embedding_dim = embedding_dim
        self.vocab_size=vocab_size

        self.embedding = nn.Embedding(vocab_size, embedding_dim)
        self.lstm = nn.LSTM(embedding_dim, hidden_dim, num_layers=num_layers,
                    dropout=dropout_rate, batch_first=True)
        self.dropout = nn.Dropout(dropout_rate)
        self.fc = nn.Linear(hidden_dim, vocab_size)
        self.learning_rate = learning_rate
        self.criterion = criterion

    def forward(self, src):
        # Convert token IDs to embeddings
        embeddings = self.embedding(src)  # Shape: [batch_size, seq_len, embedding_dim]

        # Pass embeddings through the LSTM
        lstm_out, _ = self.lstm(embeddings)  # lstm_out: [batch_size, seq_len, hidden_dim]

        # Apply dropout
        lstm_out = self.dropout(lstm_out)

        # Project the hidden states to vocab size
        output = self.fc(lstm_out)  # Shape: [batch_size, seq_len, vocab_size]

        return output

    def training_step(self, batch, batch_idx):

        src = batch[:, :-1]
        target = batch[:, 1:]
        prediction = self(src)
        prediction = prediction.reshape(-1, self.vocab_size)
        target = target.reshape(-1)
        loss = self.criterion(prediction, target)
        self.log("train_loss", loss)
        return loss

    def test_step(self, batch, batch_idx, dataloader_idx=0):

        src = batch[:, :-1]
        target = batch[:, 1:]
        with torch.no_grad():
          prediction = self(src)
        prediction = prediction.reshape(-1, self.vocab_size)
        target = target.reshape(-1)
        loss = self.criterion(prediction, target)
        self.log("test_loss", loss)
        return loss

    def configure_optimizers(self):
        return optim.Adam(self.parameters(), lr=self.learning_rate)

In [None]:
vocab_size = sp_pam.get_piece_size()
embedding_dim = 200
hidden_dim = 512
num_layers = 3
dropout_rate = 0.2
lr = 1e-3
criterion = nn.CrossEntropyLoss()
train_batch_size = 64
test_batch_size = 128

### Training

<a name="no1"></a>
#### 1. Training on Pantip data with Pantip tokenizer

In [None]:
sp_pantip = pantip_tokenizer_bpe
sp_pam = pam_tokenizer_bpe

In [None]:
trainer = L.Trainer(
    max_epochs=10,
    deterministic=True
)
model = LSTM(vocab_size, embedding_dim, hidden_dim, num_layers, dropout_rate, lr, criterion)

pantip_train_dataset = TextDataset(pantip_train_text, sp_pantip)
pantip_train_loader = DataLoader(pantip_train_dataset, batch_size = train_batch_size, shuffle = True)

pantip_test_dataset = TextDataset(pantip_test_text, sp_pantip)
pantip_test_loader = DataLoader(pantip_test_dataset, batch_size = test_batch_size, shuffle = False)

pam_train_dataset = TextDataset(pam_train_text, sp_pantip)
pam_train_loader = DataLoader(pam_train_dataset, batch_size = train_batch_size, shuffle = True)

pam_test_dataset = TextDataset(pam_test_text, sp_pantip)
pam_test_loader = DataLoader(pam_test_dataset, batch_size = test_batch_size, shuffle = False)

trainer.fit(model, train_dataloaders=pantip_train_loader)

INFO: GPU available: True (cuda), used: True
INFO:lightning.pytorch.utilities.rank_zero:GPU available: True (cuda), used: True
INFO: TPU available: False, using: 0 TPU cores
INFO:lightning.pytorch.utilities.rank_zero:TPU available: False, using: 0 TPU cores
INFO: HPU available: False, using: 0 HPUs
INFO:lightning.pytorch.utilities.rank_zero:HPU available: False, using: 0 HPUs
INFO: LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
INFO:lightning.pytorch.accelerators.cuda:LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
INFO: 
  | Name      | Type             | Params | Mode 
-------------------------------------------------------
0 | embedding | Embedding        | 200 K  | train
1 | lstm      | LSTM             | 5.7 M  | train
2 | dropout   | Dropout          | 0      | train
3 | fc        | Linear           | 513 K  | train
4 | criterion | CrossEntropyLoss | 0      | train
-------------------------------------------------------
6.4 M     Trainable params
0         Non-trainable params
6.4 M     Tot

Training: |          | 0/? [00:00<?, ?it/s]

INFO: `Trainer.fit` stopped: `max_epochs=10` reached.
INFO:lightning.pytorch.utilities.rank_zero:`Trainer.fit` stopped: `max_epochs=10` reached.


In [None]:
test_result = trainer.test(model, dataloaders=[pantip_train_loader, pam_train_loader, pantip_test_loader,pam_test_loader], verbose=False)

print(f"Perplexity on Pantip train set is:\t{np.exp(test_result[0]['test_loss/dataloader_idx_0'])}")
print(f"Perplexity on Pra apai manee train set is:\t{np.exp(test_result[1]['test_loss/dataloader_idx_1'])}")
print(f"Perplexity on Pantip test set is:\t{np.exp(test_result[2]['test_loss/dataloader_idx_2'])}")
print(f"Perplexity on Pra apai manee test set is:\t{np.exp(test_result[3]['test_loss/dataloader_idx_3'])}")

INFO: LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
INFO:lightning.pytorch.accelerators.cuda:LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
/usr/local/lib/python3.11/dist-packages/lightning/pytorch/trainer/connectors/data_connector.py:476: Your `test_dataloader`'s sampler has shuffling enabled, it is strongly recommended that you turn shuffling off for val/test dataloaders.


Testing: |          | 0/? [00:00<?, ?it/s]

Perplexity on Pantip train set is:	68.82131446383693
Perplexity on Pra apai manee train set is:	203.81059852239193
Perplexity on Pantip test set is:	124.02908684976566
Perplexity on Pra apai manee test set is:	206.5725660473081


<a name="no2"></a>
#### 2. Training on Pantip data with Pra apai manee tokenizer

In [None]:
trainer = L.Trainer(
    max_epochs=10,
    deterministic=True
)
model = LSTM(vocab_size, embedding_dim, hidden_dim, num_layers, dropout_rate, lr, criterion)

pantip_train_dataset = TextDataset(pantip_train_text, sp_pam)
pantip_train_loader = DataLoader(pantip_train_dataset, batch_size = train_batch_size, shuffle = True)

pantip_test_dataset = TextDataset(pantip_test_text, sp_pam)
pantip_test_loader = DataLoader(pantip_test_dataset, batch_size = test_batch_size, shuffle = False)

pam_train_dataset = TextDataset(pam_train_text, sp_pam)
pam_train_loader = DataLoader(pam_train_dataset, batch_size = train_batch_size, shuffle = True)

pam_test_dataset = TextDataset(pam_test_text, sp_pam)
pam_test_loader = DataLoader(pam_test_dataset, batch_size = test_batch_size, shuffle = False)

trainer.fit(model, train_dataloaders=pantip_train_loader)

INFO: GPU available: True (cuda), used: True
INFO:lightning.pytorch.utilities.rank_zero:GPU available: True (cuda), used: True
INFO: TPU available: False, using: 0 TPU cores
INFO:lightning.pytorch.utilities.rank_zero:TPU available: False, using: 0 TPU cores
INFO: HPU available: False, using: 0 HPUs
INFO:lightning.pytorch.utilities.rank_zero:HPU available: False, using: 0 HPUs
INFO: LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
INFO:lightning.pytorch.accelerators.cuda:LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
INFO: 
  | Name      | Type             | Params | Mode 
-------------------------------------------------------
0 | embedding | Embedding        | 200 K  | train
1 | lstm      | LSTM             | 5.7 M  | train
2 | dropout   | Dropout          | 0      | train
3 | fc        | Linear           | 513 K  | train
4 | criterion | CrossEntropyLoss | 0      | train
-------------------------------------------------------
6.4 M     Trainable params
0         Non-trainable params
6.4 M     Tot

Training: |          | 0/? [00:00<?, ?it/s]

INFO: `Trainer.fit` stopped: `max_epochs=10` reached.
INFO:lightning.pytorch.utilities.rank_zero:`Trainer.fit` stopped: `max_epochs=10` reached.


In [None]:
test_result = trainer.test(model, dataloaders=[pantip_train_loader, pam_train_loader, pantip_test_loader,pam_test_loader], verbose=False)

print(f"Perplexity on Pantip train set is:\t{np.exp(test_result[0]['test_loss/dataloader_idx_0'])}")
print(f"Perplexity on Pra apai manee train set is:\t{np.exp(test_result[1]['test_loss/dataloader_idx_1'])}")
print(f"Perplexity on Pantip test set is:\t{np.exp(test_result[2]['test_loss/dataloader_idx_2'])}")
print(f"Perplexity on Pra apai manee test set is:\t{np.exp(test_result[3]['test_loss/dataloader_idx_3'])}")

INFO: LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
INFO:lightning.pytorch.accelerators.cuda:LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]


Testing: |          | 0/? [00:00<?, ?it/s]

Perplexity on Pantip train set is:	29.337146543670855
Perplexity on Pra apai manee train set is:	808.5649434058206
Perplexity on Pantip test set is:	53.525623050762704
Perplexity on Pra apai manee test set is:	777.6412182802582


#### To answer

The perplexity numbers should indicate that:
1. Training the LM with Pra apai manee tokenizer on Pantip (no. [2](#no2)) results in overfitting to Pantip and poor generalization to the Pra apai manee dataset.
2. However using the Pantip tokenizer (no. [1](#no1)) results in a much better generalization.

Try and come up with some reasons for the results above. <br>
Hint:
1. think about "general" vocabs and domain-specific vocabs.
2. what do you think happens to the model when the token ids become longer.

เพราะ ความแตกต่างระหว่างคำศัพท์ทั่วไปและคำศัพท์เฉพาะโดเมน และ ความยาวของ Token IDs

Tokenizer ของ Pantip ได้รับการฝึกฝนด้วยคำศัพท์ที่หลากหลายและครอบคลุมมากกว่า ทำให้มันสามารถจัดการกับคำศัพท์เฉพาะในชุดข้อมูลพระอภัยมณีได้ดีกว่า ในขณะที่ Tokenizer ของพระอภัยมณีเน้นไปที่คำเฉพาะในโดเมนวรรณกรรมโบราณ ทำให้ไม่สามารถจัดการกับคำศัพท์ที่หลากหลายในชุดข้อมูล Pantip ได้

เมื่อ Tokenizer เฉพาะโดเมน เช่น พระอภัยมณี ถูกใช้กับชุดข้อมูล Pantip โมเดลอาจต้องแยกคำที่ไม่คุ้นเคยออกเป็นโทเค็นเล็ก ๆ จำนวนมาก ส่งผลให้ Token IDs ยาวขึ้น
Token IDs ที่ยาวขึ้นอาจทำให้โมเดลมีความซับซ้อนในการเรียนรู้มากขึ้น เนื่องจากต้องประมวลผลข้อมูลที่กระจัดกระจายและยากต่อการจับความสัมพันธ์


<a name="no3"></a>
#### 3. Training on Pra apai manee data with Pantip tokenizer


In [None]:
trainer = L.Trainer(
    max_epochs=10,
    deterministic=True
)
model = LSTM(vocab_size, embedding_dim, hidden_dim, num_layers, dropout_rate, lr, criterion)

pantip_train_dataset = TextDataset(pantip_train_text, sp_pantip)
pantip_train_loader = DataLoader(pantip_train_dataset, batch_size = train_batch_size, shuffle = True)

pantip_test_dataset = TextDataset(pantip_test_text, sp_pantip)
pantip_test_loader = DataLoader(pantip_test_dataset, batch_size = test_batch_size, shuffle = False)

pam_train_dataset = TextDataset(pam_train_text, sp_pantip)
pam_train_loader = DataLoader(pam_train_dataset, batch_size = train_batch_size, shuffle = True)

pam_test_dataset = TextDataset(pam_test_text, sp_pantip)
pam_test_loader = DataLoader(pam_test_dataset, batch_size = test_batch_size, shuffle = False)

trainer.fit(model, train_dataloaders=pam_train_loader)

INFO: GPU available: True (cuda), used: True
INFO:lightning.pytorch.utilities.rank_zero:GPU available: True (cuda), used: True
INFO: TPU available: False, using: 0 TPU cores
INFO:lightning.pytorch.utilities.rank_zero:TPU available: False, using: 0 TPU cores
INFO: HPU available: False, using: 0 HPUs
INFO:lightning.pytorch.utilities.rank_zero:HPU available: False, using: 0 HPUs
INFO: LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
INFO:lightning.pytorch.accelerators.cuda:LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
INFO: 
  | Name      | Type             | Params | Mode 
-------------------------------------------------------
0 | embedding | Embedding        | 200 K  | train
1 | lstm      | LSTM             | 5.7 M  | train
2 | dropout   | Dropout          | 0      | train
3 | fc        | Linear           | 513 K  | train
4 | criterion | CrossEntropyLoss | 0      | train
-------------------------------------------------------
6.4 M     Trainable params
0         Non-trainable params
6.4 M     Tot

Training: |          | 0/? [00:00<?, ?it/s]

INFO: `Trainer.fit` stopped: `max_epochs=10` reached.
INFO:lightning.pytorch.utilities.rank_zero:`Trainer.fit` stopped: `max_epochs=10` reached.


In [None]:
test_result = trainer.test(model, dataloaders=[pantip_train_loader, pam_train_loader, pantip_test_loader,pam_test_loader], verbose=False)

print(f"Perplexity on Pantip train set is:\t{np.exp(test_result[0]['test_loss/dataloader_idx_0'])}")
print(f"Perplexity on Pra apai manee train set is:\t{np.exp(test_result[1]['test_loss/dataloader_idx_1'])}")
print(f"Perplexity on Pantip test set is:\t{np.exp(test_result[2]['test_loss/dataloader_idx_2'])}")
print(f"Perplexity on Pra apai manee test set is:\t{np.exp(test_result[3]['test_loss/dataloader_idx_3'])}")

INFO: LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
INFO:lightning.pytorch.accelerators.cuda:LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]


Testing: |          | 0/? [00:00<?, ?it/s]

Perplexity on Pantip train set is:	6520.157483850194
Perplexity on Pra apai manee train set is:	40.525696117798105
Perplexity on Pantip test set is:	5597.954359431243
Perplexity on Pra apai manee test set is:	45.73927222308066


<a name="no4"></a>
#### 4. Training on Pra apai manee data with Pra apai manee tokenizer




In [None]:
trainer = L.Trainer(
    max_epochs=10,
    deterministic=True
)
model = LSTM(vocab_size, embedding_dim, hidden_dim, num_layers, dropout_rate, lr, criterion)

pantip_train_dataset = TextDataset(pantip_train_text, sp_pam)
pantip_train_loader = DataLoader(pantip_train_dataset, batch_size = train_batch_size, shuffle = True)

pantip_test_dataset = TextDataset(pantip_test_text, sp_pam)
pantip_test_loader = DataLoader(pantip_test_dataset, batch_size = test_batch_size, shuffle = False)

pam_train_dataset = TextDataset(pam_train_text, sp_pam)
pam_train_loader = DataLoader(pam_train_dataset, batch_size = train_batch_size, shuffle = True)

pam_test_dataset = TextDataset(pam_test_text, sp_pam)
pam_test_loader = DataLoader(pam_test_dataset, batch_size = test_batch_size, shuffle = False)

trainer.fit(model, train_dataloaders=pam_train_loader)

INFO: GPU available: True (cuda), used: True
INFO:lightning.pytorch.utilities.rank_zero:GPU available: True (cuda), used: True
INFO: TPU available: False, using: 0 TPU cores
INFO:lightning.pytorch.utilities.rank_zero:TPU available: False, using: 0 TPU cores
INFO: HPU available: False, using: 0 HPUs
INFO:lightning.pytorch.utilities.rank_zero:HPU available: False, using: 0 HPUs
INFO: LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
INFO:lightning.pytorch.accelerators.cuda:LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
INFO: 
  | Name      | Type             | Params | Mode 
-------------------------------------------------------
0 | embedding | Embedding        | 200 K  | train
1 | lstm      | LSTM             | 5.7 M  | train
2 | dropout   | Dropout          | 0      | train
3 | fc        | Linear           | 513 K  | train
4 | criterion | CrossEntropyLoss | 0      | train
-------------------------------------------------------
6.4 M     Trainable params
0         Non-trainable params
6.4 M     Tot

Training: |          | 0/? [00:00<?, ?it/s]

INFO: `Trainer.fit` stopped: `max_epochs=10` reached.
INFO:lightning.pytorch.utilities.rank_zero:`Trainer.fit` stopped: `max_epochs=10` reached.


In [None]:
test_result = trainer.test(model, dataloaders=[pantip_train_loader, pam_train_loader, pantip_test_loader,pam_test_loader], verbose=False)

print(f"Perplexity on Pantip train set is:\t{np.exp(test_result[0]['test_loss/dataloader_idx_0'])}")
print(f"Perplexity on Pra apai manee train set is:\t{np.exp(test_result[1]['test_loss/dataloader_idx_1'])}")
print(f"Perplexity on Pantip test set is:\t{np.exp(test_result[2]['test_loss/dataloader_idx_2'])}")
print(f"Perplexity on Pra apai manee test set is:\t{np.exp(test_result[3]['test_loss/dataloader_idx_3'])}")

INFO: LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
INFO:lightning.pytorch.accelerators.cuda:LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]


Testing: |          | 0/? [00:00<?, ?it/s]

Perplexity on Pantip train set is:	955.0051635110201
Perplexity on Pra apai manee train set is:	93.53121856971529
Perplexity on Pantip test set is:	938.8484614262105
Perplexity on Pra apai manee test set is:	112.77532744585123


#### To answer

The perplexity numbers should indicate that:
1. Both LM overfits on Pra apai manee data and performs really bad on Pantip data.
2. However using the Pra apai manee tokenizer (no. [4](#no4)) results in a  better generalization than the Pantip tokenizer(no. [3](#no3)).

Try and come up with some reasons for the results above. <br>

1. ความเข้ากันของคำศัพท์ในโดเมนเฉพาะ

  Tokenizer ของพระอภัยมณีได้รับการฝึกฝนกับคำศัพท์ในโดเมนเฉพาะที่มีความละเอียดและเหมาะสมกับลักษณะภาษาวรรณกรรมมากกว่า ในขณะที่ Tokenizer ของ Pantip อาจไม่สามารถรองรับโครงสร้างและคำศัพท์ในลักษณะเดียวกันได้
  การใช้ Tokenizer ของพระอภัยมณีช่วยให้ LM สามารถจับความหมายและบริบทของข้อมูลพระอภัยมณีได้แม่นยำกว่า

2. ความสัมพันธ์ของคำศัพท์ระหว่างสองชุดข้อมูล

  ถึงแม้ชุดข้อมูล Pantip และพระอภัยมณีจะมีคำศัพท์บางส่วนที่ไม่ตรงกัน แต่ Tokenizer ของพระอภัยมณีอาจสามารถจับคำหรือโครงสร้างบางอย่างที่มีความใกล้เคียงกันได้มากกว่า Tokenizer ของ Pantip