<a href="https://colab.research.google.com/github/JeansAthiwat/NLP_NoScope/blob/main/codes/L4_Token_Classification/HW_4_POS_Tagging_with_HuggingFace_for_student.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# HW 4 - POS Tagging with Hugging Face

In this exercise, you will create a part-of-speech (POS) tagging system for Thai text using NECTEC’s ORCHID corpus. Instead of building your own deep learning architecture from scratch, you will leverage a pretrained tokenizer and a pretrained token classification model from Hugging Face.

We have provided some starter code for data cleaning and preprocessing in this notebook, but feel free to modify those parts to suit your needs. You are welcome to use additional libraries (e.g., scikit-learn) as long as you incorporate the pretrained Hugging Face model. Specifically, you will need to:

1. Load a pretrained tokenizer and token classification model.
2. Fine-tune it on the ORCHID corpus for POS tagging.
3. Evaluate and report the performance of your model on the test data.

### Don't forget to change hardware accelrator to GPU in runtime on Google Colab ###

## 1. Setup and Preprocessing

In [1]:
# Install transformers and thai2transformers
!pip install wandb
!pip install -q transformers==4.30.1 datasets evaluate thaixtransformers
!pip install -q emoji pythainlp sefr_cut tinydb seqeval sentencepiece pydantic jsonlines
!pip install peft==0.10.0

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m113.6/113.6 kB[0m [31m1.1 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.2/7.2 MB[0m [31m25.4 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m480.6/480.6 kB[0m [31m10.6 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m84.0/84.0 kB[0m [31m5.5 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m5.4 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m179.3/179.3 kB[0m [31m6.5 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m143.5/143.5 kB[0m [31m9.9 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m17.9/17.9 MB[0m [31m26.5 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

## Setup

1. Register [Wandb account](https://wandb.ai/login?signup=true) (and confirm your email)

2. `wandb login` and copy paste the API key when prompt

In [2]:
!wandb login

[34m[1mwandb[0m: Logging into wandb.ai. (Learn how to deploy a W&B server locally: https://wandb.me/wandb-server)
[34m[1mwandb[0m: You can find your API key in your browser here: https://wandb.ai/authorize
[34m[1mwandb[0m: Paste an API key from your profile and hit enter, or press ctrl+c to quit: 
[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc
[34m[1mwandb[0m: Currently logged in as: [33mjeansathiwat[0m to [32mhttps://api.wandb.ai[0m. Use [1m`wandb login --relogin`[0m to force relogin


In [3]:
import wandb

We encourage you to login to your `Hugging Face` account so you can upload and share your model with the community. When prompted, enter your token to login

In [4]:
from huggingface_hub import notebook_login

notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

Download the dataset from Hugging Face

In [5]:
from datasets import load_dataset

orchid = load_dataset("Thichow/orchid_corpus")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


README.md:   0%|          | 0.00/79.0 [00:00<?, ?B/s]

orchid_corpus.py:   0%|          | 0.00/7.91k [00:00<?, ?B/s]

The repository for Thichow/orchid_corpus contains custom code which must be executed to correctly load the dataset. You can inspect the repository content at https://hf.co/datasets/Thichow/orchid_corpus.
You can avoid this prompt in future by passing the argument `trust_remote_code=True`.

Do you wish to run the custom code? [y/N] y


Downloading data:   0%|          | 0.00/5.24M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

Generating train split: 0 examples [00:00, ? examples/s]

Generating test split: 0 examples [00:00, ? examples/s]

In [6]:
orchid

DatasetDict({
    train: Dataset({
        features: ['id', 'label_tokens', 'pos_tags', 'sentence'],
        num_rows: 18500
    })
    test: Dataset({
        features: ['id', 'label_tokens', 'pos_tags', 'sentence'],
        num_rows: 4625
    })
})

In [7]:
orchid['train'][0]

{'id': '0',
 'label_tokens': ['การ', 'ประชุม', 'ทาง', 'วิชาการ', ' ', 'ครั้ง', 'ที่ 1'],
 'pos_tags': [21, 39, 26, 26, 37, 4, 18],
 'sentence': 'การประชุมทางวิชาการ ครั้งที่ 1'}

In [8]:
orchid['train'][0]["sentence"]

'การประชุมทางวิชาการ ครั้งที่ 1'

In [9]:
''.join(orchid['train'][0]['label_tokens'])

'การประชุมทางวิชาการ ครั้งที่ 1'

In [10]:
label_list = orchid["train"].features[f"pos_tags"].feature.names
print('total type of pos_tags :', len(label_list))
print(label_list)

total type of pos_tags : 47
['ADVI', 'ADVN', 'ADVP', 'ADVS', 'CFQC', 'CLTV', 'CMTR', 'CMTR@PUNC', 'CNIT', 'CVBL', 'DCNM', 'DDAC', 'DDAN', 'DDAQ', 'DDBQ', 'DIAC', 'DIAQ', 'DIBQ', 'DONM', 'EAFF', 'EITT', 'FIXN', 'FIXV', 'JCMP', 'JCRG', 'JSBR', 'NCMN', 'NCNM', 'NEG', 'NLBL', 'NONM', 'NPRP', 'NTTL', 'PDMN', 'PNTR', 'PPRS', 'PREL', 'PUNC', 'RPRE', 'VACT', 'VATT', 'VSTA', 'XVAE', 'XVAM', 'XVBB', 'XVBM', 'XVMM']


In [11]:
import numpy as np
import numpy.random
import torch

from tqdm.auto import tqdm
from functools import partial

#transformers
from transformers import (
    CamembertTokenizer,
    AutoTokenizer,
    AutoModel,
    AutoModelForMaskedLM,
    AutoModelForSequenceClassification,
    AutoModelForTokenClassification,
    TrainingArguments,
    Trainer,
    pipeline,
)

#thaixtransformers
from thaixtransformers import Tokenizer
from thaixtransformers.preprocess import process_transformers

The cache for model files in Transformers v4.22.0 has been updated. Migrating your old cache. This is a one-time only operation. You can interrupt this and resume the migration later on by calling `transformers.utils.move_cache()`.


0it [00:00, ?it/s]

Next, we load a pretrained tokenizer from Hugging Face. In this work, we utilize WangchanBERTa, a Thai-specific pretrained model, as the tokenizer.

# Choose Pretrained Model

In this notebook, you can choose from 5 versions of WangchanBERTa, XLMR and mBERT to perform downstream tasks on Thai datasets. The datasets are:

* `wangchanberta-base-att-spm-uncased` (recommended) - Largest WangchanBERTa trained on 78.5GB of Assorted Thai Texts with subword tokenizer SentencePiece
* `xlm-roberta-base` - Facebook's [XLMR](https://arxiv.org/abs/1911.02116) trained on 100 languages
* `bert-base-multilingual-cased` - Google's [mBERT](https://arxiv.org/abs/1911.03310) trained on 104 languages
* `wangchanberta-base-wiki-newmm` - WangchanBERTa trained on Thai Wikipedia Dump with PyThaiNLP's word-level tokenizer  `newmm`
* `wangchanberta-base-wiki-syllable` - WangchanBERTa trained on Thai Wikipedia Dump with PyThaiNLP's syllabel-level tokenizer `syllable`
* `wangchanberta-base-wiki-sefr` - WangchanBERTa trained on Thai Wikipedia Dump with word-level tokenizer  `SEFR`
* `wangchanberta-base-wiki-spm` - WangchanBERTa trained on Thai Wikipedia Dump with subword-level tokenizer SentencePiece

In the first part, we require you to select the wangchanberta-base-att-spm-uncased.

<b> Learn more about using wangchanberta at [wangchanberta_getting_started_ai_reseach](https://colab.research.google.com/github/PyThaiNLP/thaixtransformers/blob/main/notebooks/wangchanberta_getting_started_aireseach.ipynb?fbclid=IwY2xjawH61XZleHRuA2FlbQIxMAABHZUaAmHobzmCMHpX0EgdLdjDAEwSX0bjqpo5xPUSIx9b4O_dsIvvG8KVNA_aem_IyKkvzy-VPf9k2pYAFf6Nw#scrollTo=n5IaCot9b3cF) <b>



*   You need to set the transformers version to transformers==4.30.1.



`In the first part, we require you to select the wangchanberta-base-att-spm-uncased.`

In [None]:
model_names = [
    'airesearch/wangchanberta-base-att-spm-uncased',
    'airesearch/wangchanberta-base-wiki-newmm',
    'airesearch/wangchanberta-base-wiki-ssg',
    'airesearch/wangchanberta-base-wiki-sefr',
    'airesearch/wangchanberta-base-wiki-spm',
]

#@title Choose Pretrained Model
model_name = "airesearch/wangchanberta-base-att-spm-uncased"

#create tokenizer
tokenizer = Tokenizer(model_name).from_pretrained(
                f'{model_name}',
                revision='main',
                model_max_length=416,)


The tokenizer class you load from this checkpoint is not the same type as the class this function is called from. It may result in unexpected tokenization. 
The tokenizer class you load from this checkpoint is 'CamembertTokenizer'. 
The class this function is called from is 'WangchanbertaTokenizer'.
The tokenizer class you load from this checkpoint is not the same type as the class this function is called from. It may result in unexpected tokenization. 
The tokenizer class you load from this checkpoint is 'CamembertTokenizer'. 
The class this function is called from is 'WangchanbertaTokenizer'.


Let's try using a pretrained tokenizer.

In [None]:
text = 'ศิลปะไม่เป็นเจ้านายใคร และไม่เป็นขี้ข้าใคร'
print('text :', text)
tokens = []
for i in tokenizer([text], is_split_into_words=True)['input_ids']:
  tokens.append(tokenizer.decode(i))
print('tokens :', tokens)

text : ศิลปะไม่เป็นเจ้านายใคร และไม่เป็นขี้ข้าใคร
tokens : ['<s>', '', 'ศิลปะ', 'ไม่เป็น', 'เจ้านาย', 'ใคร', '<_>', 'และ', 'ไม่เป็น', 'ขี้ข้า', 'ใคร', '</s>']


model : * `wangchanberta-base-att-spm-uncased`

First, we print examples of label tokens from our dataset for inspection.

In [None]:
example = orchid["train"][0]
for i in example :
    print(i, ':', example[i])

id : 0
label_tokens : ['การ', 'ประชุม', 'ทาง', 'วิชาการ', ' ', 'ครั้ง', 'ที่ 1']
pos_tags : [21, 39, 26, 26, 37, 4, 18]
sentence : การประชุมทางวิชาการ ครั้งที่ 1


Then, we use the sentence 'การประชุมทางวิชาการ<space>ครั้งที่ 1' to be tokenized by the pretrained tokenizer model.

In [None]:
text = 'การประชุมทางวิชาการ ครั้งที่ 1'
tokenizer(text)

{'input_ids': [5, 10, 882, 8222, 8, 10, 1014, 8, 10, 59, 6], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}

These are already mapped into discrete values. We can uncover the original token text from the tokens by.

In [None]:
for i in tokenizer(text)['input_ids']:
  print(tokenizer.convert_ids_to_tokens(i))

<s>
▁
การประชุม
ทางวิชาการ
<_>
▁
ครั้งที่
<_>
▁
1
</s>


Now let's look at another example.

In [None]:
example = orchid["train"][1899]
print('sentence :', example["sentence"])
tokenized_input = tokenizer([example["sentence"]], is_split_into_words=True)
tokens = tokenizer.convert_ids_to_tokens(tokenized_input["input_ids"])
print('tokens :',tokens)
print('label tokens :', example["label_tokens"])
print('label pos :', example["pos_tags"])

sentence : โดยพิจารณาจากพจนานุกรมภาษาคู่ (Bilingual transfer dictionary)
tokens : ['<s>', '▁โดย', 'พิจารณาจาก', 'พจนานุกรม', 'ภาษา', 'คู่', '<_>', '▁(', '<unk>', 'i', 'ling', 'ual', '<_>', '▁', 'trans', 'fer', '<_>', '▁', 'di', 'ction', 'ary', ')', '</s>']
label tokens : ['โดย', 'พิจารณา', 'จาก', 'พจนานุกรม', 'ภาษา', 'คู่', ' ', '(', 'Bilingual transfer dictionary', ')']
label pos : [25, 39, 38, 26, 26, 5, 37, 37, 26, 37]


Notice how `B` becomes an ``<unk>`` token. This is because this is an uncased model, meaning it only handles small English characters.

# #TODO 0

Convert the dataset to lowercase.

In [12]:
# Create a lowercase dataset for uncased BERT
def lower_case_sentences(examples):
  lower_cased_examples = examples

  # fill code here to lower case the "sentence" and "label_tokens"
  lower_cased_examples["sentence"] = examples["sentence"].lower()
  lower_cased_examples["label_tokens"] = [word.lower() for word in examples["label_tokens"]]
  return lower_cased_examples

In [13]:
orchidl = orchid.map(lower_case_sentences)

Map:   0%|          | 0/18500 [00:00<?, ? examples/s]

Map:   0%|          | 0/4625 [00:00<?, ? examples/s]

In [14]:
orchidl

DatasetDict({
    train: Dataset({
        features: ['id', 'label_tokens', 'pos_tags', 'sentence'],
        num_rows: 18500
    })
    test: Dataset({
        features: ['id', 'label_tokens', 'pos_tags', 'sentence'],
        num_rows: 4625
    })
})

In [None]:
orchidl["train"][1899]['pos_tags']

[25, 39, 38, 26, 26, 5, 37, 37, 26, 37]

Now let's examine the labels again.

In [None]:
example = orchidl["train"][1899]
print('sentence :', example["sentence"])
tokenized_input = tokenizer([example["sentence"]], is_split_into_words=True)
tokens = tokenizer.convert_ids_to_tokens(tokenized_input["input_ids"])
print('tokens :',tokens)
print('label tokens :', example["label_tokens"])
print('label pos :', example["pos_tags"])

sentence : โดยพิจารณาจากพจนานุกรมภาษาคู่ (bilingual transfer dictionary)
tokens : ['<s>', '▁โดย', 'พิจารณาจาก', 'พจนานุกรม', 'ภาษา', 'คู่', '<_>', '▁(', 'bi', 'ling', 'ual', '<_>', '▁', 'trans', 'fer', '<_>', '▁', 'di', 'ction', 'ary', ')', '</s>']
label tokens : ['โดย', 'พิจารณา', 'จาก', 'พจนานุกรม', 'ภาษา', 'คู่', ' ', '(', 'bilingual transfer dictionary', ')']
label pos : [25, 39, 38, 26, 26, 5, 37, 37, 26, 37]


In [None]:
example = orchidl["train"][0]
print('sentence :', example["sentence"])
tokenized_input = tokenizer([example["sentence"]], is_split_into_words=True)
tokens = tokenizer.convert_ids_to_tokens(tokenized_input["input_ids"])
print('tokens :',tokens)
print('label tokens :', example["label_tokens"])
print('label pos :', example["pos_tags"])

sentence : การประชุมทางวิชาการ ครั้งที่ 1
tokens : ['<s>', '▁', 'การประชุม', 'ทางวิชาการ', '<_>', '▁', 'ครั้งที่', '<_>', '▁', '1', '</s>']
label tokens : ['การ', 'ประชุม', 'ทาง', 'วิชาการ', ' ', 'ครั้ง', 'ที่ 1']
label pos : [21, 39, 26, 26, 37, 4, 18]


In the example above, tokens refer to those tokenized using the pretrained tokenizer, while label tokens refer to tokens tokenized from our dataset.

**Do you see something?**

Yes, the tokens from the two tokenizers do not match.

- sentence : `การประชุมทางวิชาการ ครั้งที่ 1`

---

- tokens : `['<s>', '▁', 'การประชุม', 'ทางวิชาการ', '<_>', '▁', 'ครั้งที่', '<_>', '▁', '1', '</s>']`


---


- label tokens : `['การ', 'ประชุม', 'ทาง', 'วิชาการ', ' ', 'ครั้ง', 'ที่ 1']`
- label pos : `[21, 39, 26, 26, 37, 4, 18]`

You can see that in our label tokens, 'การ' has a POS tag of 21, and 'ประชุม' has a POS tag of 39. However, when we tokenize the sentence using WangchanBERTa, we get the token 'การประชุม'. What POS tag should we assign to this new token?

**What should we do ?**

Based on this example, we found that the tokens from the WangchanBERTa do not directly align with our label tokens. This means we cannot directly use the label POS tags. Therefore, we need to reassign POS tags to the tokens produced by WangchanBERTa tokenization. The method we will use is majority voting:
- If a token from the WangchanBERTa matches a label token exactly, we will directly assign the POS tag from the label POS.
- If the token generated overlaps or combines multiple label tokens, we assign the POS tag based on the number of characters in each token: If the token contains the most characters from any label token, we assign the POS tag from that label token.

**Example :**

    # "การประชุม" (9 chars) is formed from "การ" (3 chars) + "ประชุม" (6 chars).
    # "การ" has a POS tag of 21,
    # and "ประชุม" has a POS tag of 39.
    # Therefore, the POS tag for "การประชุม" is 39,
    # as "การประชุม" is derived more from the "ประชุม" part than from the "การ" part.

    # 'ทางวิชาการ' (10 chars) is formed from 'ทาง' (3 chars) + 'วิชาการ' (7 chars)
    # "ทาง" has a POS tag of 26,
    # and "วิชาการ" has a POS tag of 2.
    # Therefore, the POS tag for "ทางวิชาการ" is 2,
    # as "ทางวิชาการ" is derived more from the "ทาง" part than from the "วิชาการ" part.

# #TODO 1

`**Warning: Please be careful of <unk>, an unknown word token.**`

`**Warning: Please be careful of " ำ ", the 'am' vowel. WangchanBERTa's internal preprocessing replaces all " ำ " to 'ํ' and 'า'**`

Assigning the label -100 to the special tokens `[<s>]` and `[</s>]` and `[_]`  so they’re ignored by the PyTorch loss function (see [CrossEntropyLoss](https://pytorch.org/docs/stable/generated/torch.nn.CrossEntropyLoss.html): ignore_index)

In [None]:
# test = []
# for i in tokenizer(['สวัสดีทุกท่านทุกคนในวันนี้เรามาตกปลาHหมึกยามค่ำข้างๆลิงสีดำ'], is_split_into_words=True)['input_ids']:
#   test.append(tokenizer.decode(i))
# print('tokens :', test)

# test per character print of 'am' ำ
for char in 'ำ':
  print(char)

print(len(tokenizer(["สวัสดีทุกท่านทุกคนในวันนี้เรามาตกปลาหมึกยามค่ำข้างๆลิงสีดำ"], is_split_into_words=True)['input_ids']))
print(len(tokenizer.convert_ids_to_tokens(tokenizer(["สวัสดีทุกท่านทุกคนในวันนี้เรามาตกปลาหมึกยามค่ำข้างๆลิงสีดำ"], is_split_into_words=True)['input_ids'])))
print(tokenizer.convert_ids_to_tokens(tokenizer(["ำ"], is_split_into_words=True)['input_ids']))
print(tokenizer(["ำ"], is_split_into_words=True)['input_ids'])

ำ
14
14
['<s>', '▁', 'ํา', '</s>']
[5, 10, 4556, 6]


In [None]:
def majority_vote_pos(examples):
    tokenized_inputs = tokenizer([examples["sentence"]], is_split_into_words=True)
    new_tokens = tokenizer.convert_ids_to_tokens(tokenized_inputs["input_ids"])

    label_tokens = examples["label_tokens"]
    label_pos_tags = examples["pos_tags"]
    # tokenized_inputs["chs"] = []
    # tokenized_inputs["pos"]= []
    def normalize_am(char):
        return 'ํา' if char == 'ำ' else char

    char_list = []
    pos_list = []
    for word_i, word in enumerate(label_tokens):
        for c in word:
            normed = normalize_am(c)
            if normed == 'ํา':
                char_list.append('ํ')
                pos_list.append(label_pos_tags[word_i])
                char_list.append('า')
                pos_list.append(label_pos_tags[word_i])
            else:
                char_list.append(normed)
                pos_list.append(label_pos_tags[word_i])

    new_labels = []
    char_ptr = 0  # pointer to char_list / pos_list

    new_labels = []
    char_ptr = 0  # pointer to char_list / pos_list
    tok_index = 0
    new_tokens = [' ' if x == '<_>' else x.replace('▁', '') for x in new_tokens]
    while tok_index < len(new_tokens):
        tok = new_tokens[tok_index]
        if tok in ['<s>', '</s>', '▁']:
            new_labels.append(-100)
            tok_index += 1
            continue

        elif tok == '<unk>':
            pos_count_inner = {}
            # while char_ptr < len(char_list):
            next_token = new_tokens[tok_index+1]
            next_char = new_tokens[tok_index+1][0]
            print(next_token)
            print(next_char)
            while not (  char_list[char_ptr] == next_token or char_list[char_ptr] == next_char):
                vote_pos = pos_list[char_ptr]
                pos_count_inner[vote_pos] = pos_count_inner.get(vote_pos, 0) + 1
                char_ptr += 1

            if len(pos_count_inner) == 0:
                new_labels.append(-88)
                assert False, "error"
            else:
                majority_pos = max(pos_count_inner, key=pos_count_inner.get)
                new_labels.append(majority_pos)
                tok_index += 1
            continue


        token_str = tok

        pos_count = {}
        for ch in token_str:
            if char_ptr >= len(char_list):
                char_ptr += 1
                break
            if ch == char_list[char_ptr]:
                vote_pos = pos_list[char_ptr]
                pos_count[vote_pos] = pos_count.get(vote_pos, 0) + 1
                char_ptr += 1
            else:
                char_ptr += 1

            # tokenized_inputs.setdefault("chs", []).append(ch)
            # tokenized_inputs.setdefault("pos", []).append(vote_pos)

        if len(pos_count) == 0:
            new_labels.append(-100)
            tok_index += 1
        else:
            majority_pos = max(pos_count, key=pos_count.get)
            new_labels.append(majority_pos)
            tok_index += 1



    tokenized_inputs["tokens"] = new_tokens
    tokenized_inputs["labels"] = new_labels
    # tokenized_inputs["chars"] = char_list
    # tokenized_inputs["pos_tags"] = pos_list
    # tokenized_inputs["chars_len"] = len(char_list)

    return tokenized_inputs

In [None]:
tokenized_orchid = orchidl.map(majority_vote_pos)

In [None]:
tokenized_orchid
print(tokenized_orchid['train'][0]['tokens'])
print(tokenized_orchid['train'][0]['labels'])
# i = 0
# for sentence in tokenized_orchid['train']:
#   print(sentence['tokens'])
#   print(sentence['labels'])
#   if i == 500:
#     break
#   i += 1

['<s>', '', 'การประชุม', 'ทางวิชาการ', ' ', '', 'ครั้งที่', ' ', '', '1', '</s>']
[-100, -100, 39, 26, 37, -100, 4, 18, -100, 18, -100]


In [None]:
tokenized_orchid['train'][0]

{'id': '0',
 'label_tokens': ['การ', 'ประชุม', 'ทาง', 'วิชาการ', ' ', 'ครั้ง', 'ที่ 1'],
 'pos_tags': [21, 39, 26, 26, 37, 4, 18],
 'sentence': 'การประชุมทางวิชาการ ครั้งที่ 1',
 'input_ids': [5, 10, 882, 8222, 8, 10, 1014, 8, 10, 59, 6],
 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
 'tokens': ['<s>',
  '',
  'การประชุม',
  'ทางวิชาการ',
  ' ',
  '',
  'ครั้งที่',
  ' ',
  '',
  '1',
  '</s>'],
 'labels': [-100, -100, 39, 26, 37, -100, 4, 18, -100, 18, -100]}

In [None]:
example = tokenized_orchid["train"][0]
for i in example :
    print(i, ":", example[i])

id : 0
label_tokens : ['การ', 'ประชุม', 'ทาง', 'วิชาการ', ' ', 'ครั้ง', 'ที่ 1']
pos_tags : [21, 39, 26, 26, 37, 4, 18]
sentence : การประชุมทางวิชาการ ครั้งที่ 1
input_ids : [5, 10, 882, 8222, 8, 10, 1014, 8, 10, 59, 6]
attention_mask : [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]
tokens : ['<s>', '', 'การประชุม', 'ทางวิชาการ', ' ', '', 'ครั้งที่', ' ', '', '1', '</s>']
labels : [-100, -100, 39, 26, 37, -100, 4, 18, -100, 18, -100]


This is the result after we realigned the POS based on the majority vote.
- label_tokens : `['การ', 'ประชุม', 'ทาง', 'วิชาการ', ' ', 'ครั้ง', 'ที่ 1']`
- pos_tags : `[21, 39, 26, 26, 37, 4, 18]`
- tokens : `['<s>', '▁', 'การประชุม', 'ทางวิชาการ', '<_>', '▁', 'ครั้งที่', '<_>', '▁', '1', '</s>']`
- labels : `[-100, -100, 39, 26, 37, -100, 4, 18, -100, 18, -100]`

`['<s>', '▁', '</s>'] : -100`

**Check :**

> "การประชุม" (9 chars) is formed from "การ" (3 chars) + "ประชุม" (6 chars).


> "การ" has a POS tag of 21,

> and "ประชุม" has a POS tag of 39.

> Therefore, the POS tag for "การประชุม" is 39,

> as "การประชุม" is derived more from the "ประชุม" part than from the "การ" part.





In [None]:
# hard test case
example = tokenized_orchid["train"][1899]
for i in example :
    print(i, ":", example[i])

id : 1899
label_tokens : ['โดย', 'พิจารณา', 'จาก', 'พจนานุกรม', 'ภาษา', 'คู่', ' ', '(', 'bilingual transfer dictionary', ')']
pos_tags : [25, 39, 38, 26, 26, 5, 37, 37, 26, 37]
sentence : โดยพิจารณาจากพจนานุกรมภาษาคู่ (bilingual transfer dictionary)
input_ids : [5, 489, 15617, 19737, 958, 493, 8, 1241, 4906, 11608, 12177, 8, 10, 11392, 9806, 8, 10, 2951, 15779, 8001, 29, 6]
attention_mask : [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]
tokens : ['<s>', 'โดย', 'พิจารณาจาก', 'พจนานุกรม', 'ภาษา', 'คู่', ' ', '(', 'bi', 'ling', 'ual', ' ', '', 'trans', 'fer', ' ', '', 'di', 'ction', 'ary', ')', '</s>']
labels : [-100, 25, 39, 26, 26, 5, 37, 37, 26, 26, 26, 26, -100, 26, 26, 26, -100, 26, 26, 26, 37, -100]


Expected output


```
id : 1899
label_tokens : ['โดย', 'พิจารณา', 'จาก', 'พจนานุกรม', 'ภาษา', 'คู่', ' ', '(', 'bilingual transfer dictionary', ')']
pos_tags : [25, 39, 38, 26, 26, 5, 37, 37, 26, 37]
sentence : โดยพิจารณาจากพจนานุกรมภาษาคู่ (bilingual transfer dictionary)
input_ids : [5, 489, 15617, 19737, 958, 493, 8, 1241, 4906, 11608, 12177, 8, 10, 11392, 9806, 8, 10, 2951, 15779, 8001, 29, 6]
attention_mask : [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]
tokens : ['<s>', '▁โดย', 'พิจารณาจาก', 'พจนานุกรม', 'ภาษา', 'คู่', '<_>', '▁(', 'bi', 'ling', 'ual', '<_>', '▁', 'trans', 'fer', '<_>', '▁', 'di', 'ction', 'ary', ')', '</s>']
labels : [-100, 25, 39, 26, 26, 5, 37, 37, 26, 26, 26, 26, -100, 26, 26, 26, -100, 26, 26, 26, 37, -100]
```

### MY TEST

In [None]:
print(tokenized_orchid["train"]['tokens'][0][0])
for example in tokenized_orchid["train"]:
    for token in example['tokens']:
      # print(token)
      if not isinstance(token, str):
          print(token)
    for label in example['labels']:
        # print(label)
        if not isinstance(label, int):
            print(label)
        # else:
        #     print("sex: ",label)
    for label in example['labels']:
        if not isinstance(label, int):
            print(label)
    if example["labels"] is None:
        print("Found None labels:", example)
    elif len(example["tokens"]) != len(example["labels"]):
        print(f"Mismatch: Tokens ({len(example['tokens'])}) vs Labels ({len(example['labels'])})")
    elif len(example["tokens"]) != len(example["labels"]):
        print(f"Mismatch: Tokens ({len(example['tokens'])}) vs Labels ({len(example['labels'])})")


<s>


# Train and Evaluate model

We will create a batch of examples using [DataCollatorWithPadding.](https://huggingface.co/docs/transformers/v4.48.0/en/main_classes/data_collator#transformers.DataCollatorWithPadding)  

Data collators are objects that will form a batch by using a list of dataset elements as input. These elements are of the same type as the elements of train_dataset or eval_dataset.

DataCollatorWithPadding will help us pad the sentences to the longest length in a batch during collation, instead of padding the whole dataset to the maximum length. This allows for efficient computation during each batch.

*   DataCollatorForTokenClassification : `padding (bool, str or PaddingStrategy, optional, defaults to True)`
*   `True` or `'longest'` (default): Pad to the longest sequence in the batch (or no padding if only a single sequence is provided).



In [15]:
from transformers import DataCollatorForTokenClassification

data_collator = DataCollatorForTokenClassification(tokenizer=tokenizer)

For evaluating your model’s performance. You can quickly load a evaluation method with the [Evaluate](https://huggingface.co/docs/evaluate/index) library. For this task, load the [seqeval](https://huggingface.co/spaces/evaluate-metric/seqeval) framework (see the Evaluate [quick tour](https://huggingface.co/docs/evaluate/a_quick_tour) to learn more about how to load and compute a metric). Seqeval actually produces several scores: precision, recall, F1, and accuracy.

In [16]:
import evaluate

seqeval = evaluate.load("seqeval")

Downloading builder script:   0%|          | 0.00/6.34k [00:00<?, ?B/s]

Huggingface requires us to write a ``compute_metrics()`` function. This will be invoked when huggingface evalutes a model.

Note that we ignore to evaluate on -100 labels.

In [17]:
import numpy as np
import warnings


def compute_metrics(p):
    predictions, labels = p
    predictions = np.argmax(predictions, axis=2)

    true_predictions = [
        [label_list[p] for (p, l) in zip(prediction, label) if l != -100]
        for prediction, label in zip(predictions, labels)
    ]
    true_labels = [
        [label_list[l] for (p, l) in zip(prediction, label) if l != -100]
        for prediction, label in zip(predictions, labels)
    ]

    with warnings.catch_warnings():
        warnings.filterwarnings("ignore")
        results = seqeval.compute(predictions=true_predictions, references=true_labels)
    return {
        "precision": results["overall_precision"],
        "recall": results["overall_recall"],
        "f1": results["overall_f1"],
        "accuracy": results["overall_accuracy"],
    }

The total number of labels in our POS tag set.

In [18]:
id2label = {
    0: 'ADVI',
    1: 'ADVN',
    2: 'ADVP',
    3: 'ADVS',
    4: 'CFQC',
    5: 'CLTV',
    6: 'CMTR',
    7: 'CMTR@PUNC',
    8: 'CNIT',
    9: 'CVBL',
    10: 'DCNM',
    11: 'DDAC',
    12: 'DDAN',
    13: 'DDAQ',
    14: 'DDBQ',
    15: 'DIAC',
    16: 'DIAQ',
    17: 'DIBQ',
    18: 'DONM',
    19: 'EAFF',
    20: 'EITT',
    21: 'FIXN',
    22: 'FIXV',
    23: 'JCMP',
    24: 'JCRG',
    25: 'JSBR',
    26: 'NCMN',
    27: 'NCNM',
    28: 'NEG',
    29: 'NLBL',
    30: 'NONM',
    31: 'NPRP',
    32: 'NTTL',
    33: 'PDMN',
    34: 'PNTR',
    35: 'PPRS',
    36: 'PREL',
    37: 'PUNC',
    38: 'RPRE',
    39: 'VACT',
    40: 'VATT',
    41: 'VSTA',
    42: 'XVAE',
    43: 'XVAM',
    44: 'XVBB',
    45: 'XVBM',
    46: 'XVMM',
    # 47: 'O'
}
label2id = {}
for k, v in id2label.items() :
    label2id[v] = k

# label2id

In [19]:
labels = [i for i in id2label.values()]
# labels

## Load pretrained model

Select a pretrained model for fine-tuning to develop a POS Tagger model using the Orchid corpus dataset.



*   model : `wangchanberta-base-att-spm-uncased`
*   Don't forget to update the num_labels.

You’re ready to start training your model now! Load pretrained model with AutoModelForTokenClassification along with the number of expected labels, and the label mappings:




`In the first part, we require you to select the wangchanberta-base-att-spm-uncased.`

In [None]:
model_names = [
    'wangchanberta-base-att-spm-uncased',
    'wangchanberta-base-wiki-newmm',
    'wangchanberta-base-wiki-ssg',
    'wangchanberta-base-wiki-sefr',
    'wangchanberta-base-wiki-spm',
]

#@title Choose Pretrained Model
model_name = "wangchanberta-base-att-spm-uncased"

#create model
model = AutoModelForTokenClassification.from_pretrained(
    f"airesearch/{model_name}",
    revision='main',
    num_labels=47, id2label=id2label, label2id=label2id
)




model.safetensors:   0%|          | 0.00/423M [00:00<?, ?B/s]

Some weights of the model checkpoint at airesearch/wangchanberta-base-att-spm-uncased were not used when initializing CamembertForTokenClassification: ['lm_head.bias', 'lm_head.dense.weight', 'lm_head.layer_norm.weight', 'lm_head.dense.bias', 'lm_head.layer_norm.bias']
- This IS expected if you are initializing CamembertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing CamembertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of CamembertForTokenClassification were not initialized from the model checkpoint at airesearch/wangchanberta-base-att-spm-uncased and are newly initialized: ['classifier.weight', 'classifier.bias']
You should probably T

### #TODO 2

* Configure your training hyperparameters using `**TrainingArguments**`. The only required parameter is is `output_dir`, which determines the directory where your model will be saved. To upload the model to the Hugging Face Hub, set push_to_hub=True (note: you must be logged into Hugging Face for this). During training, the Trainer will compute seqeval metrics at the end of each epoch and store the training checkpoint.
* Provide the `**Trainer**` with the training arguments, as well as the model, dataset, tokenizer, data collator, and compute_metrics function.
* Use `**train()**` to fine-tune the model.


Read [huggingface's tutorial](https://huggingface.co/docs/transformers/en/tasks/token_classification) for more details.

In [None]:
training_args = TrainingArguments(
    output_dir="pos_tagger_model_att_spm_uncased",
    learning_rate=2e-5,
    num_train_epochs=3,
    weight_decay=0.01,
    evaluation_strategy="epoch",
    save_strategy="epoch",
    load_best_model_at_end=True,
    push_to_hub=True,
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_orchid["train"],
    eval_dataset=tokenized_orchid["test"],
    tokenizer=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics,
)

trainer.train()

For more details, please read https://huggingface.co/docs/huggingface_hub/concepts/git_vs_http.
/content/pos_tagger_model_att_spm_uncased is already a clone of https://huggingface.co/JeansAthiwat/pos_tagger_model_att_spm_uncased. Make sure you pull the latest changes with `repo.git_pull()`.


Epoch,Training Loss,Validation Loss,Precision,Recall,F1,Accuracy
1,0.3153,0.295221,0.862339,0.875845,0.86904,0.916193
2,0.2377,0.271533,0.870161,0.881117,0.875605,0.920963
3,0.2031,0.272307,0.870657,0.880847,0.875722,0.920873


  state_dict = torch.load(best_model_path, map_location="cpu")


TrainOutput(global_step=6939, training_loss=0.2804172888048136, metrics={'train_runtime': 946.1523, 'train_samples_per_second': 58.659, 'train_steps_per_second': 7.334, 'total_flos': 1264810975026288.0, 'train_loss': 0.2804172888048136, 'epoch': 3.0})

# Inference

With your model fine-tuned, you can now perform inference.

In [None]:
text = "การประชุมทางวิชาการ ครั้งที่ 1"

`In the first part, we require you to select the wangchanberta-base-att-spm-uncased.`

In [None]:
from transformers import AutoTokenizer

# Load pretrained tokenizer from Hugging Face
#@title Choose Pretrained Model
model_name = "airesearch/wangchanberta-base-att-spm-uncased"

tokenizer = Tokenizer(model_name).from_pretrained(model_name)
inputs = tokenizer(text, return_tensors="pt")

The tokenizer class you load from this checkpoint is not the same type as the class this function is called from. It may result in unexpected tokenization. 
The tokenizer class you load from this checkpoint is 'CamembertTokenizer'. 
The class this function is called from is 'WangchanbertaTokenizer'.
The tokenizer class you load from this checkpoint is not the same type as the class this function is called from. It may result in unexpected tokenization. 
The tokenizer class you load from this checkpoint is 'CamembertTokenizer'. 
The class this function is called from is 'WangchanbertaTokenizer'.


In [None]:
inputs

{'input_ids': tensor([[   5,   10,  882, 8222,    8,   10, 1014,    8,   10,   59,    6]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])}

In [None]:
from transformers import AutoModelForTokenClassification

## Load your fine-tuned model from Hugging Face
model = AutoModelForTokenClassification.from_pretrained("JeansAthiwat/pos_tagger_model_att_spm_uncased") ## your model path from Hugging Face
with torch.no_grad():
    logits = model(**inputs).logits

  return torch.load(checkpoint_file, map_location="cpu")


In [None]:
predictions = torch.argmax(logits, dim=2)
predicted_token_class = [model.config.id2label[t.item()] for t in predictions[0]]
predicted_token_class

['DONM',
 'PUNC',
 'VACT',
 'NCMN',
 'PUNC',
 'PUNC',
 'CFQC',
 'DONM',
 'DONM',
 'DONM',
 'PUNC']

In [None]:
# id2label

In [None]:
# Inference
# ignore special tokens

text = 'จะว่าไปแล้วเชิงเทียนของผมก็สวยดีเหมือนกัน'
inputs = tokenizer(text, return_tensors="pt")
tokenized_input = tokenizer([text], is_split_into_words=True)
tokens = tokenizer.convert_ids_to_tokens(tokenized_input["input_ids"])
print('tokens :', tokens)
with torch.no_grad():
    logits = model(**inputs).logits
predictions = torch.argmax(logits, dim=2)
predicted_token_class = [model.config.id2label[t.item()] for t in predictions[0]]
print('predict pos :', predicted_token_class)

tokens : ['<s>', '▁', 'จะว่าไป', 'แล้ว', 'เชิง', 'เทียน', 'ของ', 'ผมก็', 'สวยดี', 'เหมือนกัน', '</s>']
predict pos : ['PUNC', 'PUNC', 'ADVS', 'XVAE', 'NCMN', 'NCMN', 'RPRE', 'PPRS', 'VATT', 'ADVN', 'PUNC']


**Evaluate model :**

The output from the model is a softmax over classes. We choose the maximum class as the answer for evaluation. Again, we will ignore the -100 labels.

In [20]:
import pandas as pd
from IPython.display import display

def evaluation_report(y_true, y_pred, get_only_acc=False):
    # retrieve all tags in y_true
    tag_set = set()
    for sent in y_true:
        for tag in sent:
            tag_set.add(tag)
    for sent in y_pred:
        for tag in sent:
            tag_set.add(tag)
    tag_list = sorted(list(tag_set))

    # count correct points
    tag_info = dict()
    for tag in tag_list:
        tag_info[tag] = {'correct_tagged': 0, 'y_true': 0, 'y_pred': 0}

    all_correct = 0
    all_count = sum([len(sent) for sent in y_true])
    speacial_tag = 0
    for sent_true, sent_pred in zip(y_true, y_pred):
        for tag_true, tag_pred in zip(sent_true, sent_pred):
            # pass special token
            if tag_true == -100 :
                speacial_tag += 1
                pass
            if tag_true == tag_pred:
                tag_info[tag_true]['correct_tagged'] += 1
                all_correct += 1
            tag_info[tag_true]['y_true'] += 1
            tag_info[tag_pred]['y_pred'] += 1
    print('speacial_tag :',speacial_tag) # delete number of special token from all_count
    accuracy = (all_correct / (all_count-speacial_tag))

    # get only accuracy for testing
    if get_only_acc:
      return accuracy

    accuracy *= 100


    # summarize and make evaluation result
    eval_list = list()
    for tag in tag_list:
        eval_result = dict()
        eval_result['tag'] = tag
        eval_result['correct_count'] = tag_info[tag]['correct_tagged']
        precision = (tag_info[tag]['correct_tagged']/tag_info[tag]['y_pred'])*100 if tag_info[tag]['y_pred'] else '-'
        recall = (tag_info[tag]['correct_tagged']/tag_info[tag]['y_true'])*100 if (tag_info[tag]['y_true'] > 0) else 0
        eval_result['precision'] = precision
        eval_result['recall'] = recall
        eval_result['f1_score'] = (2*precision*recall)/(precision+recall) if (type(precision) is float and recall > 0) else '-'

        eval_list.append(eval_result)

    eval_list.append({'tag': 'accuracy=%.2f' % accuracy, 'correct_count': '', 'precision': '', 'recall': '', 'f1_score': ''})

    df = pd.DataFrame.from_dict(eval_list)
    df = df[['tag', 'precision', 'recall', 'f1_score', 'correct_count']]

    display(df)


In [None]:
# prepare test set
test_data = tokenized_orchid["test"]

In [None]:
# labels for test set
y_test = []
for inputs in test_data:
  y_test.append(inputs['labels'])

In [None]:
y_pred = []
device = 'cuda' if torch.cuda.is_available() else 'cpu'
for inputs in test_data:
    text = inputs['sentence']
    inputs = tokenizer(text, return_tensors="pt")
    with torch.no_grad():
        pred = model(**inputs).logits
        predictions = torch.argmax(pred, dim=2)
        # Append padded predictions to y_pred
        y_pred.append(predictions.tolist()[0])

KeyboardInterrupt: 

In [None]:
# check our prediction with label
# -100 is special tokens : [<s>, </s>, _]
print(y_pred[0])
print(y_test[0])

In [None]:
evaluation_report(y_test, y_pred)

# Other Pretrained model

In this section, we will experiment by fine-tuning other pretrained models, such as airesearch/wangchanberta-base-wiki-newmm, to see how about their performance.

Since each model uses a different word-tokenization method.
for example, **airesearch/wangchanberta-base-wiki-newmm uses newmm**,
while **airesearch/wangchanberta-base-att-spm-uncased uses SentencePiece**.
please try fine-tuning and compare the performance of these models.

### #TODO 3

CURRENT PIPELINE IS FOR WIKI SENTENCEPIECEMATCHING (WIKI SPM)

In [54]:
model_names = [
    'airesearch/wangchanberta-base-att-spm-uncased',
    'airesearch/wangchanberta-base-wiki-newmm',
    'airesearch/wangchanberta-base-wiki-ssg',
    'airesearch/wangchanberta-base-wiki-sefr',
    'airesearch/wangchanberta-base-wiki-spm',
]

#@title Choose Pretrained Model
model_name = "airesearch/wangchanberta-base-wiki-newmm" #@param ["airesearch/wangchanberta-base-att-spm-uncased", "airesearch/wangchanberta-base-wiki-newmm", "airesearch/wangchanberta-base-wiki-syllable", "airesearch/wangchanberta-base-wiki-sefr", "airesearch/wangchanberta-base-wiki-spm"]

#create tokenizer
tokenizer = Tokenizer(model_name).from_pretrained(
                f'{model_name}',
                revision='main',
                model_max_length=416,)


newmm.json:   0%|          | 0.00/3.56M [00:00<?, ?B/s]

config.json:   0%|          | 0.00/559 [00:00<?, ?B/s]

The tokenizer class you load from this checkpoint is not the same type as the class this function is called from. It may result in unexpected tokenization. 
The tokenizer class you load from this checkpoint is 'RobertaTokenizer'. 
The class this function is called from is 'ThaiWordsNewmmTokenizer'.
The tokenizer class you load from this checkpoint is not the same type as the class this function is called from. It may result in unexpected tokenization. 
The tokenizer class you load from this checkpoint is 'RobertaTokenizer'. 
The class this function is called from is 'ThaiWordsNewmmTokenizer'.


In [55]:
example = orchidl["train"][1899]
print('sentence :', example["sentence"])
tokenized_input = tokenizer([example["sentence"]], is_split_into_words=True)
tokens = tokenizer.convert_ids_to_tokens(tokenized_input["input_ids"])
print('tokens :',tokens)
print('label tokens :', example["label_tokens"])

sentence : โดยพิจารณาจากพจนานุกรมภาษาคู่ (bilingual transfer dictionary)
tokens : ['<s>', 'โดย', 'พิจารณา', 'จาก', 'พจนานุกรม', 'ภาษา', 'คู่', '<_>', '<unk>', '<_>', 'transfer', '<_>', 'dictionary', ')', '</s>']
label tokens : ['โดย', 'พิจารณา', 'จาก', 'พจนานุกรม', 'ภาษา', 'คู่', ' ', '(', 'bilingual transfer dictionary', ')']


It's the same problem as above.

`**Warning: Can we use same function as above ?**`

`**Warning: Please beware of <unk>, an unknown word token.**`

`**Warning: Please be careful of " ำ ", the 'am' vowel. WangchanBERTa's internal preprocessing replaces all " ำ " to 'ํ' and 'า'**`

In [56]:
for char in "คู่":
    print(char)

ape = {}
ape[' '] = "negus"
print(ape)

ค
ู
่
{' ': 'negus'}


In [63]:
def majority_vote_pos(examples):
    tokenized_inputs = tokenizer([examples["sentence"]], is_split_into_words=True)
    new_tokens = tokenizer.convert_ids_to_tokens(tokenized_inputs["input_ids"])

    label_tokens = examples["label_tokens"]
    label_pos_tags = examples["pos_tags"]
    # tokenized_inputs["chs"] = []
    # tokenized_inputs["pos"]= []
    def normalize_am(char):
        return 'ํา' if char == 'ำ' else char

    char_list = []
    pos_list = []
    for word_i, word in enumerate(label_tokens):
        for c in word:
                char_list.append(c)
                pos_list.append(label_pos_tags[word_i])
            # normed = normalize_am(c)
            # if normed == 'ํา':
            #     char_list.append('ํ')
            #     pos_list.append(label_pos_tags[word_i])
            #     char_list.append('า')
            #     pos_list.append(label_pos_tags[word_i])
            # else:
            #     char_list.append(normed)
            #     pos_list.append(label_pos_tags[word_i])

    new_labels = []
    char_ptr = 0  # pointer to char_list / pos_list

    new_labels = []
    char_ptr = 0  # pointer to char_list / pos_list
    tok_index = 0
    new_tokens = [' ' if x == '<_>' else x.replace('▁', '') for x in new_tokens]
    while tok_index < len(new_tokens):
        tok = new_tokens[tok_index]
        if tok in ['<s>', '</s>', '▁']:
            new_labels.append(-100)
            tok_index += 1
            continue

        elif tok == '<unk>':
            pos_count_inner = {}
            # while char_ptr < len(char_list):
            next_token = new_tokens[tok_index+1]
            next_char = new_tokens[tok_index+1][0]
            print(next_token)
            print(next_char)
            while not (  char_list[char_ptr] == next_token or char_list[char_ptr] == next_char):
                vote_pos = pos_list[char_ptr]
                pos_count_inner[vote_pos] = pos_count_inner.get(vote_pos, 0) + 1
                char_ptr += 1

            if len(pos_count_inner) == 0:
                new_labels.append(pos_list[char_ptr])
                # assert False, "error"
            else:
                majority_pos = max(pos_count_inner, key=pos_count_inner.get)
                new_labels.append(majority_pos)
            tok_index += 1
            continue


        token_str = tok

        pos_count = {}
        for ch in token_str:
            if char_ptr >= len(char_list):
                char_ptr += 1
                break
            if ch == char_list[char_ptr]:
                vote_pos = pos_list[char_ptr]
                pos_count[vote_pos] = pos_count.get(vote_pos, 0) + 1
                char_ptr += 1
            else:
                char_ptr += 1

            # tokenized_inputs.setdefault("chs", []).append(ch)
            # tokenized_inputs.setdefault("pos", []).append(vote_pos)

        if len(pos_count) == 0:
            new_labels.append(-100)
            tok_index += 1
        else:
            majority_pos = max(pos_count, key=pos_count.get)
            new_labels.append(majority_pos)
            tok_index += 1



    tokenized_inputs["tokens"] = new_tokens
    tokenized_inputs["labels"] = new_labels
    # tokenized_inputs["chars"] = char_list
    # tokenized_inputs["pos_tags"] = pos_list
    # tokenized_inputs["chars_len"] = len(char_list)

    return tokenized_inputs

In [64]:
# orchidl["train"][63]

In [65]:
# hard test cases
example = orchidl["train"][63]
result_example = majority_vote_pos(example)
for i in result_example :
    print(i, ":", result_example[i])

# example = tokenized_orchid["train"][1899]

# for i in example :
#     print(i, ":", example[i])

 
 
input_ids : [0, 1192, 794, 981, 8518, 1673, 2829, 22, 214, 4214, 7969, 80, 12428, 169, 170, 3568, 5, 5494, 5, 58198, 5, 12553, 5, 35030, 5, 3697, 107, 37, 273, 5, 3, 5, 55, 11943, 262, 25, 10587, 4181, 9967, 2]
token_type_ids : [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
attention_mask : [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]
tokens : ['<s>', 'ก็ได้', 'ปรากฏ', 'ผลงาน', 'วิจัยและพัฒนา', 'หลาย', 'โครงการ', 'ที่', 'สามารถ', 'นำไปสู่', 'การผลิต', 'โดย', 'ภาคเอกชน', 'ได้', 'อย่าง', 'ชัดเจน', ' ', 'อาทิ', ' ', 'ไมโครคอมพิวเตอร์', ' ', '32', ' ', 'บิท', ' ', 'วงจร', 'รวม', 'ขนาดใหญ่', 'มาก', ' ', '<unk>', ' ', 'และ', 'มอเตอร์', 'ขนาดเล็ก', 'ใน', 'ผลิตภัณฑ์', 'ไฟฟ้า', 'เป็นสำคัญ', '</s>']
labels : [-100, 43, 41, 26, 26, 17, 26, 36, 43, 39, 39, 38, 26, 42, 22, 40, 37, 38, 37, 26, 37, 10, 37, 6, 37, 26, 26, 26, 1, 37, 26, 37, 24, 26, 2

In [66]:
tokenized_orchid = orchidl.map(majority_vote_pos)

Map:   0%|          | 0/18500 [00:00<?, ? examples/s]

 
 
 
 
 
 
)
)
 
 
 
 
 
 
 
 
 
 
)
)
 
 
 
 
)
)
 
 
 
 
 
 
</s>
<
 
 
 
 
)
)
 
 
 
 
)
)
 
 
 
 
<unk>
<
</s>
<
 
 
 
 
 
 
 
 
 
 
 
 
 
 
)
)
 
 
 
 
 
 
)
)
 
 
 
 
 
 
-
-
 
 
 
 
 
 
)
)
 
 
 
 
 
 
 
 
 
 
 
 
</s>
<


IndexError: list index out of range

In [69]:
model_names = [
    'wangchanberta-base-att-spm-uncased',
    'wangchanberta-base-wiki-newmm',
    'wangchanberta-base-wiki-ssg',
    'wangchanberta-base-wiki-sefr',
    'wangchanberta-base-wiki-spm',
]

#@title Choose Pretrained Model
model_name = "wangchanberta-base-wiki-newmm" #@param ["wangchanberta-base-att-spm-uncased", "wangchanberta-base-wiki-newmm", "wangchanberta-base-wiki-syllable", "wangchanberta-base-wiki-sefr", "wangchanberta-base-wiki-spm"]

#create model
model = AutoModelForTokenClassification.from_pretrained(
    f"airesearch/{model_name}",
    revision='main',
    num_labels=47, id2label=id2label, label2id=label2id
)


pytorch_model.bin:   0%|          | 0.00/646M [00:00<?, ?B/s]

Some weights of the model checkpoint at airesearch/wangchanberta-base-wiki-newmm were not used when initializing RobertaForTokenClassification: ['lm_head.dense.bias', 'lm_head.bias', 'lm_head.decoder.weight', 'lm_head.layer_norm.weight', 'lm_head.decoder.bias', 'lm_head.layer_norm.bias', 'lm_head.dense.weight']
- This IS expected if you are initializing RobertaForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of RobertaForTokenClassification were not initialized from the model checkpoint at airesearch/wangchanberta-base-wiki-newmm and are newly initialized: ['classifier.bias', 'classifie

In [70]:
data_collator = DataCollatorForTokenClassification(tokenizer=tokenizer)

### #TODO 4

Fine-tuning other pretrained model with our orchid corpus.

In [71]:
training_args = TrainingArguments(
    output_dir="pos_tagger_model_wiki_newmm",
    learning_rate=2e-5,
    num_train_epochs=1,
    weight_decay=0.01,
    evaluation_strategy="epoch",
    save_strategy="epoch",
    load_best_model_at_end=True,
    push_to_hub=True,
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_orchid["train"],
    eval_dataset=tokenized_orchid["test"],
    tokenizer=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics,
)

trainer.train()

RuntimeError: CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.


In [None]:
######## EVALUATE YOUR MODEL ########
# Inference
# ignore special tokens
# Load pretrained tokenizer from Hugging Face
#@title Choose Pretrained Model
model_name = "airesearch/wangchanberta-base-att-spm-uncased"

tokenizer = Tokenizer(model_name).from_pretrained(model_name)
inputs = tokenizer(text, return_tensors="pt")
## Load your fine-tuned model from Hugging Face
model = AutoModelForTokenClassification.from_pretrained("JeansAthiwat/pos_tagger_model_att_spm_uncased") ## your model path from Hugging Face
with torch.no_grad():
    logits = model(**inputs).logits
predictions = torch.argmax(logits, dim=2)
predicted_token_class = [model.config.id2label[t.item()] for t in predictions[0]]
predicted_token_class

text = 'จะว่าไปแล้วเชิงเทียนของผมก็สวยดีเหมือนกัน'
inputs = tokenizer(text, return_tensors="pt")
tokenized_input = tokenizer([text], is_split_into_words=True)
tokens = tokenizer.convert_ids_to_tokens(tokenized_input["input_ids"])
print('tokens :', tokens)
with torch.no_grad():
    logits = model(**inputs).logits
predictions = torch.argmax(logits, dim=2)
predicted_token_class = [model.config.id2label[t.item()] for t in predictions[0]]
print('predict pos :', predicted_token_class)


y_pred = []
device = 'cuda' if torch.cuda.is_available() else 'cpu'
for inputs in test_data:
    text = inputs['sentence']
    inputs = tokenizer(text, return_tensors="pt")
    with torch.no_grad():
        pred = model(**inputs).logits
        predictions = torch.argmax(pred, dim=2)
        # Append padded predictions to y_pred
        y_pred.append(predictions.tolist()[0])
# check our prediction with label
# -100 is special tokens : [<s>, </s>, _]
print(y_pred[0])
print(y_test[0])
evaluation_report(y_test, y_pred)

### #TODO 5

Compare the results between both models. Are they comparable? (Think about the ground truths of both models).

Propose a way to fairly evaluate the models.

<b>Write your answer here :</b>
<br>
We cant compare them directly with just metrices on prediction vs tokenized label. as it is not fare in terms of the token amount ( newmm vs attn-spm)


<b>What we can do</b>is to maybe assemble each tokenized value back in to the original ground truth form(Do alignment again but on pred_token to ground truth) and compare that token instead?

A note on preprocessing data.

``process_transformers`` in ``thaixtransformers.preprocess`` also provides a preprocess code that deals with many issues such as casing, text cleaning, and white space replacement with <_>. You can also use this to preprocess your text. Note that space replacement is done automatically without preprocessing in thaixtransformers.
