# HOMEWORK 6: TEXT CLASSIFICATION
In this homework, you will create models to classify texts from TRUE call-center. There are two classification tasks:
1. Action Classification: Identify which action the customer would like to take (e.g. enquire, report, cancle)
2. Object Classification: Identify which object the customer is referring to (e.g. payment, truemoney, internet, roaming)

We will focus only on the Object Classification task for this homework.

In this homework, you are asked compare different text classification models in terms of accuracy and inference time.

You will need to build 3 different models.

1. A model based on tf-idf
2. A model based on MUSE
3. A model based on wangchanBERTa

**You will be ask to submit 3 different files (.pdf from .ipynb) that does the 3 different models. Finally, answer the accuracy and runtime numbers in MCV.**

This homework is quite free form, and your answer may vary. We hope that the processing during the course of this assignment will make you think more about the design choices in text classification.

In [1]:
# !wget --no-check-certificate https://www.dropbox.com/s/37u83g55p19kvrl/clean-phone-data-for-students.csv
# !pip install pythainlp

In [2]:
# Install transformers and thai2transformers
!pip install wandb
!pip install -q transformers==4.30.1 datasets evaluate thaixtransformers
!pip install -q emoji pythainlp sefr_cut tinydb seqeval sentencepiece pydantic jsonlines
!pip install peft==0.10.0



## Import libs for WangChanberta

In [3]:
%matplotlib inline

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import sklearn
import torch
import pickle
from collections import defaultdict
from functools import partial
from tqdm.auto import tqdm
from IPython.display import display
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
from torch.utils.data import Dataset
from transformers import (
    AutoModel,
    AutoModelForMaskedLM,
    AutoModelForSequenceClassification,
    AutoModelForTokenClassification,
    AutoTokenizer,
    CamembertTokenizer,
    Trainer,
    TrainingArguments,
    pipeline,
)
from datasets import Dataset, DatasetDict
from thaixtransformers import Tokenizer
from thaixtransformers.preprocess import process_transformers

from transformers import DataCollatorWithPadding
import evaluate

# Set random seed for reproducibility
np.random.seed(42)


2025-02-14 19:50:21.935838: I tensorflow/core/util/port.cc:153] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2025-02-14 19:50:21.950874: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:477] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
E0000 00:00:1739537421.965715  100291 cuda_dnn.cc:8310] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1739537421.969564  100291 cuda_blas.cc:1418] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2025-02-14 19:50:21.985718: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instr

# Model 3 WangchanBERTa

We ask you to train a WangchanBERTa-based model.

We recommend you use the thaixtransformers fork (which we used in the PoS homework).
https://github.com/PyThaiNLP/thaixtransformers

The structure of the code will be very similar to the PoS homework. You will also find the huggingface [tutorial](https://huggingface.co/docs/transformers/en/tasks/sequence_classification) useful. Or you can also add a softmax layer by yourself just like in the previous homework.

Which WangchanBERTa model will you use? Why? (Don't forget to clean your text accordingly).

**Ans:**


## Loading cleaned dataset and make it Dataset object

In [4]:


with open('template_cleaned_dataset.pkl', 'rb') as f:
    dataset = pickle.load(f)

# Extract tokenized text and labels
label_2_num_map, num_2_label_map = dataset["label_2_num_map"], dataset["num_2_label_map"]
train_texts, train_labels = dataset["train"]["input"], dataset["train"]["label"]
val_texts, val_labels = dataset["val"]["input"], dataset["val"]["label"]
test_texts, test_labels = dataset["test"]["input"], dataset["test"]["label"]


# Create Dataset objects
train_dataset = Dataset.from_dict({"text": train_texts, "label": train_labels})
val_dataset = Dataset.from_dict({"text": val_texts, "label": val_labels})
test_dataset = Dataset.from_dict({"text": test_texts, "label": test_labels})

# Create DatasetDict
dataset_dict = DatasetDict({
    "train": train_dataset,
    "validation": val_dataset,
    "test": test_dataset
})

# Display dataset structure
print(dataset_dict)

DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 10710
    })
    validation: Dataset({
        features: ['text', 'label'],
        num_rows: 1339
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 1340
    })
})


In [5]:
dataset_dict["test"][0]

{'text': 'ซื้อแอร์การ์ดมาจากเซเว่นค่ะ เบอร์นี้นะค่ะ แล้วเติมเงิน หรือสมัครอะไรไม่ได้เลยค่ะ',
 'label': 3}

## Define Tokenizer for the model
- "airesearch/wangchanberta-base-att-spm-uncased"

In [6]:
MODEL_NAME = "airesearch/wangchanberta-base-att-spm-uncased"

#create tokenizer
tokenizer = Tokenizer(MODEL_NAME).from_pretrained(
                f'{MODEL_NAME}',
                revision='main',
                model_max_length=416,)

text = 'ศิลปะไม่เป็นเจ้านายใคร และไม่เป็นขี้ข้าใคร'
print('text :', text)
tokens = []
for i in tokenizer([text], is_split_into_words=True)['input_ids']:
  tokens.append(tokenizer.decode(i))
print('tokens :', tokens)

The tokenizer class you load from this checkpoint is not the same type as the class this function is called from. It may result in unexpected tokenization. 
The tokenizer class you load from this checkpoint is 'CamembertTokenizer'. 
The class this function is called from is 'WangchanbertaTokenizer'.
The tokenizer class you load from this checkpoint is not the same type as the class this function is called from. It may result in unexpected tokenization. 
The tokenizer class you load from this checkpoint is 'CamembertTokenizer'. 
The class this function is called from is 'WangchanbertaTokenizer'.


text : ศิลปะไม่เป็นเจ้านายใคร และไม่เป็นขี้ข้าใคร
tokens : ['<s>', '', 'ศิลปะ', 'ไม่เป็น', 'เจ้านาย', 'ใคร', '<_>', 'และ', 'ไม่เป็น', 'ขี้ข้า', 'ใคร', '</s>']


## Preprocess dataset

In [7]:
def lower_case_sentences(examples):
  lower_cased_examples = examples
  lower_cased_examples["text"] = examples["text"].lower()
  return lower_cased_examples

def preprocess_function(examples, tokenizer):
    return tokenizer(examples["text"], padding="max_length", truncation=True)

dataset_dict_lower = dataset_dict.map(lower_case_sentences)
tokenized_dataset = dataset_dict_lower.map(partial(preprocess_function, tokenizer=tokenizer))

Map:   0%|          | 0/10710 [00:00<?, ? examples/s]

Map:   0%|          | 0/1339 [00:00<?, ? examples/s]

Map:   0%|          | 0/1340 [00:00<?, ? examples/s]

Map:   0%|          | 0/10710 [00:00<?, ? examples/s]

Map:   0%|          | 0/1339 [00:00<?, ? examples/s]

Map:   0%|          | 0/1340 [00:00<?, ? examples/s]

In [8]:
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)
accuracy = evaluate.load("accuracy")

def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    predictions = np.argmax(predictions, axis=1)
    return accuracy.compute(predictions=predictions, references=labels)

id2label = {k: v for k, v in num_2_label_map.items()}
label2id = {k: v for k, v in label_2_num_map.items()}
num_labels = len(label2id)

# print(label2id)
print(id2label)
print("num_labels", num_labels)

{0: 'payment', 1: 'package', 2: 'suspend', 3: 'internet', 4: 'phone_issues', 5: 'service', 6: 'nontruemove', 7: 'balance', 8: 'detail', 9: 'bill', 10: 'credit', 11: 'promotion', 12: 'mobile_setting', 13: 'iservice', 14: 'roaming', 15: 'truemoney', 16: 'information', 17: 'lost_stolen', 18: 'balance_minutes', 19: 'idd', 20: 'garbage', 21: 'ringtone', 22: 'rate', 23: 'loyalty_card', 24: 'contact', 25: 'officer'}
num_labels 26


## Defining the SequenceClassification model

In [9]:
from transformers import AutoModelForSequenceClassification, TrainingArguments, Trainer

model = AutoModelForSequenceClassification.from_pretrained(
    f"{MODEL_NAME}",
    num_labels=num_labels,
    id2label=id2label, 
    label2id=label2id
)

training_args = TrainingArguments(
    output_dir="TRUEvoice_objective_wangchanberta_uncaesd_5epoch",
    learning_rate=2e-5,
    num_train_epochs=5,
    weight_decay=0.01,
    evaluation_strategy="epoch",
    save_strategy="epoch",
    load_best_model_at_end=True,
    push_to_hub=True,
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_dataset["train"],
    eval_dataset=tokenized_dataset["test"],
    tokenizer=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics,
)


Some weights of the model checkpoint at airesearch/wangchanberta-base-att-spm-uncased were not used when initializing CamembertForSequenceClassification: ['roberta.pooler.dense.weight', 'roberta.pooler.dense.bias', 'lm_head.dense.bias', 'lm_head.dense.weight', 'lm_head.bias', 'lm_head.layer_norm.weight', 'lm_head.layer_norm.bias']
- This IS expected if you are initializing CamembertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing CamembertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of CamembertForSequenceClassification were not initialized from the model checkpoint at airesearch/wangchanberta-base-att-spm-uncased and are newly ini

## Fit the model

In [10]:
trainer.train()

[34m[1mwandb[0m: Currently logged in as: [33mjeansathiwat[0m to [32mhttps://api.wandb.ai[0m. Use [1m`wandb login --relogin`[0m to force relogin
[34m[1mwandb[0m: Using wandb-core as the SDK backend.  Please refer to https://wandb.me/wandb-core for more information.


Epoch,Training Loss,Validation Loss,Accuracy
1,1.2111,0.898832,0.74403
2,0.6932,0.849231,0.761194
3,0.5066,0.888763,0.764925
4,0.3806,0.941494,0.776866
5,0.3174,0.984565,0.776119


TrainOutput(global_step=6695, training_loss=0.6852920332802508, metrics={'train_runtime': 2151.5685, 'train_samples_per_second': 24.889, 'train_steps_per_second': 3.112, 'total_flos': 1.14502644180288e+16, 'train_loss': 0.6852920332802508, 'epoch': 5.0})

In [11]:
trainer.push_to_hub()

Several commits (2) will be pushed upstream.
The progress bars may be unreliable.


Upload file pytorch_model.bin:   0%|          | 1.00/402M [00:00<?, ?B/s]

Upload file runs/Feb14_19-50-30_JafPC/events.out.tfevents.1739537437.JafPC.100291.0:   0%|          | 1.00/9.1…

To https://huggingface.co/JeansAthiwat/TRUEvoice_objective_wangchanberta_uncaesd_5epoch
   24e119d..4b5d08f  main -> main

To https://huggingface.co/JeansAthiwat/TRUEvoice_objective_wangchanberta_uncaesd_5epoch
   4b5d08f..6c7338a  main -> main



'https://huggingface.co/JeansAthiwat/TRUEvoice_objective_wangchanberta_uncaesd_5epoch/commit/4b5d08f8ea4a849b9c13efe4b410127f94236e20'

## Evaluate

In [13]:
from transformers import AutoTokenizer

# Load pretrained tokenizer from Hugging Face
#@title Choose Pretrained Model
model_name = "airesearch/wangchanberta-base-att-spm-uncased"

tokenizer = Tokenizer(model_name).from_pretrained(model_name)
inputs = tokenizer(text, return_tensors="pt")
inputs

The tokenizer class you load from this checkpoint is not the same type as the class this function is called from. It may result in unexpected tokenization. 
The tokenizer class you load from this checkpoint is 'CamembertTokenizer'. 
The class this function is called from is 'WangchanbertaTokenizer'.
The tokenizer class you load from this checkpoint is not the same type as the class this function is called from. It may result in unexpected tokenization. 
The tokenizer class you load from this checkpoint is 'CamembertTokenizer'. 
The class this function is called from is 'WangchanbertaTokenizer'.


{'input_ids': tensor([[    5,    10,  2391,  1501,  5365,   197,     8,   222,  1501, 21325,
           197,     6]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])}

In [14]:
from transformers import AutoModelForTokenClassification

## Load your fine-tuned model from Hugging Face
model = AutoModelForSequenceClassification.from_pretrained("/home/jaf/NLP_NoScope/codes/L06_Sequence_Classification/TRUEvoice_objective_wangchanberta_uncaesd_5epoch") ## your model path from local or hugging face
with torch.no_grad():
    logits = model(**inputs).logits

In [21]:
prediction = torch.nn.functional.softmax(logits, dim=1).argmax().item()
prediction

5

In [23]:
import torch
from transformers import PreTrainedTokenizer, PreTrainedModel
from typing import List, Dict

def get_true_and_predicted_labels(
    dataset: List[Dict],
    model: PreTrainedModel,
    tokenizer: PreTrainedTokenizer,
    text_key: str = 'text',
    label_key: str = 'label',
    device: str = 'cuda' if torch.cuda.is_available() else 'cpu'
) -> (List[int], List[int]):
    model.to(device)
    model.eval()

    y_true = []
    y_pred = []

    with torch.no_grad():
        for entry in dataset:
            # Extract true label
            y_true.append(entry[label_key])

            # Tokenize input text
            inputs = tokenizer(entry[text_key], return_tensors="pt", truncation=True, padding=True)
            inputs = {k: v.to(device) for k, v in inputs.items()}

            # Get model predictions
            outputs = model(**inputs).logits

            # Get predicted label
            predicted_label = torch.argmax(torch.nn.functional.softmax(outputs, dim=1), dim=1).item()
            y_pred.append(predicted_label)

    return y_true, y_pred

# Assuming 'tokenized_dataset' is a dictionary containing your datasets
val_data = tokenized_dataset["validation"]
test_data = tokenized_dataset["test"]

# Get true and predicted labels for the test set
y_test_true, y_test_pred = get_true_and_predicted_labels(test_data, model, tokenizer)

# Get true and predicted labels for the validation set
y_val_true, y_val_pred = get_true_and_predicted_labels(val_data, model, tokenizer)

Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.


In [24]:
val_acc = accuracy_score(y_val_true, y_val_pred)
test_acc = accuracy_score(y_test_true, y_test_pred)

print(f"Validation accuracy: {val_acc}")
print(f"Test accuracy: {test_acc}")

Validation accuracy: 0.7647498132935027
Test accuracy: 0.7611940298507462
