# HOMEWORK 6: TEXT CLASSIFICATION
In this homework, you will create models to classify texts from TRUE call-center. There are two classification tasks:
1. Action Classification: Identify which action the customer would like to take (e.g. enquire, report, cancle)
2. Object Classification: Identify which object the customer is referring to (e.g. payment, truemoney, internet, roaming)

We will focus only on the Object Classification task for this homework.

In this homework, you are asked compare different text classification models in terms of accuracy and inference time.

You will need to build 3 different models.

1. A model based on tf-idf
2. A model based on MUSE
3. A model based on wangchanBERTa

**You will be ask to submit 3 different files (.pdf from .ipynb) that does the 3 different models. Finally, answer the accuracy and runtime numbers in MCV.**

This homework is quite free form, and your answer may vary. We hope that the processing during the course of this assignment will make you think more about the design choices in text classification.

In [1]:
# !wget --no-check-certificate https://www.dropbox.com/s/37u83g55p19kvrl/clean-phone-data-for-students.csv

In [2]:
# !pip install pythainlp

## Import Libs

In [3]:
%matplotlib inline
import pandas
import sklearn
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

from torch.utils.data import Dataset
from IPython.display import display
from collections import defaultdict
from sklearn.metrics import accuracy_score
import time

## Loading data
First, we load the data from disk into a Dataframe.

A Dataframe is essentially a table, or 2D-array/Matrix with a name for each column.

In [4]:
data_df = pd.read_csv('clean-phone-data-for-students.csv')

Let's preview the data.

In [5]:
# Show the top 5 rows
display(data_df.head())
# Summarize the data
data_df.describe()

Unnamed: 0,Sentence Utterance,Action,Object
0,<PHONE_NUMBER_REMOVED> ผมไปจ่ายเงินที่ Counte...,enquire,payment
1,internet ยังความเร็วอยุ่เท่าไหร ครับ,enquire,package
2,ตะกี้ไปชำระค่าบริการไปแล้ว แต่ยังใช้งานไม่ได้...,report,suspend
3,พี่ค่ะยังใช้ internet ไม่ได้เลยค่ะ เป็นเครื่อ...,enquire,internet
4,ฮาโหล คะ พอดีว่าเมื่อวานเปิดซิมทรูมูฟ แต่มันโ...,report,phone_issues


Unnamed: 0,Sentence Utterance,Action,Object
count,16175,16175,16175
unique,13389,10,33
top,บริการอื่นๆ,enquire,service
freq,97,10377,2525


## Data cleaning

We call the DataFrame.describe() again.
Notice that there are 33 unique labels/classes for object and 10 unique labels for action that the model will try to predict.
But there are unwanted duplications e.g. Idd,idd,lotalty_card,Lotalty_card

Also note that, there are 13389 unqiue sentence utterances from 16175 utterances. You have to clean that too!

## #TODO 0.1:
You will have to remove unwanted label duplications as well as duplications in text inputs.
Also, you will have to trim out unwanted whitespaces from the text inputs.
This shouldn't be too hard, as you have already seen it in the demo.



In [6]:
display(data_df.describe())
display(data_df.Object.unique())
display(data_df.Action.unique())

Unnamed: 0,Sentence Utterance,Action,Object
count,16175,16175,16175
unique,13389,10,33
top,บริการอื่นๆ,enquire,service
freq,97,10377,2525


array(['payment', 'package', 'suspend', 'internet', 'phone_issues',
       'service', 'nonTrueMove', 'balance', 'detail', 'bill', 'credit',
       'promotion', 'mobile_setting', 'iservice', 'roaming', 'truemoney',
       'information', 'lost_stolen', 'balance_minutes', 'idd',
       'TrueMoney', 'garbage', 'Payment', 'IDD', 'ringtone', 'Idd',
       'rate', 'loyalty_card', 'contact', 'officer', 'Balance', 'Service',
       'Loyalty_card'], dtype=object)

array(['enquire', 'report', 'cancel', 'Enquire', 'buy', 'activate',
       'request', 'Report', 'garbage', 'change'], dtype=object)

In [7]:
# TODO.1: Data cleaning
clean_time = 0
t_start = time.time()
data_df['clean Sentence Utterance'] = data_df['Sentence Utterance'].str.strip().copy()
# data_df['clean Action'] = data_df['Action'].str.lower().copy()
data_df['clean Object'] = data_df['Object'].str.lower().copy()

# data_df.drop_duplicates("Sentence Utterance", keep="first", inplace=True)
data_df.drop_duplicates("clean Sentence Utterance", keep="first", inplace=True)

data_df.drop('Sentence Utterance', axis=1, inplace=True)
data_df.drop('Action', axis=1, inplace=True)
data_df.drop('Object', axis=1, inplace=True)
t_end = time.time()
clean_time += t_end - t_start


data_df.describe()

# idx = 1
# print(f'"{data_df["Sentence Utterance"][idx]}"')
# print(f'"{data_df["clean Sentence Utterance"][idx]}"')


Unnamed: 0,clean Sentence Utterance,clean Object
count,13367,13367
unique,13367,26
top,สอบถามโปรโมชั่นปัจจุบันที่ใช้อยู่ค่ะ,service
freq,1,2108


In [8]:
t_start = time.time()
data = data_df.to_numpy()
unique_label = data_df['clean Object'].unique()

label_2_num = dict(zip(unique_label, range(len(unique_label))))
num_2_label = dict(zip(range(len(unique_label)), unique_label))

# display(label_2_num)
# display(num_2_label)

# display(data[:, 1])
data[:, 1] = np.vectorize(label_2_num.get)(data[:, 1])
# display(data[:, 1])
for i in range(len(data)):
    data[i, 0] = data[i, 0].replace('ำ', 'ํา')

t_end = time.time()
clean_time += t_end - t_start

Split data into train, valdation, and test sets (normally the ratio will be 80:10:10 , respectively). We recommend to use train_test_spilt from scikit-learn to split the data into train, validation, test set.

In addition, it should split the data that distribution of the labels in train, validation, test set are similar. There is **stratify** option to handle this issue.

https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html

Make sure the same data splitting is used for all models.

In [9]:
bin_label = np.bincount(np.array(data[:, 1], dtype=int))
# print(data[:, 1])
print(bin_label)

[ 641 1791  730 1786  581 2108  246 1478  327  540  173 1142  280   22
  246  248  296  231   50  206   49   79   36   67    4   10]


In [10]:
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("airesearch/wangchanberta-base-att-spm-uncased")
ma = 0
for i in range(len(data)):
    ma = max(ma, len(tokenizer(data[i, 0])['input_ids']))
print(ma)
token_time = 0
t_start = time.time()
for i in range(len(data)):
    data[i, 0] = tokenizer(data[i, 0], max_length=ma, padding='max_length', truncation=True)
t_end = time.time()
token_time += t_end - t_start



124


In [11]:
from sklearn.model_selection import train_test_split, StratifiedShuffleSplit
t_start = time.time()
sss_train_valtest = StratifiedShuffleSplit(n_splits=1, test_size=1/10, random_state=42)
sss_val_test = StratifiedShuffleSplit(n_splits=1, test_size=1/9, random_state=42)

# print(data.shape)
trainval_idx, test_idx = next(sss_train_valtest.split(data[:, 0], data[:, 1]))
trainval_raw = data[trainval_idx]
test_raw = data[test_idx]
# print(trainval_raw.shape, test_raw.shape)

train_idx, val_idx = next(sss_val_test.split(trainval_raw[:, 0], trainval_raw[:, 1]))
train_raw = trainval_raw[train_idx]
val_raw = trainval_raw[val_idx]

# print(train_raw.shape, val_raw.shape, test_raw.shape)
t_end = time.time()
clean_time += t_end - t_start

In [12]:
num_Object = len(unique_label)
print(num_Object)

26


# Model 3 WangchanBERTa

We ask you to train a WangchanBERTa-based model.

We recommend you use the thaixtransformers fork (which we used in the PoS homework).
https://github.com/PyThaiNLP/thaixtransformers

The structure of the code will be very similar to the PoS homework. You will also find the huggingface [tutorial](https://huggingface.co/docs/transformers/en/tasks/sequence_classification) useful. Or you can also add a softmax layer by yourself just like in the previous homework.

Which WangchanBERTa model will you use? Why? (Don't forget to clean your text accordingly).

**Ans: I use "wangchanberta-base-att-spm-uncased" because it yields the best accuracy in HW4.**


In [13]:
import torch
from torch.utils.data import Dataset, DataLoader
from pytorch_lightning import LightningModule, Trainer
from transformers import AutoTokenizer, AutoModel
from torchmetrics import Accuracy
import torch.nn as nn
from torch.nn import functional as F

In [14]:
tokenizer = AutoTokenizer.from_pretrained("airesearch/wangchanberta-base-att-spm-uncased")

example = train_raw[4, 0]
print(example)



{'input_ids': [5, 10, 260, 37, 1751, 1073, 4213, 27, 7675, 103, 726, 775, 775, 103, 335, 627, 200, 10, 2035, 10, 1849, 6, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]}


In [15]:
class TrueDataset(torch.utils.data.Dataset):
    def __init__(self, encodings, labels):
        self.encodings = encodings
        self.labels = labels

    def __getitem__(self, idx):
        item = {
            key: torch.tensor(val) for key, val in self.encodings[idx].items()
            if key in ['input_ids', 'attention_mask']
        }
        item['labels'] = torch.tensor(self.labels[idx])
        return item

    def __len__(self):
        return len(self.labels)

In [16]:
train_dataset = TrueDataset(train_raw[:, 0], train_raw[:, 1])
val_dataset = TrueDataset(val_raw[:, 0], val_raw[:, 1])
test_dataset = TrueDataset(test_raw[:, 0], test_raw[:, 1])
# print(train_dataset[4])
train_loader = DataLoader(train_dataset, batch_size=128, shuffle=True)
val_loader = DataLoader(val_dataset, batch_size=128)
test_loader = DataLoader(test_dataset, batch_size=128)

In [17]:
print(train_dataset[0]['input_ids'])
print(train_dataset[4]['input_ids'])

tensor([    5,    10, 10038,  1361,  8333,  1640,    10,     3,    10,    73,
            6,     1,     1,     1,     1,     1,     1,     1,     1,     1,
            1,     1,     1,     1,     1,     1,     1,     1,     1,     1,
            1,     1,     1,     1,     1,     1,     1,     1,     1,     1,
            1,     1,     1,     1,     1,     1,     1,     1,     1,     1,
            1,     1,     1,     1,     1,     1,     1,     1,     1,     1,
            1,     1,     1,     1,     1,     1,     1,     1,     1,     1,
            1,     1,     1,     1,     1,     1,     1,     1,     1,     1,
            1,     1,     1,     1,     1,     1,     1,     1,     1,     1,
            1,     1,     1,     1,     1,     1,     1,     1,     1,     1,
            1,     1,     1,     1,     1,     1,     1,     1,     1,     1,
            1,     1,     1,     1,     1,     1,     1,     1,     1,     1,
            1,     1,     1,     1])
tensor([   5,   10,  260,  

In [18]:
class BaseModel(LightningModule):
    def __init__(
          self,
          model_name: str = 'airesearch/wangchanberta-base-att-spm-uncased',
          learning_rate: float = 2e-5
    ):
        super().__init__()
        self.save_hyperparameters()

        self.encoder = AutoModel.from_pretrained(model_name)
        self.learning_rate = learning_rate

    def get_embeddings(self, input_ids, attention_mask):
        # TODO 1: get CLS token embedding to represent as a sentence embedding
        outputs = self.encoder(input_ids, attention_mask)
        return outputs.last_hidden_state[:, 0]

    def configure_optimizers(self):
        optimizer = torch.optim.AdamW(self.parameters(), lr=self.learning_rate)
        return optimizer

    def forward(self, input_ids, attention_mask):
        return self.get_embeddings(input_ids, attention_mask)

In [19]:
class LMWithLinearClassfier(BaseModel):
    def __init__(
          self,
          model_name: str = 'airesearch/wangchanberta-base-att-spm-uncased',
          ckpt_path: str = None,
          learning_rate: float = 2e-5,
          freeze_encoder_weights: bool = False
    ):
        super().__init__(
            model_name,
            learning_rate
        )
        self.save_hyperparameters()

        if ckpt_path is not None:
            state_dict = torch.load(ckpt_path)['state_dict']
            new_state_dict = {}
            for k, v in state_dict.items():
                if 'encoder' in k:
                    new_state_dict[k[8:]] = v
            self.encoder.load_state_dict(new_state_dict)

        self.linear_layer = nn.Linear(self.encoder.config.hidden_size, num_Object)

        if freeze_encoder_weights:
          self.freeze_weights(self.encoder)  # Freeze model

        self.accuracy = Accuracy(task='multiclass', num_classes=num_Object)

    def freeze_weights(self, model):
        for param in model.parameters():
            param.requires_grad = False

    def forward(self, input_ids, attention_mask):
        embeddings = self.get_embeddings(input_ids, attention_mask)
        return self.linear_layer(embeddings)

    def training_step(self, batch, batch_idx):
        xb, yb = batch['input_ids'], batch['labels']
        mask = batch['attention_mask']
        out = self(xb, mask)
        loss = F.cross_entropy(out, yb)
        acc = self.accuracy(out, yb)
        self.log('train_loss', loss, prog_bar=True)
        self.log('train_acc', acc, prog_bar=True)
        return loss

    def validation_step(self, batch, batch_idx):
        xb, yb = batch['input_ids'], batch['labels']
        mask = batch['attention_mask']
        out = self(xb, mask)
        loss = F.cross_entropy(out, yb)
        acc = self.accuracy(out, yb)
        self.log('val_loss', loss, prog_bar=True)
        self.log('val_acc', acc, prog_bar=True)

    def test_step(self, batch, batch_idx):
        xb, yb = batch['input_ids'], batch['labels']
        mask = batch['attention_mask']
        out = self(xb, mask)
        loss = F.cross_entropy(out, yb)
        acc = self.accuracy(out, yb)
        self.log('test_loss', loss, prog_bar=True)
        self.log('test_acc', acc, prog_bar=True)

In [20]:
model = LMWithLinearClassfier(
    model_name='airesearch/wangchanberta-base-att-spm-uncased',
    learning_rate=2e-5,
    freeze_encoder_weights=False
)

Some weights of the model checkpoint at airesearch/wangchanberta-base-att-spm-uncased were not used when initializing CamembertModel: ['lm_head.layer_norm.bias', 'lm_head.bias', 'lm_head.layer_norm.weight', 'lm_head.dense.bias', 'lm_head.dense.weight']
- This IS expected if you are initializing CamembertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing CamembertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [21]:
out = model(train_dataset[4]['input_ids'].unsqueeze(0), train_dataset[4]['attention_mask'].unsqueeze(0))
print(out.shape, out)

torch.Size([1, 26]) tensor([[-0.3932,  0.3188,  0.0880,  0.1463, -0.9910, -0.3220,  0.2828,  0.1696,
         -0.0709,  0.6598,  0.5253, -0.7914,  0.5810, -0.2353, -1.2406, -0.0430,
         -0.4157,  0.1644, -0.6503, -0.0283,  0.2575,  0.5041, -0.5448, -0.7062,
          0.2821,  0.5667]], grad_fn=<AddmmBackward0>)


In [22]:
import pytorch_lightning as pl

checkpoint_callback = pl.callbacks.ModelCheckpoint(
    monitor="val_acc",  # Metric to monitor
    mode="max",  # "min" for loss, "max" for accuracy
    save_top_k=1,  # Save only the best model(s)
    save_weights_only=True, # Saves only weights, not the entire model
    dirpath="./checkpoints/", # Path where the checkpoints will be saved
    filename="best_wangchan_model-{epoch}-{val_acc:.2f}", # Customized name for the checkpoint
    verbose=True,
)

trainer = Trainer(
    max_epochs=5,
    accelerator='auto',
    callbacks=[checkpoint_callback], # Add the ModelCheckpoint callback
    gradient_clip_val=1.0,
    precision=16, # Mixed precision training
    devices=1,
)

train_time = 0
t_start = time.time()

trainer.fit(model, train_loader, val_loader)

t_end = time.time()
train_time += t_end - t_start


/home/andre/anaconda3/envs/ML10/lib/python3.10/site-packages/lightning_fabric/connector.py:572: `precision=16` is supported for historical reasons but its usage is discouraged. Please set your precision to 16-mixed instead!
Using 16bit Automatic Mixed Precision (AMP)
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
HPU available: False, using: 0 HPUs
You are using a CUDA device ('NVIDIA GeForce RTX 4080') that has Tensor Cores. To properly utilize them, you should set `torch.set_float32_matmul_precision('medium' | 'high')` which will trade-off precision for performance. For more details, read https://pytorch.org/docs/stable/generated/torch.set_float32_matmul_precision.html#torch.set_float32_matmul_precision
2025-02-16 02:44:12.520238: I tensorflow/core/util/port.cc:153] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the enviro

Sanity Checking: |          | 0/? [00:00<?, ?it/s]

/home/andre/anaconda3/envs/ML10/lib/python3.10/site-packages/pytorch_lightning/trainer/connectors/data_connector.py:425: The 'val_dataloader' does not have many workers which may be a bottleneck. Consider increasing the value of the `num_workers` argument` to `num_workers=23` in the `DataLoader` to improve performance.
/home/andre/anaconda3/envs/ML10/lib/python3.10/site-packages/pytorch_lightning/trainer/connectors/data_connector.py:425: The 'train_dataloader' does not have many workers which may be a bottleneck. Consider increasing the value of the `num_workers` argument` to `num_workers=23` in the `DataLoader` to improve performance.


Training: |          | 0/? [00:00<?, ?it/s]

Validation: |          | 0/? [00:00<?, ?it/s]

Epoch 0, global step 84: 'val_acc' reached 0.48542 (best 0.48542), saving model to '/home/andre/Desktop/CU_submission/NLP_2025/L06_Sequence_Classification/HW/checkpoints/best_wangchan_model-epoch=0-val_acc=0.49.ckpt' as top 1


Validation: |          | 0/? [00:00<?, ?it/s]

Epoch 1, global step 168: 'val_acc' reached 0.72700 (best 0.72700), saving model to '/home/andre/Desktop/CU_submission/NLP_2025/L06_Sequence_Classification/HW/checkpoints/best_wangchan_model-epoch=1-val_acc=0.73.ckpt' as top 1


Validation: |          | 0/? [00:00<?, ?it/s]

Epoch 2, global step 252: 'val_acc' reached 0.75243 (best 0.75243), saving model to '/home/andre/Desktop/CU_submission/NLP_2025/L06_Sequence_Classification/HW/checkpoints/best_wangchan_model-epoch=2-val_acc=0.75.ckpt' as top 1


Validation: |          | 0/? [00:00<?, ?it/s]

Epoch 3, global step 336: 'val_acc' reached 0.76215 (best 0.76215), saving model to '/home/andre/Desktop/CU_submission/NLP_2025/L06_Sequence_Classification/HW/checkpoints/best_wangchan_model-epoch=3-val_acc=0.76-v1.ckpt' as top 1


Validation: |          | 0/? [00:00<?, ?it/s]

Epoch 4, global step 420: 'val_acc' reached 0.76739 (best 0.76739), saving model to '/home/andre/Desktop/CU_submission/NLP_2025/L06_Sequence_Classification/HW/checkpoints/best_wangchan_model-epoch=4-val_acc=0.77.ckpt' as top 1
`Trainer.fit` stopped: `max_epochs=5` reached.


In [23]:
infer_time = 0
t_start = time.time()

trainer.test(model, test_loader)

t_end = time.time()
infer_time += t_end - t_start

LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
/home/andre/anaconda3/envs/ML10/lib/python3.10/site-packages/pytorch_lightning/trainer/connectors/data_connector.py:425: The 'test_dataloader' does not have many workers which may be a bottleneck. Consider increasing the value of the `num_workers` argument` to `num_workers=23` in the `DataLoader` to improve performance.


Testing: |          | 0/? [00:00<?, ?it/s]

In [24]:
print(f"Cleaning time: {clean_time:.5f} s")
print(f"Tokenization time: {token_time:.5f} s")
print(f"Training time: {train_time:.5f} s")
print(f"Inference time: {infer_time:.5f} s")

Cleaning time: 0.02212 s
Tokenization time: 0.71313 s
Training time: 96.03775 s
Inference time: 0.78743 s


# Comparison

After you have completed the 3 models, compare the accuracy, ease of implementation, and inference speed (from cleaning, tokenization, till model compute) between the three models in mycourseville.