# HOMEWORK 6: TEXT CLASSIFICATION
In this homework, you will create models to classify texts from TRUE call-center. There are two classification tasks:
1. Action Classification: Identify which action the customer would like to take (e.g. enquire, report, cancle)
2. Object Classification: Identify which object the customer is referring to (e.g. payment, truemoney, internet, roaming)

We will focus only on the Object Classification task for this homework.

In this homework, you are asked compare different text classification models in terms of accuracy and inference time.

You will need to build 3 different models.

1. A model based on tf-idf
2. A model based on MUSE
3. A model based on wangchanBERTa

**You will be ask to submit 3 different files (.pdf from .ipynb) that does the 3 different models. Finally, answer the accuracy and runtime numbers in MCV.**

This homework is quite free form, and your answer may vary. We hope that the processing during the course of this assignment will make you think more about the design choices in text classification.

In [12]:
!wget --no-check-certificate https://www.dropbox.com/s/37u83g55p19kvrl/clean-phone-data-for-students.csv

--2025-02-15 17:25:49--  https://www.dropbox.com/s/37u83g55p19kvrl/clean-phone-data-for-students.csv
Resolving www.dropbox.com (www.dropbox.com)... 162.125.3.18, 2620:100:6018:18::a27d:312
Connecting to www.dropbox.com (www.dropbox.com)|162.125.3.18|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://www.dropbox.com/scl/fi/8h8hvsw9uj6o0524lfe4i/clean-phone-data-for-students.csv?rlkey=lwv5xbf16jerehnv3lfgq5ue6 [following]
--2025-02-15 17:25:49--  https://www.dropbox.com/scl/fi/8h8hvsw9uj6o0524lfe4i/clean-phone-data-for-students.csv?rlkey=lwv5xbf16jerehnv3lfgq5ue6
Reusing existing connection to www.dropbox.com:443.
HTTP request sent, awaiting response... 302 Found
Location: https://ucd140658d06553ba0fb5f39edc4.dl.dropboxusercontent.com/cd/0/inline/CkKTYFCxdlz7-k1p7pp6ebl7mdHz3qc9Me3wCOfH7Qpj7N2tzFBWDmdb9ghn5fc5B0bYSN7Tl29IaoqMKf8YBtC4PAviOqFCeB2KcTZJkgOpMXJqVbbmbPP6Dt2s1xG4MKs/file# [following]
--2025-02-15 17:25:49--  https://ucd140658d06553ba0fb5f39ed

In [13]:
!pip install pythainlp
!pip install -U sentence-transformers
!pip install tf-keras



## Import Libs

In [14]:
%matplotlib inline
import pandas
import sklearn
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

from torch.utils.data import Dataset
from IPython.display import display
from collections import defaultdict
from sklearn.metrics import accuracy_score

In [15]:
data_df = pd.read_csv('clean-phone-data-for-students.csv')

In [16]:
def clean_data(df):
    """Cleans the dataset by selecting relevant columns, normalizing labels, 
    trimming whitespace, and removing duplicates."""
    
    # Select and rename columns
    df = df[["Sentence Utterance", "Object"]].rename(columns={"Sentence Utterance": "input", "Object": "raw_label"})

    # Normalize label (lowercase)
    df["clean_label"] = df["raw_label"].str.lower()

    # Trim white spaces in input column
    df["input"] = df["input"].str.strip()

    # Remove duplicates based on input
    df = df.drop_duplicates(subset="input", keep="first")

    # Drop the raw label column
    df.drop(columns=["raw_label"], inplace=True)

    return df

# Apply cleaning function
data_df = clean_data(data_df)

# Display summary
display(data_df.describe())
display(data_df["clean_label"].unique())


Unnamed: 0,input,clean_label
count,13367,13367
unique,13367,26
top,สอบถามโปรโมชั่นปัจจุบันที่ใช้อยู่ค่ะ,service
freq,1,2108


array(['payment', 'package', 'suspend', 'internet', 'phone_issues',
       'service', 'nontruemove', 'balance', 'detail', 'bill', 'credit',
       'promotion', 'mobile_setting', 'iservice', 'roaming', 'truemoney',
       'information', 'lost_stolen', 'balance_minutes', 'idd', 'garbage',
       'ringtone', 'rate', 'loyalty_card', 'contact', 'officer'],
      dtype=object)

In [17]:
# Mapping and Trimming
data = data_df.to_numpy()
unique_label = data_df.clean_label.unique()

label_2_num_map = dict(zip(unique_label, range(len(unique_label))))
num_2_label_map = dict(zip(range(len(unique_label)), unique_label))

data[:,1] = np.vectorize(label_2_num_map.get)(data[:,1]) 

def strip_str(string):
    return string.strip()
data[:,0] = np.vectorize(strip_str)(data[:,0])

display(data)

array([['<PHONE_NUMBER_REMOVED> ผมไปจ่ายเงินที่ Counter Services เค้าเช็ต 3276.25 บาท เมื่อวานที่ผมเช็คที่ศูนย์บอกมียอด 3057.79 บาท',
        0],
       ['internet ยังความเร็วอยุ่เท่าไหร ครับ', 1],
       ['ตะกี้ไปชำระค่าบริการไปแล้ว แต่ยังใช้งานไม่ได้ ค่ะ', 2],
       ...,
       ['ยอดเงินเหลือเท่าไหร่ค่ะ', 7],
       ['ยอดเงินในระบบ', 7],
       ['สอบถามโปรโมชั่นปัจจุบันที่ใช้อยู่ค่ะ', 1]], dtype=object)

In [18]:
# Split
from sklearn.model_selection import train_test_split

# Constants
SEED = 42
MIN_INSTANCES = 10  # Minimum instances per class


def filter_data(data_df, min_instances=MIN_INSTANCES):
    """
    Filters classes with fewer than `min_instances` occurrences.
    Returns filtered input (X) and labels (y).
    """
    class_counts = data_df["clean_label"].value_counts()
    valid_classes = class_counts[class_counts >= min_instances].index

    filtered_data = data_df[data_df["clean_label"].isin(valid_classes)]
    return filtered_data["input"], filtered_data["clean_label"].astype(int)

def split_data(data_df, random_state=SEED, min_instances=MIN_INSTANCES):
    """
    Splits data into train (80%), validation (10%), and test (10%) sets.
    Ensures stratification and filtering of rare classes.
    """
    # Filter classes
    X, y = filter_data(data_df, min_instances)

    # Split 80% Train, 20% Temp
    X_train, X_temp, y_train, y_temp = train_test_split(
        X, y, test_size=0.20, stratify=y, random_state=random_state
    )

    # Split 10% Validation, 10% Test
    X_val, X_test, y_val, y_test = train_test_split(
        X_temp, y_temp, test_size=0.50, stratify=y_temp, random_state=random_state
    )

    print(f"Train size: {len(X_train)}")
    print(f"Validation size: {len(X_val)}")
    print(f"Test size: {len(X_test)}")

    return (
        np.array(X_train), np.array(X_val), np.array(X_test),
        np.array(y_train), np.array(y_val), np.array(y_test)
    )

# Convert to DataFrame
df = pd.DataFrame(data, columns=['input', 'clean_label'])

# Split dataset
X_train, X_val, X_test, y_train, y_val, y_test = split_data(df)


Train size: 10690
Validation size: 1336
Test size: 1337


# Model 2 MUSE

Build a simple logistic regression model using features from the MUSE model.

Which MUSE model will you use? Why?

**Ans:**  I will use sentence-transformers/use-cmlm-multilingual because:

- It is pre-trained on multiple languages, including Thai, ensuring better language coverage.
- It captures sentence-level semantics rather than just individual words, leading to more meaningful embeddings.
- It generalizes better than traditional vector-based models like TF-IDF, improving performance in downstream tasks.

MUSE is typically used with tensorflow. However, there are some pytorch conversions made by some people.

https://huggingface.co/sentence-transformers/use-cmlm-multilingual
https://huggingface.co/dayyass/universal-sentence-encoder-multilingual-large-3-pytorch

In [19]:
from sentence_transformers import SentenceTransformer
from sklearn.linear_model import LogisticRegression
start_time = time.time()
print("MUSE + Logistic Regression")

muse_model = SentenceTransformer("sentence-transformers/use-cmlm-multilingual")

def encode_texts(texts):
    return muse_model.encode(texts, convert_to_numpy=True)

start_enc_time = time.time()
X_train_enc = encode_texts(X_train.tolist())
X_val_enc = encode_texts(X_val.tolist())
X_test_enc = encode_texts(X_test.tolist())
end_enc_time = time.time()
print(f"Encoding Time: {end_enc_time - start_enc_time:.4f} seconds")

model = LogisticRegression(random_state=SEED)

start_train_time = time.time()
model.fit(X_train_enc, y_train)
end_train_time = time.time()
print(f"Training Time: {end_train_time - start_train_time:.4f} seconds")

y_pred_train = model.predict(X_train_enc)
y_pred_val = model.predict(X_val_enc)
y_pred_test = model.predict(X_test_enc)

train_acc = np.mean(y_train.astype(int) == y_pred_train)
val_acc = np.mean(y_val.astype(int) == y_pred_val)
test_acc = np.mean(y_test.astype(int) == y_pred_test)

print(f"Train Accuracy: {train_acc:.4f}")
print(f"Validation Accuracy: {val_acc:.4f}")
print(f"Test Accuracy: {test_acc:.4f}")
end_time = time.time()
print(f"Total Time: {end_time - start_time:.4f} seconds")

MUSE + Logistic Regression


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/122 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/1.89k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/804 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.89G [00:00<?, ?B/s]

Some weights of the model checkpoint at sentence-transformers/use-cmlm-multilingual were not used when initializing BertModel: ['cls.predictions.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.dense.weight', 'cls.seq_relationship.bias', 'cls.seq_relationship.weight']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


tokenizer_config.json:   0%|          | 0.00/411 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/5.22M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/9.62M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

1_Pooling%2Fconfig.json:   0%|          | 0.00/191 [00:00<?, ?B/s]

Batches:   0%|          | 0/335 [00:00<?, ?it/s]

Batches:   0%|          | 0/42 [00:00<?, ?it/s]

Batches:   0%|          | 0/42 [00:00<?, ?it/s]

Encoding Time: 21.6718 seconds
Training Time: 2.2648 seconds
Train Accuracy: 0.7373
Validation Accuracy: 0.7073
Test Accuracy: 0.7023
Total Time: 38.5585 seconds


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


# Comparison

After you have completed the 3 models, compare the accuracy, ease of implementation, and inference speed (from cleaning, tokenization, till model compute) between the three models in mycourseville.