# HOMEWORK 6: TEXT CLASSIFICATION
In this homework, you will create models to classify texts from TRUE call-center. There are two classification tasks:
1. Action Classification: Identify which action the customer would like to take (e.g. enquire, report, cancle)
2. Object Classification: Identify which object the customer is referring to (e.g. payment, truemoney, internet, roaming)

We will focus only on the Object Classification task for this homework.

In this homework, you are asked compare different text classification models in terms of accuracy and inference time.

You will need to build 3 different models.

1. A model based on tf-idf
2. A model based on MUSE
3. A model based on wangchanBERTa

**You will be ask to submit 3 different files (.pdf from .ipynb) that does the 3 different models. Finally, answer the accuracy and runtime numbers in MCV.**

This homework is quite free form, and your answer may vary. We hope that the processing during the course of this assignment will make you think more about the design choices in text classification.

In [1]:
# !wget --no-check-certificate https://www.dropbox.com/s/37u83g55p19kvrl/clean-phone-data-for-students.csv
# !pip install pythainlp

## Import Libs

In [2]:
%matplotlib inline
import pandas
import sklearn
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

from torch.utils.data import Dataset
from IPython.display import display
from collections import defaultdict
from sklearn.metrics import accuracy_score

#My import 
np.random.seed(42)
from sklearn.model_selection import train_test_split


## Loading data
First, we load the data from disk into a Dataframe.

A Dataframe is essentially a table, or 2D-array/Matrix with a name for each column.

In [3]:
data_df = pd.read_csv('clean-phone-data-for-students.csv')

Let's preview the data.

In [4]:
# Show the top 5 rows
display(data_df.head())
# Summarize the data
data_df.describe()

Unnamed: 0,Sentence Utterance,Action,Object
0,<PHONE_NUMBER_REMOVED> ผมไปจ่ายเงินที่ Counte...,enquire,payment
1,internet ยังความเร็วอยุ่เท่าไหร ครับ,enquire,package
2,ตะกี้ไปชำระค่าบริการไปแล้ว แต่ยังใช้งานไม่ได้...,report,suspend
3,พี่ค่ะยังใช้ internet ไม่ได้เลยค่ะ เป็นเครื่อ...,enquire,internet
4,ฮาโหล คะ พอดีว่าเมื่อวานเปิดซิมทรูมูฟ แต่มันโ...,report,phone_issues


Unnamed: 0,Sentence Utterance,Action,Object
count,16175,16175,16175
unique,13389,10,33
top,บริการอื่นๆ,enquire,service
freq,97,10377,2525


## Data cleaning

We call the DataFrame.describe() again.
Notice that there are 33 unique labels/classes for object and 10 unique labels for action that the model will try to predict.
But there are unwanted duplications e.g. Idd,idd,lotalty_card,Lotalty_card

Also note that, there are 13389 unqiue sentence utterances from 16175 utterances. You have to clean that too!

## #TODO 0.1:
You will have to remove unwanted label duplications as well as duplications in text inputs.
Also, you will have to trim out unwanted whitespaces from the text inputs.
This shouldn't be too hard, as you have already seen it in the demo.



In [5]:
display(data_df.describe())
display(data_df.Object.unique())
display(data_df.Action.unique())

Unnamed: 0,Sentence Utterance,Action,Object
count,16175,16175,16175
unique,13389,10,33
top,บริการอื่นๆ,enquire,service
freq,97,10377,2525


array(['payment', 'package', 'suspend', 'internet', 'phone_issues',
       'service', 'nonTrueMove', 'balance', 'detail', 'bill', 'credit',
       'promotion', 'mobile_setting', 'iservice', 'roaming', 'truemoney',
       'information', 'lost_stolen', 'balance_minutes', 'idd',
       'TrueMoney', 'garbage', 'Payment', 'IDD', 'ringtone', 'Idd',
       'rate', 'loyalty_card', 'contact', 'officer', 'Balance', 'Service',
       'Loyalty_card'], dtype=object)

array(['enquire', 'report', 'cancel', 'Enquire', 'buy', 'activate',
       'request', 'Report', 'garbage', 'change'], dtype=object)

In [6]:
# TODO.1: Data cleaning

# select only the columns that we need
data_df = data_df[['Sentence Utterance', 'Object']]
data_df.columns = ['input', 'raw_label']

# remove duplicate labels
data_df['clean_label'] = data_df['raw_label'].str.lower().copy()
data_df.drop('raw_label', axis=1, inplace=True)

# remove duplicate input rows
data_df = data_df.drop_duplicates("input", keep="first")
display(data_df.describe())

Unnamed: 0,input,clean_label
count,13389,13389
unique,13389,26
top,สอบถามโปรโมชั่นปัจจุบันที่ใช้อยู่ค่ะ,service
freq,1,2111


In [7]:
# map label to number and convert to numpy "data"
data = data_df.to_numpy()
unique_label = data_df.clean_label.unique()

label_2_num_map = dict(zip(unique_label, range(len(unique_label))))
num_2_label_map = dict(zip(range(len(unique_label)), unique_label))

# convert label to number
data[:,1] = np.vectorize(label_2_num_map.get)(data[:,1])

In [8]:
# Input string cleaning
def strip_str(string):
    return string.strip()
     
# Trim of extra begining and trailing whitespace in the string
data[:,0] = np.vectorize(strip_str)(data[:,0])

### Keep in mind class is imbalance type shi

Split data into train, valdation, and test sets (normally the ratio will be 80:10:10 , respectively). We recommend to use train_test_spilt from scikit-learn to split the data into train, validation, test set.

In addition, it should split the data that distribution of the labels in train, validation, test set are similar. There is **stratify** option to handle this issue.

https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html

Make sure the same data splitting is used for all models.

In [9]:
X = data[:,0]
y = data[:,1]

X_class_24, y_class_24 = X[y == 24], y[y == 24] #handle class 24 with only 4 samples
X, y = X[y != 24], y[y != 24] # remove class 24 from the data to add it later

random_idx = np.random.choice(len(X_class_24), 2, replace=False)

X_class_24_val, X_class_24_test = X_class_24[random_idx[0]], X_class_24[random_idx[1]]
y_class_24_val, y_class_24_test = y_class_24[random_idx[0]], y_class_24[random_idx[1]]

X_class_24_train = np.delete(X_class_24, random_idx)
y_class_24_train = np.delete(y_class_24, random_idx)

# train val test split 80:10:10
X_train, X_tmp, y_train, y_tmp = train_test_split(X, y, test_size=0.2, stratify=y, random_state=42)
X_val, X_test, y_val, y_test = train_test_split(X_tmp, y_tmp, test_size=0.5, stratify=y_tmp, random_state=42)

# add class 24 to the train, val, test set
X_train, y_train = np.append(X_train, X_class_24_train), np.append(y_train, y_class_24_train)
X_val, y_val = np.append(X_val, X_class_24_val), np.append(y_val, y_class_24_val)
X_test, y_test = np.append(X_test, X_class_24_test), np.append(y_test, y_class_24_test)

#check label distribution
# display(np.unique(y_train, return_counts=True))

In [10]:
def create_dataset_dict(dataset, data_split, split_name, keep_ws=False):
    for input_str, label in data_split:
        dataset[split_name]["input"].append(input_str)
        dataset[split_name]["label"].append(label)

dataset = { "train": {"input": [], "label": []},
            "val": {"input": [], "label": []},
            "test": {"input": [], "label": []} ,
            'label_2_num_map': label_2_num_map,
            'num_2_label_map': num_2_label_map
            }

create_dataset_dict(dataset, zip(X_train, y_train), "train")
create_dataset_dict(dataset, zip(X_val, y_val), "val")
create_dataset_dict(dataset, zip(X_test, y_test), "test")

# save as pickle
import pickle
with open('template_cleaned_dataset.pkl', 'wb') as f:
    pickle.dump(dataset, f)

print(dataset["train"]["input"][:5])
print(dataset["train"]["label"][:5])
print(dataset["label_2_num_map"])

['เดือนละ 150 บาทเล่นได้ทั้งวันหรือเปล่า', 'ครับ มีการชำระค่าบริการมาแล้ว ดูสัญญาณให้หน่อย', 'จ่ายค่าทรูมูฟค่ะ สัญญาณใช้ได้วันนี้เลยหรือเปล่าค่ะ', 'ขายบัตรทรูมันนี่ ที่ไหน', 'เมื่อคืนนี้ผมโทรมาแจ้งเรื่องโทรศัพท์หาไม่เจอ และระงับไว้ ตอนนี้เจอ อยู่ใต้เบาะรถ ต้องการเปิดสัญญาณการใช้งาน']
[1, 0, 0, 15, 5]
{'payment': 0, 'package': 1, 'suspend': 2, 'internet': 3, 'phone_issues': 4, 'service': 5, 'nontruemove': 6, 'balance': 7, 'detail': 8, 'bill': 9, 'credit': 10, 'promotion': 11, 'mobile_setting': 12, 'iservice': 13, 'roaming': 14, 'truemoney': 15, 'information': 16, 'lost_stolen': 17, 'balance_minutes': 18, 'idd': 19, 'garbage': 20, 'ringtone': 21, 'rate': 22, 'loyalty_card': 23, 'contact': 24, 'officer': 25}


# Model 1 TF-IDF

Build a model to train a tf-idf text classifier. Use a simple logistic regression model for the classifier.

For this part, you may find this [tutorial](https://scikit-learn.org/stable/auto_examples/text/plot_document_classification_20newsgroups.html#sphx-glr-auto-examples-text-plot-document-classification-20newsgroups-py) helpful.

Below are some design choices you need to consider to accomplish this task. Be sure to answer them when you submit your model.

What tokenizer will you use? Why?

**Ans:** -> **"newmm"** . fast and efficient (Good enough acc on BEST test benchmark word level tokenization)

Will you ignore some stop words (a, an, the, to, etc. for English) in your tf-idf? Is it important?
PythaiNLP provides a list of stopwords if you want to use (https://pythainlp.org/docs/2.0/api/corpus.html#pythainlp.corpus.common.thai_stopwords)

**Ans:** I will not ignore thai stop word. but instead use the TfidfVectorizer()'s 'max_df' to cut out too frequent words.

The dictionary of TF-IDF is usually based on the training data. How many words in the test set are OOVs?

**Ans:**

- Number of OOV words in validation set: 212
- Number of OOV words in test set: 187
- OOV ratio in validation set: 0.05898720089037284
- OOV ratio in test set: 0.05203116304952699

In [11]:
from pythainlp.tokenize import word_tokenize
from pythainlp.corpus.common import thai_stopwords

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics import classification_report

print(set(thai_stopwords()))

{'ฯล', 'ให้แด่', 'พอกัน', 'ประการ', 'มองว่า', 'ดั่งเคย', 'คงอยู่', 'เช่น', 'อย่างนี้', 'แท้', 'หรือไง', 'รึ', 'คิดว่า', 'สั้น', 'ฉะนั้น', 'ที่สุด', 'ล้วนจน', 'เล็กน้อย', 'มี', 'ง่ายๆ', 'ทั้งนี้', 'เมื่อครั้งก่อน', 'กว่า', 'ที่แล้ว', 'ยิ่งขึ้นไป', 'ข้างๆ', 'ความ', 'ใดๆ', 'ค่อยไปทาง', 'เป็นอันว่า', 'ถูกต้อง', 'เสียนั่น', 'จรดกับ', 'ภายหลัง', 'เมื่อเย็น', 'ทั้งมวล', 'นางสาว', 'เขียน', 'เท่าที่', 'ค่อย', 'สมัยก่อน', 'แม้แต่', 'คงจะ', 'บ่อยกว่า', 'ทว่า', 'รือว่า', 'ทุกแห่ง', 'ข้างล่าง', 'ให้แก่', 'น่า', 'ครั้งละ', 'ด้วยเหตุนี้', 'เหล่านั้น', 'ผิด', 'ใคร่', 'พอสม', 'จนแม้', 'เช่นก่อน', 'จริงๆ', 'เกือบๆ', 'ถือ', 'ส่วนที่', 'ปัจจุบัน', 'ช่วงๆ', 'เหล่า', 'อัน', 'นอกเหนือ', 'ภาค', 'ตลอด', 'นำมา', 'พวกที่', 'พร้อม', 'ส่วนมาก', 'ทุก', 'ซึ่ง', 'รับรอง', 'คล้ายกับว่า', 'วันนี้', 'ใหญ่ๆ', 'นัก', 'พวกเขา', 'ถ้า', 'ถือว่า', 'ตลอดกาล', 'พอๆ', 'ด้วยเหตุนั้น', 'อย่าง', 'เกือบ', 'ถ้าหาก', 'อันที่จะ', 'อย่างเช่น', 'เกี่ยวกับ', 'เกี่ยวเนื่อง', 'ก็ได้', 'ออก', 'นอกจากที่', 'มัน', 'แค่ไหน', 'พวกนี้', 'ตลอดศก',

In [12]:
pythainlp_tokenizer = lambda x: word_tokenize(x, engine="newmm", keep_whitespace=False)

# Extract tokenized text and labels
train_texts, train_labels = dataset["train"]["input"], dataset["train"]["label"]
val_texts, val_labels = dataset["val"]["input"], dataset["val"]["label"]
test_texts, test_labels = dataset["test"]["input"], dataset["test"]["label"]

# map texts to lowercase
train_texts = [text.lower() for text in train_texts]
val_texts = [text.lower() for text in val_texts]
test_texts = [text.lower() for text in test_texts]

# Initialize and fit TF-IDF Vectorizer on training data
vectorizer = TfidfVectorizer(tokenizer=pythainlp_tokenizer, max_df=0.7, min_df=1)
X_train_tfidf = vectorizer.fit_transform(train_texts)  # Fit and transform training data

X_val_tfidf = vectorizer.transform(val_texts)  # Transform validation data
X_test_tfidf = vectorizer.transform(test_texts)  # Transform test data

# Print shape of TF-IDF matrices
print("Train shape:", X_train_tfidf.shape)
print("Val shape:", X_val_tfidf.shape)
print("Test shape:", X_test_tfidf.shape)

# vectorizer get feature names
feature_names = vectorizer.get_feature_names_out()
print("Number of features:", len(feature_names))

# Print first 5 samples of processed train data
print("Sample TF-IDF Features (First 5 training samples):")
print(X_train_tfidf[0])



Train shape: (10710, 3594)
Val shape: (1339, 3594)
Test shape: (1340, 3594)
Number of features: 3594
Sample TF-IDF Features (First 5 training samples):
<Compressed Sparse Row sparse matrix of dtype 'float64'
	with 8 stored elements and shape (1, 3594)>
  Coords	Values
  (0, 2856)	0.2885629413746919
  (0, 2174)	0.3644740882130535
  (0, 127)	0.45100667026141533
  (0, 1677)	0.28148034536449795
  (0, 3122)	0.28025673780847243
  (0, 3522)	0.16197780948030555
  (0, 1455)	0.5452004308679377
  (0, 2514)	0.31500429643097466


In [13]:
# check for oov in test and val set
word_dict = vectorizer.get_feature_names_out()

oov_set = set()

def get_oov_words(texts):
    oov_set = set()
    for text in texts:
        for word in pythainlp_tokenizer(text):
            if word not in word_dict:
                oov_set.add(word)
    return oov_set

oov_val = get_oov_words(val_texts)
oov_test = get_oov_words(test_texts)

print("Number of OOV words in validation set:", len(oov_val))
print("Number of OOV words in test set:", len(oov_test))
print("OOV ratio in validation set:", len(oov_val) / len(word_dict))
print("OOV ratio in test set:", len(oov_test) / len(word_dict))

Number of OOV words in validation set: 212
Number of OOV words in test set: 187
OOV ratio in validation set: 0.05898720089037284
OOV ratio in test set: 0.05203116304952699


## Define a LogisticRegression model with tf-idf as feature.

In [14]:
#import from sklearn that automatically select the best hyperparameter
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import LogisticRegression

# Define the hyperparameter grid
param_grid = {
    "C": [0.01, 0.1, 1, 10, 100],
    "penalty": ["l1", "l2"],
    "solver": ["liblinear"]
}

# Initialize the GridSearchCV object
grid_search = GridSearchCV( estimator=LogisticRegression(max_iter=1000, class_weight="balanced"), 
                            param_grid=param_grid, 
                            cv=5, 
                            n_jobs=-1, 
                            verbose=2)

In [15]:
# Fit the GridSearchCV object on the training data
grid_search.fit(X_train_tfidf, train_labels)

# get the best model
best_model = grid_search.best_estimator_

Fitting 5 folds for each of 10 candidates, totalling 50 fits




[CV] END ...............C=0.01, penalty=l1, solver=liblinear; total time=   0.3s
[CV] END ...............C=0.01, penalty=l1, solver=liblinear; total time=   0.3s
[CV] END ...............C=0.01, penalty=l1, solver=liblinear; total time=   0.3s
[CV] END ...............C=0.01, penalty=l1, solver=liblinear; total time=   0.3s
[CV] END ...............C=0.01, penalty=l1, solver=liblinear; total time=   0.3s
[CV] END ...............C=0.01, penalty=l2, solver=liblinear; total time=   0.4s
[CV] END ...............C=0.01, penalty=l2, solver=liblinear; total time=   0.4s
[CV] END ...............C=0.01, penalty=l2, solver=liblinear; total time=   0.4s
[CV] END ...............C=0.01, penalty=l2, solver=liblinear; total time=   0.5s
[CV] END ...............C=0.01, penalty=l2, solver=liblinear; total time=   0.4s
[CV] END ................C=0.1, penalty=l1, solver=liblinear; total time=   0.5s
[CV] END ................C=0.1, penalty=l1, solver=liblinear; total time=   0.6s
[CV] END ................C=0



[CV] END .................C=10, penalty=l1, solver=liblinear; total time=   5.2s
[CV] END ................C=100, penalty=l1, solver=liblinear; total time=   5.0s




[CV] END ................C=100, penalty=l1, solver=liblinear; total time=   5.5s




[CV] END .................C=10, penalty=l1, solver=liblinear; total time=  13.6s


In [16]:
best_model

# Evaluation

In [17]:
# Evaluate the best model on the validation data
val_preds = best_model.predict(X_val_tfidf)
val_acc = accuracy_score(val_labels, val_preds)
print("Validation Accuracy:", val_acc)

test_preds = best_model.predict(X_test_tfidf)
test_acc = accuracy_score(test_labels, test_preds)
print("Test Accuracy:", test_acc)


# Generate classification report on test data
print(classification_report(test_labels, test_preds, target_names=num_2_label_map.values()))

Validation Accuracy: 0.7303958177744585
Test Accuracy: 0.7231343283582089
                 precision    recall  f1-score   support

        payment       0.68      0.86      0.76        64
        package       0.73      0.68      0.71       180
        suspend       0.81      0.77      0.79        73
       internet       0.72      0.77      0.74       179
   phone_issues       0.63      0.76      0.69        58
        service       0.82      0.69      0.75       211
    nontruemove       0.39      0.44      0.42        25
        balance       0.87      0.83      0.85       149
         detail       0.46      0.39      0.43        33
           bill       0.75      0.80      0.77        54
         credit       0.83      0.88      0.86        17
      promotion       0.69      0.69      0.69       115
 mobile_setting       0.42      0.46      0.44        28
       iservice       0.50      0.50      0.50         2
        roaming       0.87      0.80      0.83        25
      truemon

  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
