# HOMEWORK 6: TEXT CLASSIFICATION
In this homework, you will create models to classify texts from TRUE call-center. There are two classification tasks:
1. Action Classification: Identify which action the customer would like to take (e.g. enquire, report, cancle)
2. Object Classification: Identify which object the customer is referring to (e.g. payment, truemoney, internet, roaming)

We will focus only on the Object Classification task for this homework.

In this homework, you are asked compare different text classification models in terms of accuracy and inference time.

You will need to build 3 different models.

1. A model based on tf-idf
2. A model based on MUSE
3. A model based on wangchanBERTa

**You will be ask to submit 3 different files (.pdf from .ipynb) that does the 3 different models. Finally, answer the accuracy and runtime numbers in MCV.**

This homework is quite free form, and your answer may vary. We hope that the processing during the course of this assignment will make you think more about the design choices in text classification.

In [1]:
# !wget --no-check-certificate https://www.dropbox.com/s/37u83g55p19kvrl/clean-phone-data-for-students.csv

In [2]:
# !pip install pythainlp

## Import Libs

In [3]:
%matplotlib inline
import pandas
import sklearn
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

from torch.utils.data import Dataset
from IPython.display import display
from collections import defaultdict
from sklearn.metrics import accuracy_score

## Loading data
First, we load the data from disk into a Dataframe.

A Dataframe is essentially a table, or 2D-array/Matrix with a name for each column.

In [4]:
data_df = pd.read_csv('clean-phone-data-for-students.csv')

Let's preview the data.

In [5]:
# Show the top 5 rows
display(data_df.head())
# Summarize the data
data_df.describe()

Unnamed: 0,Sentence Utterance,Action,Object
0,<PHONE_NUMBER_REMOVED> ผมไปจ่ายเงินที่ Counte...,enquire,payment
1,internet ยังความเร็วอยุ่เท่าไหร ครับ,enquire,package
2,ตะกี้ไปชำระค่าบริการไปแล้ว แต่ยังใช้งานไม่ได้...,report,suspend
3,พี่ค่ะยังใช้ internet ไม่ได้เลยค่ะ เป็นเครื่อ...,enquire,internet
4,ฮาโหล คะ พอดีว่าเมื่อวานเปิดซิมทรูมูฟ แต่มันโ...,report,phone_issues


Unnamed: 0,Sentence Utterance,Action,Object
count,16175,16175,16175
unique,13389,10,33
top,บริการอื่นๆ,enquire,service
freq,97,10377,2525


## Data cleaning

We call the DataFrame.describe() again.
Notice that there are 33 unique labels/classes for object and 10 unique labels for action that the model will try to predict.
But there are unwanted duplications e.g. Idd,idd,lotalty_card,Lotalty_card

Also note that, there are 13389 unqiue sentence utterances from 16175 utterances. You have to clean that too!

## #TODO 0.1:
You will have to remove unwanted label duplications as well as duplications in text inputs.
Also, you will have to trim out unwanted whitespaces from the text inputs.
This shouldn't be too hard, as you have already seen it in the demo.



In [6]:
display(data_df.describe())
display(data_df.Object.unique())
display(data_df.Action.unique())

Unnamed: 0,Sentence Utterance,Action,Object
count,16175,16175,16175
unique,13389,10,33
top,บริการอื่นๆ,enquire,service
freq,97,10377,2525


array(['payment', 'package', 'suspend', 'internet', 'phone_issues',
       'service', 'nonTrueMove', 'balance', 'detail', 'bill', 'credit',
       'promotion', 'mobile_setting', 'iservice', 'roaming', 'truemoney',
       'information', 'lost_stolen', 'balance_minutes', 'idd',
       'TrueMoney', 'garbage', 'Payment', 'IDD', 'ringtone', 'Idd',
       'rate', 'loyalty_card', 'contact', 'officer', 'Balance', 'Service',
       'Loyalty_card'], dtype=object)

array(['enquire', 'report', 'cancel', 'Enquire', 'buy', 'activate',
       'request', 'Report', 'garbage', 'change'], dtype=object)

In [7]:
# TODO.1: Data cleaning
data_df['clean Sentence Utterance'] = data_df['Sentence Utterance'].str.strip().copy()
# data_df['clean Action'] = data_df['Action'].str.lower().copy()
data_df['clean Object'] = data_df['Object'].str.lower().copy()

# data_df.drop_duplicates("Sentence Utterance", keep="first", inplace=True)
data_df.drop_duplicates("clean Sentence Utterance", keep="first", inplace=True)

data_df.drop('Sentence Utterance', axis=1, inplace=True)
data_df.drop('Action', axis=1, inplace=True)
data_df.drop('Object', axis=1, inplace=True)


data_df.describe()

# idx = 1
# print(f'"{data_df["Sentence Utterance"][idx]}"')
# print(f'"{data_df["clean Sentence Utterance"][idx]}"')


Unnamed: 0,clean Sentence Utterance,clean Object
count,13367,13367
unique,13367,26
top,สอบถามโปรโมชั่นปัจจุบันที่ใช้อยู่ค่ะ,service
freq,1,2108


In [8]:
data = data_df.to_numpy()
unique_label = data_df['clean Object'].unique()

label_2_num = dict(zip(unique_label, range(len(unique_label))))
num_2_label = dict(zip(range(len(unique_label)), unique_label))

display(label_2_num)
display(num_2_label)

display(data[:, 1])
data[:, 1]
data[:, 1] = np.vectorize(label_2_num.get)(data[:, 1])
display(data[:, 1])

{'payment': 0,
 'package': 1,
 'suspend': 2,
 'internet': 3,
 'phone_issues': 4,
 'service': 5,
 'nontruemove': 6,
 'balance': 7,
 'detail': 8,
 'bill': 9,
 'credit': 10,
 'promotion': 11,
 'mobile_setting': 12,
 'iservice': 13,
 'roaming': 14,
 'truemoney': 15,
 'information': 16,
 'lost_stolen': 17,
 'balance_minutes': 18,
 'idd': 19,
 'garbage': 20,
 'ringtone': 21,
 'rate': 22,
 'loyalty_card': 23,
 'contact': 24,
 'officer': 25}

{0: 'payment',
 1: 'package',
 2: 'suspend',
 3: 'internet',
 4: 'phone_issues',
 5: 'service',
 6: 'nontruemove',
 7: 'balance',
 8: 'detail',
 9: 'bill',
 10: 'credit',
 11: 'promotion',
 12: 'mobile_setting',
 13: 'iservice',
 14: 'roaming',
 15: 'truemoney',
 16: 'information',
 17: 'lost_stolen',
 18: 'balance_minutes',
 19: 'idd',
 20: 'garbage',
 21: 'ringtone',
 22: 'rate',
 23: 'loyalty_card',
 24: 'contact',
 25: 'officer'}

array(['payment', 'package', 'suspend', ..., 'balance', 'balance',
       'package'], dtype=object)

array([0, 1, 2, ..., 7, 7, 1], dtype=object)

Split data into train, valdation, and test sets (normally the ratio will be 80:10:10 , respectively). We recommend to use train_test_spilt from scikit-learn to split the data into train, validation, test set.

In addition, it should split the data that distribution of the labels in train, validation, test set are similar. There is **stratify** option to handle this issue.

https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html

Make sure the same data splitting is used for all models.

In [9]:
bin_label = np.bincount(np.array(data[:, 1], dtype=int))
# print(data[:, 1])
print(bin_label)

[ 641 1791  730 1786  581 2108  246 1478  327  540  173 1142  280   22
  246  248  296  231   50  206   49   79   36   67    4   10]


In [16]:
from sklearn.model_selection import train_test_split, StratifiedShuffleSplit

sss_train_valtest = StratifiedShuffleSplit(n_splits=1, test_size=1/10, random_state=42)
sss_val_test = StratifiedShuffleSplit(n_splits=1, test_size=1/9, random_state=42)

print(data.shape)
trainval_idx, test_idx = next(sss_train_valtest.split(data[:, 0], data[:, 1]))
trainval_raw = data[trainval_idx]
test_raw = data[test_idx]
# print(trainval_raw.shape, test_raw.shape)

train_idx, val_idx = next(sss_val_test.split(trainval_raw[:, 0], trainval_raw[:, 1]))
train_raw = trainval_raw[train_idx]
val_raw = trainval_raw[val_idx]

print(train_raw.shape, val_raw.shape, test_raw.shape)

(13367, 2)
(10693, 2) (1337, 2) (1337, 2)


In [17]:
import pythainlp
from pythainlp import word_tokenize

thai_stopwords = pythainlp.corpus.thai_stopwords()
thai_stopwords = list(thai_stopwords)
print(thai_stopwords)
tokenizer = pythainlp.tokenize.word_tokenize

print(tokenizer('ตะกี้ไปชำระค่าบริการไปแล้ว แต่ยังใช้งานไม่ได้'))

['ยิ่งจะ', 'แค่ว่า', 'เป็นแต่เพียง', 'ขอ', 'ถ้าหาก', 'บอกแล้ว', 'อื่น', 'ก็ดี', 'อย่างไรเสีย', 'จากนี้ไป', 'ข้าง', 'อย่างดี', 'อย่างๆ', 'หรือเปล่า', 'จะได้', 'มิได้', 'ครั้งที่', 'เสียยิ่งนัก', 'มุ่งหมาย', 'ข้างบน', 'เพิ่ม', 'อย่างนั้น', 'คุณๆ', 'ค่อนมาทาง', 'มั้ยเนี่ย', 'นับแต่นั้น', 'นำ', 'เท่าใด', 'พวกมัน', 'ตลอดวัน', 'เป็นเพียง', 'จนแม้น', 'ณ', 'เดียว', 'เยอะแยะ', 'คราวโน้น', 'แหละ', 'บางที', 'ขณะใด', 'ครบถ้วน', 'จำเป็น', 'จน', 'ด้วยที่', 'ความ', 'ครา', 'อย่างที่', 'ด้วยเหตุนี้', 'จัดแจง', 'ภายภาค', 'พร้อมทั้ง', 'ตลอดระยะเวลา', 'นี่', 'ทุกคราว', 'รวมๆ', 'หน่อย', 'ร่วมกัน', 'วันนั้น', 'เถอะ', 'บางขณะ', 'พวกท่าน', 'เถิด', 'เขียน', 'ที่ใด', 'เช่นเมื่อ', 'ยิ่ง', 'ที่ไหน', 'ดังเคย', 'ยังจะ', 'แห่งไหน', 'เล็ก', 'ข้า', 'ล่าสุด', 'นะ', 'แต่ทว่า', 'รวด', 'กันเถอะ', 'คราวหลัง', 'ให้ไป', 'แห่ง', 'ยังโง้น', 'แค่จะ', 'เป็นเพียงว่า', 'เป็น', 'นี้เอง', 'นี้แหล่', 'ช้า', 'กันและกัน', 'สมัยก่อน', 'ไม่ค่อยจะ', 'ในเมื่อ', 'กล่าวคือ', 'อย่างโน้น', 'ฯลฯ', 'ที่นั้น', 'ช่วย', 'ครั้งหลังสุด', 'ทีๆ', 'ด้วย

#Model 1 TF-IDF

Build a model to train a tf-idf text classifier. Use a simple logistic regression model for the classifier.

For this part, you may find this [tutorial](https://scikit-learn.org/stable/auto_examples/text/plot_document_classification_20newsgroups.html#sphx-glr-auto-examples-text-plot-document-classification-20newsgroups-py) helpful.

Below are some design choices you need to consider to accomplish this task. Be sure to answer them when you submit your model.

What tokenizer will you use? Why?

**Ans: I use pythainlp because dataset is thai language, this tokenizer might suit tokenize task for thai sentence.**

Will you ignore some stop words (a, an, the, to, etc. for English) in your tf-idf? Is it important?
PythaiNLP provides a list of stopwords if you want to use (https://pythainlp.org/docs/2.0/api/corpus.html#pythainlp.corpus.common.thai_stopwords)

**Ans: From the experiments, Used stop words yield acuuracy on test set at 0.69, but without stop words reach 0.74. Each sentence in dataset might too short so stop words cut out many context of each sentence.**

The dictionary of TF-IDF is usually based on the training data. How many words in the test set are OOVs?

**Ans: Occuring 8193 times and there're uniquely 473 words**

In [32]:
from sklearn.feature_extraction.text import TfidfVectorizer

# vectorizer = TfidfVectorizer(tokenizer=tokenizer, stop_words=None)
vectorizer = TfidfVectorizer(tokenizer=tokenizer, stop_words=thai_stopwords)

X_train = vectorizer.fit_transform(train_raw[:, 0])
X_val = vectorizer.transform(val_raw[:, 0])
X_test = vectorizer.transform(test_raw[:, 0])

y_train = train_raw[:, 1].astype(int)
y_val = val_raw[:, 1].astype(int)
y_test = test_raw[:, 1].astype(int)

print(X_train.shape, X_val.shape, X_test.shape)
print(y_train.shape, y_val.shape, y_test.shape)

(10693, 3286) (1337, 3286) (1337, 3286)
(10693,) (1337,) (1337,)


In [33]:
example = train_raw[4, 0]
print(f"'{example}'")
print(vectorizer.transform([example]))
print(vectorizer.get_feature_names_out()[np.where(vectorizer.transform([example]).toarray()[0] > 0)])

'คะจะรบกวนสอบถามนิดนึงว่า พอดีใช้โทรศัพท์บีบีใช้ไหม แล้วเล่น ไลน์ ได้ไหม'
<Compressed Sparse Row sparse matrix of dtype 'float64'
	with 9 stored elements and shape (1, 3286)>
  Coords	Values
  (0, 1)	0.3001256331419422
  (0, 1480)	0.267998540180016
  (0, 1555)	0.5518381273514346
  (0, 1880)	0.23692016311205774
  (0, 2191)	0.12867156319941367
  (0, 2858)	0.2278119799118207
  (0, 3136)	0.20822971804046236
  (0, 3258)	0.3808233638797442
  (0, 3265)	0.4696851977498628
[' ' 'นิดนึง' 'บี' 'รบกวน' 'สอบถาม' 'เล่น' 'โทรศัพท์' 'ไลน์' 'ไหม']


In [34]:
from sklearn.linear_model import LogisticRegression

clf = LogisticRegression(max_iter=1000)
clf.fit(X_train, y_train)


In [35]:
print("Accuracy on test set:", clf.score(X_test, y_test))

Accuracy on test set: 0.6925953627524308


In [42]:
oov_list = [] 
for sen in [tokenizer(sent) for sent in test_raw[:, 0]]:
    for word in sen:
        if word not in vectorizer.get_feature_names_out():
            # print(f'"{word}" not in vocab')
            oov_list.append(word)
print(len(oov_list))
oov_list_np = np.array(oov_list)
print(np.unique(oov_list_np).shape)

8193
(473,)


# Comparison

After you have completed the 3 models, compare the accuracy, ease of implementation, and inference speed (from cleaning, tokenization, till model compute) between the three models in mycourseville.