# HOMEWORK 6: TEXT CLASSIFICATION
In this homework, you will create models to classify texts from TRUE call-center. There are two classification tasks:
1. Action Classification: Identify which action the customer would like to take (e.g. enquire, report, cancle)
2. Object Classification: Identify which object the customer is referring to (e.g. payment, truemoney, internet, roaming)

We will focus only on the Object Classification task for this homework.

In this homework, you are asked compare different text classification models in terms of accuracy and inference time.

You will need to build 3 different models.

1. A model based on tf-idf
2. A model based on MUSE
3. A model based on wangchanBERTa

**You will be ask to submit 3 different files (.pdf from .ipynb) that does the 3 different models. Finally, answer the accuracy and runtime numbers in MCV.**

This homework is quite free form, and your answer may vary. We hope that the processing during the course of this assignment will make you think more about the design choices in text classification.

In [1]:
# !wget --no-check-certificate https://www.dropbox.com/s/37u83g55p19kvrl/clean-phone-data-for-students.csv
# !pip install pythainlp

## Import Libs

In [2]:
%matplotlib inline
import sklearn
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

from torch.utils.data import Dataset
from IPython.display import display
from collections import defaultdict
from sklearn.metrics import accuracy_score

#My import 
np.random.seed(42)
from sklearn.model_selection import train_test_split
import pickle

## Loading cleaned dataset from my folder.

In [3]:
with open('template_cleaned_dataset.pkl', 'rb') as f:
    dataset = pickle.load(f)

# Extract tokenized text and labels
label_2_num_map, num_2_label_map = dataset["label_2_num_map"], dataset["num_2_label_map"]
train_texts, train_labels = dataset["train"]["input"], dataset["train"]["label"]
val_texts, val_labels = dataset["val"]["input"], dataset["val"]["label"]
test_texts, test_labels = dataset["test"]["input"], dataset["test"]["label"]

# Model 2 MUSE

Build a simple logistic regression model using features from the MUSE model.

Which MUSE model will you use? Why?

**Ans:** 

- I use sentence-transformers/use-cmlm-multilingual. as there are more likes and the other one doesn't have native support for hugging face.

MUSE is typically used with tensorflow. However, there are some pytorch conversions made by some people.

- https://huggingface.co/sentence-transformers/use-cmlm-multilingual
- https://huggingface.co/dayyass/universal-sentence-encoder-multilingual-large-3-pytorch

## Import libs for MUSE

In [4]:
from pythainlp.tokenize import word_tokenize
from pythainlp.corpus.common import thai_stopwords

from sentence_transformers import SentenceTransformer


from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report

print(set(thai_stopwords()))

MODEL_NAME = 'sentence-transformers/use-cmlm-multilingual'

  from .autonotebook import tqdm as notebook_tqdm


{'ตาม', 'ค่อนมาทาง', 'เคยๆ', 'ในช่วง', 'แต่', 'ด้วยเช่นกัน', 'มีแต่', 'เช่นไร', 'ไม่ค่อยเป็น', 'จำเป็น', 'ด้วยเหตุว่า', 'รึว่า', 'ถ้าจะ', 'เสร็จ', 'แห่งไหน', 'กันเถอะ', 'ยิ่ง', 'นอกนั้น', 'ทําให้', 'แล้วแต่', 'ประมาณ', 'จึงจะ', 'ใต้', 'ตลอดถึง', 'ถูกๆ', 'รวด', 'ยิ่งขึ้นไป', 'ได้ที่', 'เพียงพอ', 'เพิ่งจะ', 'ถูก', 'เอง', 'แก้ไข', 'จัดแจง', 'เดียวกัน', 'นี่แหละ', 'คุณ', 'ตลอดกาล', 'นี่เอง', 'เสียจนกระทั่ง', 'ใครๆ', 'ตลอดไป', 'มาก', 'ทุกแห่ง', 'ครา', 'จวบ', 'เสีย', 'อย่างน้อย', 'ต่างๆ', 'ตลอดระยะเวลา', 'ที่แท้', 'เล็กๆ', 'ขวางๆ', 'เพียงแต่', 'ทุกวันนี้', 'อาจ', 'เมื่อ', 'ยิ่งจน', 'เพียงไร', 'ถึงเมื่อใด', 'ครั้งไหน', 'ใหญ่', 'ทุกที', 'ดั่ง', 'เป็นแต่', 'ภายใน', 'เช่นดังว่า', '\ufeffๆ', 'ทั้งนั้น', 'กลุ่ม', 'บัดดล', 'ถ้าหาก', 'เถิด', 'เพราะว่า', 'นับจากนี้', 'ตลอดวัน', 'เช่นดัง', 'ใคร', 'ประการ', 'สุด', 'นิดหน่อย', 'รือ', 'แต่เมื่อ', 'ราย', 'กันไหม', 'ลง', 'ก็ต่อเมื่อ', 'พยายาม', 'เปิด', 'จึง', 'ของ', 'พูด', 'ก็แค่', 'เมื่อไร', 'ภายภาคหน้า', 'คราไหน', 'พอ', 'ทีละ', 'ครัน', 'ยัง', 'ทุกที่', '

## Initializing the model

In [5]:
model = SentenceTransformer('sentence-transformers/use-cmlm-multilingual')
model

Some weights of the model checkpoint at sentence-transformers/use-cmlm-multilingual were not used when initializing BertModel: ['cls.predictions.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.dense.weight', 'cls.seq_relationship.bias', 'cls.seq_relationship.weight']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


SentenceTransformer(
  (0): Transformer({'max_seq_length': 256, 'do_lower_case': False}) with Transformer model: BertModel 
  (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
  (2): Normalize()
)

In [6]:
sentences = ["This is an example sentence", "Each sentence is converted"]

embeddings = model.encode(sentences)
print(embeddings)

[[ 0.01540179 -0.02176822 -0.01264849 ...  0.05045101 -0.02573791
   0.01540428]
 [ 0.00617291 -0.01532528 -0.03717829 ...  0.02221827 -0.02915066
  -0.09352502]]


In [7]:
def encode_split(texts, model=model):
    embeddings = model.encode(texts)
    return embeddings

emb_train = encode_split(train_texts)
emb_val = encode_split(val_texts)
emb_test = encode_split(test_texts)

In [8]:
emb_train.shape, emb_val.shape, emb_test.shape

((10710, 768), (1339, 768), (1340, 768))

## Build simple logistic regression model grid search

In [9]:
# Define the hyperparameter grid
param_grid = {
    "C": [0.01, 0.1, 1, 10],
    "penalty": ["l1", "l2"],
    "solver": ["liblinear"]
}

# Initialize the GridSearchCV object
grid_search = GridSearchCV( estimator=LogisticRegression(max_iter=1000, class_weight="balanced"), 
                            param_grid=param_grid, 
                            cv=5, 
                            n_jobs=-1, 
                            verbose=2)

In [10]:
# Fit the GridSearchCV object on the training data
grid_search.fit(emb_train, train_labels)

# get the best model
best_model = grid_search.best_estimator_

Fitting 5 folds for each of 8 candidates, totalling 40 fits


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Av

[CV] END ...............C=0.01, penalty=l1, solver=liblinear; total time=  15.0s
[CV] END ...............C=0.01, penalty=l1, solver=liblinear; total time=  15.1s
[CV] END ...............C=0.01, penalty=l1, solver=liblinear; total time=  15.7s
[CV] END ...............C=0.01, penalty=l1, solver=liblinear; total time=  15.8s
[CV] END ...............C=0.01, penalty=l1, solver=liblinear; total time=  16.1s
[CV] END ................C=0.1, penalty=l1, solver=liblinear; total time=  25.2s
[CV] END ................C=0.1, penalty=l1, solver=liblinear; total time=  26.0s
[CV] END ................C=0.1, penalty=l1, solver=liblinear; total time=  26.1s
[CV] END ................C=0.1, penalty=l1, solver=liblinear; total time=  27.5s
[CV] END ................C=0.1, penalty=l1, solver=liblinear; total time=  27.7s
[CV] END ...............C=0.01, penalty=l2, solver=liblinear; total time=  39.7s
[CV] END ...............C=0.01, penalty=l2, solver=liblinear; total time=  39.9s
[CV] END ...............C=0.

In [11]:
best_model

In [12]:
# Evaluate the best model on the validation data
val_preds = best_model.predict(emb_val)
val_acc = accuracy_score(val_labels, val_preds)
print("Validation Accuracy:", val_acc)

test_preds = best_model.predict(emb_test)
test_acc = accuracy_score(test_labels, test_preds)
print("Test Accuracy:", test_acc)


# Generate classification report on test data
print(classification_report(test_labels, test_preds, target_names=num_2_label_map.values()))

Validation Accuracy: 0.6915608663181478
Test Accuracy: 0.6828358208955224
                 precision    recall  f1-score   support

        payment       0.58      0.77      0.66        64
        package       0.70      0.57      0.63       180
        suspend       0.72      0.75      0.74        73
       internet       0.71      0.75      0.73       179
   phone_issues       0.63      0.67      0.65        58
        service       0.85      0.65      0.74       211
    nontruemove       0.36      0.48      0.41        25
        balance       0.88      0.77      0.82       149
         detail       0.29      0.42      0.34        33
           bill       0.66      0.69      0.67        54
         credit       0.83      0.88      0.86        17
      promotion       0.69      0.69      0.69       115
 mobile_setting       0.40      0.50      0.44        28
       iservice       0.33      0.50      0.40         2
        roaming       0.76      0.76      0.76        25
      truemon

  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


# Model 3 WangchanBERTa

We ask you to train a WangchanBERTa-based model.

We recommend you use the thaixtransformers fork (which we used in the PoS homework).
https://github.com/PyThaiNLP/thaixtransformers

The structure of the code will be very similar to the PoS homework. You will also find the huggingface [tutorial](https://huggingface.co/docs/transformers/en/tasks/sequence_classification) useful. Or you can also add a softmax layer by yourself just like in the previous homework.

Which WangchanBERTa model will you use? Why? (Don't forget to clean your text accordingly).

**Ans:**


After you

# Comparison

After you have completed the 3 models, compare the accuracy, ease of implementation, and inference speed (from cleaning, tokenization, till model compute) between the three models in mycourseville.