# HOMEWORK 6: TEXT CLASSIFICATION
In this homework, you will create models to classify texts from TRUE call-center. There are two classification tasks:
1. Action Classification: Identify which action the customer would like to take (e.g. enquire, report, cancle)
2. Object Classification: Identify which object the customer is referring to (e.g. payment, truemoney, internet, roaming)

We will focus only on the Object Classification task for this homework.

In this homework, you are asked compare different text classification models in terms of accuracy and inference time.

You will need to build 3 different models.

1. A model based on tf-idf
2. A model based on MUSE
3. A model based on wangchanBERTa

**You will be ask to submit 3 different files (.pdf from .ipynb) that does the 3 different models. Finally, answer the accuracy and runtime numbers in MCV.**

This homework is quite free form, and your answer may vary. We hope that the processing during the course of this assignment will make you think more about the design choices in text classification.

In [1]:
# !wget --no-check-certificate https://www.dropbox.com/s/37u83g55p19kvrl/clean-phone-data-for-students.csv
# !pip install pythainlp

## Import Libs

In [2]:
%matplotlib inline
import sklearn
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

from torch.utils.data import Dataset
from IPython.display import display
from collections import defaultdict
from sklearn.metrics import accuracy_score

#My import 
np.random.seed(42)
from sklearn.model_selection import train_test_split
import pickle

## Loading cleaned dataset from my folder.

In [3]:
with open('template_cleaned_dataset.pkl', 'rb') as f:
    dataset = pickle.load(f)

# Extract tokenized text and labels
label_2_num_map, num_2_label_map = dataset["label_2_num_map"], dataset["num_2_label_map"]
train_texts, train_labels = dataset["train"]["input"], dataset["train"]["label"]
val_texts, val_labels = dataset["val"]["input"], dataset["val"]["label"]
test_texts, test_labels = dataset["test"]["input"], dataset["test"]["label"]

# Model 2 MUSE

Build a simple logistic regression model using features from the MUSE model.

Which MUSE model will you use? Why?

**Ans:** 

- I use sentence-transformers/use-cmlm-multilingual. as there are more likes and the other one doesn't have native support for hugging face.

MUSE is typically used with tensorflow. However, there are some pytorch conversions made by some people.

- https://huggingface.co/sentence-transformers/use-cmlm-multilingual
- https://huggingface.co/dayyass/universal-sentence-encoder-multilingual-large-3-pytorch

## Import libs for MUSE

In [4]:
from pythainlp.tokenize import word_tokenize
from pythainlp.corpus.common import thai_stopwords

from sentence_transformers import SentenceTransformer

print(set(thai_stopwords()))

MODEL_NAME = 'sentence-transformers/use-cmlm-multilingual'

  from .autonotebook import tqdm as notebook_tqdm


{'พอสม', 'ช่วงหลัง', 'ฉัน', 'เมื่อก่อน', 'เมื่อไร', 'ที่', 'ทำไม', 'หากว่า', 'เยอะ', 'เช่นดังก่อน', 'เช่นนี้', 'เมื่อคืน', 'เธอ', 'ใกล้', 'สั้น', 'จะ', 'ด้วยว่า', 'แห่งนั้น', 'สิ่งนั้น', 'แม้ว่า', 'ในช่วง', 'หากแม้นว่า', 'คราวไหน', 'ช่วงแรก', 'หนอย', 'พวกฉัน', 'พา', 'มัก', 'ตลอดถึง', 'ทั้งคน', 'คราวก่อน', 'ครั้งใด', 'ภายภาคหน้า', 'อย่างเดียว', 'ถึงเมื่อใด', 'ตั้ง', 'ไม่ค่อยเป็น', 'เพิ่มเติม', 'มากมาย', 'เยอะๆ', 'รวมถึง', 'พร้อมเพียง', 'ใหญ่ๆ', 'แท้', 'ในที่', 'คราวละ', 'ควร', 'ช่วงๆ', 'เอา', 'เป็นดัง', 'จวบกับ', 'เช่นดัง', 'จึง', 'หรือไม่', 'ประการใด', 'วันใด', 'ซึ่งกัน', 'แม้นว่า', 'คล้ายว่า', 'กระทั่ง', 'ต่อ', 'แห่งใด', 'เป็นต้นไป', 'ระหว่าง', 'กว่า', 'บ้าง', 'ดั่งกับว่า', 'เมื่อเย็น', 'คล้ายกัน', 'เสียนี่', 'เป็นที', 'นั้นไว', 'ง่ายๆ', 'รึว่า', 'ว่า', 'เป็นแต่', 'หลัง', 'บางกว่า', 'เคย', 'คราวนั้น', 'ดังกล่าว', 'ซะจนกระทั่ง', 'ถ้าจะ', 'พอเหมาะ', 'ช่วงนี้', 'ข้าพเจ้า', 'ด้วยกัน', 'ช่วงต่อไป', 'คงจะ', 'ด้วยประการฉะนี้', 'ที', 'เสียนี่กระไร', 'สุดๆ', 'มุ่ง', 'เห็น', 'ตั้งแต่', 'ทันใดนั

In [None]:
import pytorch_lightning as pl
from torch.utils.data import DataLoader, Dataset
from torch import nn

class ClassifierHead(nn.Module):
    def __init__(self, in_features=768, out_features=len(label_2_num_map), dropout=0.1):
        super().__init__()
        self.linear1 = nn.Linear(in_features, 256)
        self.norm = nn.LayerNorm(out_features)
        self.activation = nn.ReLU()
        self.dropout = nn.Dropout(dropout)
        self.linear2 = nn.Linear(256, out_features)
        
    def forward(self, x):
        x = self.linear1(x)
        x = self.norm(x)
        x = self.activation(x)
        x = self.dropout(x)
        x = self.linear2(x)
        return x

class TrueObjectClassifier(pl.LightningModule):
    def __init__(self, model_name, num_labels, lr=2e-5):
        super().__init__()
        self.encoder = SentenceTransformer(model_name)
        

NameError: name 'TfidfVectorizer' is not defined

## Initializing the model

In [None]:
model = SentenceTransformer(MODEL_NAME)
embeddings = model.encode(sentences)
print(embeddings)

# Model 3 WangchanBERTa

We ask you to train a WangchanBERTa-based model.

We recommend you use the thaixtransformers fork (which we used in the PoS homework).
https://github.com/PyThaiNLP/thaixtransformers

The structure of the code will be very similar to the PoS homework. You will also find the huggingface [tutorial](https://huggingface.co/docs/transformers/en/tasks/sequence_classification) useful. Or you can also add a softmax layer by yourself just like in the previous homework.

Which WangchanBERTa model will you use? Why? (Don't forget to clean your text accordingly).

**Ans:**


After you

# Comparison

After you have completed the 3 models, compare the accuracy, ease of implementation, and inference speed (from cleaning, tokenization, till model compute) between the three models in mycourseville.