Some notes before running: because the checkpoint for torch's model is too large to push to GitHub, please download the saved checkpoint at [Google Drive](https://drive.google.com/file/d/1Eg4ZGp1hS-EcDB7LfCEUbPvtLzSjxc8f/view?usp=sharing) first and move it to the `work/` directory.

In [None]:
import torch

from pyspark.sql import SparkSession, Row
from pyspark.sql.functions import col, udf
from pyspark.ml.classification import LogisticRegressionModel
from pyspark.ml.linalg import Vectors, VectorUDT

from torch import nn
from torch.utils.data import Dataset, DataLoader

from transformers import AutoModel, AutoTokenizer
from sklearn.metrics import classification_report

from classifiers import SENTIMENTS_AS_INDEX, MLPClassifierWithPhoBERT

# Misc libs
from tqdm import tqdm

In [None]:
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

phobert_tokenizer = AutoTokenizer.from_pretrained('vinai/phobert-base-v2')
apply_tokenization = lambda minibatch: phobert_tokenizer(
    minibatch, return_tensors = 'pt', padding=True,
    truncation=True, max_length=256
)

IDX_AS_SENT = {idx: sentiment for sentiment, idx in SENTIMENTS_AS_INDEX.items()}

spark = SparkSession.builder.master('local[*]').getOrCreate()

# Loading test data and model

In [5]:
class ReviewDataset(Dataset):
    def __init__(self, data_as_spark_df):
        self.data_as_rdd = data_as_spark_df.rdd.zipWithIndex()
        self.len = data_as_spark_df.count()

    def __len__(self): return self.len

    def __getitem__(self, index: int):
        if index < 0 or index > self.len - 1:
            raise ValueError('index exceeded length of dataframe')

        nth_row = (self.data_as_rdd
                   .filter(lambda data: data[1] == index)
                   .take(1)[0][0]
        )
        review, sentiment = nth_row

        return review, SENTIMENTS_AS_INDEX[sentiment]

In [None]:
test_df = spark.read.parquet(
    'hdfs://namenode:9000/training_data/test_set'
)
test_set = ReviewDataset(test_df)
test_loader = DataLoader(test_set, 512)

## Loading Logistic Regression Model

In [None]:
phobert = AutoModel.from_pretrained('vinai/phobert-base-v2')
if torch.cuda.is_available: phobert.cuda()
phobert.eval()

@udf(returnType=VectorUDT())
def create_embedding(text):
    tokens = phobert_tokenizer(text, return_tensors="pt", padding=True, truncation=True, max_length=256)
    input_ids = tokens["input_ids"].to(device)
    attention_mask = tokens["attention_mask"].to(device)
    with torch.no_grad():
        output = phobert(input_ids, attention_mask)
    return Vectors.dense(output.last_hidden_state[0, 0, :].cpu().numpy())

Some weights of RobertaModel were not initialized from the model checkpoint at vinai/phobert-base-v2 and are newly initialized: ['pooler.dense.bias', 'pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [None]:
lr_model = LogisticRegressionModel.load('work/models/lr_sentiment_model')

## Loading MLP model

In [9]:
@torch.no_grad
def get_cm(
    model: nn.Module,
    data_loader: DataLoader,
    n_labels: int,
    use_gpu: bool = False,
    return_preds: bool = False
) -> tuple[torch.Tensor, torch.Tensor]:
    """
    Make inference with a `torch.nn.Module` and return the confusion matrix.

    :param nn.Module: The model to make inference with.
    :param DataLoader: The data to make inference on.
    :param int n_labels: The number of labels within the dataset. Note that
        this should be the number of labels on the WHOLE dataset. The `data_loader`
        must have at maximum `n_labels`.
    :param bool use_gpu: Whether or not to do computations on GPU.
    :param bool return_preds: Whether or not to return the predictions.

    :return: A 2-d tensor of integers. Each row represents the predictions made and
        each column represents the ground truth.

        If `return_preds=True`, the function also returns the predictions.
    :rtype: tuple[torch.Tensor, torch.Tensor]
    """
    model.eval()
    flattened_dim = n_labels ** 2
    confusion_mat = torch.zeros(flattened_dim, dtype=torch.long)
    preds = torch.empty(0)

    for X, y in tqdm(data_loader):
        tokenized_X = apply_tokenization(X)

        X_input_ids = tokenized_X['input_ids']
        X_att_mask = tokenized_X['attention_mask']

        if use_gpu:
            X_input_ids = X_input_ids.cuda()
            X_att_mask = X_att_mask.cuda()

        pred = model(X_input_ids, X_att_mask).argmax(dim=1).cpu()
        if return_preds: preds = torch.concat([preds, pred])

        count_as_idx = y + n_labels * pred
        count_as_idx = torch.bincount(count_as_idx)
        if count_as_idx.shape[0] < flattened_dim:
            zeros = torch.zeros(flattened_dim - count_as_idx.shape[0], dtype=torch.long)
            count_as_idx = torch.concat([count_as_idx, zeros])
        confusion_mat += count_as_idx
    return confusion_mat.reshape((n_labels, n_labels)), preds

In [None]:
checkpoint = torch.load('work/models/03_05_25-epoch25-model.tar', map_location=device)

review_model = MLPClassifierWithPhoBERT([512, 512], nn.LeakyReLU(.02))
review_model.load_state_dict(checkpoint['model_param'])
if torch.cuda.is_available(): review_model.cuda()

Some weights of RobertaModel were not initialized from the model checkpoint at vinai/phobert-base-v2 and are newly initialized: ['pooler.dense.bias', 'pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


# Inference

## Comparision between Logistic Regression and MLP on test set

In [35]:
cm, mlp_predictions = get_cm(review_model, test_loader, 3, torch.cuda.is_available(), True)
cm

100%|██████████| 2/2 [03:20<00:00, 100.20s/it]


tensor([[573,   7,   5],
        [ 43,  14,  26],
        [  6,   9, 121]])

In [18]:
idx_to_sentiment = udf(lambda idx: IDX_AS_SENT[idx])

test_df = test_df.withColumn('features', create_embedding(col('review')))
lr_predictions = lr_model.transform(test_df)
lr_predictions = lr_predictions.withColumn('prediction', idx_to_sentiment(col('prediction')))

preds_pd = lr_predictions.select("sentiment", "prediction").toPandas()

In [28]:
print("MLP eval metrics per class:")
print(classification_report([IDX_AS_SENT[int(i)] for _, y in test_loader for i in y], [IDX_AS_SENT[int(i)] for i in mlp_predictions], digits=4))

MLP eval metrics per class:
              precision    recall  f1-score   support

    negative     0.8897    0.7961    0.8403       152
     neutral     0.1687    0.4667    0.2478        30
    positive     0.9795    0.9212    0.9495       622

    accuracy                         0.8806       804
   macro avg     0.6793    0.7280    0.6792       804
weighted avg     0.9323    0.8806    0.9026       804



In [19]:
print("Logistic Regression eval metrics per class:")
print(classification_report(preds_pd["sentiment"], preds_pd["prediction"], digits=4))

Logistic Regression eval metrics per class:
              precision    recall  f1-score   support

    negative     0.8218    0.9408    0.8773       152
     neutral     0.0000    0.0000    0.0000        30
    positive     0.9665    0.9743    0.9704       622

    accuracy                         0.9316       804
   macro avg     0.5961    0.6384    0.6159       804
weighted avg     0.9031    0.9316    0.9166       804



From the class-wise metrics, both models have relatively similar performance for `positive` and `negative` samples, shown by difference in F1 scores of around approximately 3%.

Further inspection shows that MLP overall has higher precision in both `positive` and `negative` labels, but hinders in class-wise accuracy (lower recall values), with recall value in the `negative` class much lower than that of the LR model.

Inspection on `neutral` samples shows that while MLP *can* correctly predict neutral reviews, its precision is rather low. Furthermore, the recall value of around 50% also shows that the model's ability for correctly predicting a neutral review is no better than random guessing.

In summary, both models performs relatively well on positive and negative reviews, with almost equal performance. However, like most neural networks, MLP can generalize better for neutral reviews, shown by its ability to correctly predict some of the neutral reviews.

## Inference on comments from Foody
We picked out a few comments about KFC on [Foody](https://www.foody.vn/ho-chi-minh/kfc-ly-thuong-kiet/binh-luan) and use the models to do inference.

In [None]:
with open('work/foody_reviews.txt') as file:
    foody_reviews = file.read().split('\n')

In [None]:
from pyvi import ViTokenizer
# load dictionary
abbr_fp = 'work/teencode.txt'

ABBREV_DICT = {}
with open(abbr_fp, 'r', encoding='utf-8') as abb:
    for line in abb:
        parts = line.strip().split('\t')
        if len(parts) == 2:
            short, full = parts
            ABBREV_DICT[short] = ViTokenizer.tokenize(full)

def clean_review(text: str) -> str:
    """
    Tokenize text with `PyVi.ViTokenizer` and subtitute abbreviation with its full form.
    :param str text: The text to clean
    :return: The cleaned text
    :rtype: str
    """
    return ' '.join(ABBREV_DICT.get(word.lower(), word) for word in ViTokenizer.tokenize(text).split(' '))

In [43]:
tokens = apply_tokenization(foody_reviews)
mlp_predictions = review_model(tokens['input_ids'].cuda(), tokens['attention_mask'].cuda()).argmax(dim=1).cpu()

In [47]:
foody_rdd = spark.sparkContext.textFile('work/foody_reviews.txt').map(clean_review)
foody_df = (foody_rdd.map(lambda x: Row(review=x))
    .toDF()
    .withColumn('features', create_embedding(col('review')))
)
lr_predictions = lr_model.transform(foody_df).toPandas()

In [59]:
ansi_colors = {
    'positive': '\033[32;10m',
    'neutral': '',
    'negative': '\033[31;10m'
}

for i, review in enumerate(foody_reviews):
  mlp_sent = IDX_AS_SENT[int(mlp_predictions[i])]
  lr_sent = IDX_AS_SENT[int(lr_predictions.iloc[i]['prediction'])]
  print(review[:100] + '...')
  print(f"  MLP predicted: {ansi_colors[mlp_sent]}{mlp_sent}\033[0m")
  print(f"  LR predicted: {ansi_colors[lr_sent]}{lr_sent}\033[0m\n")

Tối ngày 10/10 mình có ghé KFC (chi nhánh*** ăn và gọi combo 1 gồm: 2 miếng gà rán, 1 khoai tây chiê...
  MLP predicted: [31;10mnegative[0m
  LR predicted: [31;10mnegative[0m

KFC này mình hay ngồi lại ăn, không gian ổn, nhiều lần deli nên cũng tin tưởng chất lượng. Đi ban ng...
  MLP predicted: neutral[0m
  LR predicted: [32;10mpositive[0m

Dịch vụ tệ, mình cảm thấy hơi có lỗi vì ghé gần giờ đóng cửa lúc 9: 25pm và 10:00 đóng cửa nhưng thự...
  MLP predicted: [31;10mnegative[0m
  LR predicted: [31;10mnegative[0m

Chi nhánh này địa điểm đẹp, không gian rộng rãi, có lầu. Có điều đi 1 lần vào lúc 8h tối, nhân viên ...
  MLP predicted: [31;10mnegative[0m
  LR predicted: [31;10mnegative[0m

Mình là một khách hàng quen thuộc của KFC, nhưng hôm nay KFC (cụ thể là KFC chi nhánh Lý...
  MLP predicted: [31;10mnegative[0m
  LR predicted: [31;10mnegative[0m

Lần đầu trực tiếp qua ăn do gần nhà cũng có BigC vs Ng Tri Phương rồi. Mình rất thích bên này, ngoài...
  MLP

Both models mostly agree on reviews, except for the second review:
```
KFC này mình hay ngồi lại ăn, không gian ổn, nhiều lần deli nên cũng tin tưởng chất lượng.
Đi ban ngày nên hơi nóng, có chỗ giữ xe có người canh. Gọi phần burger (combo) thêm cheese
ăn khá ngon, nhưng mình gọi sớm nên chưa có tôm thì phải. Khoai tây ổn, nước oke nhưng gà
dạo này hơi khô, trong lớp da nhiều mỡ ăn mau ngán. Phục vụ ổn.
```

While the review is indeed mostly positive, MLP can capture some context relating to criticism inside of the review, e.g. how the chicken was dry or was "too fatty" for their liking.

Additionally, words like `"ổn"`, `"oke"` and `"khá"` are also typical of a 3-star review. Given that we have encoded 3-star review to be neutral, we consider this to be appriopriate.