### 0. Introduction
---
Models: BERT, roBERTa

##### How data is processed?

In topic modelling we retrieved reviews on movies, restaurants and books from 3 separated datasets. We manually retrieved the text content and gave them numerical labels accordingly (0-movie, 1-restaurant, 2-book, because the models do not accept labels in string), transformed them into pandas Dataframe (the datatype that the models can work with), merged them into corpus and removed duplicated contents and shuffled.

Then, the corpus is splitted into train, validation and test dataset, in the proportion takes 60%, 20% and 20% of the corpus. Before training begins, the data was checked to make sure that all three of the datasets are using different data with no overlapping data to avoid overfitting (a phenomenon that the ML model "memorizes" the answer and lose the ability of classifying unseen inputs).

##### How training is proceeded?

Both BERT and rBERTa models apply the same fine-tune parameters. The parameters are adjusted accordingly to reach the maximum performance. Both models train on the training dataset and validated with the validation dataset, which takes approximately 5 minutes for each of them in total. In the end, an evaluation of performance is executed on both models with the test dataset.



Two models were adjusted twice. In the first version, the result was remarkably good, this raises the suspesion of overfitting issue. To confirm that two models are not overfitted, the following were adjusted:

*   Make data cleaner by removing duplications in corpus
*   Train models without cross-validation and testing on the pre-trained model by setting epocs to zero

After the cleaning of data and a comparison of the evaluation with and without epocs, it was discovered that both of the models perform remarkably well with a weighted average f1 score of 99%, however, when training without epocs, the performance drops drastically. Additionally, the algorithm contains a pre-stop function that when overfitting starts to occur, the training on epochs will terminate, and in this case, the algorithm indeed terminated before running all epochs. This is a strong proof that the models are not overfitted.

##### Result:

Surprisingly, after excluding the possibility of overfitting, both BERT and rBERTa model performed extremly well and both returned high precisions and high recalls on determining all three different topics. This indicates that both BERT and rBERTa are powerful and sufficient on modelling topics - both models can identify the topics preciseely and classify them correctly, with a very small possibility of mismacthing. In conclusion, both models are capable for most of the topic modelling tasks.

##### Additional note:
1. Due to the hardward limitation, only 2000 datapoints were used from each dataset (1000 from Restaurant Dataset).
2. This project was completed in 2023. All results were acquired after the first execution. <br>
Unexpected *ValueError* occurred when executing the second time. Will be fixed later.

### 1. Data preprocessing:

Original source of datasets:


*   Movie dataset: IMDB Dataset of 50K Movie Reviews, https://www.kaggle.com/datasets/lakshmi25npathi/imdb-dataset-of-50k-movie-reviews?resource=download

*   Restaurant dataset: Restaurant Customer Review:
https://www.kaggle.com/datasets/vigneshwarsofficial/reviews


*   
Book dataset:
Amazon Book Review:
https://www.kaggle.com/datasets/shrutimehta/amazon-book-reviews-webscraped?resource=download


> Model takes long time to run! For a faster execution, please consider using Google Colab with extra RAM space.




In [1]:
import pandas as pd
import csv
import random

In [2]:
# Retrieve the text contents from the datasets
movie_reviews = []
restaurant_reviews = []
book_reviews = []

with open('Data/IMDB Dataset.csv', newline='', encoding='iso-8859-1') as csvfile:
    reader = csv.reader(csvfile, delimiter=',', quotechar='"')
    for row in list(reader)[1:2001]:
      movie_reviews.append(row[0])

with open('Data/Restaurant_Reviews.tsv', newline='',encoding='iso-8859-1') as csvfile:
    reader = csv.reader(csvfile, delimiter=',', quotechar='"')
    for row in list(reader)[1:2001]:
      restaurant_reviews.append(row[0])

with open('Data/Amazon book reviews.csv', newline='', encoding='iso-8859-1') as csvfile:
    reader = csv.reader(csvfile, delimiter=',', quotechar='"')
    for row in list(reader)[1:2001]:
      book_reviews.append(row[0])

In [3]:
# Manually transform the string label to integer label the data and transform the data with their labels into dictionaries
# 0 - movie
# 1 - restaurant
# 2 - book

def remove_duplicate(datalist):
    """
    Removes duplicated reviews from a datalist.
    :param datalist (list): A list containing reviews in text


    :returns list: A new list with duplicate elements removed.
    """
  return list(set(datalist))


def labeller(datalist, label):
    """
    Manually assigns each review in the datalist to a corresponding string label
    :param list datalist: A list containing reviews in text.
    :param str label: The string label to be assigned to each element.

    :returns list(list) result: A list of lists, each containing two strings: the original review in text and its label

    Example:
        >>> labeled(["This movie is awesome!", "This movie is lame!",...], "movie")
        [["This movie is awesome!","movie"], ["This movie is lame!", "movie"],...]]
    """

  result = []

  for review in datalist:
    result.append([review, string_label_to_numeric(label)])

  print("Labelling complete!")

  return result


def string_label_to_numeric(label):
    """
    Converts a string label to its corresponding numerical label:
    'movie'-> 0
    'restaurant'-> 1
    'book'-> 2

    :param str label: A string label ("movie", "restaurant", or "book"):

    :returns int: The numeric representation of the label (0 for "movie", 1 for "restaurant", 2 for "book").

    Example:
        >>> string_label_to_numeric("movie")
        0
    """

  if label == "movie":
    return 0
  elif label == "restaurant":
    return 1
  elif label == "book":
    return 2
  else:
    pass

  return "Invalid label."


def list_to_dict(datalist):
    """
    Converts a list of lists (labelled) into a list of dictionaries.

    :param list(list) datalist: A list of lists (labelled), where each inner list contains two elements (review text and label).

    :returns list(dict()) result: A list of dictionaries, where each dictionary contains "text" and "label" as keys.

    Example:
        >>> list_to_dict([['review1', "movie"], ['review2', "book"]])
        [{'text': 'review1', 'label': "movie"}, {'text': 'review2', 'label': "book"}]
    """ 
  result = []
  for review in datalist:
    temp_dict = {"text":review[0], "label":review[1]}
    result.append(temp_dict)

  return result

In [4]:
def data_preview(show_example = False, train_test_splitted = False):
    """
    Print the summary of the dataset including the number of reviews for each category.
    
    :param show_example (bool, optional): Whether to print examples of reviews. Defaults to False.
    :param train_test_splitted (bool, optional): Whether the dataset is splitted into train, validation, and test sets.
            If True, also prints the lengths of train, validation, and test sets. Defaults to False.
    """    
  print(f"Number of book reviews: {len(book_reviews)}")
  print(f"Number of restaurant reviews: {len(restaurant_reviews)}")
  print(f"Number of movie reviews:{len(movie_reviews)}")

  if train_test_splitted:
    print(f"Total length of dataset: {len(df)}")
    print(f"Number of training data: {len(train)}")
    print(f"Number of validation data: {len(validation)}")
    print(f"Number of test data: {len(test)}")

  if show_example:
    for i in range(10):
      print(f" review {i}: \n movie: {movie_reviews[i]} \n restaurant: {restaurant_reviews[i]} \n book: \n{ book_reviews[i]} \n")

###### Preview the data

In [5]:
print("Number of data in each category before eliminating duplication: ")
data_preview()
movie_reviews = remove_duplicate(movie_reviews)
restaurant_reviews = remove_duplicate(restaurant_reviews)
book_reviews = remove_duplicate(book_reviews)
print("Number of data in each category after eliminating duplication: ")
data_preview(show_example=True)

Number of data in each category before eliminating duplication: 
Number of book reviews: 2000
Number of restaurant reviews: 992
Number of movie reviews:2000
Number of data in each category after eliminating duplication: 
Number of book reviews: 1775
Number of restaurant reviews: 977
Number of movie reviews:2000
 review 0: 
 movie: I very much looked forward to this movie. Its a good family movie; however, if Michael Landon Jr.'s editing team did a better job of editing, the movie would be much better. Too many scenes out of context. I do hope there is another movie from the series, they're all very good. But, if another one is made, I beg them to take better care at editing. This story was all over the place and didn't seem to have a center. Which is unfortunate because the other movies of the series were great. I enjoy the story of Willie and Missy; they're both great role models. Plus, the romantic side of the viewers always enjoy a good love story. 
 restaurant: Our server was very 

###### Assign labels and transform the data

In [6]:
movie_reviews = labeller(movie_reviews, "movie")
restaurant_reviews = labeller(restaurant_reviews, "restaurant")
book_reviews = labeller(book_reviews, "book")

Labelling complete!
Labelling complete!
Labelling complete!


In [7]:
movie_reviews = list_to_dict(movie_reviews)
restaurant_reviews = list_to_dict(restaurant_reviews)
book_reviews = list_to_dict(book_reviews)

Data shuffling and merging

In [8]:
# Merge the labeled data into a big list, namely corpus, and shuffle the data to make sure that
# all types of data are randomly distributed.
corpus = movie_reviews + restaurant_reviews + book_reviews
random.shuffle(corpus)
df = pd.DataFrame(corpus, columns=['text','label'])

In [9]:
## Uncomment and run if simpletransformers is not installed
# !pip install simpletransformers

Defaulting to user installation because normal site-packages is not writeable


###### Data splitting and data cleaningness check

In [10]:
from sklearn.metrics import classification_report
from sklearn.model_selection import train_test_split
from simpletransformers.classification import ClassificationModel, ClassificationArgs
import matplotlib.pyplot as plt
import seaborn as sn




In [11]:
# Split the data into train, validate (develope) and test datasets in proportion of 60/20/20
train_without_val, test = train_test_split(df, test_size=0.2, random_state=0, stratify=df[['label']])
train, validation = train_test_split(train_without_val, test_size=0.2, random_state=0, stratify=train_without_val[['label']])

In [22]:
# check overlapping: make sure that the training dataseet and the test dataset
# are different before training to avoid overfitting!

def check_overlapping(datalist1, datalist2):
    """
    Check if two datasets are distinct and have no mutual datapoint. 
    Mainly used to ensure the integrity and cleaningness of training, validation and test dataset.

    :param list(str) datalist1: The first datalist of text review.
    :param list(str) datalist1: The second datalist of text review.

    Examples:
        >>>check_overlapping(datalist1=[i for i in train["text"]], datalist2=[i for i in test["text"]])
            "There is no overlapping! These two datasets have different data."
    """
    
    if len(set(datalist1).intersection(set(datalist2))) == 0:
        print("There is no overlapping! These two datasets have different data.")
    else:
        print(f"Overlapping data found: \n{set(datalist1).intersection(set(datalist2))}")

In [13]:
check_overlapping(datalist1=[i for i in train["text"]], datalist2=[i for i in test["text"]])
check_overlapping(datalist1=[i for i in train["text"]], datalist2=[i for i in validation["text"]])
check_overlapping(datalist1=[i for i in validation["text"]], datalist2=[i for i in test["text"]])

There is no overlapping! These two datasets have different data.
There is no overlapping! These two datasets have different data.
There is no overlapping! These two datasets have different data.


### 2: Topic modelling with BERT

###### Model Initialization

In [16]:
# Model configuration # https://simpletransformers.ai/docs/usage/#configuring-a-simple-transformers-model
model_args = ClassificationArgs()

model_args.overwrite_output_dir=True # overwrite existing saved models in the same directory
model_args.evaluate_during_training=True # to perform evaluation while training the model
# (eval data should be passed to the training method)

model_args.num_train_epochs=10 # number of epochs
model_args.train_batch_size=32 # batch size
model_args.learning_rate=4e-6 # learning rate
model_args.max_seq_length=128 # maximum sequence length
# Note! Increasing max_seq_len may provide better performance, but training time will increase.

# Early stopping to combat overfitting: https://simpletransformers.ai/docs/tips-and-tricks/#using-early-stopping
model_args.use_early_stopping=True
model_args.early_stopping_delta=0.01 # "The improvement over best_eval_loss necessary to count as a better checkpoint"
model_args.early_stopping_metric='eval_loss'
model_args.early_stopping_metric_minimize=True
model_args.early_stopping_patience=2
model_args.evaluate_during_training_steps=32 # how often you want to run validation in terms of training steps (or batches)

In [18]:
model = ClassificationModel('bert', 'bert-base-uncased', num_labels=3, args=model_args, use_cuda=False) # CUDA is enabled

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


###### Execute Training

In [19]:
_, history = model.train_model(train, eval_df=validation, multi_label=True)



  0%|          | 0/6 [00:00<?, ?it/s]

Epoch:   0%|          | 0/10 [00:00<?, ?it/s]

Running Epoch 1 of 10:   0%|          | 0/95 [00:00<?, ?it/s]



  0%|          | 0/1 [00:00<?, ?it/s]

ValueError: Target is multiclass but average='binary'. Please choose another average setting, one of [None, 'micro', 'macro', 'weighted'].

In [None]:
train_loss = history['train_loss']
eval_loss = history['eval_loss']
plt.plot(train_loss, label='Training loss')
plt.plot(eval_loss, label='Evaluation loss')
plt.title('Training and evaluation loss')
plt.legend()

In [None]:
bert_predicted, bert_probabilities = model.predict(test.text.to_list())
test['predicted'] = bert_predicted

In [None]:
print("Classification Report of BERT")
print(classification_report(test['label'], test['predicted']))

### 3. Topic modelling with roBERTa

In [None]:
roberta = ClassificationModel('roberta', 'roberta-base',num_labels=3, args=model_args,use_cuda=True)

In [None]:
_, history = roberta.train_model(train, eval_df=validation)

In [None]:
train_loss = history['train_loss']
eval_loss = history['eval_loss']
plt.plot(train_loss, label='Training loss')
plt.plot(eval_loss, label='Evaluation loss')
plt.title('Training and evaluation loss')
plt.legend()

In [None]:
roberta_predicted, roberta_probabilities = roberta.predict(test.text.to_list())
test['predicted'] = roberta_predicted

In [None]:
print("Classification report of roBERTa")
print(classification_report(test['label'], test['predicted']))