Skip to content
This repository has been archived by the owner on Jan 13, 2024. It is now read-only.

BochkarevV/yelp-reviews-sentiment-analysis

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

12 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

YELP Reviews Sentiment Analysis

Task

In this project, we train and compare several Transformers-based models. The task we are solving is known as sentiment analysis. In a nutshell, the models learn to classify free text reviews as positive and negative ones.

Data

We use YELP reviews for this task. Original Yelp Open Dataset is available here.

To reduce the training time and avoid using expensive and often unavailable hardware, we extract 25,000 records per star rating (125,000 reviews in total) and split them into training, validation and test sets for development and final model evaluation. This is still too much for ordinary CPUs to process in a reasonable time. Fortunately, Google Colab offers performant (up to 16GB GPU RAM) GPUs which is sufficient for training the models in an acceptable timeframe.

Basic exploratory data analysis of the subsample is available in notebooks/yelp_eda.ipynb.

Repository Structure

  • data - contains the data we're using for development and testing purposes. Since both initial dataset and the subsample are quite large, this folder will need to be created locally. We only need yelp_academic_dataset_review.json.zip from the original dataset to be stored in data. Subsample of the dataset can be generated by running utils.dataset_utils.py.
  • models - model, training, evaluation and prediction definitions.
  • notebooks - auxiliary notebooks, such as data exploration and example of the model usage.
  • utils - helpers and supplementary methods, such as subsampling original dataset and text preprocessing for subsequent transformers models training.
  • .gitignore - lists files and folders ignored by git.
  • main.py - default root of the project, not used at the moment.
  • README.md - the doc you're reading :)
  • requirements.txt - project dependencies. Execute pip install -r requirements.txt in console to install additional packages needed to run the project.

Data Preprocessing and Preparation for Training

Transformer-based models expect fixed-length sequences of token IDs as inputs. Thus, the first step is to transform the textual data into sequences of tokens IDs.

TextPreprocessor specifies a text preprocessing and transformation routine as a single class. When instantiating the class, one should consider which transformers model would be used. The reason for this is that two principal parameters (tokenizer and vocab_file) of the TextPreprocessor class have to be consistent with the chosen model.

tokenizer - must be an object of PreTrainedTokenizer. However, if the parameter value is not provided, BertTokenizer is used by default.

vocab_file - is the vocabulary used by the tokenizer and must be a string. Refer to HuggingFace's documentation for a full list of vocabularies (Shortcut name column) and associated model architectures. If none is provided, 'bert-base-cased' is used by default.

The main working part of the TextPreprocessor class is the preprocess method. It can take the following two parameters:

texts - a list of strings to be processed and transformed. fit - boolean, tells whether to find the max length of the sequences for padding/truncation. It should normally be True only when feeding a training corpus.

The method sequentially performs the following preprocessing steps:

  1. Tokenization using the PreTrainedTokenizer instance provided. The tokenizer internally performs four actions:
    1. Tokenizes the input strings.
    2. Prepends [CLS] token.
    3. Appends [SEP] token.
    4. Maps tokens to their IDs.
  2. Padding or truncating the sequences of token IDs to the same length.
  3. Attention masks generation. 1 denotes tokens extracted from the text and 0 specifies padding tokens.
  4. Conversion of input matrices with token IDs and attention masks to PyTorch tensors.

Usage example:

train_texts = ['One morning, when Gregor Samsa woke from troubled dreams, he',
               'found himself transformed in his bed into a horrible vermin. He',
               'lay on his armour-like back, and if he lifted his head a little',
               'he could see his brown belly, slightly domed and divided by',
               'arches into stiff sections.']
test_texts = ['The bedding was hardly able to cover',
              'it and seemed ready to slide off any moment. His many legs,',
              'pitifully thin compared with the size of the rest of him, waved',
              'about helplessly as he looked.']

from utils.text_preprocessing import TextPreprocessor
prep = TextPreprocessor()
train_seqs, train_masks = prep.preprocess(train_texts, fit=True)
test_seqs, test_masks = prep.preprocess(test_texts)

Models

We make use of HuggingFace's transformers library (PyTorch implementation), which provides general-purpose architectures for NLP tasks with a constantly growing number of pre-trained models.

The central point of the project is TransformersGeneric class. This class plays the role of an abstraction and encapsulates models definition, training and evaluation. The main purpose of this class is to hide the complexity, associated with the training process, and provide only a high-level API for classification.

The constructor of the class can take a few parameters, however, the first three are the crucial ones:

num_classes - number of classes, 2 in this example. transformers_model - instance of PreTrainedModel class. BertForSequenceClassification is used if none is provided. model_name - name of the model which pre-trained parameters will be used. Full list can be found here. By default, 'bert-base-cased' is used if the parameter is not indicated.

For an example of its usage, refer to notebooks/yelp_reviews_sentiment_analysis.ipynb.

Results

We have experimented with ALBERT Base, DistilRoBERTa Base, RoBERTa Base, BERT Base Cased and DistilBERT Base Cased models for comparison. The following image shows training and validation losses, as well as accuracy and F1 score measured on the validation set.

Metrics

Interestingly, judging by the training and validation losses, distilled RoBERTa and BERT models learn the data more closely than their non-distilled counterparts. However, this does not result in their better performance on the validation set. On the other hand, both RoBERTa-based models outperform BERT and ALBERT. Thus, even though the RoBERTa pretraining approach differs seemingly marginally, it still yields a more robust and better-performing model both in terms of validation accuracy and F1 score in these particular settings.

To check how well the models generalize to the unseen data, we evaluate them on a hold-out test set. The following image presents the final accuracy (X-axis) and F1 score (Y-axis).

Test Accuracy vs F1

The leaders do not change here and RoBERTa-based models score better against BERT and ALBERT with regard to both metrics. Compared to the validation accuracy, the testing one stays roughly the same slightly above 0.92. At the same time, the F1 score dropped by approx. 0.03 (RoBERTa) and 0.02 (DistilRoBERTa), resulting in 0.896 and 0.899 correspondingly. A similar decrease in F1 is observed for the remaining three models.

Interactive dashboard with these results and runs details is available on the Weights & Biases project page.