In [None]:
%load_ext autoreload
%autoreload 2

# 9. Sentiment Analysis - BERT Classification

**BERT** is a bidirectional model that is based on transformer architecture. The model is pre-trained on two unsupervised tasks, masked language modeling and next sentence prediction. It is commonly used as an encoder model for different downstream tasks, to provide vector representation of the input text.<br/>
The decoder part, that is responsible for producing a prediction for the task, should be added separately, depending on the task. For the text classification task, decoder part usually contains few linear layers. <br/><br/>

The goal of this exercise is to showcase how to use pre-trained BERT model for text classification. We use the standard implementation from [Hugging Face transformer library](https://huggingface.co/transformers/model_doc/bert.html).<br/>
We explain how to prepare data, load the model, execute the model and evaluate the results. This notebook does not cover fine-tuning BERT model for specific downstream task, but it is highly recommended to do this exercise as a homework task, to fully understand the capabilities of the model.


## Setup

Firstly, set up the path to the (preprocessed) dataset

In [None]:
# Path to the preprocessed data
import os

fileDir = os.path.dirname(os.path.realpath('__file__'))
absFilePathToPreprocessedDataset = os.path.join(fileDir, '../Data/training.1600000.processed.noemoticon_preprocessed.csv')
pathToPreprocessedDataset = os.path.abspath(os.path.realpath(absFilePathToPreprocessedDataset))
print(pathToPreprocessedDataset)

Set up which device to use:

In [None]:
import torch
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print('Using device:', device)

In [None]:
from Common.TwitterDataset import TwitterDataset

# Step #1: Instantiate the dataset
dataset = TwitterDataset.load_dataset_and_make_vectorizer(pathToPreprocessedDataset)

Initialize BertTokenizer, that is based on WordPiece tokenization. It encodes the input text in the expected format and encapsulates vocabulary of tokens.

In [None]:
from transformers import BertTokenizer

tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")

Define the batch size that should be used:

In [None]:
# setup the chosen batch size
batch_size = 32

Create data loaders for all three datasets.<br/>
The following 2 steps are repeated for each dataset:<br/>
* Iterate through the dataset to encode each tweet individually (tokenization + vectorization)
* Group tweets in batches with <code>batch_size</code> elements, to create a DataLoader object.

In [None]:
from BERTModel.BERTDataLoader import prepare_dataloader

train_dataloader = prepare_dataloader(tokenizer, dataset.train_df, batch_size)
validation_dataloader = prepare_dataloader(tokenizer, dataset.validation_df, batch_size)
test_dataloader = prepare_dataloader(tokenizer, dataset.test_df, batch_size)

## Bert for Sequence Classification

We load BertForSequenceClassification, the pretrained BERT model with a single linear classification layer on top.

In [None]:
from transformers import BertForSequenceClassification

# Step #2: Instantiate the model
model = BertForSequenceClassification.from_pretrained(
    # use weights from pretrained 12-layer BERT model, with an uncased vocab.
    pretrained_model_name_or_path="bert-base-uncased",
    num_labels=2,  # the number of output labels
    output_attentions=False,  # whether the model returns attentions weights.
    output_hidden_states=False,  # whether the model returns all hidden-states.
)
# send model to appropriate device
model = model.to(device)

## Evaluate the results

We run the model inference on the specific dataloader to evaluate predictions.

In [None]:
from BERTModel.BERTPredictor import predict

y_predicted = dataset.test_df.text.apply(lambda x: predict(text=x, model=model, tokenizer=tokenizer))

### More detailed evaluation on the Test Set

In [None]:
from RunHelper import print_evaluation_report

print_evaluation_report(y_labeled=dataset.test_df.target, y_predicted=y_predicted)