# BERT Hugging Face/ transformers implementation for Sentiment Analysis 

This notebook trains a sentiment analysis model to classify movie reviews as *positive* or *negative*, based on the text of the review.


## Setup

### Training using Colab GPU

Google Colab offers free GPUs and TPUs! Since we'll be training a large neural network it's best to take advantage of this (in this case we'll attach a GPU), otherwise training will take a very long time.

A GPU can be added by going to the menu and selecting:

Edit ðŸ¡’ Notebook Settings ðŸ¡’ Hardware accelerator ðŸ¡’ (GPU)

In [1]:
pip install transformers

Collecting transformers
[?25l  Downloading https://files.pythonhosted.org/packages/d5/43/cfe4ee779bbd6a678ac6a97c5a5cdeb03c35f9eaebbb9720b036680f9a2d/transformers-4.6.1-py3-none-any.whl (2.2MB)
[K     |â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 2.3MB 28.1MB/s 
[?25hCollecting tokenizers<0.11,>=0.10.1
[?25l  Downloading https://files.pythonhosted.org/packages/d4/e2/df3543e8ffdab68f5acc73f613de9c2b155ac47f162e725dcac87c521c11/tokenizers-0.10.3-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (3.3MB)
[K     |â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 3.3MB 47.8MB/s 
Collecting huggingface-hub==0.0.8
  Downloading https://files.pythonhosted.org/packages/a1/88/7b1e45720ecf59c6c6737ff332f41c955963090a18e72acbcbeac6b25e86/huggingface_hub-0.0.8-py3-none-any.whl
Collecting sacremoses
[?25l  Downloading https://files.pythonhosted.org/

In [2]:
import transformers
import torch
import torch.nn as nn
from tqdm import tqdm
import pandas as pd
from sklearn import model_selection, metrics
import numpy as np

In [3]:
# from google.colab import drive
# drive.mount('/content/gdrive')

##BERT Tokenizer

To feed our text to BERT, it must be split into tokens, and then these tokens must be mapped to their index in the tokenizer vocabulary.

In [4]:
"""BERT Configuration"""

BERT_PATH = '/content/gdrive/MyDrive/bert_base_uncased'
TOKENIZER = transformers.BertTokenizer.from_pretrained(pretrained_model_name_or_path = BERT_PATH, do_lower_case = True)
MAX_LENGTH = 64
TRAIN_FILE = '/content/dataset.csv'
TRAIN_BATCH_SIZE = 8
TRAIN_N_WORKERS = 4
VALIDATION_BATCH_SIZE = 4
VALIDATION_N_WORKERS = 1
DEVICE = 'cuda'
EPOCHS = 10
MODEL_PATH = 'model.bin'

## Loading and Preprocessing the input data

We'll need to transform our data into a format BERT understands. Text inputs need to be transformed to numeric token ids and arranged in several Tensors before being input to BERT.

In [5]:
"""data_loader"""

class BERTDataset:
    def __init__(self, review, target):
        self.review = review
        self.target = target
        self.tokenizer = TOKENIZER
        self.max_length = MAX_LENGTH

    def __len__(self):
        return len(self.review)

    def __getitem__(self, item_index):
        # sanity check
        review = str(self.review[item_index])
        review = ' '.join(review.split())

        # encoding
        inputs = self.tokenizer.encode_plus(
            review,
            None,
            add_special_tokens = True,
            max_length = self.max_length,
            pad_to_max_length = True
        )

        input_ids = inputs['input_ids']
        attention_mask = inputs['attention_mask']
        token_type_ids = inputs['token_type_ids']

        return {
            'input_ids': torch.tensor(input_ids, dtype = torch.long),
            'attention_mask': torch.tensor(attention_mask, dtype = torch.long),
            'token_type_ids': torch.tensor(token_type_ids, dtype = torch.long),
            'targets': torch.tensor(self.target[item_index], dtype = torch.float)
        }

## Defining the model

BERT(BASE): 12 layers of encoder stack with 12 bidirectional self-attention heads and 768 hidden units. Since sentiment analysis is a binary classification problem, we set the linear transformation *in_features* and *out_features* to 768 and 1 respectively.

*last_hidden_state* is a sequence of hidden states for all tokens for all batches.

*pooled_output* represents each input sequence as a whole. You can think of this as an embedding for the entire movie review. For the fine-tuning we are going to use the pooled_output array.

In [6]:
"""BERT Model"""

class BERTBaseUncased(nn.Module):
    def __init__(self):
        super(BERTBaseUncased, self).__init__()
        self.bert = transformers.BertModel.from_pretrained(pretrained_model_name_or_path = BERT_PATH)
        self.bert_dropout = nn.Dropout(0.3)
        self.output_layer = nn.Linear(in_features = 768, out_features = 1) 

    def forward(self, input_ids, attention_mask, token_type_ids):
        last_hidden_state, pooled_output = self.bert(input_ids, attention_mask = attention_mask, token_type_ids = token_type_ids,  return_dict = False)
        bert_output = self.bert_dropout(pooled_output)
        output = self.output_layer(bert_output)
        return output

## Training the model

### Loss function

Since this is a binary classification problem and the model outputs a probability (a single-unit layer), we'll use BinaryCrossentropy loss function.

In [7]:
"""Loss function"""

def loss_fn(outputs, targets):
    return nn.BCEWithLogitsLoss()(outputs, targets.view(-1, 1))

In [8]:
"""Train function"""

def train_fn(data_loader, model, optimizer, device, scheduler):
    TOTAL_N_BATCHES = len(data_loader)
    model.train()

    for batch_index, dataset in tqdm(enumerate(data_loader), total = TOTAL_N_BATCHES):
        input_ids = dataset['input_ids']
        attention_mask = dataset['attention_mask']
        token_type_ids = dataset['token_type_ids']
        targets = dataset['targets']

        # send to device
        input_ids = input_ids.to(device, dtype = torch.long)
        attention_mask = attention_mask.to(device, dtype = torch.long)
        token_type_ids = token_type_ids.to(device, dtype = torch.long)
        targets = targets.to(device, dtype = torch.float)

        optimizer.zero_grad()
        outputs = model(input_ids = input_ids, attention_mask = attention_mask, token_type_ids = token_type_ids)

        loss = loss_fn(outputs, targets)
        loss.backward()
        optimizer.step()
        scheduler.step()

In [9]:
"""Evaluation function"""

def evaluate_fn(data_loader, model, device):
    final_outputs, final_targets = [], []
    TOTAL_N_BATCHES = len(data_loader)
    model.eval()
    with torch.no_grad():
        for batch_index, dataset in tqdm(enumerate(data_loader), total = TOTAL_N_BATCHES):
            input_ids = dataset['input_ids']
            attention_mask = dataset['attention_mask']
            token_type_ids = dataset['token_type_ids']
            targets = dataset['targets']

            # send to device
            input_ids = input_ids.to(device, dtype = torch.long)
            attention_mask = attention_mask.to(device, dtype = torch.long)
            token_type_ids = token_type_ids.to(device, dtype = torch.long)
            targets = targets.to(device, dtype = torch.float)

            outputs = model(input_ids = input_ids, attention_mask = attention_mask, token_type_ids = token_type_ids)
            
            final_outputs.extend(torch.sigmoid(outputs).cpu().detach().numpy().tolist())
            final_targets.extend(targets.cpu().detach().numpy().tolist())
      
    return final_outputs, final_targets

In [10]:
def main():
    LEARNING_RATE = 3e-5


    dataframe = pd.read_csv(TRAIN_FILE).fillna('none')
    dataframe.sentiment = dataframe.sentiment.apply(lambda x: 1 if x == 'positive' else 0)

    # split into training and validation datasets
    dataframe_train, dataframe_validation = model_selection.train_test_split(
        dataframe, test_size = 0.1, random_state = 42, stratify = dataframe.sentiment.values
    )
    dataframe_train = dataframe_train.reset_index(drop = True)
    dataframe_validation = dataframe_validation.reset_index(drop = True)


    # create training dataset
    train_dataset = BERTDataset(
        review = dataframe_train.review.values, target = dataframe_train.sentiment.values
    )
    # create training data_loader
    train_data_loader = torch.utils.data.DataLoader(
        dataset = train_dataset, batch_size = TRAIN_BATCH_SIZE, num_workers = TRAIN_N_WORKERS
    )
    # create validation dataset
    validation_dataset = BERTDataset(
        review = dataframe_validation.review.values, target = dataframe_validation.sentiment.values
    )
    # create validation data_loader
    validation_data_loader = torch.utils.data.DataLoader(
        dataset = validation_dataset, batch_size = VALIDATION_BATCH_SIZE, num_workers = VALIDATION_N_WORKERS
    )


    device = torch.device(DEVICE)
    model = BERTBaseUncased()
    model.to(device)

    """
    Optimizer:
    For fine-tuning, let's use the same optimizer that BERT was originally trained with: 
    the "Adaptive Moments" (Adam). This optimizer minimizes the prediction loss and does 
    regularization by weight decay (not using moments), which is also known as AdamW."""

    parameter_optimizer = list(model.named_parameters())
    no_decay = ['bias', 'LayerNorm.bias', 'LayerNorm.weight']
    optimizer_parameters = [
        {
            'params': [parameter for n, parameter in parameter_optimizer if not any(nd in n for nd in no_decay)],
            'weight_decay': 0.001
        },
        {
            'params': [parameter for n, parameter in parameter_optimizer if any(nd in n for nd in no_decay)],
            'weight_decay': 0.0
        }
    ]

    n_train_steps = int(EPOCHS * len(dataframe_train) / TRAIN_BATCH_SIZE)
    optimizer = transformers.AdamW(optimizer_parameters, lr = LEARNING_RATE)
    scheduler = transformers.get_linear_schedule_with_warmup(
        optimizer, num_warmup_steps = 0, num_training_steps = n_train_steps
    )

    best_accuracy = 0
    for _ in range(EPOCHS):
        train_fn(train_data_loader, model, optimizer, device, scheduler)
        outputs, targets = evaluate_fn(validation_data_loader, model, device)
        outputs = np.array(outputs) >= 0.5
        accuracy = metrics.accuracy_score(targets, outputs)
        print('Accuracy Score = {}'.format(accuracy))
        if accuracy > best_accuracy:
            # export the inference
            torch.save(model.state_dict(), MODEL_PATH)
            best_accuracy = accuracy

In [11]:
main()

  cpuset_checked))
Some weights of the model checkpoint at /content/gdrive/MyDrive/bert_base_uncased were not used when initializing BertModel: ['cls.seq_relationship.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.decoder.weight']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=Tr

Accuracy Score = 0.8096


Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.
Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.
Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-s

Accuracy Score = 0.83


Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.
Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.
Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-s

Accuracy Score = 0.8456


Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.
Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.
  0%|          | 0/5625 [00:00<?, ?it/s]Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. 

Accuracy Score = 0.8326



Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.
Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.
  0%|          | 0/5625 [00:00<?, ?it/s]Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy.

Accuracy Score = 0.853


Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.
Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.
Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-s

Accuracy Score = 0.851



Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.
Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.
Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-

Accuracy Score = 0.849



Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.
Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.
Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-

Accuracy Score = 0.8586


Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.
Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.
Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-s

Accuracy Score = 0.8538



Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.
Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.
Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-

Accuracy Score = 0.8576





## References:

Tensorflow Hub authors

Chris McCormick