<a href="https://colab.research.google.com/github/demoleiwang/SDSC_Bert_Seminar/blob/master/02_Sentiment_Analysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# A Simple Example for Sentiment Analysis with Bert

If you have any questions, feel free to contact us

Download the IMDB dataset from Google Drive.

In [None]:
!gdown --id '1vP1lVYFGTLGHjvST3kSH5pxowd_4DcAe' --output IMDB_Dataset.csv

Downloading...
From: https://drive.google.com/uc?id=1vP1lVYFGTLGHjvST3kSH5pxowd_4DcAe
To: /content/IMDB_Dataset.csv
66.2MB [00:02, 31.2MB/s]


In [None]:
! ls

IMDB_Dataset.csv  sample_data


## Data Processing

Torchtext is a very friendly library for data preparation for PyTorch models. We can use some classic datasets "from torchtext import datasets" directly, such as the IMDB dataset for sentiment analysis or WikiText103 for language modeling. In this example, we do data preparation in a more general way with torchtext, loading data by ourselves.

In [None]:
! pip install transformers

Collecting transformers
[?25l  Downloading https://files.pythonhosted.org/packages/27/3c/91ed8f5c4e7ef3227b4119200fc0ed4b4fd965b1f0172021c25701087825/transformers-3.0.2-py3-none-any.whl (769kB)
[K     |▍                               | 10kB 23.3MB/s eta 0:00:01[K     |▉                               | 20kB 6.2MB/s eta 0:00:01[K     |█▎                              | 30kB 7.6MB/s eta 0:00:01[K     |█▊                              | 40kB 8.4MB/s eta 0:00:01[K     |██▏                             | 51kB 7.1MB/s eta 0:00:01[K     |██▋                             | 61kB 8.1MB/s eta 0:00:01[K     |███                             | 71kB 8.3MB/s eta 0:00:01[K     |███▍                            | 81kB 8.8MB/s eta 0:00:01[K     |███▉                            | 92kB 8.9MB/s eta 0:00:01[K     |████▎                           | 102kB 9.4MB/s eta 0:00:01[K     |████▊                           | 112kB 9.4MB/s eta 0:00:01[K     |█████▏                          | 122kB 9.4M

In [None]:
from torchtext import data
from torchtext import datasets

import torch
from transformers import BertTokenizer, BertModel

Load pre-trained bert model and its corresponding tokenizer.

In [None]:
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
bert = BertModel.from_pretrained('bert-base-uncased')

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=231508.0, style=ProgressStyle(descripti…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=433.0, style=ProgressStyle(description_…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=440473133.0, style=ProgressStyle(descri…




Considering the special tokens: [cls] and [sep]. we let the maximum length equals to max_length of bert input - 2.  

In [None]:
max_input_length = tokenizer.max_model_input_sizes['bert-base-uncased']
print(max_input_length)

def tokenize_and_cut(sentence):
    tokens = tokenizer.tokenize(sentence)
    tokens = tokens[:max_input_length-2]
    return tokens

512


torchtext will make data preparation more easily through predefining data fields.

In [None]:
REVIEW = data.Field(batch_first=True,
                    use_vocab=False,
                    tokenize = tokenize_and_cut,
                    preprocessing = tokenizer.convert_tokens_to_ids,
                    init_token = tokenizer.cls_token_id,
                    eos_token = tokenizer.sep_token_id,
                    pad_token = tokenizer.pad_token_id,
                    unk_token = tokenizer.unk_token_id
                   )
SENTIMENT = data.LabelField(dtype = torch.float)

In [None]:
fields = {'review': ('r', REVIEW), 'sentiment': ('s', SENTIMENT)}

Note that loading the data will cost more than 3 minutes.

In [None]:
IMDB_data = data.TabularDataset(
    path = './IMDB_Dataset.csv',
    format = 'csv',
    fields = fields)

In [None]:
train_data, valid_data, test_data = IMDB_data.split(split_ratio=[0.8, 0.1, 0.1])

print (len(train_data), len(valid_data), len(test_data))

40000 5000 5000


We apply the bert tokenizer for input text, which contains its own vocab dictionary. Thus, we only need to build vocab for labels.

In [None]:
SENTIMENT.build_vocab(train_data)

In [None]:
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print (device)

cuda


Construct data iterator 

In [None]:
BATCH_SIZE = 16

train_iterator, valid_iterator, test_iterator = data.BucketIterator.splits(
    (train_data, valid_data, test_data),
    sort = False, #don't sort test/validation data
    batch_size=BATCH_SIZE,
    device=device)

## Build a Simple Model with Bert

The model we use refers to [this github repo](https://github.com/bentrevett/pytorch-sentiment-analysis) . It stacks a gru layer over bert.

In [None]:
import torch.nn as nn


class BERTGRUSentiment(nn.Module):
    def __init__(self,
                 bert,
                 hidden_dim,
                 output_dim,
                 n_layers,
                 bidirectional,
                 dropout):

        super().__init__()

        self.bert = bert

        embedding_dim = bert.config.to_dict()['hidden_size']

        self.rnn = nn.GRU(embedding_dim,
                          hidden_dim,
                          num_layers=n_layers,
                          bidirectional=bidirectional,
                          batch_first=True,
                          dropout=0 if n_layers < 2 else dropout)

        self.out = nn.Linear(hidden_dim * 2 if bidirectional else hidden_dim, output_dim)

        self.dropout = nn.Dropout(dropout)

    def forward(self, text):

        # text = [batch size, sent len]

        with torch.no_grad():
            embedded = self.bert(text)[0]

        # embedded = [batch size, sent len, emb dim]

        _, hidden = self.rnn(embedded)

        # hidden = [n layers * n directions, batch size, emb dim]

        if self.rnn.bidirectional:
            hidden = self.dropout(torch.cat((hidden[-2, :, :], hidden[-1, :, :]), dim=1))
        else:
            hidden = self.dropout(hidden[-1, :, :])

        # hidden = [batch size, hid dim]

        output = self.out(hidden)

        # output = [batch size, out dim]

        return output

HIDDEN_DIM = 256
OUTPUT_DIM = 1
N_LAYERS = 2
BIDIRECTIONAL = True
DROPOUT = 0.25

model = BERTGRUSentiment(bert,
                         HIDDEN_DIM,
                         OUTPUT_DIM,
                         N_LAYERS,
                         BIDIRECTIONAL,
                         DROPOUT)

For this exmaple, we fix parameters of bert. It means we only use bert to extract better features for sentiment analysis.

In [None]:
for name, param in model.named_parameters():
    if name.startswith('bert'):
        param.requires_grad = False

Optimizer and binary loss (binary classification in this case)

In [None]:
import torch.optim as optim

optimizer = optim.Adam(model.parameters())

criterion = nn.BCEWithLogitsLoss()

model = model.to(device)
criterion = criterion.to(device)

Evaluation function

In [None]:
def binary_accuracy(preds, y):
    """
    Returns accuracy per batch, i.e. if you get 8/10 right, this returns 0.8, NOT 8
    """

    #round predictions to the closest integer
    rounded_preds = torch.round(torch.sigmoid(preds))
    correct = (rounded_preds == y).float() #convert into float for division
    acc = correct.sum() / len(correct)
    return acc

Training and evaluation 

In [None]:
from tqdm import tqdm

def train(model, iterator, optimizer, criterion):
    epoch_loss = 0
    epoch_acc = 0

    model.train()

    with tqdm(total=len(iterator)) as pbar:
      for batch in iterator:
          optimizer.zero_grad()

          input_tensor = batch.r  # .transpose(1,0)
          ground_y = batch.s.squeeze(0)
          #         print (input_tensor.size())
          #         print (ground_y.size())

          predictions = model(input_tensor).squeeze(1)

          loss = criterion(predictions, ground_y)

          acc = binary_accuracy(predictions, ground_y)

          loss.backward()

          optimizer.step()

          epoch_loss += loss.item()
          epoch_acc += acc.item()

          pbar.update(1)

    return epoch_loss / len(iterator), epoch_acc / len(iterator)

def evaluate(model, iterator, criterion):
    epoch_loss = 0
    epoch_acc = 0

    model.eval()

    with torch.no_grad():
        for batch in iterator:
            input_tensor = batch.r  # .transpose(1,0)
            ground_y = batch.s.squeeze(0)
            #         print (input_tensor.size())
            #         print (ground_y.size())

            predictions = model(input_tensor).squeeze(1)

            loss = criterion(predictions, ground_y)

            acc = binary_accuracy(predictions, ground_y)

            epoch_loss += loss.item()
            epoch_acc += acc.item()

    return epoch_loss / len(iterator), epoch_acc / len(iterator)

# Better observation for running time 
import time

def epoch_time(start_time, end_time):
    elapsed_time = end_time - start_time
    elapsed_mins = int(elapsed_time / 60)
    elapsed_secs = int(elapsed_time - (elapsed_mins * 60))
    return elapsed_mins, elapsed_secs

Main function

In [None]:
N_EPOCHS = 5

best_valid_loss = float('inf')

for epoch in range(N_EPOCHS):

    start_time = time.time()

    train_loss, train_acc = train(model, train_iterator, optimizer, criterion)
    valid_loss, valid_acc = evaluate(model, valid_iterator, criterion)

    end_time = time.time()

    epoch_mins, epoch_secs = epoch_time(start_time, end_time)

    if valid_loss < best_valid_loss:
        best_valid_loss = valid_loss
        torch.save(model.state_dict(), 'tut6-model.pt')

    print(f'Epoch: {epoch + 1:02} | Epoch Time: {epoch_mins}m {epoch_secs}s')
    print(f'\tTrain Loss: {train_loss:.3f} | Train Acc: {train_acc * 100:.2f}%')
    print(f'\t Val. Loss: {valid_loss:.3f} |  Val. Acc: {valid_acc * 100:.2f}%')

100%|██████████| 2500/2500 [17:59<00:00,  2.32it/s]
  0%|          | 0/2500 [00:00<?, ?it/s]

Epoch: 01 | Epoch Time: 19m 39s
	Train Loss: 0.231 | Train Acc: 90.85%
	 Val. Loss: 0.193 |  Val. Acc: 92.23%


 51%|█████▏    | 1287/2500 [09:15<08:43,  2.32it/s]


KeyboardInterrupt: ignored