# **Practice: BERT** (Bidirectional Encoder Representations from Transformers)

Devlin, Jacob, et al."Bert: Pre-training of deep bidirectional transformers  for language understanding." [(paper link).](https://arxiv.org/abs/1810.04805)

 BERT is one of the most famous pre-trained language models, released by Google in 2018. Using pre-trained BERT, we can solve many tasks, and this process is called `'fine-tuning'`. Fine-tuning is the process of training further on different tasks, readjusting the parameters of the pre-trained BERT.

In this practice, we're going to focus on how to utilize BERT for the task we want to do. so we're going to load a pre-trained BERT model from huggingface and use it. Implementing the BERT model yourself is complicated, but it will help you a lot in understanding transformer in depth. If you're curious about the detailed code of the model, check out this [link](https://github.com/huggingface/transformers/blob/main/src/transformers/models/bert/modeling_bert.py).

Now, let's practice fine-tuning BERT to classify Naver movie reviews!

**Note:** To ensure a smooth workflow, please run all the cells in sequential order. This way, dependencies and intermediate variables will correctly propagate from one cell to the next.

## Device

You might need to use GPU for this Colab.

Please click `Runtime` and then `'Change runtime type'`. Then set the `hardware accelerator` to GPU.

## Installation

In [None]:
# Get transformers made by HuggingFace
!pip install transformers
!pip install tensorflow
!pip install torch
!pip install pandas



In [None]:
import tensorflow as tf
import torch

from transformers import BertTokenizer
from transformers import BertForSequenceClassification, AdamW, BertConfig
from transformers import get_linear_schedule_with_warmup
from torch.utils.data import TensorDataset, DataLoader, RandomSampler, SequentialSampler
from keras.preprocessing.sequence import pad_sequences
from sklearn.model_selection import train_test_split

import pandas as pd
import numpy as np
import random
import time
import datetime


#https://huggingface.co/transformers/model_doc/bert.html#bertforsequenceclassification

In [None]:
# Download Naver movie reviews and sentiment analysis data
!git clone https://github.com/e9t/nsmc.git

Cloning into 'nsmc'...
remote: Enumerating objects: 14763, done.[K
remote: Counting objects: 100% (14762/14762), done.[K
remote: Compressing objects: 100% (13012/13012), done.[K
remote: Total 14763 (delta 1748), reused 14762 (delta 1748), pack-reused 1 (from 1)[K
Receiving objects: 100% (14763/14763), 56.19 MiB | 7.73 MiB/s, done.
Resolving deltas: 100% (1748/1748), done.
Updating files: 100% (14737/14737), done.


In [None]:
# List files in a directory
!ls nsmc -la

total 38684
drwxr-xr-x 5 root root     4096 Nov  4 16:27 .
drwxr-xr-x 1 root root     4096 Nov  4 16:27 ..
drwxr-xr-x 2 root root     4096 Nov  4 16:27 code
drwxr-xr-x 8 root root     4096 Nov  4 16:27 .git
-rw-r--r-- 1 root root  4893335 Nov  4 16:27 ratings_test.txt
-rw-r--r-- 1 root root 14628807 Nov  4 16:27 ratings_train.txt
-rw-r--r-- 1 root root 19515078 Nov  4 16:27 ratings.txt
drwxr-xr-x 2 root root   512000 Nov  4 16:27 raw
-rw-r--r-- 1 root root     2596 Nov  4 16:27 README.md
-rw-r--r-- 1 root root    36746 Nov  4 16:27 synopses.json


## Prepare Model's Input

### Load Data

In this section, we will examine the structure of the Naver movie reivew data.


In [None]:
# Load training and test data by using Pandas
train = pd.read_csv("nsmc/ratings_train.txt", sep='\t')
test = pd.read_csv("nsmc/ratings_test.txt", sep='\t')

def get_shape(dataset):
  num_row = 0
  num_col = 0

  num_row = dataset.shape[0]
  num_col = dataset.shape[1]

  return num_row, num_col

#Print shapes of train and test data
train_num_row, train_num_col = get_shape(train)
test_num_row, test_num_col = get_shape(test)
print("Train dataset has {} rows and {} columns".format(train_num_row, train_num_col))
print("Test dataset has {} rows and {} columns".format(test_num_row, test_num_col))

Train dataset has 150000 rows and 3 columns
Test dataset has 50000 rows and 3 columns


In [None]:
# Print the first 10 lines of the training set
train.head(10)

Unnamed: 0,id,document,label
0,9976970,아 더빙.. 진짜 짜증나네요 목소리,0
1,3819312,흠...포스터보고 초딩영화줄....오버연기조차 가볍지 않구나,1
2,10265843,너무재밓었다그래서보는것을추천한다,0
3,9045019,교도소 이야기구먼 ..솔직히 재미는 없다..평점 조정,0
4,6483659,사이몬페그의 익살스런 연기가 돋보였던 영화!스파이더맨에서 늙어보이기만 했던 커스틴 ...,1
5,5403919,막 걸음마 뗀 3세부터 초등학교 1학년생인 8살용영화.ㅋㅋㅋ...별반개도 아까움.,0
6,7797314,원작의 긴장감을 제대로 살려내지못했다.,0
7,9443947,별 반개도 아깝다 욕나온다 이응경 길용우 연기생활이몇년인지..정말 발로해도 그것보단...,0
8,7156791,액션이 없는데도 재미 있는 몇안되는 영화,1
9,5912145,왜케 평점이 낮은건데? 꽤 볼만한데.. 헐리우드식 화려함에만 너무 길들여져 있나?,1


The Naver movie review dataset consists of three components: id, document, and label.

*  "id" refers to the identifier of the review.

*   "document" contains the text of the review.
*   "label" is used for sentiment categorization (0 or 1). A label "0" is likely to represent negative sentiment, and "1" is likely to represent positive sentiment.

The dataset is a Python dictionary with keys "id", "documents", and "labels".

### Preprocessing

In this section, we will preprocess data to make input for BERT. BERT's input sentence should start with special token [CLS] and end with special token [SEP].

 Extract the training review sentences and convert them into the input format for BERT.




In [None]:
train_sentences = train['document']
train_labels = train['label'].values

test_sentences = test['document']
test_labels = test['label'].values

print(train_sentences[:5])
print(train_labels[:5])
print(test_sentences[:5])
print(test_labels[:5])

train_sentences = ["[CLS] " + str(sentence) + " [SEP]" for sentence in train_sentences]
test_sentences = ["[CLS] " + str(sentence) + " [SEP]" for sentence in test_sentences]

print(train_sentences[:5])
print(test_sentences[:5])


0                                  아 더빙.. 진짜 짜증나네요 목소리
1                    흠...포스터보고 초딩영화줄....오버연기조차 가볍지 않구나
2                                    너무재밓었다그래서보는것을추천한다
3                        교도소 이야기구먼 ..솔직히 재미는 없다..평점 조정
4    사이몬페그의 익살스런 연기가 돋보였던 영화!스파이더맨에서 늙어보이기만 했던 커스틴 ...
Name: document, dtype: object
[0 1 0 0 1]
0                                                  굳 ㅋ
1                                 GDNTOPCLASSINTHECLUB
2               뭐야 이 평점들은.... 나쁘진 않지만 10점 짜리는 더더욱 아니잖아
3                     지루하지는 않은데 완전 막장임... 돈주고 보기에는....
4    3D만 아니었어도 별 다섯 개 줬을텐데.. 왜 3D로 나와서 제 심기를 불편하게 하죠??
Name: document, dtype: object
[1 0 0 0 0]
['[CLS] 아 더빙.. 진짜 짜증나네요 목소리 [SEP]', '[CLS] 흠...포스터보고 초딩영화줄....오버연기조차 가볍지 않구나 [SEP]', '[CLS] 너무재밓었다그래서보는것을추천한다 [SEP]', '[CLS] 교도소 이야기구먼 ..솔직히 재미는 없다..평점 조정 [SEP]', '[CLS] 사이몬페그의 익살스런 연기가 돋보였던 영화!스파이더맨에서 늙어보이기만 했던 커스틴 던스트가 너무나도 이뻐보였다 [SEP]']
['[CLS] 굳 ㅋ [SEP]', '[CLS] GDNTOPCLASSINTHECLUB [SEP]', '[CLS] 뭐야 이 평점들은.... 나쁘진 않지만 10점 짜리는 더더욱 아니잖아 [SEP]', '[CLS] 지루하지는 않은데

### Tokenizing

Tokenize preprocessed senctences using BERT tokenizer

In [None]:
# 1. Load BERT tokenizer (use 'bert-base-multilingual-cased' model and set do_lower_case=False)
# Please refer to the tutorial below for Huggingface tokenizers:
# https://huggingface.co/learn/nlp-course/chapter2/4?fw=pt
# 2. Tokenize the sentences by using the loaded BERT tokenizer
# 3. Put the tokenized sentences into a list.

train_tokenized_texts = []
test_tokenized_texts = []

tokenizer = BertTokenizer.from_pretrained('bert-base-multilingual-cased', do_lower_case=False)

train_tokenized_texts = [tokenizer.tokenize(sent) for sent in train_sentences]
test_tokenized_texts = [tokenizer.tokenize(sent) for sent in test_sentences]

print (train_sentences[0])
print (train_tokenized_texts[0])
print (test_sentences[0])
print (test_tokenized_texts[0])

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/49.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/996k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.96M [00:00<?, ?B/s]

config.json:   0%|          | 0.00/625 [00:00<?, ?B/s]



[CLS] 아 더빙.. 진짜 짜증나네요 목소리 [SEP]
['[CLS]', '아', '더', '##빙', '.', '.', '진', '##짜', '짜', '##증', '##나', '##네', '##요', '목', '##소', '##리', '[SEP]']
[CLS] 굳 ㅋ [SEP]
['[CLS]', '굳', '[UNK]', '[SEP]']


### Padding

In natural language processing, we convert a natural language sentence into a list of token ids. A natural language model processes a batch of multiple sentences in each iteration. But, sentences with variable lengths do not align each other, and thus they cannot be combined to a matrix. In such case, we pad the sentences such that  all the padded sentences have the same length, and then we combine these sentences as a matrix all at once.



*   If sequence length is longer than the maximum length specified by a user, then truncate each sentence up to the maximum length.  
*   If sentence length is shorter than the maximum length, then put paddings at the end ("post-padding") to generate a new sequence with the maxtimum length. (Putting paddings at the beginning of a sentence is called "pre-padding".)



In [None]:
# Maximum length of the input sequence
MAX_LEN = 128

train_input_ids = [tokenizer.convert_tokens_to_ids(x) for x in train_tokenized_texts]
train_input_ids = pad_sequences(train_input_ids, maxlen=MAX_LEN, dtype="long", truncating="post", padding="post")

test_input_ids = [tokenizer.convert_tokens_to_ids(x) for x in test_tokenized_texts]
test_input_ids = pad_sequences(test_input_ids, maxlen=MAX_LEN, dtype="long", truncating="post", padding="post")

print(train_input_ids[0])
print(test_input_ids[0])

[   101   9519   9074 119005    119    119   9708 119235   9715 119230
  16439  77884  48549   9284  22333  12692    102      0      0      0
      0      0      0      0      0      0      0      0      0      0
      0      0      0      0      0      0      0      0      0      0
      0      0      0      0      0      0      0      0      0      0
      0      0      0      0      0      0      0      0      0      0
      0      0      0      0      0      0      0      0      0      0
      0      0      0      0      0      0      0      0      0      0
      0      0      0      0      0      0      0      0      0      0
      0      0      0      0      0      0      0      0      0      0
      0      0      0      0      0      0      0      0      0      0
      0      0      0      0      0      0      0      0      0      0
      0      0      0      0      0      0      0      0]
[ 101 8911  100  102    0    0    0    0    0    0    0    0    0    0
    0    0    0    

### Attention Mask

Attention Mask helps distinguish actual words from padding tokens during BERT's attention operations, ensuring that unnecessary attention is not directed towards padding tokens.

In [None]:
# Initialize attention masks
train_attention_masks = []
test_attention_masks = []

for seq in train_input_ids:
    seq_mask = [float(i>0) for i in seq]
    train_attention_masks.append(seq_mask)

for seq in test_input_ids:
    seq_mask = [float(i>0) for i in seq]
    test_attention_masks.append(seq_mask)


print(train_attention_masks[0])
print(test_attention_masks[0])

[1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]
[1.0, 1.0, 1.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0

### Data Split

To prepare validation dataset, split training data into a training set and a validation set. Also, split attention masks into training masks and validation masks.

In [None]:
train_inputs, validation_inputs, train_labels, validation_labels = train_test_split(train_input_ids,
                                                                                    train_labels,
                                                                                    random_state=2018,
                                                                                    test_size=0.1)

train_masks, validation_masks, _, _ = train_test_split(train_attention_masks,
                                                       train_input_ids,
                                                       random_state=2018,
                                                       test_size=0.1)


train_inputs = torch.tensor(train_inputs)
train_labels = torch.tensor(train_labels)
train_masks = torch.tensor(train_masks)
validation_inputs = torch.tensor(validation_inputs)
validation_labels = torch.tensor(validation_labels)
validation_masks = torch.tensor(validation_masks)
test_inputs = torch.tensor(test_input_ids)
test_labels = torch.tensor(test_labels)
test_masks = torch.tensor(test_attention_masks)

print(train_inputs[0])
print(train_labels[0])
print(train_masks[0])
print(validation_inputs[0])
print(validation_labels[0])
print(validation_masks[0])
print(test_inputs[0])
print(test_labels[0])
print(test_masks[0])

tensor([   101,   9711,  11489,   9364,  41850,   9004,  32537,   9491,  35506,
         17360,  48549,    119,    119,   9477,  26444,  12692,   9665,  21789,
         11287,   9708, 119235,   9659,  22458, 119136,  12965,  48549,    119,
           119,   9532,  22879,   9685,  16985,  14523,  48549,    119,    119,
          9596, 118728,    119,    119,   9178, 106065, 118916,    119,    119,
          8903,  11664,  11513,   9960,  14423,  25503, 118671,  48549,    119,
           119,  21890,   9546,  37819,  22879,   9356,  14867,   9715, 119230,
        118716,  48345,    119,   9663,  23321,  10954,   9638,  35506, 106320,
         10739,  20173,   9359,  19105,  11102,  42428,  17196,  48549,    119,
           119,    100,    117,   9947,  12945,   9532,  25503,   8932,  14423,
         35506, 119050,  11903,  14867,  10003,  14863,  33188,  48345,    119,
           102,      0,      0,      0,      0,      0,      0,      0,      0,
             0,      0,      0,      0, 

### Creating DataLoader, and Making Mini-Batch

Now, we are going to create the final input for BERT. We need to combine multiple input tensors into a single tensor and retrieve the data using the batch size during training

In [None]:
# Set the batch size
batch_size = 32

train_data = TensorDataset(train_inputs, train_masks, train_labels)
train_sampler = RandomSampler(train_data)
train_dataloader = DataLoader(train_data, sampler=train_sampler, batch_size=batch_size)

validation_data = TensorDataset(validation_inputs, validation_masks, validation_labels)
validation_sampler = SequentialSampler(validation_data)
validation_dataloader = DataLoader(validation_data, sampler=validation_sampler, batch_size=batch_size)

test_data = TensorDataset(test_inputs, test_masks, test_labels)
test_sampler = RandomSampler(test_data)
test_dataloader = DataLoader(test_data, sampler=test_sampler, batch_size=batch_size)

## GPU setup

In [None]:
# Get the device name
device_name = tf.test.gpu_device_name()

# Inspect if the device is GPU
if device_name == '/device:GPU:0':
    print('Found GPU at: {}'.format(device_name))
else:
    raise SystemError('GPU device not found')

Found GPU at: /device:GPU:0


In [None]:
# Set the device
if torch.cuda.is_available():
    device = torch.device("cuda")
    print('There are %d GPU(s) available.' % torch.cuda.device_count())
    print('We will use the GPU:', torch.cuda.get_device_name(0))
else:
    device = torch.device("cpu")
    print('No GPU available, using the CPU instead.')

There are 1 GPU(s) available.
We will use the GPU: Tesla T4


## Create Model

In [None]:
# Create a BERT model for classification
model = BertForSequenceClassification.from_pretrained("bert-base-multilingual-cased", num_labels=2)
model.cuda()

model.safetensors:   0%|          | 0.00/714M [00:00<?, ?B/s]

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-multilingual-cased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


BertForSequenceClassification(
  (bert): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(119547, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (token_type_embeddings): Embedding(2, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0-11): 12 x BertLayer(
          (attention): BertAttention(
            (self): BertSdpaSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1

## Optimizer & scheduler

In [None]:
# Set an optimizer
optimizer = AdamW(model.parameters(),
                  lr = 2e-5,
                  eps = 1e-8
                )

# Set the number of epochs
epochs = 4

total_steps = len(train_dataloader) * epochs
print(total_steps)

# Create a scheduler that adjusts a learning rate at the begining
scheduler = get_linear_schedule_with_warmup(optimizer,
                                            num_warmup_steps = 0,
                                            num_training_steps = total_steps)



16876


## Metric: Accuracy

Accuracy is a commonly-used metric to evaluate the performance of a classification model. Accuracy measures how many of the predictions made by the model are correct compared to the total number of predictions.

In [None]:
def flat_accuracy(preds, labels):
    pred_flat = np.argmax(preds, axis=1).flatten()
    labels_flat = labels.flatten()
    accuracy = np.sum(pred_flat == labels_flat) / len(labels_flat)

    return accuracy

In [None]:
# Function that shows time
def format_time(elapsed):

    # Round
    elapsed_rounded = int(round((elapsed)))

    # Convert into the format hh:mm:ss
    return str(datetime.timedelta(seconds=elapsed_rounded))

## Training
Now, we're going to train the model. An epoch loop consists of training and validation processes. With PyTorch, you can simply implement forward and backward operations. Now, let's fill in the code below!

In [19]:
# Fix a random seed for reproducibility
seed_val = 42
random.seed(seed_val)
np.random.seed(seed_val)
torch.manual_seed(seed_val)
torch.cuda.manual_seed_all(seed_val)

# Initialize gradient
model.zero_grad()

# Repeat for the number of epochs
for epoch_i in range(0, epochs):

    # ========================================
    #               Training
    # ========================================

    print("")
    print('======== Epoch {:} / {:} ========'.format(epoch_i + 1, epochs))
    print('Training...')

    # Set the start time
    t0 = time.time()

    # Initialize loss
    total_loss = 0

    # Set the model to train mode
    model.train()

    # For each batch retrieved from a data loader
    for step, batch in enumerate(train_dataloader):
        # Show the information of every 500 iterations
        if step % 500 == 0 and not step == 0:
            elapsed = format_time(time.time() - t0)
            print('  Batch {:>5,}  of  {:>5,}.    Elapsed: {:}.'.format(step, len(train_dataloader), elapsed))

        # Move a current batch to GPU
        batch = tuple(t.to(device) for t in batch)
        b_input_ids, b_input_mask, b_labels = batch

        outputs = model(b_input_ids,
                        token_type_ids=None,
                        attention_mask=b_input_mask,
                        labels=b_labels)

        loss = outputs[0]
        total_loss += loss.item()
        loss.backward()
        torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
        optimizer.step()
        scheduler.step()
        model.zero_grad()


    avg_train_loss = total_loss / len(train_dataloader)

    print("")
    print("  Average training loss: {0:.2f}".format(avg_train_loss))
    print("  Training epcoh took: {:}".format(format_time(time.time() - t0)))

    # ========================================
    #               Validation
    # ========================================

    print("")
    print("Running Validation...")

    # Set the inital time
    t0 = time.time()

    # Change model to eval mode
    model.eval()

    # Initialize variables
    eval_loss, eval_accuracy = 0, 0
    nb_eval_steps, nb_eval_examples = 0, 0

    # For each batch retrieved from a data loader
    for batch in validation_dataloader:
        # Put the batch into GPU
        batch = tuple(t.to(device) for t in batch)
        b_input_ids, b_input_mask, b_labels = batch

        # No gradient computation
        with torch.no_grad():
            # Forward propagation
            outputs = model(b_input_ids,
                            token_type_ids=None,
                            attention_mask=b_input_mask)

        # Get logits
        logits = outputs[0]

        # Move data to CPU
        logits = logits.detach().cpu().numpy()
        label_ids = b_labels.to('cpu').numpy()

        # Compute accuracy by using output logits and labels
        tmp_eval_accuracy = flat_accuracy(logits, label_ids)
        eval_accuracy += tmp_eval_accuracy
        nb_eval_steps += 1

    print("  Accuracy: {0:.2f}".format(eval_accuracy/nb_eval_steps))
    print("  Validation took: {:}".format(format_time(time.time() - t0)))

print("")
print("Training complete!")


Training...
  Batch   500  of  4,219.    Elapsed: 0:05:12.
  Batch 1,000  of  4,219.    Elapsed: 0:10:24.
  Batch 1,500  of  4,219.    Elapsed: 0:15:35.
  Batch 2,000  of  4,219.    Elapsed: 0:20:46.
  Batch 2,500  of  4,219.    Elapsed: 0:25:58.
  Batch 3,000  of  4,219.    Elapsed: 0:31:09.
  Batch 3,500  of  4,219.    Elapsed: 0:36:20.
  Batch 4,000  of  4,219.    Elapsed: 0:41:31.

  Average training loss: 0.38
  Training epcoh took: 0:43:48

Running Validation...
  Accuracy: 0.86
  Validation took: 0:01:32

Training...
  Batch   500  of  4,219.    Elapsed: 0:05:11.
  Batch 1,000  of  4,219.    Elapsed: 0:10:22.
  Batch 1,500  of  4,219.    Elapsed: 0:15:34.
  Batch 2,000  of  4,219.    Elapsed: 0:20:45.
  Batch 2,500  of  4,219.    Elapsed: 0:25:57.
  Batch 3,000  of  4,219.    Elapsed: 0:31:08.
  Batch 3,500  of  4,219.    Elapsed: 0:36:19.
  Batch 4,000  of  4,219.    Elapsed: 0:41:30.

  Average training loss: 0.28
  Training epcoh took: 0:43:47

Running Validation...
  Accura

## Model Evaluation

In [20]:
# Set initial time
t0 = time.time()

# Change to evel mode
model.eval()

# Initialize variables
eval_loss, eval_accuracy = 0, 0
nb_eval_steps, nb_eval_examples = 0, 0

# For each batch from the data loader
for step, batch in enumerate(test_dataloader):
    # Show the information for every 500 iterations
    if step % 100 == 0 and not step == 0:
        elapsed = format_time(time.time() - t0)
        print('  Batch {:>5,}  of  {:>5,}.    Elapsed: {:}.'.format(step, len(test_dataloader), elapsed))

    # Put the batch into GPU
    batch = tuple(t.to(device) for t in batch)
    b_input_ids, b_input_mask, b_labels = batch

    with torch.no_grad():
        outputs = model(b_input_ids,
                        token_type_ids=None,
                        attention_mask=b_input_mask)

    logits = outputs[0]

    logits = logits.detach().cpu().numpy()
    label_ids = b_labels.to('cpu').numpy()


    tmp_eval_accuracy = flat_accuracy(logits, label_ids)
    eval_accuracy += tmp_eval_accuracy
    nb_eval_steps += 1

print("")
print("Accuracy: {0:.2f}".format(eval_accuracy/nb_eval_steps))
print("Test took: {:}".format(format_time(time.time() - t0)))

  Batch   100  of  1,563.    Elapsed: 0:00:20.
  Batch   200  of  1,563.    Elapsed: 0:00:39.
  Batch   300  of  1,563.    Elapsed: 0:00:59.
  Batch   400  of  1,563.    Elapsed: 0:01:18.
  Batch   500  of  1,563.    Elapsed: 0:01:38.
  Batch   600  of  1,563.    Elapsed: 0:01:58.
  Batch   700  of  1,563.    Elapsed: 0:02:17.
  Batch   800  of  1,563.    Elapsed: 0:02:37.
  Batch   900  of  1,563.    Elapsed: 0:02:56.
  Batch 1,000  of  1,563.    Elapsed: 0:03:16.
  Batch 1,100  of  1,563.    Elapsed: 0:03:35.
  Batch 1,200  of  1,563.    Elapsed: 0:03:55.
  Batch 1,300  of  1,563.    Elapsed: 0:04:15.
  Batch 1,400  of  1,563.    Elapsed: 0:04:34.
  Batch 1,500  of  1,563.    Elapsed: 0:04:54.

Accuracy: 0.87
Test took: 0:05:06


## Let's feed new sentences

In [21]:
def convert_input_data(sentences):
    sentences = ["[CLS] " + str(sentences) + " [SEP]"]

    tokenized_texts = [tokenizer.tokenize(sent) for sent in sentences]

    MAX_LEN = 128

    input_ids = [tokenizer.convert_tokens_to_ids(x) for x in tokenized_texts]

    input_ids = pad_sequences(input_ids, maxlen=MAX_LEN, dtype="long", truncating="post", padding="post")

    # Initialize attention masks
    attention_masks = []

    # Set the Pytorch DataLoader with input, attention masks, labels
    # The data loader will retrieve the data with the batch size in training
    for seq in input_ids:
        seq_mask = [float(i>0) for i in seq]
        attention_masks.append(seq_mask)

    # Convert data to pytorch tensors
    inputs = torch.tensor(input_ids)
    masks = torch.tensor(attention_masks)

    return inputs, masks

# Test sentences
def test_sentences(sentences):

    # Change to eval mode
    model.eval()

    # Convert sentences to the input of BERT
    inputs, masks = convert_input_data(sentences)

    # Move data into GPU
    b_input_ids = inputs.to(device)
    b_input_mask = masks.to(device)

    # No gradient computation
    with torch.no_grad():
        # Forward propagation
        outputs = model(b_input_ids,
                        token_type_ids=None,
                        attention_mask=b_input_mask)

    # Get loss
    logits = outputs[0]

    # Move data to CPU
    logits = logits.detach().cpu().numpy()

    return logits


In [22]:
# Enter your review below to test your trained model
logits = test_sentences('Enter your review here')

print(logits)
print(np.argmax(logits))

[[-0.75916314  0.7327059 ]]
1
