# **Practice: BERT** (Bidirectional Encoder Representations from Transformers)

Devlin, Jacob, et al."Bert: Pre-training of deep bidirectional transformers  for language understanding." [(paper link).](https://arxiv.org/abs/1810.04805)

 BERT is one of the most famous pre-trained language models, released by Google in 2018. Using pre-trained BERT, we can solve many tasks, and this process is called `'fine-tuning'`. Fine-tuning is the process of training further on different tasks, readjusting the parameters of the pre-trained BERT.

In this practice, we're going to focus on how to utilize BERT for the task we want to do. so we're going to load a pre-trained BERT model from huggingface and use it. Implementing the BERT model yourself is complicated, but it will help you a lot in understanding transformer in depth. If you're curious about the detailed code of the model, check out this [link](https://github.com/huggingface/transformers/blob/main/src/transformers/models/bert/modeling_bert.py).

Now, let's practice fine-tuning BERT to classify Naver movie reviews!

**Note:** To ensure a smooth workflow, please run all the cells in sequential order. This way, dependencies and intermediate variables will correctly propagate from one cell to the next.

## Device

You might need to use GPU for this Colab.

Please click `Runtime` and then `'Change runtime type'`. Then set the `hardware accelerator` to GPU.

## Installation

In [1]:
# Get transformers made by HuggingFace
!pip install transformers
!pip install tensorflow
!pip install torch
!pip install pandas

Collecting transformers
  Using cached transformers-4.46.3-py3-none-any.whl.metadata (44 kB)
Collecting filelock (from transformers)
  Using cached filelock-3.16.1-py3-none-any.whl.metadata (2.9 kB)
Collecting huggingface-hub<1.0,>=0.23.2 (from transformers)
  Using cached huggingface_hub-0.26.3-py3-none-any.whl.metadata (13 kB)
Collecting numpy>=1.17 (from transformers)
  Using cached numpy-2.0.2-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (60 kB)
Collecting pyyaml>=5.1 (from transformers)
  Using cached PyYAML-6.0.2-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (2.1 kB)
Collecting regex!=2019.12.17 (from transformers)
  Using cached regex-2024.11.6-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (40 kB)
Collecting requests (from transformers)
  Using cached requests-2.32.3-py3-none-any.whl.metadata (4.6 kB)
Collecting tokenizers<0.21,>=0.20 (from transformers)
  Using cached tokenizers-0.20.3-cp39-cp39-manylinux_2_17_x86_64.

In [1]:
import tensorflow as tf
import torch

from transformers import BertTokenizer
from transformers import BertForSequenceClassification, AdamW, BertConfig
from transformers import get_linear_schedule_with_warmup
from torch.utils.data import TensorDataset, DataLoader, RandomSampler, SequentialSampler
from keras.preprocessing.sequence import pad_sequences
from sklearn.model_selection import train_test_split

import pandas as pd
import numpy as np
import random
import time
import datetime


#https://huggingface.co/transformers/model_doc/bert.html#bertforsequenceclassification

2024-12-04 14:02:33.939000: I tensorflow/core/util/port.cc:153] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2024-12-04 14:02:33.965047: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:477] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
E0000 00:00:1733288553.983996 1950407 cuda_dnn.cc:8310] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1733288553.988234 1950407 cuda_blas.cc:1418] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2024-12-04 14:02:34.002700: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instr

In [4]:
# Download Naver movie reviews and sentiment analysis data
!git clone https://github.com/e9t/nsmc.git

Cloning into 'nsmc'...
remote: Enumerating objects: 14763, done.[K
remote: Counting objects: 100% (14762/14762), done.[K
remote: Compressing objects: 100% (13012/13012), done.[K
remote: Total 14763 (delta 1748), reused 14762 (delta 1748), pack-reused 1 (from 1)[K
Receiving objects: 100% (14763/14763), 56.19 MiB | 18.30 MiB/s, done.
Resolving deltas: 100% (1748/1748), done.
Updating files: 100% (14737/14737), done.


In [2]:
# List files in a directory
!ls nsmc -la

total 38616
drwxrwxr-x 5 student student     4096 12월  4 13:14 .
drwxrwxr-x 3 student student     4096 12월  4 13:14 ..
drwxrwxr-x 2 student student     4096 12월  4 13:14 code
drwxrwxr-x 8 student student     4096 12월  4 13:14 .git
-rw-rw-r-- 1 student student  4893335 12월  4 13:14 ratings_test.txt
-rw-rw-r-- 1 student student 14628807 12월  4 13:14 ratings_train.txt
-rw-rw-r-- 1 student student 19515078 12월  4 13:14 ratings.txt
drwxrwxr-x 2 student student   438272 12월  4 13:14 raw
-rw-rw-r-- 1 student student     2596 12월  4 13:14 README.md
-rw-rw-r-- 1 student student    36746 12월  4 13:14 synopses.json


## Prepare Model's Input

### Load Data

In this section, we will examine the structure of the Naver movie reivew data.



### Question 1: What is the shape of data?

In [3]:
# Load training and test data by using Pandas
train = pd.read_csv("nsmc/ratings_train.txt", sep='\t')
test = pd.read_csv("nsmc/ratings_test.txt", sep='\t')

def get_shape(dataset):
  #TODO: Implement this function that takes a dataset object
  #and return the shape of dataset.

  num_row = 0
  num_col = 0

  ############ Your code here #############
  ## (~2 line of code)
  num_row, num_col = len(dataset), len(dataset.columns)

  #########################################

  return num_row, num_col

#Print shapes of train and test data
train_num_row, train_num_col = get_shape(train)
test_num_row, test_num_col = get_shape(test)
print("Train dataset has {} rows and {} columns".format(train_num_row, train_num_col))
print("Test dataset has {} rows and {} columns".format(test_num_row, test_num_col))

Train dataset has 150000 rows and 3 columns
Test dataset has 50000 rows and 3 columns


In [4]:
# Print the first 10 lines of the training set
train.head(10)

Unnamed: 0,id,document,label
0,9976970,아 더빙.. 진짜 짜증나네요 목소리,0
1,3819312,흠...포스터보고 초딩영화줄....오버연기조차 가볍지 않구나,1
2,10265843,너무재밓었다그래서보는것을추천한다,0
3,9045019,교도소 이야기구먼 ..솔직히 재미는 없다..평점 조정,0
4,6483659,사이몬페그의 익살스런 연기가 돋보였던 영화!스파이더맨에서 늙어보이기만 했던 커스틴 ...,1
5,5403919,막 걸음마 뗀 3세부터 초등학교 1학년생인 8살용영화.ㅋㅋㅋ...별반개도 아까움.,0
6,7797314,원작의 긴장감을 제대로 살려내지못했다.,0
7,9443947,별 반개도 아깝다 욕나온다 이응경 길용우 연기생활이몇년인지..정말 발로해도 그것보단...,0
8,7156791,액션이 없는데도 재미 있는 몇안되는 영화,1
9,5912145,왜케 평점이 낮은건데? 꽤 볼만한데.. 헐리우드식 화려함에만 너무 길들여져 있나?,1


The Naver movie review dataset consists of three components: id, document, and label.

*  "id" refers to the identifier of the review.

*   "document" contains the text of the review.
*   "label" is used for sentiment categorization (0 or 1). A label "0" is likely to represent negative sentiment, and "1" is likely to represent positive sentiment.

The dataset is a Python dictionary with keys "id", "documents", and "labels".

### Preprocessing

In this section, we will preprocess data to make input for BERT. BERT's input sentence should start with special token [CLS] and end with special token [SEP].

 Extract the training review sentences and convert them into the input format for BERT.




In [5]:
# TODO:  Extract review sentences and labels from the training and test dataset

############ Your code here #############
##(~4 line of code)
train_sentences = train.document
train_labels = np.array(train.label)
test_sentences = test.document
test_labels = np.array(test.label)



#########################################

print(train_sentences[:5])
print(train_labels[:5])
print(test_sentences[:5])
print(test_labels[:5])

#TODO: Convert the sentences into the input format for BERT (add the [CLS] and [SEP] tokens)

############ Your code here #############
##(~2 line of code)
train_sentences = ["[CLS] " + str(sentence) + " [SEP]" for sentence in train_sentences]
test_sentences = ["[CLS] " + str(sentence) + " [SEP]" for sentence in test_sentences]

#########################################

print(train_sentences[:5])
print(test_sentences[:5])


0                                  아 더빙.. 진짜 짜증나네요 목소리
1                    흠...포스터보고 초딩영화줄....오버연기조차 가볍지 않구나
2                                    너무재밓었다그래서보는것을추천한다
3                        교도소 이야기구먼 ..솔직히 재미는 없다..평점 조정
4    사이몬페그의 익살스런 연기가 돋보였던 영화!스파이더맨에서 늙어보이기만 했던 커스틴 ...
Name: document, dtype: object
[0 1 0 0 1]
0                                                  굳 ㅋ
1                                 GDNTOPCLASSINTHECLUB
2               뭐야 이 평점들은.... 나쁘진 않지만 10점 짜리는 더더욱 아니잖아
3                     지루하지는 않은데 완전 막장임... 돈주고 보기에는....
4    3D만 아니었어도 별 다섯 개 줬을텐데.. 왜 3D로 나와서 제 심기를 불편하게 하죠??
Name: document, dtype: object
[1 0 0 0 0]
['[CLS] 아 더빙.. 진짜 짜증나네요 목소리 [SEP]', '[CLS] 흠...포스터보고 초딩영화줄....오버연기조차 가볍지 않구나 [SEP]', '[CLS] 너무재밓었다그래서보는것을추천한다 [SEP]', '[CLS] 교도소 이야기구먼 ..솔직히 재미는 없다..평점 조정 [SEP]', '[CLS] 사이몬페그의 익살스런 연기가 돋보였던 영화!스파이더맨에서 늙어보이기만 했던 커스틴 던스트가 너무나도 이뻐보였다 [SEP]']
['[CLS] 굳 ㅋ [SEP]', '[CLS] GDNTOPCLASSINTHECLUB [SEP]', '[CLS] 뭐야 이 평점들은.... 나쁘진 않지만 10점 짜리는 더더욱 아니잖아 [SEP]', '[CLS] 지루하지는 않은데

### Tokenizing

Tokenize preprocessed senctences using BERT tokenizer

In [6]:
# TODO
# 1. Load BERT tokenizer (use 'bert-base-multilingual-cased' model and set do_lower_case=False)
# Please refer to the tutorial below for Huggingface tokenizers:
# https://huggingface.co/learn/nlp-course/chapter2/4?fw=pt
# 2. Tokenize the sentences by using the loaded BERT tokenizer
# 3. Put the tokenized sentences into a list.

train_tokenized_texts = []
test_tokenized_texts = []
############ Your code here #############
##(~3 line of code)
tokenizer = BertTokenizer.from_pretrained("bert-base-multilingual-cased", do_lower_case=False)
train_tokenized_texts = [tokenizer.tokenize(sentence) for sentence in train_sentences]
test_tokenized_texts = [tokenizer.tokenize(sentence) for sentence in test_sentences]


#########################################

print (train_sentences[0])
print (train_tokenized_texts[0])
print (test_sentences[0])
print (test_tokenized_texts[0])

[CLS] 아 더빙.. 진짜 짜증나네요 목소리 [SEP]
['[CLS]', '아', '더', '##빙', '.', '.', '진', '##짜', '짜', '##증', '##나', '##네', '##요', '목', '##소', '##리', '[SEP]']
[CLS] 굳 ㅋ [SEP]
['[CLS]', '굳', '[UNK]', '[SEP]']


### Padding

In natural language processing, we convert a natural language sentence into a list of token ids. A natural language model processes a batch of multiple sentences in each iteration. But, sentences with variable lengths do not align each other, and thus they cannot be combined to a matrix. In such case, we pad the sentences such that  all the padded sentences have the same length, and then we combine these sentences as a matrix all at once.



*   If sequence length is longer than the maximum length specified by a user, then truncate each sentence up to the maximum length.  
*   If sentence length is shorter than the maximum length, then put paddings at the end ("post-padding") to generate a new sequence with the maxtimum length. (Putting paddings at the beginning of a sentence is called "pre-padding".)



In [7]:
# Maximum length of the input sequence
MAX_LEN = 128


############ Your code here #############
# TODO
# 1. Convert the tokens into token ids,
# which are integer indices of tokens in their look-up table.
# Use 'tokenizer.convert_tokents_to_ids'
# (input: tokenized_texts, ouptut: intiger indices)
# 2. If |sequence| > MAX_LEN, truncate each sentence up to the MAX_LEN
# 3. If |sequence| < MAX_LEN, then put paddings at the end of the sentence
# Function 'pad_sequences(input, truncate, padding)' can be useful
# to genereate a new sequence with length of MAX_LEN
# Hint: Argument input takes integer indices. Set truncate="post", padding="post".
## (~4 line of code)
train_input_ids = [tokenizer.convert_tokens_to_ids(token) for token in train_tokenized_texts]
test_input_ids = [tokenizer.convert_tokens_to_ids(token) for token in test_tokenized_texts]
train_input_ids = pad_sequences(train_input_ids, maxlen=MAX_LEN, dtype="long", truncating="post", padding="post")
test_input_ids = pad_sequences(test_input_ids, maxlen=MAX_LEN, dtype="long", truncating="post", padding="post")




#########################################


print(train_input_ids[0])
print(test_input_ids[0])

[   101   9519   9074 119005    119    119   9708 119235   9715 119230
  16439  77884  48549   9284  22333  12692    102      0      0      0
      0      0      0      0      0      0      0      0      0      0
      0      0      0      0      0      0      0      0      0      0
      0      0      0      0      0      0      0      0      0      0
      0      0      0      0      0      0      0      0      0      0
      0      0      0      0      0      0      0      0      0      0
      0      0      0      0      0      0      0      0      0      0
      0      0      0      0      0      0      0      0      0      0
      0      0      0      0      0      0      0      0      0      0
      0      0      0      0      0      0      0      0      0      0
      0      0      0      0      0      0      0      0      0      0
      0      0      0      0      0      0      0      0]
[ 101 8911  100  102    0    0    0    0    0    0    0    0    0    0
    0    0    0    

### Attention Mask

Attention Mask helps distinguish actual words from padding tokens during BERT's attention operations, ensuring that unnecessary attention is not directed towards padding tokens.

In [8]:
# Initialize attention masks
train_attention_masks = []
test_attention_masks = []
#TODO: Set attention mask to 0 if a correponding position is a padding, 1 otherwise.

############ Your code here #############
## (~6 lines of code)
train_attention_masks = [[0.0 if id == 0 else 1.0 for id in ids] for ids in train_input_ids]
test_attention_masks = [[0.0 if id == 0 else 1.0 for id in ids] for ids in test_input_ids]

########################################

print(train_attention_masks[0])
print(test_attention_masks[0])

[1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]
[1.0, 1.0, 1.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0

### Data Split

To prepare validation dataset, split training data into a training set and a validation set. Also, split attention masks into training masks and validation masks.

In [9]:
# TODO
# 1. Split train data into training set and validation set.
# 2. Split attetion masks into training masks and validation masks.
# 3. Convert the output of Steps 1 and 2, test inputs, labels, and masks to pytorch tensors.

############ Your code here #############
## (~11 lines of code)
## Note:
## Use sklearn's 'train_test_split()' function to split data and attention masks.
## To ensure consistency between data and attention masks, the random_state should be the same.
## (Also, using the same random_state helps maintain reproducibility in the data splitting process)
## Set test_size = 0.1

test_size = 0.1
random_state=0
train_inputs, validation_inputs, train_labels, validation_labels = train_test_split(train_input_ids, train_labels, test_size=test_size, random_state=random_state)
train_masks, validation_masks = train_test_split(train_attention_masks, test_size=test_size, random_state=random_state)
test_inputs, test_masks = test_input_ids, test_attention_masks

train_inputs = torch.tensor(train_inputs)
train_masks = torch.tensor(train_masks)
train_labels = torch.tensor(train_labels, dtype=torch.int64)
validation_inputs = torch.tensor(validation_inputs)
validation_masks = torch.tensor(validation_masks)
validation_labels = torch.tensor(validation_labels, dtype=torch.int64)
test_inputs = torch.tensor(test_inputs)
test_masks = torch.tensor(test_masks)
test_labels = torch.tensor(test_labels, dtype=torch.int64)

########################################


print(train_inputs[0])
print(train_labels[0])
print(train_masks[0])
print(validation_inputs[0])
print(validation_labels[0])
print(validation_masks[0])
print(test_inputs[0])
print(test_labels[0])
print(test_masks[0])

tensor([   101,  58466,    119,    119,    119,   9992, 119312,  42428,  11018,
          9638,  17730,  48556,  30858,  18227,  80001,    100,    102,      0,
             0,      0,      0,      0,      0,      0,      0,      0,      0,
             0,      0,      0,      0,      0,      0,      0,      0,      0,
             0,      0,      0,      0,      0,      0,      0,      0,      0,
             0,      0,      0,      0,      0,      0,      0,      0,      0,
             0,      0,      0,      0,      0,      0,      0,      0,      0,
             0,      0,      0,      0,      0,      0,      0,      0,      0,
             0,      0,      0,      0,      0,      0,      0,      0,      0,
             0,      0,      0,      0,      0,      0,      0,      0,      0,
             0,      0,      0,      0,      0,      0,      0,      0,      0,
             0,      0,      0,      0,      0,      0,      0,      0,      0,
             0,      0,      0,      0, 

### Creating DataLoader, and Making Mini-Batch

Now, we are going to create the final input for BERT. We need to combine multiple input tensors into a single tensor and retrieve the data using the batch size during training

In [10]:
# Set the batch size
batch_size = 128
# TODO: Set the Pytorch DataLoader with input, attention masks, labels

############ Your code here #############
## Note:
## 1. Create a TensorDataset object to combine train_inputs, train_masks and train_labels.
## Create a TensorDataset object to combine test_inputs, test_masks and test_labels.
## Create a TensorDataset object to combine validation_inputs, validation_masks and validation_labels.
## (using TensorDataset())
## 2. Create a RandomSampler object by using RandomSampler() for training and test sets.
## 3. Create a SequentialSampler object by using SequentialSampler() for validation set.
## 3. Create train dataloader, validation dataloader and test dataloader
## (use 'DataLoader()' and set its argument "sampler" to the samplers created above.)
## (~9 lines of code)
train_dataset = TensorDataset(train_inputs, train_masks, train_labels)
train_dataset = TensorDataset(train_inputs, train_masks, train_labels)
validation_dataset = TensorDataset(validation_inputs, validation_masks, validation_labels)
test_dataset = TensorDataset(test_inputs, test_masks, test_labels)
train_sampler = RandomSampler(train_dataset)
validation_sampler = SequentialSampler(validation_dataset)
test_sampler = RandomSampler(test_dataset)
train_dataloader = DataLoader(train_dataset, batch_size=batch_size, sampler=train_sampler)
validation_dataloader = DataLoader(validation_dataset, batch_size=batch_size, sampler=validation_sampler)
test_dataloader = DataLoader(test_dataset, batch_size=batch_size, sampler=test_sampler)


########################################

## GPU setup

In [None]:
############# Only needed when it is running at server #############
# import os
# os.environ["CUDA_DEVICE_ORDER"] = "PCI_BUS_ID"
# os.environ["CUDA_VISIBLE_DEVICES"] = "3"

In [12]:
# Get the device name
device_name = tf.test.gpu_device_name()

# Inspect if the device is GPU
if device_name == '/device:GPU:0':
    print('Found GPU at: {}'.format(device_name))
else:
    raise SystemError('GPU device not found')

Found GPU at: /device:GPU:0


I0000 00:00:1733288596.531351 1950407 gpu_device.cc:2022] Created device /device:GPU:0 with 22339 MB memory:  -> device: 0, name: NVIDIA RTX A5000, pci bus id: 0000:68:00.0, compute capability: 8.6


In [13]:
# Set the device
if torch.cuda.is_available():
    device = torch.device("cuda")
    print('There are %d GPU(s) available.' % torch.cuda.device_count())
    print('We will use the GPU:', torch.cuda.get_device_name(0))
else:
    device = torch.device("cpu")
    print('No GPU available, using the CPU instead.')

There are 1 GPU(s) available.
We will use the GPU: NVIDIA RTX A5000


## Create Model

In [14]:
# Create a BERT model for classification
model = BertForSequenceClassification.from_pretrained("bert-base-multilingual-cased", num_labels=2)
model.cuda()

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-multilingual-cased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


BertForSequenceClassification(
  (bert): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(119547, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (token_type_embeddings): Embedding(2, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0-11): 12 x BertLayer(
          (attention): BertAttention(
            (self): BertSdpaSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1

## Optimizer & scheduler

### Question 2: What is the number of total steps for training?

---



The number of total steps = (the number of batches) $\times$ (the number of epochs)


In [15]:
# Set an optimizer
optimizer = AdamW(model.parameters(),
                  lr = 2e-5, # 학습률
                  eps = 1e-8 # 0으로 나누는 것을 방지하기 위한 epsilon 값
                )

# Set the number of epochs
epochs = 4

############ Your code here #############
## Note:
## (~ 1 line of code)
## Total steps = the number of batches in a dataset * number of epochs
total_steps = len(train_dataloader) * epochs

#########################################
print(total_steps)

# Create a scheduler that adjusts a learning rate at the begining
scheduler = get_linear_schedule_with_warmup(optimizer,
                                            num_warmup_steps = 0,
                                            num_training_steps = total_steps)



4220


## Metric: Accuracy

Accuracy is a commonly-used metric to evaluate the performance of a classification model. Accuracy measures how many of the predictions made by the model are correct compared to the total number of predictions.

In [16]:
# TODO: Define function that computes an accuracy
# Accuracy = (Number of Correct Predictions) / (Total Number of Predictions)
# preds, labels: [batch size, number of classes (i.e., 2)]
def flat_accuracy(preds, labels):
############ Your code here #############
##(~ 3 lines of code)
    accuracy = ((preds.argmax(dim=1).flatten() == labels.flatten()).sum() / len(labels)).item()

#########################################
    return accuracy

In [17]:
# Function that shows time
def format_time(elapsed):

    # Round
    elapsed_rounded = int(round((elapsed)))

    # Convert into the format hh:mm:ss
    return str(datetime.timedelta(seconds=elapsed_rounded))

## Training
Now, we're going to train the model. An epoch loop consists of training and validation processes. With PyTorch, you can simply implement forward and backward operations. Now, let's fill in the code below!

### Question 3: What is the average of training loss for each epoch?

In [19]:
# Fix a random seed for reproducibility
seed_val = 42
random.seed(seed_val)
np.random.seed(seed_val)
torch.manual_seed(seed_val)
torch.cuda.manual_seed_all(seed_val)

# Initialize gradient
model.zero_grad()

# Repeat for the number of epochs
for epoch_i in range(0, epochs):

    # ========================================
    #               Training
    # ========================================

    print("")
    print('======== Epoch {:} / {:} ========'.format(epoch_i + 1, epochs))
    print('Training...')

    # Set the start time
    t0 = time.time()

    # Initialize loss
    total_loss = 0

    # Set the model to train mode
    model.train()

    # For each batch retrieved from a data loader
    for step, batch in enumerate(train_dataloader):
        # Show the information of every 500 iterations
        if step % 500 == 0 and not step == 0:
            elapsed = format_time(time.time() - t0)
            print('  Batch {:>5,}  of  {:>5,}.    Elapsed: {:}.'.format(step, len(train_dataloader), elapsed))

        # Move a current batch to GPU
        batch = tuple(t.to(device) for t in batch)
        ################ Your code here ################################
        ## Note: (~ 10 lines of code)
        ## 1. Extract data(input ids, mask, labels) from the batch.
        ## 2. Forward propagation. Give the model input consisting of ids, mask and labels.
        ## 3. Get loss. The model's outputs contain the loss,
        ##    so you don't need to calculate the loss again.
        ##    Just simply get the loss from the output.
        ## 4. Compute the total loss
        ## 5. Do back-propagation, i.e., loss.backward()
        ## 6. Gradient clipping. Use torch.nn.utils.clip_grad_norm_(), set max_norm=1.0
        ## 7. Update parameters by using the gradients, i.e., optimizer.step()
        ## 8. Decrease the learning rate with scheduler, i.e., scheduler.step()
        ## 9. Initialize gradient
        ###############################################################
        input_ids, mask, labels = batch
        outputs = model(input_ids=input_ids, attention_mask=mask, labels=labels)
        loss = outputs.loss
        total_loss += loss.item()
        loss.backward()
        torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)
        optimizer.step()
        scheduler.step()
        optimizer.zero_grad()


    ################ Your code here ################################
    ## 10. Compute the average loss for each epoch
    ##(~1 line of code)
    avg_train_loss = total_loss / len(train_dataloader)
    #########################################################

    print("")
    print("  Average training loss: {0:.2f}".format(avg_train_loss))
    print("  Training epcoh took: {:}".format(format_time(time.time() - t0)))

    # ========================================
    #               Validation
    # ========================================

    print("")
    print("Running Validation...")

    # Set the inital time
    t0 = time.time()

    # Change model to eval mode
    model.eval()

    # Initialize variables
    eval_loss, eval_accuracy = 0, 0
    nb_eval_steps, nb_eval_examples = 0, 0

    # For each batch retrieved from a data loader
    for batch in validation_dataloader:
        # Put the batch into GPU
        batch = tuple(t.to(device) for t in batch)
        ########### Your code here ##############################
        ## Note: (~9 lines of code)
        ## 1. Extract data(input ids, mask, labels) from the batch
        ## 2. In validation step, you don't need to compute gradients.
        ##    Wrap the forward operation with the 'with torch.no_grad():' statement.
        ## 3. Get "logits"
        ## 4. Move logits and labels to CPU. (use '.cpu()' or '.to('cpu')')
        ## 5. Compute accuracy by using output logits and labels
        ## (use flat_accuracy function that we defined before.)
        # Unpack the batch
        input_ids, mask, labels = batch
        with torch.no_grad():
            outputs = model(input_ids=input_ids, attention_mask=mask, labels=labels)
            logits = outputs.logits
            logits = logits.cpu()
            labels = labels.cpu()
            eval_accuracy += flat_accuracy(logits, labels)
            nb_eval_steps += 1

    ##########################################################
    print("  Accuracy: {0:.2f}".format(eval_accuracy/nb_eval_steps))
    print("  Validation took: {:}".format(format_time(time.time() - t0)))

print("")
print("Training complete!")


Training...
  Batch   500  of  1,055.    Elapsed: 0:05:53.
  Batch 1,000  of  1,055.    Elapsed: 0:11:49.

  Average training loss: 0.40
  Training epcoh took: 0:12:28

Running Validation...
  Accuracy: 0.85
  Validation took: 0:00:30

Training...
  Batch   500  of  1,055.    Elapsed: 0:05:56.
  Batch 1,000  of  1,055.    Elapsed: 0:11:52.

  Average training loss: 0.30
  Training epcoh took: 0:12:31

Running Validation...
  Accuracy: 0.86
  Validation took: 0:00:30

Training...
  Batch   500  of  1,055.    Elapsed: 0:05:56.
  Batch 1,000  of  1,055.    Elapsed: 0:11:53.

  Average training loss: 0.25
  Training epcoh took: 0:12:32

Running Validation...
  Accuracy: 0.87
  Validation took: 0:00:30

Training...
  Batch   500  of  1,055.    Elapsed: 0:05:56.
  Batch 1,000  of  1,055.    Elapsed: 0:11:53.

  Average training loss: 0.22
  Training epcoh took: 0:12:32

Running Validation...
  Accuracy: 0.87
  Validation took: 0:00:30

Training complete!


## Model Evaluation

### Question 4: What is the test accuracy of our model?

In [20]:
# Set initial time
t0 = time.time()

# Change to evel mode
model.eval()

# Initialize variables
eval_loss, eval_accuracy = 0, 0
nb_eval_steps, nb_eval_examples = 0, 0

# For each batch from the data loader
for step, batch in enumerate(test_dataloader):
    # Show the information for every 500 iterations
    if step % 100 == 0 and not step == 0:
        elapsed = format_time(time.time() - t0)
        print('  Batch {:>5,}  of  {:>5,}.    Elapsed: {:}.'.format(step, len(test_dataloader), elapsed))

    # Put the batch into GPU
    batch = tuple(t.to(device) for t in batch)
    ############# Your code here ##############################
    ## Note: (~9 lines of code)
    ## 1. Extract data(input ids, mask, labels) from the batch
    ## 2. In evaluation step, you don't need to compute gradients.
    ##    Wrap the forward operation with the 'with torch.no_grad():' statement.
    ## 3. Get "logits"
    ## 4. Move logits and labels to CPU. (use '.cpu()' or '.to('cpu')')
    ## 5. Compute accuracy by using output logits and labels
    ## (use flat_accuracy function that we defined before.)
    input_ids, mask, labels = batch
    with torch.no_grad():
        outputs = model(input_ids=input_ids, attention_mask=mask, labels=labels)
        logits = outputs.logits
        logits = logits.cpu()
        labels = labels.cpu()
        eval_accuracy += flat_accuracy(logits, labels)
        nb_eval_steps += 1


    ##########################################################
print("")
print("Accuracy: {0:.2f}".format(eval_accuracy/nb_eval_steps))
print("Test took: {:}".format(format_time(time.time() - t0)))

  Batch   100  of    391.    Elapsed: 0:00:23.
  Batch   200  of    391.    Elapsed: 0:00:48.
  Batch   300  of    391.    Elapsed: 0:01:13.

Accuracy: 0.87
Test took: 0:01:35


## Can your model correctly categorize new reviews? Let's feed sentences to your model on your own!

In [66]:
#TODO:
# Make function to convert sentences into input data format.
# This function performs preprocessing, tokenization, and padding,
# and creates attention masks. (just the same as what we did earlier.)
# Set the maximum length = 128

def convert_input_data(sentences):
    ######### Your code here ##################
    #(~9 lines of code)
    sentences = ["[CLS] " + str(sentences) + " [SEP]"]
    tokenized_texts = [tokenizer.tokenize(sentence) for sentence in sentences]
    MAX_LEN = 128
    input_ids = [tokenizer.convert_tokens_to_ids(token) for token in tokenized_texts]
    input_ids = pad_sequences(input_ids, maxlen=MAX_LEN, dtype="long", truncating="post", padding="post")
    attention_masks = [[0 if id == 0 else 1 for id in ids] for ids in input_ids]

    #################################################
    # Convert data to pytorch tensors
    inputs = torch.tensor(input_ids)
    masks = torch.tensor(attention_masks)

    return inputs, masks

# Test sentences
def test_sentences(sentences):

    # Change to eval mode
    model.eval()

    # Convert sentences to the input of BERT
    inputs, masks = convert_input_data(sentences)

    # Move data into GPU
    b_input_ids = inputs.to(device)
    b_input_mask = masks.to(device)

    # No gradient computation
    with torch.no_grad():
        # Forward propagation
        outputs = model(b_input_ids,
                        token_type_ids=None,
                        attention_mask=b_input_mask)

    # Get loss
    logits = outputs[0]

    # Move data to CPU
    logits = logits.detach().cpu().numpy()

    return logits


In [70]:
# Enter your review below to test your trained model
logits = test_sentences('최고였어요! 꼭 다시 보고 싶어요!')

print(logits)
print(np.argmax(logits))

[[-2.2724063  2.4469035]]
1
