# BERT tutorial using Hugging Face
## 教學目標
利用 Hugging Face 套件快速使用 BERT 模型來進行下游任務訓練
- 單一句型分類任務 (single-sentence text classification)
- 問答任務 (question answering)

## Hugging Face 介紹
- 🤗 Hugging Face 是專門提供自然語言處理領域的函式庫
- 其函式庫支援 PyTorch 和 TensorFlow
- 🤗 Hugging Face 的主要套件為:
    1. Transformers ([連結](https://huggingface.co/transformers/index.html))
    - 提供了現今最強大的自然語言處理模型，使用上非常彈性且方便
    2. Tokenizers ([連結](https://huggingface.co/docs/tokenizers/python/latest/))
    - 讓你可以快速做好 BERT 系列模型 tokenization
    3. Datasets ([連結](https://huggingface.co/docs/datasets/))
    - 提供多種自然語言處理任務的資料集

In [1]:
# !pip install transformers

In [12]:
import torch
print("PyTorch 的版本為: {}".format(torch.__version__))

import transformers
print("Hugging Face Transformers 的版本為: {}".format(transformers.__version__))

import datasets
print("Hugging Face Datasets 的版本為: {}".format(datasets.__version__))

PyTorch 的版本為: 1.7.1
Hugging Face Transformers 的版本為: 4.5.1
Hugging Face Datasets 的版本為: 1.6.1


In [None]:
import os
import json
from pathlib import Path

# 單一句型分類任務 (single-sentence text classification)
## 準備資料集 (需先下載)
我們使用 IMDb reviews 資料集作為範例

In [2]:
# !pip install wget
import wget
url = 'http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz'
filename = wget.download(url, out='./')



In [3]:
# !pip install tarfile
import tarfile

# 指定檔案位置，並解壓縮 .gz 結尾的壓縮檔
tar = tarfile.open('aclImdb_v1.tar.gz', 'r:gz')
tar.extractall()

## 接下來我們要進行資料前處理
但首先要觀察解壓縮後的資料夾結構:
```
aclImdb---
        |--train
        |    |--neg
        |    |--pos
        |    |--...
        |--test
        |    |--neg
        |    |--pos
        |    |--...
        |--imdb.vocab
        |--imdbEr.text
        |--README
```
其中 train 和 test 資料夾中分別又有 neg 和 pos 兩種資料夾

我們要針對這兩個目標資料夾進行處理

In [4]:
# 載入 pathlib 模組 (Python3.4+)
from pathlib import Path

def read_imdb_split(split_dir):
    split_dir = Path(split_dir)
    texts = []
    labels = []
    for label_dir in ["pos", "neg"]:
        # 利用iterdir() 來列出資料夾底下的所有檔案，此功能等同於 os.path.listdir()
#         for text_file in (split_dir/label_dir).iterdir():
        # 若知道副檔名為.txt
        for text_file in (split_dir/label_dir).glob("*.txt"):
            tmp_text = text_file.read_text()
#             print(tmp_text)
#             exit()
            texts.append(tmp_text)
            labels.append(0 if label_dir == "neg" else 1)
    
    return texts, labels

In [5]:
train_texts, train_labels = read_imdb_split('aclImdb/train')
test_texts, test_labels = read_imdb_split('aclImdb/test')

### 切分訓練資料，來分出 validation set

In [6]:
from sklearn.model_selection import train_test_split

# 設立隨機種子來控制隨機過程
random_seed = 42

# 設定要分出多少比例的 validation data
valid_ratio = 0.2

# 使用 train_test_split 來切分資料
train_texts, val_texts, train_labels, val_labels = train_test_split(train_texts, train_labels, test_size=valid_ratio, random_state=random_seed)

### Tokenizization
- 斷字的部份以 DistilBERT (Sanh et al., 2019) 的 tokenizer 為例
- Hugging Face 的 tokenizer 可以直接幫你自動將資料轉換成 BERT 的輸入型式 (也就是加入[CLS]和[SEP] tokens)
- 接著由於深度學習網路需要使用 mini-batch 進行學習，我們需要先以 PyTorch dataset 來自行建立封裝資料的物件

In [20]:
# 在 Hugging Face 套件中可使用 .from_pretrained() 的方法來導入預訓練模型
tokenizer = transformers.DistilBertTokenizerFast.from_pretrained('distilbert-base-uncased')

In [8]:
# 分別將3種資料 (train/valid/test) 做 tokenization
# truncation 代表

train_encodings = tokenizer(train_texts, truncation=True, padding=True)
val_encodings = tokenizer(val_texts, truncation=True, padding=True)
test_encodings = tokenizer(test_texts, truncation=True, padding=True)

In [9]:
class IMDbDataset(torch.utils.data.Dataset):
    def __init__(self, encodings, labels):
        self.encodings = encodings
        self.labels = labels

    def __getitem__(self, idx):
        item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
        item['labels'] = torch.tensor(self.labels[idx])
        return item

    def __len__(self):
        return len(self.labels)

train_dataset = IMDbDataset(train_encodings, train_labels)
val_dataset = IMDbDataset(val_encodings, val_labels)
test_dataset = IMDbDataset(test_encodings, test_labels)

In [10]:
len(test_dataset)

25000

### 使用 Hugging Face Datasets

In [11]:
# !pip install datasets

In [None]:
datasets_list = datasets.list_datasets()

print("現在 Hugging Face Datasets 有 {} 個資料集可以使用".format(len(datasets_list)))
print("===============================================")
print("所有的資料集如下: ")
print(', '.join(dataset for dataset in datasets_list))

In [5]:
# 設立隨機種子來控制隨機過程
random_seed = 42

train = datasets.load_dataset("imdb", split="train")
splits = train.train_test_split(
    test_size=0.2,
    seed=random_seed
)
train, valid = splits['train'], splits['test']

test = datasets.load_dataset("imdb", split="test")

Reusing dataset imdb (/home/dean/.cache/huggingface/datasets/imdb/plain_text/1.0.0/4ea52f2e58a08dbc12c2bd52d0d92b30b88c00230b4522801b3636782f625c5b)
Reusing dataset imdb (/home/dean/.cache/huggingface/datasets/imdb/plain_text/1.0.0/4ea52f2e58a08dbc12c2bd52d0d92b30b88c00230b4522801b3636782f625c5b)


In [14]:
print(len(train))
print(len(valid))
print(len(test))

20000
5000
25000


In [6]:
def to_torch_data(hug_dataset):
    dataset = hug_dataset.map(
        lambda batch: tokenizer(
            batch["text"],
            truncation=True,
            padding=True
        ),
        batched=True
    )

    dataset.set_format(
        type='torch',
        columns=[
            'input_ids',
            'attention_mask',
            'label'
        ]
    )

    return dataset

In [7]:
train_dataset = to_torch_data(train)
val_dataset = to_torch_data(valid)
test_dataset = to_torch_data(test)

HBox(children=(FloatProgress(value=0.0, max=20.0), HTML(value='')))




HBox(children=(FloatProgress(value=0.0, max=5.0), HTML(value='')))




HBox(children=(FloatProgress(value=0.0, max=25.0), HTML(value='')))




In [8]:
training_args = transformers.TrainingArguments(
    output_dir='./results',          # output directory
    num_train_epochs=3,              # total number of training epochs
    per_device_train_batch_size=16,  # batch size per device during training
    per_device_eval_batch_size=64,   # batch size for evaluation
    warmup_steps=500,                # number of warmup steps for learning rate scheduler
    weight_decay=0.01,               # strength of weight decay
    logging_dir='./logs',            # directory for storing logs
    logging_steps=10,
)

model = transformers.DistilBertForSequenceClassification.from_pretrained("distilbert-base-uncased")

trainer = transformers.Trainer(
    model=model,                         # the instantiated 🤗 Transformers model to be trained
    args=training_args,                  # training arguments, defined above
    train_dataset=train_dataset,         # training dataset
    eval_dataset=val_dataset,            # evaluation dataset
)

# trainer.is_model_parallel=True
trainer.args._n_gpu=1
# trainer.train()

Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertForSequenceClassification: ['vocab_transform.weight', 'vocab_transform.bias', 'vocab_layer_norm.weight', 'vocab_layer_norm.bias', 'vocab_projector.weight', 'vocab_projector.bias']
- This IS expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['pre_classifier.weight', 'pre_classifier.bias', 'classi

In [9]:
trainer.predict(test_dataset)

PredictionOutput(predictions=array([[-0.12364519, -0.05538433],
       [-0.09750886, -0.08097227],
       [-0.14379609, -0.05519198],
       ...,
       [-0.06995446, -0.06064164],
       [-0.11827329, -0.04072288],
       [-0.13526428, -0.09345569]], dtype=float32), label_ids=array([1, 1, 1, ..., 0, 0, 0]), metrics={'test_loss': 0.6959018111228943, 'test_runtime': 127.6522, 'test_samples_per_second': 195.845, 'init_mem_cpu_alloc_delta': 2148904960, 'init_mem_gpu_alloc_delta': 268953088, 'init_mem_cpu_peaked_delta': 199368704, 'init_mem_gpu_peaked_delta': 0, 'test_mem_cpu_alloc_delta': 15785984, 'test_mem_gpu_alloc_delta': 0, 'test_mem_cpu_peaked_delta': 1085440, 'test_mem_gpu_peaked_delta': 2316313600})

# 問答任務 (question answering)

## 準備資料集 (需先下載)

In [None]:
# 建立 Squad 的資料夾
if not os.path.exists("./squad"):
    os.mkdir("./squad")

In [None]:


# !pip install wget
import wget

url = 'https://rajpurkar.github.io/SQuAD-explorer/dataset/train-v2.0.json'
filename = wget.download(url, out='./squad/')

url = 'https://rajpurkar.github.io/SQuAD-explorer/dataset/dev-v2.0.json'
filename = wget.download(url, out='./squad/')

In [None]:
def read_squad(path):
    path = Path(path)
    with open(path, 'rb') as f:
        squad_dict = json.load(f)

    contexts = []
    questions = []
    answers = []
    for group in squad_dict['data']:
        for passage in group['paragraphs']:
            context = passage['context']
            for qa in passage['qas']:
                question = qa['question']
                for answer in qa['answers']:
                    contexts.append(context)
                    questions.append(question)
                    answers.append(answer)

    return contexts, questions, answers

train_contexts, train_questions, train_answers = read_squad('squad/train-v2.0.json')
val_contexts, val_questions, val_answers = read_squad('squad/dev-v2.0.json')

In [None]:
def add_end_idx(answers, contexts):
    for answer, context in zip(answers, contexts):
        gold_text = answer['text']
        start_idx = answer['answer_start']
        end_idx = start_idx + len(gold_text)

        # sometimes squad answers are off by a character or two – fix this
        if context[start_idx:end_idx] == gold_text:
            answer['answer_end'] = end_idx
        elif context[start_idx-1:end_idx-1] == gold_text:
            answer['answer_start'] = start_idx - 1
            answer['answer_end'] = end_idx - 1     # When the gold label is off by one character
        elif context[start_idx-2:end_idx-2] == gold_text:
            answer['answer_start'] = start_idx - 2
            answer['answer_end'] = end_idx - 2     # When the gold label is off by two characters

add_end_idx(train_answers, train_contexts)
add_end_idx(val_answers, val_contexts)

In [None]:
tokenizer = transformers.DistilBertTokenizerFast.from_pretrained('distilbert-base-uncased')

train_encodings = tokenizer(train_contexts, train_questions, truncation=True, padding=True)
val_encodings = tokenizer(val_contexts, val_questions, truncation=True, padding=True)

In [None]:
def add_token_positions(encodings, answers):
    start_positions = []
    end_positions = []
    for i in range(len(answers)):
        start_positions.append(encodings.char_to_token(i, answers[i]['answer_start']))
        end_positions.append(encodings.char_to_token(i, answers[i]['answer_end'] - 1))

        # if start position is None, the answer passage has been truncated
        if start_positions[-1] is None:
            start_positions[-1] = tokenizer.model_max_length
        if end_positions[-1] is None:
            end_positions[-1] = tokenizer.model_max_length

    encodings.update({'start_positions': start_positions, 'end_positions': end_positions})

add_token_positions(train_encodings, train_answers)
add_token_positions(val_encodings, val_answers)

In [None]:
class SquadDataset(torch.utils.data.Dataset):
    def __init__(self, encodings):
        self.encodings = encodings

    def __getitem__(self, idx):
        return {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}

    def __len__(self):
        return len(self.encodings.input_ids)

train_dataset = SquadDataset(train_encodings)
val_dataset = SquadDataset(val_encodings)

In [None]:
model = transformers.DistilBertForQuestionAnswering.from_pretrained("distilbert-base-uncased")

In [None]:
device = torch.device('cuda') if torch.cuda.is_available() else torch.device('cpu')

model.to(device)
model.train()

train_loader = torch.utils.data.DataLoader(train_dataset, batch_size=16, shuffle=True)

optim = transformers.AdamW(model.parameters(), lr=5e-5)

for epoch in range(3):
    for batch in train_loader:
        optim.zero_grad()
        input_ids = batch['input_ids'].to(device)
        attention_mask = batch['attention_mask'].to(device)
        start_positions = batch['start_positions'].to(device)
        end_positions = batch['end_positions'].to(device)
        outputs = model(input_ids, attention_mask=attention_mask, start_positions=start_positions, end_positions=end_positions)
        loss = outputs[0]
        loss.backward()
        optim.step()

model.eval()