In [2]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [3]:
!nvidia-smi

Wed Nov 25 05:39:12 2020       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 455.38       Driver Version: 418.67       CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla V100-SXM2...  Off  | 00000000:00:04.0 Off |                    0 |
| N/A   37C    P0    24W / 300W |      0MiB / 16130MiB |      0%      Default |
|                               |                      |                 ERR! |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

## Transformer Pre-trained model

這一章節介紹目前自然語言處理最強大的模型- `Transformer`，`Transformer` 相較於 `RNN` 系列的模型，`Transformer` 在表現 (`metrics`) 以及計算效率 (`parallel`) 都有絕對的優勢，著名的 `pre-train` 模型如下，連結為各個模型的論文路徑，基本上這些模型都是 `Transformer` 的變形，不同的地方在於預訓練的策略，例如資料量大小、`Masked` 的差異以及 `Self-attention` 矩陣的差異，最特別的是最後一個 `ELECTRA`，是在`2019`年`11` 月初提出的論文，結合了 `transformer` 還有 `GAN`。

* BERT: https://arxiv.org/abs/1810.04805
 - Masked Language Modeling + Next Sentence Prediction


* GPT: https://s3-us-west-2.amazonaws.com/openai-assets/research-covers/language-unsupervised/language_understanding_paper.pdf
 - AutoRegressive Prediction


* Transformer-XL: https://arxiv.org/abs/1901.02860
 - Learning dependency beyond a fixed length(>512)


* XLNet: https://arxiv.org/abs/1906.08237
 - Permutation Modeling


* XLM: https://arxiv.org/abs/1901.07291
 - Pretrain on cross-lingual language


* RoBERTa: https://arxiv.org/abs/1907.11692
 - Pretrain model longer, more data


* DistilBERT: https://arxiv.org/abs/1910.01108
* CTRL: https://arxiv.org/abs/1909.05858
* ELECTRA: https://openreview.net/pdf?id=r1xMH1BtvB
 - Transformer + GAN
 
 
### [GLUE Benchmark](https://gluebenchmark.com/leaderboard)

## [Transformer](https://huggingface.co/transformers/)

這邊我們使用 `Transformers` 套件來進行 `finetune`，在進行 `finetune` 之前，需要了解自然語言處理任務上的差異，最主要分為兩種分類任務：

1. `Text classification`: 輸入一個句子，輸出該句子的分類。
2. `Sentence-Pair classification`: 輸入兩個句子的pair，輸出兩個句子之間的關係。

* PS. 這些預訓練模型除了表現亮眼之外，最重要的貢獻在於預訓練後的 `word embedding`，`word embedding` 表示在文本中，詞與詞之間的關係，最著名的例子就是: 男性 - 女性 = 國王 - 皇后，像這樣的對應關係，有訓練良好的 `word embedding` ，基本上在其他應用任務表現也會不錯，例如聊天機器人、推薦系統等等。

在這裡我們會使用 `BERT` 來進行 `finetune`。

In [4]:
!pip install transformers
!pip install sacremoses

Collecting transformers
[?25l  Downloading https://files.pythonhosted.org/packages/3a/83/e74092e7f24a08d751aa59b37a9fc572b2e4af3918cb66f7766c3affb1b4/transformers-3.5.1-py3-none-any.whl (1.3MB)
[K     |████████████████████████████████| 1.3MB 5.6MB/s 
[?25hCollecting sentencepiece==0.1.91
[?25l  Downloading https://files.pythonhosted.org/packages/d4/a4/d0a884c4300004a78cca907a6ff9a5e9fe4f090f5d95ab341c53d28cbc58/sentencepiece-0.1.91-cp36-cp36m-manylinux1_x86_64.whl (1.1MB)
[K     |████████████████████████████████| 1.1MB 20.3MB/s 
Collecting sacremoses
[?25l  Downloading https://files.pythonhosted.org/packages/7d/34/09d19aff26edcc8eb2a01bed8e98f13a1537005d31e95233fd48216eed10/sacremoses-0.0.43.tar.gz (883kB)
[K     |████████████████████████████████| 890kB 43.1MB/s 
[?25hCollecting tokenizers==0.9.3
[?25l  Downloading https://files.pythonhosted.org/packages/4c/34/b39eb9994bc3c999270b69c9eea40ecc6f0e97991dba28282b9fd32d44ee/tokenizers-0.9.3-cp36-cp36m-manylinux1_x86_64.whl (2.9MB)

In [None]:
import tensorflow as tf
import tensorflow_datasets as tfds
import numpy as np
import pandas as pd
import os
from sklearn.metrics import classification_report, confusion_matrix

from transformers import TFBertForSequenceClassification, BertTokenizer, glue_convert_examples_to_features, TFXLNetForSequenceClassification, XLNetTokenizer

os.chdir('/content/drive/Shareddrives/類技術班教材/標準版/NLP進階/Seq2seq 系列模型/4.Transformer_based_model/Finetune_on_glue')

## 模型名稱解釋

* `bert-base-uncased`:
  - `bert`: 模型名稱
  - `base`: 模型大小，`base` 表示層數為$12$層, `word embedding(hidden)` 為$768$維, `heads` 為$12$，另外有 `large`，層數為$24$層，`word embedding(hidden)` 為$1024$維，`heads` 為$16$。
  - `uncased`: 表示對於文本的前處理，`uncased` 表示字全部轉小寫，反之 `cased` 表示維持原樣。
 
另外不只有這些模型，其餘模型可以參考：
https://huggingface.co/transformers/pretrained_models.html

In [None]:
# model = TFXLNetForSequenceClassification.from_pretrained('xlnet-base-cased')
# tokenizer = XLNetTokenizer.from_pretrained('xlnet-base-cased')

In [None]:
"""
載入預訓練模型
"""
model = TFBertForSequenceClassification.from_pretrained('bert-base-uncased')

Some layers from the model checkpoint at bert-base-uncased were not used when initializing TFBertForSequenceClassification: ['mlm___cls', 'nsp___cls']
- This IS expected if you are initializing TFBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some layers of TFBertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['dropout_75', 'classifier']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [None]:
"""
載入模型斷詞工具
"""
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

## Finetune

<figure>
<center>
<img src='https://drive.google.com/uc?export=view&id=1uOR-Xmf0Gd8fXuaFIW2oew-OUkCAt4b_' width="800"/>
<figcaption>Self-attention</figcaption></center>
</figure>

所有預訓練模型都是在 [GLUE Benchmark](https://gluebenchmark.com/leaderboard) 進行競賽，這個競賽提供多種不同的自然語言處理任務，這些任務都是屬於分類任務，只是差別在於資料集大小以及來源而已，這裡我們使用其中一種分類任務 `MRPC` 來進行 `finetune`。

* 資料來源: [tensorflow dataset](https://www.tensorflow.org/datasets/catalog/overview#wmt19_translate)

In [None]:
data, info = tfds.load('glue/mrpc', with_info=True)

INFO:absl:Load dataset info from /root/tensorflow_datasets/glue/mrpc/1.0.0
INFO:absl:Reusing dataset glue (/root/tensorflow_datasets/glue/mrpc/1.0.0)
INFO:absl:Constructing tf.data.Dataset for split None, from /root/tensorflow_datasets/glue/mrpc/1.0.0


### Info

資料集的介紹，最需要注意的地方就是資料集的樣子，因為 `MRPC` 是屬於 `Sentnece-Pair classification` 任務，所以資料集包括了 `sentence1` 和 `sentence2` 對應一個 `label`，`MRPC` 主要是在分類兩個句子之間的語義是否相同，`label` 為$1$表示相同，反之$0$表示不同。

因為是競賽資料集，所以資料集已經切割好為 `train`, `validation` 以及 `test`。

In [None]:
info

tfds.core.DatasetInfo(
    name='glue',
    version=1.0.0,
    description='GLUE, the General Language Understanding Evaluation benchmark
(https://gluebenchmark.com/) is a collection of resources for training,
evaluating, and analyzing natural language understanding systems.',
    homepage='https://www.microsoft.com/en-us/download/details.aspx?id=52398',
    features=FeaturesDict({
        'idx': tf.int32,
        'label': ClassLabel(shape=(), dtype=tf.int64, num_classes=2),
        'sentence1': Text(shape=(), dtype=tf.string),
        'sentence2': Text(shape=(), dtype=tf.string),
    }),
    total_num_examples=5801,
    splits={
        'test': 1725,
        'train': 3668,
        'validation': 408,
    },
    supervised_keys=None,
    citation="""@inproceedings{dolan2005automatically,
      title={Automatically constructing a corpus of sentential paraphrases},
      author={Dolan, William B and Brockett, Chris},
      booktitle={Proceedings of the Third International Workshop on Para

In [None]:
for k, v in data.items():
    print('key:', k)
    print('data shapes:\n', v)
    print('-' * 20)

key: test
data shapes:
 <PrefetchDataset shapes: {idx: (), label: (), sentence1: (), sentence2: ()}, types: {idx: tf.int32, label: tf.int64, sentence1: tf.string, sentence2: tf.string}>
--------------------
key: train
data shapes:
 <PrefetchDataset shapes: {idx: (), label: (), sentence1: (), sentence2: ()}, types: {idx: tf.int32, label: tf.int64, sentence1: tf.string, sentence2: tf.string}>
--------------------
key: validation
data shapes:
 <PrefetchDataset shapes: {idx: (), label: (), sentence1: (), sentence2: ()}, types: {idx: tf.int32, label: tf.int64, sentence1: tf.string, sentence2: tf.string}>
--------------------


### Dataset overview

`tensorflow` 儲存資料的方式都是以 `tf.data.Data` 型態來儲存，可以使用 `iter` 來建立 `generator`，並使用 `next` 來觀看第一筆資料，資料中包含了 `idx`、`label`、`sentence1` 以及 `sentence2`。

In [None]:
assert isinstance(data['train'], tf.data.Dataset)

temp = data['train']
temp_gen = iter(temp)
next(temp_gen)

{'idx': <tf.Tensor: shape=(), dtype=int32, numpy=1680>,
 'label': <tf.Tensor: shape=(), dtype=int64, numpy=0>,
 'sentence1': <tf.Tensor: shape=(), dtype=string, numpy=b'The identical rovers will act as robotic geologists , searching for evidence of past water .'>,
 'sentence2': <tf.Tensor: shape=(), dtype=string, numpy=b'The rovers act as robotic geologists , moving on six wheels .'>}

### Training data format

接下來我們需要將資料集轉換成模型可讀取的格式，輸入格式有三個：

* `input_ids`: 這表示句子斷完詞之後轉成 `token embeddings`，每一個詞有一個 `id`，如下圖，其中 `101` 表示 `[CLS]`，`102` 表示 `[SEP]`，因為 `MPRC` 是 `Sentence-Pair classification` 任務，所以下面的範例中會看到兩個 `102`。

* `attention mask`: 因為 `Transformer` 會限制輸入句子的長度，最大限制為 `512`，而我們選擇 `128`，但不是所有的句子長度都是 `128`，所以需要在後面進行 `padding` (就是補0)，最主要的目的是不去計算 `padding` 位置的 `loss` 。

* `token_type_ids`: 用來表示 `Segment embedding`，如上圖，表示詞屬於哪一個句子，因為 `MRPC` 有兩個句子，所以 `ids` 有2種，`0` 和 `1`。

In [None]:
max_length = 128
task = 'mrpc'

train_dataset = glue_convert_examples_to_features(data['train'],
                                                  tokenizer,
                                                  max_length,
                                                  task)
valid_dataset = glue_convert_examples_to_features(data['validation'],
                                                  tokenizer,
                                                  max_length,
                                                  task)
test_dataset = glue_convert_examples_to_features(data['test'],
                                                 tokenizer,
                                                 max_length,
                                                 task)



### Example

觀察轉換過後的資料集。

In [None]:
next(iter(train_dataset))

({'attention_mask': <tf.Tensor: shape=(128,), dtype=int32, numpy=
  array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
         1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], dtype=int32)>,
  'input_ids': <tf.Tensor: shape=(128,), dtype=int32, numpy=
  array([  101,  1996,  7235,  9819,  2097,  2552,  2004, 20478, 21334,
          2015,  1010,  6575,  2005,  3350,  1997,  2627,  2300,  1012,
           102,  1996,  9819,  2552,  2004, 20478, 21334,  2015,  1010,
          3048,  2006,  2416,  7787,  1012,   102,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,  

### Parameter settings

在 `tf.data.Dataset` 中，通常會在訓練資料集後面接上三個標準的操作：

* `.shuffle()`: 打亂資料集的方式，會先從資料集中隨機抽取`buffer_size`筆資料進去 `buffer`，然後再 `buffer` 從中抽取`batch_size`筆資料進行訓練，丟進 `buffer` 的步驟主要是在處理無法一次將所有資料集丟進記憶體進行訓練的情形。

* `.batch()`: 每次迭代使用的資料數量。
* `.repeat()`: `epochs` 數量。

In [None]:
buffer_size = 100
train_bz = 16
epochs = 3
valid_bz = 50

train_dataset = train_dataset.shuffle(buffer_size).batch(train_bz).repeat(epochs)
valid_dataset = valid_dataset.batch(valid_bz)
test_dataset = test_dataset.batch(1)

In [None]:
optimizer = tf.keras.optimizers.Adam(learning_rate=1e-5,
                                     epsilon=1e-8,
                                     clipnorm=1.0)
loss = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True,
                                                     reduction=tf.keras.losses.Reduction.SUM_OVER_BATCH_SIZE)

model.compile(optimizer=optimizer, loss=loss, metrics=['accuracy'])

## Training

* `.fit()`: 支援 `generator` 的輸入方式，也可以用 `fit_generator` 。
* `steps_per_epoch`: 每個 `epoch` 訓練幾次，通常是 $\frac{train\_size}{batch\_size}$ ，遍歷整個訓練集。
* `validation_steps`: 與 `steps_per_epoch` 同義。

In [None]:
history = model.fit(train_dataset,
                    epochs=epochs,
                    steps_per_epoch=3668//train_bz, 
                    validation_data=valid_dataset,
                    validation_steps=408//valid_bz)

Epoch 1/3
Epoch 2/3
Epoch 3/3


## Evaluation

In [None]:
valid_pred = model.predict(valid_dataset)
valid_pred_ids = np.argmax(valid_pred[0], axis=-1)

In [None]:
import numpy as np
"""
從 tf.data.Dataset 中拿取 label
"""
valid_label = list()
for x in valid_dataset:
  valid_label.append(x[1].numpy())
valid_label = np.concatenate(valid_label)

In [None]:
confm = confusion_matrix(y_pred=valid_pred_ids, y_true=valid_label)

index = ['Actual_0', 'Actual_1']
columns = ['Pred_0', 'Pred_1']
pd.DataFrame(confm, index=index, columns=columns)

Unnamed: 0,Pred_0,Pred_1
Actual_0,61,68
Actual_1,7,272


In [None]:
print(classification_report(y_pred=valid_pred_ids, y_true=valid_label))

              precision    recall  f1-score   support

           0       0.90      0.47      0.62       129
           1       0.80      0.97      0.88       279

    accuracy                           0.82       408
   macro avg       0.85      0.72      0.75       408
weighted avg       0.83      0.82      0.80       408



## Save model

In [None]:
save_path = 'save'
if not os.path.exists(save_path):
    os.mkdir(save_path)

In [None]:
model.save_pretrained(save_path)

## Load model and predict

這邊參考`MRPC`的輸入格式，一樣會使用`glue_convert_examples_to_features`這個函數進行轉換。

In [None]:
new_model = TFBertForSequenceClassification.from_pretrained(save_path)

Some layers from the model checkpoint at save were not used when initializing TFBertForSequenceClassification: ['dropout_75']
- This IS expected if you are initializing TFBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some layers of TFBertForSequenceClassification were not initialized from the model checkpoint at save and are newly initialized: ['dropout_113']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [None]:
sentence1 = ["Anorld Schwarzenegger is my idol."]
sentence2 = ["My favorite idol is Anorld Schwarzenegger."]

test_dataset = pd.DataFrame(dict(idx=list(range(len(sentence1))),
                                 label=[0]*len(sentence1),
                                 sentence1=sentence1,
                                 sentence2=sentence2))

In [None]:
"""
模仿GLUE的輸入格式: (idx, label, sentence1, sentence2)
其中label是假的，是因為輸入需要，不會影響預測值
"""
test_dataset

Unnamed: 0,idx,label,sentence1,sentence2
0,0,0,Anorld Schwarzenegger is my idol.,My favorite idol is Anorld Schwarzenegger.


In [None]:
test_gen = tf.data.Dataset.from_tensor_slices(dict(test_dataset))

In [None]:
test_gen = glue_convert_examples_to_features(test_gen, tokenizer, max_length, task)



In [None]:
test_gen = test_gen.batch(1)

In [None]:
next(iter(test_gen))

({'attention_mask': <tf.Tensor: shape=(1, 128), dtype=int32, numpy=
  array([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
          1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
          0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
          0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
          0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
          0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]],
        dtype=int32)>,
  'input_ids': <tf.Tensor: shape=(1, 128), dtype=int32, numpy=
  array([[  101,  2019,  2953,  6392, 29058,  8625, 13327,  2003,  2026,
          10282,  1012,   102,  2026,  5440, 10282,  2003,  2019,  2953,
           6392, 29058,  8625, 13327,  1012,   102,     0,     0,     0,
              0,     0,     0,     0,     0,     0,     0,     0,     0,
              0,     0,     0,     0,     0,     0,     0,     0,     0,
              0,     0,     0,     

In [None]:
pred = new_model.predict(test_gen)

In [None]:
pred_ids = np.argmax(pred, axis=-1)

In [None]:
print(pred_ids[0])

[1]
