# 『人を知る』人工知能講座 <br> <span style="color: #00B0F0;">Session 3 言語メディア</span> <br> <span style="background-color: #1F4E79; color: #FFFFFF;">&nbsp;3&nbsp;</span> BERTによる自然言語処理 〜英語Fine-tuning〜 

## 1. ライブラリのインストール

本演習では、pytorchをベースとした [transformers](https://github.com/huggingface/transformers)というライブラリを使ってfine-tuningを行います。

直接pipでもインストールすることができますが、本演習では一部コードを修正しますので、gitで取得したコードをpipでインストールします。ソースコードをtransformers 以下においてあります。

このソースコードをpipでインストールします。(後ほどソースコードを修正した時には再度 pip でインストールします)

In [1]:
!pip install -r transformers/examples/requirements.txt
!pip install transformers/

Processing ./transformers
Building wheels for collected packages: transformers
  Building wheel for transformers (setup.py) ... [?25ldone
[?25h  Stored in directory: /tmp/pip-ephem-wheel-cache-8sy9ii97/wheels/7b/98/b9/2da18dcef55b090a377c480bc2c98287794672928a7a1e869e
Successfully built transformers
Installing collected packages: transformers
  Found existing installation: transformers 2.1.1
    Uninstalling transformers-2.1.1:
      Successfully uninstalled transformers-2.1.1
Successfully installed transformers-2.1.1


「Successfully installed transformers-2.1.1」と出れば成功です。

## 2. 英語Pre-trainedモデル

今回はBERT_{BASE}のuncaseモデルを使います。 /data/nlp/tool/bert/bert-base-uncased 以下にあります。

In [2]:
!ls /data/nlp/tool/bert/bert-base-uncased

config.json  pytorch_model.bin	vocab.txt


講義ではtensorflow版のpre-trainedモデルの中身を示しました(81ページ)が、このモデルはpytorch版です。pytorch_model.binがモデルの重みになります。 config.jsonは設定ファイルです(83ページ)。

In [3]:
!cat /data/nlp/tool/bert/bert-base-uncased/config.json

{
  "attention_probs_dropout_prob": 0.1,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "max_position_embeddings": 512,
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "type_vocab_size": 2,
  "vocab_size": 30522
}


そして、vocab.txtが語彙リストです(82ページ)。先頭10行を見てみます。

In [4]:
!head -n 10 /data/nlp/tool/bert/bert-base-uncased/vocab.txt

[PAD]
[unused0]
[unused1]
[unused2]
[unused3]
[unused4]
[unused5]
[unused6]
[unused7]
[unused8]


[unused..]はここでは気にしなくて結構です。[unused..]がずっと続いて、1000行目から普通の語が始まります。(以下ではsedコマンドを利用していますが、こういうものだと思ってください)

In [5]:
!sed -n 1000,1020p /data/nlp/tool/bert/bert-base-uncased/vocab.txt

!
"
#
$
%
&
'
(
)
*
+
,
-
.
/
0
1
2
3
4
5


## 3. GLUEデータのダウンロード

講義で説明した GLUE (General Language Understanding Evaluation)のデータセットでfine-tuningします。

以下のpythonスクリプトを使うことによりすべてのタスクのデータセットをダウンロードすることができます。全部で2分くらいかかります。

In [6]:
!git clone https://gist.github.com/60c2bdb54d156a41194446737ce03e2e.git download_glue_repo
!python download_glue_repo/download_glue_data.py --data_dir='glue_data' --tasks all

Cloning into 'download_glue_repo'...
remote: Enumerating objects: 21, done.[K
remote: Total 21 (delta 0), reused 0 (delta 0), pack-reused 21[K
Unpacking objects: 100% (21/21), done.
Checking connectivity... done.
Downloading and extracting CoLA...
	Completed!
Downloading and extracting SST...
	Completed!
Processing MRPC...
Local MRPC data not specified, downloading data from https://dl.fbaipublicfiles.com/senteval/senteval_data/msr_paraphrase_train.txt
	Completed!
Downloading and extracting QQP...
	Completed!
Downloading and extracting STS...
	Completed!
Downloading and extracting MNLI...
	Completed!
Downloading and extracting SNLI...
	Completed!
Downloading and extracting QNLI...
	Completed!
Downloading and extracting RTE...
	Completed!
Downloading and extracting WNLI...
	Completed!
Downloading and extracting diagnostic...
	Completed!


「Downloading and extracting diagnostic... Completed!」と出れば成功です。

試しにSST-2 (Stanford Sentiment Treebank)のデータをみてみます。

In [7]:
!ls glue_data/SST-2

dev.tsv  original  test.tsv  train.tsv


GLUEタスクではtrain.tsvでトレーニングし、dev.tsvで精度を出します。test.tsvはGLUEのleaderboard提出用です。originalは使いません。

このタスクは1文に対する分類問題で、positive(=1)かnegative(=0)を分類します。dev.tsvの最初を見てみましょう。

In [8]:
!head glue_data/SST-2/dev.tsv

sentence	label
it 's a charming and often affecting journey . 	1
unflinchingly bleak and desperate 	0
allows us to hope that nolan is poised to embark a major career as a commercial yet inventive filmmaker . 	1
the acting , costumes , music , cinematography and sound are all astounding given the production 's austere locales . 	1
it 's slow -- very , very slow . 	0
although laced with humor and a few fanciful touches , the film is a refreshingly serious look at young women . 	1
a sometimes tedious film . 	0
or doing last year 's taxes with your ex-wife . 	0
you do n't have to know about music to appreciate the film 's easygoing blend of comedy and romance . 	1


次に MRPC (Microsoft Research Paraphrase Corpus)のデータをみてみます。これは2文に対する分類問題で、2文が同じ意味(=1)かそうでない(=0)かを分類します。

In [9]:
!head glue_data/MRPC/dev.tsv

﻿Quality	#1 ID	#2 ID	#1 String	#2 String
1	1355540	1355592	He said the foodservice pie business doesn 't fit the company 's long-term growth strategy .	" The foodservice pie business does not fit our long-term growth strategy .
0	2029631	2029565	Magnarelli said Racicot hated the Iraqi regime and looked forward to using his long years of training in the war .	His wife said he was " 100 percent behind George Bush " and looked forward to using his years of training in the war .
0	487993	487952	The dollar was at 116.92 yen against the yen , flat on the session , and at 1.2891 against the Swiss franc , also flat .	The dollar was at 116.78 yen JPY = , virtually flat on the session , and at 1.2871 against the Swiss franc CHF = , down 0.1 percent .
1	1989515	1989458	The AFL-CIO is waiting until October to decide if it will endorse a candidate .	The AFL-CIO announced Wednesday that it will decide in October whether to endorse a candidate before the primaries .
0	1783137	1782659	No dates ha

## 4. GLUEのFine-tuning

今見たMRPCはGLUEの中でもサイズが小さい方なのでこのデータセットを使って fine-tuning してみます。コードの修正などは一切なく、以下のコマンドを動かすだけです。

trainingは通常3エポック走らせます。1エポックあたり約3分で終わります。ここではまず1エポック走らせてみましょう。

In [11]:
%set_env TASK_NAME=MRPC
%set_env GLUE_DIR=glue_data
!python ./transformers/examples/run_glue.py \
    --model_type bert \
    --model_name_or_path  /data/nlp/tool/bert/bert-base-uncased \
    --task_name $TASK_NAME \
    --do_train \
    --do_eval \
    --do_lower_case \
    --save_steps 1000 \
    --data_dir $GLUE_DIR/$TASK_NAME \
    --max_seq_length 128 \
    --per_gpu_eval_batch_size=8   \
    --per_gpu_train_batch_size=8   \
    --learning_rate 2e-5 \
    --num_train_epochs 3.0 \
    --output_dir glue_result/$TASK_NAME/ \
    --overwrite_output_dir \
    --overwrite_cache

env: TASK_NAME=MRPC
env: GLUE_DIR=glue_data
11/28/2019 12:32:09 - INFO - transformers.configuration_utils -   loading configuration file /data/nlp/tool/bert/bert-base-uncased/config.json
11/28/2019 12:32:09 - INFO - transformers.configuration_utils -   Model config {
  "attention_probs_dropout_prob": 0.1,
  "finetuning_task": "mrpc",
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "is_decoder": false,
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "num_labels": 2,
  "output_attentions": false,
  "output_hidden_states": false,
  "output_past": true,
  "pruned_heads": {},
  "torchscript": false,
  "type_vocab_size": 2,
  "use_bfloat16": false,
  "vocab_size": 30522
}

11/28/2019 12:32:09 - INFO - transformers.tokenization_utils -   Model name '/data/nlp/tool/bert/bert-base-uncased' not found in model shortcut name list (bert-base-

11/28/2019 12:32:14 - INFO - transformers.data.processors.glue -   *** Example ***
11/28/2019 12:32:14 - INFO - transformers.data.processors.glue -   guid: train-5
11/28/2019 12:32:14 - INFO - transformers.data.processors.glue -   input_ids: 101 1996 4518 3123 1002 1016 1012 2340 1010 2030 2055 2340 3867 1010 2000 2485 5958 2012 1002 2538 1012 4868 2006 1996 2047 2259 4518 3863 1012 102 18720 1004 1041 13058 1012 6661 5598 1002 1015 1012 6191 2030 1022 3867 2000 1002 2538 1012 6021 2006 1996 2047 2259 4518 3863 2006 5958 1012 102 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
11/28/2019 12:32:14 - INFO - transformers.data.processors.glue -   attention_mask: 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

Iteration:  37%|██████████▉                   | 168/459 [01:16<02:12,  2.19it/s][A
Iteration:  37%|███████████                   | 169/459 [01:17<02:11,  2.20it/s][A
Iteration:  37%|███████████                   | 170/459 [01:17<02:11,  2.20it/s][A
Iteration:  37%|███████████▏                  | 171/459 [01:18<02:11,  2.19it/s][A
Iteration:  37%|███████████▏                  | 172/459 [01:18<02:10,  2.19it/s][A
Iteration:  38%|███████████▎                  | 173/459 [01:18<02:10,  2.20it/s][A
Iteration:  38%|███████████▎                  | 174/459 [01:19<02:09,  2.20it/s][A
Iteration:  38%|███████████▍                  | 175/459 [01:19<02:09,  2.19it/s][A
Iteration:  38%|███████████▌                  | 176/459 [01:20<02:08,  2.20it/s][A
Iteration:  39%|███████████▌                  | 177/459 [01:20<02:08,  2.19it/s][A
Iteration:  39%|███████████▋                  | 178/459 [01:21<02:08,  2.19it/s][A
Iteration:  39%|███████████▋                  | 179/459 [01:21<02:08,  2.18i

Iteration:  79%|███████████████████████▋      | 362/459 [02:45<00:44,  2.18it/s][A
Iteration:  79%|███████████████████████▋      | 363/459 [02:46<00:44,  2.18it/s][A
Iteration:  79%|███████████████████████▊      | 364/459 [02:46<00:43,  2.18it/s][A
Iteration:  80%|███████████████████████▊      | 365/459 [02:46<00:43,  2.17it/s][A
Iteration:  80%|███████████████████████▉      | 366/459 [02:47<00:42,  2.18it/s][A
Iteration:  80%|███████████████████████▉      | 367/459 [02:47<00:42,  2.18it/s][A
Iteration:  80%|████████████████████████      | 368/459 [02:48<00:41,  2.18it/s][A
Iteration:  80%|████████████████████████      | 369/459 [02:48<00:41,  2.18it/s][A
Iteration:  81%|████████████████████████▏     | 370/459 [02:49<00:40,  2.19it/s][A
Iteration:  81%|████████████████████████▏     | 371/459 [02:49<00:40,  2.18it/s][A
Iteration:  81%|████████████████████████▎     | 372/459 [02:50<00:39,  2.19it/s][A
Iteration:  81%|████████████████████████▍     | 373/459 [02:50<00:39,  2.18i

Iteration:   0%|                                        | 0/459 [00:00<?, ?it/s][A
Iteration:   0%|                                | 1/459 [00:00<03:32,  2.16it/s][A
Iteration:   0%|▏                               | 2/459 [00:00<03:31,  2.16it/s][A
Iteration:   1%|▏                               | 3/459 [00:01<03:30,  2.17it/s][A
Iteration:   1%|▎                               | 4/459 [00:01<03:29,  2.17it/s][A
Iteration:   1%|▎                               | 5/459 [00:02<03:29,  2.17it/s][A
Iteration:   1%|▍                               | 6/459 [00:02<03:29,  2.16it/s][A
Iteration:   2%|▍                               | 7/459 [00:03<03:28,  2.17it/s][A
Iteration:   2%|▌                               | 8/459 [00:03<03:28,  2.16it/s][A
Iteration:   2%|▋                               | 9/459 [00:04<03:26,  2.17it/s][A
Iteration:   2%|▋                              | 10/459 [00:04<03:28,  2.15it/s][A
Iteration:   2%|▋                              | 11/459 [00:05<03:27,  2.16i

Iteration:  42%|████████████▋                 | 194/459 [01:29<02:02,  2.16it/s][A
Iteration:  42%|████████████▋                 | 195/459 [01:30<02:01,  2.17it/s][A
Iteration:  43%|████████████▊                 | 196/459 [01:30<02:01,  2.17it/s][A
Iteration:  43%|████████████▉                 | 197/459 [01:30<02:01,  2.16it/s][A
Iteration:  43%|████████████▉                 | 198/459 [01:31<02:00,  2.16it/s][A
Iteration:  43%|█████████████                 | 199/459 [01:31<02:00,  2.17it/s][A
Iteration:  44%|█████████████                 | 200/459 [01:32<01:59,  2.16it/s][A
Iteration:  44%|█████████████▏                | 201/459 [01:32<01:59,  2.16it/s][A
Iteration:  44%|█████████████▏                | 202/459 [01:33<01:59,  2.16it/s][A
Iteration:  44%|█████████████▎                | 203/459 [01:33<01:58,  2.16it/s][A
Iteration:  44%|█████████████▎                | 204/459 [01:34<01:58,  2.16it/s][A
Iteration:  45%|█████████████▍                | 205/459 [01:34<01:57,  2.15i

Iteration:  85%|█████████████████████████▎    | 388/459 [02:59<00:33,  2.14it/s][A
Iteration:  85%|█████████████████████████▍    | 389/459 [03:00<00:32,  2.14it/s][A
Iteration:  85%|█████████████████████████▍    | 390/459 [03:00<00:32,  2.14it/s][A
Iteration:  85%|█████████████████████████▌    | 391/459 [03:01<00:31,  2.15it/s][A
Iteration:  85%|█████████████████████████▌    | 392/459 [03:01<00:31,  2.15it/s][A
Iteration:  86%|█████████████████████████▋    | 393/459 [03:01<00:30,  2.15it/s][A
Iteration:  86%|█████████████████████████▊    | 394/459 [03:02<00:30,  2.15it/s][A
Iteration:  86%|█████████████████████████▊    | 395/459 [03:02<00:29,  2.14it/s][A
Iteration:  86%|█████████████████████████▉    | 396/459 [03:03<00:29,  2.14it/s][A
Iteration:  86%|█████████████████████████▉    | 397/459 [03:03<00:28,  2.14it/s][A
Iteration:  87%|██████████████████████████    | 398/459 [03:04<00:28,  2.15it/s][A
Iteration:  87%|██████████████████████████    | 399/459 [03:04<00:27,  2.16i

Iteration:  26%|███████▋                      | 118/459 [00:55<02:38,  2.15it/s][A
Iteration:  26%|███████▊                      | 119/459 [00:55<02:38,  2.15it/s][A
Iteration:  26%|███████▊                      | 120/459 [00:55<02:37,  2.15it/s][A
Iteration:  26%|███████▉                      | 121/459 [00:56<02:36,  2.16it/s][A
Iteration:  27%|███████▉                      | 122/459 [00:56<02:35,  2.17it/s][A
Iteration:  27%|████████                      | 123/459 [00:57<02:35,  2.17it/s][A
Iteration:  27%|████████                      | 124/459 [00:57<02:34,  2.17it/s][A
Iteration:  27%|████████▏                     | 125/459 [00:58<02:34,  2.16it/s][A
Iteration:  27%|████████▏                     | 126/459 [00:58<02:32,  2.18it/s][A
Iteration:  28%|████████▎                     | 127/459 [00:59<02:33,  2.17it/s][A
Iteration:  28%|████████▎                     | 128/459 [00:59<02:32,  2.17it/s][A
Iteration:  28%|████████▍                     | 129/459 [01:00<02:32,  2.17i

Iteration:  68%|████████████████████▍         | 312/459 [02:24<01:07,  2.16it/s][A
Iteration:  68%|████████████████████▍         | 313/459 [02:25<01:07,  2.15it/s][A
Iteration:  68%|████████████████████▌         | 314/459 [02:25<01:07,  2.16it/s][A
Iteration:  69%|████████████████████▌         | 315/459 [02:26<01:06,  2.17it/s][A
Iteration:  69%|████████████████████▋         | 316/459 [02:26<01:06,  2.16it/s][A
Iteration:  69%|████████████████████▋         | 317/459 [02:27<01:05,  2.16it/s][A
Iteration:  69%|████████████████████▊         | 318/459 [02:27<01:05,  2.16it/s][A
Iteration:  69%|████████████████████▊         | 319/459 [02:27<01:04,  2.17it/s][A
Iteration:  70%|████████████████████▉         | 320/459 [02:28<01:03,  2.17it/s][A
Iteration:  70%|████████████████████▉         | 321/459 [02:28<01:03,  2.18it/s][A
Iteration:  70%|█████████████████████         | 322/459 [02:29<01:02,  2.18it/s][A
Iteration:  70%|█████████████████████         | 323/459 [02:29<01:02,  2.19i

11/28/2019 12:43:02 - INFO - __main__ -   Creating features from dataset file at glue_data/MRPC
11/28/2019 12:43:02 - INFO - transformers.data.processors.glue -   Writing example 0
11/28/2019 12:43:02 - INFO - transformers.data.processors.glue -   *** Example ***
11/28/2019 12:43:02 - INFO - transformers.data.processors.glue -   guid: dev-1
11/28/2019 12:43:02 - INFO - transformers.data.processors.glue -   input_ids: 101 2002 2056 1996 9440 2121 7903 2063 11345 2449 2987 1005 1056 4906 1996 2194 1005 1055 2146 1011 2744 3930 5656 1012 102 1000 1996 9440 2121 7903 2063 11345 2449 2515 2025 4906 2256 2146 1011 2744 3930 5656 1012 102 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
11/28/2019 12:43:02 - INFO - transformers.data.processors.glue -   attention_mask: 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0

最後に出ている数字はdevセットにおける精度で、F値が約88となっています。3エポック回すと約91になります。(時間がある方は --num_train_epochs 3.0として3エポック回してみてください)
BERTの論文にはtestセットにおける精度が記載されており、その値は約89ですのでだいたい同じ値が出ていることがわかります。

今回は英語pre-trainedモデルの中身を確認するためにあらかじめモデルをダウンロードし、それを `--model_name_or_path`オプションで指定しましたが、実際は以下のようにモデルのタイプを指定すれば自動的にモデルをダウンロードして動かすことができます。 