## Module

- nlp : CNN/DailyMailを取得するときに使うよ
- logging : 全体の処理の流れを掴みたい時のデバグしやすいよ
- transformer : bert系のこと全般やります
- 日本語トーカナイザ系もろもろ

In [1]:
import nlp
import logging
from transformers import BertTokenizer, EncoderDecoderModel, Trainer, TrainingArguments
import torch
from transformers.tokenization_bert_japanese import BertJapaneseTokenizer
from transformers.modeling_bert import BertForMaskedLM
import csv
import pandas as pd
from datasets import load_dataset

## Model, Tokenizer
- logging.basicConfig(level=logging.INFO)でログレベルを設定

In [2]:
logging.basicConfig(level=logging.INFO)

- 東北大の日本語pre-trainモデル（tokenizer, modelに使用）

In [3]:
bert_jp_model = 'cl-tohoku/bert-base-japanese-whole-word-masking'
mecab_opts = {"mecab_option": "-r /dev/null -d /usr/local/lib/mecab/dic/ipadic"}

- encoder, decoder共に"bert_jp_model"で事前学習
- tokenizerも"bert_jp_model"で事前学習

In [4]:
tokenizer = BertJapaneseTokenizer.from_pretrained(bert_jp_model, mecab_kwargs=mecab_opts)
model = EncoderDecoderModel.from_encoder_decoder_pretrained(bert_jp_model, bert_jp_model)

INFO:transformers.tokenization_utils_base:loading file https://s3.amazonaws.com/models.huggingface.co/bert/cl-tohoku/bert-base-japanese-whole-word-masking/vocab.txt from cache at /home/ats432/.cache/torch/transformers/72ee6ecba54b20bba483760db4f23b836f27a6afda54ede38c488e8514bb3705.5fac9da4d8565963664ed9744688dc7008ff5ec4045f604e9515896f9fe46d9c
INFO:transformers.configuration_utils:loading configuration file https://s3.amazonaws.com/models.huggingface.co/bert/cl-tohoku/bert-base-japanese-whole-word-masking/config.json from cache at /home/ats432/.cache/torch/transformers/c96f5e731b9f4dc2e8263336947ec74b6f93917c0b9db6e9cf974a8a945dd313.986cf1d2960e38dfd1c7218eb101f7d8eb581c487bd3204db98f76c977e21743
INFO:transformers.configuration_utils:Model config BertConfig {
  "architectures": [
    "BertForMaskedLM"
  ],
  "attention_probs_dropout_prob": 0.1,
  "gradient_checkpointing": false,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
 

INFO:transformers.configuration_encoder_decoder:Set `config.is_decoder=True` for decoder_config


- clsトークンをbosトークンとして動作させる。なぜ？
- sepトークンをeosトークンとして動作させる。なぜ？

In [5]:
# CLS token will work as BOS token
tokenizer.bos_token = tokenizer.cls_token

# SEP token will work as EOS token
tokenizer.eos_token = tokenizer.sep_token

-------------------------------------------------------------------------------------

## Dataset

In [6]:
train_dataset = nlp.load_dataset('csv', data_files='dataset.csv', split = 'train[100:300]')
val_dataset = nlp.load_dataset('csv', data_files='dataset.csv', split = 'train[:100]')

INFO:nlp.load:Checking /home/ats432/.cache/huggingface/datasets/9350c66cf2ea2a49cf6ce363b88ef736a7c16547eb32867995b14580f63d4094.76d6b941b81a75d9a6c8523c6186cf1c3b96109d52a396a1016839dbea94d4fd.py for additional imports.
INFO:filelock:Lock 47827138107920 acquired on /home/ats432/.cache/huggingface/datasets/9350c66cf2ea2a49cf6ce363b88ef736a7c16547eb32867995b14580f63d4094.76d6b941b81a75d9a6c8523c6186cf1c3b96109d52a396a1016839dbea94d4fd.py.lock
INFO:nlp.load:Found main folder for dataset https://s3.amazonaws.com/datasets.huggingface.co/nlp/datasets/csv/csv.py at /home/ats432/anaconda3/envs/myenv_torch/lib/python3.7/site-packages/nlp/datasets/csv
INFO:nlp.load:Found specific version folder for dataset https://s3.amazonaws.com/datasets.huggingface.co/nlp/datasets/csv/csv.py at /home/ats432/anaconda3/envs/myenv_torch/lib/python3.7/site-packages/nlp/datasets/csv/ede98314803c971fef04bcee45d660c62f3332e8a74491e0b876106f3d99bd9b
INFO:nlp.load:Found script file from https://s3.amazonaws.com/datas

In [7]:
train_dataset = load_dataset('csv', data_files='dataset_csv.csv', names = ('id', 'highlights', 'article'), split = 'train[100:300]')
val_dataset = load_dataset('csv', data_files='dataset_csv.csv', names = ('id', 'highlights', 'article'), split = 'train[:100]')

Using custom data configuration default
Reusing dataset csv (/home/ats432/.cache/huggingface/datasets/csv/default-3f13e8976bf07429/0.0.0/2960f95a26e85d40ca41a230ac88787f715ee3003edaacb8b1f0891e9f04dda2)
Using custom data configuration default
Reusing dataset csv (/home/ats432/.cache/huggingface/datasets/csv/default-3f13e8976bf07429/0.0.0/2960f95a26e85d40ca41a230ac88787f715ee3003edaacb8b1f0891e9f04dda2)


In [28]:
train_dataset

Dataset(features: {'id': Value(dtype='string', id=None), 'highlights': Value(dtype='string', id=None), 'article': Value(dtype='string', id=None)}, num_rows: 200)

In [7]:
# load rouge for validation
rouge = nlp.load_metric("rouge")

INFO:nlp.load:Checking /home/ats432/.cache/huggingface/datasets/5ecb6e4b474317b41ae1fe5d702d1af8d86d452f0b1d70f77a12f6f014ded6ac.35bc2c477aa456d2f589656477ccb0b463c21cdfb83a9de86d63de8560a96d1b.py for additional imports.
INFO:filelock:Lock 47827605984080 acquired on /home/ats432/.cache/huggingface/datasets/5ecb6e4b474317b41ae1fe5d702d1af8d86d452f0b1d70f77a12f6f014ded6ac.35bc2c477aa456d2f589656477ccb0b463c21cdfb83a9de86d63de8560a96d1b.py.lock
INFO:nlp.load:Found main folder for metric https://s3.amazonaws.com/datasets.huggingface.co/nlp/metrics/rouge/rouge.py at /home/ats432/anaconda3/envs/myenv_torch/lib/python3.7/site-packages/nlp/metrics/rouge
INFO:nlp.load:Found specific version folder for metric https://s3.amazonaws.com/datasets.huggingface.co/nlp/metrics/rouge/rouge.py at /home/ats432/anaconda3/envs/myenv_torch/lib/python3.7/site-packages/nlp/metrics/rouge/06783dbed5f6b6a5413f84d2a5f0d9dc9cb871f1aeb3787f2c90a8e3fe60b1c1
INFO:nlp.load:Found script file from https://s3.amazonaws.com

In [10]:
len(train_dataset)

200

In [11]:
len(val_dataset)

100

In [12]:
type(train_dataset['article'])

list

In [8]:
# set decoding params
model.config.decoder_start_token_id = tokenizer.bos_token_id
model.config.eos_token_id = tokenizer.eos_token_id
model.config.max_length = 142
model.config.min_length = 56
model.config.no_repeat_ngram_size = 3
model.early_stopping = True
model.length_penalty = 2.0
model.num_beams = 4

In [9]:
# map data correctly
def map_to_encoder_decoder_inputs(batch):
    # Tokenizer will automatically set [BOS] <text> [EOS]
    # cut off at BERT max length 512
    inputs = tokenizer(batch["article"], padding="max_length", truncation=True, max_length=512)
    # force summarization <= 128
    outputs = tokenizer(batch["highlights"], padding="max_length", truncation=True, max_length=128)
    
    
    batch["input_ids"] = inputs.input_ids # inputsのID
    batch["attention_mask"] = inputs.attention_mask # 　encoderの重要部分を測る

    batch["decoder_input_ids"] = outputs.input_ids # outputsのID
    batch["labels"] = outputs.input_ids.copy() # outputsのIDをコピーしラベルとして使用
    # mask loss for padding
    batch["labels"] = [
        [-100 if token == tokenizer.pad_token_id else token for token in labels] for labels in batch["labels"]
    ] # attentionの値を際立たせてる？　＜＝　聞こう
    
    batch["decoder_attention_mask"] = outputs.attention_mask # decoderの重要部分を測る

    assert all([len(x) == 512 for x in inputs.input_ids])  # "assert 条件式, 条件式がFalseの場合に出力するメッセージ"
    assert all([len(x) == 128 for x in outputs.input_ids])  # "assert 条件式, 条件式がFalseの場合に出力するメッセージ"

    return batch


In [10]:
def compute_metrics(pred):
    
    labels_ids = pred.label_ids # 参照データのID
    pred_ids = pred.predictions #  予測結果のID
#     labels_ids = torch.cat(labels_ids, dim=0)
    pred_ids = pred.predictions.argmax(-1)

    # all unnecessary tokens are removed
    label_str = tokenizer.batch_decode(labels_ids, skip_special_tokens=True) #参照データの不要トークンの削除
    pred_str = tokenizer.batch_decode(pred_ids, skip_special_tokens=True) #予測結果の不要トークンの削除
  

    rouge_output = rouge.compute(predictions=pred_str, references=label_str, rouge_types=["rouge2"])["rouge2"].mid # ←具体的には何やってるかわからん

    return {
        "rouge2_precision": round(rouge_output.precision, 4), #精度
        "rouge2_recall": round(rouge_output.recall, 4),              #再現性
        "rouge2_fmeasure": round(rouge_output.fmeasure, 4),#F値
    }

In [11]:
# set batch size here
batch_size = 16

# make train dataset ready
train_dataset = train_dataset.map(
    map_to_encoder_decoder_inputs, batched=True, batch_size=batch_size, remove_columns=["highlights", "article"],
)
train_dataset.set_format(
    type="torch", columns=["input_ids", "attention_mask", "decoder_input_ids", "decoder_attention_mask", "labels"],
)

# same for validation dataset
val_dataset = val_dataset.map(
    map_to_encoder_decoder_inputs, batched=True, batch_size=batch_size, remove_columns=["article", "highlights"],
)
val_dataset.set_format(
    type="torch", columns=["input_ids", "attention_mask", "decoder_input_ids", "decoder_attention_mask", "labels"],
)


INFO:nlp.arrow_dataset:Loading cached processed dataset at /home/ats432/.cache/huggingface/datasets/csv/default-57cde2e1803d3953/0.0.0/ede98314803c971fef04bcee45d660c62f3332e8a74491e0b876106f3d99bd9b/cache-b6ea05ce863d1dfdd212aaf38a4ad596.arrow
INFO:nlp.arrow_dataset:Set __getitem__(key) output type to torch for ['input_ids', 'attention_mask', 'decoder_input_ids', 'decoder_attention_mask', 'labels'] columns  (when key is int or slice) and don't output other (un-formated) columns.
INFO:nlp.arrow_dataset:Loading cached processed dataset at /home/ats432/.cache/huggingface/datasets/csv/default-57cde2e1803d3953/0.0.0/ede98314803c971fef04bcee45d660c62f3332e8a74491e0b876106f3d99bd9b/cache-8d1d451896b57d0ac289316d6aed3621.arrow
INFO:nlp.arrow_dataset:Set __getitem__(key) output type to torch for ['input_ids', 'attention_mask', 'decoder_input_ids', 'decoder_attention_mask', 'labels'] columns  (when key is int or slice) and don't output other (un-formated) columns.


In [12]:
train_dataset

Dataset(features: {'id': Value(dtype='string', id=None), 'input_ids': Sequence(feature=Value(dtype='int64', id=None), length=-1, id=None), 'attention_mask': Sequence(feature=Value(dtype='int64', id=None), length=-1, id=None), 'decoder_input_ids': Sequence(feature=Value(dtype='int64', id=None), length=-1, id=None), 'labels': Sequence(feature=Value(dtype='int64', id=None), length=-1, id=None), 'decoder_attention_mask': Sequence(feature=Value(dtype='int64', id=None), length=-1, id=None)}, num_rows: 200)

In [13]:
# set training arguments - these params are not really tuned, feel free to change
training_args = TrainingArguments(
    output_dir="./",
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    # predict_from_generate=True,
    evaluate_during_training=True,
    do_train=True,
    do_eval=True,
    logging_steps=10,
    save_steps=1000,
    # eval_steps=1000,
    overwrite_output_dir=True,
    warmup_steps=2000,
    save_total_limit=10,
)

In [14]:
# instantiate trainer
trainer = Trainer(
    model=model,
    args=training_args,
    compute_metrics=compute_metrics,
    train_dataset=train_dataset,
    eval_dataset=val_dataset,
)

INFO:transformers.training_args:PyTorch: setting up devices
INFO:transformers.trainer:You are instantiating a Trainer but W&B is not installed. To use wandb logging, run `pip install wandb; wandb login` see https://docs.wandb.com/huggingface.


In [15]:
# start training
trainer.train()

INFO:transformers.trainer:***** Running training *****
INFO:transformers.trainer:  Num examples = 200
INFO:transformers.trainer:  Num Epochs = 3
INFO:transformers.trainer:  Instantaneous batch size per device = 16
INFO:transformers.trainer:  Total train batch size (w. parallel, distributed & accumulation) = 16
INFO:transformers.trainer:  Gradient Accumulation steps = 1
INFO:transformers.trainer:  Total optimization steps = 39


HBox(children=(FloatProgress(value=0.0, description='Epoch', max=3.0, style=ProgressStyle(description_width='i…

HBox(children=(FloatProgress(value=0.0, description='Iteration', max=13.0, style=ProgressStyle(description_wid…

INFO:transformers.trainer:{'loss': 11.589942264556885, 'learning_rate': 2.5000000000000004e-07, 'epoch': 0.7692307692307693, 'step': 10}
INFO:transformers.trainer:***** Running Evaluation *****
INFO:transformers.trainer:  Num examples = 100
INFO:transformers.trainer:  Batch size = 16


HBox(children=(FloatProgress(value=0.0, description='Evaluation', max=7.0, style=ProgressStyle(description_wid…




INFO:nlp.arrow_writer:Done writing 100 examples in 132717 bytes /home/ats432/.cache/huggingface/metrics/rouge/default/1.0.0/06783dbed5f6b6a5413f84d2a5f0d9dc9cb871f1aeb3787f2c90a8e3fe60b1c1/cache-rouge-0.arrow.
INFO:filelock:Lock 47827606114448 released on /home/ats432/.cache/huggingface/metrics/rouge/default/1.0.0/06783dbed5f6b6a5413f84d2a5f0d9dc9cb871f1aeb3787f2c90a8e3fe60b1c1/cache-rouge-0.arrow.lock
INFO:filelock:Lock 47827606115024 acquired on /home/ats432/.cache/huggingface/metrics/rouge/default/1.0.0/06783dbed5f6b6a5413f84d2a5f0d9dc9cb871f1aeb3787f2c90a8e3fe60b1c1/cache-rouge-0.arrow.lock
INFO:filelock:Lock 47827606115024 released on /home/ats432/.cache/huggingface/metrics/rouge/default/1.0.0/06783dbed5f6b6a5413f84d2a5f0d9dc9cb871f1aeb3787f2c90a8e3fe60b1c1/cache-rouge-0.arrow.lock
INFO:nlp.arrow_dataset:Set __getitem__(key) output type to python objects for no columns  (when key is int or slice) and don't output other (un-formated) columns.
INFO:transformers.trainer:{'eval_loss':




HBox(children=(FloatProgress(value=0.0, description='Iteration', max=13.0, style=ProgressStyle(description_wid…

INFO:transformers.trainer:{'loss': 11.237214756011962, 'learning_rate': 5.000000000000001e-07, 'epoch': 1.5384615384615383, 'step': 20}
INFO:transformers.trainer:***** Running Evaluation *****
INFO:transformers.trainer:  Num examples = 100
INFO:transformers.trainer:  Batch size = 16


HBox(children=(FloatProgress(value=0.0, description='Evaluation', max=7.0, style=ProgressStyle(description_wid…




INFO:nlp.arrow_writer:Done writing 100 examples in 131775 bytes /home/ats432/.cache/huggingface/metrics/rouge/default/1.0.0/06783dbed5f6b6a5413f84d2a5f0d9dc9cb871f1aeb3787f2c90a8e3fe60b1c1/cache-rouge-0.arrow.
INFO:filelock:Lock 47827665238928 acquired on /home/ats432/.cache/huggingface/metrics/rouge/default/1.0.0/06783dbed5f6b6a5413f84d2a5f0d9dc9cb871f1aeb3787f2c90a8e3fe60b1c1/cache-rouge-0.arrow.lock
INFO:filelock:Lock 47827665238928 released on /home/ats432/.cache/huggingface/metrics/rouge/default/1.0.0/06783dbed5f6b6a5413f84d2a5f0d9dc9cb871f1aeb3787f2c90a8e3fe60b1c1/cache-rouge-0.arrow.lock
INFO:nlp.arrow_dataset:Set __getitem__(key) output type to python objects for no columns  (when key is int or slice) and don't output other (un-formated) columns.
INFO:transformers.trainer:{'eval_loss': 11.307438169206891, 'eval_rouge2_precision': 0.011, 'eval_rouge2_recall': 0.0014, 'eval_rouge2_fmeasure': 0.0024, 'epoch': 1.5384615384615383, 'step': 20}





HBox(children=(FloatProgress(value=0.0, description='Iteration', max=13.0, style=ProgressStyle(description_wid…

INFO:transformers.trainer:{'loss': 10.729289245605468, 'learning_rate': 7.5e-07, 'epoch': 2.3076923076923075, 'step': 30}
INFO:transformers.trainer:***** Running Evaluation *****
INFO:transformers.trainer:  Num examples = 100
INFO:transformers.trainer:  Batch size = 16


HBox(children=(FloatProgress(value=0.0, description='Evaluation', max=7.0, style=ProgressStyle(description_wid…




INFO:nlp.arrow_writer:Done writing 100 examples in 130003 bytes /home/ats432/.cache/huggingface/metrics/rouge/default/1.0.0/06783dbed5f6b6a5413f84d2a5f0d9dc9cb871f1aeb3787f2c90a8e3fe60b1c1/cache-rouge-0.arrow.
INFO:filelock:Lock 47827678072144 acquired on /home/ats432/.cache/huggingface/metrics/rouge/default/1.0.0/06783dbed5f6b6a5413f84d2a5f0d9dc9cb871f1aeb3787f2c90a8e3fe60b1c1/cache-rouge-0.arrow.lock
INFO:filelock:Lock 47827678072144 released on /home/ats432/.cache/huggingface/metrics/rouge/default/1.0.0/06783dbed5f6b6a5413f84d2a5f0d9dc9cb871f1aeb3787f2c90a8e3fe60b1c1/cache-rouge-0.arrow.lock
INFO:nlp.arrow_dataset:Set __getitem__(key) output type to python objects for no columns  (when key is int or slice) and don't output other (un-formated) columns.
INFO:transformers.trainer:{'eval_loss': 10.283991268702916, 'eval_rouge2_precision': 0.0136, 'eval_rouge2_recall': 0.0014, 'eval_rouge2_fmeasure': 0.0025, 'epoch': 2.3076923076923075, 'step': 30}
INFO:transformers.trainer:

Training co





TrainOutput(global_step=39, training_loss=10.932890329605494)