<a href="https://colab.research.google.com/github/Exe-dev/M1-Pytorch-Tutorial/blob/main/Pytorch_Tutorial.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Pytorch Tutorial

このチュートリアルは以下の2つの内容を含みます.


1.   BERTを使った含意分類モデルのfine tuning
2.   BERT2BERTを用いた含意文生成モデルのfine tuning



## Install liblaries

In [15]:
!pip install transformers
!pip install datasets
!pip install rouge_score

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


# Imports

In [16]:
from transformers import (
    AutoTokenizer,
    AutoModelForSequenceClassification,
    Trainer, 
    TrainingArguments,
    EncoderDecoderModel,
    Seq2SeqTrainer,     
    Seq2SeqTrainingArguments
) 
import transformers
import torch
from tqdm import tqdm
from datasets import load_dataset
import random
from sklearn.metrics import precision_recall_fscore_support, accuracy_score
import pandas as pd
# Avoid load model warnings
import logging
transformers.tokenization_utils.logger.setLevel(logging.ERROR)
transformers.configuration_utils.logger.setLevel(logging.ERROR)
transformers.modeling_utils.logger.setLevel(logging.ERROR)

# GPU Setup

+ GPUの使用ができるかどうかをtorch.cuda.is_abailable()で確認．
+ CUDA_VISIBLE_DEVICESはコマンドラインで忘れるならos.environで指定してもよい．
+ datasetsのcacheはHF_DATASETS_CACHEで指定可能．指定しておくのがオススメ


In [17]:
""" optional settings
import os
os.environ["TOKENIZERS_PARALLELISM"] = "true"
os.environ['TRANSFORMERS_CACHE'] = "./"
os.environ['HF_DATASETS_CACHE'] = "./"
os.environ['TF_FORCE_GPU_ALLOW_GROWTH'] = 'true'
os.environ['CUDA_VISIBLE_DEVICES'] = "0,1,2,3"
"""
CUDA_AVAILABLE = False
if torch.cuda.is_available():
    CUDA_AVAILABLE = True
    print("CUDA IS AVAILABLE")
else:
    print("CUDA NOT AVAILABLE")
#device = torch.device('cpu')
device = torch.device('cuda') if torch.cuda.is_available() else torch.device('cpu')

CUDA IS AVAILABLE


# 分類モデル

自然言語推論(Natural Language Inferece)をする分類器を構築する．

In [23]:
def tokenize(batch):
    return tokenizer(batch["premise"], batch["hypothesis"], padding="max_length", truncation=True, max_length=256)

def compute_metrics(pred):
    labels = pred.label_ids
    preds = pred.predictions.argmax(-1)
    precision, recall, f1, _ = precision_recall_fscore_support(labels, preds, average='micro')
    acc = accuracy_score(labels, preds)
    return {
        'accuracy': acc,
        'f1': f1,
        'precision': precision,
        'recall': recall
    }

## Loading model and tokenizer

In [19]:
BATCH_SIZE = 8
MAX_LENGTH = 128
NUM_EPOCHS = 2

model = AutoModelForSequenceClassification.from_pretrained('bert-base-uncased', num_labels=3)
tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')

## Load MNLI datasets and preprocessing

In [20]:
raw_datasets = load_dataset("multi_nli")
tokenized_datasets = raw_datasets.map(tokenize, batched=True)
tokenized_datasets = tokenized_datasets.remove_columns(
    ['promptID','pairID', 'premise','premise_binary_parse','premise_parse', 'hypothesis','hypothesis_binary_parse', 'hypothesis_parse','genre']
)
# データ数が多いと時間かかるので少なくする
train_dataset = tokenized_datasets["train"].select(range(400))
test_dataset = tokenized_datasets["validation_matched"].select(range(50))
train_dataset, test_dataset

Using custom data configuration default
Reusing dataset multi_nli (/root/.cache/huggingface/datasets/multi_nli/default/0.0.0/591f72eb6263d1ab527561777936b199b714cda156d35716881158a2bd144f39)


  0%|          | 0/3 [00:00<?, ?it/s]

Loading cached processed dataset at /root/.cache/huggingface/datasets/multi_nli/default/0.0.0/591f72eb6263d1ab527561777936b199b714cda156d35716881158a2bd144f39/cache-57647e8e37010e48.arrow
Loading cached processed dataset at /root/.cache/huggingface/datasets/multi_nli/default/0.0.0/591f72eb6263d1ab527561777936b199b714cda156d35716881158a2bd144f39/cache-07aa51a9a93d911b.arrow


  0%|          | 0/10 [00:00<?, ?ba/s]

(Dataset({
     features: ['label', 'input_ids', 'token_type_ids', 'attention_mask'],
     num_rows: 400
 }), Dataset({
     features: ['label', 'input_ids', 'token_type_ids', 'attention_mask'],
     num_rows: 50
 }))

## Train

TrainingArgumentsで学習の設定をして Trainer.train()でモデルの学習

In [24]:
model.train()
training_args = TrainingArguments(
    output_dir="./",          # 出力フォルダ
    num_train_epochs=NUM_EPOCHS,              # エポック数
    per_device_train_batch_size=8,
    per_device_eval_batch_size=8,
    do_train=True,
    save_steps=100,
    eval_steps=50,
    warmup_steps=1000,
    weight_decay=0.01,
    #evaluate_during_training=True,
    logging_dir='./outputs/models/logs'
)

trainer = Trainer(
    model=model,
    args=training_args,
    compute_metrics=compute_metrics,
    train_dataset=train_dataset,
    eval_dataset=test_dataset,
)
trainer.train()
trainer.evaluate()

PyTorch: setting up devices
The default value for the training argument `--report_to` will change in v5 (from all installed integrations to none). In v5, you will need to use `--report_to all` to get the same behavior as now. You should start updating your code and make this info disappear :-).
***** Running training *****
  Num examples = 400
  Num Epochs = 2
  Instantaneous batch size per device = 8
  Total train batch size (w. parallel, distributed & accumulation) = 8
  Gradient Accumulation steps = 1
  Total optimization steps = 100


Step,Training Loss


Saving model checkpoint to ./checkpoint-100


Training completed. Do not forget to share your model on huggingface.co/models =)


***** Running Evaluation *****
  Num examples = 50
  Batch size = 8


{'epoch': 2.0,
 'eval_accuracy': 0.34,
 'eval_f1': 0.34,
 'eval_loss': 1.1039652824401855,
 'eval_precision': 0.34,
 'eval_recall': 0.34,
 'eval_runtime': 0.8177,
 'eval_samples_per_second': 61.144,
 'eval_steps_per_second': 8.56}

## predict by fine tuned model

In [25]:
pred = classification_model(torch.tensor(test_dataset["input_ids"][0:10]))
pred

SequenceClassifierOutput([('logits', tensor([[-0.0239,  0.0068, -0.3391],
                                   [ 0.0102, -0.0229, -0.2956],
                                   [ 0.0431,  0.0213, -0.2269],
                                   [ 0.0300, -0.0063, -0.2523],
                                   [ 0.0370, -0.0021, -0.2569],
                                   [ 0.0132, -0.0017, -0.2976],
                                   [ 0.0859,  0.0508, -0.1927],
                                   [-0.0331, -0.0096, -0.3775],
                                   [ 0.0159,  0.0032, -0.3178],
                                   [ 0.0517,  0.0275, -0.2442]], grad_fn=<AddmmBackward0>))])

## Load fine tuned model

In [26]:
classification_model = model.from_pretrained("./checkpoint-100")
classification_model.config.max_length = 256

## logits to labels

In [27]:
pred = pred[0].detach().numpy().tolist()
pred = [*map(lambda x: x.index(max(x)), pred)]
pred, test_dataset["label"][0:10]

([1, 0, 0, 0, 0, 0, 0, 1, 0, 0], [1, 2, 0, 2, 2, 2, 2, 1, 2, 1])

# 生成モデル

入力文に含意な文を生成するモデルを作成する

In [28]:
def compute_metrics(pred):
    labels = pred.label_ids
    preds = pred.predictions.argmax(-1)
    precision, recall, f1, _ = precision_recall_fscore_support(labels, preds, average='weighted')
    acc = accuracy_score(labels, preds)
    return {
        'accuracy': acc,
        'f1': f1,
        'precision': precision,
        'recall': recall
    }

## load tokenizer and Encoder Decoder Model

In [29]:
tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')
generation_model = EncoderDecoderModel.from_encoder_decoder_pretrained('bert-base-uncased', 'bert-base-uncased') # initialize Bert2Bert from pre-trained checkpoints

loading file https://huggingface.co/bert-base-uncased/resolve/main/vocab.txt from cache at /root/.cache/huggingface/transformers/45c3f7a79a80e1cf0a489e5c62b43f173c15db47864303a55d623bb3c96f72a5.d789d64ebfe299b0e416afc4a169632f903f693095b4629a7ea271d5a0cf2c99
loading file https://huggingface.co/bert-base-uncased/resolve/main/tokenizer.json from cache at /root/.cache/huggingface/transformers/534479488c54aeaf9c3406f647aa2ec13648c06771ffe269edabebd4c412da1d.7f2721073f19841be16f41b0a70b600ca6b880c8f3df6f3535cbc704371bdfa4
loading file https://huggingface.co/bert-base-uncased/resolve/main/added_tokens.json from cache at None
loading file https://huggingface.co/bert-base-uncased/resolve/main/special_tokens_map.json from cache at None
loading file https://huggingface.co/bert-base-uncased/resolve/main/tokenizer_config.json from cache at /root/.cache/huggingface/transformers/c1d7f0a763fb63861cc08553866f1fc3e5a6f4f07621be277452d26d71303b7e.20430bd8e10ef77a7d2977accefe796051e01bc2fc4aa146bc862997a

## Preprocess of data

+ データセットから含意ラベルのデータだけを抽出
+ 使用しないカラムを削除
+ データセットのカラム名を変更

In [30]:
raw_datasets = load_dataset("multi_nli")
generation_datasets = raw_datasets.filter(lambda x:x["label"]==1)
generation_datasets = generation_datasets.remove_columns(
    ["promptID","pairID","premise_binary_parse","premise_parse","hypothesis_binary_parse", "hypothesis_parse","genre", "label"]
)
generation_datasets = generation_datasets.rename_column("hypothesis", "input")
generation_datasets = generation_datasets.rename_column("premise", "label")
generation_datasets["train"][0]

Using custom data configuration default
Reusing dataset multi_nli (/root/.cache/huggingface/datasets/multi_nli/default/0.0.0/591f72eb6263d1ab527561777936b199b714cda156d35716881158a2bd144f39)


  0%|          | 0/3 [00:00<?, ?it/s]

  0%|          | 0/393 [00:00<?, ?ba/s]

  0%|          | 0/10 [00:00<?, ?ba/s]

  0%|          | 0/10 [00:00<?, ?ba/s]

{'input': 'Product and geography are what make cream skimming work. ',
 'label': 'Conceptually cream skimming has two basic dimensions - product and geography.'}

## Classification Modelとは違いpandasを使ってtraining dataを作ります

+ データ数が多いと学習が終わらないので，ランダムに100個サンプリング

In [31]:
df_dataset = pd.DataFrame({
    "inputs":generation_datasets["train"]["input"],
    "label":generation_datasets["train"]["label"]
})
df_dataset = df_dataset.sample(100).reset_index(drop=True)
df_dataset.head(1)

Unnamed: 0,inputs,label
0,"Stevens was a talkative guy, and many couldn't...",You Stevens shut your trap! Muller's roar brou...


## 入力テキストと出力ラベル(文)をそれぞれencodeして学習，評価データを作成

In [32]:
inputs = tokenizer.batch_encode_plus(
    df_dataset["inputs"].tolist(),
    return_tensors="pt", 
    add_special_tokens=False,
    truncation=True,
    padding="max_length",
    max_length=256
    )
labels = tokenizer.batch_encode_plus(
    df_dataset["label"].tolist(),
    return_tensors="pt", 
    add_special_tokens=True,
    truncation=True,
    padding="max_length",
    max_length=256
    )
train_data = []
for i in range(len(inputs["input_ids"])):
    train_data.append(
        {
            "input_ids":inputs["input_ids"][i],
            "token_type_ids":inputs["token_type_ids"][i],
            "attention_mask":inputs["attention_mask"][i],
            "label":labels["input_ids"][i] 
        }
    )
random.shuffle(train_data)
train_size = int(len(train_data)*0.98)
eval_data = train_data[train_size:]

## model configuration

In [33]:
generation_model.config.decoder_start_token_id = tokenizer.cls_token_id
generation_model.config.eos_token_id = tokenizer.sep_token_id
generation_model.config.pad_token_id = tokenizer.pad_token_id
# sensible parameters for beam search
generation_model.config.vocab_size = generation_model.config.decoder.vocab_size
generation_model.config.max_length = 100
generation_model.config.min_length = 20
generation_model.config.no_repeat_ngram_size = 1
generation_model.config.early_stopping = True
generation_model.config.length_penalty = 2.0
generation_model.config.num_beams = 20


## Training

+ 生成モデルはSeq2SeqTrainerを使う
+ 自分でbackward()の処理を書いても可

In [34]:
# Train Param
batch_size = 8
generation_model.train()
# https://huggingface.co/docs/transformers/main_classes/trainer#transformers.Seq2SeqTrainingArguments
training_args = Seq2SeqTrainingArguments(
    output_dir='./',
    evaluation_strategy="steps",
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    predict_with_generate=True,
    logging_steps=10,
    save_steps=30,
    eval_steps=5000,
    warmup_steps=1000,
    overwrite_output_dir=True,
    save_total_limit=5,
    fp16=False,
    num_train_epochs=3,
    no_cuda=not CUDA_AVAILABLE
)

# instantiate trainer
trainer = Seq2SeqTrainer(
    model=generation_model,
    tokenizer=tokenizer,
    args=training_args,
    train_dataset=train_data,
    eval_dataset=eval_data
)
trainer.train()

PyTorch: setting up devices
The default value for the training argument `--report_to` will change in v5 (from all installed integrations to none). In v5, you will need to use `--report_to all` to get the same behavior as now. You should start updating your code and make this info disappear :-).
***** Running training *****
  Num examples = 100
  Num Epochs = 3
  Instantaneous batch size per device = 8
  Total train batch size (w. parallel, distributed & accumulation) = 8
  Gradient Accumulation steps = 1
  Total optimization steps = 39
The following columns in the training set don't have a corresponding argument in `EncoderDecoderModel.forward` and have been ignored: token_type_ids. If token_type_ids are not expected by `EncoderDecoderModel.forward`,  you can safely ignore this message.


Step,Training Loss,Validation Loss


Saving model checkpoint to ./checkpoint-30
tokenizer config file saved in ./checkpoint-30/tokenizer_config.json
Special tokens file saved in ./checkpoint-30/special_tokens_map.json


Training completed. Do not forget to share your model on huggingface.co/models =)




TrainOutput(global_step=39, training_loss=10.61686765230619, metrics={'train_runtime': 46.3326, 'train_samples_per_second': 6.475, 'train_steps_per_second': 0.842, 'total_flos': 92018115072000.0, 'train_loss': 10.61686765230619, 'epoch': 3.0})

# Read created model

In [35]:
created_model = generation_model.from_pretrained("./checkpoint-30")

## Generate entailment sentence

In [36]:
tokenized = tokenizer(df_dataset["inputs"][0], return_tensors="pt", truncation=True, padding=True, max_length=256)
pred = created_model.generate(tokenized["input_ids"])
pred

tensor([[  101,  1996,  1012,  1025,   999,  1010,  1585, 30112, 30114,  1584,
          1586, 30111, 27876,  1583,  1587, 30132, 30130, 30129, 30131,  1141,
          1536, 25292,  1592, 30113,  1591, 19174,  1064,  4414,  1621, 17928,
          3031,  1588, 28637,  8778,  1607, 11916, 20955,  2004,  3022,  2133,
         18880, 16302, 13811, 27362,  2000,  2830,  8848,  2091,  2067,  2101,
          2083,  2627, 11165, 24288, 29053, 29051,  1998,  5685, 26379,  2664,
         16808,  5743, 15834,  7652, 19442, 25430, 13366,  1510,  4125,  2368,
         24333, 12942,  2046, 10359, 22625, 25693, 17741,  3413,  5235,  4084,
         10024, 22953, 26864, 11563,  4063, 15454,  5441,  2663,  2062, 22302,
          5963,  3553, 20755, 13806, 13776,  2721, 10278,  7367,  2061,  5879]])

## Decode predicted tensors

In [37]:
df_dataset["inputs"][0], tokenizer.decode(pred[0], skip_special_tokens=True, truncation=True, padding=True, max_length=256)

("Stevens was a talkative guy, and many couldn't stand him.",
 'the. ;!, →↑↓ ↑ ↓←missive ← ↔∪∨∧∩ ʲ ⁰ vis ∂→ ⇒ nanny | respectively ☆ api onto ↦gree protocol ≡ency hartley asas... pasadenacare dominancerath to forward backward down back later through past successive successively travers hays andandndt yet moor forth onwardwardbeat sw def ᵥ riseenfastlum intolike fidelity gorman mcmahon pass passes stepsbra bro barnetnderder reject maintain win more hotterback closeribelatelatedlalam se so norman')

# Option

+ "pytorch_tutorial.ipynb"を"pytorch_tutorial.py"に変換するコマンド

```jupyter nbconvert --to script pytorch_tutorial.ipynb```

+ 使うGPUを指定して実行する方法(os.environでも可)
+ 特にTrainerは視えてるGPUを全て使うので指定してあげる必要がある.
+ GPU:0～2を使って実行したい場合

```CUDA_VISIBLE_DEVICES=0,1,2 python pytorch_tutorial.py ```