<a href="https://colab.research.google.com/github/Takumi173/JPMA2022TF1-1/blob/main/JPMA_2022_TF_1_1_demo_(3)BERT%E5%88%86%E9%A1%9E%E3%83%A2%E3%83%87%E3%83%AB.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 前準備

## Google Driveの接続

In [1]:
# データ受け渡しのためにGoogle Driveをマウント
from google.colab import drive
drive.mount('/content/drive')

# データ保存ディレクトリの指定
datadir = '/content/drive/MyDrive/datadir/'

Mounted at /content/drive


## データのロード

In [2]:
# 分かち書き済みのテキストとベクトル化したデータを読み込み
import pickle

with open(datadir + 'datadic.pkl', 'rb') as f:
  datadic = pickle.load(f)

## pip install

In [3]:
!pip install transformers
!pip install mecab-python3 fugashi 
!pip install jaconv neologdn
!pip install sentencepiece

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting transformers
  Downloading transformers-4.25.1-py3-none-any.whl (5.8 MB)
[K     |████████████████████████████████| 5.8 MB 33.6 MB/s 
Collecting huggingface-hub<1.0,>=0.10.0
  Downloading huggingface_hub-0.11.1-py3-none-any.whl (182 kB)
[K     |████████████████████████████████| 182 kB 77.9 MB/s 
Collecting tokenizers!=0.11.3,<0.14,>=0.11.1
  Downloading tokenizers-0.13.2-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (7.6 MB)
[K     |████████████████████████████████| 7.6 MB 64.7 MB/s 
Installing collected packages: tokenizers, huggingface-hub, transformers
Successfully installed huggingface-hub-0.11.1 tokenizers-0.13.2 transformers-4.25.1
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting mecab-python3
  Downloading mecab_python3-1.0.6-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (577 kB)
[K    

## データロードと環境構築

In [4]:
import jaconv
import unicodedata
import neologdn
import re
import torch

# デバイス設定
device = torch.device('cuda:0') if torch.cuda.is_available() else torch.device('cpu')
print(device)

cuda:0


In [5]:
# MeCabとNEologdの設定
!apt install mecab libmecab-dev mecab-ipadic-utf8 file
!git clone --depth 1 https://github.com/neologd/mecab-ipadic-neologd.git
!mecab-ipadic-neologd/bin/install-mecab-ipadic-neologd -a -y

# 環境変数でmecabrcの場所を指定
import os
os.environ['MECABRC'] = "/etc/mecabrc" 

# NEologdの展開場所を取得
import subprocess
cmd = 'echo `mecab-config --dicdir`"/mecab-ipadic-neologd"'
neologd_dic_dir_path = subprocess.check_output(cmd, shell=True).decode('utf-8').strip()

# 万病辞書のダウンロードと設定
!wget http://sociocom.jp/~data/2018-manbyo/data/MANBYO_201907_Dic-utf8.dic
manbyo_dic_path = 'MANBYO_201907_Dic-utf8.dic'

Reading package lists... Done
Building dependency tree       
Reading state information... Done
The following package was automatically installed and is no longer required:
  libnvidia-common-460
Use 'apt autoremove' to remove it.
The following additional packages will be installed:
  libmagic-mgc libmagic1 libmecab2 mecab-ipadic mecab-jumandic
  mecab-jumandic-utf8 mecab-utils
The following NEW packages will be installed:
  file libmagic-mgc libmagic1 libmecab-dev libmecab2 mecab mecab-ipadic
  mecab-ipadic-utf8 mecab-jumandic mecab-jumandic-utf8 mecab-utils
0 upgraded, 11 newly installed, 0 to remove and 20 not upgraded.
Need to get 29.3 MB of archives.
After this operation, 282 MB of additional disk space will be used.
Get:1 http://archive.ubuntu.com/ubuntu bionic-updates/main amd64 libmagic-mgc amd64 1:5.32-2ubuntu0.4 [184 kB]
Get:2 http://archive.ubuntu.com/ubuntu bionic-updates/main amd64 libmagic1 amd64 1:5.32-2ubuntu0.4 [68.6 kB]
Get:3 http://archive.ubuntu.com/ubuntu bionic-up

In [6]:
# BERTのモデル名を指定（東北大モデル：ライセンスCC BY-SA 3.0）
# 別の事前学習モデル（UTHBERTやMEDBERTjpなど）を使用する場合は展開したディレクトリを指定する
model_name = 'cl-tohoku/bert-base-japanese-whole-word-masking'

# JapaneseTokenizer用のMeCabパラメータを指定（学習データは既に分かち書き済みのデータのため Fune tuning には不要）
MeCabDic = {"mecab_dic": None, "mecab_option": "-d " + neologd_dic_dir_path + " -u " + manbyo_dic_path}

# 別の事前学習モデルを使用する場合は対象モデルのLengthに注意（UTHBERT：512、MEDBERTjp：128）
MaxSeqLen = 512

# 学習の準備

## 学習パラメータの設定

In [7]:
# 学習パラメータの設定
# バッチサイズの設定
BATCH_SIZE = 32

# Learning Rateno設定
LEAENING_RATE = 1e-6

# エポック数の設定
N_EPOCHS = 10

## Tokenizerの設定

In [8]:
from transformers import BertJapaneseTokenizer

# トークナイザの設定
tokenizer = BertJapaneseTokenizer.from_pretrained(
    model_name,
    word_tokenizer_type = "mecab",
    mecab_kwargs = MeCabDic
    )

Downloading:   0%|          | 0.00/258k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/110 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/479 [00:00<?, ?B/s]

## 学習データのトークナイズ

In [9]:
# 対象データを random seed = 0 で分割
# 分割した Train データで Fine-tuning を実施
from sklearn.model_selection import train_test_split
text, X_test,label, y_test = train_test_split(datadic['wakati'], datadic['flg'], test_size=0.15, random_state = 0)


# 最大トークン数の確認
# これを超えるトークンは切り落とされるので注意
model_seq_len = MaxSeqLen

# 学習データの最大トークン数を取得
max_tk = 0
for i, chktoken in enumerate(text):
  tk = tokenizer.tokenize(chktoken)
  if len(tk) > max_tk:
    max_tk = len(tk)
    id = i

# 学習データの最大長に合わせて Length を設定
max_len = max_tk + 2 if max_tk + 2 < model_seq_len else model_seq_len

# 最大長データの確認
tokchk = tokenizer.encode_plus(
    text[id],
    add_special_tokens = True,        # スペシャルトークンの追加
    truncation = True,                # モデル定義長を超える場合の切り捨て
    max_length = max_len,             # モデル定義内の場合は入力値の最大長に再定義
    padding = 'max_length',           # 最大長までPADDING
    return_overflowing_tokens = True, # 切り捨てられたトークンを返す
    num_truncated_tokens = True       # 切り捨てられたトークン数を返す
    )

print("最大トークン数:", max_tk)
print("*** 最大トークン数に分割されるテキスト ***")
print("  ", text[id])
print("*** BERTに入力されるテキスト ***")
print("  ", tokenizer.decode(tokchk['input_ids']))
print("*** 切り捨てられるテキスト ***")
print("  ", tokenizer.decode(tokchk['overflowing_tokens']))


Keyword arguments {'num_truncated_tokens': True} not recognized.


最大トークン数: 105
*** 最大トークン数に分割されるテキスト ***
   ラフチジン 錠 0m g サワイ 0 錠 0 朝食 夕食 後 0 日間 ベイスン OD 錠 0 0 0 0m g 0 錠 0 朝食 直前 0 日間 フェブリク 錠 0m g 0 錠 0 朝食 後 0 日間 アダラート CR 錠 0m g 0 錠 0 夕食 後 0 日間 スピロノラクトン 錠 0m g TOWA 0 0 錠 ラシックス 錠 0m g 0 錠 0 朝食 後 0 日間 ノイエル カプセル 0m g 0 カプセル 0 朝食 夕食 後 0 日間
*** BERTに入力されるテキスト ***
   [CLS] ラフチジン 錠 0m g サワイ 0 錠 0 朝食 夕食 後 0 日間 ベイスン OD 錠 0 0 0 0m g 0 錠 0 朝食 直前 0 日間 フェブリク 錠 0m g 0 錠 0 朝食 後 0 日間 アダラート CR 錠 0m g 0 錠 0 夕食 後 0 日間 スピロノラクトン 錠 0m g TOWA 0 0 錠 ラシックス 錠 0m g 0 錠 0 朝食 後 0 日間 ノイエル カプセル 0m g 0 カプセル 0 朝食 夕食 後 0 日間 [SEP]
*** 切り捨てられるテキスト ***
   


In [10]:


# トークナイズ処理
# 必要なToken IDとAttentionマスクを取得
token_ids = []
attention_masks = []

for t in text:
  tknzd = tokenizer.encode_plus(
      t,
      add_special_tokens = True,        # スペシャルトークンの追加
      truncation = True,                # モデル定義長を超える場合の切り捨て
      max_length = max_len,             # モデル定義内の場合は入力値の最大長に再定義
      padding = 'max_length'            # 最大長までPADDING
      )
  token_ids.append(tknzd['input_ids'])
  attention_masks.append(tknzd['attention_mask'])

# tensor型に変換
token_ids_t = torch.tensor(token_ids)
attention_masks_t = torch.tensor(attention_masks)
labels_t = torch.tensor(label)

# 変換結果の確認
x = 0
print(tokenizer.tokenize(text[x]))
print(token_ids_t[x])
print(attention_masks_t[x])
print(labels_t[x])

['抗体', '##検', '##査', '明らか', 'だ', '抗体', '##価', '上昇', '認める', 'いる', 'ない']
tensor([    2, 14744, 29192, 29037,  2275,    75, 14744, 29120,  4312,  7044,
           33,    80,     3,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0])
tensor([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 

## データセットとデータローダーの作成

In [11]:
from torch.utils.data import Dataset, DataLoader, TensorDataset
from torch.utils.data.dataset import Subset
from sklearn.model_selection import StratifiedKFold

# 全学習データをデータセット化
dataset = TensorDataset(token_ids_t, attention_masks_t, labels_t)

# Stratified k-Fold
k = 5
kf = StratifiedKFold(n_splits=k)

train_sets={}
valid_sets={}
for _fold, (train_index, valid_index) in enumerate(kf.split(dataset,dataset[:][-1])):
  train_dataset = Subset(dataset, train_index)
  valid_dataset = Subset(dataset, valid_index)

  train_dataloader = DataLoader(
      train_dataset,
      batch_size = BATCH_SIZE,
      shuffle = True,     # ランダムで取得するか否か
      drop_last = True    # バッチ数に満たないラストデータを落とすか否か
      )
  valid_dataloader = DataLoader(
      valid_dataset,
      batch_size = BATCH_SIZE,
      shuffle = False,
      drop_last = False
      )
  
  tname = "train_" + str(_fold)
  vname = "valid_" + str(_fold)
  train_sets[tname] = train_dataloader
  valid_sets[vname] = valid_dataloader

  print('*** Fold ', _fold, '***')
  print('学習データ数：', len(train_dataset))
  print('検証データ数: ', len(valid_dataset))
  print('Sum of Pos in Val: ', sum(valid_dataset[:][-1]), '\n')


*** Fold  0 ***
学習データ数： 3412
検証データ数:  854
Sum of Pos in Val:  tensor(432) 

*** Fold  1 ***
学習データ数： 3413
検証データ数:  853
Sum of Pos in Val:  tensor(432) 

*** Fold  2 ***
学習データ数： 3413
検証データ数:  853
Sum of Pos in Val:  tensor(432) 

*** Fold  3 ***
学習データ数： 3413
検証データ数:  853
Sum of Pos in Val:  tensor(432) 

*** Fold  4 ***
学習データ数： 3413
検証データ数:  853
Sum of Pos in Val:  tensor(432) 



# Fine-tuningの実行と確認

## 実行関数の定義

In [12]:
import pandas as pd
import numpy as np

# 訓練パート関数の定義
def train(train_dataloader, model, optimizer, device):
  train_losses = []
  model.train()     # 訓練モード
  optimizer.zero_grad()
  for n_iter, d in enumerate(train_dataloader):
    outputs = model(
      d[0].to(device),                    # input_ids_t
      attention_mask = d[1].to(device),   # attention_masks_t
      labels = d[2].to(device),           # labels_t
      token_type_ids=None
      )
    
    loss = outputs.loss # BertForSequenceClassificationから損失関数CrossEntropyLossの結果を受け取る
    loss.backward()

    torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0) # 勾配クリッピング

    optimizer.step()
    optimizer.zero_grad()

    train_losses.append(loss.item())

  return train_losses

# 検証パート関数の定義
def validation(val_dataloader, model, device):
  val_losses = []
  logits     = []
  labels     = []
  inputs     = []

  model.eval()    # 検証モード
  for n_iter, d in enumerate(val_dataloader):
    with torch.no_grad():
      outputs = model(
        d[0].to(device),                    # input_ids_t
        attention_mask = d[1].to(device),   # attention_masks_t
        labels = d[2].to(device),           # labels_t
        token_type_ids=None
        )

    loss = outputs.loss # BertForSequenceClassificationから損失関数CrossEntropyLossの結果を受け取る
    val_losses.append(loss.item())
    
    logits += outputs.logits.sigmoid().cpu().tolist()
    inputs += d[0].tolist()
    labels += d[2].tolist()

  # Predictionの結果をDataFrameで返す
  val_res = pd.DataFrame(logits, columns=['logit0', 'logit1'])
  val_pred = np.argmax(val_res.values, axis=1).tolist()
  val_res['label'] = labels
  val_res['pred']  = val_pred
  val_res['input_ids']  = inputs
  
  return val_losses, val_res

## 実行とモデルの保存

In [13]:
from transformers import  AdamW, BertForSequenceClassification
from sklearn.metrics import confusion_matrix
from tqdm.notebook import tqdm


# ミニバッチごとのLossを格納する変数を定義
train_losses = []
val_losses = []

# 学習の実施
for i in range(k):

  # BertForSequenceClassificationに事前学習モデルをロード
  model = BertForSequenceClassification.from_pretrained(
      model_name,
      num_labels = 2,                # Binary classification
      output_attentions = False,     # Attentionの出力
      output_hidden_states = False,  # 隠れ層の出力
      )
  
  # 最適化手法の設定
  optimizer = torch.optim.AdamW(model.parameters(), lr=LEAENING_RATE)

  # モデルをGPUへ転送
  model.to(device)
  print(device)

  print('*** CV', i, 'started')
  for epoch in tqdm(range(N_EPOCHS), total = N_EPOCHS):

    train_ = train(train_sets["train_"+str(i)], model, optimizer, device)
    loss, val_res = validation(valid_sets["valid_"+str(i)], model, device)

    '''
    # Epochごとの混同行列を表示させる場合はこのコメントアウト部分を実行
    cm = confusion_matrix(val_res['label'].tolist(), val_res['pred'].tolist())
    cm_df = pd.DataFrame(cm,columns=['Predicted Neg', 'Predicted Pos'], index=['Actual Neg', 'Actual Pos'])
    display(cm_df)
    '''

    print('  epoch', epoch, 'total loss :', sum(loss))

    train_losses += train_
    val_losses += loss

 
  '''
  # 教師ラベルと異なる予測となった一覧を表示させる場合はこのコメントアウト部分を実行
  val_res['Text'] = [t.strip('[CLS] [SEP] [PAD]') for t in tokenizer.batch_decode(val_res['input_ids'])]
  Errors = val_res.query('label!=pred')
  display(Errors)
  '''

  #モデルの保存
  tokenizer.save_pretrained(datadir + 'bert/' + 'BERT_MODEL_' + str(i) + '/')
  model.save_pretrained(datadir + 'bert/' + 'BERT_MODEL_' + str(i) + '/')

Downloading:   0%|          | 0.00/445M [00:00<?, ?B/s]

Some weights of the model checkpoint at cl-tohoku/bert-base-japanese-whole-word-masking were not used when initializing BertForSequenceClassification: ['cls.predictions.decoder.weight', 'cls.predictions.transform.dense.weight', 'cls.seq_relationship.bias', 'cls.seq_relationship.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.bias']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialize

cuda:0
*** CV 0 started


  0%|          | 0/10 [00:00<?, ?it/s]

  epoch 0 total loss : 15.973736464977264
  epoch 1 total loss : 12.971617549657822
  epoch 2 total loss : 10.886202543973923
  epoch 3 total loss : 9.409014716744423
  epoch 4 total loss : 8.325396701693535
  epoch 5 total loss : 7.6624883115291595
  epoch 6 total loss : 7.309448100626469
  epoch 7 total loss : 6.86689867079258
  epoch 8 total loss : 6.6041359603405
  epoch 9 total loss : 6.6322054117918015


Some weights of the model checkpoint at cl-tohoku/bert-base-japanese-whole-word-masking were not used when initializing BertForSequenceClassification: ['cls.predictions.decoder.weight', 'cls.predictions.transform.dense.weight', 'cls.seq_relationship.bias', 'cls.seq_relationship.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.bias']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialize

cuda:0
*** CV 1 started


  0%|          | 0/10 [00:00<?, ?it/s]

  epoch 0 total loss : 16.42708659172058
  epoch 1 total loss : 13.47651493549347
  epoch 2 total loss : 10.64563935995102
  epoch 3 total loss : 8.977853775024414
  epoch 4 total loss : 7.612244680523872
  epoch 5 total loss : 6.781359180808067
  epoch 6 total loss : 6.263590782880783
  epoch 7 total loss : 5.743235342204571
  epoch 8 total loss : 5.408562239259481
  epoch 9 total loss : 5.216616563498974


Some weights of the model checkpoint at cl-tohoku/bert-base-japanese-whole-word-masking were not used when initializing BertForSequenceClassification: ['cls.predictions.decoder.weight', 'cls.predictions.transform.dense.weight', 'cls.seq_relationship.bias', 'cls.seq_relationship.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.bias']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialize

cuda:0
*** CV 2 started


  0%|          | 0/10 [00:00<?, ?it/s]

  epoch 0 total loss : 15.826271057128906
  epoch 1 total loss : 12.451465904712677
  epoch 2 total loss : 9.570117026567459
  epoch 3 total loss : 7.5916359424591064
  epoch 4 total loss : 6.400576569139957
  epoch 5 total loss : 5.615303583443165
  epoch 6 total loss : 5.133206494152546
  epoch 7 total loss : 4.70725890994072
  epoch 8 total loss : 4.431269191205502
  epoch 9 total loss : 4.198870878666639


Some weights of the model checkpoint at cl-tohoku/bert-base-japanese-whole-word-masking were not used when initializing BertForSequenceClassification: ['cls.predictions.decoder.weight', 'cls.predictions.transform.dense.weight', 'cls.seq_relationship.bias', 'cls.seq_relationship.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.bias']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialize

cuda:0
*** CV 3 started


  0%|          | 0/10 [00:00<?, ?it/s]

  epoch 0 total loss : 16.822385907173157
  epoch 1 total loss : 14.173374742269516
  epoch 2 total loss : 11.387856930494308
  epoch 3 total loss : 9.511450156569481
  epoch 4 total loss : 8.12158977985382
  epoch 5 total loss : 7.127610415220261
  epoch 6 total loss : 6.303596794605255
  epoch 7 total loss : 5.841823682188988
  epoch 8 total loss : 5.524155884981155
  epoch 9 total loss : 5.22299575060606


Some weights of the model checkpoint at cl-tohoku/bert-base-japanese-whole-word-masking were not used when initializing BertForSequenceClassification: ['cls.predictions.decoder.weight', 'cls.predictions.transform.dense.weight', 'cls.seq_relationship.bias', 'cls.seq_relationship.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.bias']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialize

cuda:0
*** CV 4 started


  0%|          | 0/10 [00:00<?, ?it/s]

  epoch 0 total loss : 16.86497050523758
  epoch 1 total loss : 14.349805265665054
  epoch 2 total loss : 11.108561247587204
  epoch 3 total loss : 8.641068994998932
  epoch 4 total loss : 7.235273033380508
  epoch 5 total loss : 6.444413512945175
  epoch 6 total loss : 5.8605131804943085
  epoch 7 total loss : 5.560221206396818
  epoch 8 total loss : 5.297652006149292
  epoch 9 total loss : 5.036112768575549
