# BERT Classifictaion Fine Tuning with PyTorch Lightning
ローカル環境で BERT のファインチューニングを行います。

## 0. 事前準備
### Data
ターミナルで以下のコマンドを実行し、Livedoor ニュースのコーパスデータの前処理を実施します。
```bash
python utils/livedoor-dataprep.py
````

### Python 環境の準備
ターミナルで以下のコマンドを実行し conda 環境を構築してください。

```bash
$ conda env create --file bert_finetune_local.yml 
```

## 1. ライブラリのインポート

In [1]:
import numpy as np
import pandas as pd
import pytorch_lightning as pl
import torch
from scipy.special import logit
from sklearn.model_selection import train_test_split
from torch.utils.data import DataLoader

from src import datasets, models

pl.seed_everything(1234)
torch.manual_seed(1234)
np.random.seed(1234)

Global seed set to 1234


In [2]:
# GPU が利用可能か確認
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")

## 2. データ前処理

In [3]:
df = pd.read_csv("./data/processed/livedoor.tsv", delimiter='\t')
df = df.dropna()
df.head()

Unnamed: 0,text,label_index,label
0,ソーシャルレビューコミュニティ「zigsow（ジグソー）」が運営する企業向け＆ビジネス向け商...,1,it-life-hack
1,ブログなどよりも気軽に発信できるため、有名人も多くが利用しているTwitter。だが、その気...,2,kaden-channel
2,「秋葉原通り魔事件」の映画化がネットで反響を呼んでいる。だが、その内容はあまりにも衝撃的だっ...,8,topic-news
3,全国に21000人の部員を誇る「iPhone女子部」が、女子のハートをがっちりつかむiPho...,5,peachy
4,2006年に、同業の小沢コージ君（みんカラスペシャルブログメンバー）と『力説自動車』という単...,3,livedoor-homme


In [4]:
X_train, X_test = train_test_split(df, test_size=0.2, stratify=df['label_index'])
X_train.reset_index(drop=True, inplace=True)
X_test.reset_index(drop=True, inplace=True)

In [5]:
X_train.to_csv("./data/processed/livedoor-train.tsv", sep='\t', index=False)
X_test.to_csv("./data/processed/livedoor-test.tsv", sep='\t', index=False)

In [6]:
train_dataset = datasets.LivedoorDataset(X_train)
train_loader = DataLoader(train_dataset, batch_size=16, shuffle=True)

In [7]:
test_dataset = datasets.LivedoorDataset(X_train)
test_loader = DataLoader(test_dataset, batch_size=16, shuffle=True)

## 3. モデル学習

In [8]:
model = models.LitBert()

# fix param
for param in model.bert.bert.parameters():
    param.requires_grad = False

for param in model.bert.bert.encoder.layer[-1].parameters():
    param.requires_grad = True

model.to(device)
trainer = pl.Trainer(gpus=1, default_root_dir='pl-model', max_epochs=5)

Some weights of the model checkpoint at cl-tohoku/bert-base-japanese-whole-word-masking were not used when initializing BertModel: ['cls.predictions.transform.dense.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.bias', 'cls.predictions.transform.dense.weight', 'cls.seq_relationship.weight', 'cls.seq_relationship.bias', 'cls.predictions.decoder.weight', 'cls.predictions.transform.LayerNorm.bias']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
GPU available: True, used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs


In [9]:
# モデル学習開始
trainer.fit(model, train_dataloader=train_loader, val_dataloaders=test_loader)

  "`trainer.fit(train_dataloader)` is deprecated in v1.4 and will be removed in v1.6."
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]

  | Name | Type               | Params
--------------------------------------------
0 | bert | BertClassification | 110 M 
--------------------------------------------
7.1 M     Trainable params
103 M     Non-trainable params
110 M     Total params
442.497   Total estimated model params size (MB)


Validation sanity check: 0it [00:00, ?it/s]

  f"Your {mode}_dataloader has `shuffle=True`, it is best practice to turn"
  f"The dataloader, {name}, does not have many workers which may be a bottleneck."
Global seed set to 1234
  f"The dataloader, {name}, does not have many workers which may be a bottleneck."


Training: -1it [00:00, ?it/s]

Validating: 0it [00:00, ?it/s]

Validating: 0it [00:00, ?it/s]

Validating: 0it [00:00, ?it/s]

Validating: 0it [00:00, ?it/s]

Validating: 0it [00:00, ?it/s]

## 4. モデル検証

In [10]:
# モデル検証
result = trainer.test(model, test_dataloaders=test_loader)
print(result)

  "`trainer.test(test_dataloaders)` is deprecated in v1.4 and will be removed in v1.6."
  rank_zero_warn(f"you passed in a {loader_name} but have no {step_name}. Skipping {stage} loop")
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]


[]


In [11]:
# モデル保存
trainer.save_checkpoint("./model/bert-livedoor.ckpt")

###  Tensorboard の起動
```bash
tensorboard --logdir pl-model/lightning_logs
```