### __Bertで文章分類__

- livedorrニュースコーパス (9クラス) の分類を行う

- BertForSequenceClassificationを用いる

    - 正解ラベルは`labels`引数に入力する

    - [公式ドキュメント](https://huggingface.co/docs/transformers/main/en/model_doc/bert#transformers.BertForSequenceClassification)

In [1]:
!nvidia-smi

Tue Dec 20 20:32:39 2022       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.32.03    Driver Version: 460.32.03    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla T4            Off  | 00000000:00:04.0 Off |                    0 |
| N/A   52C    P0    28W /  70W |      0MiB / 15109MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

In [2]:
!pip install transformers[ja] datasets | tail -n 1

Successfully installed datasets-2.8.0 fugashi-1.2.1 huggingface-hub-0.11.1 ipadic-1.0.0 multiprocess-0.70.14 plac-1.3.5 pyknp-0.6.1 responses-0.18.0 sudachidict-core-20221021 sudachipy-0.6.6 tokenizers-0.13.2 transformers-4.25.1 unidic-1.1.0 unidic-lite-1.0.8 urllib3-1.25.11 xxhash-3.1.0


In [3]:
from glob import glob
from tqdm.notebook import tqdm
from sklearn.metrics import accuracy_score, f1_score
import torch
from datasets import Dataset
from transformers import (
    AutoTokenizer,
    BertForSequenceClassification,
    EvalPrediction,
    TrainingArguments,
    Trainer
)

In [4]:
model_name = "cl-tohoku/bert-base-japanese-whole-word-masking"

tokenizer = AutoTokenizer.from_pretrained(model_name)
model = BertForSequenceClassification.from_pretrained(model_name, num_labels=9) # クラス数

Downloading:   0%|          | 0.00/110 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/479 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/258k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/445M [00:00<?, ?B/s]

Some weights of the model checkpoint at cl-tohoku/bert-base-japanese-whole-word-masking were not used when initializing BertForSequenceClassification: ['cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.bias', 'cls.seq_relationship.weight', 'cls.predictions.transform.dense.weight', 'cls.predictions.decoder.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.seq_relationship.bias']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialize

In [5]:
# BertModelに分類ヘッドを追加している
model

BertForSequenceClassification(
  (bert): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(32000, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (token_type_embeddings): Embedding(2, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0): BertLayer(
          (attention): BertAttention(
            (self): BertSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1e-12, element

In [6]:
# 使用データセット: livedorrニュースコーパス

# ダウンロード
!wget https://www.rondhuit.com/download/ldcc-20140209.tar.gz 

# 解凍
!tar -zxf ldcc-20140209.tar.gz 

--2022-12-20 20:34:06--  https://www.rondhuit.com/download/ldcc-20140209.tar.gz
Resolving www.rondhuit.com (www.rondhuit.com)... 59.106.19.174
Connecting to www.rondhuit.com (www.rondhuit.com)|59.106.19.174|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 8855190 (8.4M) [application/x-gzip]
Saving to: ‘ldcc-20140209.tar.gz’


2022-12-20 20:34:14 (3.05 MB/s) - ‘ldcc-20140209.tar.gz’ saved [8855190/8855190]



In [13]:
# Datasetの作成

max_length = 128

dataset = []
categories = [c.split("/")[1] for c in glob("text/**/")]
for i, category in tqdm(enumerate(categories), total=len(categories)):
    for file in glob("text/" + category + f"/{category}-*.txt"):
        with open(file) as f:
            texts = f.readlines()[3:] # 4行目以降にニュースのテキストが入っている
        text = "".join(texts)
        encoding = tokenizer(
            text,
            padding="max_length",
            truncation=True,
            max_length = max_length,
        )
        encoding["labels"] = i
        dataset.append(encoding)

dataset = Dataset.from_list(dataset)
dataset = dataset.train_test_split(test_size=0.2, shuffle=True)

len(dataset)

  0%|          | 0/9 [00:00<?, ?it/s]

2

In [14]:
# 評価関数
# 辞書で返す
def compute_metrics(p: EvalPrediction):
    pred = p.predictions.argmax(axis=-1)
    labels = p.label_ids
    return {
        "accuracy": accuracy_score(labels, pred,),
        "f1": f1_score(labels, pred, average="macro"),
    }

In [15]:
training_args = TrainingArguments(
    output_dir="./results",
    num_train_epochs=3,
    per_device_train_batch_size=16, # メモリに注意する
    per_device_eval_batch_size=64,
    warmup_steps=100,
    weight_decay=0.01,
    evaluation_strategy='epoch',
    learning_rate=5e-5,
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=dataset["train"],
    eval_dataset=dataset["test"],
    compute_metrics=compute_metrics,
)

PyTorch: setting up devices
The default value for the training argument `--report_to` will change in v5 (from all installed integrations to none). In v5, you will need to use `--report_to all` to get the same behavior as now. You should start updating your code and make this info disappear :-).


In [16]:
# 学習
trainer.train()

***** Running training *****
  Num examples = 5893
  Num Epochs = 3
  Instantaneous batch size per device = 16
  Total train batch size (w. parallel, distributed & accumulation) = 16
  Gradient Accumulation steps = 1
  Total optimization steps = 1107
  Number of trainable parameters = 110624265


Epoch,Training Loss,Validation Loss,Accuracy,F1
1,No log,0.368926,0.893487,0.888049
2,0.687800,0.365601,0.895522,0.889872
3,0.136600,0.387305,0.907734,0.901112


***** Running Evaluation *****
  Num examples = 1474
  Batch size = 64
Saving model checkpoint to ./results/checkpoint-500
Configuration saved in ./results/checkpoint-500/config.json
Model weights saved in ./results/checkpoint-500/pytorch_model.bin
***** Running Evaluation *****
  Num examples = 1474
  Batch size = 64
Saving model checkpoint to ./results/checkpoint-1000
Configuration saved in ./results/checkpoint-1000/config.json
Model weights saved in ./results/checkpoint-1000/pytorch_model.bin
***** Running Evaluation *****
  Num examples = 1474
  Batch size = 64


Training completed. Do not forget to share your model on huggingface.co/models =)




TrainOutput(global_step=1107, training_loss=0.3789015561476848, metrics={'train_runtime': 488.8804, 'train_samples_per_second': 36.162, 'train_steps_per_second': 2.264, 'total_flos': 1162958174459136.0, 'train_loss': 0.3789015561476848, 'epoch': 3.0})

In [29]:
# 推論
# 損失関数の計算にsoftmax関数が含まれるため、
# モデル自体はsoftmax関数がかかる前のスコアを出力する
# 推論時には「labels」を削除する

prediction = trainer.predict(dataset["test"].remove_columns("labels"))
prediction

***** Running Prediction *****
  Num examples = 1474
  Batch size = 64


PredictionOutput(predictions=array([[-1.0119767 , -0.10744184, -0.72416866, ..., -1.0647011 ,
        -1.9622073 , -1.3940773 ],
       [-1.1255065 , -0.2811904 , -2.0328465 , ..., -1.1692348 ,
        -0.6740754 ,  0.03338239],
       [-1.0427496 , -0.6820771 , -1.0405897 , ..., -0.9456292 ,
        -1.8252132 , -1.531799  ],
       ...,
       [-1.2731665 , -1.5314319 ,  7.0346704 , ..., -1.0889316 ,
        -0.94325805, -1.0704814 ],
       [-1.4367728 ,  7.3573074 , -1.2244735 , ..., -1.1020482 ,
        -1.0586737 , -0.261383  ],
       [ 0.6389012 , -0.49120837, -0.88427347, ..., -1.6719192 ,
        -1.4458706 ,  7.4639792 ]], dtype=float32), label_ids=None, metrics={'test_runtime': 11.7054, 'test_samples_per_second': 125.925, 'test_steps_per_second': 2.05})

In [31]:
# 評価関数の確認
pred = prediction.predictions.argmax(-1)
labels = dataset["test"]["labels"]

# 学習時のログと一致する
accuracy_score(pred, labels), f1_score(pred, labels, average="macro")

(0.9077340569877883, 0.9011118037466537)