## __文章分類__

- livedoorニュースコーパス (9クラス) の分類を行う

- AutoModelForSequenceClassificationを用いる

    - 正解ラベルは`labels`引数に入力する

    - [公式ドキュメント](https://huggingface.co/docs/transformers/main/en/model_doc/bert#transformers.BertForSequenceClassification)

### __準備__

In [1]:
!nvidia-smi

Mon Oct 23 01:49:31 2023       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.105.17   Driver Version: 525.105.17   CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla T4            Off  | 00000000:00:04.0 Off |                    0 |
| N/A   67C    P8    11W /  70W |      0MiB / 15360MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

In [2]:
!pip install transformers[ja, torch] datasets | tail -n 1

Successfully installed accelerate-0.23.0 datasets-2.14.5 dill-0.3.7 fugashi-1.3.0 huggingface-hub-0.17.3 ipadic-1.0.0 multiprocess-0.70.15 plac-1.4.0 rhoknp-1.3.0 safetensors-0.4.0 sudachidict-core-20230927 sudachipy-0.6.7 tokenizers-0.14.1 transformers-4.34.1 unidic-1.1.0 unidic-lite-1.0.8 wasabi-0.10.1


In [3]:
from glob import glob
from tqdm.notebook import tqdm
from sklearn.metrics import accuracy_score, f1_score
import torch
from datasets import Dataset
from transformers import (
    AutoTokenizer,
    AutoModelForSequenceClassification,
    EvalPrediction,
    TrainingArguments,
    Trainer
)

In [4]:
model_name = "cl-tohoku/bert-base-japanese-whole-word-masking"

tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=9) # クラス数

Downloading (…)okenizer_config.json:   0%|          | 0.00/110 [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/479 [00:00<?, ?B/s]

Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/258k [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/445M [00:00<?, ?B/s]

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at cl-tohoku/bert-base-japanese-whole-word-masking and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [5]:
# 使用データセット: livedoorニュースコーパス

# ダウンロード
!wget https://www.rondhuit.com/download/ldcc-20140209.tar.gz

# 解凍
!tar -zxf ldcc-20140209.tar.gz

--2023-10-23 01:50:54--  https://www.rondhuit.com/download/ldcc-20140209.tar.gz
Resolving www.rondhuit.com (www.rondhuit.com)... 59.106.19.174
Connecting to www.rondhuit.com (www.rondhuit.com)|59.106.19.174|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 8855190 (8.4M) [application/x-gzip]
Saving to: ‘ldcc-20140209.tar.gz’


2023-10-23 01:50:57 (3.62 MB/s) - ‘ldcc-20140209.tar.gz’ saved [8855190/8855190]



In [6]:
# Datasetの作成

max_length = 128

dataset = []
categories = [c.split("/")[1] for c in glob("text/**/")]
for i, category in tqdm(enumerate(categories), total=len(categories)):
    for file in glob("text/" + category + f"/{category}-*.txt"):
        with open(file) as f:
            texts = f.readlines()[3:] # 4行目以降にニュースのテキストが入っている
        text = "".join(texts)
        encoding = tokenizer(
            text,
            padding="max_length",
            truncation=True,
            max_length = max_length,
        )
        encoding["labels"] = i
        dataset.append(encoding)

dataset = Dataset.from_list(dataset)
dataset = dataset.train_test_split(test_size=0.2, shuffle=True)
dataset

  0%|          | 0/9 [00:00<?, ?it/s]

DatasetDict({
    train: Dataset({
        features: ['input_ids', 'token_type_ids', 'attention_mask', 'labels'],
        num_rows: 5893
    })
    test: Dataset({
        features: ['input_ids', 'token_type_ids', 'attention_mask', 'labels'],
        num_rows: 1474
    })
})

In [7]:
# 自作の評価関数
def compute_metrics(p: EvalPrediction):
    pred = p.predictions.argmax(axis=-1)
    labels = p.label_ids
    return {
        "accuracy": accuracy_score(labels, pred,),
        "f1": f1_score(labels, pred, average="macro"),
    }

In [8]:
training_args = TrainingArguments(
    output_dir="./results",
    num_train_epochs=5,
    per_device_train_batch_size=32,
    per_device_eval_batch_size=64,
    evaluation_strategy='epoch',
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=dataset["train"],
    eval_dataset=dataset["test"],
    compute_metrics=compute_metrics,
)

In [9]:
# 学習
trainer.train()

Epoch,Training Loss,Validation Loss,Accuracy,F1
1,No log,0.488431,0.837178,0.832285
2,No log,0.471125,0.873813,0.868244
3,0.366400,0.495785,0.875848,0.869703
4,0.366400,0.540352,0.877883,0.872216
5,0.366400,0.551387,0.881275,0.875302


TrainOutput(global_step=925, training_loss=0.20896520356874207, metrics={'train_runtime': 687.6515, 'train_samples_per_second': 42.849, 'train_steps_per_second': 1.345, 'total_flos': 1938263624098560.0, 'train_loss': 0.20896520356874207, 'epoch': 5.0})

In [10]:
# 推論
# 損失関数の計算にsoftmax関数が含まれるため、
# モデル自体はsoftmax関数がかかる前のスコアを出力する
# 推論時には「labels」を削除する

prediction = trainer.predict(dataset["test"].remove_columns("labels"))
prediction

PredictionOutput(predictions=array([[-1.8266122 ,  3.009017  , -1.1699245 , ..., -1.8693671 ,
        -1.4403919 ,  6.459254  ],
       [-1.1472561 , -1.7351799 ,  7.8606834 , ..., -1.3176523 ,
        -0.3460874 , -1.3403219 ],
       [-1.384682  ,  6.104582  , -1.7276295 , ..., -0.8640675 ,
         0.89589506, -1.2378716 ],
       ...,
       [-0.9674564 , -0.69104326, -0.98230076, ...,  0.42792907,
        -1.2998247 , -0.696225  ],
       [-1.7133216 ,  7.6840434 , -1.3893275 , ..., -0.51261014,
         0.02509275, -1.042011  ],
       [-0.9984986 , -0.75192314, -1.5250102 , ...,  7.6136384 ,
        -0.9469103 , -1.2879707 ]], dtype=float32), label_ids=None, metrics={'test_runtime': 10.6495, 'test_samples_per_second': 138.41, 'test_steps_per_second': 2.254})

In [11]:
# 評価関数の確認
pred = prediction.predictions.argmax(-1)
labels = dataset["test"]["labels"]

# 学習時のログと一致する
accuracy_score(pred, labels), f1_score(pred, labels, average="macro")

(0.8812754409769336, 0.8753020895562503)