# Final Project: Transformer-based Language Models





*   **2021-18031 전종원 (자유전공학부)**
*   Computational Linguistics (Spring 2023)
*   Construct a system dealing with NLP tasks using Transformer-based Language Models


## Links to related files
Note that these files are only allowed to those (1) **with the link** and (2) logined with **an SNU account**

*   [Report](https://drive.google.com/file/d/1sDY0yx9_toHkkW8AbA8kjlMOna5n80Oj/view?usp=sharing)
   *   the same pdf file in the submitted .zip


*   [Domain-Adaptive pre-trained model](https://drive.google.com/drive/folders/1b1nRP9PrEjpbndkxlGlO6SkJKprgqKO4?usp=sharing)
  *   FYI it took me 3 hours to fully pretrain

If you have any problem accessing the files, feel free to contact me via [email](cjw107@snu.ac.kr)




## Selected Paper
**LEGAL-BERT: The Muppets straight out of Law School**
> Ilias Chalkidis, Manos Fergadiotis, Prodromos Malakasiotis, Nikolaos Aletras, and Ion Androutsopoulos. 2020. LEGAL-BERT: The Muppets straight out of Law School. In Findings of the Association for Computational Linguistics: EMNLP 2020, pages 2898–2904, Online. Association for Computational Linguistics.

## References

These are the references for writing the following code
*   [LBox Open: 한국어 AI Benchmark Dataset](https://blog.lbox.kr/lbox-open)
*   [Legal-BERT, 법률 도메인에 특화된 언어모델 개발기](https://blog.lbox.kr/legal-bert)
*   [Fine-tuning BERT (and friends) for multi-label text classification.ipynb](https://colab.research.google.com/github/NielsRogge/Transformers-Tutorials/blob/master/BERT/Fine_tuning_BERT_(and_friends)_for_multi_label_text_classification.ipynb#scrollTo=AFWlSsbZaRLc)
*   [BERT — Pre-training + Fine-tuning](https://medium.com/analytics-vidhya/bert-pre-training-fine-tuning-eb574be614f6)
*   [Multi-label-classification](https://flonelin.wordpress.com/2022/01/09/multi-label-sentence-classification-bert-transformers/)
*   [Multiclass and Multilabel Text Classification in One BERT Model](https://lajavaness.medium.com/multiclass-and-multilabel-text-classification-in-one-bert-model-95c54aab59dc)
*   [Huggingface - RobertaForSequenceClassification의 반환값 분석](https://kyunghyunlim.github.io/ml_ai/2021/10/04/roberta_cls.html)
*   [Bert For Domain Adaptation](https://github.com/yangoos57/Bert_For_Domain_Adaptation/blob/main/%5Btutorial%5D%20Bert%20Domain%20Adaptation.ipynb)
*   [A-Domain-adaptive-Pre-training-Approach-for-Language-BiasDetection-in-News](https://github.com/Media-Bias-Group/A-Domain-adaptive-Pre-training-Approach-for-Language-BiasDetection-in-News)


# 0. Environment Setting

In [None]:
!pip install --q torch
!pip install --q pandas
!pip install --q transformers
!pip install --q datasets

!pip install -q transformers[torch]
!pip install -q accelerate -U

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.2/7.2 MB[0m [31m72.4 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m236.8/236.8 kB[0m [31m25.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.8/7.8 MB[0m [31m57.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.3/1.3 MB[0m [31m44.8 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m485.6/485.6 kB[0m [31m10.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m110.5/110.5 kB[0m [31m12.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m212.5/212.5 kB[0m [31m26.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m134.3/134.3 kB[0m [31m17.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━

# Part 1. Fine-tuning

## 1. Load libraries, model, and dataset

In [None]:
import torch
import pandas
import transformers
import datasets
import numpy as np

from transformers import AutoModel
from transformers import AutoTokenizer
from transformers import AutoModelForSequenceClassification

from datasets import load_dataset
from datasets import Dataset, load_metric

from transformers import DataCollatorWithPadding
from torch.utils.data import DataLoader

from sklearn.metrics import classification_report
from sklearn.metrics import f1_score, roc_auc_score

from transformers import Trainer, TrainerCallback, TrainingArguments
from transformers import AdamW
from transformers import get_scheduler

from tqdm.auto import tqdm

import accelerate

device = 'cuda' if torch.cuda.is_available() else "cpu"

In [None]:
# model = AutoModelForSequenceClassification.from_pretrained("klue/roberta-base")
tokenizer = AutoTokenizer.from_pretrained("klue/roberta-base")

len(tokenizer)

32000

In [None]:
# statutes classification task
data_st = load_dataset("lbox/lbox_open", "statute_classification")
# data_st_plus = load_dataset("lbox/lbox_open", "statute_classification_plus")



  0%|          | 0/4 [00:00<?, ?it/s]

In [None]:
data_st

DatasetDict({
    train: Dataset({
        features: ['id', 'casetype', 'casename', 'statutes', 'facts'],
        num_rows: 2208
    })
    validation: Dataset({
        features: ['id', 'casetype', 'casename', 'statutes', 'facts'],
        num_rows: 276
    })
    test: Dataset({
        features: ['id', 'casetype', 'casename', 'statutes', 'facts'],
        num_rows: 276
    })
    test2: Dataset({
        features: ['id', 'casetype', 'casename', 'statutes', 'facts'],
        num_rows: 538
    })
})

## 2. Preprocess Data


*   Embedding "statutes" (labels)
*   Tokenize "facts" using Tokenizer
*   Remove unused columns



For multi-label text classification, this is a matrix of shape (batch_size, num_labels). Also important: this should be a tensor of floats rather than integers, otherwise PyTorch' BCEWithLogitsLoss (which the model will use) will complain

In [None]:
# gather labels
labels = set()
for statutes in data_st['train']['statutes']:
  for statute in statutes:
    labels.add(statute)
for statutes in data_st['validation']['statutes']:
  for statute in statutes:
    labels.add(statute)
for statutes in data_st['test']['statutes']:
  for statute in statutes:
    labels.add(statute)
for statutes in data_st['test2']['statutes']:
  for statute in statutes:
    labels.add(statute)
len(labels)

188

In [None]:
# 2 dictionaries that map labels to integers and back.
id2label = {idx:label for idx, label in enumerate(labels)}
label2id = {label:idx for idx, label in enumerate(labels)}

In [None]:
# preprocess function
def preprocess_function(example):
    statute = [0] * len(id2label)
    for k, l in id2label.items():
        if l in example["statutes"]:
            statute[k] = float(1)
        else:
            statute[k] = float(0)
    example = tokenizer(example["facts"], truncation=True, padding=True)
    example["statutes"] = statute
    return example

In [None]:
tokenized_dataset = data_st.map(preprocess_function)
tokenized_dataset = tokenized_dataset.remove_columns(["id", "casetype", "casename", "facts"])
tokenized_dataset = tokenized_dataset.rename_column("statutes", "labels")
tokenized_dataset.set_format("torch")
tokenized_dataset



Map:   0%|          | 0/276 [00:00<?, ? examples/s]



DatasetDict({
    train: Dataset({
        features: ['labels', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 2208
    })
    validation: Dataset({
        features: ['labels', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 276
    })
    test: Dataset({
        features: ['labels', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 276
    })
    test2: Dataset({
        features: ['labels', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 538
    })
})

## 3. Construct Iterator and Metrics

In [None]:
# Construct data_collator with padding and Iterator(dataloader)
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)
BATCH_SIZE = 4

train_dataloader = DataLoader(tokenized_dataset["train"], shuffle=True, batch_size=BATCH_SIZE, collate_fn=data_collator)
validation_dataloader = DataLoader(tokenized_dataset["validation"], shuffle=True, batch_size=BATCH_SIZE, collate_fn=data_collator)
test_dataloader = DataLoader(tokenized_dataset["test"], shuffle=True, batch_size=BATCH_SIZE, collate_fn=data_collator)
test2_dataloader = DataLoader(tokenized_dataset["test2"], shuffle=True, batch_size=BATCH_SIZE, collate_fn=data_collator)

In [None]:
from sklearn.metrics import f1_score, roc_auc_score, accuracy_score
import torch

# source: https://jesusleal.io/2021/04/21/Longformer-multilabel-classification/
def multi_label_metrics(predictions, labels, threshold=0.05):
    # first, apply sigmoid on predictions which are of shape (batch_size, num_labels)
    sigmoid = torch.nn.Sigmoid()
    probs = sigmoid(torch.Tensor(predictions))
    # print("probs: ", probs)

    # next, use threshold to turn them into integer predictions
    y_pred = np.zeros(probs.shape)
    y_pred[np.where(probs >= threshold)] = 1
    # print("pred: ", y_pred)

    # finally, compute metrics
    y_true = labels
    f1_micro_average = f1_score(y_true=y_true, y_pred=y_pred, average='micro')
    roc_auc = roc_auc_score(y_true, y_pred, average = 'micro')

    # return as dictionary
    metrics = {'f1': f1_micro_average,
               'roc_auc': roc_auc}
    return metrics

In [None]:
def mergeDict(dictA, dictB):
  for key in dictB:
    if key in dictA:
        dictB[key] = (dictB[key] + dictA[key])/2
    else:
        pass
  res = dictA | dictB
  return res

## 4. Build Model


*   Define model
*   Set optimizer
*   Define Train, Evaluate Method



In [None]:
model = AutoModelForSequenceClassification.from_pretrained(
    "klue/roberta-base",
    problem_type="multi_label_classification",
    num_labels=len(labels),
    id2label=id2label,
    label2id=label2id)

Some weights of the model checkpoint at klue/roberta-base were not used when initializing RobertaForSequenceClassification: ['lm_head.layer_norm.bias', 'lm_head.dense.bias', 'lm_head.bias', 'lm_head.layer_norm.weight', 'lm_head.dense.weight']
- This IS expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at klue/roberta-base and are newly initialized: ['classifier.out_proj.bias', 'classifier.dense.weight', 'classifier.out_proj.weight', 'classifier.dense.bias']
You

In [None]:
optimizer = AdamW(model.parameters(), lr=5e-5)



In [None]:
num_epochs = 3
num_training_steps = num_epochs * len(train_dataloader)
lr_scheduler = get_scheduler (
    "linear",
    optimizer = optimizer,
    num_warmup_steps = 0,
    num_training_steps = num_training_steps
)
print(num_training_steps)

1656


In [None]:
model.to(device)

RobertaForSequenceClassification(
  (roberta): RobertaModel(
    (embeddings): RobertaEmbeddings(
      (word_embeddings): Embedding(32000, 768, padding_idx=1)
      (position_embeddings): Embedding(514, 768, padding_idx=1)
      (token_type_embeddings): Embedding(1, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): RobertaEncoder(
      (layer): ModuleList(
        (0-11): 12 x RobertaLayer(
          (attention): RobertaAttention(
            (self): RobertaSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): RobertaSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
             

In [None]:
def train(model, dataloader, optimizer):
  epoch_loss = 0
  metrics = {}

  model.train()
  for batch in train_dataloader :
    optimizer.zero_grad()

    batch = {k: v.to(device) for k, v in batch.items()}
    outputs = model(**batch)
    loss = outputs.loss
    epoch_loss += loss.item()
    # print(loss)

    logits = outputs.logits.cpu()
    # print(logits)
    metrics = mergeDict(metrics, multi_label_metrics(predictions=logits, labels=batch['labels'].cpu()))
    # for key in metrics.keys():
    #   print(key, ":", metrics[key]*100)

    loss.backward()
    optimizer.step()
    lr_scheduler.step()
    optimizer.zero_grad()
    progress_bar.update(1)

  return epoch_loss / len(dataloader), metrics

In [None]:
def evaluate(model, dataloader):
  model.eval()
  test_loss = 0
  metrics = {}

  for batch in validation_dataloader:
    batch = {k: v.to(device) for k, v in batch.items()}

    with torch.no_grad():
      outputs = model(**batch)

    loss = outputs.loss
    test_loss += loss.item()

    logits = outputs.logits.cpu()
    metrics = mergeDict(metrics, multi_label_metrics(predictions=logits, labels=batch['labels'].cpu()))

  return test_loss / len(dataloader), metrics

In [None]:
def printMetrics(exp, metrics):
  for key in metrics.keys():
    print(f'\t {exp} {key}: {metrics[key]:.3f}', end=' ')
  print()

## 5. Train Model

In [None]:
progress_bar = tqdm(range(num_training_steps))

for epoch in range(num_epochs) :
  train_loss, train_metrics = train(model, train_dataloader, optimizer)
  validation_loss, validation_metrics = evaluate(model, validation_dataloader)

  print(f'Epoch: {epoch+1:02}')
  print(f'\t Train loss: {train_loss:.3f}')
  printMetrics("Train", train_metrics)
  print(f'\t Val. loss: {validation_loss:.3f}')
  printMetrics("Validation", validation_metrics)

  0%|          | 0/1656 [00:00<?, ?it/s]

You're using a BertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Epoch: 01
	 Train loss: 0.088
	 Train f1: 0.115 	 Train roc_auc: 0.623 
	 Val. loss: 0.041
	 Validation f1: 0.118 	 Validation roc_auc: 0.615 
Epoch: 02
	 Train loss: 0.040
	 Train f1: 0.123 	 Train roc_auc: 0.670 
	 Val. loss: 0.039
	 Validation f1: 0.136 	 Validation roc_auc: 0.654 
Epoch: 03
	 Train loss: 0.040
	 Train f1: 0.159 	 Train roc_auc: 0.689 
	 Val. loss: 0.039
	 Validation f1: 0.157 	 Validation roc_auc: 0.734 


## 6. Evaluate Model

In [None]:
test_loss, test_metrics = evaluate(model, test_dataloader)
test2_loss, test2_metrics = evaluate(model, test2_dataloader)

print(f'\t Test loss: {test_loss:.3f}')
printMetrics("Test", test_metrics)
print(f'\t Test2 loss: {test2_loss:.3f}')
printMetrics("Test2", test2_metrics)

	 Test loss: 0.039
	 Test f1: 0.071 	 Test roc_auc: 0.577 
	 Test2 loss: 0.020
	 Test2 f1: 0.182 	 Test2 roc_auc: 0.771 


### (Personal Use) Inference Test

In [None]:
data_st['test'][0]

{'id': 190,
 'casetype': 'criminal',
 'casename': '강제추행',
 'statutes': ['형법 제298조'],
 'facts': '피고인은 2020. 12. 17. 20:30경 강원 속초시 B에 있는 피고인이 운영하는 ‘C 노래연습장\' 안에서, 그곳에 손님으로 온 피해자 D(여, 21세)에게 "옷이 이게 뭐야, 딸 생각난다."고 말하며 갑자기 피해자의 가슴 부분 윗옷을 위로 잡아당기는 듯하며 양 엄지 손가락으로 피해자의 가슴 부분을 만졌다.\n이로써 피고인은 피해자를 강제로 추행하였다.'}

In [None]:
data_st['test'][21]

{'id': 793,
 'casetype': 'criminal',
 'casename': '공무집행방해, 업무방해',
 'statutes': ['형법 제136조 제1항', '형법 제314조 제1항'],
 'facts': '1. 업무방해\n피고인은 2020. 11. 13. 00:28경 서울 종로구 B에 있는 피해자 C이 운영하는 ‘D주점\'에서 같이 술을 마시던 E에게 폭력을 행사하는 것을 피해자가 말리자 손으로 피해자를 밀고 탁자 위에 있던 맥주병과 맥주잔을 바닥에 던지는 등 약 20분간 위력으로써 피해자의 주점 영업 업무를 방해하였다.\n2. 공무집행방해\n피고인은 2020. 11. 13. 00:39경 위와 같은 장소에서 112신고를 받고 출동한 서울혜화경찰서 F파출소 소속 경위 G과 순경 H으로부터 사건 경위에 대한 질문을 받자 "경찰은 참견하지 말아라, 니들은 나를 건드릴 수 없다"고 말하며 위 G에게 주먹을 휘두르고 손으로 어깨를 밀치고 멱살을 잡아 흔들고 옆구리 부분을 잡고 흔드는 등 폭행하여 경찰관의 범죄의 예방 및 진압 등에 관한 정당한 직무집행을 방해하였다.'}

In [None]:
facts = ["피고인은 2020. 12. 17. 20:30경 강원 속초시 B에 있는 피고인이 운영하는 ‘C 노래연습장\' 안에서, 그곳에 손님으로 온 피해자 D(여, 21세)에게 옷이 이게 뭐야, 딸 생각난다고 말하며 갑자기 피해자의 가슴 부분 윗옷을 위로 잡아당기는 듯하며 양 엄지 손가락으로 피해자의 가슴 부분을 만졌다.\n이로써 피고인은 피해자를 강제로 추행하였다.", '1. 업무방해\n피고인은 2020. 11. 13. 00:28경 서울 종로구 B에 있는 피해자 C이 운영하는 ‘D주점\'에서 같이 술을 마시던 E에게 폭력을 행사하는 것을 피해자가 말리자 손으로 피해자를 밀고 탁자 위에 있던 맥주병과 맥주잔을 바닥에 던지는 등 약 20분간 위력으로써 피해자의 주점 영업 업무를 방해하였다.\n2. 공무집행방해\n피고인은 2020. 11. 13. 00:39경 위와 같은 장소에서 112신고를 받고 출동한 서울혜화경찰서 F파출소 소속 경위 G과 순경 H으로부터 사건 경위에 대한 질문을 받자 "경찰은 참견하지 말아라, 니들은 나를 건드릴 수 없다"고 말하며 위 G에게 주먹을 휘두르고 손으로 어깨를 밀치고 멱살을 잡아 흔들고 옆구리 부분을 잡고 흔드는 등 폭행하여 경찰관의 범죄의 예방 및 진압 등에 관한 정당한 직무집행을 방해하였다.']
tokens = tokenizer(facts, padding=True, truncation=True, return_tensors="pt").to(device)
output = model(**tokens)

sigmoid = torch.nn.Sigmoid()
probs = sigmoid(torch.Tensor(output.logits.cpu()))
pred = np.zeros(probs.shape)
pred[np.where(probs >= 0.05)] = 1
print(pred)
for i in np.where(probs[1] >= 0.05):
  for t in i:
    print(id2label[t])
# id2label(np.where(probs >= 0.05))

[[0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
  0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
  0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
  0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0.
  0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
  0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
  0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
  0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0.
  0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
  0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
  0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0.
  0. 1. 0. 1. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
  0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.

# Part 2. DAPT (Domain Adaptive Pre-training)

## 1. Load libraries, model, and dataset

In [None]:
import torch
import pandas
import transformers
import datasets
import numpy as np

from transformers import AutoModel
from transformers import AutoTokenizer
from transformers import AutoModelForMaskedLM

from datasets import load_dataset
from datasets import Dataset, load_metric

from transformers import DataCollatorWithPadding
from torch.utils.data import DataLoader

from sklearn.metrics import f1_score, roc_auc_score, accuracy_score
from sklearn.metrics import classification_report

from transformers import AdamW
from transformers import get_scheduler

from transformers import Trainer, TrainerCallback, TrainingArguments
from transformers import DataCollatorForLanguageModeling

import accelerate

device = 'cuda' if torch.cuda.is_available() else "cpu"

In [None]:
# Base Model
modelDA = AutoModelForMaskedLM.from_pretrained("klue/roberta-base")
tokenizer = AutoTokenizer.from_pretrained("klue/roberta-base")

Downloading (…)lve/main/config.json:   0%|          | 0.00/546 [00:00<?, ?B/s]

Downloading model.safetensors:   0%|          | 0.00/443M [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/375 [00:00<?, ?B/s]

Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/248k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/752k [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/173 [00:00<?, ?B/s]

In [None]:
modelDA.to(device)

RobertaForMaskedLM(
  (roberta): RobertaModel(
    (embeddings): RobertaEmbeddings(
      (word_embeddings): Embedding(32000, 768, padding_idx=1)
      (position_embeddings): Embedding(514, 768, padding_idx=1)
      (token_type_embeddings): Embedding(1, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): RobertaEncoder(
      (layer): ModuleList(
        (0-11): 12 x RobertaLayer(
          (attention): RobertaAttention(
            (self): RobertaSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): RobertaSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): 

In [None]:
# precedent corpus
# data_corpus = load_dataset("lbox/lbox_open", "precedent_corpus")
# data_corpus

# Since the corpus is too large to handle, I used only 30% of the whole corpus
data_corpus_30 = load_dataset("lbox/lbox_open", "precedent_corpus", split='train[:30%]')
data_corpus_30

## 2. Preprocess Data

In [None]:
# tokenize function for precedent
def tokenize_function(examples):
    return tokenizer(examples['precedent'], padding=True, truncation=True)

In [None]:
# tokenized_dataset = data_corpus.map(tokenize_function)
# tokenized_dataset = tokenized_dataset.remove_columns(['id', 'precedent'])
# tokenized_dataset = tokenized_dataset['train'].train_test_split(test_size=0.2)
# tokenized_dataset.set_format("torch")
# tokenized_dataset

In [None]:
tokenized_dataset = data_corpus_30.map(tokenize_function)
tokenized_dataset = tokenized_dataset.remove_columns(['id', 'precedent'])
tokenized_dataset



Dataset({
    features: ['input_ids', 'token_type_ids', 'attention_mask'],
    num_rows: 45000
})

In [None]:
tokenized_dataset = tokenized_dataset.train_test_split(test_size=0.1)
tokenized_dataset.set_format("torch")
tokenized_dataset

DatasetDict({
    train: Dataset({
        features: ['input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 40500
    })
    test: Dataset({
        features: ['input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 4500
    })
})

## 3. Build Trainer and Train

In [None]:
data_collator = DataCollatorForLanguageModeling(
    tokenizer=tokenizer, mlm=True, mlm_probability=0.15, return_tensors="pt"
)

In [None]:
training_args = TrainingArguments(
    output_dir="/",
    per_device_eval_batch_size=8,
    per_device_train_batch_size=8,
    logging_steps=100,
    num_train_epochs=2,
    evaluation_strategy='epoch',
)

In [None]:
# Build customCallback by inheriting TrainerCallback
class myCallback(TrainerCallback):
  def on_step_begin(self, args, state, control, logs=None, **kwargs):
    # for every start of the step
      if state.global_step % args.logging_steps == 0:
          print("")
          print(
              f"{int(state.epoch)}번째 epoch 진행 중 --- {state.global_step}번째 step 결과"
          )

In [None]:
# Build customtrainer by inheriting Trainer
class customtrainer(Trainer):
    def __init__(self, *args, **kwargs):
        super().__init__(*args, **kwargs)

    def step_check(self):
        return self.state.global_step

    def compute_loss(self, model, inputs, return_outputs=False):
        if self.label_smoother is not None and "labels" in inputs:
            labels = inputs.pop("labels")
        else:
            labels = None
        outputs = model(**inputs)
        # Save past state if it exists
        if self.args.past_index >= 0:
            self._past = outputs[self.args.past_index]

        if labels is not None:
            loss = self.label_smoother(outputs, labels)
        else:
            loss = outputs["loss"] if isinstance(outputs, dict) else outputs[0]

        if self.step_check() % self.args.logging_steps == 0:
            num = 1
            input_id = inputs.input_ids[num].reshape(-1).data.tolist()
            output_id = outputs.logits[num].argmax(dim=-1).reshape(-1).data.tolist()
            attention_mask = inputs.attention_mask[num]

            mask_idx = (inputs.input_ids[num] == 4).nonzero().data.reshape(-1).tolist()

            input_id_without_pad = [
                input_id[i] for i in range(len(input_id)) if attention_mask[i]
            ]
            output_id_without_pad = [
                output_id[i] for i in range(len(output_id)) if attention_mask[i]
            ]

            inputs_tokens = self.tokenizer.convert_ids_to_tokens(input_id_without_pad)[
                1:-1
            ]
            outputs_tokens = self.tokenizer.convert_ids_to_tokens(
                output_id_without_pad
            )[1:-1]

            for i in mask_idx:
                outputs_tokens[i - 1] = "[" + outputs_tokens[i - 1] + "]"

            inputs_sen = self.tokenizer.convert_tokens_to_string(inputs_tokens)
            outputs_sen = self.tokenizer.convert_tokens_to_string(outputs_tokens)

            print(f"input 문장 : {''.join(inputs_sen)}")
            print(f"output 문장 : {''.join(outputs_sen)}")

        return (loss, outputs) if return_outputs else loss

In [None]:
trainer = customtrainer(
    model=modelDA,
    train_dataset=tokenized_dataset['train'],
    eval_dataset=tokenized_dataset['test'],
    data_collator=data_collator,
    args=training_args,
    tokenizer=tokenizer,
    callbacks=[myCallback],
)

In [None]:
trainer.train()

You're using a BertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.



0번째 epoch 진행 중 --- 0번째 step 결과
input 문장 : 궂 1. 위 당사자 사이의 서울 [MASK]앙지방법원 2018카합20275 [MASK]금지가처분 신청사건에 관하여 위 법원이 2018. 5. 10. 한 가처분결정을 인가한다. 2 [MASK] 사비용 [MASK] 채무자 [MASK] [MASK]한다 [MASK] 이유 1. 가처분결정 주문 제1 [MASK] 기재 가처분 신청사건에 관하여 서울중앙 [MASK]법원 [MASK] 2018 [MASK] 5. 10. [MASK]들의 [MASK]을 받아들여 담보제공을 조건으로, ' 채무자 [MASK] 별지 목록 기재 [MASK] 저작물을 판매, 배포 [MASK] 전시하여서는 아니 된다 ' [MASK] 내용 [MASK] 가처분결 [MASK] ( 이하 ' 이 사건 가처분결정 ' 이라 한다 ) 을 하였다. 2 [MASK] 피 [MASK]전권리 및 보전의 [MASK]성 채무자가 이의 [MASK]청을 통하여 거듭 강조하거나 새롭게 제기 [MASK] 있 [MASK] 주장과 소명자료를 염두에 두고 기록을 살펴보아도, [MASK] 사건 [MASK]결정은 여전히 정당하여 유지할 필요성이 있 [MASK]. 따라서 이 법원은 민사집행규칙 제 [MASK]3조 [MASK] [MASK] 제2항 [MASK] [MASK]203조 제1항 제3호에 따라 이 사건 가처분결정의 이유를 그대로 인용한다. 3. 결론 [MASK] 사건 가처분 [MASK]정은 정당하므로 이를 인가하기로 하여 주문과 같이 결정한다. 2018 [MASK] 6. 7.
output 문장 : ' 1. 위 당사자 사이의 서울 [']앙지방법원 2018카합20275 [.]금지가처분 신청사건에 관하여 위 법원이 2018. 5. 10. 한 가처분결정을 인가한다. 2 [.] 사비용 [및] 채무자 [신청] [##의] 가처분 [결정] 이유 1. 가처분결정 주문 제1 [목록] 기재 가처분 신청사건에 관하여 서울중앙 [지법]법원 [은] 2018 [.] 5. 10. [원고]들의 [신청]

Epoch,Training Loss,Validation Loss
1,0.7541,0.70236
2,0.6689,0.619354



0번째 epoch 진행 중 --- 100번째 step 결과
input 문장 : 주문 원심판결 [MASK] 파기한다. 피고인들은 각 무죄. 이유 피고인들의 [MASK]이유의 요지 제1점은, 피고인 [MASK] 원심판시 사실과 [MASK] 속칭 고스 [MASK]이라는 화투놀이를 한 것은 사실이나, 이는 처방 [MASK]의 정도에 불과함에도 원심이 이를 유죄로 [MASK]한 것은 [MASK]죄 [MASK] 법리를 [MASK]함으로써 판결의 결과에 영향 [MASK] 미친 잘못이 있고, 그 제 계열점은 원심의 형량이 너무 무거워서 부당하다는 데 있다. 살피건대 원심이 증거로 한 사법 [MASK] [MASK] 작성 [MASK] 피고인들 [MASK] 대한 각 피의자신문조서의 각 기재와 피고인들의 원심 및 당심법정에서의 각 진술 등을 종합하면 [MASK] 피고인들 [MASK] 이 사건 당시 모두 공소사실 첫머리에 기재된 각 직업 [MASK] 종사하면서 각 월 50 - 80만 원 정도의 수입을 얻고 있던 [MASK]들 [MASK]서, 그 [MASK] 이 사건 고스톱을 한 장소인 서울 영등포구 [MASK]동 근처에 직장을 가지고 [MASK]거나, 또는 그 부근에 [MASK]고 있 [MASK] 세기 평소 동네 이웃으로 서로 잘 아는 사이인바, 이 사건 고스톱을 하게 된 1992. 9. 14. 에도 그날 직장일이 끝난 후 [MASK]들이 전부터 추진해 오던 친목회를 구성하기 위하여 같은 회원인 공소외 황산남이 경영하는 영등포구 당산동 1가 ( 지 [MASK] 생략 ) 소재 위 ( [MASK] 생략 ) 에 모였다가 [MASK] 오지 않 [MASK] [MASK]을 [MASK]는 상담소에 다 끝난 후 딴돈으로 술을 사 먹자고 합의가 되어 [MASK] 1시간 10분 정도 3 - [MASK]회에 걸쳐 매 [MASK]회당 3점에 500원으로 하고 [MASK] 매 2점이 올라갈 때마다 500원씩 추가되는 방법으로 고스톱을 하 [MASK] 된 사실 및 당시 피고인들이 [MASK] [MASK]고

TrainOutput(global_step=10126, training_loss=0.7977693553218581, metrics={'train_runtime': 9776.2404, 'train_samples_per_second': 8.285, 'train_steps_per_second': 1.036, 'total_flos': 2.1319957610496e+16, 'train_loss': 0.7977693553218581, 'epoch': 2.0})

In [None]:
modelDA.save_pretrained(".")

**디렉토리에 'modelDA' 이름으로 된 폴더를 만든 후, pytorch_model.bin과 config.json 파일을 'modelDA' 폴더에 직접 넣어주었습니다**

## 4. Prepare for fine-tuning

In [None]:
# statutes classification task
data_st = load_dataset("lbox/lbox_open", "statute_classification")

# gather labels
labels = set()
for statutes in data_st['train']['statutes']:
  for statute in statutes:
    labels.add(statute)
for statutes in data_st['validation']['statutes']:
  for statute in statutes:
    labels.add(statute)
for statutes in data_st['test']['statutes']:
  for statute in statutes:
    labels.add(statute)
for statutes in data_st['test2']['statutes']:
  for statute in statutes:
    labels.add(statute)

# 2 dictionaries that map labels to integers and back.
id2label = {idx:label for idx, label in enumerate(labels)}
label2id = {label:idx for idx, label in enumerate(labels)}

# preprocess function
def preprocess_function(example):
    statute = [0] * len(id2label)
    for k, l in id2label.items():
        if l in example["statutes"]:
            statute[k] = float(1)
        else:
            statute[k] = float(0)
    example = tokenizer(example["facts"], truncation=True, padding=True)
    example["statutes"] = statute
    return example

# tokenize facts and encode labels
tokenized_dataset = data_st.map(preprocess_function)
tokenized_dataset = tokenized_dataset.remove_columns(["id", "casetype", "casename", "facts"])
tokenized_dataset = tokenized_dataset.rename_column("statutes", "labels")
tokenized_dataset.set_format("torch")

# Construct data_collator with padding and Iterator(dataloader)
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)
BATCH_SIZE = 4

# Construct Dataloader
train_dataloader = DataLoader(tokenized_dataset["train"], shuffle=True, batch_size=BATCH_SIZE, collate_fn=data_collator)
validation_dataloader = DataLoader(tokenized_dataset["validation"], shuffle=True, batch_size=BATCH_SIZE, collate_fn=data_collator)
test_dataloader = DataLoader(tokenized_dataset["test"], shuffle=True, batch_size=BATCH_SIZE, collate_fn=data_collator)
test2_dataloader = DataLoader(tokenized_dataset["test2"], shuffle=True, batch_size=BATCH_SIZE, collate_fn=data_collator)

# multi_label_metrics function
from sklearn.metrics import f1_score, roc_auc_score, accuracy_score
def multi_label_metrics(predictions, labels, threshold=0.05):
    # first, apply sigmoid on predictions which are of shape (batch_size, num_labels)
    sigmoid = torch.nn.Sigmoid()
    probs = sigmoid(torch.Tensor(predictions))

    # next, use threshold to turn them into integer predictions
    y_pred = np.zeros(probs.shape)
    y_pred[np.where(probs >= threshold)] = 1

    # finally, compute metrics
    y_true = labels
    f1_micro_average = f1_score(y_true=y_true, y_pred=y_pred, average='micro')
    roc_auc = roc_auc_score(y_true, y_pred, average = 'micro')

    # return as dictionary
    metrics = {'f1': f1_micro_average,
               'roc_auc': roc_auc}
    return metrics

def mergeDict(dictA, dictB):
  for key in dictB:
    if key in dictA:
        dictB[key] = (dictB[key] + dictA[key])/2
    else:
        pass
  res = dictA | dictB
  return res

Downloading and preparing dataset lbox_open/statute_classification to /root/.cache/huggingface/datasets/lbox___lbox_open/statute_classification/0.2.0/3d5761e2d9292b674a2adabbe7f4bc200d1985908e6f657e7953c9bf247da7ae...


Downloading data:   0%|          | 0.00/3.02M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/391k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/389k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/670k [00:00<?, ?B/s]

Generating train split: 0 examples [00:00, ? examples/s]

Generating validation split: 0 examples [00:00, ? examples/s]

Generating test split: 0 examples [00:00, ? examples/s]

Generating test2 split: 0 examples [00:00, ? examples/s]

Dataset lbox_open downloaded and prepared to /root/.cache/huggingface/datasets/lbox___lbox_open/statute_classification/0.2.0/3d5761e2d9292b674a2adabbe7f4bc200d1985908e6f657e7953c9bf247da7ae. Subsequent calls will reuse this data.


  0%|          | 0/4 [00:00<?, ?it/s]

Map:   0%|          | 0/2208 [00:00<?, ? examples/s]

Map:   0%|          | 0/276 [00:00<?, ? examples/s]

Map:   0%|          | 0/276 [00:00<?, ? examples/s]

Map:   0%|          | 0/538 [00:00<?, ? examples/s]

In [None]:
# Define Model
from transformers import AutoModelForSequenceClassification
model = AutoModelForSequenceClassification.from_pretrained(
    './modelDA',
    problem_type="multi_label_classification",
    num_labels=len(labels),
    id2label=id2label,
    label2id=label2id)
model.to(device)

# Define Optimizer
optimizer = AdamW(model.parameters(), lr=5e-5)

# Set hyperparameters
num_epochs = 3
num_training_steps = num_epochs * len(train_dataloader)
lr_scheduler = get_scheduler (
    "linear",
    optimizer = optimizer,
    num_warmup_steps = 0,
    num_training_steps = num_training_steps
)

# Define training function
def train(model, dataloader, optimizer):
  epoch_loss = 0
  metrics = {}

  model.train()
  for batch in train_dataloader :
    optimizer.zero_grad()

    batch = {k: v.to(device) for k, v in batch.items()}
    outputs = model(**batch)
    loss = outputs.loss
    epoch_loss += loss.item()

    logits = outputs.logits.cpu()
    metrics = mergeDict(metrics, multi_label_metrics(predictions=logits, labels=batch['labels'].cpu()))

    loss.backward()
    optimizer.step()
    lr_scheduler.step()
    optimizer.zero_grad()
    progress_bar.update(1)

  return epoch_loss / len(dataloader), metrics

# Define evaluation function
def evaluate(model, dataloader):
  model.eval()
  test_loss = 0
  metrics = {}

  for batch in validation_dataloader:
    batch = {k: v.to(device) for k, v in batch.items()}

    with torch.no_grad():
      outputs = model(**batch)

    loss = outputs.loss
    test_loss += loss.item()

    logits = outputs.logits.cpu()
    metrics = mergeDict(metrics, multi_label_metrics(predictions=logits, labels=batch['labels'].cpu()))

  return test_loss / len(dataloader), metrics

# Define printMetrics function
def printMetrics(exp, metrics):
  for key in metrics.keys():
    print(f'\t {exp} {key}: {metrics[key]:.3f}', end=' ')
  print()

Some weights of the model checkpoint at ./modelDA were not used when initializing RobertaForSequenceClassification: ['lm_head.layer_norm.bias', 'lm_head.dense.bias', 'lm_head.bias', 'lm_head.layer_norm.weight', 'lm_head.dense.weight']
- This IS expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at ./modelDA and are newly initialized: ['classifier.out_proj.bias', 'classifier.dense.weight', 'classifier.out_proj.weight', 'classifier.dense.bias']
You should probably

## 5. Fine-tuning and Evaluation

In [None]:
from transformers import get_scheduler
from tqdm.auto import tqdm

progress_bar = tqdm(range(num_training_steps))

for epoch in range(num_epochs) :
  train_loss, train_metrics = train(model, train_dataloader, optimizer)
  validation_loss, validation_metrics = evaluate(model, validation_dataloader)

  print(f'Epoch: {epoch+1:02}')
  print(f'\t Train loss: {train_loss:.3f}')
  printMetrics("Train", train_metrics)
  print(f'\t Val. loss: {validation_loss:.3f}')
  printMetrics("Validation", validation_metrics)

  0%|          | 0/1656 [00:00<?, ?it/s]

Epoch: 01
	 Train loss: 0.087
	 Train f1: 0.136 	 Train roc_auc: 0.645 
	 Val. loss: 0.040
	 Validation f1: 0.116 	 Validation roc_auc: 0.640 
Epoch: 02
	 Train loss: 0.040
	 Train f1: 0.230 	 Train roc_auc: 0.814 
	 Val. loss: 0.039
	 Validation f1: 0.152 	 Validation roc_auc: 0.741 
Epoch: 03
	 Train loss: 0.039
	 Train f1: 0.177 	 Train roc_auc: 0.772 
	 Val. loss: 0.037
	 Validation f1: 0.197 	 Validation roc_auc: 0.798 


In [None]:
test_loss, test_metrics = evaluate(model, test_dataloader)
test2_loss, test2_metrics = evaluate(model, test2_dataloader)

print(f'\t Test loss: {test_loss:.3f}')
printMetrics("Test", test_metrics)
print(f'\t Test2 loss: {test2_loss:.3f}')
printMetrics("Test2", test2_metrics)

	 Test loss: 0.037
	 Test f1: 0.157 	 Test roc_auc: 0.732 
	 Test2 loss: 0.019
	 Test2 f1: 0.219 	 Test2 roc_auc: 0.778 


# Part 3. Baseline (Klue w/o fine-tuning)

In [None]:
import torch
import pandas
import transformers
import datasets
import numpy as np

from transformers import AutoModel
from transformers import AutoTokenizer
from transformers import AutoModelForSequenceClassification

from datasets import load_dataset
from datasets import Dataset, load_metric

from transformers import DataCollatorWithPadding
from torch.utils.data import DataLoader

from sklearn.metrics import classification_report
from sklearn.metrics import f1_score, roc_auc_score, accuracy_score

from transformers import Trainer, TrainerCallback, TrainingArguments
from transformers import AdamW
from transformers import get_scheduler

from tqdm.auto import tqdm

import accelerate

device = 'cuda' if torch.cuda.is_available() else "cpu"

In [None]:
# Load Model and Tokenizer
model_name = "klue/roberta-base"
tokenizer = AutoTokenizer.from_pretrained("klue/roberta-base")

In [None]:
# statutes classification task
data_st = load_dataset("lbox/lbox_open", "statute_classification")

# gather labels
labels = set()
for statutes in data_st['train']['statutes']:
  for statute in statutes:
    labels.add(statute)
for statutes in data_st['validation']['statutes']:
  for statute in statutes:
    labels.add(statute)
for statutes in data_st['test']['statutes']:
  for statute in statutes:
    labels.add(statute)
for statutes in data_st['test2']['statutes']:
  for statute in statutes:
    labels.add(statute)

# 2 dictionaries that map labels to integers and back.
id2label = {idx:label for idx, label in enumerate(labels)}
label2id = {label:idx for idx, label in enumerate(labels)}

# preprocess function
def preprocess_function(example):
    statute = [0] * len(id2label)
    for k, l in id2label.items():
        if l in example["statutes"]:
            statute[k] = float(1)
        else:
            statute[k] = float(0)
    example = tokenizer(example["facts"], truncation=True, padding=True)
    example["statutes"] = statute
    return example

tokenized_dataset = data_st.map(preprocess_function)
tokenized_dataset = tokenized_dataset.remove_columns(["id", "casetype", "casename", "facts"])
tokenized_dataset = tokenized_dataset.rename_column("statutes", "labels")
tokenized_dataset.set_format("torch")
tokenized_dataset



  0%|          | 0/4 [00:00<?, ?it/s]

Map:   0%|          | 0/2208 [00:00<?, ? examples/s]

Map:   0%|          | 0/276 [00:00<?, ? examples/s]

Map:   0%|          | 0/276 [00:00<?, ? examples/s]

Map:   0%|          | 0/538 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['labels', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 2208
    })
    validation: Dataset({
        features: ['labels', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 276
    })
    test: Dataset({
        features: ['labels', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 276
    })
    test2: Dataset({
        features: ['labels', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 538
    })
})

In [None]:
# Construct data_collator with padding and Iterator(dataloader)
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)
BATCH_SIZE = 4

test_dataloader = DataLoader(tokenized_dataset["test"], shuffle=True, batch_size=BATCH_SIZE, collate_fn=data_collator)
test2_dataloader = DataLoader(tokenized_dataset["test2"], shuffle=True, batch_size=BATCH_SIZE, collate_fn=data_collator)

In [None]:
from sklearn.metrics import f1_score, roc_auc_score, accuracy_score
import torch

# source: https://jesusleal.io/2021/04/21/Longformer-multilabel-classification/
def multi_label_metrics(predictions, labels, threshold=0.05):
    # first, apply sigmoid on predictions which are of shape (batch_size, num_labels)
    sigmoid = torch.nn.Sigmoid()
    probs = sigmoid(torch.Tensor(predictions))
    # print("probs: ", probs)

    # next, use threshold to turn them into integer predictions
    y_pred = np.zeros(probs.shape)
    y_pred[np.where(probs >= threshold)] = 1
    # print("pred: ", y_pred)

    # finally, compute metrics
    y_true = labels
    f1_micro_average = f1_score(y_true=y_true, y_pred=y_pred, average='micro')
    roc_auc = roc_auc_score(y_true, y_pred, average = 'micro')

    # return as dictionary
    metrics = {'f1': f1_micro_average,
               'roc_auc': roc_auc}
    return metrics

def mergeDict(dictA, dictB):
  for key in dictB:
    if key in dictA:
        dictB[key] = (dictB[key] + dictA[key])/2
    else:
        pass
  res = dictA | dictB
  return res

In [None]:
model = AutoModelForSequenceClassification.from_pretrained(
    "klue/roberta-base",
    problem_type="multi_label_classification",
    num_labels=len(labels),
    id2label=id2label,
    label2id=label2id)

model.to(device)

Some weights of the model checkpoint at klue/roberta-base were not used when initializing RobertaForSequenceClassification: ['lm_head.layer_norm.bias', 'lm_head.dense.bias', 'lm_head.bias', 'lm_head.layer_norm.weight', 'lm_head.dense.weight']
- This IS expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at klue/roberta-base and are newly initialized: ['classifier.out_proj.bias', 'classifier.dense.weight', 'classifier.out_proj.weight', 'classifier.dense.bias']
You

RobertaForSequenceClassification(
  (roberta): RobertaModel(
    (embeddings): RobertaEmbeddings(
      (word_embeddings): Embedding(32000, 768, padding_idx=1)
      (position_embeddings): Embedding(514, 768, padding_idx=1)
      (token_type_embeddings): Embedding(1, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): RobertaEncoder(
      (layer): ModuleList(
        (0-11): 12 x RobertaLayer(
          (attention): RobertaAttention(
            (self): RobertaSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): RobertaSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
             

In [None]:
def evaluate(model, dataloader):
  model.eval()
  test_loss = 0
  metrics = {}

  for batch in validation_dataloader:
    batch = {k: v.to(device) for k, v in batch.items()}

    with torch.no_grad():
      outputs = model(**batch)

    loss = outputs.loss
    test_loss += loss.item()

    logits = outputs.logits.cpu()
    metrics = mergeDict(metrics, multi_label_metrics(predictions=logits, labels=batch['labels'].cpu()))

  return test_loss / len(dataloader), metrics

def printMetrics(exp, metrics):
  for key in metrics.keys():
    print(f'\t {exp} {key}: {metrics[key]:.3f}', end=' ')
  print()

In [None]:
test_loss, test_metrics = evaluate(model, test_dataloader)
test2_loss, test2_metrics = evaluate(model, test2_dataloader)

print(f'\t Test loss: {test_loss:.3f}')
printMetrics("Test", test_metrics)
print(f'\t Test2 loss: {test2_loss:.3f}')
printMetrics("Test2", test2_metrics)

	 Test loss: 0.704
	 Test f1: 0.017 	 Test roc_auc: 0.500 
	 Test2 loss: 0.360
	 Test2 f1: 0.018 	 Test2 roc_auc: 0.500 
