# K-MHaS (Korean Multi-label Hate Speech Dataset)

## Dataset loading

Loading the K-MHaS dataset from [HuggingFace](https://huggingface.co/datasets/jeanlee/kmhas_korean_hate_speech) and checking meta information (published @COLING2022)


In [None]:
!pip install transformers
!pip install datasets

Collecting datasets
  Downloading datasets-2.20.0-py3-none-any.whl (547 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m547.8/547.8 kB[0m [31m3.7 MB/s[0m eta [36m0:00:00[0m
Collecting pyarrow>=15.0.0 (from datasets)
  Downloading pyarrow-16.1.0-cp310-cp310-manylinux_2_28_x86_64.whl (40.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m40.8/40.8 MB[0m [31m11.8 MB/s[0m eta [36m0:00:00[0m
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl (116 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m15.1 MB/s[0m eta [36m0:00:00[0m
Collecting requests>=2.32.2 (from datasets)
  Downloading requests-2.32.3-py3-none-any.whl (64 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m64.9/64.9 kB[0m [31m4.8 MB/s[0m eta [36m0:00:00[0m
Collecting xxhash (from datasets)
  Downloading xxhash-3.4.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1

In [None]:
from datasets import load_dataset

dataset = load_dataset("jeanlee/kmhas_korean_hate_speech")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


Downloading data:   0%|          | 0.00/5.24M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/579k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.46M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/78977 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/8776 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/21939 [00:00<?, ? examples/s]

In [None]:
dataset

DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 78977
    })
    validation: Dataset({
        features: ['text', 'label'],
        num_rows: 8776
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 21939
    })
})

In [None]:
dataset = load_dataset("jeanlee/kmhas_korean_hate_speech", split="test")

In [None]:
dataset

Dataset({
    features: ['text', 'label'],
    num_rows: 21939
})

In [None]:
dataset.features

{'text': Value(dtype='string', id=None),
 'label': Sequence(feature=ClassLabel(names=['origin', 'physical', 'politics', 'profanity', 'age', 'gender', 'race', 'religion', 'not_hate_speech'], id=None), length=-1, id=None)}

In [None]:
# meta information

print(dataset.info.description)
print(dataset.info.homepage)

The K-MHaS (Korean Multi-label Hate Speech) dataset contains 109k utterances from Korean online news comments labeled with 8 fine-grained hate speech classes or Not Hate Speech class.
The fine-grained hate speech classes are politics, origin, physical, age, gender, religion, race, and profanity and these categories are selected in order to reflect the social and historical context.

https://github.com/adlnlp/K-MHaS


In [None]:
print(dataset.info.citation)

@inproceedings{lee-etal-2022-k,
    title = "K-{MH}a{S}: A Multi-label Hate Speech Detection Dataset in {K}orean Online News Comment",
    author = "Lee, Jean  and
      Lim, Taejun  and
      Lee, Heejun  and
      Jo, Bogeun  and
      Kim, Yangsok  and
      Yoon, Heegeun  and
      Han, Soyeon Caren",
    booktitle = "Proceedings of the 29th International Conference on Computational Linguistics",
    month = oct,
    year = "2022",
    address = "Gyeongju, Republic of Korea",
    publisher = "International Committee on Computational Linguistics",
    url = "https://aclanthology.org/2022.coling-1.311",
    pages = "3530--3538",
    abstract = "Online hate speech detection has become an important issue due to the growth of online content, but resources in languages other than English are extremely limited. We introduce K-MHaS, a new multi-label dataset for hate speech detection that effectively handles Korean language patterns. The dataset consists of 109k utterances from news commen

## Data preparation

- Prepare data from train, validation, and test dataset
- Multi-label is converted to multi-label one hot encodding

      class_label:
        names:
          0: origin
          1: physical
          2: politics
          3: profanity
          4: age
          5: gender
          6: race
          7: religion
          8: not_hate_speech

In [None]:
#!pip install transformers
#!pip install datasets

In [None]:
!pip install keras-preprocessing

Collecting keras-preprocessing
  Downloading Keras_Preprocessing-1.1.2-py2.py3-none-any.whl (42 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/42.6 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m42.6/42.6 kB[0m [31m1.1 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: keras-preprocessing
Successfully installed keras-preprocessing-1.1.2


In [None]:
from datasets import load_dataset

dataset = load_dataset("jeanlee/kmhas_korean_hate_speech")

In [None]:
import tensorflow as tf
import torch

from transformers import BertTokenizer
from transformers import BertForSequenceClassification, AdamW, BertConfig
from transformers import get_linear_schedule_with_warmup
from torch.utils.data import TensorDataset, DataLoader, RandomSampler, SequentialSampler
from keras_preprocessing.sequence import pad_sequences

from sklearn.model_selection import train_test_split
from sklearn.metrics import f1_score, roc_auc_score, accuracy_score, hamming_loss
from sklearn.preprocessing import MultiLabelBinarizer

import pandas as pd
import numpy as np
import random
import time
import datetime
from tqdm import tqdm

import csv
import os



In [None]:
# load train, validation, and test dataset from HuggingFace

train = load_dataset("jeanlee/kmhas_korean_hate_speech", split="train")
validation = load_dataset("jeanlee/kmhas_korean_hate_speech", split="validation")
test = load_dataset("jeanlee/kmhas_korean_hate_speech", split="test")

In [None]:
# adding masking (able to remove this step depending on the model)

train_sentences = list(map(lambda x: '[CLS] ' + str(x) + ' [SEP]', train['text']))
validation_sentences = list(map(lambda x: '[CLS] ' + str(x) + ' [SEP]', validation['text']))
test_sentences = list(map(lambda x: '[CLS] ' + str(x) + ' [SEP]', test['text']))

In [None]:
# convert multi-label to multi-label binary (one hot encoding)
# [8] -> [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0]

from sklearn.preprocessing import MultiLabelBinarizer

enc = MultiLabelBinarizer()

def multi_label(example):
    enc_label = enc.fit_transform(example['label'])
    float_arr = np.vstack(enc_label[:]).astype(float)
    update_label = float_arr.tolist()
    return update_label

train_labels = multi_label(train)
validation_labels = multi_label(validation)
test_labels = multi_label(test)

In [None]:
test_sentences[:10]

['[CLS] 그만큼 길예르모가 잘했다고 보면되겠지 기대되네 셰이프 오브 워터 [SEP]',
 '[CLS] "1. 8넘의 문재앙" [SEP]',
 '[CLS] "문재인 정권의 내로남불은 타의 추종을 불허하네. 자한당 욕할거리도 없음." [SEP]',
 '[CLS] "짱개들 지나간 곳은 폐허된다 ㅋㅋ" [SEP]',
 '[CLS] 곱창은 자갈치~~~~~ [SEP]',
 '[CLS] 밥맛없게생겼냐 [SEP]',
 '[CLS] 알고 보니 외국 국적? 또는 국가유공자? [SEP]',
 '[CLS] "중국 유학생, 중국인들 입국 금지시키고 그들을 위해 쓰여질 많은 세금을 줄여" [SEP]',
 '[CLS] "댓글 길게 쓴거보니 우리 도태한녀 화 많이 났넹 ㅋㅋ 우쭈쭈" [SEP]',
 '[CLS] 이미연 닮음 [SEP]']

In [None]:
test_labels[:10]

[[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0],
 [0.0, 0.0, 1.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0],
 [0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0],
 [1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0],
 [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0],
 [0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0],
 [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0],
 [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0],
 [1.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0],
 [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0]]

## Prep for Pytorch

In [None]:
# Tokenizing : bert-base-multilingual-cased

tokenizer = BertTokenizer.from_pretrained('bert-base-multilingual-cased', do_lower_case=False)

tokenizer_config.json:   0%|          | 0.00/49.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/996k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.96M [00:00<?, ?B/s]



config.json:   0%|          | 0.00/625 [00:00<?, ?B/s]

In [None]:
MAX_LEN = 128

def data_to_tensor (sentences, labels):
  tokenized_texts = [tokenizer.tokenize(sent) for sent in sentences]
  input_ids = [tokenizer.convert_tokens_to_ids(x) for x in tokenized_texts]
  input_ids = pad_sequences(input_ids, maxlen=MAX_LEN, dtype="long", truncating="post", padding="post")

  attention_masks = []

  for seq in input_ids:
      seq_mask = [float(i > 0) for i in seq]
      attention_masks.append(seq_mask)

  tensor_inputs = torch.tensor(input_ids)
  tensor_labels = torch.tensor(labels)
  tensor_masks = torch.tensor(attention_masks)

  return tensor_inputs, tensor_labels, tensor_masks


In [None]:
train_inputs, train_labels, train_masks = data_to_tensor(train_sentences, train_labels)
validation_inputs, validation_labels, validation_masks = data_to_tensor(validation_sentences, validation_labels)
test_inputs, test_labels, test_masks = data_to_tensor(test_sentences, test_labels)

In [None]:
batch_size = 32

train_data = TensorDataset(train_inputs, train_masks, train_labels)
train_sampler = RandomSampler(train_data)
train_dataloader = DataLoader(train_data, sampler=train_sampler, batch_size=batch_size)

validation_data = TensorDataset(validation_inputs, validation_masks, validation_labels)
validation_sampler = SequentialSampler(validation_data)
validation_dataloader = DataLoader(validation_data, sampler=validation_sampler, batch_size=batch_size)

test_data = TensorDataset(test_inputs, test_masks, test_labels)
test_sampler = RandomSampler(test_data)
test_dataloader = DataLoader(test_data, sampler=test_sampler, batch_size=batch_size)

In [None]:
print('testset size:', len(test_labels))
print('trainset size:', len(train_labels))
print('validset size:', len(validation_labels))

testset size: 21939
trainset size: 78977
validset size: 8776


## Multi-BERT model

### GPU setting

In [None]:
device_name = tf.test.gpu_device_name()
if device_name == '/device:GPU:0':
    print('Found GPU at: {}'.format(device_name))
else:
    raise SystemError('GPU device not found')

Found GPU at: /device:GPU:0


In [None]:
if torch.cuda.is_available():
    device = torch.device("cuda")
    print('There are %d GPU(s) available.' % torch.cuda.device_count())
    print('We will use the GPU:', torch.cuda.get_device_name(0))
else:
    device = torch.device("cpu")
    print('No GPU available, using the CPU instead.')

There are 1 GPU(s) available.
We will use the GPU: Tesla T4


### Model setting

In [None]:
num_labels = 9

model = BertForSequenceClassification.from_pretrained("bert-base-multilingual-cased", num_labels=num_labels, problem_type="multi_label_classification")
model.cuda()

model.safetensors:   0%|          | 0.00/714M [00:00<?, ?B/s]

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-multilingual-cased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


BertForSequenceClassification(
  (bert): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(119547, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (token_type_embeddings): Embedding(2, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0-11): 12 x BertLayer(
          (attention): BertAttention(
            (self): BertSdpaSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1

In [None]:
optimizer = AdamW(model.parameters(),
                  lr = 2e-5,
                  eps = 1e-8
                )

# change epochs for improving results (our paper : epochs = 4)
epochs = 2
total_steps = len(train_dataloader) * epochs
scheduler = get_linear_schedule_with_warmup(optimizer,
                                            num_warmup_steps = 0,
                                            num_training_steps = total_steps)



In [None]:
def format_time(elapsed):
    elapsed_rounded = int(round((elapsed)))
    return str(datetime.timedelta(seconds=elapsed_rounded))  # hh:mm:ss

In [None]:
def multi_label_metrics(predictions, labels, threshold=0.5):

    # first, apply sigmoid on predictions which are of shape (batch_size, num_labels)
    sigmoid = torch.nn.Sigmoid()
    probs = sigmoid(torch.Tensor(predictions))

    # next, use threshold to turn them into integer predictions
    y_pred = np.zeros(probs.shape)
    y_pred[np.where(probs >= threshold)] = 1

    # finally, compute metrics
    y_true = labels
    accuracy = accuracy_score(y_true, y_pred)
    f1_macro_average = f1_score(y_true=y_true, y_pred=y_pred, average='macro', zero_division=0)
    f1_micro_average = f1_score(y_true=y_true, y_pred=y_pred, average='micro', zero_division=0)
    f1_weighted_average = f1_score(y_true=y_true, y_pred=y_pred, average='weighted', zero_division=0)
    roc_auc = roc_auc_score(y_true, y_pred, average = 'micro')
    hamming = hamming_loss(y_true, y_pred)

    # return as dictionary
    metrics = {'accuracy': accuracy,
               'f1_macro': f1_macro_average,
               'f1_micro': f1_micro_average,
               'f1_weighted': f1_weighted_average,
               'roc_auc': roc_auc,
               'hamming_loss': hamming}

    return metrics

### Model training

In [None]:
seed_val = 42
random.seed(seed_val)
np.random.seed(seed_val)
torch.manual_seed(seed_val)
torch.cuda.manual_seed_all(seed_val)

model.zero_grad()
for epoch_i in range(0, epochs):

    # ========================================
    #               Training
    # ========================================

    print("")
    print('======== Epoch {:} / {:} ========'.format(epoch_i + 1, epochs))
    print('Training...')

    t0 = time.time()
    total_loss = 0

    model.train()

    for step, batch in tqdm(enumerate(train_dataloader)):
        if step % 500 == 0 and not step == 0:
            elapsed = format_time(time.time() - t0)
            print('  Batch {:>5,}  of  {:>5,}.    Elapsed: {:}.'.format(step, len(train_dataloader), elapsed))

        batch = tuple(t.to(device) for t in batch)
        b_input_ids, b_input_mask, b_labels = batch

        outputs = model(b_input_ids,
                        token_type_ids=None,
                        attention_mask=b_input_mask,
                        labels=b_labels)

        loss = outputs[0]
        total_loss += loss.item()
        loss.backward()

        torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)  # gradient clipping if it is over a threshold
        optimizer.step()
        scheduler.step()

        model.zero_grad()

    avg_train_loss = total_loss / len(train_dataloader)

    print("")
    print("  Average training loss: {0:.4f}".format(avg_train_loss))
    print("  Training epcoh took: {:}".format(format_time(time.time() - t0)))

print("")
print("Training complete!")


Training...


500it [05:14,  1.56it/s]

  Batch   500  of  2,469.    Elapsed: 0:05:14.


1000it [10:34,  1.57it/s]

  Batch 1,000  of  2,469.    Elapsed: 0:10:34.


1500it [15:54,  1.56it/s]

  Batch 1,500  of  2,469.    Elapsed: 0:15:54.


2000it [21:14,  1.56it/s]

  Batch 2,000  of  2,469.    Elapsed: 0:21:14.


2469it [26:14,  1.57it/s]


  Average training loss: 0.1740
  Training epcoh took: 0:26:14

Training complete!





In [None]:
# ========================================
#               Validation
# ========================================

print("")
print("Running Validation...")

t0 = time.time()
model.eval()
accum_logits, accum_label_ids = [], []

for batch in validation_dataloader:
    batch = tuple(t.to(device) for t in batch)
    b_input_ids, b_input_mask, b_labels = batch

    with torch.no_grad():
        outputs = model(b_input_ids,
                        token_type_ids=None,
                        attention_mask=b_input_mask)

    logits = outputs[0]
    logits = logits.detach().cpu().numpy()
    label_ids = b_labels.to('cpu').numpy()

    for b in logits:
        accum_logits.append(list(b))

    for b in label_ids:
        accum_label_ids.append(list(b))

accum_logits = np.array(accum_logits)
accum_label_ids = np.array(accum_label_ids)
results = multi_label_metrics(accum_logits, accum_label_ids)

print("Accuracy: {0:.4f}".format(results['accuracy']))
print("F1 (Macro) Score: {0:.4f}".format(results['f1_macro']))
print("F1 (Micro) Score: {0:.4f}".format(results['f1_micro']))
print("F1 (Weighted) Score: {0:.4f}".format(results['f1_weighted']))
print("ROC-AUC: {0:.4f}".format(results['roc_auc']))
print("Hamming Loss: {0:.4f}".format(results['hamming_loss']))
print("Validation took: {:}".format(format_time(time.time() - t0)))


Running Validation...
Accuracy: 0.7248
F1 (Macro) Score: 0.6513
F1 (Micro) Score: 0.7864
F1 (Weighted) Score: 0.7804
ROC-AUC: 0.8648
Hamming Loss: 0.0514
Validation took: 0:00:56


### Evaluation

In [None]:
# save model

import torch

# 모델 가중치를 저장할 경로
path = './'  # 현재 디렉토리에 저장
torch.save(model.state_dict(), path + "BERT_model.pt")


In [None]:
# load model

from transformers import BertForSequenceClassification

# 모델을 다시 초기화합니다
model = BertForSequenceClassification.from_pretrained('bert-base-multilingual-cased', num_labels=num_labels)

# 저장된 가중치를 로드할 경로
path = './'  # 현재 디렉토리에 저장
model.load_state_dict(torch.load(path + "BERT_model.pt"))

# 모델을 평가 모드로 전환합니다
model.eval()


Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-multilingual-cased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


BertForSequenceClassification(
  (bert): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(119547, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (token_type_embeddings): Embedding(2, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0-11): 12 x BertLayer(
          (attention): BertAttention(
            (self): BertSdpaSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1

In [None]:
import time
import torch
from tqdm import tqdm

def format_time(elapsed):
    return str(datetime.timedelta(seconds=int(round((elapsed)))))

# 디바이스 설정 (GPU가 있으면 GPU, 없으면 CPU 사용)
device = torch.device('cuda') if torch.cuda.is_available() else torch.device('cpu')
model.to(device)

# 모델 평가 모드 설정
t0 = time.time()
model.eval()
accum_logits, accum_label_ids = [], []

for step, batch in tqdm(enumerate(test_dataloader), total=len(test_dataloader)):
    if step % 100 == 0 and not step == 0:
        elapsed = format_time(time.time() - t0)
        print('  Batch {:>5,}  of  {:>5,}.    Elapsed: {:}.'.format(step, len(test_dataloader), elapsed))

    batch = tuple(t.to(device) for t in batch)
    b_input_ids, b_input_mask, b_labels = batch

    with torch.no_grad():
        outputs = model(b_input_ids,
                        token_type_ids=None,
                        attention_mask=b_input_mask)

    logits = outputs[0]
    logits = logits.detach().cpu().numpy()
    label_ids = b_labels.cpu().numpy()

    for b in logits:
        accum_logits.append(list(b))

    for b in label_ids:
        accum_label_ids.append(list(b))


 15%|█▍        | 100/686 [00:20<02:02,  4.77it/s]

  Batch   100  of    686.    Elapsed: 0:00:20.


 29%|██▉       | 200/686 [00:41<01:42,  4.73it/s]

  Batch   200  of    686.    Elapsed: 0:00:42.


 44%|████▎     | 300/686 [01:02<01:18,  4.90it/s]

  Batch   300  of    686.    Elapsed: 0:01:02.


 58%|█████▊    | 401/686 [01:22<00:57,  4.93it/s]

  Batch   400  of    686.    Elapsed: 0:01:23.


 73%|███████▎  | 500/686 [01:42<00:37,  4.91it/s]

  Batch   500  of    686.    Elapsed: 0:01:43.


 87%|████████▋ | 600/686 [02:03<00:17,  4.84it/s]

  Batch   600  of    686.    Elapsed: 0:02:04.


100%|██████████| 686/686 [02:21<00:00,  4.86it/s]


### Break down evaluation

In [None]:
import numpy as np

accum_results = []

# Check the distribution of label counts in the dataset
label_counts = [len(np.where(labels)[0]) for labels in accum_label_ids]
label_distribution = np.bincount(label_counts)
print(f'Label counts distribution: {label_distribution}')

for i in range(num_labels):
    ith_label_ids, ith_logits = [], []

    for j, labels in enumerate(accum_label_ids):
        if len(np.where(labels)[0]) == i + 1:
            ith_label_ids.append(accum_label_ids[j])
            ith_logits.append(accum_logits[j])

    ith_label_ids = np.array(ith_label_ids)
    ith_logits = np.array(ith_logits)

    if ith_label_ids.shape[0] == 0 or ith_logits.shape[0] == 0:
        print(f'No data for {i+1} labels.')
        continue

    # Ensure that the shapes are consistent
    try:
        assert ith_label_ids.shape == ith_logits.shape
    except AssertionError:
        print(f"Inconsistent shapes for {i+1} labels: {ith_label_ids.shape} and {ith_logits.shape}")
        continue

    results = multi_label_metrics(ith_logits, ith_label_ids)
    accum_results.append(list(results.values()))

    print('# of labels:', i + 1)
    print("Accuracy: {0:.4f}".format(results['accuracy']))
    print("F1 (Macro) Score: {0:.4f}".format(results['f1_macro']))
    print("F1 (Micro) Score: {0:.4f}".format(results['f1_micro']))
    print("F1 (Weighted) Score: {0:.4f}".format(results['f1_weighted']))
    print("ROC-AUC: {0:.4f}".format(results['roc_auc']))
    print("Hamming Loss: {0:.4f}".format(results['hamming_loss']))

    print('\n')


Label counts distribution: [    0 19185  2439   290    25]
# of labels: 1
Accuracy: 0.7671
F1 (Macro) Score: 0.6304
F1 (Micro) Score: 0.7978
F1 (Weighted) Score: 0.7954
ROC-AUC: 0.8837
Hamming Loss: 0.0446


# of labels: 2
Accuracy: 0.4461
F1 (Macro) Score: 0.5983
F1 (Micro) Score: 0.7517
F1 (Weighted) Score: 0.7646
ROC-AUC: 0.8123
Hamming Loss: 0.0947


# of labels: 3
Accuracy: 0.1241
F1 (Macro) Score: 0.5401
F1 (Micro) Score: 0.6908
F1 (Weighted) Score: 0.6888
ROC-AUC: 0.7644
Hamming Loss: 0.1636


# of labels: 4
Accuracy: 0.0400
F1 (Macro) Score: 0.4238
F1 (Micro) Score: 0.6494
F1 (Weighted) Score: 0.6376
ROC-AUC: 0.7340
Hamming Loss: 0.2400


No data for 5 labels.
No data for 6 labels.
No data for 7 labels.
No data for 8 labels.
No data for 9 labels.


### test

In [None]:
from google.colab import drive
drive.mount('/content/gdrive/')

Mounted at /content/gdrive/


In [None]:
import torch
from transformers import BertTokenizer, BertForSequenceClassification, AdamW
from sklearn.metrics import classification_report, roc_auc_score
from torch.utils.data import DataLoader, TensorDataset
import numpy as np

# pre-trained model을 로드합니다
model_path = 'bert-base-multilingual-cased'
model = BertForSequenceClassification.from_pretrained(model_path, num_labels=9)

# Tokenizer를 로드합니다
tokenizer = BertTokenizer.from_pretrained(model_path)

# 테스트 데이터셋을 로드합니다
# dataset['test']는 이미 데이터셋이 로드되어 있다고 가정합니다
test_dataset = dataset['test']

# 입력 텍스트를 토큰화합니다
test_texts = test_dataset['text']
test_encodings = tokenizer(test_texts, truncation=True, padding=True, max_length=512, return_tensors='pt')

# 테스트 라벨을 로드합니다
test_labels = test_dataset['label']

# 테스트 라벨의 구조를 확인합니다
print(f"Original test labels shape: {len(test_labels)}")
print(test_labels[:5])

# 라벨을 2차원 배열로 변환하고 크기를 9로 맞춥니다
fixed_test_labels = []
for labels in test_labels:
    fixed_labels = np.zeros(9)
    for label in labels:
        fixed_labels[label] = 1
    fixed_test_labels.append(fixed_labels)

# 라벨을 numpy 배열로 변환합니다
test_labels = np.array(fixed_test_labels)

# 라벨을 PyTorch 텐서로 변환합니다
labels = torch.tensor(test_labels, dtype=torch.float)

# Convert inputs and labels to PyTorch tensors
input_ids = test_encodings['input_ids']
attention_mask = test_encodings['attention_mask']

# Check if the sizes match
assert input_ids.size(0) == labels.size(0) == attention_mask.size(0), "Size mismatch between tensors"

# Create a DataLoader for the test dataset
batch_size = 16  # Adjust this based on your GPU memory
test_data = TensorDataset(input_ids, attention_mask, labels)
test_dataloader = DataLoader(test_data, batch_size=batch_size)

# 디바이스 설정 (GPU가 있으면 GPU, 없으면 CPU 사용)
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model.to(device)

# 모델을 학습합니다 (여기서는 학습 코드를 간단히 추가합니다)
optimizer = AdamW(model.parameters(), lr=2e-5)

model.train()
for epoch in range(1):  # 학습 에포크 수를 조정하세요
    for batch in test_dataloader:
        b_input_ids, b_attention_mask, b_labels = tuple(t.to(device) for t in batch)

        outputs = model(b_input_ids, attention_mask=b_attention_mask, labels=b_labels)
        loss = outputs.loss

        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

    print(f"Epoch {epoch + 1} finished.")

# 모델 평가
model.eval()
test_predictions, true_labels = [], []

for batch in test_dataloader:
    b_input_ids, b_attention_mask, b_labels = tuple(t.to(device) for t in batch)

    with torch.no_grad():
        outputs = model(b_input_ids, attention_mask=b_attention_mask)

    logits = outputs.logits
    predictions = torch.sigmoid(logits).cpu().numpy()  # 멀티라벨을 위해 sigmoid를 사용하고 임계값을 설정합니다
    test_predictions.extend(predictions)
    true_labels.extend(b_labels.cpu().numpy())

# 예측값을 이진값으로 변환합니다
threshold = 0.5
test_predictions = np.array(test_predictions)
binary_predictions = (test_predictions > threshold).astype(int)

# 분류 보고서를 출력합니다
print(classification_report(true_labels, binary_predictions, zero_division=0))

# 각 라벨에 대해 ROC AUC를 계산합니다. 두 개 이상의 클래스가 있는 경우에만 계산합니다.
for i in range(9):
    true_label = [label[i] for label in true_labels]
    pred_prob = [pred[i] for pred in test_predictions]

    if len(np.unique(true_label)) > 1:
        roc_auc = roc_auc_score(true_label, pred_prob)
        print(f"ROC AUC for label {i}: {roc_auc:.4f}")
    else:
        print(f"Only one class present in true labels for label {i}. ROC AUC score is not defined.")


Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-multilingual-cased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Original test labels shape: 21939
[[8], [2, 3], [2], [0], [8]]




Epoch 1 finished.
              precision    recall  f1-score   support

           0       0.88      0.50      0.64      2166
           1       0.77      0.53      0.63      1747
           2       0.87      0.67      0.75      2456
           3       0.91      0.51      0.66      3224
           4       0.90      0.58      0.71      1490
           5       0.61      0.55      0.58      1581
           6       0.00      0.00      0.00        58
           7       0.88      0.64      0.74       492
           8       0.82      0.89      0.86     11819

   micro avg       0.83      0.71      0.77     25033
   macro avg       0.74      0.54      0.62     25033
weighted avg       0.83      0.71      0.75     25033
 samples avg       0.77      0.74      0.75     25033

ROC AUC for label 0: 0.9476
ROC AUC for label 1: 0.9127
ROC AUC for label 2: 0.9604
ROC AUC for label 3: 0.9364
ROC AUC for label 4: 0.9529
ROC AUC for label 5: 0.9257
ROC AUC for label 6: 0.7078
ROC AUC for label 7: 0.9785

In [None]:
# 모델 저장
model_save_path = '/content/gdrive/MyDrive/bert_multilabel_model.pth'
torch.save(model.state_dict(), model_save_path)
print(f"Model saved to {model_save_path}")

Model saved to /content/gdrive/MyDrive/bert_multilabel_model.pth
