**BERT모델을 Fine-tuning하여 Multi-Class Text Classification 수행**

> Klue-Bert 파인튜닝



**CSV to JSON (text, emotion 추출)**

In [None]:
import pandas as pd

file_path = "/content/drive/MyDrive/datasets/text-emotion_결측치 제거.csv"
df = pd.read_csv(file_path)

# 'text'와 'emotion' 컬럼만 추출
text_emotion_df = df[['text', 'emotion']]

# JSON 파일로 저장
output_path = "text_emotion_data.json"
text_emotion_df.to_json(output_path, orient='records', force_ascii=False, indent=2)

print(f"JSON 파일이 저장되었습니다: {output_path}")

JSON 파일이 저장되었습니다: text_emotion_data.json


**JSON 데이터 확인**

In [None]:
import json
import pandas as pd
# -*- coding: utf-8 -*-

In [None]:
# Load Train-set
with open('/content/text_emotion_data.json', mode='rt', encoding='utf-8-sig') as f:
    train_dataset_raw = json.load(f)

train_dataset_list = [{'text':data['text'], 'label':data['emotion']} for data in train_dataset_raw]
train_df = pd.DataFrame(train_dataset_list)
train_df.head()

Unnamed: 0,text,label
0,아빠는 없다. 나에게 아빠는 없다. 나에게 아빠는 없었다. 애초에 그건 나에게 없...,unknown
1,"다들 바쁘게 사는 것 같은데 나만 멈춰 있는 기분이 든다. 뒤처진 것도, 앞서간 것...",unknown
2,"요즘 뭘 해도 재미가 없다. 좋아하던 영화도, 음악도, 아무런 감흥이 없다. 감정이...",unknown
3,"하루 종일 아무 말도 하지 않았다. 말을 걸 사람도 없고, 굳이 이야기할 이유도 없...",unknown
4,"사람들과 어울려도 외롭고, 혼자 있어도 외롭다. 누군가에게 말하고 싶지만 막상 입을...",unknown


In [None]:
train_df.groupby(by=['label']).count()

Unnamed: 0_level_0,text
label,Unnamed: 1_level_1
unknown,82
기쁨,129
두려움,94
분노,91
슬픔,123


**결측치 확인시 Unknown으로 분류**

In [None]:
print(train_df.isnull().sum())  # 열별 결측치 개수 확인
train_df["label"] = train_df["label"].fillna("Unknown")
print(train_df.isnull().sum())  # 열별 결측치 개수 확인

text     0
label    0
dtype: int64
text     0
label    0
dtype: int64


**감정 라벨 숫자로 인코딩**

In [None]:
import numpy as np
from sklearn.preprocessing import LabelEncoder

label_encoder = LabelEncoder()
label_encoder.fit(train_df['label'])
num_labels = len(label_encoder.classes_)

train_df['encoded_label'] = np.asarray(label_encoder.transform(train_df['label']), dtype=np.int32)
train_df.head()

Unnamed: 0,text,label,encoded_label
0,아빠는 없다. 나에게 아빠는 없다. 나에게 아빠는 없었다. 애초에 그건 나에게 없...,unknown,0
1,"다들 바쁘게 사는 것 같은데 나만 멈춰 있는 기분이 든다. 뒤처진 것도, 앞서간 것...",unknown,0
2,"요즘 뭘 해도 재미가 없다. 좋아하던 영화도, 음악도, 아무런 감흥이 없다. 감정이...",unknown,0
3,"하루 종일 아무 말도 하지 않았다. 말을 걸 사람도 없고, 굳이 이야기할 이유도 없...",unknown,0
4,"사람들과 어울려도 외롭고, 혼자 있어도 외롭다. 누군가에게 말하고 싶지만 막상 입을...",unknown,0


**Spliting data into training and validation set**

In [None]:
train_texts = train_df["text"].to_list() # Features (not-tokenized yet)
train_labels = train_df["encoded_label"].to_list() # Labels

In [None]:
from sklearn.model_selection import train_test_split

# Split Train and Validation data
train_texts, val_texts, train_labels, val_labels = train_test_split(train_texts, train_labels, test_size=0.2, random_state=0)

**Load Tokenizer and Tokenizing**

In [None]:
HUGGINGFACE_MODEL_PATH = "klue/bert-base"
from transformers import BertTokenizerFast

# Load Tokenizer
tokenizer = BertTokenizerFast.from_pretrained(HUGGINGFACE_MODEL_PATH,from_pt=True)

# Tokenizing
train_encodings = tokenizer(train_texts, truncation=True, padding=True)
val_encodings = tokenizer(val_texts, truncation=True, padding=True)

**tokenized 된 데이터 셋을 Tensorflow의 Dataset object로 변환**

In [None]:
import tensorflow as tf

# trainset-set
train_dataset = tf.data.Dataset.from_tensor_slices({
    'input_ids': train_encodings['input_ids'],
    'token_type_ids': train_encodings['token_type_ids'],
    'attention_mask': train_encodings['attention_mask'],
    'labels': train_labels # Add labels with the key 'labels'
})

# validation-set
val_dataset = tf.data.Dataset.from_tensor_slices({
    'input_ids': val_encodings['input_ids'],
    'token_type_ids': val_encodings['token_type_ids'],
    'attention_mask': val_encodings['attention_mask'],
    'labels': val_labels # Add labels with the key 'labels'
})

In [None]:
print(train_dataset)
print(val_dataset)

**Fine Tuning Using Native Tensorflow**


In [None]:
from transformers import TFBertForSequenceClassification
import tensorflow as tf

num_labels = len(label_encoder.classes_)
model = TFBertForSequenceClassification.from_pretrained(HUGGINGFACE_MODEL_PATH, num_labels=num_labels, from_pt=True)

optimizer = tf.keras.optimizers.Adam(learning_rate=5e-5)

# Compile the model without explicitly setting the loss.
# Keras will infer the loss based on the model's architecture and the labels in the dataset.
model.compile(optimizer= optimizer, metrics=['accuracy'])

Some weights of the PyTorch model were not used when initializing the TF 2.0 model TFBertForSequenceClassification: ['bert.embeddings.position_ids']
- This IS expected if you are initializing TFBertForSequenceClassification from a PyTorch model trained on another task or with another architecture (e.g. initializing a TFBertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFBertForSequenceClassification from a PyTorch model that you expect to be exactly identical (e.g. initializing a TFBertForSequenceClassification model from a BertForSequenceClassification model).
Some weights or buffers of the TF 2.0 model TFBertForSequenceClassification were not initialized from the PyTorch model and are newly initialized: ['classifier.weight', 'classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


**Training**

In [None]:
from tensorflow.keras.callbacks import ModelCheckpoint
from tensorflow.keras.callbacks import EarlyStopping

callback_earlystop = EarlyStopping(
    monitor="val_accuracy",
    min_delta=0.001, # the threshold that triggers the termination (acc should at least improve 0.001)
    patience=2)

early_stop = EarlyStopping(
    monitor='val_loss',
    patience=2,
    restore_best_weights=True,
    mode = 'auto')

model.fit(
    train_dataset.shuffle(500).batch(16), epochs=3, batch_size=16,
    validation_data=val_dataset.shuffle(500).batch(16),
    #callbacks = [early_stop]
)

Epoch 1/3
Epoch 2/3
Epoch 3/3


<tf_keras.src.callbacks.History at 0x7c3f301d2fd0>

In [None]:
# from transformers import TFTrainer, TFTrainingArguments

# training_args = TFTrainingArguments(
#     output_dir='./results',          # output directory
#     num_train_epochs=5,              # total number of training epochs
#     per_device_train_batch_size=16,  # batch size per device during training
#     per_device_eval_batch_size=64,   # batch size for evaluation
#     warmup_steps=500,                # number of warmup steps for learning rate scheduler
#     weight_decay=0.01,               # strength of weight decay
#     logging_dir='./logs'            # directory for storing logs
# )

# with training_args.strategy.scope():
#     trainer_model = TFBertForSequenceClassification.from_pretrained(huggingface_path, num_labels=num_labels, from_pt=True)

# trainer = TFTrainer(
#     model=trainer_model,                 # the instantiated Transformers model to be trained
#     args=training_args,                  # training arguments, defined above
#     train_dataset=train_dataset,         # training dataset
#     eval_dataset=val_dataset             # evaluation dataset
# )

**Change id2label, label2id in model.config**

In [None]:
import re

id2labels = model.config.id2label
model.config.id2label = {id : label_encoder.inverse_transform([int(re.sub('LABEL_', '', label))])[0]  for id, label in id2labels.items()}

label2ids = model.config.label2id
model.config.label2id = {label_encoder.inverse_transform([int(re.sub('LABEL_', '', label))])[0] : id   for id, label in id2labels.items()}

**Saving the model and tokenizer**

In [None]:
import os
MODEL_NAME = 'fine-tuned-klue-bert-base'
MODEL_SAVE_PATH = os.path.join("/content/drive/MyDrive/_model", MODEL_NAME) # change this to your preferred location

if os.path.exists(MODEL_SAVE_PATH):
    print(f"{MODEL_SAVE_PATH} -- Folder already exists \n")
else:
    os.makedirs(MODEL_SAVE_PATH, exist_ok=True)
    print(f"{MODEL_SAVE_PATH} -- Folder create complete \n")

# save tokenizer, model
model.save_pretrained(MODEL_SAVE_PATH)
tokenizer.save_pretrained(MODEL_SAVE_PATH)

/content/drive/MyDrive/_model/fine-tuned-klue-bert-base -- Folder already exists 



('/content/drive/MyDrive/_model/fine-tuned-klue-bert-base/tokenizer_config.json',
 '/content/drive/MyDrive/_model/fine-tuned-klue-bert-base/special_tokens_map.json',
 '/content/drive/MyDrive/_model/fine-tuned-klue-bert-base/vocab.txt',
 '/content/drive/MyDrive/_model/fine-tuned-klue-bert-base/added_tokens.json',
 '/content/drive/MyDrive/_model/fine-tuned-klue-bert-base/tokenizer.json')

**Usage**

In [None]:
import tensorflow as tf
from transformers import BertTokenizerFast, TFBertForSequenceClassification
import numpy as np

MODEL_SAVE_PATH = '/content/drive/MyDrive/_model/fine-tuned-klue-bert-base'

tokenizer = BertTokenizerFast.from_pretrained(MODEL_SAVE_PATH)
model = TFBertForSequenceClassification.from_pretrained(MODEL_SAVE_PATH)

# 예측 함수
def classify_text(text):
    inputs = tokenizer(text, return_tensors="tf", truncation=True, padding=True)
    outputs = model(**inputs)
    logits = outputs.logits
    probs = tf.nn.softmax(logits, axis=-1)
    predicted_class = tf.argmax(probs, axis=1).numpy()[0]
    return {
        "label": int(predicted_class),
        "scores": probs.numpy()[0].tolist()
    }

# 라벨 목록 (index 0부터 순서대로)
labels = ["unknown", "기쁨", "두려움", "분노", "슬픔"]

# 출력 포맷 함수
def pretty_print_result(result):
    pred_idx = result["label"]
    print(f"예측 감정: {labels[pred_idx]}")
    print("확률 분포:")
    for i, score in enumerate(result["scores"]):
        print(f"  {labels[i]}: {score:.2%}")

# 테스트
sample_text = """오늘은 너무 몸이 가벼운 하루. 아침에 일어났더니 너무 상쾌했고, 해야할 과제도 어제 다 끝내고 기분 최상이었다. 점심도 맛있었고, 저녁은 끝내주게 좋았다."""
result = classify_text(sample_text)
pretty_print_result(result)


Some layers from the model checkpoint at /content/drive/MyDrive/_model/fine-tuned-klue-bert-base were not used when initializing TFBertForSequenceClassification: ['dropout_37']
- This IS expected if you are initializing TFBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
All the layers of TFBertForSequenceClassification were initialized from the model checkpoint at /content/drive/MyDrive/_model/fine-tuned-klue-bert-base.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFBertForSequenceClassification for predictions without further training.

예측 감정: 기쁨
확률 분포:
  unknown: 26.80%
  기쁨: 62.35%
  두려움: 5.74%
  분노: 2.07%
  슬픔: 3.05%


**Evaluation**

In [None]:
from transformers import TextClassificationPipeline

# Load Fine-tuning model
loaded_tokenizer = BertTokenizerFast.from_pretrained(MODEL_SAVE_PATH)
loaded_model = TFBertForSequenceClassification.from_pretrained(MODEL_SAVE_PATH)

text_classifier = TextClassificationPipeline(
    tokenizer=loaded_tokenizer,
    model=loaded_model,
    framework='tf',
    return_all_scores=True
)

Some layers from the model checkpoint at /content/drive/MyDrive/_model/fine-tuned-klue-bert-base were not used when initializing TFBertForSequenceClassification: ['dropout_37']
- This IS expected if you are initializing TFBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
All the layers of TFBertForSequenceClassification were initialized from the model checkpoint at /content/drive/MyDrive/_model/fine-tuned-klue-bert-base.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFBertForSequenceClassification for predictions without further training.

In [None]:
# Load Test-set
with open('/content/drive/MyDrive/datasets/test_text_emotion_data.json', mode='rt', encoding='utf-8-sig') as f:
    test_dataset = json.load(f)

test_dataset_list = [{'text':data['text'], 'label':data['emotion']} for data in test_dataset]
test_df = pd.DataFrame(test_dataset_list)
test_df.head()
#print(len(test_df))

Unnamed: 0,text,label
0,"오늘은 유난히 기분이 좋았다. 평소보다 일찍 일어났고, 아침 햇살이 방 안으로 들어...",기쁨
1,"요즘은 아무것도 손에 잡히지 않는다. 마음 한구석이 자꾸 불안하고, 무언가 큰일이 ...",두려움
2,"정말 짜증나는 하루였다. 아침부터 버스를 놓치고, 회사에서는 상사가 시비를 걸고, ...",분노
3,"오늘은 하루종일 울고만 싶었다. 별일이 없는데도 눈물이 났고, 아무것도 하기 싫었다...",슬픔
4,하루가 어떻게 지나갔는지 모르겠다. 그냥 멍하니 시간만 보내고 있는 기분이다. 뭘 ...,unknown


In [None]:
predicted_label_list = []
predicted_score_list = []

for text in test_df['text']:
    # predict
    preds_list = text_classifier(text)[0]

    sorted_preds_list = sorted(preds_list, key=lambda x: x['score'], reverse=True)
    predicted_label_list.append(sorted_preds_list[0]) # label
    predicted_score_list.append(sorted_preds_list[1]) # score

In [None]:
test_df['pred'] = predicted_label_list
test_df['score'] = predicted_score_list
test_df.head()

Unnamed: 0,text,label,pred,score
0,"오늘은 유난히 기분이 좋았다. 평소보다 일찍 일어났고, 아침 햇살이 방 안으로 들어...",기쁨,"{'label': '기쁨', 'score': 0.4049473702907562}","{'label': 'unknown', 'score': 0.3572457730770111}"
1,"요즘은 아무것도 손에 잡히지 않는다. 마음 한구석이 자꾸 불안하고, 무언가 큰일이 ...",두려움,"{'label': '두려움', 'score': 0.7478803992271423}","{'label': 'unknown', 'score': 0.15152686834335..."
2,"정말 짜증나는 하루였다. 아침부터 버스를 놓치고, 회사에서는 상사가 시비를 걸고, ...",분노,"{'label': '분노', 'score': 0.9389721155166626}","{'label': '두려움', 'score': 0.02056475542485714}"
3,"오늘은 하루종일 울고만 싶었다. 별일이 없는데도 눈물이 났고, 아무것도 하기 싫었다...",슬픔,"{'label': '두려움', 'score': 0.6031561493873596}","{'label': '슬픔', 'score': 0.2747381031513214}"
4,하루가 어떻게 지나갔는지 모르겠다. 그냥 멍하니 시간만 보내고 있는 기분이다. 뭘 ...,unknown,"{'label': '두려움', 'score': 0.5328735709190369}","{'label': 'unknown', 'score': 0.38395440578460..."


In [None]:
test_df['pred_label'] = test_df['pred'].apply(lambda x: x['label'])

# classification_report 사용
from sklearn.metrics import classification_report
print(classification_report(y_true=test_df['label'], y_pred=test_df['pred_label']))

              precision    recall  f1-score   support

     unknown       0.45      0.91      0.61        11
          기쁨       1.00      0.27      0.43        11
         두려움       0.62      0.91      0.74        11
          분노       0.77      0.91      0.83        11
          슬픔       1.00      0.09      0.17        11

    accuracy                           0.62        55
   macro avg       0.77      0.62      0.56        55
weighted avg       0.77      0.62      0.56        55

