## klue nsmc 모델
klue : 한국어 버전 glue   
nsmc : Naver Sentiment Movie Corpus, 네이버 영화리뷰 감정분석

#### 루브릭
1. 모델과 데이터를 정상적으로 불러오고, 작동하는 것을 확인하였다.
2. Preprocessing을 개선하고, fine-tuning을 통해 모델의 성능을 개선시켰다.  
    max_padding 512 -> 32
3. 모델 학습에 Bucketing을 성공적으로 적용하고, 그 결과를 비교분석하였다.

In [1]:
import tensorflow as tf
import numpy as np
import transformers
import datasets

### 데이터 셋

In [2]:
from datasets import load_dataset

dataset = load_dataset("nsmc", trust_remote_code=True)

In [3]:
dataset

DatasetDict({
    train: Dataset({
        features: ['id', 'document', 'label'],
        num_rows: 150000
    })
    test: Dataset({
        features: ['id', 'document', 'label'],
        num_rows: 50000
    })
})

In [4]:
dataset["train"][0]

{'id': '9976970', 'document': '아 더빙.. 진짜 짜증나네요 목소리', 'label': 0}

### 모델 및 토크나이저

In [20]:
from transformers import AutoModelForSequenceClassification, AutoTokenizer

model = AutoModelForSequenceClassification.from_pretrained("klue/bert-base")
tokenizer = AutoTokenizer.from_pretrained("klue/bert-base")

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at klue/bert-base and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [42]:
# 버트는 기본적으로 패딩이 512개인데 리뷰데이터는 그렇게 길지 않음
len(tokenizer("sample text", padding="max_length")["input_ids"])

512

In [46]:
# train 토큰화 길이
train_documents = dataset["train"]["document"]
train_input_lengths = [len(tokenizer(doc)["input_ids"]) for doc in train_documents]

# test 토큰화하고 길이
test_documents = dataset["test"]["document"]
test_input_lengths = [len(tokenizer(doc)["input_ids"]) for doc in test_documents]

# 평균
avg_train_input_length = sum(train_input_lengths) / len(train_input_lengths)
avg_test_input_length = sum(test_input_lengths) / len(test_input_lengths)

In [54]:
print("Train, Test 최대 길이: ", max(train_input_lengths), max(test_input_lengths))
print(f"Train 평균 길이: {avg_train_input_length}")
print(f"Test 평균 길이: {avg_test_input_length}")

Train, Test 최대 길이:  142 122
Train 평균 길이: 22.275513333333333
Test 평균 길이: 22.35976


In [55]:
tokenizer.model_max_length = 32
len(tokenizer("sample text", padding="max_length")["input_ids"])

32

In [56]:
def transform(data):
    return tokenizer(
        text=data["document"],
        truncation=True,
        padding="max_length",
        return_token_type_ids=False,
    )

In [57]:
dataset_tokenized = dataset.map(
    transform,
    batched=True,
)

Map:   0%|          | 0/150000 [00:00<?, ? examples/s]

Map:   0%|          | 0/50000 [00:00<?, ? examples/s]

In [59]:
train = dataset_tokenized["train"]
test = dataset_tokenized["test"]

In [63]:
# 학습시간이 길어 데이터 축소
import random

subset_size = 15000
train_subset = train.shuffle(seed=42).select(range(subset_size))
test_subset = test.shuffle(seed=42).select(range(int(subset_size / 3)))

### 학습

In [75]:
import numpy as np
from transformers import Trainer, TrainingArguments

output_dir = "model/nsmc/"

training_arguments = TrainingArguments(
    output_dir,  # output이 저장될 경로
    evaluation_strategy="epoch",  # evaluation하는 빈도
    learning_rate=2e-5,  # learning_rate
    warmup_steps=500,
    per_device_train_batch_size=8,  # 각 device 당 batch size
    per_device_eval_batch_size=8,  # evaluation 시에 batch size
    num_train_epochs=3,  # train 시킬 총 epochs
    weight_decay=0.01,  # weight decay
)



In [76]:
# 정확도
from datasets import load_metric

metric = load_metric("glue", "mrpc")  # 바이너리 크로스엔트로피


def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    predictions = np.argmax(predictions, axis=1)
    return metric.compute(predictions=predictions, references=labels)

In [77]:
trainer = Trainer(
    model=model,  # 학습시킬 model
    args=training_arguments,  # TrainingArguments을 통해 설정한 arguments
    train_dataset=train_subset,  # training dataset
    eval_dataset=test_subset,
    compute_metrics=compute_metrics,
)
trainer.train()

  0%|          | 0/5625 [00:00<?, ?it/s]

{'loss': 0.6993, 'grad_norm': 10.24004077911377, 'learning_rate': 2e-05, 'epoch': 0.27}
{'loss': 0.7021, 'grad_norm': 2.190335273742676, 'learning_rate': 1.804878048780488e-05, 'epoch': 0.53}
{'loss': 0.6414, 'grad_norm': 11.948502540588379, 'learning_rate': 1.6097560975609757e-05, 'epoch': 0.8}


  0%|          | 0/625 [00:00<?, ?it/s]

{'eval_loss': 0.5070755481719971, 'eval_accuracy': 0.8162, 'eval_f1': 0.8149788604791625, 'eval_runtime': 9.1766, 'eval_samples_per_second': 544.863, 'eval_steps_per_second': 68.108, 'epoch': 1.0}
{'loss': 0.4501, 'grad_norm': 56.362056732177734, 'learning_rate': 1.4146341463414635e-05, 'epoch': 1.07}
{'loss': 0.3579, 'grad_norm': 1.8230552673339844, 'learning_rate': 1.2195121951219513e-05, 'epoch': 1.33}
{'loss': 0.3311, 'grad_norm': 6.776037693023682, 'learning_rate': 1.024390243902439e-05, 'epoch': 1.6}
{'loss': 0.3221, 'grad_norm': 15.574055671691895, 'learning_rate': 8.292682926829268e-06, 'epoch': 1.87}


  0%|          | 0/625 [00:00<?, ?it/s]

{'eval_loss': 0.5396305918693542, 'eval_accuracy': 0.8552, 'eval_f1': 0.8491666666666666, 'eval_runtime': 9.3938, 'eval_samples_per_second': 532.264, 'eval_steps_per_second': 66.533, 'epoch': 2.0}
{'loss': 0.2461, 'grad_norm': 6.702902317047119, 'learning_rate': 6.341463414634147e-06, 'epoch': 2.13}
{'loss': 0.1976, 'grad_norm': 0.9099440574645996, 'learning_rate': 4.390243902439025e-06, 'epoch': 2.4}
{'loss': 0.1879, 'grad_norm': 0.10862462222576141, 'learning_rate': 2.4390243902439027e-06, 'epoch': 2.67}
{'loss': 0.2052, 'grad_norm': 0.23882833123207092, 'learning_rate': 4.878048780487805e-07, 'epoch': 2.93}


  0%|          | 0/625 [00:00<?, ?it/s]

{'eval_loss': 0.6092442870140076, 'eval_accuracy': 0.8574, 'eval_f1': 0.8559304910082846, 'eval_runtime': 8.6386, 'eval_samples_per_second': 578.796, 'eval_steps_per_second': 72.349, 'epoch': 3.0}
{'train_runtime': 427.4598, 'train_samples_per_second': 105.273, 'train_steps_per_second': 13.159, 'train_loss': 0.39198618842230903, 'epoch': 3.0}


TrainOutput(global_step=5625, training_loss=0.39198618842230903, metrics={'train_runtime': 427.4598, 'train_samples_per_second': 105.273, 'train_steps_per_second': 13.159, 'total_flos': 739999843200000.0, 'train_loss': 0.39198618842230903, 'epoch': 3.0})

### Bucketing

Data Collator를 사용해서 Bucketing과  dynamic padding 구현 후 비교

In [67]:
from transformers import DataCollatorWithPadding

In [68]:
datacollator = DataCollatorWithPadding(tokenizer, padding=True)

In [73]:
training_arguments_bucket = TrainingArguments(
    output_dir,  # output이 저장될 경로
    evaluation_strategy="epoch",  # evaluation하는 빈도
    learning_rate=2e-5,  # learning_rate
    per_device_train_batch_size=8,  # 각 device 당 batch size
    per_device_eval_batch_size=8,  # evaluation 시에 batch size
    num_train_epochs=3,  # train 시킬 총 epochs
    weight_decay=0.01,  # weight decay
    group_by_length=True,
)



In [72]:
trainer_bucket = Trainer(
    model=model,  # 학습시킬 model
    args=training_arguments_bucket,  # TrainingArguments을 통해 설정한 arguments
    train_dataset=train_subset,  # training dataset
    eval_dataset=test_subset,
    compute_metrics=compute_metrics,
    data_collator=datacollator,
)
trainer_bucket.train()

  0%|          | 0/5625 [00:00<?, ?it/s]

{'loss': 0.6792, 'grad_norm': 4.326119899749756, 'learning_rate': 0.00018222222222222224, 'epoch': 0.27}
{'loss': 0.7079, 'grad_norm': 6.315160751342773, 'learning_rate': 0.00016444444444444444, 'epoch': 0.53}
{'loss': 0.7083, 'grad_norm': 12.499991416931152, 'learning_rate': 0.00014666666666666666, 'epoch': 0.8}
{'loss': 0.7036, 'grad_norm': 13.895317077636719, 'learning_rate': 0.00012888888888888892, 'epoch': 1.07}
{'loss': 0.709, 'grad_norm': 2.4762637615203857, 'learning_rate': 0.00011111111111111112, 'epoch': 1.33}
{'loss': 0.7023, 'grad_norm': 2.2617084980010986, 'learning_rate': 9.333333333333334e-05, 'epoch': 1.6}
{'loss': 0.7048, 'grad_norm': 11.508976936340332, 'learning_rate': 7.555555555555556e-05, 'epoch': 1.87}
{'loss': 0.6998, 'grad_norm': 3.150838613510132, 'learning_rate': 5.7777777777777776e-05, 'epoch': 2.13}
{'loss': 0.7006, 'grad_norm': 8.67473030090332, 'learning_rate': 4e-05, 'epoch': 2.4}
{'loss': 0.699, 'grad_norm': 2.4435043334960938, 'learning_rate': 2.222222

TrainOutput(global_step=5625, training_loss=0.7008982788085938, metrics={'train_runtime': 394.6325, 'train_samples_per_second': 114.03, 'train_steps_per_second': 14.254, 'total_flos': 739999843200000.0, 'train_loss': 0.7008982788085938, 'epoch': 3.0})