# 프로젝트 : 커스텀 프로젝트 직접 만들기

KLUE의 model(klue/ber-base)를 활용하여 NSMC(Naver Sentiment Movie Corpus) task를 도전해보겠습니다.

In [1]:
import tensorflow
import numpy
import transformers
import datasets

print(tensorflow.__version__)
print(numpy.__version__)
print(transformers.__version__)
print(datasets.__version__)

2.6.0
1.21.4
4.11.3
1.14.0


## STEP 1. NSMC 데이터 분석 및 Huggingface dataset 구성

* 데이터셋은 깃허브에서 다운받거나, Huggingface datasets에서 가져올 수 있습니다. 앞에서 배운 방법들을 활용해봅시다!

In [2]:
import datasets
from datasets import load_dataset

huggingface_nsmc_dataset = load_dataset('nsmc')
print(huggingface_nsmc_dataset)

Using custom data configuration default
Reusing dataset nsmc (/aiffel/.cache/huggingface/datasets/nsmc/default/1.1.0/bfd4729bf1a67114e5267e6916b9e4807010aeb238e4a3c2b95fbfa3a014b5f3)


  0%|          | 0/2 [00:00<?, ?it/s]

DatasetDict({
    train: Dataset({
        features: ['id', 'document', 'label'],
        num_rows: 150000
    })
    test: Dataset({
        features: ['id', 'document', 'label'],
        num_rows: 50000
    })
})


In [3]:
train = huggingface_nsmc_dataset['train']
cols = train.column_names
cols

['id', 'document', 'label']

In [4]:
for i in range(5):
    for col in cols:
        print(col, ":", train[col][i])
    print('\n')

id : 9976970
document : 아 더빙.. 진짜 짜증나네요 목소리
label : 0


id : 3819312
document : 흠...포스터보고 초딩영화줄....오버연기조차 가볍지 않구나
label : 1


id : 10265843
document : 너무재밓었다그래서보는것을추천한다
label : 0


id : 9045019
document : 교도소 이야기구먼 ..솔직히 재미는 없다..평점 조정
label : 0


id : 6483659
document : 사이몬페그의 익살스런 연기가 돋보였던 영화!스파이더맨에서 늙어보이기만 했던 커스틴 던스트가 너무나도 이뻐보였다
label : 1




## 📊 분석
huggingface nsmc dataset을 확인해보면 위와 같이 구성되어 있습니다.

Dataset dictionary안에 train dataset, test dataset으로 구성되어 있고 각 Dataset은 ‘id’, ‘documnet’, ‘label’로 구성되어 있습니다.


* This is a movie review dataset in the Korean language. Reviews were scraped from Naver Movies.

    Each file is consisted of three columns: id, document, label
    * id: The review id, provieded by Naver
    * document: The actual review
    * label: The sentiment class of the review. (0: negative, 1: positive)

**Characteristics**
    
    All reviews are shorter than 140 characters
    Each sentiment class is sampled equally (i.e., random guess yields 50% accuracy)
    100K negative reviews (originally reviews of ratings 1-4)
    100K positive reviews (originally reviews of ratings 9-10)
    Neutral reviews (originally reviews of ratings 5-8) are excluded

## STEP 2. klue/bert-base model 및 tokenizer 불러오기


In [8]:
import transformers
from transformers import BertForSequenceClassification, BertTokenizer
from transformers import AutoModelForSequenceClassification

huggingface_tokenizer = BertTokenizer.from_pretrained('klue/bert-base')
huggingface_model = AutoModelForSequenceClassification.from_pretrained('klue/bert-base', num_labels = 2)
config = huggingface_model.config

Some weights of the model checkpoint at klue/bert-base were not used when initializing BertForSequenceClassification: ['cls.predictions.transform.dense.bias', 'cls.predictions.decoder.weight', 'cls.predictions.decoder.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.dense.weight', 'cls.seq_relationship.weight', 'cls.predictions.bias', 'cls.predictions.transform.LayerNorm.bias']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized

In [9]:
def transform(data):
    return huggingface_tokenizer(
        data['document'],
        truncation = True,
        padding = 'max_length',
        return_token_type_ids = False,
        )

In [10]:
# train & validation & test split

# hf_train_dataset = load_dataset('nsmc', split = 'train[:80%]')
# hf_val_dataset = load_dataset('nsmc', split = 'train[80%:100%]')
# hf_test_dataset = load_dataset('nsmc', split = 'test')

# hf_train_dataset = hf_train_dataset.map(transform, batched=True)
# hf_train_dataset

# hf_val_dataset = hf_val_dataset.map(transform, batched=True)
# hf_val_dataset

# hf_test_dataset = hf_test_dataset.map(transform, batched=True)
# hf_test_dataset

## 🚨 수정
### test데이터 수를 너무 늘리면 모델 run 시간이 너무 오래걸리므로, 주어진 1~2%의 데이터만 사용하여 돌리는 것으로 수정!

In [11]:
# 2차 train & validation & test split

hf_train_dataset = load_dataset('nsmc', split = 'train[:2%]')
hf_val_dataset = load_dataset('nsmc', split = 'train[2%:3%]')
hf_test_dataset = load_dataset('nsmc', split = 'test[:2%]')

hf_train_dataset = hf_train_dataset.map(transform, batched=True)


hf_val_dataset = hf_val_dataset.map(transform, batched=True)


hf_test_dataset = hf_test_dataset.map(transform, batched=True)

print(hf_train_dataset)
print(hf_val_dataset)
print(hf_test_dataset)

Using custom data configuration default
Reusing dataset nsmc (/aiffel/.cache/huggingface/datasets/nsmc/default/1.1.0/bfd4729bf1a67114e5267e6916b9e4807010aeb238e4a3c2b95fbfa3a014b5f3)
Using custom data configuration default
Reusing dataset nsmc (/aiffel/.cache/huggingface/datasets/nsmc/default/1.1.0/bfd4729bf1a67114e5267e6916b9e4807010aeb238e4a3c2b95fbfa3a014b5f3)
Using custom data configuration default
Reusing dataset nsmc (/aiffel/.cache/huggingface/datasets/nsmc/default/1.1.0/bfd4729bf1a67114e5267e6916b9e4807010aeb238e4a3c2b95fbfa3a014b5f3)
Loading cached processed dataset at /aiffel/.cache/huggingface/datasets/nsmc/default/1.1.0/bfd4729bf1a67114e5267e6916b9e4807010aeb238e4a3c2b95fbfa3a014b5f3/cache-c28c09024d5ef772.arrow
Loading cached processed dataset at /aiffel/.cache/huggingface/datasets/nsmc/default/1.1.0/bfd4729bf1a67114e5267e6916b9e4807010aeb238e4a3c2b95fbfa3a014b5f3/cache-4f10c8a4eb4783af.arrow
Loading cached processed dataset at /aiffel/.cache/huggingface/datasets/nsmc/defa

Dataset({
    features: ['attention_mask', 'document', 'id', 'input_ids', 'label'],
    num_rows: 3000
})
Dataset({
    features: ['attention_mask', 'document', 'id', 'input_ids', 'label'],
    num_rows: 1500
})
Dataset({
    features: ['attention_mask', 'document', 'id', 'input_ids', 'label'],
    num_rows: 1000
})


## STEP 3. 위에서 불러온 tokenizer으로 데이터셋을 전처리하고, model 학습 진행해 보기

In [12]:
import os
import numpy as np
from transformers import Trainer, TrainingArguments

output_dir = os.getenv('HOME')+'/aiffel/transformers'

training_arguments = TrainingArguments(
    output_dir,                                         # output이 저장될 경로
    evaluation_strategy="epoch",           #evaluation하는 빈도
    learning_rate = 2e-5,                         #learning_rate
    per_device_train_batch_size = 8,   # 각 device 당 batch size
    per_device_eval_batch_size = 8,    # evaluation 시에 batch size
    num_train_epochs = 3,                     # train 시킬 총 epochs
    weight_decay = 0.01,                        # weight decay
)

In [13]:
from datasets import load_metric
metric = load_metric('glue', 'mrpc')

def compute_metrics(eval_pred):    
    predictions,labels = eval_pred
    predictions = np.argmax(predictions, axis=1)
    return metric.compute(predictions=predictions, references = labels)

In [14]:
trainer = Trainer(
    model=huggingface_model,           # 학습시킬 model
    args=training_arguments,           # TrainingArguments을 통해 설정한 arguments
    train_dataset=hf_train_dataset,    # training dataset
    eval_dataset=hf_val_dataset,       # evaluation dataset
    compute_metrics=compute_metrics,
)
trainer.train()

The following columns in the training set  don't have a corresponding argument in `BertForSequenceClassification.forward` and have been ignored: id, document.
***** Running training *****
  Num examples = 3000
  Num Epochs = 3
  Instantaneous batch size per device = 8
  Total train batch size (w. parallel, distributed & accumulation) = 8
  Gradient Accumulation steps = 1
  Total optimization steps = 1125


Epoch,Training Loss,Validation Loss,Accuracy,F1
1,No log,0.405592,0.856667,0.857898
2,0.375300,0.495182,0.864667,0.858141
3,0.198200,0.611858,0.864667,0.864214


The following columns in the evaluation set  don't have a corresponding argument in `BertForSequenceClassification.forward` and have been ignored: id, document.
***** Running Evaluation *****
  Num examples = 1500
  Batch size = 8
Saving model checkpoint to /aiffel/aiffel/transformers/checkpoint-500
Configuration saved in /aiffel/aiffel/transformers/checkpoint-500/config.json
Model weights saved in /aiffel/aiffel/transformers/checkpoint-500/pytorch_model.bin
The following columns in the evaluation set  don't have a corresponding argument in `BertForSequenceClassification.forward` and have been ignored: id, document.
***** Running Evaluation *****
  Num examples = 1500
  Batch size = 8
Saving model checkpoint to /aiffel/aiffel/transformers/checkpoint-1000
Configuration saved in /aiffel/aiffel/transformers/checkpoint-1000/config.json
Model weights saved in /aiffel/aiffel/transformers/checkpoint-1000/pytorch_model.bin
The following columns in the evaluation set  don't have a corresponding

TrainOutput(global_step=1125, training_loss=0.2702951422797309, metrics={'train_runtime': 1030.738, 'train_samples_per_second': 8.732, 'train_steps_per_second': 1.091, 'total_flos': 2367999498240000.0, 'train_loss': 0.2702951422797309, 'epoch': 3.0})

In [15]:
trainer.evaluate(hf_test_dataset)

The following columns in the evaluation set  don't have a corresponding argument in `BertForSequenceClassification.forward` and have been ignored: id, document.
***** Running Evaluation *****
  Num examples = 1000
  Batch size = 8


{'eval_loss': 0.6369144916534424,
 'eval_accuracy': 0.859,
 'eval_f1': 0.8648130393096836,
 'eval_runtime': 32.1174,
 'eval_samples_per_second': 31.136,
 'eval_steps_per_second': 3.892,
 'epoch': 3.0}

## STEP 5. Bucketing을 적용하여 학습시키고, STEP 4의 결과와의 비교
* 아래 링크를 바탕으로 bucketing과 dynamic padding이 무엇인지 알아보고, 이들을 적용하여 model을 학습시킵니다.
* STEP 4에 학습한 결과와 bucketing을 적용하여 학습시킨 결과를 비교해보고, 모델 성능 향상과 훈련 시간 두 가지 측면에서 각각 어떤 이점이 있는지 비교해봅시다.

## Bucketing이란?


다른 많은 데이터와 마찬가지로 순차 데이터도 가능하다면 매 epoch 마다 순서를 섞어 미니 배치를 만드는 것이 좋습니다. 그런데 데이터의 문장마다 길이는 천차만별이고, 최악의 경우 엄청 긴 하나와 엄청 짧은 나머지들이 합쳐져 미니 배치가 만들어지면 가장 긴 하나에 맞게 패딩이 이루어질 테니 엄청난 비효율이 발생합니다. 직접 C++로 짜면 모를까, 3차원 텐서를 입력으로 받는 TF의 경우 패딩을 안 하기도 힘듭니다. 이런 비효율을 그래도 최대한 방지하고자 하는 노력이 버켓(bucket)에 넣는 방법입니다.

개념은 간단한데, 우선 데이터들을 스텝 길이에 따라 정렬한 뒤 몇 개의 그룹으로 나누고, 데이터를 섞더라도 그 그룹 안에서만 섞도록 하는 것입니다. 20 단어 이하, 21개 이상 40개 이하, 41개 이상 등으로 그룹을 분리할 수 있을 것입니다. 이 때는 길이가 제한되어 있으니 어느 정도 비효율적인 더미 배치가 되지는 않을 것입니다. 만약 외부에서 sequence length 를 지정해 주는 형태로 프로그래밍 했다면 각 버켓의 길이를 넣어 주면 되니 매번 동적으로 바꾸지 않아도 된다는 장점도 있습니다.


## Dynamic padding이란?

동적 패딩(Dynamic padding)

전체 데이터셋이 아닌 개별 배치(batch)에 대해서 별도로 패딩(padding)을 수행하여 과도한 패딩 작업을 해주는 것을 동적 패딩이라고 합니다. 
각 배치의 가장 긴 시퀀스에 맞춰 패딩을 진행합니다. 이는 각 배치에 필요한 만큼의 패딩만 가능하게 하여 모델의 효율성을 높여줍니다.

이를 수행하려면 batch로 분리하려는 데이터셋의 요소 각각에 대해서 정확한 수의 padding을 적용할 수 있도록 도와주는 collate function이 필요합니다. Transformers는 라이브러리 DataCollatorWithPadding을 통해 이러한 기능을 제공합니다.

In [16]:
from transformers import DataCollatorWithPadding
collate_function = DataCollatorWithPadding(tokenizer = huggingface_tokenizer)

training_arguments = TrainingArguments(
    output_dir,                                         # output이 저장될 경로
    evaluation_strategy="epoch",           #evaluation하는 빈도
    learning_rate = 2e-5,                         #learning_rate
    per_device_train_batch_size = 8,   # 각 device 당 batch size
    per_device_eval_batch_size = 8,    # evaluation 시에 batch size
    num_train_epochs = 3,                     # train 시킬 총 epochs
    weight_decay = 0.01,                        # weight decay
)

trainer = Trainer(
    model=huggingface_model,           # 학습시킬 model
    args=training_arguments,# TrainingArguments을 통해 설정한 arguments
    data_collator = collate_function, ## ❗️dynamic padding을 위해서 추가❗️
    train_dataset=hf_train_dataset,    # training dataset
    eval_dataset=hf_val_dataset,       # evaluation dataset
    compute_metrics=compute_metrics,
)
trainer.train()

PyTorch: setting up devices
The default value for the training argument `--report_to` will change in v5 (from all installed integrations to none). In v5, you will need to use `--report_to all` to get the same behavior as now. You should start updating your code and make this info disappear :-).
The following columns in the training set  don't have a corresponding argument in `BertForSequenceClassification.forward` and have been ignored: id, document.
***** Running training *****
  Num examples = 3000
  Num Epochs = 3
  Instantaneous batch size per device = 8
  Total train batch size (w. parallel, distributed & accumulation) = 8
  Gradient Accumulation steps = 1
  Total optimization steps = 1125


Epoch,Training Loss,Validation Loss,Accuracy,F1
1,No log,0.744031,0.864667,0.862745
2,0.087400,0.892822,0.864,0.860656
3,0.033000,0.930898,0.866667,0.86541


The following columns in the evaluation set  don't have a corresponding argument in `BertForSequenceClassification.forward` and have been ignored: id, document.
***** Running Evaluation *****
  Num examples = 1500
  Batch size = 8
Saving model checkpoint to /aiffel/aiffel/transformers/checkpoint-500
Configuration saved in /aiffel/aiffel/transformers/checkpoint-500/config.json
Model weights saved in /aiffel/aiffel/transformers/checkpoint-500/pytorch_model.bin
The following columns in the evaluation set  don't have a corresponding argument in `BertForSequenceClassification.forward` and have been ignored: id, document.
***** Running Evaluation *****
  Num examples = 1500
  Batch size = 8
Saving model checkpoint to /aiffel/aiffel/transformers/checkpoint-1000
Configuration saved in /aiffel/aiffel/transformers/checkpoint-1000/config.json
Model weights saved in /aiffel/aiffel/transformers/checkpoint-1000/pytorch_model.bin
The following columns in the evaluation set  don't have a corresponding

TrainOutput(global_step=1125, training_loss=0.056650420082939995, metrics={'train_runtime': 1028.7269, 'train_samples_per_second': 8.749, 'train_steps_per_second': 1.094, 'total_flos': 2367999498240000.0, 'train_loss': 0.056650420082939995, 'epoch': 3.0})

In [17]:
trainer.evaluate(hf_test_dataset)

The following columns in the evaluation set  don't have a corresponding argument in `BertForSequenceClassification.forward` and have been ignored: id, document.
***** Running Evaluation *****
  Num examples = 1000
  Batch size = 8


{'eval_loss': 0.9647986888885498,
 'eval_accuracy': 0.864,
 'eval_f1': 0.8671875,
 'eval_runtime': 34.761,
 'eval_samples_per_second': 28.768,
 'eval_steps_per_second': 3.596,
 'epoch': 3.0}

## ✅ Bucketing 적용 전후 비교

우선, 주어진 데이터의 1~2%만 사용하여 bucketing을 적용한 모델과 아닌 모델 모두 돌려보았다. 그 결과는 다음과 같다.

* **Bucketing 적용 전**
    
    'train_runtime': 591.4823, 
    
    'train_samples_per_second': 7.608, 
    
    'train_steps_per_second': 0.954, 
    
    'total_flos': 1183999749120000.0, '
    
    train_loss': 0.027098344995620402, 
    
    'epoch': 3.0
    

    {'eval_loss': 0.5826172232627869,
     'eval_accuracy': 0.854,
     'eval_f1': 0.8596153846153847,
     'eval_runtime': 34.6742,
     'eval_samples_per_second': 28.84,
     'eval_steps_per_second': 3.605,
     'epoch': 3.0}
 
* **Bucketing 적용 후**

    'train_runtime': 588.6736,
    
    'train_samples_per_second': 7.644, 
    
    'train_steps_per_second': 0.958, 
    
    'total_flos': 1183999749120000.0, 
    
    'train_loss': 0.06299833033947234, 
    
    'epoch': 3.0
    
    
    {'eval_loss': 0.9059286713600159,
     'eval_accuracy': 0.862,
     'eval_f1': 0.8667953667953667,
     'eval_runtime': 32.0769,
     'eval_samples_per_second': 31.175,
     'eval_steps_per_second': 3.897,
     'epoch': 3.0}
     
### 📊 분석
    accuaracy 점수는 0.849에서 0.862로 증가하였고, f1 score 또한 0.859에서 0.866으로 증가하였다.
    
    runtime은 34.6에서 32.0으로 감소하였다.
    
    다만, loss값이 0.58에서 0.9로 증가하였다. 
    
    If the loss increases and the accuracy increase too is because your regularization techniques are working well and you're fighting the overfitting problem. This is true only if the loss, then, starts to decrease whilst the accuracy continues to increase. Otherwise, if the loss keep growing your model is diverging and you should look for the cause (usually you're using a too high learning rate value).

## 회고
잘한 점 : Huggingface를 활용하는 코드의 방식과 흐름을 더 잘 이해할 수 있는 프로젝트였다.

못한 점 : 90% 이상의 정확도를 내지 못해 아쉽다.

노력할 점 : 데이터를 모두 사용하거나 parameter를 잘 조정하여 정확도를 더 높이는 방법을 찾아야한다.