# STEP 1. NSMC 데이터 분석 및 Huggingface dataset 구성
- 데이터셋은 깃허브에서 다운받거나, [Huggingface datasets](https://huggingface.co/datasets)에서 가져올 수 있습니다. 앞에서 배운 방법들을 활용해봅시다!

In [1]:
import datasets
from datasets import load_dataset, Dataset

train_data = load_dataset('nsmc', split='train[:80%]') #30
val_data = load_dataset('nsmc',split='train[-7%:]') #-5
test_data = load_dataset('nsmc',split='test[:21%]') #15

Downloading builder script:   0%|          | 0.00/3.18k [00:00<?, ?B/s]

Downloading readme:   0%|          | 0.00/3.74k [00:00<?, ?B/s]

Downloading and preparing dataset nsmc/default to /aiffel/.cache/huggingface/datasets/nsmc/default/1.1.0/bfd4729bf1a67114e5267e6916b9e4807010aeb238e4a3c2b95fbfa3a014b5f3...


Downloading data files:   0%|          | 0/2 [00:00<?, ?it/s]

Downloading data:   0%|          | 0.00/6.33M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/2.12M [00:00<?, ?B/s]

Extracting data files:   0%|          | 0/2 [00:00<?, ?it/s]

Generating train split:   0%|          | 0/150000 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/50000 [00:00<?, ? examples/s]

Dataset nsmc downloaded and prepared to /aiffel/.cache/huggingface/datasets/nsmc/default/1.1.0/bfd4729bf1a67114e5267e6916b9e4807010aeb238e4a3c2b95fbfa3a014b5f3. Subsequent calls will reuse this data.


Found cached dataset nsmc (/aiffel/.cache/huggingface/datasets/nsmc/default/1.1.0/bfd4729bf1a67114e5267e6916b9e4807010aeb238e4a3c2b95fbfa3a014b5f3)
Found cached dataset nsmc (/aiffel/.cache/huggingface/datasets/nsmc/default/1.1.0/bfd4729bf1a67114e5267e6916b9e4807010aeb238e4a3c2b95fbfa3a014b5f3)


In [2]:
train_data

Dataset({
    features: ['id', 'document', 'label'],
    num_rows: 120000
})

In [3]:
val_data

Dataset({
    features: ['id', 'document', 'label'],
    num_rows: 10500
})

In [4]:
test_data

Dataset({
    features: ['id', 'document', 'label'],
    num_rows: 10500
})

### 데이터 전처리

nsmc에는 train, 과 test로만 구성된것을 확인<br>
그리고 데이터 속 컬럼들도 확인한 결과 id, document, label 로 구성된것을 확인했다.<br>
validation 데이터를 만들기 위해서 train에서 나눠서 진행

# STEP 2. klue/bert-base model 및 tokenizer 불러오기

In [23]:
from transformers import AutoModel, AutoTokenizer, AutoModelForSequenceClassification

model = AutoModelForSequenceClassification.from_pretrained("klue/bert-base", num_labels=2)
tokenizer = AutoTokenizer.from_pretrained("klue/bert-base", use_fast=True)


# model = AutoModelForSequenceClassification.from_pretrained("klue/roberta-small", num_labels=2)
# tokenizer = AutoTokenizer.from_pretrained("klue/roberta-small")


Some weights of the model checkpoint at klue/bert-base were not used when initializing BertForSequenceClassification: ['cls.predictions.transform.LayerNorm.weight', 'cls.seq_relationship.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.seq_relationship.bias', 'cls.predictions.transform.dense.bias', 'cls.predictions.decoder.weight', 'cls.predictions.bias', 'cls.predictions.decoder.bias', 'cls.predictions.transform.dense.weight']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized

In [24]:
def transform(data):
    return tokenizer(
        data['document'],
        truncation=True,
        padding='max_length',
        max_length=30,
        return_token_type_ids=False,)

In [25]:
train_data = train_data.map(transform, batched=True)
test_data = test_data.map(transform, batched=True)
val_data = val_data.map(transform, batched=True)

Loading cached processed dataset at /aiffel/.cache/huggingface/datasets/nsmc/default/1.1.0/bfd4729bf1a67114e5267e6916b9e4807010aeb238e4a3c2b95fbfa3a014b5f3/cache-3b0438473d9d52d4.arrow
Loading cached processed dataset at /aiffel/.cache/huggingface/datasets/nsmc/default/1.1.0/bfd4729bf1a67114e5267e6916b9e4807010aeb238e4a3c2b95fbfa3a014b5f3/cache-7c8bf3841eee4408.arrow
Loading cached processed dataset at /aiffel/.cache/huggingface/datasets/nsmc/default/1.1.0/bfd4729bf1a67114e5267e6916b9e4807010aeb238e4a3c2b95fbfa3a014b5f3/cache-2bfefb7a5d16dca8.arrow


처음 학습을 진행했을 때는 max_lenght로 진행 했지만<br>
학습 시간을 단축하기 위해서 30으로 진행<br>


# STEP 3. 위에서 불러온 tokenizer으로 데이터셋을 전처리하고, model 학습 진행해 보기

In [29]:
import os
import numpy as np
from transformers import Trainer, TrainingArguments

output_dir = os.getenv("HOME") + '/aiffel/aiffel_quest/mini_quest_240326'

training_arguments = TrainingArguments(
output_dir,                        # output이 저장될 경로
evaluation_strategy='epoch',       # evaluation하는 빈도
learning_rate=2e-5,                # learning_rate
per_device_train_batch_size=576,     # 각 device 당 batch size(8)
per_device_eval_batch_size=576,      # evaluation 시에 batch size(8)
num_train_epochs=8,                # train 시킬 총 epochs
weight_decay=0.01)                 # weight decay



In [30]:
from datasets import load_metric
metric = load_metric('accuracy')


def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    predictions = np.argmax(predictions, axis=1)
    return metric.compute(predictions=predictions, references=labels)

In [31]:
import torch

# 코드 실행 중간에 GPU 메모리를 해제해야 하는 경우
torch.cuda.empty_cache()


In [32]:
%%time
trainer = Trainer(
    model=model,
    args=training_arguments,
    train_dataset=train_data,
    eval_dataset=val_data,
    compute_metrics=compute_metrics,)

trainer.train()

The following columns in the training set  don't have a corresponding argument in `BertForSequenceClassification.forward` and have been ignored: document, id.
***** Running training *****
  Num examples = 120000
  Num Epochs = 8
  Instantaneous batch size per device = 576
  Total train batch size (w. parallel, distributed & accumulation) = 576
  Gradient Accumulation steps = 1
  Total optimization steps = 1672


Epoch,Training Loss,Validation Loss,Accuracy
1,No log,0.264576,0.885143
2,No log,0.248351,0.895619
3,0.279400,0.249185,0.896762
4,0.279400,0.253465,0.900286
5,0.182400,0.266606,0.898286
6,0.182400,0.278067,0.898286
7,0.182400,0.28822,0.89819
8,0.133700,0.293996,0.897524


The following columns in the evaluation set  don't have a corresponding argument in `BertForSequenceClassification.forward` and have been ignored: document, id.
***** Running Evaluation *****
  Num examples = 10500
  Batch size = 576
The following columns in the evaluation set  don't have a corresponding argument in `BertForSequenceClassification.forward` and have been ignored: document, id.
***** Running Evaluation *****
  Num examples = 10500
  Batch size = 576
Saving model checkpoint to /aiffel/aiffel/aiffel_quest/mini_quest_240326/checkpoint-500
Configuration saved in /aiffel/aiffel/aiffel_quest/mini_quest_240326/checkpoint-500/config.json
Model weights saved in /aiffel/aiffel/aiffel_quest/mini_quest_240326/checkpoint-500/pytorch_model.bin
The following columns in the evaluation set  don't have a corresponding argument in `BertForSequenceClassification.forward` and have been ignored: document, id.
***** Running Evaluation *****
  Num examples = 10500
  Batch size = 576
The followin

CPU times: user 1h 6min 6s, sys: 8min 37s, total: 1h 14min 43s
Wall time: 1h 14min 42s


TrainOutput(global_step=1672, training_loss=0.19038934570750551, metrics={'train_runtime': 4477.434, 'train_samples_per_second': 214.409, 'train_steps_per_second': 0.373, 'total_flos': 1.4799996864e+16, 'train_loss': 0.19038934570750551, 'epoch': 8.0})

In [33]:
trainer.evaluate(test_data)

The following columns in the evaluation set  don't have a corresponding argument in `BertForSequenceClassification.forward` and have been ignored: document, id.
***** Running Evaluation *****
  Num examples = 10500
  Batch size = 576


{'eval_loss': 0.32934555411338806,
 'eval_accuracy': 0.8917142857142857,
 'eval_runtime': 18.3454,
 'eval_samples_per_second': 572.351,
 'eval_steps_per_second': 1.036,
 'epoch': 8.0}

# STEP 4. Fine-tuning을 통하여 모델 성능(accuarcy) 향상시키기
- 데이터 전처리, TrainingArguments 등을 조정하여 모델의 정확도를 90% 이상으로 끌어올려봅시다.

In [34]:
import gc
del training_arguments, trainer, model
gc.collect()

3

In [35]:
model = AutoModelForSequenceClassification.from_pretrained("klue/bert-base", num_labels=2)
tokenizer = AutoTokenizer.from_pretrained("klue/bert-base", use_fast=True)

# model = AutoModelForSequenceClassification.from_pretrained("klue/roberta-small", num_labels=2)
# tokenizer = AutoTokenizer.from_pretrained("klue/roberta-small")

loading configuration file https://huggingface.co/klue/bert-base/resolve/main/config.json from cache at /aiffel/.cache/huggingface/transformers/fbd0b2ef898c4653902683fea8cc0dd99bf43f0e082645b913cda3b92429d1bb.99b3298ed554f2ad731c27cdb11a6215f39b90bc845ff5ce709bb4e74ba45621
Model config BertConfig {
  "architectures": [
    "BertForMaskedLM"
  ],
  "attention_probs_dropout_prob": 0.1,
  "classifier_dropout": null,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": "bert",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "pad_token_id": 0,
  "position_embedding_type": "absolute",
  "transformers_version": "4.11.3",
  "type_vocab_size": 2,
  "use_cache": true,
  "vocab_size": 32000
}

loading weights file https://huggingface.co/klue/bert-base/resolve/main/pytorch_model.bin from cache at /aiffel/.cache/huggingface/transform

In [36]:
output_dir = os.getenv("HOME") + '/aiffel/aiffel_quest/mini_quest_240326'

training_arguments = TrainingArguments(
output_dir,                        # output이 저장될 경로
evaluation_strategy='epoch',       # evaluation하는 빈도
learning_rate=2e-5,                # learning_rate
per_device_train_batch_size=512,     # 각 device 당 batch size(8)
per_device_eval_batch_size=512,      # evaluation 시에 batch size(8)
num_train_epochs=10,                # train 시킬 총 epochs
weight_decay=0.01)                 # weight decay



PyTorch: setting up devices
The default value for the training argument `--report_to` will change in v5 (from all installed integrations to none). In v5, you will need to use `--report_to all` to get the same behavior as now. You should start updating your code and make this info disappear :-).


In [37]:
%%time
trainer = Trainer(
    model=model,
    args=training_arguments,
    train_dataset=train_data,
    eval_dataset=val_data,
    compute_metrics=compute_metrics,)

trainer.train()

The following columns in the training set  don't have a corresponding argument in `BertForSequenceClassification.forward` and have been ignored: document, id.
***** Running training *****
  Num examples = 120000
  Num Epochs = 10
  Instantaneous batch size per device = 512
  Total train batch size (w. parallel, distributed & accumulation) = 512
  Gradient Accumulation steps = 1
  Total optimization steps = 2350


Epoch,Training Loss,Validation Loss,Accuracy
1,No log,0.259228,0.888095
2,No log,0.247204,0.895143
3,0.283400,0.247491,0.896952
4,0.283400,0.256262,0.900476
5,0.185500,0.268784,0.900762
6,0.185500,0.289511,0.899238
7,0.131300,0.301005,0.899619
8,0.131300,0.325092,0.898381
9,0.098900,0.337379,0.898667
10,0.098900,0.342531,0.898381


The following columns in the evaluation set  don't have a corresponding argument in `BertForSequenceClassification.forward` and have been ignored: document, id.
***** Running Evaluation *****
  Num examples = 10500
  Batch size = 512
The following columns in the evaluation set  don't have a corresponding argument in `BertForSequenceClassification.forward` and have been ignored: document, id.
***** Running Evaluation *****
  Num examples = 10500
  Batch size = 512
Saving model checkpoint to /aiffel/aiffel/aiffel_quest/mini_quest_240326/checkpoint-500
Configuration saved in /aiffel/aiffel/aiffel_quest/mini_quest_240326/checkpoint-500/config.json
Model weights saved in /aiffel/aiffel/aiffel_quest/mini_quest_240326/checkpoint-500/pytorch_model.bin
The following columns in the evaluation set  don't have a corresponding argument in `BertForSequenceClassification.forward` and have been ignored: document, id.
***** Running Evaluation *****
  Num examples = 10500
  Batch size = 512
The followin

CPU times: user 1h 20min 16s, sys: 10min 1s, total: 1h 30min 18s
Wall time: 1h 30min 55s


TrainOutput(global_step=2350, training_loss=0.16093394178025267, metrics={'train_runtime': 5455.8327, 'train_samples_per_second': 219.948, 'train_steps_per_second': 0.431, 'total_flos': 1.849999608e+16, 'train_loss': 0.16093394178025267, 'epoch': 10.0})

In [38]:
trainer.evaluate(test_data)

The following columns in the evaluation set  don't have a corresponding argument in `BertForSequenceClassification.forward` and have been ignored: document, id.
***** Running Evaluation *****
  Num examples = 10500
  Batch size = 512


{'eval_loss': 0.3852943778038025,
 'eval_accuracy': 0.8865714285714286,
 'eval_runtime': 17.0778,
 'eval_samples_per_second': 614.834,
 'eval_steps_per_second': 1.23,
 'epoch': 10.0}

# STEP 5. Bucketing을 적용하여 학습시키고, STEP 4의 결과와의 비교
- 아래 링크를 바탕으로 bucketing과 dynamic padding이 무엇인지 알아보고, 이들을 적용하여 model을 학습시킵니다.
    - [Data Collator](https://huggingface.co/docs/transformers/v4.30.0/en/main_classes/data_collator)
    - [Trainer.TrainingArguments 의 group_by_length](https://huggingface.co/docs/transformers/main_classes/trainer#transformers.TrainingArguments)
- STEP 4에 학습한 결과와 bucketing을 적용하여 학습시킨 결과를 비교해보고, 모델 성능 향상과 훈련 시간 두 가지 측면에서 각각 어떤 이점이 있는지 비교해봅시다.

In [44]:
del training_arguments, trainer, model, tokenizer
gc.collect()

18

In [17]:
from transformers import DataCollatorWithPadding

# 토크나이저와 데이터 콜레이터 정의
model = AutoModelForSequenceClassification.from_pretrained("klue/bert-base", num_labels=2)
tokenizer = AutoTokenizer.from_pretrained("klue/bert-base", use_fast=True)
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

Downloading pytorch_model.bin:   0%|          | 0.00/445M [00:00<?, ?B/s]

Some weights of the model checkpoint at klue/bert-base were not used when initializing BertForSequenceClassification: ['cls.predictions.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.decoder.weight', 'cls.predictions.decoder.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.seq_relationship.bias', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.dense.weight', 'cls.seq_relationship.weight']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized

In [10]:
def transform(data):
    encoded_data = tokenizer(
        data['document'],
        truncation=True,
        padding='max_length',
        max_length=30,
        return_tensors="pt",
        return_token_type_ids=False,)
    
    # 텐서를 리스트로 변환하여 반환
    return {key: value.squeeze().tolist() for key, value in encoded_data.items()}


train_data = train_data.map(transform)#, batched=True)
test_data = test_data.map(transform)#, batched=True)
val_data = val_data.map(transform)#, batched=True)

Map:   0%|          | 0/120000 [00:00<?, ? examples/s]

Map:   0%|          | 0/10500 [00:00<?, ? examples/s]

Map:   0%|          | 0/10500 [00:00<?, ? examples/s]

In [14]:
import os
import numpy as np
from transformers import Trainer, TrainingArguments

In [21]:
output_dir = os.getenv("HOME") + '/aiffel/aiffel_quest/mini_quest_240326'

training_arguments = TrainingArguments(
output_dir,                        # output이 저장될 경로
evaluation_strategy='epoch',       # evaluation하는 빈도
learning_rate=2e-5,                # learning_rate
per_device_train_batch_size=512,     # 각 device 당 batch size(8)
per_device_eval_batch_size=512,      # evaluation 시에 batch size(8)
num_train_epochs=10,                # train 시킬 총 epochs
group_by_length=True,
weight_decay=0.1)                 # weight decay

In [22]:
from datasets import load_metric
metric = load_metric('accuracy')


def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    predictions = np.argmax(predictions, axis=1)
    return metric.compute(predictions=predictions, references=labels)

In [23]:
%%time
trainer = Trainer(
    model=model,
    args=training_arguments,
    train_dataset=train_data,
    data_collator = data_collator, # Dynamic Padding 을 위해 추가
    eval_dataset=val_data,
    compute_metrics=compute_metrics,)

trainer.train()



Epoch,Training Loss,Validation Loss,Accuracy
1,No log,0.263432,0.887905
2,No log,0.249843,0.896286
3,0.274400,0.254918,0.895238
4,0.274400,0.260062,0.900095
5,0.184700,0.273178,0.900286
6,0.184700,0.29692,0.899333
7,0.130300,0.310607,0.900476
8,0.130300,0.343896,0.896
9,0.096700,0.352533,0.898571
10,0.096700,0.358275,0.897619


CPU times: user 1h 20min 50s, sys: 8min 33s, total: 1h 29min 23s
Wall time: 1h 29min 18s


TrainOutput(global_step=2350, training_loss=0.15802060553368102, metrics={'train_runtime': 5349.695, 'train_samples_per_second': 224.312, 'train_steps_per_second': 0.439, 'total_flos': 1.849999608e+16, 'train_loss': 0.15802060553368102, 'epoch': 10.0})

In [24]:
trainer.evaluate(test_data)

{'eval_loss': 0.3893945515155792,
 'eval_accuracy': 0.8895238095238095,
 'eval_runtime': 16.8408,
 'eval_samples_per_second': 623.484,
 'eval_steps_per_second': 1.247,
 'epoch': 10.0}

Bucketing 이란?<br>
NLP에서의 Bucketing은 데이터를 효율적으로 처리하기 위한 전략 중 하나입니다. 특히, Sequence-to-Sequence 모델을 학습할 때 사용되며, 입력 문장의 길이에 따라 데이터를 그룹화하여 배치를 구성하는 방법입니다.

일반적으로, 서로 다른 길이의 문장을 동일한 배치에 함께 처리할 때 모든 문장의 길이를 동일하게 맞추는 것이 일반적입니다. 그러나 이는 GPU의 메모리를 비효율적으로 사용할 수 있습니다. 예를 들어, 모든 배치에 포함된 문장의 최대 길이보다 더 긴 문장이 있을 경우, 모든 문장을 최대 길이에 맞추어 패딩(padding)을 추가해야 합니다. 이렇게 되면 패딩된 토큰들은 실제 의미를 가지지 않지만 연산에도 참여하게 되어 메모리와 계산 시간을 낭비하게 됩니다.

Bucketing은 이러한 문제를 해결하기 위한 방법 중 하나입니다. Bucketing은 데이터를 여러 그룹으로 나누고, 각 그룹 내에서는 비슷한 길이의 문장들을 함께 배치로 만듭니다. 이렇게 하면 패딩의 양을 최소화하고 효율적으로 GPU 메모리를 사용할 수 있습니다.

예를 들어, Bucketing을 사용하여 문장의 길이에 따라 다음과 같이 데이터를 그룹화할 수 있습니다:

길이가 1-10 토큰인 문장들의 그룹
길이가 11-20 토큰인 문장들의 그룹
길이가 21-30 토큰인 문장들의 그룹
...
그런 다음 각 그룹에서는 비슷한 길이의 문장들을 배치로 만들어 처리합니다. 이렇게 함으로써 효율적인 학습이 가능해집니다.

Bucketing은 특히 Transformer와 같은 Self-Attention 기반의 모델을 학습할 때 유용합니다. Transformer 모델의 Self-Attention 계산은 시퀀스 길이에 따라 계산량이 크게 달라지기 때문에, Bucketing을 통해 비슷한 길이의 문장들을 함께 처리하여 계산 효율을 높일 수 있습니다.

Dynamic padding이란?<br>
Dynamic padding은 패딩을 적용할 때, 각 배치에 포함된 샘플들의 길이에 맞게 패딩을 조절하는 기법입니다. 정적 패딩(static padding)과는 달리, 각 배치마다 서로 다른 길이의 샘플을 가지고 있을 때, 최대한 적은 양의 패딩을 사용하여 효율적으로 메모리를 사용합니다.

일반적으로 정적 패딩은 모든 샘플을 특정 길이로 맞추기 위해 최대 길이에 맞게 패딩을 적용하는 반면, 동적 패딩은 각 배치에 포함된 샘플 중 가장 긴 샘플의 길이에 맞게 패딩을 적용합니다. 이렇게 하면 모든 샘플이 동일한 길이로 패딩되는 것이 아니라, 각 샘플의 실제 길이에 맞게 패딩이 적용되어 메모리를 더 효율적으로 사용할 수 있습니다.

동적 패딩은 주로 RNN(Recurrent Neural Network)이나 Transformer와 같은 시퀀스 모델에서 사용됩니다. 이러한 모델들은 입력 시퀀스의 길이가 다를 수 있으며, 동적 패딩은 이러한 모델에서 효율적인 배치 처리를 위해 중요한 역할을 합니다. Hugging Face의 Transformers 라이브러리에서도 동적 패딩을 쉽게 적용할 수 있습니다.

# 회고
- 0.900762 까지 파인튜닝을 하여 높여보았다. 사용할 수 있는 최대의 리소스를 활용하고 많은 데이터로 학습을 하니 도달 할 수 있었다.
- 여러번의 자연어처리 진행을 하니 조금은 이해가 가는 부분은 있었지만 아직까지도 어려운 부분이 많아서 추가적인 학습이 많이 필요할 것 같다.
- bucketing을 진행을 으로 1분의 시간 단축과 loss가 Fine-tuning 했을 때보다 더 낮아져서 효과를 보았다.
- 평가를 진행을 했을땐 조금 bucketing했을 때 loss가 조금 높고 샘플당 평가 시간도 더 오래걸린긴 했다.