# HuggingFace 커스텀 프로젝트 만들기

## 프로젝트 목표

1. 모델과 데이터를 정상적으로 불러오고, 작동하는 것을 확인하였다.	
    - klue/bert-base를 NSMC 데이터셋으로 fine-tuning 하여, 모델이 정상적으로 작동하는 것을 확인하였다.
2. Preprocessing을 개선하고, fine-tuning을 통해 모델의 성능을 개선시켰다.	
    - Validation accuracy를 90% 이상으로 개선하였다.
3. 모델 학습에 Bucketing을 성공적으로 적용하고, 그 결과를 비교분석하였다.	
    - Bucketing task을 수행하여 fine-tuning 시 연산 속도와 모델 성능 간의 trade-off 관계가 발생하는지 여부를 확인하고, 분석한 결과를 제시하였다.

## 코드구현

In [39]:
from datasets import DatasetDict
from datasets import load_dataset
import transformers
from transformers import AutoTokenizer, AutoModelForSequenceClassification

import os
import numpy as np



from transformers import Trainer, TrainingArguments
from datasets import load_metric

from transformers import DataCollatorWithPadding

In [40]:
# Blpeng/nsmc 데이터 셋 사용
# 토크나이저 진행 중 에러발생
# 일부 데이터의 document가 다른 형식으로 입력되어 있는거 같음

# e9t/nsmc로 수정 후 토크나이저
# 정상작동

In [41]:
loaded_nsme_dataset = load_dataset("e9t/nsmc")
print(loaded_nsme_dataset)

Using custom data configuration default
Reusing dataset nsmc (/aiffel/.cache/huggingface/datasets/e9t___nsmc)/default/1.1.0/bfd4729bf1a67114e5267e6916b9e4807010aeb238e4a3c2b95fbfa3a014b5f3)


  0%|          | 0/2 [00:00<?, ?it/s]

DatasetDict({
    train: Dataset({
        features: ['id', 'document', 'label'],
        num_rows: 150000
    })
    test: Dataset({
        features: ['id', 'document', 'label'],
        num_rows: 50000
    })
})


In [42]:
train = loaded_nsme_dataset['train']
cols = train.column_names
cols

['id', 'document', 'label']

In [43]:
len(train)

150000

In [24]:
# 속도 이슈로 15만개에서 5만개 추출해서 학습에 이용

In [25]:
train_dataset_1M = loaded_nsme_dataset['train'].train_test_split(train_size=100000, seed=42)['train']
len(train_dataset_1M)

Loading cached split indices for dataset at /aiffel/.cache/huggingface/datasets/e9t___nsmc)/default/1.1.0/bfd4729bf1a67114e5267e6916b9e4807010aeb238e4a3c2b95fbfa3a014b5f3/cache-a54e64762cbfb223.arrow and /aiffel/.cache/huggingface/datasets/e9t___nsmc)/default/1.1.0/bfd4729bf1a67114e5267e6916b9e4807010aeb238e4a3c2b95fbfa3a014b5f3/cache-bd2febced1d21d3a.arrow


100000

In [26]:
for i in range(5):
    for col in cols:
        print(col, ":", train[col][i])
    print('\n')

id : 9976970
document : 아 더빙.. 진짜 짜증나네요 목소리
label : 0


id : 3819312
document : 흠...포스터보고 초딩영화줄....오버연기조차 가볍지 않구나
label : 1


id : 10265843
document : 너무재밓었다그래서보는것을추천한다
label : 0


id : 9045019
document : 교도소 이야기구먼 ..솔직히 재미는 없다..평점 조정
label : 0


id : 6483659
document : 사이몬페그의 익살스런 연기가 돋보였던 영화!스파이더맨에서 늙어보이기만 했던 커스틴 던스트가 너무나도 이뻐보였다
label : 1




In [44]:
# 처음은 train / temp(test + val)
dataset_split = train.train_test_split(test_size=0.2, seed=42)

# temp를 다시 validation / test로 분리
test_val_split = dataset_split['test'].train_test_split(test_size=0.5, seed=42)

# 최종 dataset dict
loaded_nsme_dataset = DatasetDict({
    'train': dataset_split['train'],
    'validation': test_val_split['train'],
    'test': test_val_split['test']
})

In [45]:
len(loaded_nsme_dataset['train']), len(loaded_nsme_dataset['validation']), len(loaded_nsme_dataset['test'])

(120000, 15000, 15000)

###  klue/bert-base model, tokenizer 불러오기

In [46]:
print(loaded_nsme_dataset['train'][0])

{'document': '이정도로 지루함... 몹시도 굼벵이 같은 주인공..빨리감기 추천', 'id': '5050479', 'label': 0}


In [47]:
huggingface_tokenizer = AutoTokenizer.from_pretrained('klue/bert-base')
huggingface_model = AutoModelForSequenceClassification.from_pretrained('klue/bert-base', num_labels = 2)

loading configuration file https://huggingface.co/klue/bert-base/resolve/main/config.json from cache at /aiffel/.cache/huggingface/transformers/fbd0b2ef898c4653902683fea8cc0dd99bf43f0e082645b913cda3b92429d1bb.99b3298ed554f2ad731c27cdb11a6215f39b90bc845ff5ce709bb4e74ba45621
Model config BertConfig {
  "architectures": [
    "BertForMaskedLM"
  ],
  "attention_probs_dropout_prob": 0.1,
  "classifier_dropout": null,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": "bert",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "pad_token_id": 0,
  "position_embedding_type": "absolute",
  "transformers_version": "4.11.3",
  "type_vocab_size": 2,
  "use_cache": true,
  "vocab_size": 32000
}

loading file https://huggingface.co/klue/bert-base/resolve/main/vocab.txt from cache at /aiffel/.cache/huggingface/transformers/1a36e69d48a0

In [48]:
# 토크나이저 해주는 transform 함수 정의
def transform(data):
    return huggingface_tokenizer(
        data['document'],
        truncation = True,
#         padding = 'max_length',
        return_token_type_ids = False,
        )


In [49]:
nsme_dataset = loaded_nsme_dataset.map(
    transform,
    batched=True
)

# train & validation & test split
nsme_train_dataset = nsme_dataset['train']
nsme_val_dataset = nsme_dataset['validation']
nsme_test_dataset = nsme_dataset['test']

  0%|          | 0/120 [00:00<?, ?ba/s]

  0%|          | 0/15 [00:00<?, ?ba/s]

  0%|          | 0/15 [00:00<?, ?ba/s]

### model 학습 진행해 보기

In [50]:
# MRPC 불러옴(Acc, F1)
metric = load_metric('glue', 'mrpc')

def compute_metrics(eval_pred):    
    predictions,labels = eval_pred
    predictions = np.argmax(predictions, axis=1)
    return metric.compute(predictions=predictions, references = labels)

In [51]:
output_dir = os.getenv('HOME')+'/aiffel/transformers'

In [132]:


training_arguments = TrainingArguments(
    output_dir,                                         # output이 저장될 경로
    evaluation_strategy="epoch",           #evaluation하는 빈도
    learning_rate = 2e-5,                         #learning_rate
    per_device_train_batch_size = 8,   # 각 device 당 batch size
    per_device_eval_batch_size = 8,    # evaluation 시에 batch size
    num_train_epochs = 3,                     # train 시킬 총 epochs
    weight_decay = 0.01,                        # weight decay
)

PyTorch: setting up devices
The default value for the training argument `--report_to` will change in v5 (from all installed integrations to none). In v5, you will need to use `--report_to all` to get the same behavior as now. You should start updating your code and make this info disappear :-).


In [134]:
trainer = Trainer(
    model=huggingface_model,           # 학습시킬 model
    args=training_arguments,           # TrainingArguments을 통해 설정한 arguments
    train_dataset=nsme_train_dataset,    # training dataset
    eval_dataset=nsme_val_dataset,       # evaluation dataset
    compute_metrics=compute_metrics,
)
trainer.train()

The following columns in the training set  don't have a corresponding argument in `BertForSequenceClassification.forward` and have been ignored: id, document.
***** Running training *****
  Num examples = 40000
  Num Epochs = 3
  Instantaneous batch size per device = 8
  Total train batch size (w. parallel, distributed & accumulation) = 8
  Gradient Accumulation steps = 1
  Total optimization steps = 15000


Epoch,Training Loss,Validation Loss,Accuracy,F1
1,0.31,0.331196,0.8856,0.884491
2,0.2496,0.402897,0.8866,0.88557
3,0.1499,0.513854,0.8906,0.889873


Saving model checkpoint to /aiffel/aiffel/transformers/checkpoint-500
Configuration saved in /aiffel/aiffel/transformers/checkpoint-500/config.json
Model weights saved in /aiffel/aiffel/transformers/checkpoint-500/pytorch_model.bin
Saving model checkpoint to /aiffel/aiffel/transformers/checkpoint-1000
Configuration saved in /aiffel/aiffel/transformers/checkpoint-1000/config.json
Model weights saved in /aiffel/aiffel/transformers/checkpoint-1000/pytorch_model.bin
Saving model checkpoint to /aiffel/aiffel/transformers/checkpoint-1500
Configuration saved in /aiffel/aiffel/transformers/checkpoint-1500/config.json
Model weights saved in /aiffel/aiffel/transformers/checkpoint-1500/pytorch_model.bin
Saving model checkpoint to /aiffel/aiffel/transformers/checkpoint-2000
Configuration saved in /aiffel/aiffel/transformers/checkpoint-2000/config.json
Model weights saved in /aiffel/aiffel/transformers/checkpoint-2000/pytorch_model.bin
Saving model checkpoint to /aiffel/aiffel/transformers/checkpoi

TrainOutput(global_step=15000, training_loss=0.24337798716227213, metrics={'train_runtime': 11882.6268, 'train_samples_per_second': 10.099, 'train_steps_per_second': 1.262, 'total_flos': 3.15733266432e+16, 'train_loss': 0.24337798716227213, 'epoch': 3.0})

In [None]:
trainer.evaluate(nsme_test_dataset)

The following columns in the evaluation set  don't have a corresponding argument in `BertForSequenceClassification.forward` and have been ignored: id, document.
***** Running Evaluation *****
  Num examples = 5000
  Batch size = 8


### Bucketing을 적용하여 학습시키고, STEP 4의 결과와의 비교

1. Dynamic Padding
    - 문장 길이에 맞춰서 패딩
    - 기존 : 모든 배치가 최대길이로 패딩
    - Dynamic Padding : 배치 단위로 가장 긴 문장 기준으로 패딩

2. Bucketing
    - 비슷한 길이끼리 그룹으로 묶어서 배치
    - 길이가 비슷한 문장끼리 묶어서 패딩 최소화

- Dynamic Padding, Bucketing은 메모리 효율 + 훈련 속도 향상을 기대할수있음


In [52]:
# Bucketing 옵션 True 설정
training_arguments = TrainingArguments(
    output_dir,                                         # output이 저장될 경로
    evaluation_strategy="epoch",           #evaluation하는 빈도
    learning_rate = 2e-5,                         #learning_rate
    per_device_train_batch_size = 8,   # 각 device 당 batch size
    per_device_eval_batch_size = 8,    # evaluation 시에 batch size
    num_train_epochs = 3,                     # train 시킬 총 epochs
    weight_decay = 0.01,                        # weight decay
    group_by_length = True,       # Bucketing 옵션
    fp16=True,
)

PyTorch: setting up devices
The default value for the training argument `--report_to` will change in v5 (from all installed integrations to none). In v5, you will need to use `--report_to all` to get the same behavior as now. You should start updating your code and make this info disappear :-).


In [53]:
# Dynamic Padding
data_collator = DataCollatorWithPadding(
    tokenizer=huggingface_tokenizer
)


trainer = Trainer(
    model=huggingface_model,           # 학습시킬 model
    args=training_arguments,           # TrainingArguments을 통해 설정한 arguments
    train_dataset=nsme_train_dataset,    # training dataset
    eval_dataset=nsme_val_dataset,       # evaluation dataset
    compute_metrics=compute_metrics,
    data_collator=data_collator,       # Dynamic Padding 추가
)
trainer.train()

Using amp fp16 backend
The following columns in the training set  don't have a corresponding argument in `BertForSequenceClassification.forward` and have been ignored: document, id.
***** Running training *****
  Num examples = 120000
  Num Epochs = 3
  Instantaneous batch size per device = 8
  Total train batch size (w. parallel, distributed & accumulation) = 8
  Gradient Accumulation steps = 1
  Total optimization steps = 45000


Epoch,Training Loss,Validation Loss,Accuracy,F1
1,0.2955,0.298654,0.8948,0.89533
2,0.2384,0.366269,0.902133,0.902381
3,0.1673,0.476985,0.9026,0.9019


  nn.utils.clip_grad_norm_(
Saving model checkpoint to /aiffel/aiffel/transformers/checkpoint-500
Configuration saved in /aiffel/aiffel/transformers/checkpoint-500/config.json
Model weights saved in /aiffel/aiffel/transformers/checkpoint-500/pytorch_model.bin
Saving model checkpoint to /aiffel/aiffel/transformers/checkpoint-1000
Configuration saved in /aiffel/aiffel/transformers/checkpoint-1000/config.json
Model weights saved in /aiffel/aiffel/transformers/checkpoint-1000/pytorch_model.bin
Saving model checkpoint to /aiffel/aiffel/transformers/checkpoint-1500
Configuration saved in /aiffel/aiffel/transformers/checkpoint-1500/config.json
Model weights saved in /aiffel/aiffel/transformers/checkpoint-1500/pytorch_model.bin
Saving model checkpoint to /aiffel/aiffel/transformers/checkpoint-2000
Configuration saved in /aiffel/aiffel/transformers/checkpoint-2000/config.json
Model weights saved in /aiffel/aiffel/transformers/checkpoint-2000/pytorch_model.bin
  nn.utils.clip_grad_norm_(
Saving 

Saving model checkpoint to /aiffel/aiffel/transformers/checkpoint-13500
Configuration saved in /aiffel/aiffel/transformers/checkpoint-13500/config.json
Model weights saved in /aiffel/aiffel/transformers/checkpoint-13500/pytorch_model.bin
Saving model checkpoint to /aiffel/aiffel/transformers/checkpoint-14000
Configuration saved in /aiffel/aiffel/transformers/checkpoint-14000/config.json
Model weights saved in /aiffel/aiffel/transformers/checkpoint-14000/pytorch_model.bin
Saving model checkpoint to /aiffel/aiffel/transformers/checkpoint-14500
Configuration saved in /aiffel/aiffel/transformers/checkpoint-14500/config.json
Model weights saved in /aiffel/aiffel/transformers/checkpoint-14500/pytorch_model.bin
Saving model checkpoint to /aiffel/aiffel/transformers/checkpoint-15000
Configuration saved in /aiffel/aiffel/transformers/checkpoint-15000/config.json
Model weights saved in /aiffel/aiffel/transformers/checkpoint-15000/pytorch_model.bin
The following columns in the evaluation set  don

Saving model checkpoint to /aiffel/aiffel/transformers/checkpoint-27000
Configuration saved in /aiffel/aiffel/transformers/checkpoint-27000/config.json
Model weights saved in /aiffel/aiffel/transformers/checkpoint-27000/pytorch_model.bin
Saving model checkpoint to /aiffel/aiffel/transformers/checkpoint-27500
Configuration saved in /aiffel/aiffel/transformers/checkpoint-27500/config.json
Model weights saved in /aiffel/aiffel/transformers/checkpoint-27500/pytorch_model.bin
Saving model checkpoint to /aiffel/aiffel/transformers/checkpoint-28000
Configuration saved in /aiffel/aiffel/transformers/checkpoint-28000/config.json
Model weights saved in /aiffel/aiffel/transformers/checkpoint-28000/pytorch_model.bin
Saving model checkpoint to /aiffel/aiffel/transformers/checkpoint-28500
Configuration saved in /aiffel/aiffel/transformers/checkpoint-28500/config.json
Model weights saved in /aiffel/aiffel/transformers/checkpoint-28500/pytorch_model.bin
Saving model checkpoint to /aiffel/aiffel/transf

Saving model checkpoint to /aiffel/aiffel/transformers/checkpoint-39500
Configuration saved in /aiffel/aiffel/transformers/checkpoint-39500/config.json
Model weights saved in /aiffel/aiffel/transformers/checkpoint-39500/pytorch_model.bin
Saving model checkpoint to /aiffel/aiffel/transformers/checkpoint-40000
Configuration saved in /aiffel/aiffel/transformers/checkpoint-40000/config.json
Model weights saved in /aiffel/aiffel/transformers/checkpoint-40000/pytorch_model.bin
Saving model checkpoint to /aiffel/aiffel/transformers/checkpoint-40500
Configuration saved in /aiffel/aiffel/transformers/checkpoint-40500/config.json
Model weights saved in /aiffel/aiffel/transformers/checkpoint-40500/pytorch_model.bin
Saving model checkpoint to /aiffel/aiffel/transformers/checkpoint-41000
Configuration saved in /aiffel/aiffel/transformers/checkpoint-41000/config.json
Model weights saved in /aiffel/aiffel/transformers/checkpoint-41000/pytorch_model.bin
Saving model checkpoint to /aiffel/aiffel/transf

TrainOutput(global_step=45000, training_loss=0.23620720638699003, metrics={'train_runtime': 5365.2577, 'train_samples_per_second': 67.098, 'train_steps_per_second': 8.387, 'total_flos': 4285149980900160.0, 'train_loss': 0.23620720638699003, 'epoch': 3.0})

In [54]:
trainer.evaluate(nsme_test_dataset)

The following columns in the evaluation set  don't have a corresponding argument in `BertForSequenceClassification.forward` and have been ignored: document, id.
***** Running Evaluation *****
  Num examples = 15000
  Batch size = 8


{'eval_loss': 0.475166916847229,
 'eval_accuracy': 0.9034,
 'eval_f1': 0.9035093560631285,
 'eval_runtime': 54.0362,
 'eval_samples_per_second': 277.591,
 'eval_steps_per_second': 34.699,
 'epoch': 3.0}

In [29]:
!nvidia-smi

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
Tue Mar 18 03:17:27 2025       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.230.02             Driver Version: 535.230.02   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|   0  Tesla T4                       Off | 00000000:00:04.0 Off |                    0 |
| N/A   65C    P0              29W /  70W |  14923MiB /

1. 데이터별 정확도
- 데이터 1만개 : 0.87
- 데이터 5만개 : 0.89
- 데이터 15만개 : 0.91

2. Dynamic Padding, Bucketing 사용 전후 학습시간
- 사용전 : 3시간 18분, 약 200분
- 사용후 : 1시간 32분, 약 90분
- 사용 후 학습 시간 2배 이상 단축
- 정확도를 유의미한 차이가 없음

3. 문제발생
- 적절한 패딩이 적용안되었음
- trainsfrom 함수에서 이미 padding이 되어있어서 동적 패딩이 적용한됨
- padding 부분을 삭제
- 삭제하고 비교해보니 3시간 걸리는 부분이 1시간 미만으로 시간이 3배 단축

In [None]:
# - 15만개에서 5만개 1만개 썻을땐 동등한 효과가 있는가
# - sequence lenth 통계치 확인해보자
# - metrics 너무 러프하다. metrics 수정해야한다.
# - 데이터 추출에 기준 정해보기(문장 길이 등등)


SyntaxError: invalid syntax (1041892719.py, line 1)