# 320. Custom Dataset을 이용한 Hugging Face BERT model Fine Tuning

- NAVER Movie review dataset을 이용하여 transformers BERT model을 fine tuning  

- Pytorch 와 Trainer를 이용한 Fine Tuning (Pytorch version이 Tensorflow 보다 안정적)

In [1]:
!pip install transformers[torch]

Collecting accelerate>=0.21.0 (from transformers[torch])
  Downloading accelerate-0.29.1-py3-none-any.whl (297 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m297.3/297.3 kB[0m [31m4.7 MB/s[0m eta [36m0:00:00[0m
Collecting nvidia-cuda-nvrtc-cu12==12.1.105 (from torch->transformers[torch])
  Downloading nvidia_cuda_nvrtc_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (23.7 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m23.7/23.7 MB[0m [31m41.7 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting nvidia-cuda-runtime-cu12==12.1.105 (from torch->transformers[torch])
  Downloading nvidia_cuda_runtime_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (823 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m823.6/823.6 kB[0m [31m62.3 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting nvidia-cuda-cupti-cu12==12.1.105 (from torch->transformers[torch])
  Downloading nvidia_cuda_cupti_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (14.1 MB)
[2K     

In [3]:
from transformers import BertTokenizer
from transformers import BertForSequenceClassification, Trainer, TrainingArguments
import torch.nn.functional as F
import tensorflow as tf
import pandas as pd

In [4]:
DATA_TRAIN_PATH = tf.keras.utils.get_file("ratings_train.txt",
                     "https://raw.github.com/ironmanciti/Infran_NLP/master/data/naver_movie/ratings_train.txt")
DATA_TEST_PATH = tf.keras.utils.get_file("ratings_test.txt",
                    "https://raw.github.com/ironmanciti/Infran_NLP/master/data/naver_movie/ratings_test.txt")

Downloading data from https://raw.github.com/ironmanciti/NLP_lecture/master/data/naver_movie/ratings_train.txt
Downloading data from https://raw.github.com/ironmanciti/NLP_lecture/master/data/naver_movie/ratings_test.txt


### Train Set

In [5]:
train_data = pd.read_csv(DATA_TRAIN_PATH, delimiter='\t')
print(train_data.shape)
train_data.head()

(150000, 3)


Unnamed: 0,id,document,label
0,9976970,아 더빙.. 진짜 짜증나네요 목소리,0
1,3819312,흠...포스터보고 초딩영화줄....오버연기조차 가볍지 않구나,1
2,10265843,너무재밓었다그래서보는것을추천한다,0
3,9045019,교도소 이야기구먼 ..솔직히 재미는 없다..평점 조정,0
4,6483659,사이몬페그의 익살스런 연기가 돋보였던 영화!스파이더맨에서 늙어보이기만 했던 커스틴 ...,1


In [6]:
train_data.dropna(inplace=True)
train_data.info()

<class 'pandas.core.frame.DataFrame'>
Index: 149995 entries, 0 to 149999
Data columns (total 3 columns):
 #   Column    Non-Null Count   Dtype 
---  ------    --------------   ----- 
 0   id        149995 non-null  int64 
 1   document  149995 non-null  object
 2   label     149995 non-null  int64 
dtypes: int64(2), object(1)
memory usage: 4.6+ MB


### Test Set

In [7]:
test_data = pd.read_csv(DATA_TEST_PATH, delimiter='\t')
print(test_data.shape)
test_data.head()

(50000, 3)


Unnamed: 0,id,document,label
0,6270596,굳 ㅋ,1
1,9274899,GDNTOPCLASSINTHECLUB,0
2,8544678,뭐야 이 평점들은.... 나쁘진 않지만 10점 짜리는 더더욱 아니잖아,0
3,6825595,지루하지는 않은데 완전 막장임... 돈주고 보기에는....,0
4,6723715,3D만 아니었어도 별 다섯 개 줬을텐데.. 왜 3D로 나와서 제 심기를 불편하게 하죠??,0


In [8]:
test_data.dropna(inplace=True)
test_data.info()

<class 'pandas.core.frame.DataFrame'>
Index: 49997 entries, 0 to 49999
Data columns (total 3 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   id        49997 non-null  int64 
 1   document  49997 non-null  object
 2   label     49997 non-null  int64 
dtypes: int64(2), object(1)
memory usage: 1.5+ MB


- 훈련 시간 단축을 위해 1/10 의 data 만 sampling - 6분 소요

In [9]:
df_train = train_data.sample(n=15000, random_state=1)
df_test = test_data.sample(n=5000, random_state=1)
print(df_train.shape)
print(df_test.shape)

(15000, 3)
(5000, 3)


In [10]:
df_train['label'].value_counts()

label
0    7524
1    7476
Name: count, dtype: int64

In [11]:
X_train = df_train['document'].values.tolist()
y_train = df_train['label'].values.tolist()

X_test = df_test['document'].values.tolist()
y_test = df_test['label'].values.tolist()

## pre-trained bert model 호출
### tokenizer 호출
- 토큰화 처리를 합니다. bert 다국어 version 용의 pre-trained tokenizer 를 불러옵니다.

In [12]:
tokenizer = BertTokenizer.from_pretrained('bert-base-multilingual-cased')

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/49.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/996k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.96M [00:00<?, ?B/s]

config.json:   0%|          | 0.00/625 [00:00<?, ?B/s]

pre-trained tokenizer 를 이용하여 train set 과 test set 을 token 화 합니다.

- Input IDs : 토큰 인덱스, 모델에서 입력으로 사용할 시퀀스를 구축하는 토큰의 숫자 표현
- Token Type IDs : 한 쌍의 문장 또는 질문 답변에 대한 분류 시 사용  
- attention mask : `1`은 주목해야 하는 값을 나타내고 `0`은 패딩된 값을 나타냅니다.  
```
[CLS] SEQUENCE_A [SEP] SEQUENCE_B [SEP]
ex) [CLS] HuggingFace is based in NYC [SEP] Where is HuggingFace based? [SEP]
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1]
```

In [13]:
train_encodings = tokenizer(X_train, truncation=True, padding=True)
test_encodings = tokenizer(X_test, truncation=True, padding=True)

In [14]:
train_encodings.keys()

dict_keys(['input_ids', 'token_type_ids', 'attention_mask'])

In [15]:
print(train_encodings['input_ids'][0])
print(train_encodings['attention_mask'][0])
print(train_encodings['token_type_ids'][0])

[101, 113, 9926, 34907, 20626, 58931, 24974, 122, 114, 9532, 25503, 12030, 28911, 9367, 19855, 47869, 9682, 9634, 21386, 136, 8924, 11261, 119351, 12605, 20308, 12453, 117, 9792, 73352, 21876, 20173, 9294, 36553, 11287, 52560, 9391, 11664, 9640, 18784, 12030, 12508, 9304, 12508, 19709, 9684, 52560, 10892, 8932, 118651, 14523, 48549, 119, 8905, 119377, 11102, 117, 9604, 78123, 11102, 117, 9684, 89523, 42769, 15387, 9792, 73352, 21876, 20173, 47058, 8982, 28188, 11664, 9294, 36553, 11287, 52560, 9597, 10530, 19709, 9792, 73352, 21876, 100698, 11018, 9670, 14871, 15387, 9637, 12945, 22333, 43022, 113, 9069, 18227, 114, 63783, 9641, 42337, 14801, 119, 119, 102, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 

### Convert encodings to Tensors

- 레이블과 인코딩을 Dataset 개체로 변환합니다. Pytorch를 이용합니다.  

- PyTorch에서 이것은 `torch.utils.data.Dataset` 객체를 하고 `__len__` 및 `__getitem__`을 구현하여 수행됩니다.

- TensorFlow에서는 입력 인코딩과 레이블을 `from_tensor_slices` 생성자 메서드에 전달합니다. (불안정)

In [16]:
import torch

class IMDbDataset(torch.utils.data.Dataset):
    def __init__(self, encodings, labels=None):
        self.encodings = encodings
        self.labels = labels

    def __getitem__(self, idx):
        item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
        if self.labels:
            item["labels"] = torch.tensor(self.labels[idx])
        return item

    def __len__(self):
        return len(self.encodings["input_ids"])

train_dataset = IMDbDataset(train_encodings, y_train)
test_dataset = IMDbDataset(test_encodings, y_test)

이제 데이터 세트가 준비되었으므로 🤗 `Trainer` 또는 기본 PyTorch/TensorFlow를 사용하여 모델을 미세 조정할 수 있습니다. [training](https://huggingface.co/transformers/training.html)을 참조하세요.

- Training warmup steps :  

    - 이는 일반적으로 설정된 수의 훈련 단계(워밍업 단계)에 대해 매우 낮은 학습률을 사용한다는 것을 의미합니다. 워밍업 단계 후에 "일반" 학습률 또는 학습률 스케줄러를 사용합니다. 또한 워밍업 단계 수에 따라 학습률을 점진적으로 높일 수 있습니다.

- weight_decay : 가중치 감쇠. L2 regularization

In [17]:
training_args = TrainingArguments(
    output_dir='./results',               # output 저장 directory
    num_train_epochs=2,              # total number of training epochs
    per_device_train_batch_size=8,  # batch size per device during training
    per_device_eval_batch_size=16,   # batch size per device during evaluation
    warmup_steps=500,                # number of warmup steps for learning rate scheduler
    weight_decay=0.01,               # weight decay 강도
    logging_dir='./logs',            # log 저장 directory
    logging_steps=10,
)

### model Train

In [18]:
import time

model = BertForSequenceClassification.from_pretrained('bert-base-multilingual-cased')

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=test_dataset
)

s = time.time()

trainer.train()

model.safetensors:   0%|          | 0.00/714M [00:00<?, ?B/s]

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-multilingual-cased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
dataloader_config = DataLoaderConfiguration(dispatch_batches=None, split_batches=False, even_batches=True, use_seedable_sampler=True)


Step,Training Loss
10,0.7025
20,0.6923
30,0.7038
40,0.6876
50,0.6874
60,0.6855
70,0.7059
80,0.6598
90,0.6825
100,0.6768


TrainOutput(global_step=3750, training_loss=0.5625850903828938, metrics={'train_runtime': 898.4771, 'train_samples_per_second': 33.39, 'train_steps_per_second': 4.174, 'total_flos': 1896249598200000.0, 'train_loss': 0.5625850903828938, 'epoch': 2.0})

In [19]:
print("경과 시간 : {:.2f}분".format((time.time() - s)/60))

경과 시간 : 14.98분


In [20]:
trainer.evaluate(test_dataset)

{'eval_loss': 0.4893137812614441,
 'eval_runtime': 34.0236,
 'eval_samples_per_second': 146.957,
 'eval_steps_per_second': 9.199,
 'epoch': 2.0}

In [21]:
prediction = trainer.predict(test_dataset)

fine-tuned model 은 logit 을 return

In [22]:
trainer.model.classifier

Linear(in_features=768, out_features=2, bias=True)

In [23]:
y_logit = torch.tensor(prediction[0])
y_logit[:10]

tensor([[ 0.5371, -0.7004],
        [ 0.8602, -1.1240],
        [ 0.0580,  0.3154],
        [ 1.0007, -1.2351],
        [-0.9990,  1.2586],
        [ 0.9937, -1.2316],
        [-0.0258,  0.4150],
        [ 0.9588, -1.2089],
        [-1.0017,  1.2621],
        [ 0.9329, -1.1884]])

In [24]:
y_pred = F.softmax(y_logit, dim=-1).argmax(axis=1).numpy()
print(list(y_pred[:30]))
print(y_test[:30])

[0, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 1, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1]
[0, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 1]


In [25]:
from sklearn.metrics import confusion_matrix, accuracy_score

print(accuracy_score(y_test, y_pred))

cm=confusion_matrix(y_test, y_pred)
cm

0.7814


array([[1898,  597],
       [ 496, 2009]])

In [26]:
x = '돈주고 보기에는 아까운 영화 ㅠㅠ...'
# x = '정말 재미있는 영화'
tokenized = tokenizer([x], truncation=True, padding=True)
pred = trainer.predict(IMDbDataset(tokenized))

logit = torch.tensor(pred[0])
result = F.softmax(logit, dim=-1).argmax(1).numpy()
"긍정" if result == 1 else "부정"

'부정'

# Next Step
20 만개 전체 dataset으로 fine tuning