# RoBERTa를 이용한 한국어 자연어추론(NLI)
- 사전학습 모델 : KLUE-RoBERTa (MODU, CC-100-Kor, NAMUWIKI, NEWSCRAWL, PETITION)
- 데이터 : KLUE-NLI (WIKITREE, POLICY, WIKINEWS, WIKIPEDIA, NSMC and AIRBNB)

# 사전 준비

In [1]:
!pip install transformers
!pip install datasets

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting transformers
  Downloading transformers-4.20.1-py3-none-any.whl (4.4 MB)
[K     |████████████████████████████████| 4.4 MB 5.2 MB/s 
Collecting huggingface-hub<1.0,>=0.1.0
  Downloading huggingface_hub-0.8.1-py3-none-any.whl (101 kB)
[K     |████████████████████████████████| 101 kB 5.0 MB/s 
[?25hCollecting tokenizers!=0.11.3,<0.13,>=0.11.1
  Downloading tokenizers-0.12.1-cp37-cp37m-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (6.6 MB)
[K     |████████████████████████████████| 6.6 MB 24.4 MB/s 
Collecting pyyaml>=5.1
  Downloading PyYAML-6.0-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (596 kB)
[K     |████████████████████████████████| 596 kB 70.8 MB/s 
Installing collected packages: pyyaml, tokenizers, huggingface-hub, transformers
  Attempting uninstall: pyyaml
    Found existing installation: PyYAML 3.13
    Uninstall

**KLUE-NLI 데이터 불러오기**

In [2]:
from datasets import load_dataset

datasets = load_dataset("klue", "nli")

Downloading builder script:   0%|          | 0.00/5.21k [00:00<?, ?B/s]

Downloading metadata:   0%|          | 0.00/2.93k [00:00<?, ?B/s]

Downloading and preparing dataset klue/nli (download: 1.20 MiB, generated: 6.10 MiB, post-processed: Unknown size, total: 7.30 MiB) to /root/.cache/huggingface/datasets/klue/nli/1.0.0/e0fc3bc3de3eb03be2c92d72fd04a60ecc71903f821619cb28ca0e1e29e4233e...


Downloading data:   0%|          | 0.00/1.26M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/24998 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/3000 [00:00<?, ? examples/s]

Dataset klue downloaded and prepared to /root/.cache/huggingface/datasets/klue/nli/1.0.0/e0fc3bc3de3eb03be2c92d72fd04a60ecc71903f821619cb28ca0e1e29e4233e. Subsequent calls will reuse this data.


  0%|          | 0/2 [00:00<?, ?it/s]

In [3]:
datasets

DatasetDict({
    train: Dataset({
        features: ['guid', 'source', 'premise', 'hypothesis', 'label'],
        num_rows: 24998
    })
    validation: Dataset({
        features: ['guid', 'source', 'premise', 'hypothesis', 'label'],
        num_rows: 3000
    })
})

In [4]:
# label 0: entailment(함의) / 1: neutral(중립) / 2: contradiction(모순)
print(datasets["train"][0])
print(datasets["validation"][0])

{'guid': 'klue-nli-v1_train_00000', 'source': 'NSMC', 'premise': '힛걸 진심 최고다 그 어떤 히어로보다 멋지다', 'hypothesis': '힛걸 진심 최고로 멋지다.', 'label': 0}
{'guid': 'klue-nli-v1_dev_00000', 'source': 'airbnb', 'premise': '흡연자분들은 발코니가 있는 방이면 발코니에서 흡연이 가능합니다.', 'hypothesis': '어떤 방에서도 흡연은 금지됩니다.', 'label': 2}


**KLUE-RoBERTa 모델과 토크나이저 불러오기**

In [5]:
from transformers import AutoModel, AutoTokenizer

roberta_model = AutoModel.from_pretrained("klue/roberta-base")
tokenizer = AutoTokenizer.from_pretrained("klue/roberta-base")

Downloading:   0%|          | 0.00/546 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/422M [00:00<?, ?B/s]

Some weights of the model checkpoint at klue/roberta-base were not used when initializing RobertaModel: ['lm_head.bias', 'lm_head.layer_norm.bias', 'lm_head.decoder.bias', 'lm_head.dense.weight', 'lm_head.dense.bias', 'lm_head.layer_norm.weight', 'lm_head.decoder.weight']
- This IS expected if you are initializing RobertaModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of RobertaModel were not initialized from the model checkpoint at klue/roberta-base and are newly initialized: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for

Downloading:   0%|          | 0.00/375 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/243k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/734k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/173 [00:00<?, ?B/s]

In [6]:
roberta_model.config

RobertaConfig {
  "_name_or_path": "klue/roberta-base",
  "architectures": [
    "RobertaForMaskedLM"
  ],
  "attention_probs_dropout_prob": 0.1,
  "bos_token_id": 0,
  "classifier_dropout": null,
  "eos_token_id": 2,
  "gradient_checkpointing": false,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-05,
  "max_position_embeddings": 514,
  "model_type": "roberta",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "pad_token_id": 1,
  "position_embedding_type": "absolute",
  "tokenizer_class": "BertTokenizer",
  "transformers_version": "4.20.1",
  "type_vocab_size": 1,
  "use_cache": true,
  "vocab_size": 32000
}

# 토크나이징, 데이터 구축

**스페셜 토큰 확인**

In [7]:
for i in range (10):
    print("index : ",i," =  tokens : ",tokenizer.decode(i))

index :  0  =  tokens :  [CLS]
index :  1  =  tokens :  [PAD]
index :  2  =  tokens :  [SEP]
index :  3  =  tokens :  [UNK]
index :  4  =  tokens :  [MASK]
index :  5  =  tokens :  !
index :  6  =  tokens :  "
index :  7  =  tokens :  #
index :  8  =  tokens :  $
index :  9  =  tokens :  %


**[CLS] 전제 [SEP] 가설 [SEP] [PAD]...**

In [8]:
import torch
from torch.utils.data import Dataset

In [9]:
class NLIDataset(Dataset):
    def __init__(self, data, max_len=64):  # 데이터셋의 전처리를 해주는 부분
        self._data = data
        self.max_len = max_len
        self.bos = tokenizer.bos_token      # [CLS]
        self.eos = tokenizer.eos_token      # [SEP]
        self.pad = tokenizer.pad_token      # [PAD]
        self.sep = tokenizer.sep_token      # [SEP]
        self.tokenizer = tokenizer
    
    def __len__(self):
        return len(self._data)

    def __getitem__(self, idx):  # 로드한 데이터를 차례차례 DataLoader로 넘겨주는 메서드
        index = self._data[idx]

        p = index["premise"]  # 전제
        p_toked = self.tokenizer.tokenize(self.bos + p + self.sep)      # [CLS] 전제 [SEP]
        p_len = len(p_toked)

        h = index["hypothesis"]  # 가설
        h_toked = self.tokenizer.tokenize(h + self.eos)      # 가설 [SEP]
        h_len = len(p_toked)

        # 전제 + 가설 길이가 최대길이보다 클때
        if p_len + h_len > self.max_len:
            h_len = self.max_len - p_len        # 가설의 길이 = 최대길이 - 전제길이

            if p_len <= 0:       # 전제의 길이가 너무 길어 전제만으로 최대 길이를 초과 한다면
                p_toked = p_toked[-(int(self.max_len / 2)) :]   # 전제길이를 최대길이의 반으로 
                p_len = len(p_toked)
                h_len = self.max_len - p_len              # 답변의 길이를 최대길이 - 전제길이
                
            h_toked = h_toked[:h_len]
            h_len = len(h_toked)

        # 전제 + 가설 토큰을 index로 변환   
        token_ids = self.tokenizer.convert_tokens_to_ids(p_toked + h_toked)

        # 최대 길이만큼 padding
        while len(token_ids) < self.max_len:
            token_ids += [self.tokenizer.pad_token_id]

        # attention_mask(어텐션마스크) = 전제 + 가설 길이 1 + 나머지(패딩) 0
        attention_mask = [1]*(p_len + h_len) + [0]*(self.max_len - p_len - h_len)

        # label = 0: entailment(함의) / 1: neutral(중립) / 2: contradiction(모순)
        label = index["label"]

        # 전제+가설 + 답변, 어텐션마스크, label
        return (token_ids, attention_mask, label)

**데이터셋 구축** <br>
구성 : (token_ids, attention_mask, token_type_ids, label)

In [10]:
# 훈련 데이터셋
train_dataset = NLIDataset(datasets["train"])

for n in range(5):
    print("train_dataset[",n,"]")
    print("token_ids      : ", train_dataset[n][0])
    print("attention_mask : ", train_dataset[n][1])
    print("label          : ", train_dataset[n][2],"\n")

train_dataset[ 0 ]
token_ids      :  [0, 3, 7254, 3841, 2062, 636, 3711, 12717, 2178, 2062, 11980, 2062, 2, 3, 7254, 3841, 2200, 11980, 2062, 18, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]
attention_mask :  [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
label          :  0 

train_dataset[ 1 ]
token_ids      :  [0, 3911, 2377, 2366, 1521, 3061, 4785, 1282, 2955, 3308, 3515, 2170, 22, 2532, 5675, 2, 3911, 2377, 2366, 1525, 2062, 18, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]
attention_mask :  [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0

In [11]:
# 검증 데이터셋
val_dataset = NLIDataset(datasets["validation"])

for n in range(5):
    print("val_dataset[",n,"]")
    print("token_ids      : ", val_dataset[n][0])
    print("attention_mask : ", val_dataset[n][1])
    print("label          : ", val_dataset[n][2],"\n")

val_dataset[ 0 ]
token_ids      :  [0, 25313, 2377, 2031, 2073, 20812, 2116, 1513, 2259, 1129, 24094, 20812, 27135, 9753, 2052, 3662, 11800, 18, 2, 3711, 1129, 27135, 2119, 9753, 2073, 5040, 3598, 3606, 18, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]
attention_mask :  [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
label          :  2 

val_dataset[ 1 ]
token_ids      :  [0, 3633, 2211, 2052, 3655, 3704, 31302, 5153, 2530, 4087, 4671, 2371, 2062, 18, 2, 3633, 2211, 2052, 3655, 3704, 31302, 5153, 2530, 2052, 1039, 2886, 2062, 18, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]
attention_mask :  [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 

**데이터로더 구축**

In [12]:
# collate_fn 구성
def collate_batch(batch):
    token_ids = [item[:][0] for item in batch]
    attention_mask = [item[:][1] for item in batch]
    label_ids = [item[:][2] for item in batch]

    return torch.LongTensor(token_ids), torch.LongTensor(attention_mask), torch.LongTensor(label_ids)

In [13]:
from torch.utils.data import DataLoader

train_dataloader = DataLoader(train_dataset, shuffle=True, collate_fn = collate_batch, batch_size=8)
val_dataloader = DataLoader(val_dataset, collate_fn = collate_batch, batch_size=16)

In [14]:
# 데이터로더 확인
sample_data = iter(train_dataloader)
sample_ids = next(sample_data)

token_ids, attention_mask, label_ids = sample_ids

print("first item of batch (train_dataloader)")
print("token_ids \n", token_ids[:][0],"batch size : ", token_ids.size(),"\n")
print("attention_mask \n", attention_mask[:][0], "batch size : ", attention_mask.size(),"\n")
print("label_ids \n", label_ids[:][0], "batch size : ", label_ids.size())

first item of batch (train_dataloader)
token_ids 
 tensor([    0,  4999,  2173,  2211,  2079,  4498,  2116,  1652,  2207,  1174,
         3415,  2466,  2118,  2170,  3844, 19521,  1513,  4007,    16,  3768,
           22,  2211,  2079,  4498,  2116,  1652,   627,  2079,  1644,  1508,
         2015,  2138,  4994,  2097,  1131,  2456,  2444,  5332,  2170,  2318,
         7789,  2069,  3750,  2371,  2062,    18,     2,  9701,  2073,  4498,
         5714, 27135,  3750,  2496,  2359,  2062,    18,     2,     1,     1,
            1,     1,     1,     1]) batch size :  torch.Size([8, 64]) 

attention_mask 
 tensor([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
        1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
        1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0]) batch size :  torch.Size([8, 64]) 

label_ids 
 tensor(1) batch size :  torch.Size([8])


# 모델 학습

**모델 정의**

In [15]:
# RoBERTa를 포함한 신경망 모형
class NLIModel(torch.nn.Module):
    def __init__(self, pretrained_model, token_size, num_labels): 
        super(NLIModel, self).__init__()
        
        self.token_size = token_size
        self.num_labels = num_labels
        self.pretrained_model = pretrained_model

        # 분류기 정의
        self.classifier = torch.nn.Linear(self.token_size, self.num_labels)

    def forward(self, input_ids, attention_mask):
        # BERT 모형에 입력을 넣고 출력을 받음
        outputs = self.pretrained_model(input_ids, attention_mask)
        # BERT 출력에서 CLS 토큰에 해당하는 부분만 가져옴
        bert_clf_token = outputs.last_hidden_state[:,0,:]
        # 3개의 라벨로 분류
        outputs = self.classifier(bert_clf_token)

        return outputs

# token_size는 BERT 토큰과 동일
model = NLIModel(roberta_model, token_size=roberta_model.config.hidden_size, num_labels=3)

**파라미터 설정**

In [16]:
from transformers import get_linear_schedule_with_warmup
import torch.nn.functional as F
import time

# GPU 가속을 사용할 수 있으면 device를 cuda로 설정하고, 아니면 cpu로 설정
device = torch.device('cuda') if torch.cuda.is_available() else torch.device('cpu')

# 옵티마이저 AdamW로 설정
optimizer = torch.optim.AdamW(model.parameters(), lr=5e-5, weight_decay=0.01) # 가중치 감쇠 설정
criterion = torch.nn.CrossEntropyLoss()    # 멀티클래스이므로 크로스 엔트로피를 손실함수로 사용 -> RoBERTa 코드 내 포함되어있음

num_epochs = 3      # 학습 epoch를 3회로 설정

total_training_steps = num_epochs * len(train_dataloader)
# 학습 스케줄러 설정
scheduler = get_linear_schedule_with_warmup(optimizer=optimizer,
                                            num_training_steps=total_training_steps,
                                            num_warmup_steps=200)

step = 0
eval_steps = 500

In [17]:
model.to(device)  
model.train()     # 학습모드

NLIModel(
  (pretrained_model): RobertaModel(
    (embeddings): RobertaEmbeddings(
      (word_embeddings): Embedding(32000, 768, padding_idx=1)
      (position_embeddings): Embedding(514, 768, padding_idx=1)
      (token_type_embeddings): Embedding(1, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): RobertaEncoder(
      (layer): ModuleList(
        (0): RobertaLayer(
          (attention): RobertaAttention(
            (self): RobertaSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): RobertaSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm

**학습 진행**

In [18]:
# GPU 캐시 비우기 (GPU 메모리 확보)
torch.cuda.empty_cache()

In [21]:
from tqdm import tqdm

for epoch in range(num_epochs):
    train_loss = 0
    avg_loss = 0.0
    
    for batch_idx, samples in enumerate(tqdm(train_dataloader)):    # Error: expected sequence of length 64 at dim 1 (got 67)
        optimizer.zero_grad()       # optimizer 초기화(Gradient)

        # 모델 입력 텐서 GPU에 올리기
        token_ids, attention_mask, label_ids = samples

        token_ids = token_ids.to(device)
        attention_mask = attention_mask.to(device)
        label_ids = label_ids.to(device)

        out = model(
            input_ids=token_ids,
            attention_mask=attention_mask,
            )

        out.argmax(dim=1)

        loss = criterion(out, label_ids)
        loss.backward()
        optimizer.step()

        avg_loss += loss.item()
        
        step += 1
        if step % eval_steps == 0:  # eval_steps 마다 loss를 출력
            with torch.no_grad():   # 학습 X (그래디언트 계산 X)
                val_loss = 0
                model.eval()        # 평가모드로 전환

                for batch_idx, samples in enumerate(tqdm(val_dataloader)):

                    token_ids, attention_mask, label_ids = samples

                    token_ids = token_ids.to(device)
                    attention_mask = attention_mask.to(device)
                    label_ids = label_ids.to(device)
                    
                    out = model(
                        input_ids=token_ids,
                        attention_mask=attention_mask,
                        )

                    out.argmax(dim=1)

                    loss = criterion(out, label_ids)  
                    val_loss += loss

                avg_val_loss = val_loss / len(val_dataloader)

            avg_train_loss = train_loss / eval_steps    # eval_steps의 평균 loss 계산
            
            print('Step %d, train loss: %.4f, validation loss: %.4f' 
                  % (step, avg_train_loss, avg_val_loss))

 11%|█▏        | 354/3125 [00:43<05:42,  8.09it/s]


ValueError: ignored