## 1. Load the Data

Korpora에서 제공하는 NSMCDataset의 테스트 데이터셋에서 20000개의 데이터만 사용하여 학습 진행

In [2]:
import numpy as np
import pandas as pd
from Korpora import Korpora

corpus = Korpora.load("nsmc")
df = pd.DataFrame(corpus.test).sample(20000, random_state=42)

# train : val : test 를 6 : 2 : 2 로 분리
train_df, val_df, test_df = np.split(
    df.sample(frac=1, random_state=42), [int(0.6 * len(df)), int(0.8 * len(df))]
)

# 출력 확인
print(train_df.head(5).to_markdown())
print(f"train: {len(train_df)}, val: {len(val_df)}")
print(f"test: {len(test_df)}")


    Korpora 는 다른 분들이 연구 목적으로 공유해주신 말뭉치들을
    손쉽게 다운로드, 사용할 수 있는 기능만을 제공합니다.

    말뭉치들을 공유해 주신 분들에게 감사드리며, 각 말뭉치 별 설명과 라이센스를 공유 드립니다.
    해당 말뭉치에 대해 자세히 알고 싶으신 분은 아래의 description 을 참고,
    해당 말뭉치를 연구/상용의 목적으로 이용하실 때에는 아래의 라이센스를 참고해 주시기 바랍니다.

    # Description
    Author : e9t@github
    Repository : https://github.com/e9t/nsmc
    References : www.lucypark.kr/docs/2015-pyconkr/#39

    Naver sentiment movie corpus v1.0
    This is a movie review dataset in the Korean language.
    Reviews were scraped from Naver Movies.

    The dataset construction is based on the method noted in
    [Large movie review dataset][^1] from Maas et al., 2011.

    [^1]: http://ai.stanford.edu/~amaas/data/sentiment/

    # License
    CC0 1.0 Universal (CC0 1.0) Public Domain Dedication
    Details in https://creativecommons.org/publicdomain/zero/1.0/

[Korpora] Corpus `nsmc` is already installed at /home/ho/Korpora/nsmc/ratings_train.txt
[Korpora] Corpus `nsmc` is already installed at /home/ho/Korpora/n

## 2. Dataset & DataLoader

Load해온 NSMCDataset을 DataLoader를 통해 배치 단위로 불러올 수 있도록 하는 함수.

In [2]:
import torch
from torch.utils.data import TensorDataset, DataLoader

def make_dataset(data, tokenizer, device):
    tokenized = tokenizer(
        text = data.text.to_list(),
        padding="longest", # 배치 내에서 가장 긴 sequence 길이로 통일
        truncation=True, # 모델이 처리할 수 있는 최대 길이로 입력 조정 (BERT의 경우, 512)
        return_tensors="pt"
    )

    input_ids = tokenized["input_ids"].to(device)
    att_masks = tokenized["attention_mask"].to(device)
    labels = torch.tensor(data.label.values, dtype=torch.long).to(device)
    
    return TensorDataset(input_ids, att_masks, labels)

def get_dataloader(dataset, sampler, batch_size):
    data_sampler = sampler(dataset)
    data_loader = DataLoader(dataset, sampler=data_sampler, batch_size=batch_size)

    return data_loader

## 3. Check the Netwrok Architecture

이후 실험할 BERT, GPT2에 대한 Network 구조를 확인하기 위한 함수

In [4]:
def check_network(net):
    for main_name, main_module in net.named_children():
        print(main_name)

        for sub_name, sub_module in main_module.named_children():
            print("- ", sub_name)

            for ssub_name, ssub_module in sub_module.named_children():
                print("|    - ", ssub_name)

                for sssub_name, _ in ssub_module.named_children():
                    print("|    |   - ", sssub_name)
    
    return main_name

## 4. Train & Test Function

모델 훈련 및 테스트 코드

In [5]:
import numpy as np
import torch.nn as nn
from tqdm.notebook import tqdm

def calc_accuracy(preds, labels):
    pred_flatten = np.argmax(preds, axis=1).flatten()
    labels_flatten = labels.flatten()

    return np.sum(pred_flatten == labels_flatten) / len(labels_flatten)

def train(net, optimizer, data_loader):
    net.train()
    train_loss = 0.0

    for input_ids, att_mask, labels in tqdm(data_loader):
        optimizer.zero_grad()

        outputs = net(
            input_ids=input_ids,
            attention_mask=att_mask,
            labels=labels
        )

        loss = outputs.loss
        train_loss += loss.item()
        
        loss.backward()
        optimizer.step()
    
    train_loss /= len(data_loader)
    return train_loss

def eval(net, data_loader):
    net.eval()
    criterion = nn.CrossEntropyLoss()
    val_loss, val_acc = 0.0, 0.0

    with torch.no_grad():
        for input_ids, att_mask, labels in tqdm(data_loader):
            outputs = net(
                input_ids=input_ids,
                attention_mask=att_mask,
                labels=labels
            )
            
            logits = outputs.logits
            loss = criterion(logits, labels)
            
            logits.detach().cpu().numpy()
            labels = labels.to("cpu").numpy()
            acc = calc_accuracy(logits, labels)

            val_loss += loss
            val_acc += acc

    val_loss /= len(data_loader)
    val_acc /= len(data_loader)

    return val_loss, val_acc

In [6]:
def test(net, main_name, test_loader):    
    net.config.pad_token_id = net.config.eos_token_id
    net.load_state_dict(torch.load(f"./pth/{main_name}.pth"))

    test_loss, test_acc = eval(net, test_loader)
    print(f"Test Loss: {test_loss:.4f}, Test Accuracy: {test_acc:.2f}%")


## 5. Bert

Bert를 통해 Self-Supervised Learning으로 Transfer Learning 된 모델과 그렇지 않은 모델 성능 비교

### 5-1. Hyper-Parameter 설정

In [20]:
EPOCHS = 5
BATCH_SIZE = 32
LR = 1e-5
EPS = 1e-8

device = "cuda" if torch.cuda.is_available() else "cpu"
pretrained = "bert-base-multilingual-cased"

### 5-2. DataLoader 불러오기

이때, tokenizer는 pretrained된 값을 사용 (오로지, 모델의 pretrained 여부에 따른 성능 차이 확인을 위해서)

In [30]:
from transformers import AutoTokenizer
from torch.utils.data import RandomSampler, SequentialSampler

def prepare_data(pretrained):
    tokenizer = AutoTokenizer.from_pretrained(
        pretrained_model_name_or_path=pretrained,
        do_lower_case=False
    )

    train_dataset = make_dataset(train_df, tokenizer, device)
    train_loader = get_dataloader(train_dataset, RandomSampler, BATCH_SIZE)

    val_dataset = make_dataset(val_df, tokenizer, device)
    val_loader = get_dataloader(val_dataset, SequentialSampler, BATCH_SIZE)

    test_dataset = make_dataset(test_df, tokenizer, device)
    test_loader = get_dataloader(test_dataset, SequentialSampler, BATCH_SIZE)

    return train_loader, val_loader, test_loader

train_loader, val_loader, test_loader = prepare_data(pretrained)

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

ValueError: Asking to pad but the tokenizer does not have a padding token. Please select a token to use as `pad_token` `(tokenizer.pad_token = tokenizer.eos_token e.g.)` or add a new pad token via `tokenizer.add_special_tokens({'pad_token': '[PAD]'})`.

### 5-3. Supervised Learning

In [24]:
# 모델 정의
from transformers import BertConfig, BertForSequenceClassification

config = BertConfig(
    num_labels=2
)

net = BertForSequenceClassification(config=config).to(device)
main_name = check_network(net)

bert
-  embeddings
|    -  word_embeddings
|    -  position_embeddings
|    -  token_type_embeddings
|    -  LayerNorm
|    -  dropout
-  encoder
|    -  layer
|    |   -  0
|    |   -  1
|    |   -  2
|    |   -  3
|    |   -  4
|    |   -  5
|    |   -  6
|    |   -  7
|    |   -  8
|    |   -  9
|    |   -  10
|    |   -  11
-  pooler
|    -  dense
|    -  activation
dropout
classifier


In [25]:
# optimizer 정의
from torch.optim import AdamW
optimizer = AdamW(net.parameters(), lr=LR, eps=EPS)

In [26]:
# 훈련
def main():
    best_loss = 1e9
    for epoch in range(EPOCHS):
        train_loss = train(net, optimizer, train_loader)
        val_loss, val_acc = eval(net, val_loader)

        print(f"Epoch [{epoch+1}/{EPOCHS}]")
        print(f"  Train Loss: {train_loss:.4f}, Val Loss: {val_loss:.4f}, Val Accuracy: {val_acc:.2f}%")
        
        if val_loss < best_loss:
            best_loss = val_loss
            torch.save(net.state_dict(), f"./pth/{main_name}.pth")

main()

In [None]:
# 테스트
net = BertForSequenceClassification(config=config).to(device)
test(net, main_name, test_loader)

### 5-4. Self-Supervised Learning

이때, 모델 구조가 동일한 것을 확인할 수 있음. (차이는 오로지, pretrained 여부)

In [27]:
# 모델 정의
from transformers import AutoModelForSequenceClassification

net = AutoModelForSequenceClassification.from_pretrained(
        pretrained, num_labels=2
    ).to(device)
main_name = check_network(net)

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-multilingual-cased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


bert
-  embeddings
|    -  word_embeddings
|    -  position_embeddings
|    -  token_type_embeddings
|    -  LayerNorm
|    -  dropout
-  encoder
|    -  layer
|    |   -  0
|    |   -  1
|    |   -  2
|    |   -  3
|    |   -  4
|    |   -  5
|    |   -  6
|    |   -  7
|    |   -  8
|    |   -  9
|    |   -  10
|    |   -  11
-  pooler
|    -  dense
|    -  activation
dropout
classifier


In [None]:
# 훈련
main()

In [None]:
# 테스트
net = AutoModelForSequenceClassification.from_pretrained(
        pretrained, num_labels=2
    ).to(device)
test(net, main_name, test_loader)

## 6. GPT2

GPT2를 통해 Self-Supervised Learning으로 Transfer Learning 된 모델과 그렇지 않은 모델 성능 비교

### 6-1. HyperParameter 설정

In [28]:
EPOCHS = 5
BATCH_SIZE = 32
LR = 1e-5
EPS = 1e-8

device = "cuda" if torch.cuda.is_available() else "cpu"
pretrained = "gpt2"

### 6-2. DataLoader 불러오기

In [None]:
train_loader, val_loader, test_loader = prepare_data(pretrained)

### 6-3. Supervised Learning

In [9]:
# 모델 정의
from transformers import GPT2Config, GPT2ForSequenceClassification

config = GPT2Config(
    num_labels=2
)

net = GPT2ForSequenceClassification(config=config).to(device)
main_name = check_network(net)

In [1]:
# optimizer 정의
from torch.optim import AdamW
optimizer = AdamW(net.parameters(), lr=LR, eps=EPS)

NameError: name 'net' is not defined

In [None]:
# 훈련
main()

In [None]:
# 테스트
net = GPT2ForSequenceClassification(config=config).to(device)
test(net, main_name, test_loader)

### 6-4. Self-Supervised Learning

In [None]:
# 모델 정의
from transformers import AutoModelForSequenceClassification

net = AutoModelForSequenceClassification.from_pretrained(
        pretrained, num_labels=2
    ).to(device)
main_name = check_network(net)

In [None]:
# 훈련
main()

In [None]:
# 테스트
net = AutoModelForSequenceClassification.from_pretrained(
        pretrained, num_labels=2
    ).to(device)
test(net, main_name, test_loader)