이번 과제는 Bert Model을 사용하여 BBC 뉴스 기사의 category를 분류해보는 과제입니다. clone coding을 하시되, 코드 주석을 line by line으로 꼼꼼하게 달아보시며 공부해보세요!

## 데이터 로드 및 탐색

In [1]:
%%capture
!pip install transformers

In [2]:
import pandas as pd
import torch
import numpy as np
from transformers import BertTokenizer, BertModel
from torch import nn
from torch.optim import Adam
from tqdm import tqdm

In [3]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [4]:
path = '/content/drive/MyDrive/KUBIG/24_w_nlp_session/Week4'

In [6]:
df = pd.read_csv(path+'/bbc-text.csv')

In [7]:
df.head()

Unnamed: 0,category,text
0,tech,tv future in the hands of viewers with home th...
1,business,worldcom boss left books alone former worldc...
2,sport,tigers wary of farrell gamble leicester say ...
3,sport,yeading face newcastle in fa cup premiership s...
4,entertainment,ocean s twelve raids box office ocean s twelve...


In [8]:
print(len(df))

2225


In [9]:
df.groupby('category').count()

Unnamed: 0_level_0,text
category,Unnamed: 1_level_1
business,510
entertainment,386
politics,417
sport,511
tech,401


## BertTokenizer

토크나이저로 pretrain된 BERT의 BertTokenizer를 갖고 옵니다. 여러 종류를 시도해보세요.

- bert-base-uncased : 108MB param, all lowercase
- bert-large-cased : 340MB param, both upper and lower
- bert-base-cased : 108MB param, multi language, both upper and lower


In [10]:
bert_model_name = 'bert-large-cased'

tokenizer = BertTokenizer.from_pretrained(bert_model_name)

labels = {'business':0,
          'entertainment':1,
          'sport':2,
          'tech':3,
          'politics':4
          }

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/29.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/213k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/436k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/762 [00:00<?, ?B/s]

## Dataset

In [41]:
class Dataset(torch.utils.data.Dataset):

    def __init__(self, df):
        self.labels = [labels[label] for label in df['category']] # str label을 int로변환
        self.texts = [tokenizer(text,
                               padding='max_length', max_length = 512, truncation=True,
                                return_tensors="pt") for text in df['text']] # text를 token으로 변환. 512length 사용

    def classes(self):
        return self.labels

    def __len__(self):
        return len(self.labels)

    def get_batch_labels(self, idx):
        return np.array(self.labels[idx])

    def get_batch_texts(self, idx):
        return self.texts[idx]

    def __getitem__(self, idx):

        batch_texts = self.get_batch_texts(idx)
        batch_y = self.get_batch_labels(idx)

        return batch_texts, batch_y

## Train & Evaluate BertClassifier

pretrain된 BertModel을 불러옵니다. 다른 간단한 층들도 같이 쌓아줍니다.

- bert-base-cased: 12-layer, 768-hidden, 12-self attention heads, 110M parameters. Trained on cased English text.


다른 종류들의 pretrianed model은 아래 링크에서 확인할 수 있습니다.

https://huggingface.co/transformers/v2.9.1/pretrained_models.html

bert - large - cased , 1024 hidden

In [12]:
class BertClassifier(nn.Module):

    def __init__(self, dropout=0.5):
        super(BertClassifier, self).__init__()

        self.bert = BertModel.from_pretrained(bert_model_name) # bert pre-trained model사용
        self.dropout = nn.Dropout(dropout)
        self.linear = nn.Linear(1024, 5) # Linear layer로 classification
        self.relu = nn.ReLU()

    def forward(self, input_id, mask):

        _, pooled_output = self.bert(input_ids= input_id, attention_mask=mask,return_dict=False)
        dropout_output = self.dropout(pooled_output)
        linear_output = self.linear(dropout_output)
        final_layer = self.relu(linear_output)

        return final_layer # dimension : 5

In [38]:
df_dataset = Dataset(df)
train_dataloader = torch.utils.data.DataLoader(df_dataset, batch_size=2, shuffle=True) # batch size 2
# dataloader 살펴보기

In [33]:
next(iter(train_dataloader))

[{'input_ids': tensor([[[  101, 25138,  2239,  ...,     0,     0,     0]],
 
         [[  101,  1366,  4217,  ...,     0,     0,     0]]]), 'token_type_ids': tensor([[[0, 0, 0,  ..., 0, 0, 0]],
 
         [[0, 0, 0,  ..., 0, 0, 0]]]), 'attention_mask': tensor([[[1, 1, 1,  ..., 0, 0, 0]],
 
         [[1, 1, 1,  ..., 0, 0, 0]]])},
 tensor([4, 1])]

In [42]:
def train(model, train_data, val_data, learning_rate, epochs):

    train, val = Dataset(train_data), Dataset(val_data)

    train_dataloader = torch.utils.data.DataLoader(train, batch_size=2, shuffle=True) # batch size 2
    val_dataloader = torch.utils.data.DataLoader(val, batch_size=2)

    use_cuda = torch.cuda.is_available()
    device = torch.device("cuda" if use_cuda else "cpu")

    criterion = nn.CrossEntropyLoss() # CrossEntropyLoss 사용
    optimizer = Adam(model.parameters(), lr= learning_rate) # Adam optimizer 사용

    if use_cuda:
            # model, loss func를 gpu로
            model = model.cuda()
            criterion = criterion.cuda()

    for epoch_num in range(epochs):
            # train accuracy, loss를 기록하기 위해 init.
            total_acc_train = 0
            total_loss_train = 0

            # for문 작업 현황을 보기 위해tqdm사용
            for train_input, train_label in tqdm(train_dataloader):

                train_label = train_label.to(device)
                mask = train_input['attention_mask'].to(device)
                input_id = train_input['input_ids'].squeeze(1).to(device) # input_ids에 있는 불필요한 차원 제거 후 device로 이동

                output = model(input_id, mask)

                batch_loss = criterion(output, train_label.long()) #train_label에 long을 해줘야 계산 가능.
                total_loss_train += batch_loss.item()

                acc = (output.argmax(dim=1) == train_label).sum().item() # output중 값이 가장 큰 값을 label과 비교 -> 맞는 경우를 sum
                total_acc_train += acc

                model.zero_grad()   # gradient zero 로 초기화 : 이전 step의 gradient초기화.
                batch_loss.backward() # backward 계산
                optimizer.step() # parameter update

            # validation accuracy, loss를 기록하기 위해 init.
            total_acc_val = 0
            total_loss_val = 0

            # evaluation 과정에서 gradient계산 불필요
            with torch.no_grad():

                for val_input, val_label in val_dataloader:

                    val_label = val_label.to(device)
                    mask = val_input['attention_mask'].to(device)
                    input_id = val_input['input_ids'].squeeze(1).to(device)

                    output = model(input_id, mask)

                    batch_loss = criterion(output, val_label.long())
                    total_loss_val += batch_loss.item()

                    acc = (output.argmax(dim=1) == val_label).sum().item()
                    total_acc_val += acc
                    # train과 과정 같음

            # epoch끝날때마다 print
            print(
                f'Epochs: {epoch_num + 1} | Train Loss: {total_loss_train / len(train_data): .3f} | Train Accuracy: {total_acc_train / len(train_data): .3f} | Val Loss: {total_loss_val / len(val_data): .3f} | Val Accuracy: {total_acc_val / len(val_data): .3f}')


In [43]:
def evaluate(model, test_data):

    test = Dataset(test_data)

    test_dataloader = torch.utils.data.DataLoader(test, batch_size=2)

    use_cuda = torch.cuda.is_available()
    device = torch.device("cuda" if use_cuda else "cpu")

    if use_cuda:

        model = model.cuda()

    total_acc_test = 0
    # test 시에도 gradient 계산 불필요
    with torch.no_grad():

        # dataset만 다르고 validation과 과정 유사
        for test_input, test_label in test_dataloader:

              test_label = test_label.to(device)
              mask = test_input['attention_mask'].to(device)
              input_id = test_input['input_ids'].squeeze(1).to(device)

              output = model(input_id, mask)

              acc = (output.argmax(dim=1) == test_label).sum().item()
              total_acc_test += acc

    print(f'Test Accuracy: {total_acc_test / len(test_data): .3f}')

In [44]:
np.random.seed(112)
df_train, df_val, df_test = np.split(df.sample(frac=1, random_state=42),
                                     [int(.8*len(df)), int(.9*len(df))])

print(len(df_train),len(df_val), len(df_test))

1780 222 223


In [19]:
model = BertClassifier()
# train 이전. -> random weight init.
evaluate(model, df_test)

Test Accuracy:  0.175


In [20]:
EPOCHS = 2 #EPOCH 수 늘려보기!
LR = 1e-6

train(model, df_train, df_val, LR, EPOCHS)

100%|██████████| 890/890 [03:21<00:00,  4.42it/s]


Epochs: 1 | Train Loss:  0.731 | Train Accuracy:  0.404 | Val Loss:  0.545 | Val Accuracy:  0.676


100%|██████████| 890/890 [03:21<00:00,  4.42it/s]


Epochs: 2 | Train Loss:  0.371 | Train Accuracy:  0.861 | Val Loss:  0.205 | Val Accuracy:  0.991


In [21]:
evaluate(model, df_test)

Test Accuracy:  0.978


In [45]:
EPOCHS = 2 #2 EPOCH 추가로 학습
LR = 1e-6

train(model, df_train, df_val, LR, EPOCHS)

100%|██████████| 890/890 [03:20<00:00,  4.43it/s]


Epochs: 1 | Train Loss:  0.142 | Train Accuracy:  0.978 | Val Loss:  0.072 | Val Accuracy:  0.995


100%|██████████| 890/890 [03:21<00:00,  4.43it/s]


Epochs: 2 | Train Loss:  0.055 | Train Accuracy:  0.995 | Val Loss:  0.036 | Val Accuracy:  1.000


In [46]:
evaluate(model, df_test)

Test Accuracy:  0.991
