이번 과제는 Bert Model을 사용하여 BBC 뉴스 기사의 category를 분류해보는 과제입니다. clone coding을 하시되, 코드 주석을 line by line으로 꼼꼼하게 달아보시며 공부해보세요!

## 데이터 로드 및 탐색

In [1]:
%%capture
!pip install transformers

In [2]:
import pandas as pd
import torch
import numpy as np
from transformers import BertTokenizer, BertModel
from torch import nn
from torch.optim import Adam
from tqdm import tqdm

In [3]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [4]:
df = pd.read_csv('/content/drive/MyDrive/KUBIG/24_2_Basicstudy_NLP/NLP_Week4/bbc-text.csv')

In [5]:
df.head()

Unnamed: 0,category,text
0,tech,tv future in the hands of viewers with home th...
1,business,worldcom boss left books alone former worldc...
2,sport,tigers wary of farrell gamble leicester say ...
3,sport,yeading face newcastle in fa cup premiership s...
4,entertainment,ocean s twelve raids box office ocean s twelve...


- target, input으로 이루어진 데이터셋

In [6]:
print(len(df))

2225


In [7]:
df.groupby('category').count()

Unnamed: 0_level_0,text
category,Unnamed: 1_level_1
business,510
entertainment,386
politics,417
sport,511
tech,401


In [None]:
df[df['category']=='business'].sample(3)

Unnamed: 0,category,text
1142,business,ebbers denies worldcom fraud former worldcom c...
2015,business,japan turns to beer alternatives japanese brew...
1792,business,worldcom bosses $54m payout ten former direct...


In [None]:
df[df['category']=='entertainment'].sample(3)

Unnamed: 0,category,text
655,entertainment,eminem secret gig venue revealed rapper eminem...
37,entertainment,row threatens hendrix museum plan proposals to...
1596,entertainment,downloads enter us singles chart digital music...


In [None]:
df[df['category']=='politics'].sample(3)

Unnamed: 0,category,text
612,politics,profile: gordon brown the ultimate prize of 10...
1805,politics,kelly trails new discipline power teachers cou...
1487,politics,labour chooses manchester the labour party wil...


In [None]:
df[df['category']=='sport'].sample(3)

Unnamed: 0,category,text
659,sport,beckham rules out management move real madrid ...
306,sport,young debut cut short by ginepri fifteen-year-...
1808,sport,dal maso in to replace bergamasco david dal ma...


In [None]:
df[df['category']=='tech'].sample(3)

Unnamed: 0,category,text
512,tech,digital guru floats sub-$100 pc nicholas negro...
1550,tech,latest opera browser gets vocal net browser op...
1614,tech,microsoft seeking spyware trojan microsoft is ...


## BertTokenizer

토크나이저로 pretrain된 BERT의 BertTokenizer를 갖고 옵니다. 여러 종류를 시도해보세요.

- bert-base-uncased : 108MB param, all lowercase
- bert-large-cased : 340MB param, both upper and lower
- bert-base-cased : 108MB param, multi language, both upper and lower


In [8]:
tokenizer = BertTokenizer.from_pretrained('bert-base-cased')
labels = {'business':0,
          'entertainment':1,
          'sport':2,
          'tech':3,
          'politics':4
          }

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/49.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/213k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/436k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

## Dataset

In [9]:
class Dataset(torch.utils.data.Dataset):

    def __init__(self, df):
        self.labels = [labels[label] for label in df['category']]
        self.texts = [tokenizer(text,
                               padding='max_length', max_length = 512, truncation=True,
                                return_tensors="pt") for text in df['text']]

    def classes(self):
        return self.labels

    def __len__(self):
        return len(self.labels)

    def get_batch_labels(self, idx):
        return np.array(self.labels[idx])

    def get_batch_texts(self, idx):
        return self.texts[idx]

    def __getitem__(self, idx):

        batch_texts = self.get_batch_texts(idx)
        batch_y = self.get_batch_labels(idx)

        return batch_texts, batch_y

## Train & Evaluate BertClassifier

pretrain된 BertModel을 불러옵니다. 다른 간단한 층들도 같이 쌓아줍니다.

- bert-base-cased: 12-layer, 768-hidden, 12-self attention heads, 110M parameters. Trained on cased English text.


다른 종류들의 pretrianed model은 아래 링크에서 확인할 수 있습니다.

https://huggingface.co/transformers/v2.9.1/pretrained_models.html

In [10]:
class BertClassifier(nn.Module):

    def __init__(self, dropout=0.5):

        super(BertClassifier, self).__init__()

        self.bert = BertModel.from_pretrained('bert-base-cased')
        self.dropout = nn.Dropout(dropout)
        self.linear = nn.Linear(768, 5)
        self.relu = nn.ReLU()

    def forward(self, input_id, mask):

        _, pooled_output = self.bert(input_ids= input_id, attention_mask=mask,return_dict=False)
        dropout_output = self.dropout(pooled_output)
        linear_output = self.linear(dropout_output)
        final_layer = self.relu(linear_output)

        return final_layer

In [None]:
'''
def train(model, train_data, val_data, learning_rate, epochs):

    train, val = Dataset(train_data), Dataset(val_data)

    train_dataloader = torch.utils.data.DataLoader(train, batch_size=2, shuffle=True)
    val_dataloader = torch.utils.data.DataLoader(val, batch_size=2)

    use_cuda = torch.cuda.is_available()
    device = torch.device("cuda" if use_cuda else "cpu")

    criterion = nn.CrossEntropyLoss()
    optimizer = Adam(model.parameters(), lr= learning_rate)

    if use_cuda:

            model = model.cuda()
            criterion = criterion.cuda()

    for epoch_num in range(epochs):

            total_acc_train = 0
            total_loss_train = 0

            for train_input, train_label in tqdm(train_dataloader):

                train_label = train_label.to(device)
                mask = train_input['attention_mask'].to(device)
                input_id = train_input['input_ids'].squeeze(1).to(device)

                output = model(input_id, mask)

                batch_loss = criterion(output, train_label.long())
                total_loss_train += batch_loss.item()

                acc = (output.argmax(dim=1) == train_label).sum().item()
                total_acc_train += acc

                model.zero_grad()
                batch_loss.backward()
                optimizer.step()

            total_acc_val = 0
            total_loss_val = 0

            with torch.no_grad():

                for val_input, val_label in val_dataloader:

                    val_label = val_label.to(device)
                    mask = val_input['attention_mask'].to(device)
                    input_id = val_input['input_ids'].squeeze(1).to(device)

                    output = model(input_id, mask)

                    batch_loss = criterion(output, val_label.long())
                    total_loss_val += batch_loss.item()

                    acc = (output.argmax(dim=1) == val_label).sum().item()
                    total_acc_val += acc

            print(
                f'Epochs: {epoch_num + 1} | Train Loss: {total_loss_train / len(train_data): .3f} | Train Accuracy: {total_acc_train / len(train_data): .3f} | Val Loss: {total_loss_val / len(val_data): .3f} | Val Accuracy: {total_acc_val / len(val_data): .3f}')
'''

In [11]:
def train(model, train_data, val_data, learning_rate, epochs):
    train, val = Dataset(train_data), Dataset(val_data)
    train_dataloader = torch.utils.data.DataLoader(train, batch_size=2, shuffle=True)
    val_dataloader = torch.utils.data.DataLoader(val, batch_size=2)

    use_cuda = torch.cuda.is_available()
    device = torch.device("cuda" if use_cuda else "cpu")

    criterion = nn.CrossEntropyLoss()
    optimizer = Adam(model.parameters(), lr=learning_rate)

    if use_cuda:
        model = model.cuda()
        criterion = criterion.cuda()

    for epoch_num in range(epochs):
        total_acc_train = 0
        total_loss_train = 0

        for train_input, train_label in tqdm(train_dataloader):
            train_label = train_label.to(device)
            mask = train_input['attention_mask'].to(device)
            input_id = train_input['input_ids'].to(device)

            # Shape check and adjustment
            if len(input_id.shape) > 2:
                input_id = input_id.squeeze(1)
            if len(mask.shape) > 2:
                mask = mask.squeeze(1)

            output = model(input_id, mask)

            batch_loss = criterion(output, train_label.long())
            total_loss_train += batch_loss.item()

            acc = (output.argmax(dim=1) == train_label).sum().item()
            total_acc_train += acc

            model.zero_grad()
            batch_loss.backward()
            optimizer.step()

        total_acc_val = 0
        total_loss_val = 0

        with torch.no_grad():
            for val_input, val_label in val_dataloader:
                val_label = val_label.to(device)
                mask = val_input['attention_mask'].to(device)
                input_id = val_input['input_ids'].to(device)

                # Shape check and adjustment
                if len(input_id.shape) > 2:
                    input_id = input_id.squeeze(1)
                if len(mask.shape) > 2:
                    mask = mask.squeeze(1)

                output = model(input_id, mask)

                batch_loss = criterion(output, val_label.long())
                total_loss_val += batch_loss.item()

                acc = (output.argmax(dim=1) == val_label).sum().item()
                total_acc_val += acc

        print(f'Epochs: {epoch_num + 1} | Train Loss: {total_loss_train / len(train_data): .3f} | Train Accuracy: {total_acc_train / len(train_data): .3f} | Val Loss: {total_loss_val / len(val_data): .3f} | Val Accuracy: {total_acc_val / len(val_data): .3f}')


- mask shape이 expected 보다 크다는 error가 자꾸 떠서 squeeze(1)은 len가 2보다 크면 squeeze(1)해주는 코드로 수정해줌

In [None]:
'''
def evaluate(model, test_data):

    test = Dataset(test_data)

    test_dataloader = torch.utils.data.DataLoader(test, batch_size=2)

    use_cuda = torch.cuda.is_available()
    device = torch.device("cuda" if use_cuda else "cpu")

    if use_cuda:

        model = model.cuda()

    total_acc_test = 0
    with torch.no_grad():

        for test_input, test_label in test_dataloader:

              test_label = test_label.to(device)
              mask = test_input['attention_mask'].to(device)
              input_id = test_input['input_ids'].squeeze(1).to(device)

              output = model(input_id, mask)

              acc = (output.argmax(dim=1) == test_label).sum().item()
              total_acc_test += acc

    print(f'Test Accuracy: {total_acc_test / len(test_data): .3f}')
'''

In [12]:
def evaluate(model, test_data):
    test = Dataset(test_data)
    test_dataloader = torch.utils.data.DataLoader(test, batch_size=2)

    use_cuda = torch.cuda.is_available()
    device = torch.device("cuda" if use_cuda else "cpu")

    if use_cuda:
        model = model.cuda()

    total_acc_test = 0
    with torch.no_grad():
        for test_input, test_label in test_dataloader:
            test_label = test_label.to(device)
            mask = test_input['attention_mask'].to(device)
            input_id = test_input['input_ids'].to(device)

            # Shape check and adjustment
            if len(input_id.shape) > 2:
                input_id = input_id.squeeze(1)
            if len(mask.shape) > 2:
                mask = mask.squeeze(1)

            output = model(input_id, mask)

            acc = (output.argmax(dim=1) == test_label).sum().item()
            total_acc_test += acc

    print(f'Test Accuracy: {total_acc_test / len(test_data): .3f}')

- 마찬가지의 error가 떠서 똑같이 수정

In [13]:
np.random.seed(112)
df_train, df_val, df_test = np.split(df.sample(frac=1, random_state=42),
                                     [int(.8*len(df)), int(.9*len(df))])

print(len(df_train),len(df_val), len(df_test))

1780 222 223


## Results

### bert-base-cased: 108MB param, multi language, both upper and lower

In [16]:
EPOCHS = 2 #EPOCH 수 늘려보기!
model = BertClassifier()
LR = 1e-6

train(model, df_train, df_val, LR, EPOCHS)

100%|██████████| 890/890 [03:08<00:00,  4.72it/s]


Epochs: 1 | Train Loss:  0.746 | Train Accuracy:  0.359 | Val Loss:  0.572 | Val Accuracy:  0.608


100%|██████████| 890/890 [03:08<00:00,  4.72it/s]


Epochs: 2 | Train Loss:  0.394 | Train Accuracy:  0.820 | Val Loss:  0.207 | Val Accuracy:  0.973


In [24]:
evaluate(model, df_test)    # bert-base-cased # epoch 2

Test Accuracy:  0.978


- 7분 소요
- train accuracy: 0.82
- test accuracy: 0.978

epoch 10도 해보자

In [None]:
EPOCHS = 10
model = BertClassifier()
LR = 1e-6

train(model, df_train, df_val, LR, EPOCHS)

100%|██████████| 890/890 [03:15<00:00,  4.56it/s]


Epochs: 1 | Train Loss:  0.782 | Train Accuracy:  0.282 | Val Loss:  0.688 | Val Accuracy:  0.505


100%|██████████| 890/890 [03:14<00:00,  4.58it/s]


Epochs: 2 | Train Loss:  0.627 | Train Accuracy:  0.556 | Val Loss:  0.509 | Val Accuracy:  0.734


100%|██████████| 890/890 [03:16<00:00,  4.54it/s]


Epochs: 3 | Train Loss:  0.394 | Train Accuracy:  0.806 | Val Loss:  0.228 | Val Accuracy:  0.968


100%|██████████| 890/890 [03:13<00:00,  4.60it/s]


Epochs: 4 | Train Loss:  0.181 | Train Accuracy:  0.962 | Val Loss:  0.153 | Val Accuracy:  0.950


100%|██████████| 890/890 [03:13<00:00,  4.61it/s]


Epochs: 5 | Train Loss:  0.098 | Train Accuracy:  0.985 | Val Loss:  0.082 | Val Accuracy:  0.982


100%|██████████| 890/890 [03:13<00:00,  4.60it/s]


Epochs: 6 | Train Loss:  0.062 | Train Accuracy:  0.992 | Val Loss:  0.058 | Val Accuracy:  0.986


100%|██████████| 890/890 [03:13<00:00,  4.59it/s]


Epochs: 7 | Train Loss:  0.038 | Train Accuracy:  0.996 | Val Loss:  0.067 | Val Accuracy:  0.973


100%|██████████| 890/890 [03:16<00:00,  4.53it/s]


Epochs: 8 | Train Loss:  0.029 | Train Accuracy:  0.995 | Val Loss:  0.043 | Val Accuracy:  0.986


100%|██████████| 890/890 [03:13<00:00,  4.59it/s]


Epochs: 9 | Train Loss:  0.018 | Train Accuracy:  0.999 | Val Loss:  0.035 | Val Accuracy:  0.986


100%|██████████| 890/890 [03:13<00:00,  4.60it/s]


Epochs: 10 | Train Loss:  0.013 | Train Accuracy:  0.999 | Val Loss:  0.024 | Val Accuracy:  0.991


In [None]:
evaluate(model, df_test)     # bert-base-cased # epoch 10

Test Accuracy:  0.996


- 33분 소요
- Train Accuracy: 0.999, Test Accuracy: 0.996
- 성능이 매우 높아짐
  -> Train accuracy의 경우 epoch 5 이후부터는 0.99에 준함

다른 모델들도 epoch 10은 비슷하게 accuracy가 높을 것 같음 & 시간 너무 오래 걸림의 이유로 밑에서부터는 epoch 2만 실행해서 비교해보자

### bert-base-uncased: 108MB param, all lowercase
bart-base-cased와 다르게 모두 소문자화하는 bert-base-uncased를 해보자

In [None]:
EPOCHS = 2
model = BertClassifier()
LR = 1e-6

train(model, df_train, df_val, LR, EPOCHS)

model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

100%|██████████| 890/890 [03:13<00:00,  4.60it/s]


Epochs: 1 | Train Loss:  0.718 | Train Accuracy:  0.458 | Val Loss:  0.546 | Val Accuracy:  0.802


100%|██████████| 890/890 [03:15<00:00,  4.54it/s]


Epochs: 2 | Train Loss:  0.411 | Train Accuracy:  0.896 | Val Loss:  0.286 | Val Accuracy:  0.959


In [None]:
evaluate(model, df_test)      # bert-base-uncased # epoch 2

Test Accuracy:  0.946


- 7분 소요
- train accuracy:  0.896
- test accuracy: 0.946
- bert-base-cased보다 test accuracy 조금 낮아짐

### bert-large-cased: 340MB param, both upper and lower
- parameter가 약 세 배 많은 large-cased도 해보자
- 마찬가지로 시간이 너무 오래걸릴 것 같아 epoch 2만 진행

In [None]:
EPOCHS = 2
model = BertClassifier()
LR = 1e-6

train(model, df_train, df_val, LR, EPOCHS)

model.safetensors:   0%|          | 0.00/1.34G [00:00<?, ?B/s]

100%|██████████| 890/890 [10:37<00:00,  1.40it/s]


Epochs: 1 | Train Loss:  0.812 | Train Accuracy:  0.230 | Val Loss:  0.804 | Val Accuracy:  0.189


100%|██████████| 890/890 [10:42<00:00,  1.39it/s]


Epochs: 2 | Train Loss:  0.514 | Train Accuracy:  0.648 | Val Loss:  0.192 | Val Accuracy:  0.977


In [None]:
evaluate(model, df_test)   # bert-large-cased  # epoch 2

Test Accuracy:  0.973


- 22분 소요
- Train Accuracy: 0.648
- Test Accuracy: 0.973