# KUBIG 24S NLP Basic Week 4

## Preview

### 19기 정종락

이번 과제는 Bert Model을 사용하여 BBC 뉴스 기사의 category를 분류해보는 과제입니다. clone coding을 하시되, 코드 주석을 line by line으로 꼼꼼하게 달아보시며 공부해보세요!

$$
\\
$$

# 데이터 로드 및 탐색

## Import Libraries and Modules

In [1]:
%%capture
!pip install transformers

In [2]:
# Libraries for data manipulation and computation
## data manipulation
import pandas as pd
## numerical computations
import numpy as np

# Library and modules for deep learning (PyTorch)
# deep learning framework
import torch
# neural network module
from torch import nn
# adam optimizer
from torch.optim import Adam

# Hugging Face library for NLP models
## modules for BERT
from transformers import BertTokenizer, BertModel

# Other
## progress bar
from tqdm import tqdm

$$
\\
$$

## Load Dataset

In [3]:
from google.colab import drive
drive.mount('/content/drive')
path = '/content/drive/Othercomputers/My MacBook Pro/Colab/KUBIG/24S/NLP/Week 4/HW/'

Mounted at /content/drive


In [4]:
df = pd.read_csv(path + 'bbc-text.csv')

In [5]:
df.head()

Unnamed: 0,category,text
0,tech,tv future in the hands of viewers with home th...
1,business,worldcom boss left books alone former worldc...
2,sport,tigers wary of farrell gamble leicester say ...
3,sport,yeading face newcastle in fa cup premiership s...
4,entertainment,ocean s twelve raids box office ocean s twelve...


In [6]:
print(len(df))

2225


In [7]:
df.groupby('category').count()

Unnamed: 0_level_0,text
category,Unnamed: 1_level_1
business,510
entertainment,386
politics,417
sport,511
tech,401


$$
\\
$$

# Setting

## BertTokenizer

토크나이저로 pretrain된 BERT의 BertTokenizer를 갖고 옵니다. 여러 종류를 시도해보세요.

- bert-base-uncased : 108MB param, all lowercase
- bert-large-cased : 340MB param, both upper and lower
- bert-base-cased : 108MB param, multi language, both upper and lower


### Define tokenizers

In [8]:
tokenizer1 = BertTokenizer.from_pretrained('bert-base-cased')
tokenizer2 = BertTokenizer.from_pretrained('bert-large-cased')
tokenizer3 = BertTokenizer.from_pretrained('bert-base-uncased')

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/49.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/213k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/436k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/49.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/213k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/436k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/762 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

### Set labels

In [9]:
labels = {'business': 0,
          'entertainment': 1,
          'sport': 2,
          'tech': 3,
          'politics': 4
          }

$$
\\
$$

## Dataset

In [10]:
class Dataset(torch.utils.data.Dataset):

    # Initialize the `Dataset` class with a dataframe and a tokenizer
    ## Note: add 'tokenizer' argument to try several tokenizers
    def __init__(self, df, tokenizer):
        ### Extract labels from the dataframe
        self.labels = [labels[label] for label in df['category']]
        ### Tokenize each text in the dataframe
        self.texts = [tokenizer(text,
                                padding = 'max_length', max_length = 512, truncation = True,
                                return_tensors = "pt") for text in df['text']]
    # Return the labels
    def classes(self):
        return self.labels

    # Return the length of the dataset
    def __len__(self):
        return len(self.labels)

    # Get labels for a batch of data
    def get_batch_labels(self, idx):
        return np.array(self.labels[idx])

    # Get texts for a batch of data
    def get_batch_texts(self, idx):
        return self.texts[idx]

    # Get a batch of texts and labels
    def __getitem__(self, idx):
        batch_texts = self.get_batch_texts(idx)
        batch_y = self.get_batch_labels(idx)

        return batch_texts, batch_y

$$
\\
$$

## Train & Evaluate BertClassifier

pretrain된 BertModel을 불러옵니다. 다른 간단한 층들도 같이 쌓아줍니다.

- bert-base-cased: 12-layer, 768-hidden, 12-self attention heads, 110M parameters. Trained on cased English text.


다른 종류들의 pretrianed model은 아래 링크에서 확인할 수 있습니다.

https://huggingface.co/transformers/v2.9.1/pretrained_models.html

$$
\\
$$

### `BertClassifier`

In [11]:
class BertClassifier(nn.Module):

    # initialize the `BertClassifier` class with a dropout rate
    ## Note: add `bert_model_name` to try several pre-trained models
    def __init__(self, bert_model_name = 'bert-base-cased', dropout = 0.5):
        super(BertClassifier, self).__init__()
        ## load pre-trained BERT model
        self.bert = BertModel.from_pretrained(bert_model_name)
        ## dropout layer to prevent overfitting
        self.dropout = nn.Dropout(dropout)
        ## linear layer to map BERT output to desired output size
        self.linear = nn.Linear(768, 5)
        ## ReLU activation function
        self.relu = nn.ReLU()

    # Define the forward pass
    def forward(self, input_id, mask):
        ## BERT forward pass
        _, pooled_output = self.bert(input_ids=input_id, attention_mask=mask, return_dict=False)
        ## Apply dropout
        dropout_output = self.dropout(pooled_output)
        ## Apply linear layer
        linear_output = self.linear(dropout_output)
        ## Apply ReLU activation
        final_layer = self.relu(linear_output)
        return final_layer

$$
\\
$$

### `train`

In [12]:
def train(model, train_data, val_data, tokenizer, learning_rate, epochs):
    # Initialize datasets with tokenizer
    train_dataset = Dataset(train_data, tokenizer)
    val_dataset = Dataset(val_data, tokenizer)

    # Create data loaders
    train_dataloader = torch.utils.data.DataLoader(train_dataset, batch_size = 2, shuffle = True)
    val_dataloader = torch.utils.data.DataLoader(val_dataset, batch_size=2)

    # GPU
    ## Check if CUDA is available
    use_cuda = torch.cuda.is_available()
    ## Set device to CUDA if available, else CPU
    device = torch.device("cuda" if use_cuda else "cpu")

    # Define loss function
    criterion = nn.CrossEntropyLoss()

    # Define optimizer
    optimizer = Adam(model.parameters(), lr = learning_rate)

    if use_cuda:
        ## Move model to GPU
        model = model.cuda()
        ## Move criterion to GPU
        criterion = criterion.cuda()

    # Iterate over epochs
    for epoch_num in range(epochs):

        total_acc_train = 0
        total_loss_train = 0

        ## Iterate over training batches
        for train_input, train_label in tqdm(train_dataloader):

            ### Move labels to device
            train_label = train_label.to(device)

            ### Move attention mask to device
            mask = train_input['attention_mask'].to(device)

            ### Remove extra dimension from mask
            mask = mask.squeeze(1)

            ### Move input IDs to device and remove extra dimension
            input_id = train_input['input_ids'].squeeze(1).to(device)

            ### Forward pass
            output = model(input_id, mask)

            ### Compute loss
            batch_loss = criterion(output, train_label.long())
            total_loss_train += batch_loss.item()

            ### Compute accuracy
            acc = (output.argmax(dim=1) == train_label).sum().item()
            total_acc_train += acc

            ### Gradients and update
            model.zero_grad()
            batch_loss.backward()
            optimizer.step()

        total_acc_val = 0
        total_loss_val = 0

        # Validation
        with torch.no_grad():
            ## Iterate over validation batches
            for val_input, val_label in val_dataloader:
                ### Move labels to device
                val_label = val_label.to(device)

                ### Move attention mask to device
                mask = val_input['attention_mask'].to(device)

                ### Remove extra dimension from mask
                mask = mask.squeeze(1)

                ### Move input IDs to device and remove extra dimension
                input_id = val_input['input_ids'].squeeze(1).to(device)

                ### Forward pass
                output = model(input_id, mask)

                ### Loss
                batch_loss = criterion(output, val_label.long())
                total_loss_val += batch_loss.item()

                ### Accuracy
                acc = (output.argmax(dim=1) == val_label).sum().item()
                total_acc_val += acc

        print(f'Epochs: {epoch_num + 1} | Train Loss: {total_loss_train / len(train_data): .3f} | Train Accuracy: {total_acc_train / len(train_data): .3f} | Val Loss: {total_loss_val / len(val_data): .3f} | Val Accuracy: {total_acc_val / len(val_data): .3f}')


$$
\\
$$

### `evaluate`

In [13]:
def evaluate(model, test_data, tokenizer):

    # Initialize the Dataset with the provided tokenizer
    test = Dataset(test_data, tokenizer)

    # Create data loader
    test_dataloader = torch.utils.data.DataLoader(test, batch_size = 2)

    # GPU
    ## Check if CUDA is available
    use_cuda = torch.cuda.is_available()
    ## Set device to CUDA if available, else CPU
    device = torch.device("cuda" if use_cuda else "cpu")

    if use_cuda:
        ## Move model to GPU
        model = model.cuda()

    total_acc_test = 0

    # Evaluation
    with torch.no_grad():
        ## Iterate over test batches
        for test_input, test_label in test_dataloader:

            ### Move labels to device
            test_label = test_label.to(device)

            ### Move attention mask to device
            mask = test_input['attention_mask'].to(device)

            ### Remove extra dimension from mask
            mask = mask.squeeze(1)

            ### Move input IDs to device and remove extra dimension
            input_id = test_input['input_ids'].squeeze(1).to(device)

            ### Forward pass
            output = model(input_id, mask)

            ### Compute accuracy
            acc = (output.argmax(dim=1) == test_label).sum().item()
            total_acc_test += acc  # Accumulate the test accuracy

    print(f'Test Accuracy: {total_acc_test / len(test_data): .3f}')

$$
\\
$$

# Training & Evaluation

## Split dataset

In [14]:
np.random.seed(112)
df_train, df_val, df_test = np.split(df.sample(frac = 1, random_state = 42),
                                     [int(.8*len(df)), int(.9*len(df))])

print(len(df_train),len(df_val), len(df_test))

1780 222 223


  return bound(*args, **kwds)


$$
\\
$$

## `EPOCHS = 2`

In [15]:
EPOCHS = 2
LR = 1e-6

model1 = BertClassifier('bert-base-cased')
model2 = BertClassifier('bert-large-cased')
model3 = BertClassifier('bert-base-uncased')

model.safetensors:   0%|          | 0.00/436M [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.34G [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

$$
\\
$$

### Train

In [16]:
train(model1, df_train, df_val, tokenizer1, LR, EPOCHS)

100%|██████████| 890/890 [03:07<00:00,  4.73it/s]


Epochs: 1 | Train Loss:  0.715 | Train Accuracy:  0.421 | Val Loss:  0.591 | Val Accuracy:  0.622


100%|██████████| 890/890 [03:07<00:00,  4.76it/s]


Epochs: 2 | Train Loss:  0.392 | Train Accuracy:  0.848 | Val Loss:  0.198 | Val Accuracy:  0.982


$$
\\
$$

In [17]:
# train(model2, df_train, df_val, tokenizer2, LR, EPOCHS)

```
  0%|          | 0/890 [00:00<?, ?it/s]
---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
<ipython-input-35-caafb928dc27> in <cell line: 1>()
----> 1 train(model2, df_train, df_val, tokenizer2, LR, EPOCHS)

6 frames
/usr/local/lib/python3.10/dist-packages/torch/nn/modules/linear.py in forward(self, input)
    114
    115     def forward(self, input: Tensor) -> Tensor:
--> 116         return F.linear(input, self.weight, self.bias)
    117
    118     def extra_repr(self) -> str:

RuntimeError: mat1 and mat2 shapes cannot be multiplied (2x1024 and 768x5)
```

model2는 아마 레이어 설정 등을 다르게 해야하는 것 같다

$$
\\
$$

In [18]:
train(model3, df_train, df_val, tokenizer3, LR, EPOCHS)

100%|██████████| 890/890 [03:07<00:00,  4.74it/s]


Epochs: 1 | Train Loss:  0.721 | Train Accuracy:  0.409 | Val Loss:  0.602 | Val Accuracy:  0.608


100%|██████████| 890/890 [03:07<00:00,  4.74it/s]


Epochs: 2 | Train Loss:  0.485 | Train Accuracy:  0.762 | Val Loss:  0.374 | Val Accuracy:  0.905


$$
\\
$$

### Evaluation

In [19]:
evaluate(model1, df_test, tokenizer1)

Test Accuracy:  0.964


$$
\\
$$

In [20]:
# evaluate(model2, df_test, tokenizer2)

$$
\\
$$

In [21]:
evaluate(model3, df_test, tokenizer3)

Test Accuracy:  0.879


$$
\\
$$

## `EPOCHS = 3`

In [22]:
EPOCHS = 3
LR = 1e-6

model1 = BertClassifier('bert-base-cased')
model2 = BertClassifier('bert-large-cased')
model3 = BertClassifier('bert-base-uncased')

$$
\\
$$

### Train

In [23]:
train(model1, df_train, df_val, tokenizer1, LR, EPOCHS)

100%|██████████| 890/890 [03:07<00:00,  4.76it/s]


Epochs: 1 | Train Loss:  0.754 | Train Accuracy:  0.340 | Val Loss:  0.596 | Val Accuracy:  0.608


100%|██████████| 890/890 [03:07<00:00,  4.75it/s]


Epochs: 2 | Train Loss:  0.373 | Train Accuracy:  0.843 | Val Loss:  0.196 | Val Accuracy:  0.973


100%|██████████| 890/890 [03:07<00:00,  4.75it/s]


Epochs: 3 | Train Loss:  0.151 | Train Accuracy:  0.964 | Val Loss:  0.108 | Val Accuracy:  0.964


$$
\\
$$

In [24]:
# train(model2, df_train, df_val, tokenizer2, LR, EPOCHS)

$$
\\
$$

In [25]:
train(model3, df_train, df_val, tokenizer3, LR, EPOCHS)

100%|██████████| 890/890 [03:07<00:00,  4.75it/s]


Epochs: 1 | Train Loss:  0.726 | Train Accuracy:  0.430 | Val Loss:  0.566 | Val Accuracy:  0.739


100%|██████████| 890/890 [03:07<00:00,  4.75it/s]


Epochs: 2 | Train Loss:  0.451 | Train Accuracy:  0.858 | Val Loss:  0.325 | Val Accuracy:  0.941


100%|██████████| 890/890 [03:07<00:00,  4.75it/s]


Epochs: 3 | Train Loss:  0.248 | Train Accuracy:  0.969 | Val Loss:  0.181 | Val Accuracy:  0.968


$$
\\
$$

### Evaluation

In [26]:
evaluate(model1, df_test, tokenizer1)

Test Accuracy:  0.982


$$
\\
$$

In [27]:
# evaluate(model2, df_test, tokenizer2)

$$
\\
$$

In [28]:
evaluate(model3, df_test, tokenizer3)

Test Accuracy:  0.973


$$
\\
$$

## `EPOCHS = 4`

In [29]:
EPOCHS = 4
LR = 1e-6

model1 = BertClassifier('bert-base-cased')
model2 = BertClassifier('bert-large-cased')
model3 = BertClassifier('bert-base-uncased')

$$
\\
$$

### Train

In [30]:
train(model1, df_train, df_val, tokenizer1, LR, EPOCHS)

100%|██████████| 890/890 [03:07<00:00,  4.75it/s]


Epochs: 1 | Train Loss:  0.737 | Train Accuracy:  0.389 | Val Loss:  0.622 | Val Accuracy:  0.536


100%|██████████| 890/890 [03:07<00:00,  4.76it/s]


Epochs: 2 | Train Loss:  0.436 | Train Accuracy:  0.780 | Val Loss:  0.260 | Val Accuracy:  0.946


100%|██████████| 890/890 [03:07<00:00,  4.75it/s]


Epochs: 3 | Train Loss:  0.177 | Train Accuracy:  0.964 | Val Loss:  0.118 | Val Accuracy:  0.982


100%|██████████| 890/890 [03:07<00:00,  4.75it/s]


Epochs: 4 | Train Loss:  0.088 | Train Accuracy:  0.981 | Val Loss:  0.072 | Val Accuracy:  0.982


$$
\\
$$

In [31]:
# train(model2, df_train, df_val, tokenizer2, LR, EPOCHS)

$$
\\
$$

In [32]:
train(model3, df_train, df_val, tokenizer3, LR, EPOCHS)

100%|██████████| 890/890 [03:07<00:00,  4.75it/s]


Epochs: 1 | Train Loss:  0.744 | Train Accuracy:  0.387 | Val Loss:  0.616 | Val Accuracy:  0.689


100%|██████████| 890/890 [03:07<00:00,  4.75it/s]


Epochs: 2 | Train Loss:  0.455 | Train Accuracy:  0.873 | Val Loss:  0.304 | Val Accuracy:  0.964


100%|██████████| 890/890 [03:07<00:00,  4.75it/s]


Epochs: 3 | Train Loss:  0.230 | Train Accuracy:  0.978 | Val Loss:  0.160 | Val Accuracy:  0.977


100%|██████████| 890/890 [03:07<00:00,  4.75it/s]


Epochs: 4 | Train Loss:  0.128 | Train Accuracy:  0.983 | Val Loss:  0.096 | Val Accuracy:  0.986


$$
\\
$$

### Evaluation

In [33]:
evaluate(model1, df_test, tokenizer1)

Test Accuracy:  0.982


$$
\\
$$

In [34]:
# evaluate(model2, df_test, tokenizer2)

$$
\\
$$

In [35]:
evaluate(model3, df_test, tokenizer3)

Test Accuracy:  0.991
