# Problem 4: Natural Language Inference with BERT Model

## Problem Description

In this problem, we will be exploring the field of Natural Language Inference (NLI) using a BERT (Bidirectional Encoder Representations from Transformers) Model. NLI is a subfield of natural language processing (NLP) that focuses on determining whether a "hypothesis" is true (entailment), false (contradiction), or undetermined (neutral) given a "premise".

The BERT model, a transformer-based machine learning technique for NLP tasks, is designed to understand the context of each word in a sentence, rather than just the individual words. This makes it particularly well-suited for tasks like NLI.

Our goal is to compare the performance of the BERT model on the NLI task.

## Requirements

1. **Load and Preprocess the Data**: The first step is to load the Stanford SNLI (Stanford Natural Language Inference) dataset from the provided link: [Stanford SNLI Dataset](https://nlp.stanford.edu/projects/snli/snli_1.0.zip). The SNLI dataset is a collection of 570k human-written English sentence pairs manually annotated for balanced classification with the labels entailment, contradiction, and neutral. After loading, these files should be preprocessed to extract the necessary features for our BERT model. This preprocessing might involve steps such as tokenization (breaking down the text into individual words or tokens) or converting the text into input features suitable for BERT.

For faster training time, you can sample a subset of the data (100k instances) for training). You can also use the full dataset if you have the computational resources.

2. **Model**: Implement a BERT model for this task. The model should take as input the features extracted from the text and output the predicted category (entailment, contradiction, or neutral) for each instance.

3. **Evaluation**: Evaluate the performance of the BERT model. Use appropriate metrics for this comparison, such as accuracy, precision, recall, and F1 score.

4. **Report Your Results**: After evaluating your model, report the results. This should include the performance of your model on the test set. Discuss any significant findings or observations from your results. For example, how well did the model perform? Were there any categories that the model struggled with more than others? What might be some reasons for this?

In [1]:
import torch
from torch import nn, optim
from torch.utils.data import DataLoader, Dataset
import lightning as pl
from torchmetrics import Accuracy
from transformers import AutoModel, AutoTokenizer, DataCollatorForLanguageModeling
import pandas as pd
# from zipfile import ZipFile
from lightning.pytorch.callbacks import EarlyStopping, ModelCheckpoint

In [2]:
max_length = 512
batch_size = 256
model_name = 'MoritzLaurer/deberta-v3-large-zeroshot-v2.0'

file_name = "Data/snli_1.0.zip"

with ZipFile(file_name, 'r') as zip:
    zip.printdir()
    print('Extracting all the files to the Data folder...')
    zip.extractall('Data')
    print('Done!')

## Preparing Inputs for BERT

Once we have our dataset, we need to prepare the inputs for the BERT model. BERT requires three types of inputs:

1. **Tokens Index**: This is the main input to BERT and contains the indexes of the sequence tokens.

2. **Attention Mask**: This helps the model distinguish between useful tokens and padding that is done during batch preparation. It is a sequence of 1’s with the same length as the input tokens.

3. **Token Type IDs**: These help the model know which token belongs to which sentence. For tokens of the first sentence in the input, token type IDs contain 0, and for the second sentence tokens, they contain 1.

For example, consider the following two sentences: "Man is wearing blue jeans." and "Man is wearing red jeans.". The input tokens, attention mask, and token type IDs would look like this:

```python
Input tokens: [ '[CLS]', 'Man', 'is', 'wearing', 'blue', 'jeans', '.', '[SEP]', 'Man', 'is', 'wearing', 'red', 'jeans', '.', '[SEP]' ]
Attention_mask: [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]
Token type ids: [0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1]
```

In the next steps, we will prepare these inputs for our SNLI dataset.
```

In [3]:
# Change directory to the root of the project
import os
os.chdir(os.path.dirname("F:/IT/DataScience/NLP/FinalExam/"))

# 1. Load the data

In [4]:
# Read the data
df_train = pd.read_csv("snli_1.0/snli_1.0/snli_1.0_train.txt", sep="\t")
df_val = pd.read_csv("snli_1.0/snli_1.0/snli_1.0_dev.txt", sep="\t")
df_test = pd.read_csv("snli_1.0/snli_1.0/snli_1.0_test.txt", sep="\t")

In [5]:
# Let's take a look at the data
df_train.head()

Unnamed: 0,gold_label,sentence1_binary_parse,sentence2_binary_parse,sentence1_parse,sentence2_parse,sentence1,sentence2,captionID,pairID,label1,label2,label3,label4,label5
0,neutral,( ( ( A person ) ( on ( a horse ) ) ) ( ( jump...,( ( A person ) ( ( is ( ( training ( his horse...,(ROOT (S (NP (NP (DT A) (NN person)) (PP (IN o...,(ROOT (S (NP (DT A) (NN person)) (VP (VBZ is) ...,A person on a horse jumps over a broken down a...,A person is training his horse for a competition.,3416050480.jpg#4,3416050480.jpg#4r1n,neutral,,,,
1,contradiction,( ( ( A person ) ( on ( a horse ) ) ) ( ( jump...,( ( A person ) ( ( ( ( is ( at ( a diner ) ) )...,(ROOT (S (NP (NP (DT A) (NN person)) (PP (IN o...,(ROOT (S (NP (DT A) (NN person)) (VP (VBZ is) ...,A person on a horse jumps over a broken down a...,"A person is at a diner, ordering an omelette.",3416050480.jpg#4,3416050480.jpg#4r1c,contradiction,,,,
2,entailment,( ( ( A person ) ( on ( a horse ) ) ) ( ( jump...,"( ( A person ) ( ( ( ( is outdoors ) , ) ( on ...",(ROOT (S (NP (NP (DT A) (NN person)) (PP (IN o...,(ROOT (S (NP (DT A) (NN person)) (VP (VBZ is) ...,A person on a horse jumps over a broken down a...,"A person is outdoors, on a horse.",3416050480.jpg#4,3416050480.jpg#4r1e,entailment,,,,
3,neutral,( Children ( ( ( smiling and ) waving ) ( at c...,( They ( are ( smiling ( at ( their parents ) ...,(ROOT (NP (S (NP (NNP Children)) (VP (VBG smil...,(ROOT (S (NP (PRP They)) (VP (VBP are) (VP (VB...,Children smiling and waving at camera,They are smiling at their parents,2267923837.jpg#2,2267923837.jpg#2r1n,neutral,,,,
4,entailment,( Children ( ( ( smiling and ) waving ) ( at c...,( There ( ( are children ) present ) ),(ROOT (NP (S (NP (NNP Children)) (VP (VBG smil...,(ROOT (S (NP (EX There)) (VP (VBP are) (NP (NN...,Children smiling and waving at camera,There are children present,2267923837.jpg#2,2267923837.jpg#2r1e,entailment,,,,


In [6]:
# We only keep the necessary columns
df_train = df_train[["gold_label", "sentence1", "sentence2"]]
df_val = df_val[["gold_label", "sentence1", "sentence2"]]
df_test = df_test[["gold_label", "sentence1", "sentence2"]]

In [7]:
# Convert the labels to integers
df_train["gold_label"] = df_train["gold_label"].map({"neutral": 0, "entailment": 1, "contradiction": 2})
df_val["gold_label"] = df_val["gold_label"].map({"neutral": 0, "entailment": 1, "contradiction": 2})
df_test["gold_label"] = df_test["gold_label"].map({"neutral": 0, "entailment": 1, "contradiction": 2})

In [8]:
# Remove the rows with missing values
df_train = df_train.dropna().reset_index(drop=True)
df_val = df_val.dropna().reset_index(drop=True)
df_test = df_test.dropna().reset_index(drop=True)

In [9]:
# Shape of the data
print(f'Train shape: {df_train.shape}')
print(f'Val shape: {df_val.shape}')
print(f'Test shape: {df_test.shape}')

Train shape: (549361, 3)
Val shape: (9842, 3)
Test shape: (9824, 3)


In [10]:
# In order to train in a reasonable time, we only keep a subset of the data
df_train = df_train.sample(50000)

In [11]:
# Because we have 2 sentences, we need to concatenate them
df_train["text"] = df_train["sentence1"] + " [SEP] " + df_train["sentence2"]
df_val["text"] = df_val["sentence1"] + " [SEP] " + df_val["sentence2"]
df_test["text"] = df_test["sentence1"] + " [SEP] " + df_test["sentence2"]

In [12]:
# Drop the unnecessary columns
df_train = df_train[["text", "gold_label"]]
df_val = df_val[["text", "gold_label"]]
df_test = df_test[["text", "gold_label"]]
df_train.head()

Unnamed: 0,text,gold_label
380513,A surfer in a black wetsuit holding his balanc...,0.0
389235,"Three women are dressed in African clothes, an...",0.0
212368,A lady is in a wheelchair on the corner of a s...,0.0
365454,Five ballet dancers caught mid jump in a danci...,2.0
449198,Two men in red and blue T-Shirts are playing s...,2.0


# 2. Build BERT model

In [13]:
class TextDataset(Dataset):
    def __init__(self, encodings, labels):
        self.encodings = encodings
        self.labels = labels

    def __getitem__(self, idx):
        item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
        item['labels'] = torch.tensor(self.labels[idx])
        return item

    def __len__(self):
        return len(self.labels)

class DataModule(pl.LightningDataModule):
    def __init__(self, train_df, test_df, batch_size=16, max_len=512):
        super().__init__()
        self.save_hyperparameters(ignore=['train_df', 'test_df'])
        self.train_df = train_df
        self.test_df = test_df
        self.batch_size = batch_size
        self.max_len = max_len

        self.tokenizer = AutoTokenizer.from_pretrained(model_name, use_fast=False)

    def setup(self, stage=None):
        if stage == 'fit' or stage is None:
            self.train_dataset = self._create_dataset(self.train_df)
            self.val_dataset = self._create_dataset(self.test_df)
        if stage == 'test' or stage is None:
            self.test_dataset = self._create_dataset(self.test_df)

    def _create_dataset(self, df):
        texts = df['text'].tolist()
        labels = df['gold_label'].tolist()
        
        encodings = self.tokenizer(texts, truncation=True, padding=True, max_length=self.max_len)
        
        return TextDataset(encodings, labels)

    def _collate_fn(self, batch):
        input_ids = torch.stack([item['input_ids'] for item in batch]).long()
        attention_mask = torch.stack([item['attention_mask'] for item in batch]).long()
        labels = torch.stack([item['labels'] for item in batch]).long()
        return input_ids, attention_mask, labels

    def train_dataloader(self):
        return DataLoader(
            self.train_dataset, 
            batch_size=self.batch_size, 
            collate_fn=self._collate_fn,
            shuffle=True,
            num_workers=0, 
            pin_memory=True
        )
    
    def val_dataloader(self):
        return DataLoader(
            self.val_dataset, 
            batch_size=self.batch_size, 
            collate_fn=self._collate_fn,
            num_workers=0, 
            pin_memory=True
        )
    
    def test_dataloader(self):
        return DataLoader(
            self.test_dataset, 
            batch_size=self.batch_size, 
            collate_fn=self._collate_fn,
            num_workers=0, 
            pin_memory=True
        )

In [14]:
# Initialize the data module
data_module = DataModule(df_train, df_test, batch_size=batch_size, max_len=max_length)

tokenizer_config.json:   0%|          | 0.00/1.26k [00:00<?, ?B/s]

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to see activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development


spm.model:   0%|          | 0.00/2.46M [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/23.0 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/970 [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/8.66M [00:00<?, ?B/s]

In [15]:
class Classifier(pl.LightningModule):
    def __init__(self, max_seq_len=512, batch_size=16, learning_rate=2e-5):
        super().__init__()
        
        self.save_hyperparameters()
        
        self.criterion = nn.CrossEntropyLoss()
        self.accuracy = Accuracy(task='multiclass', num_classes=3)
        
        self.pretrained_model = AutoModel.from_pretrained(model_name)
        
        for param in self.pretrained_model.parameters():
            param.requires_grad = False
            
        self.fc = nn.Sequential(
            nn.Linear(self.pretrained_model.config.hidden_size, 2048),
            nn.ReLU(),
            nn.Linear(2048, 1024),
            nn.ReLU(),
            nn.Linear(1024, 3)
        )
        
    def forward(self, input_ids, attention_mask):
        output = self.pretrained_model(input_ids, attention_mask)
        output = output.last_hidden_state[:, 0, :]
        output = self.fc(output)
        return output
    
    def _shared_step(self, batch, batch_idx):
        input_ids, attention_mask, labels = batch
        
        # Ensure input tensors are of type long
        input_ids = input_ids.long()
        attention_mask = attention_mask.long()
        labels = labels.long()
        
        output = self(input_ids, attention_mask)
        
        # Ensure output is float32 for loss calculation
        output = output.float()
        
        loss = self.criterion(output, labels)
        acc = self.accuracy(output, labels)
        
        return loss, acc
    
    def training_step(self, batch, batch_idx):
        loss, acc = self._shared_step(batch, batch_idx)
        self.log('train_loss', loss, on_step=True, on_epoch=True, prog_bar=True, logger=True)
        self.log('train_acc', acc, on_step=True, on_epoch=True, prog_bar=True, logger=True)
        return loss
    
    def validation_step(self, batch, batch_idx):
        loss, acc = self._shared_step(batch, batch_idx)
        self.log('val_loss', loss, on_step=True, on_epoch=True, prog_bar=True, logger=True)
        self.log('val_acc', acc, on_step=True, on_epoch=True, prog_bar=True, logger=True)
        return loss
    
    def test_step(self, batch, batch_idx):
        loss, acc = self._shared_step(batch, batch_idx)
        self.log('test_loss', loss, on_step=True, on_epoch=True, prog_bar=True, logger=True)
        self.log('test_acc', acc, on_step=True, on_epoch=True, prog_bar=True, logger=True)
        return loss
    
    def configure_optimizers(self):
        return torch.optim.AdamW(self.parameters(), lr=self.hparams.learning_rate)

In [16]:
# Initialize the model
model = Classifier(max_seq_len=max_length, batch_size=batch_size, learning_rate=2e-5)

config.json:   0%|          | 0.00/1.02k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/870M [00:00<?, ?B/s]

In [17]:
# Early stopping
early_stopping = EarlyStopping(
    monitor='val_loss',
    patience=3,
    verbose=True,
)

In [18]:
# Model checkpoint
checkpoint_callback = ModelCheckpoint(
    monitor='val_loss',
    dirpath='checkpoints',
    filename='bertxlm-{epoch:02d}-{val_loss:.2f}',
    save_top_k=3,
    mode='min',
)

In [19]:
# Define the trainer
trainer = pl.Trainer(
    devices=-1,
    max_epochs=20,
    callbacks=[early_stopping, checkpoint_callback],
    precision='16-mixed'
)

Using 16bit Automatic Mixed Precision (AMP)
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
HPU available: False, using: 0 HPUs


In [20]:
# Train the model
trainer.fit(model, data_module)

You are using a CUDA device ('NVIDIA GeForce RTX 3070') that has Tensor Cores. To properly utilize them, you should set `torch.set_float32_matmul_precision('medium' | 'high')` which will trade-off precision for performance. For more details, read https://pytorch.org/docs/stable/generated/torch.set_float32_matmul_precision.html#torch.set_float32_matmul_precision
f:\IT\qkenv39\lib\site-packages\lightning\pytorch\callbacks\model_checkpoint.py:654: Checkpoint directory F:\IT\DataScience\NLP\FinalExam\checkpoints exists and is not empty.
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]

  | Name             | Type               | Params | Mode 
----------------------------------------------------------------
0 | criterion        | CrossEntropyLoss   | 0      | train
1 | accuracy         | MulticlassAccuracy | 0      | train
2 | pretrained_model | DebertaV2Model     | 434 M  | eval 
3 | fc               | Sequential         | 4.2 M  | train
----------------------------------------------------------

Sanity Checking: |          | 0/? [00:00<?, ?it/s]

f:\IT\qkenv39\lib\site-packages\lightning\pytorch\trainer\connectors\data_connector.py:424: The 'val_dataloader' does not have many workers which may be a bottleneck. Consider increasing the value of the `num_workers` argument` to `num_workers=11` in the `DataLoader` to improve performance.
f:\IT\qkenv39\lib\site-packages\lightning\pytorch\trainer\connectors\data_connector.py:424: The 'train_dataloader' does not have many workers which may be a bottleneck. Consider increasing the value of the `num_workers` argument` to `num_workers=11` in the `DataLoader` to improve performance.


Training: |          | 0/? [00:00<?, ?it/s]

Validation: |          | 0/? [00:00<?, ?it/s]

Metric val_loss improved. New best score: 0.330


Validation: |          | 0/? [00:00<?, ?it/s]

Metric val_loss improved by 0.036 >= min_delta = 0.0. New best score: 0.294


Validation: |          | 0/? [00:00<?, ?it/s]

Metric val_loss improved by 0.014 >= min_delta = 0.0. New best score: 0.280


Validation: |          | 0/? [00:00<?, ?it/s]

Metric val_loss improved by 0.004 >= min_delta = 0.0. New best score: 0.275


Validation: |          | 0/? [00:00<?, ?it/s]

Validation: |          | 0/? [00:00<?, ?it/s]

Metric val_loss improved by 0.002 >= min_delta = 0.0. New best score: 0.273


Validation: |          | 0/? [00:00<?, ?it/s]

Metric val_loss improved by 0.003 >= min_delta = 0.0. New best score: 0.270


Validation: |          | 0/? [00:00<?, ?it/s]

Validation: |          | 0/? [00:00<?, ?it/s]

Metric val_loss improved by 0.000 >= min_delta = 0.0. New best score: 0.270


Validation: |          | 0/? [00:00<?, ?it/s]

Metric val_loss improved by 0.005 >= min_delta = 0.0. New best score: 0.265


Validation: |          | 0/? [00:00<?, ?it/s]

Validation: |          | 0/? [00:00<?, ?it/s]

Metric val_loss improved by 0.000 >= min_delta = 0.0. New best score: 0.265


Validation: |          | 0/? [00:00<?, ?it/s]

Validation: |          | 0/? [00:00<?, ?it/s]

Validation: |          | 0/? [00:00<?, ?it/s]

Metric val_loss improved by 0.002 >= min_delta = 0.0. New best score: 0.263


Validation: |          | 0/? [00:00<?, ?it/s]

Validation: |          | 0/? [00:00<?, ?it/s]

Metric val_loss improved by 0.002 >= min_delta = 0.0. New best score: 0.261


Validation: |          | 0/? [00:00<?, ?it/s]

Validation: |          | 0/? [00:00<?, ?it/s]

Metric val_loss improved by 0.001 >= min_delta = 0.0. New best score: 0.260


Validation: |          | 0/? [00:00<?, ?it/s]

`Trainer.fit` stopped: `max_epochs=20` reached.


# 3. Test the data

In [21]:
# Load the best model
model = Classifier.load_from_checkpoint(checkpoint_callback.best_model_path)

In [22]:
# Test the model
trainer.test(model, data_module)

LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
f:\IT\qkenv39\lib\site-packages\lightning\pytorch\trainer\connectors\data_connector.py:424: The 'test_dataloader' does not have many workers which may be a bottleneck. Consider increasing the value of the `num_workers` argument` to `num_workers=11` in the `DataLoader` to improve performance.


Testing: |          | 0/? [00:00<?, ?it/s]

[{'test_loss_epoch': 0.25981611013412476,
  'test_acc_epoch': 0.9072679281234741}]

In [28]:
# Save the model
torch.save(model.state_dict(), 'deberta_v3_model.pth')

# 4. Conclusion
1. Độ chính xác (Accuracy):
- Mô hình đạt được độ chính xác 90.73% trên tập test (test_acc_epoch: 0.9072).
- Đây là một kết quả khá tốt, cho thấy mô hình có khả năng phân loại chính xác trong hơn 90% trường hợp.
2. Mất mát (Loss):
- Giá trị loss trên tập test là 0.2598 (test_loss_epoch: 0.25981).
- Loss này tương đối thấp, cho thấy mô hình đã hội tụ khá tốt và có khả năng dự đoán với độ tin cậy cao.
3. Đánh giá tổng thể:
- Kết quả này cho thấy mô hình DeBERTa-v3 đã được huấn luyện khá hiệu quả cho bài toán NLI.
- Độ chính xác trên 90% thường được coi là một kết quả khá tốt.