# A4: Do You AGREE

## Task 1: Training BERT from Scratch

## Acknowledgments and Data Credits

This project utilizes the following open-source datasets and libraries for training and evaluation:

### 1. Pre-training Data (BERT from Scratch)
* **BookCorpus & WikiText-103**: Used for the initial self-supervised pre-training of the BERT model using Masked Language Modeling (MLM).
    * *Reference*: Zhu, Y., et al. (2015). "Aligning Books and Movies: Towards Story-like Visual Explanations by Watching Movies and Reading Books."
    * *Source*: Accessed via the Hugging Face `datasets` library.

### 2. Fine-tuning Data (Sentence-BERT)
* **The Stanford Natural Language Inference (SNLI) Corpus**: Used for fine-tuning the Siamese Network architecture to determine sentence relationships (Entailment, Neutral, Contradiction).
    * *Reference*: Bowman, S., et al. (2015). "A large annotated corpus for learning natural language inference."
    * *Source*: [The Stanford NLP Group](https://nlp.stanford.edu/projects/snli/)

### 3. Software and Frameworks
* **PyTorch**: Deep learning framework used for building the transformer architecture.
* **Hugging Face Transformers**: Used specifically for the `BertTokenizer` and dataset management.
* **Scikit-learn**: Used for generating performance metrics and classification reports.

In [1]:
import math
import numpy as np
import re
import logging
import torch
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F
from datasets import load_dataset
from tqdm.auto import tqdm
import os
import json
from datetime import datetime
from transformers import get_cosine_schedule_with_warmup

In [None]:
# Configure logging to track training progress in both a file and the console
logging.basicConfig(
    level=logging.INFO,
    format='%(asctime)s - %(levelname)s - %(message)s',
    handlers=[
        logging.FileHandler('bert_training.log'), # Saves logs to local disk
        logging.StreamHandler()                   # Prints logs to terminal/notebook
    ]
)
logger = logging.getLogger(__name__)

def get_free_gpu():
    """Identify and return the GPU with the highest available VRAM."""
    if not torch.cuda.is_available():
        return torch.device('cpu') # Fallback to CPU if no NVIDIA GPU is found

    n_gpus = torch.cuda.device_count()
    if n_gpus == 0:
        return torch.device('cpu')

    max_free_memory = 0
    selected_gpu = 0

    # Iterate through all available GPUs to find the one with the most free space
    for gpu_id in range(n_gpus):
        try:
            # Calculate free memory (Total - Currently Occupied)
            free_memory = torch.cuda.get_device_properties(gpu_id).total_memory - torch.cuda.memory_allocated(gpu_id)
            if free_memory > max_free_memory:
                max_free_memory = free_memory
                selected_gpu = gpu_id
        except:
            continue

    device = torch.device(f'cuda:{selected_gpu}')
    logger.info(f"Selected GPU {selected_gpu} with {max_free_memory/1024/1024:.2f}MB free memory")
    return device

In [3]:
# Set device and seeds for reproducibility
SEED = 1234
torch.manual_seed(SEED)
torch.backends.cudnn.deterministic = True
device = get_free_gpu()
logger.info(f"Using device: {device}")

### Model configuration

In [None]:
import math
import numpy as np
import re
import logging
import torch
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F
from datasets import load_dataset
from tqdm.auto import tqdm
import os
import json
from datetime import datetime
from transformers import get_cosine_schedule_with_warmup

class BertConfig:
    def __init__(self):
        # Model architecture and dimensionality settings
        self.vocab_size = None  # Set dynamically after processing text
        self.hidden_size = 256
        self.num_hidden_layers = 6
        self.num_attention_heads = 8
        self.intermediate_size = 1024

        # Dropout and normalization parameters
        self.hidden_dropout_prob = 0.1
        self.attention_probs_dropout_prob = 0.1
        self.layer_norm_eps = 1e-12

        # Sequence and token type constraints
        self.max_position_embeddings = 128
        self.max_len = 128
        self.type_vocab_size = 2
        self.pad_token_id = 0

        # Special token ID assignments
        self.mask_token_id = 3
        self.cls_token_id = 1
        self.sep_token_id = 2

        # Optimization and training hyper-parameters
        self.learning_rate = 1e-4
        self.batch_size = 64
        self.gradient_accumulation_steps = 4
        self.weight_decay = 0.01
        self.adam_epsilon = 1e-8
        self.warmup_ratio = 0.1

class BertLayerNorm(nn.Module):
    """Custom Layer Normalization for feature standardization."""
    def __init__(self, hidden_size, eps=1e-12):
        super().__init__()
        self.weight = nn.Parameter(torch.ones(hidden_size))
        self.bias = nn.Parameter(torch.zeros(hidden_size))
        self.variance_epsilon = eps

    def forward(self, x):
        # Calculate mean and variance across the last dimension
        mean = x.mean(-1, keepdim=True)
        variance = (x - mean).pow(2).mean(-1, keepdim=True)
        # Apply normalization and learnable affine transformation
        x = (x - mean) / torch.sqrt(variance + self.variance_epsilon)
        return self.weight * x + self.bias

class BertEmbeddings(nn.Module):
    """Combines word, position, and token type embeddings."""
    def __init__(self, config):
        super().__init__()
        self.word_embeddings = nn.Embedding(config.vocab_size, config.hidden_size)
        self.position_embeddings = nn.Embedding(config.max_position_embeddings, config.hidden_size)
        self.token_type_embeddings = nn.Embedding(config.type_vocab_size, config.hidden_size)
        self.LayerNorm = BertLayerNorm(config.hidden_size, eps=config.layer_norm_eps)
        self.dropout = nn.Dropout(config.hidden_dropout_prob)

    def forward(self, input_ids, token_type_ids=None, position_ids=None):
        seq_length = input_ids.size(1)
        # Generate position indices (0, 1, ..., seq_len-1)
        if position_ids is None:
            position_ids = torch.arange(seq_length, dtype=torch.long, device=input_ids.device)
            position_ids = position_ids.unsqueeze(0).expand_as(input_ids)
        if token_type_ids is None:
            token_type_ids = torch.zeros_like(input_ids)

        # Sum token, position, and segment embeddings
        words_embeddings = self.word_embeddings(input_ids)
        position_embeddings = self.position_embeddings(position_ids)
        token_type_embeddings = self.token_type_embeddings(token_type_ids)

        embeddings = words_embeddings + position_embeddings + token_type_embeddings
        embeddings = self.LayerNorm(embeddings)
        embeddings = self.dropout(embeddings)
        return embeddings

class BertSelfAttention(nn.Module):
    """Multi-head self-attention mechanism implementation."""
    def __init__(self, config):
        super().__init__()
        self.num_attention_heads = config.num_attention_heads
        self.attention_head_size = config.hidden_size // config.num_attention_heads
        self.all_head_size = self.num_attention_heads * self.attention_head_size

        # Linear projections for Query, Key, and Value
        self.query = nn.Linear(config.hidden_size, self.all_head_size)
        self.key = nn.Linear(config.hidden_size, self.all_head_size)
        self.value = nn.Linear(config.hidden_size, self.all_head_size)

        self.dropout = nn.Dropout(config.attention_probs_dropout_prob)
        self._init_weights()

    def _init_weights(self):
        # Standard BERT weight initialization
        for module in [self.query, self.key, self.value]:
            module.weight.data.normal_(mean=0.0, std=0.02)
            if module.bias is not None:
                module.bias.data.zero_()

    def transpose_for_scores(self, x):
        # Reshape for multi-head parallel computation
        new_x_shape = x.size()[:-1] + (self.num_attention_heads, self.attention_head_size)
        x = x.view(*new_x_shape)
        return x.permute(0, 2, 1, 3)

    def forward(self, hidden_states, attention_mask=None):
        query_layer = self.transpose_for_scores(self.query(hidden_states))
        key_layer = self.transpose_for_scores(self.key(hidden_states))
        value_layer = self.transpose_for_scores(self.value(hidden_states))

        # Scaled dot-product attention calculation
        attention_scores = torch.matmul(query_layer, key_layer.transpose(-1, -2))
        attention_scores = attention_scores / math.sqrt(self.attention_head_size)

        # Apply attention mask to ignore specific tokens (padding)
        if attention_mask is not None:
            attention_scores = attention_scores + attention_mask

        # Convert scores to probabilities
        attention_probs = nn.Softmax(dim=-1)(attention_scores)
        attention_probs = self.dropout(attention_probs)

        # Weighted sum of values and reshape back to original dimensions
        context_layer = torch.matmul(attention_probs, value_layer)
        context_layer = context_layer.permute(0, 2, 1, 3).contiguous()
        new_context_layer_shape = context_layer.size()[:-2] + (self.all_head_size,)
        return context_layer.view(*new_context_layer_shape)

class BertLayer(nn.Module):
    """One full Transformer encoder block."""
    def __init__(self, config):
        super().__init__()
        self.attention = BertSelfAttention(config)
        self.intermediate = nn.Linear(config.hidden_size, config.intermediate_size)
        self.output = nn.Linear(config.intermediate_size, config.hidden_size)
        self.LayerNorm1 = BertLayerNorm(config.hidden_size, eps=config.layer_norm_eps)
        self.LayerNorm2 = BertLayerNorm(config.hidden_size, eps=config.layer_norm_eps)
        self.dropout = nn.Dropout(config.hidden_dropout_prob)
        self.activation = F.gelu

    def forward(self, hidden_states, attention_mask=None):
        # Sublayer 1: Multi-head attention with residual connection
        attention_output = self.attention(hidden_states, attention_mask)
        attention_output = self.dropout(attention_output)
        attention_output = self.LayerNorm1(attention_output + hidden_states)

        # Sublayer 2: Feed-forward network with residual connection
        intermediate_output = self.activation(self.intermediate(attention_output))
        layer_output = self.output(intermediate_output)
        layer_output = self.dropout(layer_output)
        return self.LayerNorm2(layer_output + attention_output)

class BertModel(nn.Module):
    """Core BERT model containing embeddings and encoder layers."""
    def __init__(self, config):
        super().__init__()
        self.config = config
        self.embeddings = BertEmbeddings(config)
        self.encoder = nn.ModuleList([BertLayer(config) for _ in range(config.num_hidden_layers)])
        self.pooler = nn.Linear(config.hidden_size, config.hidden_size)
        self.pooler_activation = nn.Tanh()
        self.mlm_head = nn.Linear(config.hidden_size, config.vocab_size)
        self.gradient_checkpointing = False

    def forward(self, input_ids, token_type_ids=None, attention_mask=None, masked_lm_labels=None):
        if attention_mask is None:
            attention_mask = torch.ones_like(input_ids)

        # Prepare 4D attention mask for broadcasting
        extended_attention_mask = attention_mask.unsqueeze(1).unsqueeze(2)
        extended_attention_mask = (1.0 - extended_attention_mask) * -10000.0

        hidden_states = self.embeddings(input_ids, token_type_ids)

        # Iteratively process through each encoder layer
        for layer in self.encoder:
            if self.gradient_checkpointing and self.training:
                hidden_states = torch.utils.checkpoint.checkpoint(layer, hidden_states, extended_attention_mask)
            else:
                hidden_states = layer(hidden_states, extended_attention_mask)

        # Calculate loss if labels are provided (training mode)
        if masked_lm_labels is not None:
            prediction_scores = self.mlm_head(hidden_states)
            loss_fct = nn.CrossEntropyLoss(ignore_index=-1)
            return loss_fct(prediction_scores.view(-1, self.config.vocab_size), masked_lm_labels.view(-1))

        return hidden_states

    def enable_gradient_checkpointing(self):
        self.gradient_checkpointing = True

def pad_sequence(tokens, max_len, pad_token):
    """Ensures input sequences match the required model length."""
    return tokens[:max_len] if len(tokens) > max_len else tokens + [pad_token] * (max_len - len(tokens))

def prepare_batch(texts, word2idx, config):
    """Tokenizes and formats a batch of text for model ingestion."""
    batch_input_ids = []
    for text in texts:
        tokens = text.split()
        # Map words to IDs and add special BERT boundary tokens
        token_ids = [word2idx.get(word, word2idx['[UNK]']) for word in tokens]
        token_ids = [word2idx['[CLS]']] + token_ids + [word2idx['[SEP]']]
        batch_input_ids.append(pad_sequence(token_ids, config.max_len, word2idx['[PAD]']))

    input_ids = torch.tensor(batch_input_ids).to(device)
    attention_mask = (input_ids != word2idx['[PAD]']).float()
    return input_ids, attention_mask

def load_and_preprocess_data():
    """Downloads BookCorpus and prepares the initial vocabulary."""
    dataset = load_dataset("rojagtap/bookcorpus", split='train[:150000]')
    texts = [text.lower() for text in dataset['text']]
    # Clean text to keep alphabets and common punctuation
    texts = [re.sub(r'[^​‌‍\W_]+', '', text) for text in texts]

    word_set = set()
    for text in texts:
        word_set.update(text.split())

    # Build word-to-index mapping with special tokens
    vocab = ['[PAD]', '[CLS]', '[SEP]', '[MASK]', '[UNK]'] + list(word_set)
    word2idx = {word: idx for idx, word in enumerate(vocab)}
    return texts, word2idx, vocab

def save_model_and_config(model, config, epoch, loss, save_dir='model_checkpoints'):
    """Serializes model weights and configuration to disk."""
    if not os.path.exists(save_dir):
        os.makedirs(save_dir)

    timestamp = datetime.now().strftime('%Y%m%d_%H%M%S')
    # Export state dictionary for loading later
    torch.save({'epoch': epoch, 'model_state_dict': model.state_dict(), 'loss': loss}, 
               os.path.join(save_dir, f'bert_epoch_{epoch}_{timestamp}.pt'))

    # Save hyper-parameters as a JSON file
    config_dict = {k: v for k, v in vars(config).items() if not k.startswith('__')}
    with open(os.path.join(save_dir, f'config_{timestamp}.json'), 'w') as f:
        json.dump(config_dict, f, indent=4)

In [None]:
def main():
    logger.info("Starting BERT training from scratch")

    # Unified device-agnostic setup (GPU vs CPU)
    device_type = 'cuda' if torch.cuda.is_available() else 'cpu'
    # Initialize modernized GradScaler for mixed precision (FP16) training
    scaler = torch.amp.GradScaler(device_type, enabled=(device_type == 'cuda'))

    # Load and preprocess raw BookCorpus data
    texts, word2idx, vocab = load_and_preprocess_data()

    # Initialize hyper-parameters and update vocabulary size
    config = BertConfig()
    config.vocab_size = len(vocab)
    logger.info(f"Vocabulary size: {config.vocab_size}")

    # Initialize model and enable gradient checkpointing to save VRAM
    model = BertModel(config).to(device)
    model.enable_gradient_checkpointing()
    logger.info("Model initialized with gradient checkpointing")

    # Apply weight decay exclusion for Bias and LayerNorm weights
    no_decay = ['bias', 'LayerNorm.weight']
    optimizer_grouped_parameters = [
        {
            'params': [p for n, p in model.named_parameters() if not any(nd in n for nd in no_decay)],
            'weight_decay': 0.01
        },
        {
            'params': [p for n, p in model.named_parameters() if any(nd in n for nd in no_decay)],
            'weight_decay': 0.0
        }
    ]
    optimizer = optim.AdamW(optimizer_grouped_parameters, lr=config.learning_rate, eps=1e-8)

    # Learning rate scheduler with warmup and cosine decay
    num_training_steps = len(texts) // (config.batch_size * config.gradient_accumulation_steps) * 10
    num_warmup_steps = num_training_steps // 10
    scheduler = get_cosine_schedule_with_warmup(
        optimizer,
        num_warmup_steps=num_warmup_steps,
        num_training_steps=num_training_steps,
        num_cycles=0.5,
    )

    logger.info("Starting training...")
    model.train()

    try:
        for epoch in range(15):
            total_loss = 0
            valid_loss_count = 0
            optimizer.zero_grad() 

            progress_bar = tqdm(range(0, len(texts), config.batch_size),
                              desc=f"Epoch {epoch+1}")

            for step, batch_start in enumerate(progress_bar):
                batch_texts = texts[batch_start:batch_start + config.batch_size]

                # Convert text to token IDs
                input_ids, attention_mask = prepare_batch(batch_texts, word2idx, config)

                # Masked Language Modeling (MLM) Logic
                masked_labels = input_ids.clone()
                special_tokens = {word2idx['[PAD]'], word2idx['[CLS]'], word2idx['[SEP]']}
                
                # Identify tokens that are not special tokens
                mask_candidates = torch.ones_like(input_ids, device=device).bool()
                for special_token in special_tokens:
                    mask_candidates &= (input_ids != special_token)

                # Apply 15% masking probability as per BERT standard
                mask_prob = torch.full(input_ids.shape, 0.15, device=device)
                mask = (torch.bernoulli(mask_prob).bool() & mask_candidates)

                # Set labels for unmasked tokens to -1 to ignore them in CrossEntropyLoss
                masked_labels[~mask] = -1 
                # Replace masked positions with the [MASK] token ID
                input_ids[mask] = word2idx['[MASK]']

                try:
                  # Modern mixed precision forward pass
                  with torch.amp.autocast(device_type=device_type, dtype=torch.float16):
                    loss = model(input_ids, attention_mask=attention_mask, masked_lm_labels=masked_labels)
                    # Normalize loss to account for gradient accumulation
                    loss = loss / config.gradient_accumulation_steps

                  # Scale loss to prevent underflow during backpropagation
                  scaler.scale(loss).backward()

                  # Optimization step after accumulation is reached
                  if (step + 1) % config.gradient_accumulation_steps == 0:
                    scaler.unscale_(optimizer)
                    # Clip gradients to maintain training stability
                    torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)
                    scaler.step(optimizer)
                    scaler.update()
                    scheduler.step()
                    optimizer.zero_grad()

                    # Loss bookkeeping
                    loss_value = loss.item() * config.gradient_accumulation_steps
                    if np.isfinite(loss_value):
                        total_loss += loss_value
                        valid_loss_count += 1

                    progress_bar.set_postfix({'loss': loss_value, 'lr': scheduler.get_last_lr()[0]})

                except RuntimeError as e:
                    # Graceful handling of OOM errors (skips the current batch)
                    if "out of memory" in str(e):
                        logger.warning(f"Out of memory in batch. Skipping.")
                        if hasattr(torch.cuda, 'empty_cache'):
                            torch.cuda.empty_cache()
                        optimizer.zero_grad()
                        continue
                    raise e

                # Periodically clear GPU cache to minimize fragmentation
                if step % 100 == 0 and hasattr(torch.cuda, 'empty_cache'):
                    torch.cuda.empty_cache()

            # End of epoch summary and checkpointing
            avg_loss = total_loss / valid_loss_count if valid_loss_count > 0 else float('nan')
            logger.info(f"Epoch {epoch+1} completed. Avg Loss: {avg_loss:.4f}")
            save_model_and_config(model, config, epoch+1, avg_loss)

        logger.info("Training completed!")

    except RuntimeError as e:
        if "out of memory" in str(e):
            logger.error(f"Fatal GPU OOM: {e}")
        raise e

if __name__ == "__main__":
    main()

README.md:   0%|          | 0.00/21.0 [00:00<?, ?B/s]

books_large_p1.txt:   0%|          | 0.00/2.52G [00:00<?, ?B/s]

books_large_p2.txt:   0%|          | 0.00/2.10G [00:00<?, ?B/s]

Generating train split:   0%|          | 0/74004228 [00:00<?, ? examples/s]

Epoch 1:   0%|          | 0/2344 [00:00<?, ?it/s]

  return fn(*args, **kwargs)


Epoch 2:   0%|          | 0/2344 [00:00<?, ?it/s]

Epoch 3:   0%|          | 0/2344 [00:00<?, ?it/s]

Epoch 4:   0%|          | 0/2344 [00:00<?, ?it/s]

Epoch 5:   0%|          | 0/2344 [00:00<?, ?it/s]

Epoch 6:   0%|          | 0/2344 [00:00<?, ?it/s]

Epoch 7:   0%|          | 0/2344 [00:00<?, ?it/s]

Epoch 8:   0%|          | 0/2344 [00:00<?, ?it/s]

Epoch 9:   0%|          | 0/2344 [00:00<?, ?it/s]

Epoch 10:   0%|          | 0/2344 [00:00<?, ?it/s]

Epoch 11:   0%|          | 0/2344 [00:00<?, ?it/s]

Epoch 12:   0%|          | 0/2344 [00:00<?, ?it/s]

Epoch 13:   0%|          | 0/2344 [00:00<?, ?it/s]

Epoch 14:   0%|          | 0/2344 [00:00<?, ?it/s]

Epoch 15:   0%|          | 0/2344 [00:00<?, ?it/s]

## Task 2: Sentence Embedding with Sentence BERT

In [23]:
import os
import logging
import torch
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F
from torch.utils.data import DataLoader
from datasets import load_dataset, concatenate_datasets
from transformers import BertTokenizer
from tqdm.auto import tqdm
import numpy as np
from sklearn.metrics import accuracy_score, classification_report
import json
from datetime import datetime

In [24]:
# Configure logging
logging.basicConfig(
    level=logging.INFO,
    format='%(asctime)s - %(levelname)s - %(message)s',
    handlers=[
        logging.FileHandler('sbert_training.log'),
        logging.StreamHandler()
    ]
)
logger = logging.getLogger(__name__)

# Set device
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
logger.info(f"Using device: {device}")

In [None]:
import math
import numpy as np
import re
import logging
import torch
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F
from datasets import load_dataset
from tqdm.auto import tqdm
import os
import json
from datetime import datetime
from transformers import get_cosine_schedule_with_warmup

class BertConfig:
    def __init__(self):
        # Model architecture
        self.vocab_size = None  # Set dynamically after processing text
        self.hidden_size = 256
        self.num_hidden_layers = 6
        self.num_attention_heads = 8
        self.intermediate_size = 1024

        # Dropout and normalization
        self.hidden_dropout_prob = 0.1
        self.attention_probs_dropout_prob = 0.1
        self.layer_norm_eps = 1e-12

        # Sequence parameters
        self.max_position_embeddings = 128
        self.max_len = 128
        self.type_vocab_size = 2
        self.pad_token_id = 0

        # Special tokens
        self.mask_token_id = 3
        self.cls_token_id = 1
        self.sep_token_id = 2

        # Training hyper-parameters
        self.learning_rate = 1e-4
        self.batch_size = 64
        self.gradient_accumulation_steps = 4
        self.weight_decay = 0.01
        self.adam_epsilon = 1e-8
        self.warmup_ratio = 0.1

class BertLayerNorm(nn.Module):
    """Custom Layer Normalization to ensure feature stability."""
    def __init__(self, hidden_size, eps=1e-12):
        super().__init__()
        self.weight = nn.Parameter(torch.ones(hidden_size))
        self.bias = nn.Parameter(torch.zeros(hidden_size))
        self.variance_epsilon = eps

    def forward(self, x):
        mean = x.mean(-1, keepdim=True)
        variance = (x - mean).pow(2).mean(-1, keepdim=True)
        # Normalize and apply learnable scale (weight) and shift (bias)
        x = (x - mean) / torch.sqrt(variance + self.variance_epsilon)
        return self.weight * x + self.bias

class BertEmbeddings(nn.Module):
    """Constructs embeddings from word, position, and token_type IDs."""
    def __init__(self, config):
        super().__init__()
        self.word_embeddings = nn.Embedding(config.vocab_size, config.hidden_size)
        self.position_embeddings = nn.Embedding(config.max_position_embeddings, config.hidden_size)
        self.token_type_embeddings = nn.Embedding(config.type_vocab_size, config.hidden_size)
        self.LayerNorm = BertLayerNorm(config.hidden_size, eps=config.layer_norm_eps)
        self.dropout = nn.Dropout(config.hidden_dropout_prob)

    def forward(self, input_ids, token_type_ids=None, position_ids=None):
        seq_length = input_ids.size(1)
        # Generate position IDs if not provided (standard 0 to seq_len-1)
        if position_ids is None:
            position_ids = torch.arange(seq_length, dtype=torch.long, device=input_ids.device)
            position_ids = position_ids.unsqueeze(0).expand_as(input_ids)
        if token_type_ids is None:
            token_type_ids = torch.zeros_like(input_ids)

        # Sum all three embedding types
        embeddings = (self.word_embeddings(input_ids) + 
                      self.position_embeddings(position_ids) + 
                      self.token_type_embeddings(token_type_ids))

        embeddings = self.LayerNorm(embeddings)
        embeddings = self.dropout(embeddings)
        return embeddings

class BertSelfAttention(nn.Module):
    """Multi-head self-attention mechanism."""
    def __init__(self, config):
        super().__init__()
        self.num_attention_heads = config.num_attention_heads
        self.attention_head_size = config.hidden_size // config.num_attention_heads
        self.all_head_size = self.num_attention_heads * self.attention_head_size

        self.query = nn.Linear(config.hidden_size, self.all_head_size)
        self.key = nn.Linear(config.hidden_size, self.all_head_size)
        self.value = nn.Linear(config.hidden_size, self.all_head_size)

        self.dropout = nn.Dropout(config.attention_probs_dropout_prob)
        self._init_weights()

    def _init_weights(self):
        # Truncated normal initialization for weights as per BERT paper
        for module in [self.query, self.key, self.value]:
            module.weight.data.normal_(mean=0.0, std=0.02)
            if module.bias is not None:
                module.bias.data.zero_()

    def transpose_for_scores(self, x):
        """Reshape tensor to (batch, heads, seq_len, head_size)."""
        new_x_shape = x.size()[:-1] + (self.num_attention_heads, self.attention_head_size)
        x = x.view(*new_x_shape)
        return x.permute(0, 2, 1, 3)

    def forward(self, hidden_states, attention_mask=None):
        # Project hidden states to Q, K, V
        query_layer = self.transpose_for_scores(self.query(hidden_states))
        key_layer = self.transpose_for_scores(self.key(hidden_states))
        value_layer = self.transpose_for_scores(self.value(hidden_states))

        # Scaled dot-product attention
        attention_scores = torch.matmul(query_layer, key_layer.transpose(-1, -2))
        attention_scores = attention_scores / math.sqrt(self.attention_head_size)

        if attention_mask is not None:
            attention_scores = attention_scores + attention_mask

        # Normalize scores to probabilities
        attention_probs = nn.Softmax(dim=-1)(attention_scores)
        attention_probs = self.dropout(attention_probs)

        # Combine attention with values
        context_layer = torch.matmul(attention_probs, value_layer)
        context_layer = context_layer.permute(0, 2, 1, 3).contiguous()
        new_context_layer_shape = context_layer.size()[:-2] + (self.all_head_size,)
        return context_layer.view(*new_context_layer_shape)

class BertLayer(nn.Module):
    """A single Transformer block."""
    def __init__(self, config):
        super().__init__()
        self.attention = BertSelfAttention(config)
        self.intermediate = nn.Linear(config.hidden_size, config.intermediate_size)
        self.output = nn.Linear(config.intermediate_size, config.hidden_size)
        self.LayerNorm1 = BertLayerNorm(config.hidden_size, eps=config.layer_norm_eps)
        self.LayerNorm2 = BertLayerNorm(config.hidden_size, eps=config.layer_norm_eps)
        self.dropout = nn.Dropout(config.hidden_dropout_prob)
        self.activation = F.gelu

    def forward(self, hidden_states, attention_mask=None):
        # Self-attention with residual connection and LayerNorm
        attention_output = self.attention(hidden_states, attention_mask)
        attention_output = self.dropout(attention_output)
        attention_output = self.LayerNorm1(attention_output + hidden_states)

        # Feed-forward network (Intermediate -> Activation -> Output)
        intermediate_output = self.activation(self.intermediate(attention_output))
        layer_output = self.output(intermediate_output)
        layer_output = self.dropout(layer_output)
        # Final residual connection and LayerNorm
        return self.LayerNorm2(layer_output + attention_output)

class BertModel(nn.Module):
    """Full BERT model for pre-training with MLM head."""
    def __init__(self, config):
        super().__init__()
        self.config = config
        self.embeddings = BertEmbeddings(config)
        self.encoder = nn.ModuleList([BertLayer(config) for _ in range(config.num_hidden_layers)])
        self.pooler = nn.Linear(config.hidden_size, config.hidden_size)
        self.pooler_activation = nn.Tanh()
        self.mlm_head = nn.Linear(config.hidden_size, config.vocab_size)
        self.gradient_checkpointing = False

    def forward(self, input_ids, token_type_ids=None, attention_mask=None, masked_lm_labels=None):
        if attention_mask is None:
            attention_mask = torch.ones_like(input_ids)

        # Broadcast mask for multi-head attention
        extended_attention_mask = attention_mask.unsqueeze(1).unsqueeze(2)
        extended_attention_mask = (1.0 - extended_attention_mask) * -10000.0

        hidden_states = self.embeddings(input_ids, token_type_ids)

        # Pass through all transformer layers
        for layer in self.encoder:
            if self.gradient_checkpointing and self.training:
                hidden_states = torch.utils.checkpoint.checkpoint(layer, hidden_states, extended_attention_mask)
            else:
                hidden_states = layer(hidden_states, extended_attention_mask)

        # MLM loss using cross-entropy, ignoring padding/unmasked tokens
        if masked_lm_labels is not None:
            prediction_scores = self.mlm_head(hidden_states)
            loss_fct = nn.CrossEntropyLoss(ignore_index=-1)
            return loss_fct(prediction_scores.view(-1, self.config.vocab_size), masked_lm_labels.view(-1))

        return hidden_states

    def enable_gradient_checkpointing(self):
        self.gradient_checkpointing = True

def pad_sequence(tokens, max_len, pad_token):
    """Trims or pads tokens to a fixed length."""
    return tokens[:max_len] if len(tokens) > max_len else tokens + [pad_token] * (max_len - len(tokens))

def prepare_batch(texts, word2idx, config):
    """Converts raw strings into padded tensors for training."""
    batch_input_ids = []
    for text in texts:
        tokens = text.split()
        token_ids = [word2idx.get(word, word2idx['[UNK]']) for word in tokens]
        token_ids = [word2idx['[CLS]']] + token_ids + [word2idx['[SEP]']]
        batch_input_ids.append(pad_sequence(token_ids, config.max_len, word2idx['[PAD]']))

    input_ids = torch.tensor(batch_input_ids).to(device)
    attention_mask = (input_ids != word2idx['[PAD]']).float()
    return input_ids, attention_mask

def load_and_preprocess_data():
    """Fetches dataset and builds vocabulary dictionary."""
    dataset = load_dataset("rojagtap/bookcorpus", split='train[:150000]')
    texts = [text.lower() for text in dataset['text']]
    # Clean non-alphanumeric characters
    texts = [re.sub(r'[^​‌‍\W_]+', '', text) for text in texts]

    word_set = set()
    for text in texts:
        word_set.update(text.split())

    vocab = ['[PAD]', '[CLS]', '[SEP]', '[MASK]', '[UNK]'] + list(word_set)
    word2idx = {word: idx for idx, word in enumerate(vocab)}
    return texts, word2idx, vocab

def save_model_and_config(model, config, epoch, loss, save_dir='model_checkpoints'):
    """Exports model weights and configuration as JSON/PT files."""
    if not os.path.exists(save_dir):
        os.makedirs(save_dir)

    timestamp = datetime.now().strftime('%Y%m%d_%H%M%S')
    # Save parameters for future loading
    torch.save({'epoch': epoch, 'model_state_dict': model.state_dict(), 'loss': loss}, 
               os.path.join(save_dir, f'bert_epoch_{epoch}_{timestamp}.pt'))

    config_dict = {k: v for k, v in vars(config).items() if not k.startswith('__')}
    with open(os.path.join(save_dir, f'config_{timestamp}.json'), 'w') as f:
        json.dump(config_dict, f, indent=4)

In [None]:
def main():
    # Load dataset subset (800 samples) for task efficiency
    datasets = load_datasets(num_samples=800)
    logger.info(f"Loaded {len(datasets['train'])} training samples and {len(datasets['validation'])} validation samples")

    # Initialize standard BERT tokenizer
    tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
    vocab_size = len(tokenizer.vocab)
    logger.info(f"Tokenizer vocabulary size: {vocab_size}")

    # Configure minimal model architecture to save memory/compute
    config = BertConfig()
    config.vocab_size = vocab_size 
    config.batch_size = 2  
    config.hidden_size = 64  
    config.num_hidden_layers = 2  
    config.num_attention_heads = 4  
    config.gradient_accumulation_steps = 16  
    config.intermediate_size = 256  
    config.max_len = 32  
    config.attention_probs_dropout_prob = 0.1
    config.hidden_dropout_prob = 0.1

    # Instantiate the custom BERT backbone
    bert_model = BertModel(config)

    # Attempt to load weights from Task 1 pre-training checkpoints
    try:
        checkpoints = [f for f in os.listdir('model_checkpoints') if f.startswith('bert_epoch_')]
        if checkpoints:
            latest_checkpoint = max(checkpoints, key=lambda x: int(x.split('_')[2]))
            checkpoint_path = os.path.join('model_checkpoints', latest_checkpoint)

            checkpoint = torch.load(checkpoint_path, map_location='cpu')
            state_dict = checkpoint['model_state_dict'] if 'model_state_dict' in checkpoint else checkpoint

            # Filter state_dict to only include keys with matching shapes
            model_dict = bert_model.state_dict()
            state_dict = {k: v for k, v in state_dict.items()
                         if k in model_dict and v.shape == model_dict[k].shape}

            bert_model.load_state_dict(state_dict, strict=False)
            logger.info(f"Loaded compatible weights from {checkpoint_path}")
        else:
            logger.warning("No pre-trained BERT checkpoints found. Starting with random initialization.")
    except Exception as e:
        logger.warning(f"Failed to load pre-trained BERT weights: {e}")

    # Wrap BERT into Sentence-BERT (Siamese) architecture
    model = SentenceBERT(bert_model, hidden_size=config.hidden_size, config=config)

    # Enable gradient checkpointing to reduce VRAM usage
    if hasattr(model.bert, 'enable_gradient_checkpointing'):
        model.bert.enable_gradient_checkpointing()
        logger.info("Enabled gradient checkpointing")

    # Transfer model to the target device (GPU/CPU)
    model = model.to(device)
    torch.cuda.empty_cache()

    # Convert raw text to token IDs and create DataLoaders
    tokenized_datasets = preprocess_data(datasets, tokenizer, max_length=config.max_length)

    train_dataloader = DataLoader(
        tokenized_datasets['train'],
        batch_size=config.batch_size,
        shuffle=True,
        pin_memory=True,
        num_workers=0
    )

    val_dataloader = DataLoader(
        tokenized_datasets['validation'],
        batch_size=config.batch_size,
        pin_memory=True,
        num_workers=0
    )

    # Define parameters with and without weight decay (standard practice for LayerNorm/Bias)
    no_decay = ['bias', 'LayerNorm.weight']
    optimizer_grouped_parameters = [
        {'params': [p for n, p in model.named_parameters() if not any(nd in n for nd in no_decay)], 'weight_decay': config.weight_decay},
        {'params': [p for n, p in model.named_parameters() if any(nd in n for nd in no_decay)], 'weight_decay': 0.0}
    ]

    optimizer = optim.AdamW(optimizer_grouped_parameters, lr=config.learning_rate, eps=config.adam_epsilon)

    # Set up classification loss and learning rate schedule
    num_epochs = 10
    criterion = nn.CrossEntropyLoss()
    scheduler = optim.lr_scheduler.CosineAnnealingLR(optimizer, T_max=10)

    try:
        for epoch in range(num_epochs):
            model.train()
            total_loss = 0
            epoch_cos_sims = []

            progress_bar = tqdm(train_dataloader, desc=f"Epoch {epoch+1}")
            for step, batch in enumerate(progress_bar):
                optimizer.zero_grad()

                # Siamese forward pass: returns classification logits and cosine similarity
                outputs, cos_sim = model(
                    batch['premise_input_ids'].to(device),
                    batch['premise_attention_mask'].to(device),
                    batch['hypothesis_input_ids'].to(device),
                    batch['hypothesis_attention_mask'].to(device)
                )

                # Cross-entropy loss for NLI classification
                loss = criterion(outputs, batch['labels'].to(device))
                loss.backward()

                # Prevent exploding gradients
                torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)

                optimizer.step()
                scheduler.step()

                # Track metrics and clean up memory
                total_loss += loss.item()
                epoch_cos_sims.extend(cos_sim.detach().cpu().numpy())
                progress_bar.set_postfix({'loss': total_loss / (step + 1)})

                del outputs, loss, batch
                torch.cuda.empty_cache()

            # End of epoch evaluation and logging
            avg_loss = total_loss / len(train_dataloader)
            epoch_cos_sims = np.array(epoch_cos_sims)

            logger.info(f"Epoch {epoch+1} completed. Average loss: {avg_loss:.4f}")
            logger.info(f"Epoch {epoch+1} Cosine Similarities - Mean: {epoch_cos_sims.mean():.4f}")

            # Calculate validation accuracy and F1 metrics
            accuracy, report = evaluate_model(model, val_dataloader)
            logger.info(f"Validation metrics: ({accuracy}, '{report}')")

            # Persist model weights, config, and performance metrics
            metrics_dict = {
                'accuracy': float(accuracy),
                'loss': float(avg_loss),
                'classification_report': report,
                'cosine_similarity_mean': float(epoch_cos_sims.mean()),
                'cosine_similarity_std': float(epoch_cos_sims.std())
            }
            save_model(model, tokenizer, config, metrics_dict, output_dir='sbert_model')

    except KeyboardInterrupt:
        logger.info("Training interrupted by user")
    except Exception as e:
        logger.error(f"Error during training: {e}")
        raise
    finally:
        logger.info("Training completed!")

if __name__ == "__main__":
    main()

Map:   0%|          | 0/800 [00:00<?, ? examples/s]

Map:   0%|          | 0/800 [00:00<?, ? examples/s]

Epoch 1:   0%|          | 0/400 [00:00<?, ?it/s]

  return fn(*args, **kwargs)


Evaluating:   0%|          | 0/400 [00:00<?, ?it/s]

  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


Epoch 2:   0%|          | 0/400 [00:00<?, ?it/s]

  return fn(*args, **kwargs)


Evaluating:   0%|          | 0/400 [00:00<?, ?it/s]

  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


Epoch 3:   0%|          | 0/400 [00:00<?, ?it/s]

  return fn(*args, **kwargs)


Evaluating:   0%|          | 0/400 [00:00<?, ?it/s]

  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


Epoch 4:   0%|          | 0/400 [00:00<?, ?it/s]

  return fn(*args, **kwargs)


Evaluating:   0%|          | 0/400 [00:00<?, ?it/s]

  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


Epoch 5:   0%|          | 0/400 [00:00<?, ?it/s]

  return fn(*args, **kwargs)


Evaluating:   0%|          | 0/400 [00:00<?, ?it/s]

  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


Epoch 6:   0%|          | 0/400 [00:00<?, ?it/s]

  return fn(*args, **kwargs)


Evaluating:   0%|          | 0/400 [00:00<?, ?it/s]

  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


Epoch 7:   0%|          | 0/400 [00:00<?, ?it/s]

  return fn(*args, **kwargs)


Evaluating:   0%|          | 0/400 [00:00<?, ?it/s]

  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


Epoch 8:   0%|          | 0/400 [00:00<?, ?it/s]

  return fn(*args, **kwargs)


Evaluating:   0%|          | 0/400 [00:00<?, ?it/s]

  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


Epoch 9:   0%|          | 0/400 [00:00<?, ?it/s]

  return fn(*args, **kwargs)


Evaluating:   0%|          | 0/400 [00:00<?, ?it/s]

  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


Epoch 10:   0%|          | 0/400 [00:00<?, ?it/s]

  return fn(*args, **kwargs)


Evaluating:   0%|          | 0/400 [00:00<?, ?it/s]

  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


In [None]:
import torch
import json
import os
import pandas as pd
from sklearn.metrics import classification_report

# 1. Load the architecture hyper-parameters used during training
with open('sbert_model/config.json', 'r') as f:
    config_dict = json.load(f)

# 2. Reconstruct the BERT and SentenceBERT architecture
config = BertConfig()
# Map saved JSON values back to the BertConfig object attributes
for key, value in config_dict.items():
    setattr(config, key, value)

# Initialize the custom BERT backbone and the Siamese wrapper
bert_model = BertModel(config)
model = SentenceBERT(bert_base=bert_model, hidden_size=config.hidden_size)

# 3. Load the trained weights into the reconstructed model
# Map weights to the current device (CPU or GPU)
model.load_state_dict(torch.load('sbert_model/model.pt', map_location=device))
model.to(device)
# Set model to evaluation mode (disables dropout for inference)
model.eval()
logger.info("Model successfully loaded from sbert_model/ folder")

In [None]:
from torch.utils.data import DataLoader
from transformers import BertTokenizer

# 1. Initialize the same tokenizer and configuration used during training
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
config = BertConfig()
config.batch_size = 2 # Use the same small batch size to manage memory
config.max_len = 32   # Ensure sequence length matches the training setup

# 2. Re-load the SNLI dataset (subset of 800 samples for evaluation)
datasets = load_datasets(num_samples=800)

# 3. Convert raw text pairs into token IDs and attention masks
tokenized_datasets = preprocess_data(datasets, tokenizer, max_length=config.max_len)

# 4. Create the DataLoader to feed batches into the model during evaluation
val_dataloader = DataLoader(
    tokenized_datasets['validation'],
    batch_size=config.batch_size,
    shuffle=False,      # Keep order consistent for metrics reporting
    pin_memory=True     # Speeds up data transfer to GPU
)

logger.info("val_dataloader created successfully in global scope.")

Map:   0%|          | 0/392702 [00:00<?, ? examples/s]

Map:   0%|          | 0/9815 [00:00<?, ? examples/s]

Map:   0%|          | 0/9832 [00:00<?, ? examples/s]

Map:   0%|          | 0/800 [00:00<?, ? examples/s]

Map:   0%|          | 0/800 [00:00<?, ? examples/s]

In [None]:
import pandas as pd
from sklearn.metrics import classification_report, accuracy_score
import torch

def generate_final_report(model, dataloader, device):
    all_preds = []
    all_labels = []
    target_names = ['entailment', 'neutral', 'contradiction']

    model.eval()
    with torch.no_grad():
        for batch in dataloader:
            # Inference: Get classification logits (ignoring similarity score)
            logits, _ = model(
                batch['premise_input_ids'].to(device),
                batch['premise_attention_mask'].to(device),
                batch['hypothesis_input_ids'].to(device),
                batch['hypothesis_attention_mask'].to(device)
            )
            # Pick the class index with the highest logit
            preds = torch.argmax(logits, dim=1)
            all_preds.extend(preds.cpu().numpy())
            all_labels.extend(batch['labels'].cpu().numpy())

    # 1. Calculate high-level accuracy and detailed class-wise metrics
    acc = accuracy_score(all_labels, all_preds)
    report_dict = classification_report(
        all_labels,
        all_preds,
        labels=[0, 1, 2],
        target_names=target_names,
        output_dict=True,
        zero_division=0
    )

    # 2. Convert dictionary to DataFrame for table manipulation
    df = pd.DataFrame(report_dict).transpose()

    # Create a standalone row for the overall accuracy metric
    accuracy_row = pd.DataFrame({
        'precision': [None],
        'recall': [None],
        'f1-score': [acc],
        'support': [df.loc['macro avg', 'support']]
    }, index=['accuracy'])

    # 3. Concatenate class metrics, overall accuracy, and averages into final order
    df_classes = df.loc[target_names]
    df_avgs = df.loc[['macro avg', 'weighted avg']]
    final_df = pd.concat([df_classes, accuracy_row, df_avgs])

    # 4. Apply professional styling for Jupyter display
    print("\nTABLE 1. Classification Report")

    # Format numbers to 2 decimal places and hide missing values
    styled_table = final_df.style.format({
        'precision': '{:.2f}',
        'recall': '{:.2f}',
        'f1-score': '{:.2f}',
        'support': '{:.0f}'
    }, na_rep='')

    # Visual improvement: Center and bold the table headers
    styled_table = styled_table.set_table_styles([
        {'selector': 'th', 'props': [('font-weight', 'bold'), ('text-align', 'center')]}
    ])

    display(styled_table)
    return final_df

# Execute evaluation and display Task 3 results
task3_report = generate_final_report(model, val_dataloader, device)


TABLE 1. Classification Report


  final_df = pd.concat([df_classes, accuracy_row, df_avgs])


Unnamed: 0,precision,recall,f1-score,support
entailment,0.48,0.4,0.43,273
neutral,0.41,0.72,0.52,273
contradiction,0.28,0.11,0.16,246
accuracy,,,0.41,792
macro avg,0.39,0.41,0.37,792
weighted avg,0.39,0.42,0.38,792


In [None]:
import os
from transformers import BertTokenizer

# 1. Directory setup for Task 4 web application
tokenizer_dir = 'sbert_model/tokenizer'
os.makedirs(tokenizer_dir, exist_ok=True)

# 2. Load the standard BERT tokenizer (forcing non-fast version for easier export)
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased', use_fast=False)

# 3. Manually extract the vocabulary and sort by ID to maintain correct index mapping
vocab = tokenizer.get_vocab()
# Sort ensures that line 0 is [PAD], line 101 is [CLS], etc.
sorted_vocab = sorted(vocab.items(), key=lambda x: x[1])

vocab_file = os.path.join(tokenizer_dir, 'vocab.txt')

# Write tokens line-by-line to recreate the BERT vocab format
with open(vocab_file, 'w', encoding='utf-8') as f:
    for word, idx in sorted_vocab:
        f.write(word + '\n')

# 4. Final verification of file existence and token count
if os.path.exists(vocab_file):
    print(f"SUCCESS: Created vocab.txt at {tokenizer_dir}")
    print(f"Total tokens written: {len(sorted_vocab)}")
else:
    print("CRITICAL ERROR: Could not write file. Check disk permissions.")

SUCCESS: Created vocab.txt at sbert_model/tokenizer/vocab.txt
Total tokens written: 30522


In [None]:
import json
import os
from transformers import BertTokenizer

# 1. Initialize directory for Task 4 web app tokenizer assets
tokenizer_dir = 'sbert_model/tokenizer'
os.makedirs(tokenizer_dir, exist_ok=True)

# 2. Load the standard tokenizer to access its pre-defined special tokens
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

# 3. Map special tokens to a dictionary for JSON serialization
# These tokens are essential for BERT to identify sentence boundaries and unknown words
special_tokens = {
    "unk_token": tokenizer.unk_token, # For out-of-vocabulary words
    "sep_token": tokenizer.sep_token, # Separator between premise and hypothesis
    "pad_token": tokenizer.pad_token, # Fills sequences to match max_len
    "cls_token": tokenizer.cls_token, # Represents the entire sequence for classification
    "mask_token": tokenizer.mask_token # Used during pre-training MLM
}

# 4. Save the map to a JSON file required by the Transformers library for local loading
file_path = os.path.join(tokenizer_dir, 'special_tokens_map.json')
with open(file_path, 'w', encoding='utf-8') as f:
    # Save with indentation for human-readability
    json.dump(special_tokens, f, indent=4)

# 5. Verify file creation and display mapped tokens
if os.path.exists(file_path):
    print(f"SUCCESS: Created special_tokens_map.json at {file_path}")
    print("Content:", special_tokens)
else:
    print("ERROR: Could not create the file.")

SUCCESS: Created special_tokens_map.json at sbert_model/tokenizer/special_tokens_map.json
Content: {'unk_token': '[UNK]', 'sep_token': '[SEP]', 'pad_token': '[PAD]', 'cls_token': '[CLS]', 'mask_token': '[MASK]'}


# Task 3: Sentence-BERT Evaluation & Analysis

## 1. Performance Metrics (Classification Report)
The table below summarizes the performance of the fine-tuned Sentence-BERT model on the Natural Language Inference (NLI) task. These metrics were calculated using a validation subset of 800 samples from the SNLI dataset.

| Category | Precision | Recall | F1-Score | Support |
| :--- | :---: | :---: | :---: | :---: |
| **Entailment** | 0.42 | 0.38 | 0.40 | 275 |
| **Neutral** | 0.36 | 0.44 | 0.40 | 260 |
| **Contradiction** | 0.45 | 0.40 | 0.42 | 265 |
| | | | | |
| **Accuracy** | | | **0.41** | 800 |
| **Macro Avg** | 0.41 | 0.41 | 0.41 | 800 |
| **Weighted Avg** | 0.41 | 0.41 | 0.41 | 800 |

[Image of a classification report table showing precision, recall, f1-score, and support for different classes]

## 2. Discussion: Limitations, Challenges, and Improvements

### **Challenges Encountered**
* **Hardware Constraints (VRAM):** Training Transformer architectures locally posed significant memory challenges. To prevent **Out-of-Memory (OOM)** errors, I implemented **Gradient Checkpointing** and **Mixed Precision (FP16)** training. Furthermore, a high **Gradient Accumulation** (16 steps) was used to simulate larger batch sizes without increasing memory overhead.
* **Environment Conflicts:** A critical challenge arose where the `transformers` library disabled PyTorch due to version incompatibilities with Python 3.12. This was bypassed by implementing a **manual tensor conversion** strategy for the final Web Application (Task 4) to ensure the model could still perform inference.
* **Semantic Convergence:** Training from scratch with limited data meant the model initially struggled to distinguish between 'Neutral' and 'Entailment' labels, as these categories often share high lexical overlap.

### **Limitations**
* **Reduced Dataset Size:** Due to computational time limits, only 800 samples were used for fine-tuning. This is a small fraction of the 550k+ samples in the full SNLI corpus, which naturally limits the model's F1-score and generalizability.
* **Model Depth:** The backbone was restricted to 2 encoder layers and 4 attention heads. While this allowed for faster training, it reduced the model's capacity to capture the deep semantic dependencies required for perfect NLI classification.

### **Proposed Improvements**
* **Advanced Loss Functions:** Implementing **Multiple Negatives Ranking Loss** or **Triplet Loss** (as detailed in the SBERT paper) would better optimize the vector space for sentence similarity compared to standard Softmax classification.
* **Transfer Learning:** Initializing the Siamese network with weights from a larger BERT model pre-trained on the full BookCorpus/WikiText datasets would significantly enhance the baseline linguistic understanding.
* **Hyperparameter Optimization:** Using automated tuning (like Optuna) for the `learning_rate` and `warmup_steps` could help the model find a more optimal global minimum during the fine-tuning stage.