Where BERT is Still Used Today

1. Search & Retrieval (Vector Search, RAG Base Models)

- For generating dense embeddings (e.g., Sentence-Transformers, MiniLM).

- Stored in FAISS, Pinecone, Milvus for fast similarity search.

- Used in smaller LLM pipelines (retriever + generator architecture).

2. Enterprise-level NLP Tasks (Fast & Cost-Effective)

- Named Entity Recognition (NER)

- Sentiment Analysis

- Classification tasks (spam detection, intent classification)

- Summarization using lightweight variants (DistilBERT).

3. Hybrid Pipelines with LLMs

- BERT embeddings for the retriever, then an LLM generates the answer (RAG architecture).

4. Multilingual NLP

- XLM-R (a multilingual BERT version) is still a top choice for 100+ languages.

- Used for translation and cross-lingual search.

5. On-Device / Low-Latency Inference

- For mobile apps and edge devices where GPT/Claude can’t run.

- Quantized DistilBERT/MiniLM models for chatbots and offline NLP tasks.

| **Model**           | **Main Use-Cases**                                  | **Why Still Used**                    |
| ------------------- | --------------------------------------------------- | ------------------------------------- |
| BERT / DistilBERT   | NER, classification, embeddings                     | Small, fast, cheap inference          |
| RoBERTa / DeBERTa   | Classification, QA, summarization                   | High accuracy, Kaggle/enterprise use  |
| MPNet / MiniLM      | Vector search, semantic retrieval (RAG)             | Best for FAISS/Pinecone retrieval     |
| T5 / Flan-T5        | Summarization, translation, instruction tasks       | Lightweight text-to-text generation   |
| BART / Pegasus      | Abstractive summarization                           | Less resource-hungry than LLMs        |
| XLNet / Electra     | Classification, QA (legacy setups)                  | Still optimized for speed             |
| XLM-R / mT5 / LaBSE | Multilingual NLP, translation, cross-lingual search | 100+ language support, enterprise use |


In [None]:
# !pip install --upgrade datasets fsspec transformers

# text-classification

In [None]:
from datasets import load_dataset
from transformers import BertTokenizer, BertForSequenceClassification, Trainer, TrainingArguments

In [None]:
# take the below complex dataset
# load_dataset("ag_news")
# load_dataset("dbpedia_14")

In [None]:
# Customer feedback classification (positive/negative/neutral)
# Support ticket intent detection (billing, technical, general)
# Email/topic categorization

In [None]:
from datasets import load_dataset
from transformers import BertTokenizer
# Load IMDB dataset and subset
dataset = load_dataset("imdb")

In [None]:
dataset

In [None]:
train_dataset = dataset["train"].select(range(1000))
test_dataset = dataset["test"].select(range(500))

In [None]:
# Initialize tokenizer
tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")



##### The raw text column (Each row in the IMDB dataset contains a review text)
##### It will pad each sentence to the same length (up to 256 tokens)
##### If the text is longer than 256 tokens, it will truncate (cut) it

    


In [None]:
# Tokenization function
def tokenize_fn(examples):
    return tokenizer(examples["text"], padding="max_length", truncation=True, max_length=256)

In [None]:
# Apply tokenization + rename + format in a single flow
def preprocess(ds):
    ds = ds.map(tokenize_fn, batched=True, remove_columns=["text"])  # remove raw text (saves memory)
    ds = ds.rename_column("label", "labels")
    ds.set_format(type="torch", columns=["input_ids", "attention_mask", "labels"])
    return ds

In [None]:
train_dataset = preprocess(train_dataset)

In [None]:
test_dataset = preprocess(test_dataset)

In [None]:
# 3. Model load
model = BertForSequenceClassification.from_pretrained("bert-base-uncased", num_labels=2)

In [None]:
for layer in model.bert.encoder.layer:
  print(layer)


In [None]:
### Finetune only Last few layers  and other layer freeze.
## Classifier head trainable by default

# for param in model.bert.parameters():
#     param.requires_grad = False  # Freeze BERT encoder

## Unfreeze last 2 encoder layers

# for layer in model.bert.encoder.layer[-2:]:
#     for param in layer.parameters():
#         param.requires_grad = True

In [None]:
from transformers import TrainingArguments
training_args = TrainingArguments(
    output_dir="./bert-finetuned-imdb",       # Directory to save model checkpoints and outputs
    num_train_epochs=1,                       # Number of training epochs (1 full pass over the training data)
    per_device_train_batch_size=8,            # Batch size per GPU/CPU device
    logging_dir="./logs",                     # Directory to store logs for visualization (e.g., with TensorBoard)
    learning_rate=2e-5,                       # Learning rate for the AdamW optimizer (typically low for BERT)
    weight_decay=0.01,                        # Weight decay (L2 regularization) to reduce overfitting
    report_to="none"                          # Disable reporting to external tracking tools (e.g., wandb, hub)
)

In [None]:
# 5. Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=test_dataset,
)

In [None]:
# 6. Train
trainer.train()

In [None]:
!tensorboard --logdir=./logs

In [None]:
trainer.save_model("./bert-finetuned-imdb")

In [None]:
tokenizer.save_pretrained("./bert-finetuned-imdb")

In [None]:
# 7. Evaluate
metrics = trainer.evaluate()

In [None]:
print(metrics)

## Prediction

In [None]:
tokenizer = BertTokenizer.from_pretrained("/content/bert-finetuned-imdb")
model = BertForSequenceClassification.from_pretrained("/content/bert-finetuned-imdb")

In [None]:
from transformers import pipeline
classifier = pipeline("text-classification", model=model, tokenizer=tokenizer)

In [None]:
# Predict
text = "This movie was amazing and I loved the acting!"
result = classifier(text)

In [None]:
print(result)  # Example: [{'label': 'POSITIVE', 'score': 0.98}]

### Pushing it to Huggingfacehub

In [None]:
from huggingface_hub import notebook_login
notebook_login()

In [None]:
from huggingface_hub import whoami
print(whoami())

In [None]:
tokenizer.push_to_hub("ganesh/my-bert-imdb2")

In [None]:
trainer.push_to_hub("ganesh/my-bert-imdb2")

# BERT Fine-Tuning on IMDB Sentiment Classification

This project fine-tunes a pre-trained BERT model (`bert-base-uncased`) on the **IMDB movie reviews dataset** to perform **binary sentiment classification** (positive or negative).

---

## Summary

### Task Type
**Text Classification (Binary)**  
Each IMDB review is labeled as:
- `0`: Negative
- `1`: Positive

### Goal
Fine-tune a general-purpose language model (BERT) to accurately classify movie reviews based on sentiment.

---

In [None]:
# from datasets import load_dataset
# from transformers import BertTokenizer, BertForSequenceClassification, Trainer, TrainingArguments, DataCollatorWithPadding

# # 1. Load IMDB dataset (subset for speed)
# dataset = load_dataset("imdb")
# train_dataset = dataset["train"].select(range(1000))
# test_dataset = dataset["test"].select(range(500))

# # 2. Initialize tokenizer
# tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")

# # 3. Tokenization function (no fixed padding here)
# def tokenize_fn(examples):
#     return tokenizer(examples["text"], truncation=True, max_length=256)

# # 4. Preprocess dataset (map + rename + torch format)
# def preprocess(ds):
#     ds = ds.map(tokenize_fn, batched=True, remove_columns=["text"])  # remove raw text
#     ds = ds.rename_column("label", "labels")
#     ds.set_format(type="torch", columns=["input_ids", "attention_mask", "labels"])
#     return ds

# train_dataset = preprocess(train_dataset)
# test_dataset = preprocess(test_dataset)

# # 5. Initialize model
# model = BertForSequenceClassification.from_pretrained("bert-base-uncased", num_labels=2)

# # 6. Data collator (dynamic padding)
# data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

# # 7. Training arguments
# training_args = TrainingArguments(
#     output_dir="./bert-finetuned-imdb",   # where to save the model
#     num_train_epochs=1,                   # train for 1 epoch
#     per_device_train_batch_size=8,        # training batch size
#     per_device_eval_batch_size=8,         # eval batch size
#     logging_dir="./logs",                 # logs for TensorBoard
#     logging_steps=50,                     # log every 50 steps
#     learning_rate=2e-5,                   # small LR for fine-tuning
#     weight_decay=0.01,                    # regularization
#     eval_steps=500,                       # run evaluation every 500 steps
#     save_steps=500,                       # save checkpoint every 500 steps
#     save_total_limit=1,                   # keep only the latest checkpoint
#     report_to="none"                      # disable WandB or other reporting
# )


# # 8. Trainer
# trainer = Trainer(
#     model=model,
#     args=training_args,
#     train_dataset=train_dataset,
#     eval_dataset=test_dataset,
#     data_collator=data_collator,  # dynamic padding here
# )

# # 9. Train
# trainer.train()


# Finetune on multiple problem

In [None]:
import torch
import torch.nn as nn
from torch.utils.data import Dataset, DataLoader
from transformers import (
    BertTokenizer,
    BertTokenizerFast,
    BertForSequenceClassification,
    BertForTokenClassification,
    BertForQuestionAnswering,
    get_linear_schedule_with_warmup ###### Gradually warms up then decays learning rate for stable BERT training.
)
from datasets import load_dataset
from sklearn.metrics import accuracy_score, classification_report, f1_score
import numpy as np
from tqdm import tqdm
from torch.optim import AdamW #Adam Optimizer with Weight Decay

In [None]:
# Set device
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f"Using device: {device}")

# Explanation of Training Steps and Learning Rate Scheduling

### Dataset size = 800 samples

You have **800 training examples** in total.

---

### batch_size = 8 → 1 epoch = 800 / 8 = 100 batches

- Your batch size is 8.
- In one epoch, the model processes 800 samples in batches of 8.
- Therefore, one epoch consists of **100 batches**.

---

### epochs = 3

- The model will train over **3 full passes (epochs)** through the dataset.

---

### total_steps = 100 × 3 = 300

- Total number of training steps (optimizer updates) is:
  - Number of batches per epoch × number of epochs
  - 100 × 3 = **300 steps**

Each step corresponds to processing one batch and updating model weights.

---

### Warmup = 10% of total steps → 30 steps

- Warmup is a phase at the start of training where the learning rate increases gradually from zero to its maximum value.
- 10% of 300 steps is **30 warmup steps**.
- For these first 30 steps, the learning rate rises linearly.

---

### Remaining 270 steps → Learning rate linearly decreases

- After warmup, for the remaining **270 steps**, the learning rate decreases linearly.
- This helps the model converge smoothly towards the end of training.

---

# Summary

- Train on 800 samples with batch size 8 → 100 batches per epoch.
- Train for 3 epochs → total 300 steps.
- Learning rate warms up for first 30 steps.
- Learning rate then decays linearly for remaining 270 steps.


# When to Use Hugging Face’s `.map()` + `.set_format()` vs Custom PyTorch Dataset Class

- **Hugging Face’s `.map()` + `.set_format()`**  
  Use this approach **if you already have your dataset in the Hugging Face Dataset format**.  
  It’s easy to preprocess and convert your data using these built-in methods.

- **Custom PyTorch Dataset Class**  
  Use this when:  
  - You **do not have your data in Hugging Face Dataset format** (for example, you just have plain Python lists like `train_texts` and `train_labels`).  
  - You need **custom preprocessing logic** that goes beyond simple transformations (like using a special tokenizer or additional data processing steps).  
  - You want more control over how data is accessed and transformed during training.

### Example analogy with a simple class:

```python
class Basket:
    def __init__(self, fruits):
        self.fruits = fruits  # List of fruits

    def __len__(self):
        return len(self.fruits)  # Total number of fruits

    def __getitem__(self, idx):
        return self.fruits[idx]  # Get specific fruit by index

basket = Basket(["Apple", "Banana", "Mango"])


In [None]:
class Basket:
    def __init__(self, fruits):
        self.fruits = fruits  # List of fruits

    def __len__(self):
        return len(self.fruits)  # Kitne fruits hai total

    def __getitem__(self, idx):
        return self.fruits[idx]  # Index se specific fruit nikalna


In [None]:
basket = Basket(["Apple", "Banana", "Mango"])

In [None]:
print(len(basket))
print(basket[0])
print(basket[0])

# Explanation of the `TextClassificationDataset` Class

This is a **custom PyTorch Dataset class** designed for text classification tasks.

### Purpose:
- To **prepare and serve text data** in a format that a PyTorch model can easily use during training or evaluation.
- It **tokenizes the text inputs on-the-fly** using a tokenizer (like from Hugging Face’s Transformers library).
- It converts raw texts and labels into tensors that can be fed into a neural network.

---

### Why use this class?

- It abstracts data loading and preprocessing, so you can plug it into a PyTorch `DataLoader`.
- Enables efficient batch processing and GPU compatibility.
- Supports custom tokenization logic per item.

---

### Summary:

This class wraps raw text and labels into a format suitable for training a text classification model with PyTorch and Hugging Face tokenizers.


In [None]:
# Dataset class for text classification using PyTorch
class TextClassificationDataset(Dataset):
    def __init__(self, texts, labels, tokenizer, max_length=512):
        # texts: list of raw text samples
        # labels: list of integer labels corresponding to texts
        # tokenizer: a tokenizer object to convert text to token IDs
        # max_length: max number of tokens per text (for padding/truncation)
        self.texts = texts
        self.labels = labels
        self.tokenizer = tokenizer
        self.max_length = max_length

    def __len__(self):
        # Returns the total number of samples in the dataset
        return len(self.texts)

    def __getitem__(self, idx):
        # Fetches the text and label at index `idx`
        text = str(self.texts[idx])   # Convert to string in case input isn't already
        label = int(self.labels[idx]) # Convert label to integer

        # Tokenize the text using the tokenizer
        encoding = self.tokenizer(
            text,
            truncation=True,           # Cut off tokens beyond max_length
            padding='max_length',      # Pad shorter texts to max_length
            max_length=self.max_length,
            return_tensors='pt'        # Return PyTorch tensors
        )

        # Return a dictionary containing:
        # - input_ids: token ids tensor (flattened to 1D)
        # - attention_mask: tensor indicating real tokens vs padding (flattened)
        # - labels: tensor of the label (long integer type)
        return {
            'input_ids': encoding['input_ids'].flatten(),
            'attention_mask': encoding['attention_mask'].flatten(),
            'labels': torch.tensor(label, dtype=torch.long)
        }


# Purpose of the Code `BERTTextClassifier`

The main purpose of this `BERTTextClassifier` code is to create a reusable **BERT-based text classification system** that can:

- Load and preprocess text data (e.g., IMDb movie reviews).
- Fine-tune a pretrained BERT model to classify texts into categories such as positive or negative sentiment.
- Evaluate the trained model's performance on test data using metrics like accuracy and F1 score.
- Make predictions with confidence scores (probabilities) on new, unseen text inputs.

### Summary:
This code implements a full pipeline for **fine-tuning and using BERT** for text classification tasks, wrapping data loading, training, evaluation, and prediction inside a single convenient class.

---

## Workflow:

1. **Initialize the classifier**  
   Load the BERT model, tokenizer, and move the model to the appropriate device (CPU/GPU).

2. **Load IMDb data**  
   Sample training and testing texts along with their labels.

3. **Train the model**  
   - Convert texts and labels into a dataset format.  
   - Use `DataLoader` for batching data.  
   - For each epoch and batch: perform a forward pass, compute loss, backpropagate, update weights, and adjust the learning rate.

4. **Evaluate the model**  
   - Run the model on test data without calculating gradients.  
   - Collect predictions and compute metrics such as accuracy, F1 score, and classification report.

5. **Predict new texts**  
   - Tokenize new input texts, run them through the model, apply softmax to obtain probabilities, and return predicted classes.


In [None]:
# BERT Text Classifier
class BERTTextClassifier:
    """BERT for Text Classification (Sentiment, Spam etc.)"""

    # --- Initialize the classifier ---
    def __init__(self, model_name='bert-base-uncased', num_classes=2, max_length=512):
        self.model_name = model_name
        self.num_classes = num_classes
        self.max_length = max_length

        # Load the tokenizer for the specified pretrained BERT model
        self.tokenizer = BertTokenizerFast.from_pretrained(model_name)
        # Load the pretrained BERT model for sequence classification with given number of classes
        self.model = BertForSequenceClassification.from_pretrained(
            model_name, num_labels=num_classes
        )
        # Move the model to the available device (CPU or GPU)
        self.model.to(device)

    # --- Load IMDb data ---
    def load_imdb_data(self, sample_size=5000):
        """Load IMDb movie reviews dataset"""

        print("Loading IMDb dataset...")

        # Download the IMDb dataset using Hugging Face datasets library
        dataset = load_dataset("imdb")

        # Randomly sample indices from the training set for quicker experiments
        train_indices = np.random.choice(len(dataset['train']),
                                       min(sample_size, len(dataset['train'])),
                                       replace=False)

        # Randomly sample indices from the test set (smaller subset)
        test_indices = np.random.choice(len(dataset['test']),
                                      min(sample_size//4, len(dataset['test'])),
                                      replace=False)

        # Extract texts and labels for training samples
        train_texts = [dataset['train'][int(i)]['text'] for i in train_indices]
        train_labels = [dataset['train'][int(i)]['label'] for i in train_indices]

        # Extract texts and labels for test samples
        test_texts = [dataset['test'][int(i)]['text'] for i in test_indices]
        test_labels = [dataset['test'][int(i)]['label'] for i in test_indices]

        print(f"Train samples: {len(train_texts)}")
        print(f"Test samples: {len(test_texts)}")

        return train_texts, train_labels, test_texts, test_labels

    # --- Train the model ---
    def train(self, train_texts, train_labels, epochs=1, batch_size=8, learning_rate=2e-5):
        """Train the text classifier"""

        # Create dataset from raw texts and labels with tokenizer
        train_dataset = TextClassificationDataset(
            train_texts, train_labels, self.tokenizer, self.max_length
        )

        # DataLoader for batching and shuffling training data
        train_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True)

        # Optimizer with weight decay to update model parameters
        optimizer = AdamW(self.model.parameters(), lr=learning_rate, weight_decay=0.01)

        # Total number of training steps (batches * epochs)
        total_steps = len(train_loader) * epochs

        # Scheduler for gradually decreasing learning rate with warmup period
        scheduler = get_linear_schedule_with_warmup(
            optimizer, num_warmup_steps=total_steps//10, num_training_steps=total_steps
        )

        self.model.train()  # Set model to training mode

        for epoch in range(epochs):
            total_loss = 0
            progress_bar = tqdm(train_loader, desc=f'Epoch {epoch+1}/{epochs}')

            for batch in progress_bar:
                optimizer.zero_grad()  # Reset gradients

                # Move inputs and labels to device
                input_ids = batch['input_ids'].to(device)
                attention_mask = batch['attention_mask'].to(device)
                labels = batch['labels'].to(device)

                # Forward pass: get model outputs and calculate loss
                outputs = self.model(
                    input_ids=input_ids,
                    attention_mask=attention_mask,
                    labels=labels
                )

                loss = outputs.loss  # Extract loss value
                total_loss += loss.item()

                loss.backward()  # Backpropagation

                # Gradient clipping to avoid exploding gradients
                torch.nn.utils.clip_grad_norm_(self.model.parameters(), 1.0)

                optimizer.step()  # Update model weights
                scheduler.step()  # Update learning rate

                # Show the current loss on the progress bar
                progress_bar.set_postfix({'Loss': f'{loss.item():.4f}'})

            print(f'Epoch {epoch+1}, Average Loss: {total_loss/len(train_loader):.4f}')

    # --- Evaluate the model ---
    def evaluate(self, test_texts, test_labels, batch_size=8):
        """Evaluate the text classifier"""

        # Prepare dataset and dataloader for test set
        test_dataset = TextClassificationDataset(
            test_texts, test_labels, self.tokenizer, self.max_length
        )
        test_loader = DataLoader(test_dataset, batch_size=batch_size, shuffle=False)

        self.model.eval()  # Set model to evaluation mode

        predictions = []
        true_labels = []

        with torch.no_grad():  # Disable gradient calculation
            for batch in tqdm(test_loader, desc='Evaluating'):
                input_ids = batch['input_ids'].to(device)
                attention_mask = batch['attention_mask'].to(device)
                labels = batch['labels'].to(device)

                # Forward pass without labels (no loss computed)
                outputs = self.model(input_ids=input_ids, attention_mask=attention_mask)
                logits = outputs.logits

                # Predicted class is the argmax of logits
                preds = torch.argmax(logits, dim=1).cpu().numpy()

                predictions.extend(preds)  # Collect predictions
                true_labels.extend(labels.cpu().numpy())  # Collect true labels

        # Compute accuracy and F1 score for evaluation
        accuracy = accuracy_score(true_labels, predictions)
        f1 = f1_score(true_labels, predictions, average='weighted')

        # Detailed classification report (precision, recall, F1 per class)
        report = classification_report(true_labels, predictions,
                                     target_names=['Negative', 'Positive'])

        return accuracy, f1, report

    # --- Predict new texts ---
    def predict(self, texts):
        """Predict sentiment for new texts"""
        predictions = []
        probabilities = []

        self.model.eval()  # Set model to eval mode

        for text in texts:
            # Tokenize input text, prepare tensors
            encoding = self.tokenizer(
                text,
                truncation=True,
                padding='max_length',
                max_length=self.max_length,
                return_tensors='pt'
            )

            input_ids = encoding['input_ids'].to(device)
            attention_mask = encoding['attention_mask'].to(device)

            with torch.no_grad():
                # Forward pass to get logits
                outputs = self.model(input_ids=input_ids, attention_mask=attention_mask)
                logits = outputs.logits

                # Convert logits to probabilities using softmax
                probs = torch.softmax(logits, dim=1).cpu().numpy()[0]
                # Predicted class is argmax of logits
                pred = torch.argmax(logits, dim=1).cpu().numpy()[0]

                predictions.append(pred)       # Append predicted class
                probabilities.append(probs)    # Append class probabilities

        return predictions, probabilities


# Purpose of the `NERDataset` Class

The `NERDataset` class is designed to prepare data for Named Entity Recognition (NER) tasks using transformer models like BERT.

## Key goals:

- **Input Handling:**  
  Accepts pre-tokenized input sentences (`tokens_list`) and their corresponding word-level labels (`labels_list`).

- **Tokenization & Alignment:**  
  Uses a BERT-compatible tokenizer to split words into subword tokens while maintaining alignment between original word labels and tokenized subwords.

- **Label Alignment:**  
  Assigns the correct label to the first subword token of each word and marks subsequent subword tokens with `-100` so that they are ignored during loss calculation.

- **Padding & Truncation:**  
  Ensures that all sequences and their aligned labels have a consistent length (`max_length`) by padding or truncating as needed.

- **Output Format:**  
  Provides data samples as dictionaries containing `input_ids`, `attention_mask`, and `labels` tensors, ready to be fed directly into transformer models for training or evaluation.

## Why is this important?

Transformer tokenizers often split words into multiple subwords. For sequence labeling tasks like NER, the model expects token-level labels aligned with these subwords. This class automates the process of properly aligning labels to tokenized inputs, which is crucial for effective training of token classification models.


# Token Classification Label Alignment with BERT

This example demonstrates how to handle token classification tasks (like Named Entity Recognition - NER) using BERT's tokenizer, including how to align original word-level labels with BERT's subword tokens.

---

## Input Example

**Sentence (tokens):**  
`["John", "lives", "in", "London"]`

**Labels (per token):**  
`[1, 0, 0, 2]`

- `1` = B-PER (Beginning of a Person entity, e.g., "John")  
- `0` = O (Outside any entity)  
- `2` = B-LOC (Beginning of a Location entity, e.g., "London")

---

## Step 1: Tokenizer Output

| Original Words | John | lives | in | London |
|----------------|-------|-------|----|--------|
| BERT Tokens    | `[CLS]`, John, lives, in, Lon, `##don`, `[SEP]`, `[PAD]`... |
| Word IDs       | None  | 0     | 1  | 2      | 3    | 3       | None    | None    |

- BERT splits "London" into two subword tokens: `Lon` and `##don`.
- Special tokens like `[CLS]` (start) and `[SEP]` (end) have no word IDs.

---

## Step 2: Label Alignment

| Token      | Word ID | Original Label | Final Label for Model |
|------------|---------|----------------|-----------------------|
| `[CLS]`    | None    | —              | -100 (ignore)         |
| John       | 0       | 1 (B-PER)      | 1                     |
| lives      | 1       | 0 (O)          | 0                     |
| in         | 2       | 0 (O)          | 0                     |
| Lon        | 3       | 2 (B-LOC)      | 2                     |
| ##don      | 3       | 2 (B-LOC)      | -100 (ignore)         |
| `[SEP]`    | None    | —              | -100 (ignore)         |
| `[PAD]`    | None    | —              | -100 (ignore)         |

- The first subword token of a word gets the original label.
- Continuation subwords (like `##don`) get `-100` to be ignored during loss calculation.
- Special tokens and padding tokens also get `-100`.

---

## Step 3: Model Input Example

```python
{
  'input_ids': tensor([...]),           # Token IDs for BERT tokens (PyTorch tensor)
  'attention_mask': tensor([1, 1, 1, ...]),  # 1 for real tokens, 0 for padding
  'labels': tensor([-100, 1, 0, 0, 2, -100, -100, ...])  # Labels aligned with tokens
}


In [None]:
import torch
from torch.utils.data import Dataset

class NERDataset(Dataset):
    """Dataset class for Named Entity Recognition (NER) tasks"""

    def __init__(self, tokens_list, labels_list, tokenizer, max_length=512):
        """
        Initialize the dataset with:
        - tokens_list: list of tokenized sentences (list of tokens per sentence)
        - labels_list: corresponding list of label sequences for each sentence
        - tokenizer: tokenizer compatible with BERT (or similar) model
        - max_length: max token sequence length (for padding/truncation)
        """
        self.tokens_list = tokens_list
        self.labels_list = labels_list
        self.tokenizer = tokenizer
        self.max_length = max_length

    def __len__(self):
        # Return number of samples in the dataset
        return len(self.tokens_list)

    def __getitem__(self, idx):
        # Retrieve tokens and labels for the given index
        tokens = self.tokens_list[idx]
        labels = self.labels_list[idx]

        # Tokenize input tokens with padding and truncation
        # `is_split_into_words=True` tells tokenizer that input is already tokenized at word-level
        encoding = self.tokenizer(
            tokens,
            truncation=True,
            padding='max_length',
            max_length=self.max_length,
            is_split_into_words=True,
            return_tensors='pt'  # Return PyTorch tensors
        )

        # word_ids maps each tokenized subword token back to the original word index (or None for special tokens)
        word_ids = encoding.word_ids(batch_index=0)

        aligned_labels = []  # List to store labels aligned with subword tokens
        previous_word_idx = None

        for word_idx in word_ids:
            if word_idx is None:
                # Special tokens like [CLS], [SEP] get label -100 (ignored in loss)
                aligned_labels.append(-100)
            elif word_idx != previous_word_idx:
                # For the first subword token of a word, assign the original word's label
                # Defensive check in case word_idx exceeds labels length (assign 0 if so)
                aligned_labels.append(labels[word_idx] if word_idx < len(labels) else 0)
            else:
                # For subsequent subword tokens of the same word, assign -100 to ignore them
                aligned_labels.append(-100)
            previous_word_idx = word_idx

        # Ensure aligned labels list matches max_length by padding or truncating
        if len(aligned_labels) < self.max_length:
            aligned_labels += [-100] * (self.max_length - len(aligned_labels))
        elif len(aligned_labels) > self.max_length:
            aligned_labels = aligned_labels[:self.max_length]

        # Return dictionary with input IDs, attention mask, and aligned labels as tensors
        return {
            'input_ids': encoding['input_ids'].squeeze(0),        # Token IDs tensor shape: (max_length,)
            'attention_mask': encoding['attention_mask'].squeeze(0),  # Attention mask tensor shape: (max_length,)
            'labels': torch.tensor(aligned_labels, dtype=torch.long)  # Aligned labels tensor shape: (max_length,)
        }


| **Label**  | **Full Form**          | **Meaning (English)**                                                   | **Example**                                                       |
|------------|------------------------|------------------------------------------------------------------------|------------------------------------------------------------------|
| **O**      | Outside                | Not part of any named entity (normal word)                             | "works", "at"                                                    |
| **B-PER**  | Begin - Person         | First word of a person’s name                                          | "John" → `B-PER`                                                 |
| **I-PER**  | Inside - Person        | Continuation word(s) of a person’s name                                | "Mary Jane" → `Mary = B-PER`, `Jane = I-PER`                     |
| **B-ORG**  | Begin - Organization   | First word of an organization or company name                          | "Google" → `B-ORG`                                               |
| **I-ORG**  | Inside - Organization  | Continuation word(s) of an organization’s name                         | "New York Times" → `New = B-ORG`, `York = I-ORG`, `Times = I-ORG`|
| **B-LOC**  | Begin - Location       | First word of a location or place name                                 | "London" → `B-LOC`                                               |
| **I-LOC**  | Inside - Location      | Continuation word(s) of a location name                                | "New York" → `New = B-LOC`, `York = I-LOC`                       |
| **B-MISC** | Begin - Miscellaneous  | First word of miscellaneous entities (events, products, nationalities) | "Indian" (nationality) → `B-MISC`                                |
| **I-MISC** | Inside - Miscellaneous | Continuation word(s) of miscellaneous entities                         | "South Korean" → `South = B-MISC`, `Korean = I-MISC`             |



# Explanation of NER Labels

- **B** → Begin: The first word of a named entity (the starting word of the entity).
- **I** → Inside: Continuation words that belong to the same named entity.
- **O** → Outside: Words that are not part of any named entity (normal words).

---

# 📚 About WikiAnn Dataset

The **WikiAnn** dataset is a multilingual Named Entity Recognition (NER) dataset created by combining Wikipedia articles with manual and automatic annotations. It is widely used for training and evaluating NER models across multiple languages.

---

# 📌 Purpose of the `BERTNERClassifier` Code

This code implements a **Named Entity Recognition (NER)** pipeline using a fine-tuned **BERT** model on the **WikiAnn (English)** dataset.

---

## 🎯 Goal

To automatically **identify and label named entities** (like person names, locations, and organizations) in raw text using a BERT-based deep learning model.

---

## 🛠️ What the Code Does

### 1. Load the Dataset
- Loads the [WikiAnn English](https://huggingface.co/datasets/wikiann) dataset, which contains:
  - Sentences broken into **tokens**
  - Corresponding **NER labels** such as:
    - `B-PER` (beginning of a person’s name)
    - `I-LOC` (continuation of a location)
    - `O` (non-entity words)

### 2. Define a Classifier
- The `BERTNERClassifier` class is responsible for:
  - Loading a pre-trained **BERT** model (`bert-base-uncased`)
  - Modifying it for **token classification**
  - Tokenizing inputs
  - Managing label definitions

### 3. Train the Model
- Fine-tunes the BERT model using:
  - A **sampled subset** of WikiAnn
  - `AdamW` optimizer with weight decay
  - **Learning rate scheduler** for gradual warmup
  - **Gradient clipping** to prevent exploding gradients

### 4. Evaluate the Model
- Tests the model’s performance on unseen data
- Calculates:
  - **Accuracy**
  - **F1-score (weighted)**
- Ignores padded tokens for a fair evaluation

### 5. Predict on New Text
- Accepts tokenized sentences (word lists)
- Returns predicted NER tags aligned with original tokens
- Handles alignment between subword tokens and full words

---

## 🔍 Real-World Application

This pipeline can be used in various real-world scenarios like:

- 🗂️ **Information extraction** from unstructured documents  
- 🤖 **Conversational agents** that identify people, places, or companies  
- 📑 **Document classifiers** with entity context awareness  
- 🔍 **Search engines** with better indexing using named entities  
- 📢 **Social media monitoring** tools to detect key mentions in text

---

## 📦 Summary Table

| Component         | Description |
|------------------|-------------|
| **Dataset**       | WikiAnn (English) |
| **Model**         | BERT (`bert-base-uncased`) |
| **NER Tags**      | `B-PER`, `I-LOC`, `B-ORG`, `O`, etc. |
| **Core Functions**| Load data, train, evaluate, and predict |
| **Libraries Used**| `transformers`, `datasets`, `sklearn`, `torch`, `tqdm` |

---

> 💡 **Tip:** You can easily extend this pipeline to other languages or entity types by switching the WikiAnn language version or adjusting the label list.

---

# 🔁 Complete Flow of the Classifier

1. Load the **WikiAnn English dataset** from Hugging Face using `load_dataset`.
2. Randomly sample:
   - **1000 training examples**
   - **250 test examples**
3. Extract:
   - **Tokens**: Words in each sentence  
   - **NER labels**: Tags for each word
4. Return token-label pairs to use in:
   - Training (`.train()`)
   - Evaluation (`.evaluate()`)
   - Inference (`.predict()`)

---


In [None]:
# Define the main class for BERT-based Named Entity Recognition
class BERTNERClassifier:
    """
    A classifier for Named Entity Recognition using BERT.

    It includes methods for:
    - Loading and sampling the WikiAnn dataset
    - Fine-tuning the model
    - Evaluating it on test data
    - Making predictions on new text
    """

    # ---------------- Initialization ----------------
    def __init__(self, model_name='bert-base-uncased', num_labels=9, max_length=512):
        # model_name: the pre-trained BERT model to use
        # num_labels: number of unique entity tags (e.g. B-PER, I-LOC, etc.)
        # max_length: max sequence length for padding/truncation
        self.model_name = model_name
        self.num_labels = num_labels
        self.max_length = max_length

        # Load tokenizer for BERT
        self.tokenizer = BertTokenizerFast.from_pretrained(model_name)

        # Load BERT model for token classification
        self.model = BertForTokenClassification.from_pretrained(model_name, num_labels=num_labels)
        self.model.to(device)

        # Define the list of NER labels as per WikiAnn format
        self.labels = ['O', 'B-PER', 'I-PER', 'B-ORG', 'I-ORG',
                       'B-LOC', 'I-LOC', 'B-MISC', 'I-MISC']
        
    # ---------------- Dataset Loader ----------------
    def load_wikiann_data(self, sample_size=1000):
        # sample_size: number of examples to sample from the dataset
        print("Loading WikiAnn English dataset...")
        dataset = load_dataset("wikiann", "en")  # Load WikiAnn NER dataset for English

        # Randomly choose training and test indices
        train_indices = np.random.choice(len(dataset['train']),
                                         min(sample_size, len(dataset['train'])),
                                         replace=False)
        test_indices = np.random.choice(len(dataset['test']),
                                        min(sample_size // 4, len(dataset['test'])),
                                        replace=False)

        # Extract tokens and their corresponding NER labels
        train_tokens = [dataset['train'][int(i)]['tokens'] for i in train_indices]
        train_labels = [dataset['train'][int(i)]['ner_tags'] for i in train_indices]
        test_tokens = [dataset['test'][int(i)]['tokens'] for i in test_indices]
        test_labels = [dataset['test'][int(i)]['ner_tags'] for i in test_indices]

        print(f"Train samples: {len(train_tokens)}")
        print(f"Test samples: {len(test_tokens)}")

        return train_tokens, train_labels, test_tokens, test_labels
    
    # ---------------- Training Function ----------------
    def train(self, train_tokens, train_labels, epochs=1, batch_size=8, learning_rate=2e-5):
        # Fine-tunes the BERT model using the provided training data
        # train_tokens: list of word-token lists
        # train_labels: list of NER tag sequences
        # epochs: number of training passes
        # batch_size: number of samples per batch
        # learning_rate: optimizer learning rate

        train_dataset = NERDataset(train_tokens, train_labels, self.tokenizer, self.max_length)
        train_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True)

        # Setup Adam optimizer and learning rate scheduler
        optimizer = AdamW(self.model.parameters(), lr=learning_rate, weight_decay=0.01)
        total_steps = len(train_loader) * epochs
        scheduler = get_linear_schedule_with_warmup(
            optimizer, num_warmup_steps=total_steps // 10, num_training_steps=total_steps
        )

        self.model.train()  # Set model to training mode
        for epoch in range(epochs):
            total_loss = 0
            progress_bar = tqdm(train_loader, desc=f"Epoch {epoch + 1}/{epochs}")

            for batch in progress_bar:
                optimizer.zero_grad()  # Reset gradients

                input_ids = batch['input_ids'].to(device)
                attention_mask = batch['attention_mask'].to(device)
                labels = batch['labels'].to(device)

                outputs = self.model(input_ids=input_ids, attention_mask=attention_mask, labels=labels)
                loss = outputs.loss  # Compute loss
                total_loss += loss.item()

                loss.backward()  # Backpropagation
                torch.nn.utils.clip_grad_norm_(self.model.parameters(), 1.0)  # Gradient clipping
                optimizer.step()  # Update weights
                scheduler.step()  # Update learning rate

                progress_bar.set_postfix({'Loss': f'{loss.item():.4f}'})

            print(f"Epoch {epoch + 1}, Avg Loss: {total_loss / len(train_loader):.4f}")

    # ---------------- Evaluation Function ----------------       
    def evaluate(self, test_tokens, test_labels, batch_size=8):
        # Evaluates the model's performance on unseen test data
        # Returns overall accuracy and weighted F1 score

        test_dataset = NERDataset(test_tokens, test_labels, self.tokenizer, self.max_length)
        test_loader = DataLoader(test_dataset, batch_size=batch_size, shuffle=False)

        self.model.eval()  # Set model to evaluation mode
        predictions = []
        true_labels = []

        with torch.no_grad():
            for batch in tqdm(test_loader, desc="Evaluating"):
                input_ids = batch['input_ids'].to(device)
                attention_mask = batch['attention_mask'].to(device)
                labels = batch['labels'].to(device)

                outputs = self.model(input_ids=input_ids, attention_mask=attention_mask)
                logits = outputs.logits  # Raw scores
                preds = torch.argmax(logits, dim=2).cpu().numpy()  # Predicted class indices
                labels = labels.cpu().numpy()

                # Filter out padding tokens (label = -100)
                for i in range(preds.shape[0]):
                    for j in range(preds.shape[1]):
                        if labels[i][j] != -100:
                            predictions.append(preds[i][j])
                            true_labels.append(labels[i][j])

        accuracy = accuracy_score(true_labels, predictions)
        f1 = f1_score(true_labels, predictions, average='weighted')
        return accuracy, f1
    
    # ---------------- Prediction Function ----------------
    def predict(self, tokens_list):
        # Predict NER tags for a list of tokenized word lists (new sentences)

        predictions = []
        self.model.eval()

        for tokens in tokens_list:
            encoding = self.tokenizer(
                tokens,
                truncation=True,
                padding='max_length',
                max_length=self.max_length,
                return_tensors='pt',
                is_split_into_words=True
            )

            input_ids = encoding['input_ids'].to(device)
            attention_mask = encoding['attention_mask'].to(device)

            with torch.no_grad():
                outputs = self.model(input_ids=input_ids, attention_mask=attention_mask)
                logits = outputs.logits
                preds = torch.argmax(logits, dim=2).cpu().numpy()[0]

                # Get predicted tags aligned with original words
                word_ids = encoding.word_ids(batch_index=0)
                token_predictions = []
                previous_word_idx = None

                for i, word_idx in enumerate(word_ids):
                    if word_idx is not None and word_idx != previous_word_idx:
                        if word_idx < len(tokens):  # Avoid indexing errors
                            token_predictions.append(self.labels[preds[i]])
                    previous_word_idx = word_idx

                predictions.append(token_predictions)

        return predictions


# Explanation of `QADataset` Class

## 🎯 Purpose

The `QADataset` class prepares data for a **Question Answering (QA)** task (like SQuAD). It helps a model learn **where the answer is located within a given context** by:

- Tokenizing the question and context together.
- Finding the start and end token positions of the answer in the context.
- Returning this data so the model can learn to predict the answer span.

---

## 🔍 What the Code Does

1. **Inputs:**  
   - A list of questions  
   - Corresponding contexts (paragraphs containing answers)  
   - Answers with their text and character start positions  
   - A tokenizer for converting text into tokens  
   - Max sequence length (for padding/truncation)  

2. **Tokenization:**  
   - Combines question and context  
   - Tokenizes into input IDs and attention masks  
   - Provides offset mapping to link tokens back to original characters  

3. **Answer Span Identification:**  
   - Uses offset mapping to find tokens that cover the answer text  
   - Determines start and end token indices for the answer  

4. **Returns:**  
   A dictionary containing:  
   - `input_ids`: tokenized question + context  
   - `attention_mask`: mask for padding tokens  
   - `start_positions`: index of token where answer starts  
   - `end_positions`: index of token where answer ends  

---

## 📝 Simple Example

| Input     | Example Value                                |
|-----------|----------------------------------------------|
| Question  | "Where is the Eiffel Tower located?"         |
| Context   | "The Eiffel Tower is located in Paris, France." |
| Answer    | "Paris" (starts at character 27 in context)  |

### How it works:

- The tokenizer converts question + context into tokens like:  
  `[CLS] Where is the Eiffel Tower located? The Eiffel Tower is located in Paris, France. [SEP]`

- The code finds the tokens that cover `"Paris"` by matching character offsets.

- Sets:  
  - `start_positions = 12` (token where "Paris" starts)  
  - `end_positions = 12` (same for a single-word answer)

- These positions help the model learn to point to the answer span during training.

---

## 🔑 Why Is This Important?

- QA models need to **locate exact answer spans** inside paragraphs.
- This dataset class formats data correctly to train such models by providing tokenized inputs and precise answer token positions.

---

> **Analogy:**  
> If someone asks you, "Which word in the sentence is the answer?" you’d count and say "word 12." This class teaches the model to do the same with tokens.

---

If you want, I can show you a code snippet to visualize tokenization and answer span finding!


In [None]:
class QADataset(Dataset):
    """Dataset class for Question Answering tasks (SQuAD-style)"""

    def __init__(self, questions, contexts, answers, tokenizer, max_length=512):
        # Initialize the dataset with:
        # questions: list of question strings
        # contexts: list of context paragraphs (each containing the answer)
        # answers: list of dicts with 'text' and 'answer_start' (character-level answer positions)
        # tokenizer: tokenizer to convert text into tokens suitable for model input
        # max_length: max number of tokens in input sequence (for padding/truncation)
        self.questions = questions
        self.contexts = contexts
        self.answers = answers
        self.tokenizer = tokenizer
        self.max_length = max_length

    def __len__(self):
        # Returns the total number of samples (questions) in the dataset
        return len(self.questions)

    def __getitem__(self, idx):
        # Fetch the idx-th sample: question, context, and answer info
        question = self.questions[idx]
        context = self.contexts[idx]
        answer = self.answers[idx]  # dict like {'text': [...], 'answer_start': [...]}

        # Tokenize the question and context together, producing:
        # - input_ids: token IDs of question + context
        # - attention_mask: mask for real tokens vs padding
        # - offset_mapping: maps tokens back to character positions in original text
        encoding = self.tokenizer(
            question,
            context,
            truncation=True,              # truncate to max_length if too long
            padding='max_length',         # pad shorter sequences to max_length
            max_length=self.max_length,   # maximum allowed length for input
            return_offsets_mapping=True,  # get token-to-char position mapping
            return_tensors='pt'           # return PyTorch tensors
        )

        # offset_mapping has shape (1, max_length, 2), pop to remove from encoding dict
        # [0] extracts the sequence dimension since batch size = 1 here
        offset_mapping = encoding.pop("offset_mapping")[0]  # shape: (max_length, 2)

        # Default start and end token positions (for answers not found)
        start_positions = torch.tensor(0, dtype=torch.long)
        end_positions = torch.tensor(0, dtype=torch.long)

        # If an answer exists and answer start positions are provided...
        if answer and 'answer_start' in answer and answer['answer_start']:
            # Extract character index where answer starts in context
            answer_start = answer['answer_start'][0]
            # Extract the answer text string
            answer_text = answer['text'][0]
            # Calculate the character index where answer ends
            answer_end = answer_start + len(answer_text)

            # Iterate through offset mappings to find the token span covering the answer
            for idx, (start, end) in enumerate(offset_mapping):
                # Find token whose char span includes the answer start char
                if start <= answer_start < end:
                    start_positions = torch.tensor(idx, dtype=torch.long)
                # Find token whose char span includes the answer end char
                if start < answer_end <= end:
                    end_positions = torch.tensor(idx, dtype=torch.long)
                    break  # Once end token found, stop searching

        # Return a dict with inputs for model training/inference:
        return {
            'input_ids': encoding['input_ids'].squeeze(0),      # token ids, shape: (max_length,)
            'attention_mask': encoding['attention_mask'].squeeze(0),  # attention mask, shape: (max_length,)
            'start_positions': start_positions,                  # index of answer start token
            'end_positions': end_positions                       # index of answer end token
        }


# Purpose of the BERT Question Answering Code

---

### What is this code for?

This code builds a **Question Answering (QA) system** using a BERT model.  
You provide:

- A **context** (paragraph or text passage)  
- A **question** related to that context  

The system finds the **exact answer span** within the context.

---

### Why is this useful?

You can use this for:

- Smart assistants that answer questions based on documents  
- Search engines that find specific answers, not just documents  
- Chatbots that understand and answer factual questions  

---

### How does it work? (Simple steps)

1. **Load data:** Uses the SQuAD dataset containing questions, contexts, and exact answers.  
2. **Train:** Fine-tunes BERT to learn where answers start and end inside paragraphs.  
3. **Answer:** Given a new question and paragraph, predicts the text span that answers the question.

---

### Key Idea

The model doesn’t generate text from scratch — it **selects a snippet** (span) from the given paragraph as the answer, similar to highlighting a phrase in a book.

---


In [None]:
class BERTQuestionAnswering:
    """BERT model wrapper for Question Answering tasks (SQuAD style)"""

    # ---------------- Initialization ----------------
    def __init__(self, model_name='bert-base-uncased', max_length=512):
        """
        Initialize the tokenizer and model.

        Parameters:
        - model_name (str): Pretrained BERT model name from Hugging Face (default 'bert-base-uncased').
        - max_length (int): Maximum token length for input sequences (default 512).
        """
        self.model_name = model_name  # Store model name
        self.max_length = max_length  # Store max token length for inputs

        # Load pretrained tokenizer to convert text to tokens/ids
        self.tokenizer = BertTokenizerFast.from_pretrained(model_name)
        # Load pretrained BERT model with a QA head (predicts start/end tokens)
        self.model = BertForQuestionAnswering.from_pretrained(model_name)

        # Send model to GPU if available, else CPU
        self.model.to(device)

    # ---------------- Dataset Loading ----------------
    def load_squad_data(self, sample_size=2000):
        """
        Load and sample the SQuAD dataset.

        Steps:
        - Loads the full SQuAD dataset.
        - Randomly selects a subset of training and validation samples for faster experiments.
        - Extracts questions, contexts, and answers from the dataset.

        Parameters:
        - sample_size (int): Number of training samples to randomly select (default 2000).

        Returns:
        - train_questions, train_contexts, train_answers: training data lists.
        - val_questions, val_contexts, val_answers: validation data lists.
        """
        print("Loading SQuAD dataset...")

        # Load the full SQuAD dataset from Hugging Face datasets
        dataset = load_dataset("squad")

        # Randomly select sample_size examples from training set without replacement
        train_indices = np.random.choice(len(dataset['train']),
                                         min(sample_size, len(dataset['train'])),
                                         replace=False)
        # Randomly select 1/4th sample size for validation set
        val_indices = np.random.choice(len(dataset['validation']),
                                       min(sample_size // 4, len(dataset['validation'])),
                                       replace=False)

        # Extract question texts for selected training samples
        train_questions = [dataset['train'][int(i)]['question'] for i in train_indices]
        # Extract context paragraphs for selected training samples
        train_contexts = [dataset['train'][int(i)]['context'] for i in train_indices]
        # Extract answers for selected training samples (dict with text and start positions)
        train_answers = [dataset['train'][int(i)]['answers'] for i in train_indices]

        # Do the same extraction for validation samples
        val_questions = [dataset['validation'][int(i)]['question'] for i in val_indices]
        val_contexts = [dataset['validation'][int(i)]['context'] for i in val_indices]
        val_answers = [dataset['validation'][int(i)]['answers'] for i in val_indices]

        # Print counts for user info
        print(f"Train samples: {len(train_questions)}")
        print(f"Validation samples: {len(val_questions)}")

        # Return extracted lists
        return (train_questions, train_contexts, train_answers,
                val_questions, val_contexts, val_answers)
    
    # ---------------- Training ----------------
    def train(self, questions, contexts, answers, epochs=1, batch_size=8, learning_rate=2e-5):
        """
        Fine-tune the BERT QA model on the provided data.

        Steps:
        - Wrap the data into a custom Dataset (QADataset) for tokenization and label processing.
        - Use DataLoader for batching and shuffling.
        - Define optimizer (AdamW) and learning rate scheduler.
        - Run training epochs with loss backpropagation and gradient clipping.
        - Print progress and average loss per epoch.

        Parameters:
        - questions (list): List of question strings.
        - contexts (list): List of context strings.
        - answers (list): List of answer dicts containing 'text' and 'answer_start'.
        - epochs (int): Number of training epochs.
        - batch_size (int): Batch size for DataLoader.
        - learning_rate (float): Optimizer learning rate.
        """
        # Create the dataset with questions, contexts, answers, and tokenizer
        train_dataset = QADataset(questions, contexts, answers, self.tokenizer, self.max_length)
        # Create data loader to iterate over dataset in batches and shuffle for randomness
        train_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True)

        # Initialize AdamW optimizer with weight decay (better for transformers)
        optimizer = AdamW(self.model.parameters(), lr=learning_rate, weight_decay=0.01)

        # Total training steps = number of batches per epoch * epochs
        total_steps = len(train_loader) * epochs

        # Learning rate scheduler to warm-up and then decay linearly
        scheduler = get_linear_schedule_with_warmup(
            optimizer, num_warmup_steps=total_steps // 10, num_training_steps=total_steps
        )

        self.model.train()  # Set model to training mode

        for epoch in range(epochs):
            total_loss = 0  # Track cumulative loss per epoch
            # Progress bar to monitor training progress
            progress_bar = tqdm(train_loader, desc=f'Epoch {epoch + 1}/{epochs}')

            for batch in progress_bar:
                optimizer.zero_grad()  # Reset gradients before each batch

                # Move batch data to device
                input_ids = batch['input_ids'].to(device)           # Token ids for question+context
                attention_mask = batch['attention_mask'].to(device) # Mask to ignore padding tokens
                start_positions = batch['start_positions'].to(device) # True start token index of answer
                end_positions = batch['end_positions'].to(device)     # True end token index of answer

                # Forward pass through BERT model
                outputs = self.model(
                    input_ids=input_ids,             # Input tokens batch
                    attention_mask=attention_mask,   # Attention mask batch
                    start_positions=start_positions, # True start token indices (labels)
                    end_positions=end_positions      # True end token indices (labels)
                )

                loss = outputs.loss  # Compute loss between predicted and true start/end positions

                total_loss += loss.item()  # Accumulate loss

                loss.backward()  # Backpropagation to compute gradients

                # Clip gradients to max norm 1.0 to stabilize training
                torch.nn.utils.clip_grad_norm_(self.model.parameters(), 1.0)

                optimizer.step()  # Update model parameters based on gradients

                scheduler.step()  # Update learning rate according to scheduler

                # Update progress bar with current loss value
                progress_bar.set_postfix({'Loss': f'{loss.item():.4f}'})

            # Print average loss at end of epoch
            print(f'Epoch {epoch + 1}, Average Loss: {total_loss / len(train_loader):.4f}')

    # ---------------- Inference / Prediction ----------------
    def answer_question(self, question, context, max_answer_len=30):
        """
        Predict answer span for a single question and context pair.

        Steps:
        - Tokenize and encode input question and context.
        - Pass through BERT QA model to get start and end logits.
        - Select token positions with highest start and end scores.
        - Ensure valid span and limit maximum answer length.
        - Decode predicted tokens back into text answer.

        Parameters:
        - question (str): Question string.
        - context (str): Context paragraph string.
        - max_answer_len (int): Max allowed answer length in tokens.

        Returns:
        - answer (str): Predicted answer text extracted from context.
        """
        # Tokenize and encode question + context as model input
        encoding = self.tokenizer(
            question,
            context,
            truncation=True,              # Truncate if longer than max_length
            padding='max_length',         # Pad sequences to max_length
            max_length=self.max_length,   # Max tokens length
            return_tensors='pt'           # Return PyTorch tensors
        )

        input_ids = encoding['input_ids'].to(device)             # Token IDs (1 x max_length)
        attention_mask = encoding['attention_mask'].to(device)   # Attention mask for padding

        self.model.eval()  # Set model to evaluation mode (disables dropout)

        with torch.no_grad():  # Disable gradient computation for inference
            outputs = self.model(
                input_ids=input_ids,           # Input token IDs
                attention_mask=attention_mask  # Attention mask
            )

            start_logits = outputs.start_logits  # Start position logits (1 x max_length)
            end_logits = outputs.end_logits      # End position logits (1 x max_length)

            # Pick token with highest start logit score as predicted start index
            start_idx = torch.argmax(start_logits, dim=1).item()
            # Pick token with highest end logit score as predicted end index
            end_idx = torch.argmax(end_logits, dim=1).item()

            # Correct invalid spans where end comes before start
            if end_idx < start_idx:
                end_idx = start_idx

            # Limit answer length to max_answer_len tokens
            if (end_idx - start_idx) > max_answer_len:
                end_idx = start_idx + max_answer_len

            # Extract predicted token ids for the answer span
            answer_tokens = input_ids[0][start_idx:end_idx + 1]

            # Decode tokens to human-readable string, skipping special tokens like [CLS], [SEP]
            answer = self.tokenizer.decode(answer_tokens, skip_special_tokens=True).strip()

            return answer


# Explanation of `run_text_classification_demo()` Function

---

### Purpose

This function runs a **demo of text classification**, specifically sentiment analysis, using a BERT-based classifier.  
It walks through loading data, training a model, evaluating it, and testing it on custom text samples.

---

### Workflow Breakdown

1. **Print Demo Header**  
   Displays a clear, formatted title for the demo in the console.

2. **Initialize Classifier**  
   Creates an instance of the `BERTTextClassifier` with 2 classes (Positive, Negative).

3. **Load Data**  
   Calls `load_imdb_data()` on the classifier to get a subset of the IMDB movie reviews dataset, with:
   - `train_texts`: training review texts  
   - `train_labels`: their corresponding sentiment labels (0=Negative, 1=Positive)  
   - `test_texts`: testing review texts  
   - `test_labels`: testing sentiment labels  
   *Note:* Only 1000 samples are loaded here for a quick demo.

4. **Display Sample**  
   Prints the first training review (first 200 characters) and its sentiment label.

5. **Train the Model**  
   Trains the classifier on the training data for 1 epoch with a batch size of 8.  
   This fine-tunes BERT to classify sentiments in text.

6. **Evaluate the Model**  
   Evaluates the trained model on the test data to calculate:
   - Accuracy (percentage of correct predictions)  
   - F1 Score (balance between precision and recall)  
   - A detailed classification report (precision, recall, f1 for each class)

7. **Print Evaluation Metrics**  
   Displays accuracy and F1 score on the console.

8. **Predict on Custom Reviews**  
   Tests the trained model on three custom example sentences.  
   For each, it predicts sentiment and prints:  
   - The review snippet  
   - Predicted sentiment (Positive/Negative)  
   - Confidence percentage of the prediction

---

### Summary

This demo function shows how to:

- Load and prepare data for sentiment classification  
- Fine-tune a BERT model on a sample dataset  
- Evaluate the model’s performance  
- Make predictions on new, unseen text  

It’s a full mini pipeline demonstrating text classification with BERT.

---

> **Note:**  
> This is intended as a small-scale demo (using only 1000 samples and 1 epoch) for quick experimentation and learning, not for production-ready accuracy.



In [None]:
def run_text_classification_demo():

    """Demo for text classification"""

    print("\n" + "="*60)
    print("TEXT CLASSIFICATION (Sentiment Analysis) DEMO")
    print("="*60)

    classifier = BERTTextClassifier(num_classes=2)

    # Load data
    train_texts, train_labels, test_texts, test_labels = classifier.load_imdb_data(sample_size=1000)

    # Show sample
    print(f"\nSample Review: {train_texts[0][:200]}...")
    print(f"Label: {'Positive' if train_labels[0] == 1 else 'Negative'}")

    # Train for 2 epochs (small for demo)
    classifier.train(train_texts, train_labels, epochs=1, batch_size=8)

    # Evaluate
    accuracy, f1, report = classifier.evaluate(test_texts, test_labels, batch_size=8)

    print(f"\nAccuracy: {accuracy:.4f}")
    print(f"F1 Score: {f1:.4f}")

    # Test custom examples
    custom_reviews = [
        "This movie was fantastic! Amazing acting and great plot.",
        "Boring and terrible. Waste of time.",
        "Not bad, could be better though."
    ]

    predictions, probabilities = classifier.predict(custom_reviews)

    print(f"\nCustom Predictions:")

    for text, pred, prob in zip(custom_reviews, predictions, probabilities):

        sentiment = "Positive" if pred == 1 else "Negative"

        confidence = prob[pred] * 100

        print(f"'{text[:50]}...' -> {sentiment} ({confidence:.1f}%)")


# Explanation of `run_ner_demo()` Function

---

### Purpose

This function runs a **demo for Named Entity Recognition (NER)** using a BERT-based model.  
It demonstrates loading data, training, evaluating, and predicting named entities in text.

---

### Workflow Breakdown

1. **Print Demo Header**  
   Prints a decorative header announcing the NER demo in the console.

2. **Initialize NER Model**  
   Creates an instance of `BERTNERClassifier` configured for 9 NER labels (entity types).

3. **Load Dataset Subset**  
   Calls `load_wikiann_data()` on the model to load a small subset (500 samples) of the WikiAnn dataset for quick training/testing:  
   - `train_tokens`: tokenized training sentences  
   - `train_labels`: corresponding NER labels for each token  
   - `test_tokens`: tokenized test sentences  
   - `test_labels`: corresponding NER labels for test tokens  

4. **Display Sample Tokens and Labels**  
   Prints the first 10 tokens of the first training sample.  
   Converts numeric labels into human-readable label names (e.g., `B-PER`, `I-LOC`) and prints them.

5. **Train the Model**  
   Fine-tunes the NER model for 1 epoch on the training tokens and labels, with batch size 8.

6. **Evaluate the Model**  
   Evaluates the model on the test data to calculate:  
   - Accuracy (correct token label predictions percentage)  
   - F1 Score (harmonic mean of precision and recall for entity recognition)

7. **Print Evaluation Metrics**  
   Prints the accuracy and F1 score for the test set.

8. **Predict on Custom Sentences**  
   Provides two sample sentences as token lists and runs the model’s `predict()` method.  
   Prints tokens alongside their predicted NER labels.

---

### Summary

This function demonstrates a full NER pipeline, including:

- Loading and preparing token-level NER data  
- Fine-tuning a BERT-based NER classifier  
- Evaluating model performance  
- Predicting entities on new sentences  

It’s designed as a quick, illustrative example using a small data subset.

---

> **Note:**  
> This demo is simplified for speed and learning purposes, so it uses only 500 samples and trains for 1 epoch.  
> Real applications typically require more data and longer training.



In [None]:
def run_ner_demo():

    """Demo for Named Entity Recognition"""

    print("\n" + "="*60)
    print("NAMED ENTITY RECOGNITION DEMO")
    print("="*60)

    ner_model = BERTNERClassifier(num_labels=9)

    # Load small subset for demo (fast training)
    train_tokens, train_labels, test_tokens, test_labels = ner_model.load_wikiann_data(sample_size=500)

    # Show sample
    print(f"\nSample tokens: {train_tokens[0][:10]}")

    label_names = [ner_model.labels[l] if l < len(ner_model.labels) else "O" for l in train_labels[0][:10]]

    print(f"Sample labels: {label_names}")

    # Train for 2 epochs
    ner_model.train(train_tokens, train_labels, epochs=1, batch_size=8)

    # Evaluate
    accuracy, f1 = ner_model.evaluate(test_tokens, test_labels, batch_size=8)
    print(f"\nAccuracy: {accuracy:.4f}")
    print(f"F1 Score: {f1:.4f}")

    # Custom examples
    custom_sentences = [
        ["John", "Smith", "works", "at", "Google", "in", "California"],
        ["Apple", "Inc.", "was", "founded", "by", "Steve", "Jobs"]
    ]

    predictions = ner_model.predict(custom_sentences)
    print(f"\nCustom Predictions:")
    for tokens, preds in zip(custom_sentences, predictions):
        print("Tokens:", tokens)
        print("Labels:", preds)
        print()


# Explanation of `run_qa_demo()` Function

---

### Purpose

This function runs a **demo for Question Answering (QA)** using a BERT-based QA model.  
It demonstrates how to load data, train the model, and answer questions based on a given context.

---

### Workflow Breakdown

1. **Print Demo Header**  
   Prints a visually clear header to announce the start of the Question Answering demo.

2. **Initialize QA Model**  
   Creates an instance of `BERTQuestionAnswering`, which internally loads a BERT model and tokenizer for QA tasks.

3. **Load SQuAD Dataset Subset**  
   Uses the `load_squad_data()` method to load a small subset (500 samples) of the SQuAD dataset, which contains:  
   - `train_questions`: list of questions in the training set  
   - `train_contexts`: corresponding paragraphs (contexts)  
   - `train_answers`: ground truth answers with start positions  
   - `val_questions`, `val_contexts`, `val_answers`: similarly for validation set  

4. **Print Sample Data**  
   Displays a sample question, the first 200 characters of the corresponding context, and the ground truth answer (or a fallback message if missing).

5. **Train the Model**  
   Calls the `train()` method to fine-tune the BERT QA model on the training data for 1 epoch (batch size 4) — a small number for demo purposes.

6. **Run Custom Q&A Tests**  
   Defines a list of custom question-context pairs not from the dataset.  
   For each pair, it calls the `answer_question()` method of the model to predict answers and prints the question and the predicted answer.

---

### Summary

This demo shows a complete pipeline for:

- Loading a QA dataset (SQuAD subset)  
- Preparing and training a BERT QA model  
- Testing the model with new, unseen questions and contexts  
- Outputting the predicted answers for human review  

---

### Notes

- The training sample size and epochs are kept small for quick demonstration.  
- The `answer_question()` method limits the answer span length (max 30 tokens) to avoid excessively long predictions.  
- This pipeline can be extended to larger datasets, longer training, and different QA models.



In [None]:
def run_qa_demo():

    """Demo for Question Answering"""

    print("\n" + "="*60)
    print("QUESTION ANSWERING DEMO")
    print("="*60)

    qa_model = BERTQuestionAnswering()

    # Load small subset (for speed)
    (train_questions, train_contexts, train_answers,
     val_questions, val_contexts, val_answers) = qa_model.load_squad_data(sample_size=500)

    # Safe sample print
    ans_text = train_answers[0]['text'][0] if train_answers[0]['text'] else 'No answer'

    print(f"\nSample Question: {train_questions[0]}")

    print(f"Sample Context: {train_contexts[0][:200]}...")

    print(f"Sample Answer: {ans_text}")

    # Train model (2 epochs for demo)
    qa_model.train(train_questions, train_contexts, train_answers, epochs=1, batch_size=4)

    # Test on custom questions
    print(f"\nCustom Q&A Examples:")

    test_cases = [
        {
            "question": "What is the capital of France?",
            "context": "France is a country in Europe. Paris is the capital and largest city of France. The city is known for the Eiffel Tower and the Louvre Museum."
        },
        {
            "question": "Who founded Apple?",
            "context": "Apple Inc. is an American technology company. It was founded by Steve Jobs, Steve Wozniak, and Ronald Wayne in 1976. The company is known for products like iPhone and Mac."
        }
    ]

    for case in test_cases:
        answer = qa_model.answer_question(case["question"], case["context"], max_answer_len=30)
        print(f"Q: {case['question']}")
        print(f"A: {answer}")
        print()


# BERT Multi-Task Demo Launcher Explanation

## Purpose  
This script provides a simple command-line menu for users to select and run demos of different BERT-based NLP tasks:
- Text Classification (Sentiment Analysis)
- Named Entity Recognition (NER)
- Question Answering (QA)

## Workflow Overview

1. **Display Menu**  
   The script prints a header and a list of task options for the user to choose from.

2. **User Input**  
   It prompts the user to enter a choice (1 to 4), then strips any extra spaces.

3. **Execute Task(s)**  
   Based on the input:
   - **1:** Runs the Text Classification demo.
   - **2:** Runs the NER demo.
   - **3:** Runs the Question Answering demo.
   - **4:** Runs all three demos sequentially.
   - **Invalid input:** Shows an error message indicating the choice is invalid.

4. **Error Handling**  
   Wraps the task execution inside a try-except block to catch any runtime errors.  
   If an error occurs, it displays:
   - The error message.
   - Instructions to ensure necessary packages and classes are installed and loaded.
   - Advice on using fixed versions of the code to avoid common bugs.

## Key Points

- **User-Friendly Interface:** Provides a clean and straightforward way to run different BERT demos without modifying the code.  
- **Robustness:** Includes error handling to inform the user about missing dependencies or other issues.  
- **Dependencies:** Requires that all necessary Python packages and demo classes/datasets are installed and available.  
- **Flexibility:** Allows running individual tasks or all tasks in one go for demonstration or testing purposes.

## Summary  
This script is a launcher that helps users quickly demo multiple NLP tasks powered by BERT models via a simple menu system, with built-in error handling and user guidance.


In [None]:
print("BERT Multi-Task Demo")
print("Choose a task to run:")
print("1. Text Classification (Sentiment Analysis)")
print("2. Named Entity Recognition (NER)")
print("3. Question Answering")
print("4. Run All Tasks")

choice = input("\nEnter your choice (1-4): ").strip()

try:
    if choice == "1":
        run_text_classification_demo()
    elif choice == "2":
        run_ner_demo()
    elif choice == "3":
        run_qa_demo()
    elif choice == "4":
        run_text_classification_demo()
        run_ner_demo()
        run_qa_demo()
    else:
        print("Invalid choice! Please run again.")
except Exception as e:
    print("\n--- ERROR OCCURRED ---")
    print(f"Error: {e}")
    print("\nMake sure you have:")
    print("1. Installed required packages:")
    print("   pip install torch transformers datasets scikit-learn tqdm numpy")
    print("2. Loaded all classes & dataset helpers (BERTTextClassifier, BERTNERClassifier, BERTQuestionAnswering, TextClassificationDataset, NERDataset, QADataset)")
    print("3. Using the fixed versions (with int casting, offset_mapping for QA, word_ids fix for NER)")
