### **Detailed Explanation of the Multi-Task Sentence Transformer Model**

This implementation extends a **pre-trained transformer (BERT)** to perform **two NLP tasks simultaneously**:
1. **Task A: Sentence Classification** – Predicts a class label for an input sentence.
2. **Task B: Named Entity Recognition (NER)** – Assigns a label to each token in the input sentence.

---


### **Key Components of the Model**
1. **Transformer Backbone (`self.encoder`)**
   - Loads **BERT** using `AutoModel.from_pretrained("bert-base-uncased")`.
   - Outputs **contextual embeddings** for each token.

2. **Tokenizer (`self.tokenizer`)**
   - Converts input sentences into **tokenized tensors**.

3. **Mean Pooling Layer (`self.pooling`)**
   - Aggregates **token embeddings** into a **single fixed-length vector**.
   - Uses **`AdaptiveAvgPool1d(1)`** to compute the **average** of all token embeddings.

4. **Sentence Classification Head (`self.classifier`)**
   - A fully connected **linear layer** (`nn.Linear`).
   - Maps **sentence embeddings** to **class probabilities**.

5. **Token Classification Head (`self.token_classifier`)**
   - Another **fully connected layer** for **NER**.
   - Predicts a **label for each token** in the sentence.

In [1]:
import torch
import torch.nn as nn
import torch.optim as optim
from transformers import AutoModel, AutoTokenizer


# Defining the Multi-Task Sentence Transformer Model

class MultiTaskSentenceTransformer(nn.Module):
    def __init__(self, model_name="bert-base-uncased", num_classes=3, num_labels=5):
        super(MultiTaskSentenceTransformer, self).__init__()
        self.encoder = AutoModel.from_pretrained(model_name)
        self.tokenizer = AutoTokenizer.from_pretrained(model_name)

        # Mean Pooling Layer
        self.pooling = nn.AdaptiveAvgPool1d(1)

        # Task A: Sentence Classification
        self.classifier = nn.Linear(self.encoder.config.hidden_size, num_classes)

        # Task B: Named Entity Recognition (NER)
        self.token_classifier = nn.Linear(self.encoder.config.hidden_size, num_labels)

    def mean_pooling(self, token_embeddings, attention_mask):
        """Compute mean pooling over token embeddings based on attention mask"""
        input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size())
        sum_embeddings = torch.sum(token_embeddings * input_mask_expanded, 1)
        sum_mask = torch.clamp(input_mask_expanded.sum(1), min=1e-9)  # Avoid division by zero
        return sum_embeddings / sum_mask


# Forward Pass Through the Model
    def forward(self, sentences):
        inputs = self.tokenizer(sentences, padding=True, truncation=True, return_tensors="pt")
        outputs = self.encoder(**inputs)
        pooled_output = self.mean_pooling(outputs.last_hidden_state, inputs["attention_mask"])

        # Task A: Sentence Classification
        class_logits = self.classifier(pooled_output)

        # Task B: Named Entity Recognition (NER)
        token_logits = self.token_classifier(outputs.last_hidden_state)  # (batch_size, seq_len, num_labels)

        return class_logits, token_logits


### **Breakdown of the Forward Pass**
1. **Tokenization**
   - Converts raw sentences into tokenized tensors (`return_tensors="pt"` ensures PyTorch tensors).
   - `padding=True` ensures uniform input length.
   - `truncation=True` prevents excessively long inputs.

2. **Passing Through Transformer**
   - `outputs.last_hidden_state` contains **contextual embeddings** for each token.

3. **Applying Mean Pooling**
   - Converts token embeddings into a **single sentence embedding**.

4. **Sentence Classification**
   - Passes the **sentence embedding** through `self.classifier` to obtain **class probabilities**.

5. **Token Classification (NER)**
   - Passes **token embeddings** through `self.token_classifier` to get **NER predictions**.

- **CrossEntropyLoss** is used for both classification tasks.
- **Sentence classification**: Compares **logits with ground-truth class labels**.
- **Token classification**: Compares **token-level logits with true token labels**.

In [3]:
# Model Training
multi_task_model = MultiTaskSentenceTransformer()
classification_loss_fn = nn.CrossEntropyLoss()
token_classification_loss_fn = nn.CrossEntropyLoss()


optimizer = optim.AdamW(multi_task_model.parameters(), lr=5e-5)

# Running a Sample Training Step
sample_sentences = ["This is a test sentence.", "Sentence transformers generate embeddings."]
sentence_labels = torch.tensor([0, 1])  # Example class labels
token_labels = torch.randint(0, 5, (2, 10))  # Example token labels (random for demo)
class_logits, token_logits = multi_task_model(sample_sentences)


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

- `optimizer.zero_grad()`: Clears previous gradients.
- `total_loss.backward()`: Computes gradients.
- `optimizer.step()`: Updates model parameters.


In [4]:
# Computing Loss

classification_loss = classification_loss_fn(class_logits, sentence_labels)
token_classification_loss = token_classification_loss_fn(token_logits.view(-1, 5), token_labels.view(-1))

# Total Loss
total_loss = classification_loss + token_classification_loss

# Backpropagation and Optimization
optimizer.zero_grad()
total_loss.backward()
optimizer.step()

print("Sentence Classification Output Shape:", class_logits.shape)  # Expected: (batch_size, num_classes)
print("Token Classification Output Shape:", token_logits.shape)  # Expected: (batch_size, seq_len, num_labels)


Sentence Classification Output Shape: torch.Size([2, 3])
Token Classification Output Shape: torch.Size([2, 10, 5])


## **Summary of Key Features**
| **Component** | **Description** |
|--------------|----------------|
| **Transformer Encoder** | Extracts token embeddings using a pre-trained model. |
| **Mean Pooling** | Aggregates token embeddings into a single sentence representation. |
| **Sentence Classification Head** | Maps sentence embeddings to class probabilities. |
| **NER Head** | Assigns labels to each token in the sentence. |
| **CrossEntropyLoss** | Computes loss for classification tasks. |
| **AdamW Optimizer** | Fine-tunes the transformer model. |

### **Changes Made to the Architecture to Support Multi-Task Learning**

The architecture was modified to support **two NLP tasks simultaneously**:  
1. **Task A: Sentence Classification** – Predicting a class label for an input sentence.  
2. **Task B: Named Entity Recognition (NER)** – Assigning a label to each token in a sentence.  

### **Key Changes to the Model Architecture**
---

### **1. Shared Transformer Encoder**
```python
self.encoder = AutoModel.from_pretrained(model_name)
```
- A **pre-trained transformer model (BERT)** is used as the **shared feature extractor** for both tasks.
- It generates **contextualized embeddings** for input tokens.

---

### **2. Adding a Mean Pooling Layer for Sentence Classification**
```python
self.pooling = nn.AdaptiveAvgPool1d(1)
```
- **Why?**  
  - BERT outputs embeddings for each token, but **sentence classification** requires a **single embedding per sentence**.
  - **Mean pooling** computes the **average of all token embeddings**, producing a **fixed-length sentence representation**.

---

### **3. Introducing Task-Specific Heads**
The model now includes **two separate output heads**, one for each task.

#### **Task A: Sentence Classification Head**
```python
self.classifier = nn.Linear(self.encoder.config.hidden_size, num_classes)
```
- A **fully connected layer** that maps the **sentence embedding** to `num_classes` outputs.
- Each output corresponds to a possible sentence label (e.g., Positive, Negative, Neutral).

#### **Task B: Named Entity Recognition (NER) Head**
```python
self.token_classifier = nn.Linear(self.encoder.config.hidden_size, num_labels)
```
- A **token classification head** that assigns a **label to each token** in the input sequence.
- Output shape: `(batch_size, seq_length, num_labels)`, where:
  - `batch_size` = number of sentences in a batch.
  - `seq_length` = number of tokens per sentence.
  - `num_labels` = number of possible entity classes (e.g., PERSON, LOCATION, ORGANIZATION).

---

### **4. Forward Pass Adjustments**
```python
def forward(self, sentences):
    inputs = self.tokenizer(sentences, padding=True, truncation=True, return_tensors="pt")
    outputs = self.encoder(**inputs)

    # Mean pooling for sentence classification
    pooled_output = self.mean_pooling(outputs.last_hidden_state, inputs["attention_mask"])

    # Task A: Sentence Classification
    class_logits = self.classifier(pooled_output)

    # Task B: Named Entity Recognition (NER)
    token_logits = self.token_classifier(outputs.last_hidden_state)  # (batch_size, seq_len, num_labels)

    return class_logits, token_logits
```
- The **token embeddings** from BERT are passed through:
  - **Mean pooling** → Sentence Classification Head.
  - **Directly** → Token Classification Head.

---

### **5. Task-Specific Loss Functions**
```python
classification_loss_fn = nn.CrossEntropyLoss()
token_classification_loss_fn = nn.CrossEntropyLoss()
```
- **Sentence classification loss (`classification_loss_fn`)**  
  - Compares predicted class logits with the true **sentence label**.
- **NER loss (`token_classification_loss_fn`)**  
  - Compares predicted token logits with the true **token labels**.

---

### **6. Combining Losses for Joint Training**
```python
# Compute Losses
classification_loss = classification_loss_fn(class_logits, sentence_labels)
token_classification_loss = token_classification_loss_fn(token_logits.view(-1, 5), token_labels.view(-1))

# Total Loss
total_loss = classification_loss + token_classification_loss
```
- **Both task-specific losses are added together**.
- This ensures **simultaneous training of both tasks**.

---

### **7. Optimization Adjustments**
```python
optimizer = optim.AdamW(multi_task_model.parameters(), lr=5e-5)
```
- **A single optimizer (`AdamW`)** updates the **shared transformer** and **both task-specific heads**.

---

## **Summary of Architectural Changes**
| **Change** | **Reason** | **Benefit** |
|------------|-----------|-------------|
| **Shared Transformer Encoder** | Common feature extraction for both tasks | Reduces redundancy and improves efficiency |
| **Mean Pooling Layer** | Aggregates token embeddings for sentence classification | Converts token-level embeddings into a fixed-size sentence vector |
| **Separate Task-Specific Heads** | One for **sentence classification**, one for **NER** | Allows independent prediction for different NLP tasks |
| **Task-Specific Loss Functions** | `CrossEntropyLoss` for both tasks | Ensures correct optimization for different outputs |
| **Combined Loss Function** | Sum of classification and NER losses | Enables multi-task training |
| **Single Optimizer** | Updates both task heads and transformer backbone | Ensures joint learning |

---

## **Benefits of Multi-Task Learning**
1. **Parameter Efficiency**  
   - The **shared transformer** reduces computational cost compared to training separate models.

2. **Knowledge Transfer**  
   - The **NER task** benefits from **sentence-level understanding**, improving performance.

3. **Improved Generalization**  
   - The model learns **more robust sentence representations** by solving **multiple tasks at once**.

---

### **Discussion: Training Strategy for Multi-Task Sentence Transformer**

When training a **multi-task model**, we need to decide:
1. **Which parts of the network to train**
2. **Which parts to freeze (keep unchanged)**

---

## **1. When to Freeze the Transformer Backbone?**
The **transformer backbone (`self.encoder`)** is a **pre-trained model** that extracts contextual embeddings. Freezing it means that **only the task-specific heads (classification layers) are trained**.

### **Scenarios for Freezing the Transformer Backbone**
✅ **When Using a Small Dataset**
- If we have **limited data**, fine-tuning the transformer can cause **overfitting**.
- Keeping it frozen ensures we **retain the general knowledge** from the pre-trained model.

✅ **When the Pre-Trained Embeddings Are Sufficient**
- If BERT's embeddings already represent the data well, training only the classification heads **reduces training time**.

✅ **When Transferring to a New Task Quickly**
- If we need **quick adaptation** to new tasks, freezing BERT and training only task heads **speeds up training**.

### **How to Freeze the Transformer?**
Modify the model initialization:
```python
for param in self.encoder.parameters():
    param.requires_grad = False
```
This ensures only the **task-specific heads** are trained.

---

## **2. When to Fine-Tune the Transformer Backbone?**
Fine-tuning the transformer means training **all layers**, including `self.encoder`.

### **Scenarios for Fine-Tuning the Transformer Backbone**
✅ **When the Data is Different from the Pre-Trained Domain**
- If the pre-trained model was trained on general text (e.g., Wikipedia), but our data is **specialized** (e.g., medical text), **fine-tuning helps adapt embeddings**.

✅ **When the Dataset is Large**
- If we have **plenty of labeled data**, fine-tuning improves performance by adapting the embeddings.

✅ **When Both Tasks Require Deep Contextual Understanding**
- If **NER and sentence classification share deeper dependencies**, updating the transformer improves **shared representations**.

---

## **3. When to Freeze One Task-Specific Head and Train the Other?**
Sometimes, one task has **more labeled data** than the other. We might freeze a task-specific head and train only the other.

### **Scenarios for Freezing One Head**
✅ **When One Task Has Already Converged**
- If **sentence classification performs well** but **NER is underperforming**, we can **freeze the classification head** and train **only the NER head**.

✅ **When We Want to Avoid Forgetting Previously Learned Tasks**
- If we previously trained the **sentence classification model**, but now want to train an **NER model**, freezing the classification head **prevents it from being altered**.

✅ **When We Want to Balance Training**
- If **NER has much less data** than classification, freezing classification ensures that **NER doesn't get overshadowed** during training.

### **How to Freeze One Task-Specific Head?**
Modify the model initialization:
```python
for param in self.classifier.parameters():
    param.requires_grad = False  # Freeze classification head, train only NER head
```
This ensures only **NER is trained**.

---

## **4. Summary of Training Strategies**
| **Scenario** | **Freeze Transformer?** | **Freeze One Head?** | **Train Only Heads?** |
|-------------|------------------|------------------|----------------|
| **Small dataset** | ✅ Yes | ❌ No | ✅ Yes |
| **Plenty of data** | ❌ No | ❌ No | ❌ No |
| **Fast adaptation to a new task** | ✅ Yes | ❌ No | ✅ Yes |
| **Pre-trained embeddings are good enough** | ✅ Yes | ❌ No | ✅ Yes |
| **Different domain data** | ❌ No | ❌ No | ❌ No |
| **NER needs improvement, Classification is already good** | ❌ No | ✅ Yes (freeze classification) | ❌ No |

---

### **Final Thoughts**
- Freezing the **transformer backbone** speeds up training but reduces adaptability.
- Freezing **one task head** helps balance multi-task learning.
- Fine-tuning everything is **best for large datasets** or when adapting to **domain-specific data**.

### **When to Use a Multi-Task Model vs. Separate Models**
Choosing between a **multi-task model** and **separate models** depends on several factors, including **task similarity, data availability, computational resources, and performance trade-offs**.

---

## **1. When to Implement a Multi-Task Model**
A **multi-task model** shares a common transformer backbone while having separate task-specific heads. It is beneficial in the following cases:

### ✅ **When Tasks Share Common Features or Representations**
- Both **sentence classification and NER** rely on **semantic understanding of text**.
- The **same transformer embeddings** can serve both tasks effectively.

**Example Use Case:**  
- **Medical NLP System** where:
  - **Task A:** Classify medical reports as "Normal" or "Abnormal".
  - **Task B:** Identify medical entities (e.g., diseases, drugs).
  - Since both tasks rely on **understanding medical terminology**, sharing the transformer backbone helps.

### ✅ **When There is Limited Data for One or Both Tasks**
- **Multi-task learning allows parameter sharing**, which helps improve generalization, especially when one task has **less labeled data**.
- The model **leverages the task with more data** to improve performance on the task with **fewer samples**.

**Example Use Case:**  
- **Sentiment Analysis (Task A) has 50,000 labeled sentences**.
- **NER (Task B) has only 5,000 labeled sentences**.
- Training a multi-task model can help **improve NER performance** due to shared sentence representations.

### ✅ **When Computational Efficiency is Required**
- **Training one model** is more efficient than training two separate models.
- **Faster inference** since the same transformer processes both tasks **simultaneously**.

**Example Use Case:**  
- **Chatbot Applications**
  - Task A: Detect **user intent** (e.g., "Order Food", "Check Weather").
  - Task B: Extract **named entities** (e.g., food item names, locations).
  - A multi-task model reduces the need for **two separate API calls**, making the chatbot faster.

### ✅ **When Tasks Benefit from Joint Learning**
- Training both tasks together can act as **regularization**, reducing overfitting.
- If **one task improves** as the other task improves, multi-task learning is beneficial.

**Example Use Case:**  
- **Machine Translation & Part-of-Speech (POS) Tagging**
  - Both tasks require **understanding sentence structure**.
  - Training together can improve **word representation quality**.

---

## **2. When to Use Separate Models**
In some cases, training **two independent models** is a better choice.

### ❌ **When Tasks Have Very Different Data Distributions**
- If Task A and Task B come from **completely different datasets**, forcing them into a multi-task model may degrade performance.

**Example Use Case:**  
- **Task A:** Classify **legal documents** into categories.
- **Task B:** Identify **product reviews** as positive/negative.
- **Issue:** Legal text and product reviews have vastly different language patterns.

### ❌ **When Tasks Require Different Model Architectures**
- Some NLP tasks require specialized architectures that do not fit well within a shared transformer.

**Example Use Case:**  
- **Task A:** Summarization (requires a transformer with an encoder-decoder architecture).
- **Task B:** Named Entity Recognition (NER) (requires token classification).
- Since **summarization needs a decoder**, it makes sense to use **two separate models**.

### ❌ **When There is No Performance Gain from Multi-Task Learning**
- If training both tasks together leads to **no improvement** or **worse performance**, it's better to train them separately.

**Example Use Case:**  
- **Spam Detection (Task A) & Machine Translation (Task B)**
  - The two tasks have **completely different objectives**.
  - A shared model may cause **negative transfer**, where one task **hurts the other**.

### ❌ **When You Need More Flexibility in Training**
- Training two separate models allows:
  - **Different learning rates** for each task.
  - **Different pre-training approaches**.
  - **Task-specific optimizations**.

**Example Use Case:**  
- **Task A:** Sentiment Analysis on social media (fast training, large dataset).
- **Task B:** Named Entity Recognition for legal documents (slow training, small dataset).
- A separate model for each allows more **fine-tuned training** strategies.

---

## **3. Summary Table: Multi-Task vs. Separate Models**
| **Criteria** | **Multi-Task Model** ✅ | **Separate Models** ❌ |
|-------------|-------------------|----------------|
| **Shared Features** | Yes, if tasks use similar features (e.g., Sentence Classification + NER) | No, if tasks are different (e.g., Translation + Spam Detection) |
| **Limited Data** | Helps if one task has less labeled data | Not helpful if tasks have **no shared features** |
| **Computational Efficiency** | Faster since transformer is shared | More expensive since two models must be trained |
| **Performance Boost** | Works well if tasks complement each other | Better if tasks are very different |
| **Training Complexity** | Joint optimization is needed | Simpler, each model can be trained separately |
| **Inference Time** | Faster, one model runs both tasks | Slower, requires running two models |
| **Negative Transfer Risk** | Possible if tasks are unrelated | No risk, tasks do not interfere |

---

## **4. Final Decision: When to Use Multi-Task Learning?**
Use a **multi-task model** when:
✅ **The tasks share underlying features** (e.g., sentence classification and NER).  
✅ **One task has less labeled data**, and we want to improve it.  
✅ **Computational efficiency is needed**, and inference time matters.  
✅ **Tasks benefit from joint learning** (e.g., medical entity extraction and document classification).  

Use **separate models** when:
❌ **The tasks have different data distributions**.  
❌ **One task requires a different architecture**.  
❌ **There is no performance improvement from multi-task learning**.  
❌ **We need full flexibility in training each task independently**.  

---

### **Handling Data Imbalance in Multi-Task Learning**
When training a multi-task model where **Task A (Sentence Classification) has abundant data** but **Task B (NER) has limited data**, we need to implement **strategies to balance the training process**. Otherwise, the model may **overfit to Task A** and perform poorly on Task B.

---

## **1. Adjusting Task-Specific Loss Weights**
Since Task B has **less data**, we can increase its contribution to the total loss.

### **Modification in the Loss Computation**
Update the **total loss calculation** by assigning different weights:
```python
# Define task-specific loss weights
task_A_weight = 1.0  # Since Task A has abundant data
task_B_weight = 2.0  # Increase Task B weight due to limited data

# Compute weighted total loss
total_loss = (task_A_weight * classification_loss) + (task_B_weight * token_classification_loss)
```
- **Higher weight for Task B** ensures that even with fewer samples, the model still learns effectively.

---

## **2. Using Different Learning Rates for Each Task**
Since Task A has more data, it **converges faster**, while Task B **needs more updates**.

### **Solution: Use Task-Specific Optimizers**
Modify the optimizer to use **separate learning rates**:
```python
optimizer = optim.AdamW([
    {'params': multi_task_model.encoder.parameters(), 'lr': 5e-5},  # Transformer backbone
    {'params': multi_task_model.classifier.parameters(), 'lr': 3e-5},  # Task A head
    {'params': multi_task_model.token_classifier.parameters(), 'lr': 8e-5}  # Task B head
])
```
- **Lower LR for Task A** to prevent overfitting.
- **Higher LR for Task B** to allow more adaptation.

---

## **3. Data Sampling: Oversampling Task B**
Since **Task B has less data**, we can **train on its samples multiple times per epoch**.

### **Solution: Use Oversampling**
Modify the data loader to **oversample** Task B:
```python
from torch.utils.data import WeightedRandomSampler

# Assuming dataset contains task identifiers (0 for Task A, 1 for Task B)
task_labels = [0] * len(task_A_data) + [1] * len(task_B_data)

# Assign higher weights to Task B samples
weights = [1.0 if label == 0 else 3.0 for label in task_labels]  
sampler = WeightedRandomSampler(weights, num_samples=len(task_A_data) + len(task_B_data), replacement=True)

# Use the sampler in the DataLoader
dataloader = torch.utils.data.DataLoader(dataset, batch_size=32, sampler=sampler)
```
- Ensures **Task B samples appear more frequently** in training.

---

## **4. Task-Specific Training Scheduling**
Instead of training both tasks in every step, we can **alternate between tasks**.

### **Solution: Alternating Task Updates**
Modify the training loop to **update Task B more frequently**:
```python
for epoch in range(num_epochs):
    for batch in dataloader:
        sentences, sentence_labels, token_labels = batch
        
        # Forward pass
        class_logits, token_logits = multi_task_model(sentences)
        
        # Compute losses
        classification_loss = classification_loss_fn(class_logits, sentence_labels)
        token_classification_loss = token_classification_loss_fn(token_logits.view(-1, 5), token_labels.view(-1))

        # Update Task A every step, Task B every 2 steps
        if epoch % 2 == 0:
            loss = classification_loss
        else:
            loss = token_classification_loss

        # Backpropagation
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
```
- This helps **balance the learning rate** between tasks.

---

## **5. Freezing Task A Temporarily**
Since Task A has **a lot of data**, we can **freeze it for a few epochs** to allow Task B to **catch up**.

### **Solution: Freeze Task A in Early Epochs**
```python
if epoch < 5:  # Freeze Task A for the first 5 epochs
    for param in multi_task_model.classifier.parameters():
        param.requires_grad = False
```
- This prevents **Task A from dominating training early**.

---

## **Summary of Strategies**
| **Method** | **Implementation** | **Effect** |
|------------|--------------------|------------|
| **Weighted Loss** | Increase Task B’s loss contribution | Balances training despite fewer samples |
| **Separate Learning Rates** | Use different LRs for Task A and B | Allows Task B to train faster |
| **Oversampling** | Sample Task B data more frequently | Ensures Task B is seen more times |
| **Alternating Training** | Update Task B more frequently | Helps Task B improve without Task A overwhelming it |
| **Freezing Task A** | Temporarily freeze Task A head | Allows Task B to catch up |

---

### **Final Thoughts**
Implementing these techniques will ensure that **Task B (NER) performs well**, even with fewer labeled examples, without negatively impacting **Task A (sentence classification)**.