## Task 1: Sentence Transformer Implementation

### Loading Pre-trained Model and Tokenizer
In this section, we load the pre-trained model and tokenizer from the `sentence-transformers` library. The `bert-base-nli-mean-tokens` model is specifically designed to generate sentence embeddings.

### Defining the Encoding Function
Here, we define a function to encode sentences into fixed-length embeddings. The function takes a list of sentences as input, tokenizes them, and then feeds them through the model. The mean of the last hidden states is computed to obtain a single embedding for each sentence.

### Sample Sentences and Embedding Display
In this section, we test our encoding function with a few sample sentences and display the resulting embeddings. Each embedding is a fixed-length vector representation of the input sentence.


In [1]:
import torch
from transformers import AutoTokenizer, AutoModel

# Loading Pre-trained Model and Tokenizer
# We use the `sentence-transformers/bert-base-nli-mean-tokens` model to generate sentence embeddings.
model_name = 'sentence-transformers/bert-base-nli-mean-tokens'
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModel.from_pretrained(model_name)

# Defining the Encoding Function
# This function encodes input sentences into fixed-length embeddings using mean pooling on the last hidden states.
def encode_sentences(sentences, max_length=128):
    # Encode input sentences with truncation and padding
    inputs = tokenizer(sentences, padding=True, truncation=True, max_length=max_length, return_tensors='pt')
    with torch.no_grad():
        outputs = model(**inputs)
    # Obtain the last hidden states and compute mean pooling
    embeddings = outputs.last_hidden_state.mean(dim=1)
    return embeddings

# Sample Sentences and Embedding Display
# We test the model with sample sentences and display the resulting embeddings.
sentences = ["This is a test sentence.", "Sentence transformers are great!", "This task is interesting."]
embeddings = encode_sentences(sentences)

# Check the shape and statistics of the embeddings
for i, embedding in enumerate(embeddings):
    print(f"Sentence: {sentences[i]}")
    print(f"Embedding shape: {embedding.shape}")
    print(f"Embedding mean: {embedding.mean().item()}, std: {embedding.std().item()}\n")


  from .autonotebook import tqdm as notebook_tqdm


Sentence: This is a test sentence.
Embedding shape: torch.Size([768])
Embedding mean: -0.016957160085439682, std: 0.590369462966919

Sentence: Sentence transformers are great!
Embedding shape: torch.Size([768])
Embedding mean: -0.01905854046344757, std: 0.6100025177001953

Sentence: This task is interesting.
Embedding shape: torch.Size([768])
Embedding mean: -0.019545016810297966, std: 0.6200253367424011



## Task 2: Multi-Task Learning Expansion

In [2]:
import torch
import torch.nn as nn
from transformers import AutoModel, AutoTokenizer

# Loading Pre-trained Model and Tokenizer
model_name = 'sentence-transformers/bert-base-nli-mean-tokens'
tokenizer = AutoTokenizer.from_pretrained(model_name)
base_model = AutoModel.from_pretrained(model_name)

# Defining Multi-Task Model Architecture
class MultiTaskModel(nn.Module):
    def __init__(self, base_model, num_classes_task_a, num_classes_task_b):
        super(MultiTaskModel, self).__init__()
        self.base_model = base_model
        # Task A: Sentence Classification Head
        self.classifier_task_a = nn.Linear(base_model.config.hidden_size, num_classes_task_a)
        # Task B: Sentiment Analysis Head
        self.classifier_task_b = nn.Linear(base_model.config.hidden_size, num_classes_task_b)

    def forward(self, input_ids, attention_mask):
        outputs = self.base_model(input_ids=input_ids, attention_mask=attention_mask)
        # Use the [CLS] token representation (first token) as sentence embedding
        cls_output = outputs.last_hidden_state[:, 0, :]
        # Task A: Sentence Classification
        logits_task_a = self.classifier_task_a(cls_output)
        # Task B: Sentiment Analysis
        logits_task_b = self.classifier_task_b(cls_output)
        return logits_task_a, logits_task_b

# Example usage with 3 classes for Task A and 2 classes for Task B
num_classes_task_a = 3  # Example: classifying sentences into 3 categories
num_classes_task_b = 2  # Example: positive, negative sentiment
multi_task_model = MultiTaskModel(base_model, num_classes_task_a, num_classes_task_b)

# Sample Input for Testing
sentences_task_a = ["This is a test sentence.", "Sentence transformers are great!", "This task is interesting."]
inputs_task_a = tokenizer(sentences_task_a, padding=True, truncation=True, max_length=128, return_tensors='pt')

sentences_task_b = ["I love this movie!", "This is the worst day of my life.", "I feel fantastic!"]
inputs_task_b = tokenizer(sentences_task_b, padding=True, truncation=True, max_length=128, return_tensors='pt')

# Forward pass for Task A and Task B
logits_task_a, _ = multi_task_model(inputs_task_a['input_ids'], inputs_task_a['attention_mask'])
_, logits_task_b = multi_task_model(inputs_task_b['input_ids'], inputs_task_b['attention_mask'])

# Check the outputs
print(f"Logits for Task A (Sentence Classification): {logits_task_a}")
print(f"Logits for Task B (Sentiment Analysis): {logits_task_b}")

# Defining loss functions for both tasks
criterion_task_a = nn.CrossEntropyLoss()
criterion_task_b = nn.CrossEntropyLoss()

# Example labels (randomly generated for illustration)
labels_task_a = torch.tensor([0, 2, 1])  # Example labels for Task A
labels_task_b = torch.tensor([1, 0, 1])  # Example labels for Task B

# Calculating losses
loss_task_a = criterion_task_a(logits_task_a, labels_task_a)
loss_task_b = criterion_task_b(logits_task_b, labels_task_b)
total_loss = loss_task_a + loss_task_b

print(f"Loss for Task A: {loss_task_a.item()}")
print(f"Loss for Task B: {loss_task_b.item()}")
print(f"Total Loss: {total_loss.item()}")

# Define an optimizer
optimizer = torch.optim.Adam(multi_task_model.parameters(), lr=1e-5)

num_epochs = 3
for epoch in range(num_epochs):
    multi_task_model.train()
    optimizer.zero_grad()
    
    logits_task_a, _ = multi_task_model(inputs_task_a['input_ids'], inputs_task_a['attention_mask'])
    _, logits_task_b = multi_task_model(inputs_task_b['input_ids'], inputs_task_b['attention_mask'])
    
    loss_task_a = criterion_task_a(logits_task_a, labels_task_a)
    loss_task_b = criterion_task_b(logits_task_b, labels_task_b)
    total_loss = loss_task_a + loss_task_b
    
    total_loss.backward()
    optimizer.step()
    
    print(f"Epoch {epoch + 1}, Loss: {total_loss.item()}")

Logits for Task A (Sentence Classification): tensor([[-0.6985,  0.1159,  0.1417],
        [-0.4934, -0.5920, -0.0066],
        [-0.4135, -0.3569, -0.2814]], grad_fn=<AddmmBackward0>)
Logits for Task B (Sentiment Analysis): tensor([[-0.3619,  0.3586],
        [-0.3751,  0.4887],
        [-0.3012,  0.4846]], grad_fn=<AddmmBackward0>)
Loss for Task A: 1.2000242471694946
Loss for Task B: 0.6624807715415955
Total Loss: 1.8625049591064453
Epoch 1, Loss: 1.9825471639633179
Epoch 2, Loss: 1.8293399810791016
Epoch 3, Loss: 1.3688188791275024


## Task 3: Training Considerations

#### Freezing the Entire Network
- **Impact and Advantages**: The model will not learn new information and will only use pre-trained knowledge. This approach is beneficial if the pre-trained model already performs well on the new tasks.
- **Training Strategy**: No training is needed for the frozen model, just use the pre-trained embeddings directly for downstream tasks.

In [3]:
# Scenario 1: Freeze the entire network
for param in multi_task_model.base_model.parameters():
    param.requires_grad = False


#### Freezing Only the Transformer Backbone
- **Impact and Advantages**: This allows the model to retain the pre-trained knowledge while the task-specific heads can learn the new tasks. It reduces training time and computational resources.
- **Training Strategy**: Train only the classification heads (`classifier_task_a` and `classifier_task_b`). Use a smaller learning rate for the heads to fine-tune them.

In [4]:
# Scenario 2: Freeze only the transformer backbone
for param in multi_task_model.base_model.parameters():
    param.requires_grad = False
for param in multi_task_model.classifier_task_a.parameters():
    param.requires_grad = True
for param in multi_task_model.classifier_task_b.parameters():
    param.requires_grad = True


#### Freezing Only One Task-Specific Head
- **Impact and Advantages**: This enables the model to learn a new task without affecting the other pre-trained task head. Useful when one task is more important or has more data than the other.
- **Training Strategy**: Freeze one task-specific head (e.g., `classifier_task_a`) and train the other head (`classifier_task_b`). Adjust the learning rate accordingly.

In [5]:
# Scenario 3: Freeze one task-specific head
for param in multi_task_model.classifier_task_a.parameters():
    param.requires_grad = False
for param in multi_task_model.classifier_task_b.parameters():
    param.requires_grad = True


#### Transfer Learning Scenario
- **Choosing a Pre-trained Model**: Use a model pre-trained on a large, diverse corpus (e.g., `bert-base-nli-mean-tokens`).
- **Freezing/Unfreezing Layers**: Freeze the initial layers of the Transformer (to retain basic language understanding) and unfreeze the higher layers and task-specific heads (to fine-tune for the new tasks).
- **Rationale**: Freezing the lower layers helps retain general language features, while fine-tuning the higher layers and heads allows the model to adapt to the specific tasks with new data.

In [6]:
# Transfer Learning Scenario
for name, param in multi_task_model.base_model.named_parameters():
    if "layer.11" in name or "layer.10" in name:  # Adjust layer numbers based on the specific model
        param.requires_grad = True
    else:
        param.requires_grad = False

# Ensure task-specific heads are trainable
for param in multi_task_model.classifier_task_a.parameters():
    param.requires_grad = True
for param in multi_task_model.classifier_task_b.parameters():
    param.requires_grad = True


In [7]:
# Simplified training loop example
criterion_task_a = nn.CrossEntropyLoss()
criterion_task_b = nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(filter(lambda p: p.requires_grad, multi_task_model.parameters()), lr=1e-5)

num_epochs = 3

# Updated input and labels for consistency
sentences_task_a = ["This is a test sentence.", "Sentence transformers are great!", "This task is interesting."]
inputs_task_a = tokenizer(sentences_task_a, padding=True, truncation=True, max_length=128, return_tensors='pt')
labels_task_a = torch.tensor([0, 2, 1])  # Example labels for Task A

sentences_task_b = ["I love this movie!", "This is the worst day of my life.", "I feel fantastic!"]
inputs_task_b = tokenizer(sentences_task_b, padding=True, truncation=True, max_length=128, return_tensors='pt')
labels_task_b = torch.tensor([1, 0, 1])  # Example labels for Task B

for epoch in range(num_epochs):
    multi_task_model.train()
    optimizer.zero_grad()
    
    # Forward pass for Task A and Task B
    logits_task_a, _ = multi_task_model(inputs_task_a['input_ids'], inputs_task_a['attention_mask'])
    _, logits_task_b = multi_task_model(inputs_task_b['input_ids'], inputs_task_b['attention_mask'])
    
    # Calculate losses for both tasks
    loss_task_a = criterion_task_a(logits_task_a, labels_task_a)
    loss_task_b = criterion_task_b(logits_task_b, labels_task_b)
    total_loss = loss_task_a + loss_task_b
    
    # Backpropagation
    total_loss.backward()
    optimizer.step()
    
    print(f"Epoch {epoch + 1}, Loss: {total_loss.item()}")


Epoch 1, Loss: 1.1768361330032349
Epoch 2, Loss: 1.3216972351074219
Epoch 3, Loss: 1.0765254497528076


### Brief Summary
- **Freezing Entire Network**: No further training, rely on pre-trained knowledge.
- **Freezing Transformer Backbone**: Train task-specific heads only.
- **Freezing One Task Head**: Train the other task head while retaining the frozen head.
- **Transfer Learning**: Freeze lower layers, fine-tune higher layers and task heads, leveraging pre-trained language understanding for specific tasks.

## Task 4: Layer-wise Learning Rate Implementation (BONUS)

In [8]:
# Define layer-wise learning rates
optimizer_grouped_parameters = [
    {'params': multi_task_model.base_model.embeddings.parameters(), 'lr': 1e-5},
    {'params': multi_task_model.base_model.encoder.layer[:6].parameters(), 'lr': 5e-5},
    {'params': multi_task_model.base_model.encoder.layer[6:].parameters(), 'lr': 1e-4},
    {'params': multi_task_model.classifier_task_a.parameters(), 'lr': 1e-3},
    {'params': multi_task_model.classifier_task_b.parameters(), 'lr': 1e-3},
]

optimizer = torch.optim.AdamW(optimizer_grouped_parameters)

# Simplified training loop example
criterion_task_a = nn.CrossEntropyLoss()
criterion_task_b = nn.CrossEntropyLoss()
num_epochs = 3

for epoch in range(num_epochs):
    multi_task_model.train()
    optimizer.zero_grad()
    
    # Forward pass for Task A and Task B
    logits_task_a, _ = multi_task_model(inputs_task_a['input_ids'], inputs_task_a['attention_mask'])
    _, logits_task_b = multi_task_model(inputs_task_b['input_ids'], inputs_task_b['attention_mask'])
    
    # Calculate losses for both tasks
    loss_task_a = criterion_task_a(logits_task_a, labels_task_a)
    loss_task_b = criterion_task_b(logits_task_b, labels_task_b)
    total_loss = loss_task_a + loss_task_b
    
    # Backpropagation
    total_loss.backward()
    optimizer.step()
    
    print(f"Epoch {epoch + 1}, Loss: {total_loss.item()}")


Epoch 1, Loss: 1.1857175827026367
Epoch 2, Loss: 0.5991258025169373
Epoch 3, Loss: 0.2687854468822479


### Explanation of Layer-wise Learning Rates

Using different learning rates for different layers helps fine-tune pre-trained models more effectively. Here's our setup:

- **Embedding Layer**: We used a very low rate (1e-5) to keep basic token representations stable.
- **First 6 Encoder Layers**: These had a slightly higher rate (5e-5) to allow some fine-tuning while preserving general language features.
- **Last 6 Encoder Layers**: These needed a higher rate (1e-4) to adapt more to the specific tasks.
- **Task-specific Heads**: These new layers had the highest rate (1e-3) to quickly learn task-specific mappings.

### Benefits of Layer-wise Learning Rates

1. **Fine-tuning Control**:
   - Different learning rates let us control how much each part of the model is adjusted, keeping pre-trained knowledge intact while making necessary tweaks.

2. **Stability and Convergence**:
   - Lower rates for lower layers keep training stable. Higher rates for higher layers, which are more task-specific, help them adapt quickly. This combo helps the model converge faster and more stably.

3. **Adaptability**:
   - Different parts of the model learning at different rates makes it easier for the network to adapt to new tasks. This flexibility is key for effective fine-tuning.

### Impact of Multi-task Setting

1. **Enhanced Task-specific Learning**:
   - In a multi-task setup, each task might need different adaptations. Layer-wise learning rates let each task-specific head learn efficiently while the shared backbone is fine-tuned in a controlled way.

2. **Efficient Resource Utilization**:
   - Sharing parameters across tasks can lead to more efficient use of resources. Layer-wise learning rates optimize this, leading to faster training times and better generalization.

3. **Mitigation of Catastrophic Forgetting**:
   - Learning new tasks can sometimes make a model forget old ones. Using lower rates for shared layers and higher rates for task-specific layers helps the model retain old information while adapting to new tasks, reducing the risk of forgetting.

In summary, layer-wise learning rates help fine-tune the model more effectively, keep training stable, and make the model adaptable to multiple tasks without forgetting what it already learned.

### Task 4: Brief Summary

In Task 4, we used layer-wise learning rates to improve the training of a multi-task sentence transformer. Here's what we did:

1. **Layer-wise Learning Rates**:
   - We applied different learning rates to different layers:
     - Embedding Layer: 1e-5
     - First 6 Encoder Layers: 5e-5
     - Last 6 Encoder Layers: 1e-4
     - Task-specific Heads: 1e-3
   - This let us fine-tune the model more precisely, keeping the foundational layers stable while letting the task-specific layers adapt more.

2. **Fine-tuning Control**:
   - Using lower rates for lower layers and higher rates for higher layers and task heads helped fine-tune effectively without losing pre-trained knowledge.

3. **Stability and Convergence**:
   - This approach made training more stable and led to faster convergence, reducing the risk of overfitting.

4. **Adaptability in Multi-task Setting**:
   - In a multi-task setup, layer-wise learning rates improved task-specific learning while keeping shared knowledge intact, which is key for efficient training and better generalization.

5. **Mitigating Catastrophic Forgetting**:
   - Different learning rates helped retain previously learned info while adapting to new tasks, reducing the risk of forgetting.

### Results and Conclusion
- The loss values dropped significantly over the epochs.
- Layer-wise learning rates proved beneficial for fine-tuning, letting the model leverage pre-trained knowledge and adapt to new tasks effectively, leading to better performance and faster convergence.