In [1]:
pip install transformers

Note: you may need to restart the kernel to use updated packages.


## Task 1: Sentence Transformer Implementation

In [2]:
import torch
import torch.nn as nn
from transformers import BertModel, BertTokenizer

class SentenceTransformer(nn.Module):
    def __init__(self, pretrained_model_name='bert-base-uncased'):
        super(SentenceTransformer, self).__init__()
        self.tokenizer = BertTokenizer.from_pretrained(pretrained_model_name)
        self.encoder = BertModel.from_pretrained(pretrained_model_name)
        self.pooling = nn.AdaptiveAvgPool1d(1) 

    def forward(self, sentences):
        input_ids = self.tokenizer(sentences, return_tensors='pt', padding=True, truncation=True)['input_ids']
        outputs = self.encoder(input_ids)
        pooled_output = self.pooling(outputs.last_hidden_state.permute(0, 2, 1)).squeeze(-1)
        return pooled_output

sample_sentences = ["This is a sample sentence.", "Another example sentence."]
model = SentenceTransformer()
embeddings = model(sample_sentences)
print("Embeddings shape:", embeddings.shape)
print("Obtained embeddings:", embeddings)

We strongly recommend passing in an `attention_mask` since your input_ids may be padded. See https://huggingface.co/docs/transformers/troubleshooting#incorrect-output-when-padding-tokens-arent-masked.


Embeddings shape: torch.Size([2, 768])
Obtained embeddings: tensor([[-0.0639, -0.4284, -0.0668,  ..., -0.1753, -0.1239,  0.3197],
        [-0.1491, -0.4124, -0.0350,  ..., -0.1290,  0.2041,  0.0163]],
       grad_fn=<SqueezeBackward1>)


I used the BertModel from the Hugging Face Transformers library as the transformer backbone.

BertTokenizer is used to tokenize input sentences.

I used an Adaptive Average Pooling layer to obtain fixed-length embeddings from the transformer output.

The forward method takes a list of sentences, tokenizes them, passes through the transformer, and performs pooling to obtain embeddings.

## Task 2: Multi-Task Learning Expansion

To expand the Sentence Transformer for multi-task learning, we need to modify the architecture to accommodate multiple tasks. Here, I have extended the model to handle two tasks: Sentence Classification (Task A) and Named Entity Recognition (Task B).

Here are the changes made to the architecture:

Task-Specific Heads: Adding task-specific classification heads on top of the transformer backbone for each task. These heads will be responsible for predicting task-specific outputs.

Loss Function: For multi-task learning, I have defined a combined loss function that takes into account the losses from both tasks.

Training Data: We need labeled data for both tasks. Each input will be associated with labels for both tasks.

Fine-Tuning: During training, I fine-tuned the entire model, including the transformer backbone and the task-specific heads, using the combined loss function.

In [3]:
import torch
import torch.nn as nn
from transformers import BertModel, BertTokenizer

class MultiTaskSentenceTransformer(nn.Module):
    def __init__(self, num_classes_task_a, num_classes_task_b, pretrained_model_name='bert-base-uncased'):
        super(MultiTaskSentenceTransformer, self).__init__()
        self.tokenizer = BertTokenizer.from_pretrained(pretrained_model_name)
        self.encoder = BertModel.from_pretrained(pretrained_model_name)
        
        # Task A classification head
        self.classification_head_task_a = nn.Linear(self.encoder.config.hidden_size, num_classes_task_a)
        
        # Task B classification head
        self.classification_head_task_b = nn.Linear(self.encoder.config.hidden_size, num_classes_task_b)

    def forward(self, sentences):
        input_ids = self.tokenizer(sentences, return_tensors='pt', padding=True, truncation=True)['input_ids']
        outputs = self.encoder(input_ids)
        pooled_output = outputs.last_hidden_state[:, 0] 
        
        logits_task_a = self.classification_head_task_a(pooled_output)
        
        logits_task_b = self.classification_head_task_b(pooled_output)
        
        return logits_task_a, logits_task_b

sample_sentences = ["This is a sample sentence.", "Another example sentence."]
model = MultiTaskSentenceTransformer(num_classes_task_a=3, num_classes_task_b=5)
logits_task_a, logits_task_b = model(sample_sentences)
print("Task A logits shape:", logits_task_a.shape)
print("Task B logits shape:", logits_task_b.shape)


Task A logits shape: torch.Size([2, 3])
Task B logits shape: torch.Size([2, 5])


In this modified version:

Two classification heads are added on top of the transformer backbone, one for each task (Task A and Task B).

During forward pass, we obtain the hidden states from the transformer, then pass them through the task-specific classification heads to get logits for each task.

The logits can be passed through appropriate activation functions and then used to compute the loss for each task during training.

During training, we would need to combine the losses from both tasks using appropriate weighting or balancing techniques, depending on the importance of each task.

## Task 3: Training Considerations

Discuss the implications and advantages of each scenario and explain your rationale as to how the model should be trained given the following:

If the entire network should be frozen.

--> Implications: No parameters are updated during training, so the model essentially acts as a fixed feature extractor. It's beneficial if you have limited labeled data or if the pre-trained model captures relevant features for your tasks.

Advantages: Training is faster, especially if you have limited computational resources. It reduces the risk of overfitting, especially if the labeled data for the tasks is limited.


If only the transformer backbone should be frozen.

--> Implications: Only the pre-trained weights of the transformer are kept frozen. Task-specific heads and additional layers are trainable. This scenario is beneficial when the pre-trained transformer captures general language understanding, but the downstream tasks require fine-tuning.

Advantages: The model can adapt to task-specific features while still benefiting from the pre-trained language representation. It allows for task-specific feature extraction and avoids catastrophic forgetting of pre-trained representations.


If only one of the task-specific heads (either for Task A or Task B) should be frozen.

--> Implications: One of the task-specific heads is kept frozen while the other components are trainable. This approach can be beneficial if one task is more important or if you have limited labeled data for one task.

Advantages: It allows focusing more training capacity on the task that needs more adaptation. It's useful when one task is relatively simpler or when you want to prioritize one task over the other.


Consider a scenario where transfer learning can be beneficial. Explain how you would approach the transfer learning process, including:

The choice of a pre-trained model.

--> Choose a pre-trained model that has been pre-trained on a large corpus of text data, such as BERT, RoBERTa, or GPT. The choice depends on factors like model size, computational resources, and performance on similar tasks.

The layers you would freeze/unfreeze.

--> Freezing: Initially freeze all layers of the pre-trained model.

Unfreezing: Gradually unfreeze layers starting from the top layers closer to the task-specific heads and fine-tune them along with the task-specific layers.

The rationale behind these choices.

--> Freezing the initial layers helps the model retain the general language understanding capabilities learned during pre-training.

Gradually unfreezing allows the model to adapt to task-specific features while preventing catastrophic forgetting.
Fine-tuning the top layers first is beneficial as they are closer to the task-specific heads and more likely to capture task-specific features.


Training Process:

Divide the training data into batches for each task.

Use a suitable optimizer (e.g., Adam) with task-specific learning rates and weight decay.

Monitor performance on both tasks using appropriate evaluation metrics.

Employ techniques like early stopping and learning rate scheduling to prevent overfitting and improve convergence.

## Task 4: Layer-wise Learning Rate Implementation (BONUS) 

Implementing layer-wise learning rates involves assigning different learning rates to different layers of the neural network during training. This can be achieved by passing a list of learning rates to the optimizer, with each learning rate corresponding to a specific layer or group of layers in the network.

In [4]:
import torch.optim as optim

learning_rates = [
    {"params": model.encoder.parameters(), "lr": 1e-5},  
    {"params": model.classification_head_task_a.parameters(), "lr": 1e-4},  
    {"params": model.classification_head_task_b.parameters(), "lr": 1e-4}  
]

optimizer = optim.Adam(learning_rates)

Rationale for specific learning rates:

Transformer Backbone:

The backbone captures general language representations. Since it's already pre-trained, we use a relatively lower learning rate to fine-tune it gradually. A small learning rate prevents drastic changes to the pre-trained weights.

Task-Specific Heads:

Task-specific heads require more adaptation to the specific tasks. Hence, we use a higher learning rate compared to the backbone. This allows faster updates to the task-specific parameters and facilitates task-specific feature learning.

Potential benefits of using layer-wise learning rates:

Faster Convergence:

Layer-wise learning rates enable faster convergence by allowing different parts of the network to update at different rates. This can lead to faster training and better utilization of computational resources.

Better Optimization:

Assigning different learning rates to different layers can help avoid issues like vanishing gradients or exploding gradients, leading to more stable optimization.

Task-Specific Adaptation:

In a multi-task setting, layer-wise learning rates allow each task-specific head to adapt at its own pace, potentially improving performance on individual tasks.

Improved Generalization:

By controlling the learning rates for different layers, we can prevent overfitting on certain parts of the network and encourage better generalization across tasks.