
ML Apprenticeship Take-Home
Sentence Transformers and Multi-Task Learning


#### Task 1: Sentence Transformer Implementation
Implement a sentence transformer model using any deep learning framework of your choice. This model should be able to encode input sentences into fixed-length embeddings. Test your implementation with a few sample sentences and showcase the obtained embeddings.
Describe any choices you had to make regarding the model architecture outside of the transformer backbone.

In [26]:
#Import all the lobraries neeeded

import numpy as np
import pandas as pd
import torch
import torch.nn as nn
from transformers import BertModel, BertTokenizer
from transformers import DistilBertTokenizer, DistilBertModel

import torch.nn.functional as F
import torch.optim as optim
from torch.utils.data import DataLoader, TensorDataset
import os
from sklearn.metrics import accuracy_score, f1_score, confusion_matrix
from transformers import BertTokenizer
from datasets import load_dataset


In [27]:

'''
Here we have chose distilbert-base-uncased as it is a smaller, faster, and lighter version of BERT.
It retains 97% of BERT's performance while being 60% faster and 40% smaller.
This makes it a good choice for applications where computational resources are limited.

Transformers: The Hugging Face transformers library provides easy access to pre-trained transformer models like BERT,
including tokenizers and model classes.

embedding_dim=768: The dimension of the output sentence embeddings. By default, it is set to 768, matching BERT's hidden size.
Here the output of our model has the embeddings for each token in the sequence have a fixed length.
The forward method processes the input through DistilBERT and then through the projection layer to get the sentence embeddings.
The sentence embeddings are then returned.

A linear layer is added after DistilBERT to allow for dimension reduction or transformation of the embeddings.
Here our projection layer keeps the dimension same as the output dimension of DistilBERT (768).

[CLS] pool startegy is a common used pooling texhnique: token's embedding from the last hidden state to represent the sentence embedding is used here

'''

class SentenceTransformer(nn.Module):
    def __init__(self, model_name="distilbert-base-uncased", embedding_dim=768):
        super(SentenceTransformer, self).__init__()
        self.bert = DistilBertModel.from_pretrained(model_name)
        self.projection = nn.Linear(self.bert.config.hidden_size, embedding_dim)

    def forward(self, input_ids, attention_mask):
        outputs = self.bert(input_ids=input_ids, attention_mask=attention_mask)
        last_hidden_state = outputs.last_hidden_state
        sentence_embeddings = self.projection(last_hidden_state[:, 0, :]) #pooling startegy
        return sentence_embeddings

# Example usage for better undertsanding on the sentnces are converted into word embeddings
tokenizer = DistilBertTokenizer.from_pretrained("distilbert-base-uncased")
model = SentenceTransformer()

sentences = ["Hi", "How are you","This is a sample", "This is a sentence.","This is to check"]
encoded_input = tokenizer(sentences, padding= True, truncation=True, return_tensors='pt')
embeddings = model(encoded_input['input_ids'], encoded_input['attention_mask'])

print(embeddings.shape)
print(embeddings)



torch.Size([5, 768])
tensor([[-0.0992,  0.5831,  0.2383,  ...,  0.0702,  0.0611,  0.0679],
        [-0.1668,  0.5245,  0.3737,  ...,  0.0159, -0.0261,  0.0620],
        [-0.1075,  0.5769,  0.2630,  ..., -0.0784,  0.0303,  0.1201],
        [-0.0251,  0.5881,  0.2733,  ..., -0.0503,  0.0371,  0.0736],
        [-0.0730,  0.5766,  0.2521,  ..., -0.0258,  0.0538,  0.0762]],
       grad_fn=<AddmmBackward0>)


In [28]:
# Task 2: Multi-Task Learning Expansion
'''
The primary modification we followed here is  adding separate task-specific heads to handle different tasks,
such as Task_a: sentence classification and  Task_b sentiment analysis
Two separate linear layers are added after the projection layer. These heads are responsible for producing task-specific outputs.

The forward method  accepts an  argument, task, which specifies which task's head should be used to produce the final output.
Depending on the value of task, the model routes the sentence embeddings through the appropriate task head.

Here we have used the sam eabove cell task1 implementation but slighlt modified the approach for multi-task learning'''
from transformers import DistilBertModel, DistilBertTokenizer

class MultiTaskSentenceTransformer(nn.Module):
    def __init__(self, model_name="distilbert-base-uncased", embedding_dim=768, num_classes_task_a=5, num_classes_task_b=3):
        super(MultiTaskSentenceTransformer, self).__init__()
        self.bert = DistilBertModel.from_pretrained(model_name)
        self.projection = nn.Linear(self.bert.config.hidden_size, embedding_dim) # A linear layer to project the BERT output to a fixed embedding dimension
        self.task_a_head = nn.Linear(embedding_dim, num_classes_task_a)  # Sentence Classification
        self.task_b_head = nn.Linear(embedding_dim, num_classes_task_b)  # Sentiment Analysis

    def forward(self, input_ids, attention_mask, task):
        outputs = self.bert(input_ids=input_ids, attention_mask=attention_mask) #Get the outputs from the BERT model.
        last_hidden_state = outputs.last_hidden_state #Extract the last hidden state from the outputs.
        sentence_embeddings = self.projection(last_hidden_state[:, 0, :])#Use the first token embedding from the last hidden state and project it to the desired embedding dimension

        if task == "task_a":
            logits = self.task_a_head(sentence_embeddings)
        elif task == "task_b":
            logits = self.task_b_head(sentence_embeddings)
        else:
            raise ValueError("Invalid task specified.")

        return logits


# Example usage
tokenizer = DistilBertTokenizer.from_pretrained("distilbert-base-uncased")
model = MultiTaskSentenceTransformer() #used default classes size for convenience and easeness uses 5 classes for task1, and 3 classes for task2

# Sample sentences
sentences = ["Hi", "How are you","This is a sample", "This is a sentence.","This is to check"]

# Tokenize input
encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt')

# Forward pass for Task A (Sentence Classification)
logits_task_a = model(encoded_input['input_ids'], encoded_input['attention_mask'], task="task_a")
print("Logits for Task A:", logits_task_a)

# Forward pass for Task B (Sentiment Analysis)
logits_task_b = model(encoded_input['input_ids'], encoded_input['attention_mask'], task="task_b")
print("Logits for Task B:", logits_task_b)

Logits for Task A: tensor([[-0.1810, -0.0567, -0.0717,  0.1778,  0.0090],
        [-0.1964, -0.0556, -0.0214,  0.2183,  0.0572],
        [-0.1689, -0.0339, -0.0520,  0.2031,  0.0484],
        [-0.1649, -0.0683, -0.0442,  0.2533,  0.0707],
        [-0.1233, -0.0086, -0.0539,  0.2406,  0.0861]],
       grad_fn=<AddmmBackward0>)
Logits for Task B: tensor([[-0.0492,  0.1269,  0.0531],
        [-0.0455,  0.1618, -0.0046],
        [-0.0605,  0.1634, -0.0419],
        [-0.0185,  0.1620, -0.0037],
        [-0.0525,  0.1719,  0.0045]], grad_fn=<AddmmBackward0>)


#Task 3: Training Considerations


---


If the entire network should be frozen:

1. Freezing the entire network means that no parameters will be updated during training, effectively turning the model into a fixed feature extractor.

Advantages: Computational efficiency, as no gradients need to be computed for the frozen layers.
Disadvantages: The model cannot adapt to the specific tasks or domains, and its performance will be limited by the pre-trained representations.

My approach :Freezing the entire network is generally not recommended for multi-task learning scenarios, as it defeats the purpose of fine-tuning
the model on the specific tasks.
It should only be considered if the pre-trained representations are already highly optimized for the target tasks, which is rarely the case.



---


2. If only the transformer backbone should be frozen:

In this scenario, the transformer backbone (e.g., BERT) is frozen, while the task-specific heads and the projection layer are allowed to be trained.

Advantages: Computational efficiency, as the transformer backbone typically has a large number of parameters; leverages the pre-trained language representations while adapting the task-specific components.

Disadvantages: Limited ability to adapt the language representations to the specific tasks or domains.

My approach:
Freezing the transformer backbone is a common practice in transfer learning for NLP tasks. It allows the model to leverage the general language representations learned during pre-training while fine-tuning the task-specific components. This approach strikes a balance between computational efficiency and task-specific adaptation.



---


3. If only one of the task-specific heads (either for Task A or Task B) should be frozen:

In this scenario, one of the task-specific heads is frozen, while the other head, the transformer backbone,
and the projection layer are allowed to be trained.

Advantages: Allows for knowledge transfer from the frozen task-specific head to the other components of the model, potentially improving performance on the new task.
Disadvantages: Limited ability for the model to fully adapt to the new task, as the frozen task-specific head is not updated during training.
Approach:
Freezing one of the task-specific heads can be useful in scenarios where you want to transfer knowledge from a well-performing model for one task to improve performance on another task.
For example, if you have a highly accurate model for Task A,
you could freeze the Task A head and fine-tune the rest of the model on Task B, leveraging the knowledge from Task A.

Transfer Learning Scenario:


---



Choice of a pre-trained model:

For transfer learning in NLP tasks, it is recommended to choose a pre-trained model that has been trained on a large corpus of data relevant to the target tasks.
Popular choices include BERT, RoBERTa, XLNet, and GPT models, which have been pre-trained on large datasets like Wikipedia, BookCorpus, and web crawl data.

The choice of the pre-trained model can significantly impact the performance of the downstream tasks, as the model's initial representations can influence the fine-tuning process.






---


Layers to freeze/unfreeze:

A common approach is to freeze the transformer backbone (e.g., BERT) and fine-tune the task-specific heads and the projection layer.
Alternatively, you could unfreeze a few layers of the transformer backbone and fine-tune them along with the task-specific heads.

This can be beneficial if the target tasks are significantly different from the pre-training data domain.


---



My approach:

1. Freezing the transformer backbone reduces the number of trainable parameters, making the fine-tuning process more computationally efficient and less prone to overfitting, especially when dealing with limited task-specific data.

2. Fine-tuning the task-specific heads and the projection layer allows the model to adapt to the specific tasks and learn task-relevant representations.

3. Unfreezing a few layers of the transformer backbone can help the model adapt its language representations to the specific tasks or domains, potentially improving performance if the target tasks are significantly different from the pre-training data.



---



In summary, the choice of freezing or fine-tuning different components of the model depends on the specific requirements, computational resources, and the similarity between the pre-training data and the target tasks.

A common approach is to freeze the transformer backbone and fine-tune the task-specific heads and the projection layer, while unfreezing a few layers of the backbone can be considered if the target tasks are significantly different from the pre-training data.

In [29]:
#Task 4: Layer-wise Learning Rate Implementation (BONUS)

'''
Lower learning rates are typically set for the lower layers of the network because they often capture more general features and require more stable updates.
Slightly higher learning rates can be set for the middle layers (e.g., projection layer) to ensure faster adaptation to task-specific features.
Higher learning rates are set for the task-specific heads (e.g., task_a_head and task_b_head) to facilitate faster convergence on the specific task

Advantages  of using layer wise learning rates:
Improved stability and faster convergence.
Different layers of a neural network may learn at different rates or require different magnitudes of updates to their parameters.
Layer-wise learning rates allow us to adjust the learning rates for each layer individually, providing finer control over the learning process.
This can help accelerate convergence and improve overall training performance.
Applying different learning rates to different layers can act as a form of regularization to prevent overfitting.


'''

model = MultiTaskSentenceTransformer()

# Define different learning rates for different layers
learning_rates = [
    {"params": model.bert.parameters(), "lr": 1e-5},        # Lower learning rate for BERT layers
    {"params": model.projection.parameters(), "lr": 1e-4},  # Slightly higher learning rate for projection layer
    {"params": model.task_a_head.parameters(), "lr": 1e-3},  # Higher learning rate for task A head
    {"params": model.task_b_head.parameters(), "lr": 1e-3}   # Higher learning rate for task B head
]

# optimizer with different learning rates for different layers
optimizer = optim.Adam(learning_rates)
