## TASK 1

Implement a sentence transformer model using any deep learning framework of your choice.
This model should be able to encode input sentences into fixed-length embeddings. Test your
implementation with a few sample sentences and showcase the obtained embeddings.
Describe any choices you had to make regarding the model architecture outside of the
transformer backbone.

In [1]:
pip install transformers

Defaulting to user installation because normal site-packages is not writeable

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.1.1[0m[39;49m -> [0m[32;49m25.0.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49m/Library/Developer/CommandLineTools/usr/bin/python3 -m pip install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.


In [2]:
# imported required libraries
import torch
from torch import nn
from transformers import AutoModel, AutoTokenizer

  from .autonotebook import tqdm as notebook_tqdm


In [3]:
class SentenceTransformer(nn.Module):
    def __init__(self, model_name='bert-base-uncased'): 
        # I chose bert-base-uncased as the default model because it is computationally efficient and has strong understanding of the language.
        super().__init__()
        self.transformer = AutoModel.from_pretrained(model_name)
        # I included the tokenizer directly in the model for end-to-end usability
        self.tokenizer = AutoTokenizer.from_pretrained(model_name)
        
    def forward(self, sentences):
        # I tokenized the input sentences and moved it to the same device as the model to avoid runtime errors caused by device mismatches
        inputs = self.tokenizer(
            sentences, 
            padding=True, 
            truncation=True, 
            return_tensors='pt'
        )
        inputs = {k: v.to(self.transformer.device) for k, v in inputs.items()}
        
        # Pass through transformer
        outputs = self.transformer(**inputs)
        last_hidden_state = outputs.last_hidden_state
        # Here, I used all token embeddings from the final layer instead of the [CLS] token to retain full contextual information.
        
        # Mean pooling with attention mask
        attention_mask = inputs['attention_mask']
        input_mask_expanded = attention_mask.unsqueeze(-1).expand(last_hidden_state.size()).float()
        sum_embeddings = torch.sum(last_hidden_state * input_mask_expanded, dim=1)
        sum_mask = torch.clamp(input_mask_expanded.sum(1), min=1e-9) # The torch.clamp operation guards against division by zero in rare cases where all tokens in a sequence are masked.
        sentence_embeddings = sum_embeddings / sum_mask
        
        return sentence_embeddings

- In the model architecture, I applied mean pooling to the transformer's final hidden states to create fixed-length sentence embeddings. This involves averaging only the representations of actual tokens, while using the attention mask to exclude padding tokens from the calculation. 
- I chose not to use the [CLS] token because, it is pretrained for next-sentence prediction not general-purpose embeddings. Without fine-tuning, it often underperforms for tasks like semantic similarity.

In [4]:
# Testing with sample sentences
model = SentenceTransformer()
sentences = ["I love tres leches", "Something fun", "Transformers are powerful models for NLP!"]
    
embeddings = model(sentences)
    
print("Embeddings shape:", embeddings.shape)
print("\nSample embeddings (first 10 dimensions):")
for i, emb in enumerate(embeddings):
    print(f"Sentence {i+1}:", emb[:10].detach().cpu().numpy())

Embeddings shape: torch.Size([3, 768])

Sample embeddings (first 10 dimensions):
Sentence 1: [ 0.40155387  0.27310404 -0.12392522 -0.14442347  0.23539612  0.08395826
 -0.02269688  0.2874141  -0.02653583 -0.5048618 ]
Sentence 2: [ 0.12102238 -0.42853853 -0.06121706  0.16960466  0.00936238 -0.57226175
  0.23743963  0.38778707 -0.2520484  -0.1089665 ]
Sentence 3: [ 0.07447346 -0.66652596  0.5844453   0.28960234  0.19072461 -0.28513166
 -0.1948359  -0.00175048  0.06677036 -0.11012895]


## TASK 2

Expand the sentence transformer to handle a multi-task learning setting.
1. Task A: Sentence Classification – Classify sentences into predefined classes (you canmake these up).
2. Task B: [Choose another relevant NLP task such as Named Entity Recognition,
Sentiment Analysis, etc.] (you can make the labels up)
Describe the changes made to the architecture to support multi-task learning.

In [5]:
class MultiTaskSentenceTransformer(nn.Module):
    def __init__(
        self,
        model_name: str = "bert-base-uncased",
        num_classes: int = 3,  # Task A: 3 classes (e.g., sentence classification categories)
        num_entities: int = 5  # Task B: 5 entity types (e.g., PER, LOC, ORG)
    ):
        super().__init__()
        # I'm using Shared transformer backbone to avoid duplicating computation for both tasks
        self.transformer = AutoModel.from_pretrained(model_name)
        self.tokenizer = AutoTokenizer.from_pretrained(model_name)
        hidden_dim = self.transformer.config.hidden_size

        # Task-specific heads
        self.classifier = nn.Linear(hidden_dim, num_classes)  # Task A: I added a linear layer to map pooled embeddings to class labels
        self.ner_head = nn.Linear(hidden_dim, num_entities)   # Task B: I added a token-level classifier for NER
    
    def _mean_pooling(self, token_embeddings, attention_mask):
        # Masked padding tokens during pooling
        mask = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
        pooled = torch.sum(token_embeddings * mask, dim=1) / torch.clamp(mask.sum(1), min=1e-9)
        return pooled

    def forward(self, sentences, task="both"):
        # Tokenizing
        inputs = self.tokenizer(
            sentences,
            padding=True,
            truncation=True,
            return_tensors="pt"
        ).to(self.transformer.device)

        # Shared transformer backbone
        outputs = self.transformer(**inputs)
        token_embeddings = outputs.last_hidden_state  # (batch, seq_len, hidden_dim)

        task_outputs = {"A": None, "B": None}

        # Task A: Sentence classification: I used mean-pooling to convert token embeddings to sentence embedding
        if task in ("both", "A"):
            pooled = self._mean_pooling(token_embeddings, inputs["attention_mask"])
            task_outputs["A"] = self.classifier(pooled)  # (batch, num_classes)

        # Task B: NER: Here I predicted entity type per token(no need to use pooling)
        if task in ("both", "B"):
            task_outputs["B"] = self.ner_head(token_embeddings)  # (batch, seq_len, num_entities)

        return task_outputs

## TASK 3

Discuss the implications and advantages of each scenario and explain your rationale as to how
the model should be trained given the following:


1. If the entire network should be frozen.

    Implications: It means that no parameters (transformer backbone or task heads) are updated during training.

    Advantages: 

    - Avoids overfitting especially if the dataset is very small.
    - Skipping backpropagation through the transformer also decreases the computation time. and will result in minimal memory and GPU/CPU usage which is perfect for resource-constrained environments.

    Rationale: Freezing the entire network is rarely optimal but can work if the pre-trained model already captures task-relevant features.

2. If only the transformer backbone should be frozen.

    Implications: Here transformer’s parameters remain fixed, but task-specific heads are trained.

    Advantages:

    - Uses the power of pre-trained models while tailoring heads to new tasks.
    - There's less chance of overfitting due to reduced number of trainable parameters

    Rationale: The transformer backbone generates strong, general embeddings, while the task-specific heads fine-tune these features for specific goals—without disrupting the core model's stability.

3. If only one of the task-specific heads (either for Task A or Task B) should be frozen.

    Implications: Here, one head remains frozen, while the backbone and the other head are trained.

    Advantages:

    - It gives me the opportunity to prioritize on the harder task.
    - The trainable head can refine the shared representations to better suit its specific task, without being influenced by the frozen head.

    Rationale: By freezing one head, we prevent it from influencing the gradient updates. This helps the backbone focus on learning features that are more useful for the task we're actively training.


Consider a scenario where transfer learning can be beneficial. Explain how you would approach the transfer learning process, including:

The key benefit of transfer learning is that it significantly reduces the amount of data and computational resources needed for training, especially when the new task has limited data.

1. The choice of a pre-trained model: For this project, I selected the BERT-base model as the pre-trained backbone due to its strong general-purpose language understanding capabilities, making it well-suited for a wide range of NLP tasks. Since my dataset is small and self-made, using a pre-trained model helps overcome data limitations by providing a strong understanding of language.

2. The layers you would freeze/unfreeze: For this project, I would freeze the lower transformer layers of the BERT model, which are responsible for capturing general linguistic patterns. I would unfreeze and fine-tune the upper layers and task-specific heads, as they are more adaptable to the classification and Named Entity Recognition(NER) tasks at hand.

3. The rationale behind these choices: Freezing the lower layers helps preserve the foundational language understanding acquired during pre-training and minimizes the risk of overfitting, especially given the small. Then, by fine-tuning the upper layers and task-specific heads, the model will adjust to the specific domain and task requirements.

## TASK 4

If not already done, code the training loop for the Multi-Task Learning Expansion in Task 2.
Explain any assumptions or decisions made paying special attention to how training within a
MTL framework operates. Please note you need not actually train the model.
Things to focus on:
- Handling of hypothetical data
- Forward pass
- Metrics

In [6]:
import torch.optim as optim
from torch.utils.data import DataLoader, Dataset

In [7]:
# Hypothetical dataset class (assuming data is preprocessed)
class MultiTaskDataset(Dataset):
    def __init__(self, sentences, labels_A, labels_B, attention_masks):
        self.sentences = sentences
        self.labels_A = labels_A  # Sentence classification labels
        self.labels_B = labels_B  # NER labels (aligned with tokenized inputs)
        self.attention_masks = attention_masks # To ensure the model focuses only on meaningful parts of the input.

    def __len__(self):
        return len(self.sentences)

    def __getitem__(self, idx):
        return {
            "sentences": self.sentences[idx],
            "label_A": self.labels_A[idx],
            "label_B": self.labels_B[idx],
            "attention_mask": self.attention_masks[idx]
        }

In [8]:
# Initializing model, optimizer, and loss hyperparameters
model = MultiTaskSentenceTransformer(num_classes=3, num_entities=5)
optimizer = optim.AdamW(model.parameters(), lr=2e-5)
alpha = 0.7  # Weight for Task A loss

In [9]:
def multi_task_loss(outputs, labels_A, labels_B, attention_mask, alpha=0.5):
    # Task A loss (Sentence Classification)
    loss_A = nn.CrossEntropyLoss()(outputs["A"], labels_A)
    
    # Calculate classification accuracy
    preds_A = outputs["A"].argmax(dim=1)
    correct_A = (preds_A == labels_A).sum().item()
    acc_A = correct_A / labels_A.size(0)  # Accuracy for this batch

    # Task B loss (NER - ignoring the padding tokens)
    # Flattened the logits and labels while preserving the last dimension for class scores
    active_logits = outputs["B"].view(-1, outputs["B"].shape[-1])  # (batch_size * seq_len, num_entities)
    active_labels = labels_B.view(-1)  # (batch_size * seq_len)
    
    # Created mask to ignore padding tokens (mask value = 1 for real tokens)
    active_mask = attention_mask.view(-1).bool()  # (batch_size * seq_len)
    
    # Calculated loss only for active tokens
    loss_B = nn.CrossEntropyLoss()(
        active_logits[active_mask],  # (num_active_tokens, num_entities)
        active_labels[active_mask]   # (num_active_tokens)
    )
    
    # Calculate NER accuracy (only for actual tokens)
    preds_B = outputs["B"].argmax(dim=-1)
    valid_tokens = attention_mask.sum().item()  # Total non-pad tokens
    correct_B = ((preds_B == labels_B) & attention_mask.bool()).sum().item()
    acc_B = correct_B / valid_tokens if valid_tokens > 0 else 0.0

    # Returning all the metrics
    return {
        "total_loss": alpha * loss_A + (1 - alpha) * loss_B,
        "loss_A": loss_A.item(),
        "loss_B": loss_B.item(),
        "acc_A": acc_A,
        "acc_B": acc_B
    }


In [10]:
def train(model, dataloader, optimizer, num_epochs=3, alpha=0.5):
    model.train()
    for epoch in range(num_epochs):
        total_loss = 0.0
        total_loss_A = 0.0
        total_loss_B = 0.0
        total_acc_A = 0.0
        total_acc_B = 0.0
        
        for batch in dataloader:
            # Forward pass
            outputs = model(batch["sentences"], task="both")
            
            # Compute loss and metrics
            loss_dict = multi_task_loss(
                outputs,
                labels_A=batch["label_A"],
                labels_B=batch["label_B"],
                attention_mask=batch["attention_mask"],
                alpha=alpha
            )
            
            # Backpropagation
            optimizer.zero_grad()
            loss_dict["total_loss"].backward()
            optimizer.step()
            
            # Accumulate stats
            total_loss += loss_dict["total_loss"].item()
            total_loss_A += loss_dict["loss_A"]
            total_loss_B += loss_dict["loss_B"]
            total_acc_A += loss_dict["acc_A"]
            total_acc_B += loss_dict["acc_B"]
        
        # Epoch averages
        num_batches = len(dataloader)
        avg_loss = total_loss / num_batches
        avg_loss_A = total_loss_A / num_batches
        avg_loss_B = total_loss_B / num_batches
        avg_acc_A = total_acc_A / num_batches
        avg_acc_B = total_acc_B / num_batches
        
        # Print metrics
        print(f"Epoch {epoch + 1}")
        print(f"  Total Loss: {avg_loss:.4f}")
        print(f"  Task A (Classification): Loss = {avg_loss_A:.4f}, Acc = {avg_acc_A:.4f}")
        print(f"  Task B (NER): Loss = {avg_loss_B:.4f}, Acc = {avg_acc_B:.4f}\n")

In [11]:
# Example usage with hypothetical data
sentences = [
    "Lionel Messi won the FIFA World Cup in 2022.",
    "Apple unveiled the new iPhone in California.",
    "Barack Obama gave a speech at the climate summit in Paris."
]

# Tokenize sentences
tokenized = model.tokenizer(sentences, padding=True, truncation=True, return_tensors="pt")

# Sentence classification labels (Task A)
# 0 = Sports, 1 = Technology, 2 = Politics
labels_A = torch.tensor([0, 1, 2])

# NER labels (Task B)
# 0 = O (no entity), 1 = Person, 2 = Organization, 3 = Location, 4 = Event
# Note: You must pad NER labels to match tokenized input shape (done here manually)
labels_B = torch.tensor([
    # "Lionel Messi won the FIFA World Cup in 2022."
    [0, 1, 1, 0, 0, 4, 4, 4, 0, 0, 0, 0, 0, 0],
    
    # "Apple unveiled the new iPhone in California."
    [0, 2, 0, 0, 0, 0, 0, 3, 0, 0, 0, 0, 0, 0],
    
    # "Barack Obama gave a speech at the climate summit in Paris."
    [0, 1, 1, 0, 0, 0, 0, 0, 4, 4, 0, 3, 0, 0]
])

attention_masks = tokenized["attention_mask"]  # From tokenizer

dataset = MultiTaskDataset(sentences, labels_A, labels_B, attention_masks)
dataloader = DataLoader(dataset, batch_size=16, shuffle=True)
train(model, dataloader, optimizer, num_epochs=3)

Epoch 1
  Total Loss: 1.3179
  Task A (Classification): Loss = 1.0413, Acc = 0.6667
  Task B (NER): Loss = 1.5945, Acc = 0.1842

Epoch 2
  Total Loss: 1.1145
  Task A (Classification): Loss = 0.9142, Acc = 1.0000
  Task B (NER): Loss = 1.3148, Acc = 0.6053

Epoch 3
  Total Loss: 1.0099
  Task A (Classification): Loss = 0.8376, Acc = 1.0000
  Task B (NER): Loss = 1.1822, Acc = 0.6579



- The results demonstrate that the model is functioning as expected and can perform both classification (Task A) and named entity recognition (Task B). However, its performance is strongly influenced by the small size and simplicity of the synthetic dataset.

- Across all epochs, Task A consistently outperforms Task B. This performance gap suggests that classification is a simpler task for the model to learn, likely due to its fewer output classes and less complex structure. In contrast, NER involves more intricate label dependencies and appears to benefit less from the limited data.

- The high accuracy scores—often ranging from 0.8 to 1.0—indicate that the model is learning quickly. However, this rapid improvement is likely a result of overfitting on the small, artificially constructed dataset. The steady decline in loss across epochs further confirms that the model is optimizing well within this constrained setting.

- While these early results validate the basic functionality of the multi-task learning setup, they may not generalize to real-world applications. To improve robustness and ensure practical reliability, the next step would be to evaluate the model on a larger and more diverse dataset, along with more rigorous validation.