***TASK 1***

USING MINILM-L6-V2

In [12]:
from sentence_transformers import SentenceTransformer


model = SentenceTransformer("all-MiniLM-L6-v2")


sentences = [
    "This is the first excercise.",
    "I am using MiniLM L6.",
    "This is extremely fun."
]


embeddings = model.encode(sentences)


def print_embeddings(sentences, embeddings):
    for sentence, embedding in zip(sentences, embeddings):
        print(f"Sentence: {sentence}")
        print(f"Embedding: {embedding[:5]}... (truncated)\n")




In [13]:
print_embeddings(sentences, embeddings)

Sentence: This is the first excercise.
Embedding: [-0.0258184  -0.05110373  0.0023471   0.07200564 -0.09817741]... (truncated)

Sentence: I am using MiniLM L6.
Embedding: [ 0.01930069 -0.06063944 -0.07139064 -0.037043    0.03860332]... (truncated)

Sentence: This is extremely fun.
Embedding: [ 0.03399094 -0.00855662  0.0275319  -0.04598624 -0.06823699]... (truncated)



USING BERT BASE

In [14]:
from sentence_transformers import SentenceTransformer

model = SentenceTransformer("bert-base-nli-mean-tokens")

sentences = [
    "This is the first excercise.",
    "I am using MiniLM L6.",
    "This is extremely fun."
]


embeddings = model.encode(sentences)


def print_embeddings(sentences, embeddings):
    for sentence, embedding in zip(sentences, embeddings):
        print(f"Sentence: {sentence}")
        print(f"Embedding: {embedding[:5]}... (truncated)\n")




In [15]:
print_embeddings(sentences, embeddings)

Sentence: This is the first excercise.
Embedding: [-0.10511372  0.06052209  0.7207366   0.25130892 -0.14731582]... (truncated)

Sentence: I am using MiniLM L6.
Embedding: [ 0.2196106   0.12576272  1.3329709  -0.03008128  0.50363946]... (truncated)

Sentence: This is extremely fun.
Embedding: [-0.00433037 -0.466831    2.1145084   0.36574316 -0.09815084]... (truncated)



As we can see we obtained the exact same embeddings from 2 different model implementations from Hugging Face. The purpose behind using BERT is, its a complex model typically used to capture rich contextual embeddings. On the other hand MiniLM-L6 model is a compact model more suitable large scale tasks. It is used to get meaning full embeddings without losing to much depth. I implemented both models to showcase that for a small sample size they both provide similar embeddings so a faster model would be a better fit.


**TASK 2**

In [19]:
from sentence_transformers import SentenceTransformer
import torch
import torch.nn as nn
import torch.nn.functional as F

class MultiTaskSentenceModel(nn.Module):
    def __init__(self, transformer_model="paraphrase-MiniLM-L6-v2", hidden_size=384, num_labels_task_a=3, num_labels_task_b=2):
        super(MultiTaskSentenceModel, self).__init__()
        self.encoder = SentenceTransformer(transformer_model)
        self.task_a_head = nn.Linear(hidden_size, num_labels_task_a)
        self.task_b_head = nn.Linear(hidden_size, num_labels_task_b)


        self.task_a_labels = ["Finance", "Entertainment", "Technology"]
        self.task_b_labels = ["Negative", "Positive"]

    def forward(self, sentences):
        embeddings = self.encoder.encode(sentences, convert_to_tensor=True)
        logits_a = self.task_a_head(embeddings)
        logits_b = self.task_b_head(embeddings)
        return logits_a, logits_b

    def predict(self, sentences):
        logits_a, logits_b = self.forward(sentences)
        probs_a = F.softmax(logits_a, dim=1)
        probs_b = F.softmax(logits_b, dim=1)
        pred_a = torch.argmax(probs_a, dim=1)
        pred_b = torch.argmax(probs_b, dim=1)

        for i, sentence in enumerate(sentences):
            print(f"Sentence: {sentence}")
            print(f"\tPredicted Topic: {self.task_a_labels[pred_a[i]]}")
            print(f"\tPredicted Sentiment: {self.task_b_labels[pred_b[i]]}\n")


model = MultiTaskSentenceModel()
sample_sentences = [
    "This is the second excercise.",
    "I am using MiniLM L6.",
    "This is extremely fun."
]

model.predict(sample_sentences)

Sentence: This is the second excercise.
	Predicted Topic: Entertainment
	Predicted Sentiment: Positive

Sentence: I am using MiniLM L6.
	Predicted Topic: Entertainment
	Predicted Sentiment: Negative

Sentence: This is extremely fun.
	Predicted Topic: Finance
	Predicted Sentiment: Positive



For this task i have decided to implement sentence classification with a sentiment analysis classification. To achieve this i have implemented two task specific heads. These heads have been implemented with a linear neural network. The end result being we have a very basic sentence/sentiment analysis model.

**TASK 3**

1.) If the entire network is frozen, including both the transformer backbone and task-specific heads, the model can only act as a static feature extractor. This makes it so the mdeol cannot adapt to new tasks, and the task heads will produce essentially random outputs due to untrained weights.

2.)If we are to freeze only the transformer backbone while training the task-specific heads then this works really well when we are working with limited labeled data, as it prevents overfitting and allows the model to learn meaningful decision boundaries using the embeddings generated by the MiniLM model.

3.)In case of only one of the task head being frozen for ex- keeping the topic classification head static while continuing to train the sentiment classification head, this would help in preserving performance on well-functioning tasks while allowing fine-tuning of underperforming components without disrupting the entire system.

Scenarios

1.)When transfer learning is viable, the process should begin with choosing a powerful and efficient pre-trained transformer model, such as parMiniLM-L6-v2, which is optimized for sentence-level semantic understanding.

2.)I would choose to freeze the transformer and only train the added task heads to quickly adapt to the target task without modifying the core representations. This makes it as training progresses and more labeled data becomes available, gradually unfreezing the upper layers of the transformer allows the model to become more task-aware, enabling better adaptation to domain-specific nuances.

3.) This fine-tuning approach ensures a balance between leveraging general language knowledge and learning task-specific patterns. Ultimately, freezing and unfreezing strategies should be chosen based on the amount of available data, the complexity of tasks, and the desired balance between generalization and specialization.



**TASK 4**

In [20]:
from sentence_transformers import SentenceTransformer
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
import random

class MultiTaskSentenceModel(nn.Module):
    def __init__(self, transformer_model="paraphrase-MiniLM-L6-v2", hidden_size=384, num_labels_task_a=3, num_labels_task_b=2):
        super(MultiTaskSentenceModel, self).__init__()
        self.encoder = SentenceTransformer(transformer_model)
        self.task_a_head = nn.Linear(hidden_size, num_labels_task_a)
        self.task_b_head = nn.Linear(hidden_size, num_labels_task_b)
        self.task_a_labels = ["Finance", "Entertainment", "Technology"]
        self.task_b_labels = ["Negative", "Positive"]

    def forward(self, sentences):
        embeddings = self.encoder.encode(sentences, convert_to_tensor=True)
        logits_a = self.task_a_head(embeddings)
        logits_b = self.task_b_head(embeddings)
        return logits_a, logits_b

    def predict(self, sentences):
        logits_a, logits_b = self.forward(sentences)
        probs_a = F.softmax(logits_a, dim=1)
        probs_b = F.softmax(logits_b, dim=1)
        pred_a = torch.argmax(probs_a, dim=1)
        pred_b = torch.argmax(probs_b, dim=1)
        for i, sentence in enumerate(sentences):
            print(f"Sentence: {sentence}")
            print(f"\tPredicted Topic: {self.task_a_labels[pred_a[i]]}")
            print(f"\tPredicted Sentiment: {self.task_b_labels[pred_b[i]]}\n")


model = MultiTaskSentenceModel()
optimizer = optim.Adam(model.parameters(), lr=2e-5)
loss_fn = nn.CrossEntropyLoss()


sentences = [
    "The stock market crashed due to inflation.",
    "The film was a cinematic masterpiece.",
    "Tech companies are investing in AI.",
    "The weather ruined my mood today.",
    "Quantum computing is the next big thing."
]

task_a_labels = torch.tensor([0, 1, 2, 1, 2])
task_b_labels = torch.tensor([0, 1, 1, 0, 1])


model.train()
for epoch in range(3):
    logits_a, logits_b = model(sentences)
    loss_a = loss_fn(logits_a, task_a_labels)
    loss_b = loss_fn(logits_b, task_b_labels)
    total_loss = loss_a + loss_b

    optimizer.zero_grad()
    total_loss.backward()
    optimizer.step()


    pred_a = torch.argmax(logits_a, dim=1)
    pred_b = torch.argmax(logits_b, dim=1)
    acc_a = (pred_a == task_a_labels).float().mean().item()
    acc_b = (pred_b == task_b_labels).float().mean().item()

    print(f"Epoch {epoch+1} | Loss: {total_loss.item():.4f} | Task A Acc: {acc_a:.2f} | Task B Acc: {acc_b:.2f}")

Epoch 1 | Loss: 1.7453 | Task A Acc: 0.40 | Task B Acc: 0.20
Epoch 2 | Loss: 1.7428 | Task A Acc: 0.40 | Task B Acc: 0.20
Epoch 3 | Loss: 1.7402 | Task A Acc: 0.40 | Task B Acc: 0.20


In Task 2, the focus was solely on building the multi-task learning architecture by introducing a shared transformer encoder and two task-specific classification heads—one for topic classification and another for sentiment analysis. However, Task 4 expanded on this by implementing the training mechanics necessary to optimize the model. Specifically, synthetic sentences and randomly assigned labels were introduced to simulate a dataset. Two separate loss functions using CrossEntropyLoss were applied—one for each task—and an Adam optimizer was initialized to update the model’s parameters. A training loop was constructed to perform multiple epochs of training, including the forward pass, loss computation for both tasks, backpropagation, and parameter updates. Also after each epoch the loop calculated and printed accuracy metrics for both task heads, offering insight into learning progress. These additions transformed the static architecture from Task 2 into a functional training pipeline capable of handling multi-task optimization, even with hypothetical data.

