# Sentence Transformers and Multi-Task Learning

## Install Neccessary Packages

In [1]:
! pip install datasets transformers torch

Collecting datasets
  Downloading datasets-3.3.1-py3-none-any.whl.metadata (19 kB)
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting xxhash (from datasets)
  Downloading xxhash-3.5.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess<0.70.17 (from datasets)
  Downloading multiprocess-0.70.16-py311-none-any.whl.metadata (7.2 kB)
Collecting nvidia-cuda-nvrtc-cu12==12.4.127 (from torch)
  Downloading nvidia_cuda_nvrtc_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-runtime-cu12==12.4.127 (from torch)
  Downloading nvidia_cuda_runtime_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-cupti-cu12==12.4.127 (from torch)
  Downloading nvidia_cuda_cupti_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cudnn-cu12==9.1.0.70 (from torch)
  Downloading nvidia_cudnn_cu12-9.

## Task1: Sentence Transformer Implementation

* Summary:
  *   This task uses a pretrained transformer model with mean pooling to obtain efficient, fixed-length embeddings.
  *   Model Choice: bert-base-uncase was selected for its balance of speed and accuracy.
  * Pooling Method: Mean pooling offers simplicity without sacrificing contextual information, making it suitable for varied NLP tasks.
* Key Decisions:
  *   Efficiency: Mean pooling avoids extra computational layers, aligning with the task’s performance goals.
  *   Clarity: This approach is straightforward, making it reproducible and accessible for a variety of tasks and models.
* Output: Each sentence produces a fixed-length embedding of shape [1, 256].

### Import Dependencies

In [3]:
import torch
from transformers import BertModel, BertTokenizer
import torch.nn as nn
from torch.utils.data import DataLoader, Dataset
from torch.optim import Adam
import torch.optim as optim
from google.colab import files

### Define SentenceTransformerModel
```
Initializes the SentenceTransformerModel using a pre-trained BERT model.

Args:
    model_name (str): The name of the pre-trained model to use (default is 'bert-base-uncased').
    fixed_length (int): The fixed length for sentence embeddings.
```



In [4]:
class SentenceTransformerModel(torch.nn.Module):
    def __init__(self, model_name='bert-base-uncased', fixed_length=256):
        super(SentenceTransformerModel, self).__init__()

        # Load a pre-trained BERT model (transformer backbone) from HuggingFace
        self.bert = BertModel.from_pretrained(model_name)

        # Choice: Use mean pooling to get fixed-length sentence embeddings.
        self.pooling = 'mean'
        self.fixed_length = fixed_length

    def forward(self, input_ids, attention_mask):
        """
        Forward pass for the model to obtain sentence embeddings.

        Args:
            input_ids (torch.Tensor): Tokenized input sentences.
            attention_mask (torch.Tensor): Attention mask indicating where the padding is in the input sequence.

        Returns:
            torch.Tensor: Sentence embeddings of fixed length.
        """
        # Pass the tokenized sentences through the BERT model
        outputs = self.bert(input_ids=input_ids, attention_mask=attention_mask)

        # Extract the last hidden states (token embeddings) from the output of BERT
        last_hidden_state = outputs.last_hidden_state

        # Apply the chosen pooling method to get fixed-length embeddings from variable-length token sequences
        if self.pooling == 'mean':
            input_mask_expanded = attention_mask.unsqueeze(-1).expand(last_hidden_state.size()).float()
            sum_embeddings = torch.sum(last_hidden_state * input_mask_expanded, dim=1)
            sum_mask = input_mask_expanded.sum(dim=1)
            sum_mask = torch.clamp(sum_mask, min=1e-9)
            sentence_embeddings = sum_embeddings / sum_mask
        else:
            sentence_embeddings = last_hidden_state[:, 0, :]

        # Ensure fixed-length embeddings
        if sentence_embeddings.size(1) > self.fixed_length:
            # Truncate embeddings to the desired length
            sentence_embeddings = sentence_embeddings[:, :self.fixed_length]
        elif sentence_embeddings.size(1) < self.fixed_length:
            # Pad embeddings to the desired length
            padding = torch.zeros(sentence_embeddings.size(0), self.fixed_length - sentence_embeddings.size(1))
            sentence_embeddings = torch.cat((sentence_embeddings, padding), dim=1)

        return sentence_embeddings


### Declare Model

In [5]:
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = SentenceTransformerModel(fixed_length=256)


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

### Obtain Embeddings from sample sentences

In [6]:
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = SentenceTransformerModel(fixed_length=256)

sentences = [
    "The quick brown fox jumps over the lazy dog.",
    "Transformer-based model change the world quickly.",
    "I love programming in Python.",
    "I love to work with machine learning stuffs.",
    "perhaps no picture ever made has more literally showed that the road to hell is paved with good intentions",
    "The product exceeded my expectations, and I will definitely buy again.",
    "Amazing customer service! The staff were friendly and helpful.",
    "I love the design and performance of this phone, highly recommend it!",
    "The quality of the product was terrible and broke after one use.",
    "Customer service was rude, and they didn’t resolve my issue.",
    "Very disappointed with the experience, I won’t be coming back.",
    "The new iPhone features a faster processor and improved camera.",
    "Artificial Intelligence is transforming industries across the globe.",
    "The football team won their fifth championship title this season.",
    "The swimmer broke the world record in the 100-meter freestyle.",
    "A balanced diet and regular exercise are key to maintaining good health.",
    "Doctors recommend getting at least 7 hours of sleep per night.",
    "The pandemic has raised awareness about the importance of hygiene.",
]

# Tokenize the sentences: Convert sentences into token IDs for BERT input.
encoded_input = tokenizer(
    sentences,
    padding=True,
    truncation=True,
    return_tensors='pt',
    max_length=128  # Fixed-length size for tokenization
)

# Pass tokenized sentences through the model to obtain embeddings (disable gradient computation)
with torch.no_grad():
    embeddings = model(
        input_ids=encoded_input['input_ids'],
        attention_mask=encoded_input['attention_mask']
    )

# Print the sentence embeddings and their corresponding sentence
print("Sentence Embeddings:")
for idx, embedding in enumerate(embeddings):
    print(f"Sentence {idx+1}: {sentences[idx]}")
    print(f"Embedding shape: {embedding.shape}")
    print(f"Embedding vector:\n{embedding}\n")

Sentence Embeddings:
Sentence 1: The quick brown fox jumps over the lazy dog.
Embedding shape: torch.Size([256])
Embedding vector:
tensor([-1.4466e-02, -7.4887e-02,  5.6368e-02,  4.5168e-03,  4.0891e-01,
         2.5804e-02, -7.5612e-02,  4.7453e-01, -1.8951e-03, -1.5011e-01,
        -1.0119e-01, -1.5718e-01, -2.1705e-01, -1.6252e-02, -4.5524e-01,
        -2.5191e-01,  2.0256e-01, -2.0102e-02, -1.6217e-01, -8.5939e-03,
         1.9837e-01, -3.7650e-01, -5.1490e-01, -6.7321e-02,  4.7553e-01,
         2.2702e-01, -3.7373e-03,  2.4479e-01, -3.7153e-01,  1.7458e-02,
         2.2193e-01, -1.3309e-01, -1.1020e-02,  1.4927e-01, -1.6918e-01,
        -3.3690e-02,  4.0328e-02, -3.5493e-01, -4.4815e-01,  8.7867e-02,
        -2.4535e-01, -5.4502e-02, -8.5848e-02, -8.2122e-02,  1.0064e-01,
        -4.1189e-01,  6.9579e-02, -2.2557e-01,  7.4127e-01, -3.2274e-01,
        -5.2064e-01,  5.7741e-01, -3.2543e-01,  1.8200e-01, -3.3087e-01,
         2.8472e-01,  3.9729e-01, -1.5372e-01,  1.5706e-01, -2.711

### Save Model and Download

In [7]:
# Save the model
model_path = '/content/sentence_transformer_model.pth'
torch.save(model.state_dict(), model_path)

# Download the saved model
files.download(model_path)

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

## Task 2: Explanation of Changes to Support Multi-Task Learning

### 1. Define MultiTaskSentenceTransformerModel


*   self.classifier_task_a: A linear layer mapping the sentence embeddings to class scores for Task A (Sentence Classification).
*   self.classifier_task_b: A separate linear layer for Task B (Sentiment Analysis).

**Rationale:**

By adding task-specific output layers, the model can share the transformer backbone and sentence embeddings while learning to perform different tasks simultaneously. This setup allows the model to learn representations that are beneficial for both tasks.


**Model Definition**

In [8]:
class MultiTaskSentenceTransformerModel(SentenceTransformerModel):
    """
    A model that extends the basic SentenceTransformerModel to handle multi-task learning.
    This model performs two tasks:
    1. Sentence Classification (Task A)
    2. Sentiment Analysis (Task B)
    """
    def __init__(self, model_name='bert-base-uncased', num_classes_task_a=3, num_classes_task_b=2):
        """
        Initializes the multi-task sentence transformer model by extending the base SentenceTransformerModel.

        Args:
            model_name (str): The pre-trained BERT model to use (default is 'bert-base-uncased').
            num_classes_task_a (int): Number of classes for sentence classification task A (default is 3).
            num_classes_task_b (int): Number of classes for sentiment analysis task B (default is 2).
        """
        super(MultiTaskSentenceTransformerModel, self).__init__()  # Call the parent class constructor
        # Load a pre-trained BERT model from HuggingFace (transformer backbone)
        self.bert = BertModel.from_pretrained(model_name)

        # Set the pooling method to 'mean', which means we'll average token embeddings for the sentence
        self.pooling = 'mean'

        # Task A: Sentence Classification
        # A linear layer that maps sentence embeddings to class scores for Task A
        # Example: Task A could classify sentences into categories like 'News', 'Opinion', 'Entertainment'
        self.classifier_task_a = torch.nn.Linear(self.bert.config.hidden_size, num_classes_task_a)

        # Task B: Sentiment Analysis
        # A linear layer that maps sentence embeddings to class scores for Task B
        # Example: Task B could classify the sentiment of the sentence (e.g., 'Positive', 'Negative')
        self.classifier_task_b = torch.nn.Linear(self.bert.config.hidden_size, num_classes_task_b)

    def forward(self, input_ids, attention_mask):
        """
        Forward pass for the model to get predictions for both tasks.

        Args:
            input_ids (torch.Tensor): Tokenized input sentences.
            attention_mask (torch.Tensor): Attention mask to distinguish padding tokens from real tokens.

        Returns:
            torch.Tensor: logits_task_a (Task A class scores) and logits_task_b (Task B class scores).
        """
        # Pass the input tokens through BERT to obtain hidden states (token embeddings)
        outputs = self.bert(input_ids=input_ids, attention_mask=attention_mask)

        # Extract the last hidden state from the BERT outputs (the token embeddings)
        last_hidden_state = outputs.last_hidden_state

        # Perform pooling to obtain sentence-level embeddings (mean pooling by default)
        if self.pooling == 'mean':
            # Expand the attention mask to match the size of the last hidden state for element-wise multiplication
            input_mask_expanded = attention_mask.unsqueeze(-1).expand(last_hidden_state.size()).float()

            # Compute the sum of embeddings, ignoring padding tokens (by applying attention mask)
            sum_embeddings = torch.sum(last_hidden_state * input_mask_expanded, dim=1)

            # Sum the attention mask (counts real tokens) to later normalize the embeddings
            sum_mask = input_mask_expanded.sum(dim=1)

            # Prevent division by zero by clamping the sum_mask to a minimum value
            sum_mask = torch.clamp(sum_mask, min=1e-9)

            # Normalize the embeddings by dividing the sum of embeddings by the number of real tokens
            sentence_embeddings = sum_embeddings / sum_mask
        else:
            # Use the [CLS] token's embedding as a representation of the whole sentence (default option)
            sentence_embeddings = last_hidden_state[:, 0, :]

        # Task A output: Class scores for sentence classification (based on the sentence embeddings)
        logits_task_a = self.classifier_task_a(sentence_embeddings)

        # Task B output: Class scores for sentiment analysis (based on the sentence embeddings)
        logits_task_b = self.classifier_task_b(sentence_embeddings)

        # Return the outputs for both tasks
        return logits_task_a, logits_task_b

### 2. Finetunning Model

#### Prepare Data for Finetuning

In [20]:
# Define a dataset class
class MultiTaskDataset(Dataset):
    def __init__(self, texts, labels_task_a, labels_task_b, tokenizer, max_length=256):
        self.texts = texts
        self.labels_task_a = labels_task_a
        self.labels_task_b = labels_task_b
        self.tokenizer = tokenizer
        self.max_length = max_length

    def __len__(self):
        return len(self.texts)

    def __getitem__(self, idx):
        encoding = self.tokenizer(
            self.texts[idx],
            truncation=True,
            padding='max_length',
            max_length=self.max_length,
            return_tensors='pt'
        )
        return {
            'input_ids': encoding['input_ids'].squeeze(0),
            'attention_mask': encoding['attention_mask'].squeeze(0),
            'label_task_a': torch.tensor(self.labels_task_a[idx], dtype=torch.long),
            'label_task_b': torch.tensor(self.labels_task_b[idx], dtype=torch.long)
        }

# Example sentences and labels (news=0, opinion=1, entertainment=2)
sentences = [
    "The stock market saw a sharp decline today.",
    "I think remote work is the future of employment.",
    "The new action movie was absolutely thrilling!",
    "Scientists discovered a new planet outside our solar system.",
    "The new education policy is a step in the right direction.",
    "The concert last night was an unforgettable experience.",
    "The latest smartphone lacks innovation and feels overpriced.",
    "The football team's performance was disappointing this season.",
    "The annual music festival attracted thousands of fans.",
    "A major earthquake struck the city, causing widespread damage.",
    "The new restaurant in town serves the best Italian food.",
    "The gaming industry continues to evolve with new technology."
]
labels_task_a = [0, 1, 2, 0, 0, 2, 1, 1, 2, 0, 1, 2]  # Example class labels for Task A
labels_task_b = [0, 1, 1, 0, 1, 1, 0, 0, 1, 0, 1, 1]  # Sentiment: 1 = Positive, 0 = Negative



#### Finetune the model

In [21]:
# Hyperparameters
model_name = 'bert-base-uncased'
max_length = 256
batch_size = 4
num_epochs = 3

# Tokenizer
tokenizer = BertTokenizer.from_pretrained(model_name)

# Prepare dataset and dataloader
train_dataset = MultiTaskDataset(sentences, labels_task_a, labels_task_b, tokenizer)
train_loader = DataLoader(train_dataset, batch_size=2, shuffle=True)


# Define number of classes for each task
num_classes_task_a = 3  # e.g., Sentence classification could have 3 classes: 'News', 'Opinion', 'Entertainment'
num_classes_task_b = 2  # e.g., Sentiment analysis could have 2 classes: 'Positive', 'Negative'

# Create an instance of the multi-task model with the specified number of classes
model = MultiTaskSentenceTransformerModel(
    num_classes_task_a=num_classes_task_a,
    num_classes_task_b=num_classes_task_b
)

# Loss function and optimizer
criterion_task_a = nn.CrossEntropyLoss()
criterion_task_b = nn.CrossEntropyLoss()
optimizer = optim.AdamW(model.parameters(), lr=2e-5)

# Training loop
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model.to(device)

def train(model, train_loader, optimizer, criterion_task_a, criterion_task_b, epochs=3):
    model.train()
    for epoch in range(epochs):
        total_loss = 0
        for batch in train_loader:
            input_ids = batch['input_ids'].to(device)
            attention_mask = batch['attention_mask'].to(device)
            labels_task_a = batch['label_task_a'].to(device)
            labels_task_b = batch['label_task_b'].to(device)

            optimizer.zero_grad()
            logits_task_a, logits_task_b = model(input_ids, attention_mask)

            loss_a = criterion_task_a(logits_task_a, labels_task_a)
            loss_b = criterion_task_b(logits_task_b, labels_task_b)
            loss = loss_a + loss_b

            loss.backward()
            optimizer.step()
            total_loss += loss.item()

        print(f"Epoch {epoch+1}/{epochs}, Loss: {total_loss:.4f}")

# Run training
train(model, train_loader, optimizer, criterion_task_a, criterion_task_b)

Epoch 1/3, Loss: 11.3722
Epoch 2/3, Loss: 8.7849
Epoch 3/3, Loss: 6.8588


#### Save and Download finetuned model

In [None]:
# Save the model
model_path = '/content/MutliTask-Sentence-Transfomer-model.pth'
torch.save(model.state_dict(), model_path)

# Download the saved model
files.download(model_path)

### 3. Model Evalutaion

In [22]:
# Example inference
model.eval()

# Sample sentences for testing
test_sentences = [
    "The government has announced a new policy to tackle climate change, aiming for carbon neutrality by 2050.",
    "Scientists have discovered a new species of deep-sea fish in the Pacific Ocean.",
    "The stock market saw a sharp decline today as investors reacted to inflation concerns.",
    "A major earthquake struck the city, causing widespread damage and power outages.",
    "I believe that remote work is the future of employment, offering flexibility and increased productivity.",
    "Social media has done more harm than good by spreading misinformation and polarizing society.",
    "The new healthcare policy is a step in the right direction, but it still fails to address affordability.",
    "In my view, electric cars are not the ultimate solution to climate change, but they are a necessary step forward.",
    "The latest Marvel movie broke box office records, becoming the highest-grossing film of the year.",
    "The Grammy Awards featured stunning performances from some of the biggest names in the music industry.",
    "A new fantasy novel series has taken the literary world by storm, selling millions of copies worldwide.",
    "The hit TV show’s season finale left fans with a shocking cliffhanger, sparking online debates."
]

# Tokenize the sentences using BERT tokenizer
encoded_input = tokenizer(
    test_sentences,  # List of sentences to tokenize
    padding='max_length',  # Pad sentences to make them the same length
    truncation=True,  # Truncate sentences that exceed the max token length for BERT
    return_tensors='pt',  # Return as PyTorch tensors for compatibility with the model
    max_length=max_length
)

with torch.no_grad():
    logits_task_a, logits_task_b = model(
        input_ids=encoded_input['input_ids'],  # Tokenized input sentences
        attention_mask=encoded_input['attention_mask']  # Attention mask to ignore padding tokens
    )

# Apply softmax to logits to obtain probabilities for both tasks (optional)
probabilities_task_a = torch.nn.functional.softmax(logits_task_a, dim=1)  # For classification (Task A)
probabilities_task_b = torch.nn.functional.softmax(logits_task_b, dim=1)  # For sentiment analysis (Task B)

label_map = {0: 'News', 1: 'Opinion', 2: 'entertainment'}

print("Task A: Sentence Classification")
for idx, probs in enumerate(probabilities_task_a):
    predicted_class = torch.argmax(probs, dim=0).item()
    print(f"Sentence {idx+1}: {test_sentences[idx]}")
    print(f"Class Probabilities: {probs.numpy()}")
    print(f"Predicted Class: {label_map[predicted_class]}")

Task A: Sentence Classification
Sentence 1: The government has announced a new policy to tackle climate change, aiming for carbon neutrality by 2050.
Class Probabilities: [0.5219605  0.23490465 0.2431348 ]
Predicted Class: News
Sentence 2: Scientists have discovered a new species of deep-sea fish in the Pacific Ocean.
Class Probabilities: [0.5241299  0.27061197 0.20525806]
Predicted Class: News
Sentence 3: The stock market saw a sharp decline today as investors reacted to inflation concerns.
Class Probabilities: [0.5946374  0.19517238 0.21019024]
Predicted Class: News
Sentence 4: A major earthquake struck the city, causing widespread damage and power outages.
Class Probabilities: [0.6535912  0.12692635 0.21948244]
Predicted Class: News
Sentence 5: I believe that remote work is the future of employment, offering flexibility and increased productivity.
Class Probabilities: [0.35063282 0.3928569  0.25651032]
Predicted Class: Opinion
Sentence 6: Social media has done more harm than good by