In [None]:
# Check whether the Bert model able to load properly in the machine or not

from transformers import TFBertModel, BertTokenizer

model_name = "bert-base-uncased"
try:
    tokenizer = BertTokenizer.from_pretrained(model_name, force_download=True)
    model = TFBertModel.from_pretrained(model_name, force_download=True)
    print("Model loaded successfully.")
except Exception as e:
    print(f"Error loading model: {e}")


  from .autonotebook import tqdm as notebook_tqdm
Some weights of the PyTorch model were not used when initializing the TF 2.0 model TFBertModel: ['cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.bias', 'cls.seq_relationship.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.dense.bias']
- This IS expected if you are initializing TFBertModel from a PyTorch model trained on another task or with another architecture (e.g. initializing a TFBertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFBertModel from a PyTorch model that you expect to be exactly identical (e.g. initializing a TFBertForSequenceClassification model from a BertForSequenceClassification model).
All the weights of TFBertModel were initialized from the PyTorch model.
If your task is similar to the task the model of the checkpoint was trained on,

Model loaded successfully.


## Task 1: Sentence Transformer Implementation

In [2]:
from transformers import TFBertModel, BertTokenizer
import tensorflow as tf

model_name = "bert-base-uncased"

# Load the tokenizer and model
tokenizer = BertTokenizer.from_pretrained(model_name, force_download=True)
model = TFBertModel.from_pretrained(model_name, force_download=True)

# Sample sentences for embedding generation
sentences = ["This is a test sentence.", "Transformers are powerful models for NLP tasks."]

# Tokenize the sentences
inputs = tokenizer(sentences, return_tensors="tf", padding=True, truncation=True, max_length=128)

# Generate embeddings using the model
outputs = model(inputs.input_ids)

# Mean pooling to get a fixed-size vector for each sentence
embeddings = tf.reduce_mean(outputs.last_hidden_state, axis=1)

print("Embeddings:", embeddings)


Some weights of the PyTorch model were not used when initializing the TF 2.0 model TFBertModel: ['cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.bias', 'cls.seq_relationship.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.dense.bias']
- This IS expected if you are initializing TFBertModel from a PyTorch model trained on another task or with another architecture (e.g. initializing a TFBertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFBertModel from a PyTorch model that you expect to be exactly identical (e.g. initializing a TFBertForSequenceClassification model from a BertForSequenceClassification model).
All the weights of TFBertModel were initialized from the PyTorch model.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFBertModel for predictions w

Embeddings: tf.Tensor(
[[-0.2708747  -0.44847974  0.18935685 ... -0.2094822   0.13687545
  -0.06472529]
 [ 0.04127148 -0.33472702  0.3767434  ... -0.45713213 -0.24520482
  -0.03398323]], shape=(2, 768), dtype=float32)


### Task-1 Explanation

In this task, the model architecture is primarily based on the BERT transformer backbone (TFBertModel). There were no significant changes or additional architectural choices beyond using the pre-trained BERT model and generating sentence embeddings. However, here are some considerations made during the design:

1. Pooling Strategy:

The embeddings are generated by applying mean pooling over the token embeddings (outputs.last_hidden_state) to obtain a fixed-size vector for each sentence. This is a choice to summarize the sentence while keeping it independent of sentence length.

2. Tokenization and Padding:

Padding and truncation are used to ensure all sentences are of the same length (max_length=128). This is necessary for batch processing with fixed-size input vectors.

3. Output Representation:

The model's outputs are directly taken from the final hidden states of the BERT model. No additional layers or task-specific heads (such as classification layers) are added in this task, as the goal is just to generate embeddings

In [None]:
#Save the sentence embeddings in your machine
import numpy as np
np.save("sentence_embeddings.npy", embeddings.numpy())


## Task-2:Multi-Task Learning Expansion

In [None]:
# Task-2:Multi-Task Learning Expansion

class MultiTaskSentenceTransformer(tf.keras.Model):
    def __init__(self, model_name="bert-base-uncased", num_classes=3):
        super(MultiTaskSentenceTransformer, self).__init__()
        
        # Load the tokenizer and BERT model
        self.tokenizer = BertTokenizer.from_pretrained(model_name)
        self.bert = TFBertModel.from_pretrained(model_name)
        
        # Define task-specific output layers
        self.classification_head = tf.keras.layers.Dense(num_classes, activation='softmax')  # For Sentence Classification
        self.sentiment_head = tf.keras.layers.Dense(2, activation='softmax')  # Binary sentiment classification (Positive/Negative)

    def call(self, sentences, task="classification"):
        # Tokenize and pass through BERT model
        inputs = self.tokenizer(sentences, return_tensors="tf", padding=True, truncation=True, max_length=128)
        outputs = self.bert(inputs.input_ids)
        pooled_output = tf.reduce_mean(outputs.last_hidden_state, axis=1)  # Pooling output for each sentence
        
        # Select task-specific head
        if task == "classification":
            return self.classification_head(pooled_output)  # For Sentence Classification
        elif task == "sentiment":
            return self.sentiment_head(pooled_output)  # For Sentiment Analysis
        else:
            raise ValueError("Invalid task specified. Choose between 'classification' and 'sentiment'.")

# Testing the multi-task model
multi_task_model = MultiTaskSentenceTransformer()
sentences = ["This is a test sentence.", "Transformers are powerful models for NLP tasks."]

classification_output = multi_task_model(sentences, task="classification")
sentiment_output = multi_task_model(sentences, task="sentiment")

print("Classification Output:", classification_output)
print("Sentiment Output:", sentiment_output)


Some weights of the PyTorch model were not used when initializing the TF 2.0 model TFBertModel: ['cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.bias', 'cls.seq_relationship.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.dense.bias']
- This IS expected if you are initializing TFBertModel from a PyTorch model trained on another task or with another architecture (e.g. initializing a TFBertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFBertModel from a PyTorch model that you expect to be exactly identical (e.g. initializing a TFBertForSequenceClassification model from a BertForSequenceClassification model).
All the weights of TFBertModel were initialized from the PyTorch model.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFBertModel for predictions w

Classification Output: tf.Tensor(
[[0.57701844 0.20382427 0.21915731]
 [0.48584175 0.11417866 0.39997956]], shape=(2, 3), dtype=float32)
Sentiment Output: tf.Tensor(
[[0.21390598 0.786094  ]
 [0.47390592 0.526094  ]], shape=(2, 2), dtype=float32)


### Task-2: Explanation
### Describe the changes made to the architecture to support multi-task learning?

1. **Shared BERT Encoder:** 

        The BERT model (TFBertModel) is used as a shared encoder to generate contextual embeddings for the input sentences, which are then used for both tasks.

2. **Task-Specific Heads:**

        A. Sentence Classification: A softmax layer is added on top of the embeddings for sentence classification.

        B. Sentiment Analysis: A sigmoid layer is added for binary sentiment classification (positive/negative).

3. **Multi-Task Setup:** 
        
        Both heads share the same BERT encoder but have separate output layers.Each task has its own loss function (e.g., categorical cross-entropy for classification, binary cross-entropy for sentiment analysis).

4. **Training:** The model is trained simultaneously on both tasks, optimizing for both losses.

## Task 3: Training Considerations
1. **Freezing the Entire Network:**

*Scenario:* In this case, the entire network (including the transformer backbone and both task-specific heads) is frozen. The model will not be updated during training, meaning no gradients will be computed for the weights in the network.

**Implications and Advantages:**

**Advantages:**

A. **Pre-trained Features:** This is useful when you want to use the pre-trained model as a feature extractor without modifying it. The transformer layer captures rich linguistic features, so freezing the entire model can save computational resources.

B. **Limited Data:** If the dataset is small and you don’t want to risk overfitting, freezing the model could be useful. You can still learn task-specific heads without requiring large-scale data for fine-tuning.

**Disadvantages:**

**Limited Flexibility:** The model won’t adapt to your specific task, since the weights will remain unchanged. If the pre-trained model doesn’t generalize well to your specific data, the performance could be suboptimal.



2.**Freezing the Transformer Backbone (BERT) and Unfreezing Task-Specific Heads**

Scenario: In this case, the transformer backbone (BERT) is frozen, and only the task-specific heads (the classification and sentiment heads) are trainable. This allows the pre-trained BERT model to provide rich feature representations, while training only the final layers for the specific tasks.

**Implications and Advantages:**

**Advantages:**

A. **Preserving Pre-trained Knowledge:** The transformer model (BERT) has been pre-trained on large datasets, which gives it a rich understanding of language. By freezing it, we preserve this knowledge and use it for our tasks without having to re-learn it from scratch.

B. **Faster Training:** Training only the heads (classification and sentiment) reduces the computational load significantly, as the majority of the model is frozen.

C. **Task-Specific Adaptation:** The heads will be fine-tuned to the specific tasks (e.g., classification and sentiment analysis), so they can learn the necessary mapping from BERT’s features to task-specific outputs.

**Disadvantages:**

**Suboptimal Performance:** If the transformer’s representation is not optimal for your tasks, freezing it could limit the model’s ability to adapt and improve. In some cases, the transformer’s feature representation might need task-specific adjustments to achieve the best performance.



3. **Freezing One of the Task-Specific Heads (Task A or Task B)**

Scenario: In this scenario, we freeze one of the task-specific heads while leaving the other one trainable. For instance, you could freeze the classification head and only train the sentiment head, or vice versa.

**Implications and Advantages:**

**Advantages:**

A. **Task-Specific Fine-tuning:** This is useful if you have more data for one task than the other. For example, if you have a large dataset for sentiment analysis but a smaller dataset for classification, you can fine-tune the sentiment head while keeping the classification head fixed.

B. **Efficiency:** If one task (e.g., sentiment analysis) is more critical or needs more fine-tuning, this approach allows you to focus resources on that specific task while preserving the rest of the model.

**Disadvantages:**

**Imbalance:** Freezing one of the heads means you won’t be able to jointly optimize both tasks. This could lead to suboptimal performance on the frozen task since it won’t be able to adapt further.



### Transfer Learning: A Beneficial Scenario

Transfer learning allows us to take a pre-trained model and adapt it to a new task. Here's how you can approach the transfer learning process with a BERT-based model, particularly in the context of multi-task learning:

1. **Choice of Pre-Trained Model**

The choice of a pre-trained model depends on the task you're solving. For this example, we are using BERT (bert-base-uncased), which is pre-trained on large corpora and has been shown to work well for a variety of NLP tasks.

**Pre-trained Model:** bert-base-uncased is a commonly used version for transfer learning in NLP tasks. It has been trained on large text corpora and provides strong feature representations for sentences.

2. **Which Layers to Freeze/Unfreeze**

A. **Freeze the Transformer Backbone (BERT):**

For fine-tuning on task-specific heads, we would generally freeze the BERT backbone, since it already captures useful language features. This allows us to adapt the final dense layers (task-specific heads) without changing the transformer’s weights.

Why Freeze the Transformer? Freezing the transformer prevents overfitting and allows the model to focus on learning task-specific features. BERT is already well-trained on language modeling, so we don't need to retrain it.

B. **Unfreeze the Task-Specific Heads:** 

These layers are trainable because we want to adapt the final decision-making layers to the specific tasks (sentence classification and sentiment analysis). Each task-specific head can learn to map the transformer’s output to the correct task outputs.

3. **Rationale Behind Freezing/Unfreezing**

A. **Freezing the Backbone:** 

BERT has already been pre-trained on a large corpus and learned to capture deep language features. Fine-tuning this on a small dataset might lead to overfitting or slower convergence, so freezing helps in faster adaptation to new tasks without losing generalization.

B.**Unfreezing the Task-Specific Heads:** 

We want to learn task-specific mappings from the BERT features to the desired outputs. By unfreezing the heads, we allow the model to adapt to the specific nature of the tasks (classification and sentiment).

## Task-4: Layer-wise Learning Rate Implementation (BONUS)

In [None]:
# Task-4:

# Function to create the optimizer with layer-wise learning rates
def get_optimizer(model, base_lr=1e-5):
    # Initialize the Adam optimizer
    optimizer = tf.keras.optimizers.legacy.Adam(learning_rate=base_lr)
    
    # Create a list of learning rates based on variable names
    layer_lr = {}

    for var in model.trainable_variables:
        # Check if the variable name corresponds to BERT layers
        if 'bert' in var.name:
            # Use the count of 'layer' in the variable name to define the learning rate
            layer_index = var.name.count('layer')
            layer_lr[var.name] = base_lr * (0.9 ** layer_index)

    # Define a custom learning rate schedule based on variable names
    def lr_schedule(variable):
        # Return the learning rate for the corresponding layer, or use base_lr if not found
        return layer_lr.get(variable.name, base_lr)

    # Create a custom optimizer that updates learning rate according to layer
    class LayerWiseLearningRate(tf.keras.optimizers.legacy.Adam):
        def apply_gradients(self, grads_and_vars, name=None, experimental_aggregate_gradients=True):
            # Apply custom learning rate schedule to each gradient update
            grads_and_vars = [(grad, var) for grad, var in grads_and_vars if grad is not None]
            for i, (grad, var) in enumerate(grads_and_vars):
                layer_lr_value = lr_schedule(var)
                grads_and_vars[i] = (grad * layer_lr_value, var)
            return super().apply_gradients(grads_and_vars, name, experimental_aggregate_gradients)

    # Use the custom optimizer with layer-wise learning rate
    return LayerWiseLearningRate()

# Assuming multi_task_model is already created
optimizer = get_optimizer(multi_task_model)

# Compile the model with the new optimizer
multi_task_model.compile(optimizer=optimizer, loss='categorical_crossentropy', metrics=['accuracy'])


### Rationale for Learning Rates:

1. **Lower Layers:** These layers focus on general language features, so they need smaller learning rates to avoid drastic changes.

2. **Upper Layers:** These layers are more task-specific and benefit from larger learning rates to adapt quickly.

3. **Exponential Decay:** The learning rate decays exponentially as the layer index increases, balancing training stability and task adaptation.

### Benefits of Layer-wise Learning Rates:

1. **Efficient Training:** Lower layers adjust slowly to preserve pre-trained features, while upper layers learn quickly for task adaptation.

2. **Stabilized Training:** Fine-tuning upper layers while keeping lower layers frozen prevents overfitting and stabilizes the training process.

3. **Faster Convergence:** Larger learning rates for upper layers allow faster convergence, while smaller rates for lower layers prevent drastic changes.

4. **Pre-trained Models:** BERT’s lower layers are already well-trained, so they don’t need significant updates during fine-tuning.

### Effect in Multi-task Learning:

In multi-task learning, layer-wise learning rates help balance task-specific fine-tuning while maintaining generalization across tasks:

1. **Task-Specific Fine-Tuning:** Upper layers are tuned to each task, while lower layers are shared across tasks.

2. **Avoiding Overfitting:** Conservative updates to shared layers prevent overfitting to one task while generalizing to others.

3. **Task Interference:** Layer-wise rates prevent one task from dominating the learning process, ensuring efficient multitasking.
