# Transfer Learning & Transformers

The purpose of this tutorial is to introduce the concept of "Transfer Learning" and the architecture of "Transformer" models, illustrated through a practical example. We will explore how the knowledge accumulated in one task can be reused to improve the performance of a model in another task.

As a case study, we will address the problem of Semantic Textual Similarity (STS). Considering the theme of the hackathon, we will use the [RO-STS dataset](https://github.com/dumitrescustefan/RO-STS) and the [RoBERT](https://github.com/dumitrescustefan/Romanian-Transformers), pre-trained model specifically for the Romanian language 🇷🇴, illustrating how advanced machine learning techniques can be applied in the specific context of the Romanian language.

❗The material is based on the tutorial offered by [Ștefan Dumitrescu](https://scholar.google.com/citations?hl=en&user=UR_c_N4AAAAJ&view_op=list_works&sortby=pubdate) last year, which served as the inspiration and structure for the current notebook. Many thanks ✨

## Pre-requisites: Some theoretical stuff

1. What is an Embedding
2. Static vs Contextualized Word Wmbeddings
3. A few words regarding Transformers

### What is an Embedding?
An embedding is a way of converting categorical data, like words or IDs, into vectors of numbers. This process helps computers understand and process the data more intuitively, capturing relationships and similarities between items.


#### One-hot representation vs Dense Representation

  <div style="text-align: center;">
      <img src="https://drive.google.com/uc?id=1c5ArBZ_1V-aNMHcO4VZFuRGBjREqvwVA" style="max-width: 90%; height: auto; display: block; margin-left: auto; margin-right: auto;">
  </div>


  <div style="text-align: center;">
      <img src="https://drive.google.com/uc?id=1tZ0sGn0K_uKIA2JMo0pw49Acb_K_rdnB" style="max-width: 90%; height: auto; display: block; margin-left: auto; margin-right: auto;">
  </div>


### What is Word2Vec?
  <div style="text-align: center;">
      <img src="https://drive.google.com/uc?id=1MsjwlTFHrNor04E9lasMNtI5HB6A20Ao" style="max-width: 90%; height: auto; display: block; margin-left: auto; margin-right: auto;">
  </div>

### How Word2Vec works?
  <div style="text-align: center;">
      <img src="https://drive.google.com/uc?id=1Q-Cr6Oct5SpGSuE2QGAMdSwG3RY7yI7U" style="max-width: 90%; height: auto; display: block; margin-left: auto; margin-right: auto;">
  </div>

### Word2Vec: Skip-Gram
  <div style="text-align: center;">
      <img src="https://drive.google.com/uc?id=1R23QLh5pjTIxzusnw82V2XEwDhGsiglY" style="max-width: 90%; height: auto; display: block; margin-left: auto; margin-right: auto;">
  </div>

### Word2Vec: CBoW
  <div style="text-align: center;">
      <img src="https://drive.google.com/uc?id=10xBythFgNugBt-UtMG41IQDMeQmQYMU7" style="max-width: 100%; height: 40; display: block; margin-left: auto; margin-right: auto;">
  </div>

### Static Embeddings Limitations


*   We have established good representations for individual words.
*   But, are these representations sufficient?
  *   **No**, they are fixed and lack adaptability (problem with polysemantic words).
* Challenges with sentence encoding:
   * How to utilize fixed word embeddings for entire sentences? Can we simply average all the embeddings to find the meaning of a sentence?
   * **No Again**, simply averaging fixed embeddings (Bag of Words approach) results in a loss of semantic meaning.





### Contextulized Embeddings
  <div style="text-align: center;">
      <img src="https://drive.google.com/uc?id=1gz-LsAtilfzS3QoavPnrkQoJ4oiOtHjk" style="max-width: 90%; height: auto; display: block; margin-left: auto; margin-right: auto;">
  </div>

## Pre-requisite

In [None]:
!pip install -q transformers pytorch_lightning
!wget -q https://raw.githubusercontent.com/dumitrescustefan/RO-STS/master/dataset/text-similarity/RO-STS.train.tsv
!wget -q https://raw.githubusercontent.com/dumitrescustefan/RO-STS/master/dataset/text-similarity/RO-STS.dev.tsv
!wget -q https://raw.githubusercontent.com/dumitrescustefan/RO-STS/master/dataset/text-similarity/RO-STS.test.tsv

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m801.6/801.6 kB[0m [31m1.3 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m841.5/841.5 kB[0m [31m1.3 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m23.7/23.7 MB[0m [31m1.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m823.6/823.6 kB[0m [31m981.3 kB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m14.1/14.1 MB[0m [31m1.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m731.7/731.7 MB[0m [31m977.7 kB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m410.6/410.6 MB[0m [31m917.8 kB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m121.6/121.6 MB[0m [31m1.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━

In [None]:
import os
import sys
import json
import logging
import random

import numpy as np
import torch
import torch.nn as nn
import pytorch_lightning as pl


from pprint import pprint

from torch.optim import AdamW
from torch.utils.data import DataLoader, Dataset
from pytorch_lightning import LightningModule, Trainer
from pytorch_lightning.callbacks import EarlyStopping
from transformers import AutoTokenizer, AutoModel, AutoConfig, TrainingArguments

TRAIN_DATASET_FILENAME = "RO-STS.train.tsv"
DEV_DATASET_FILENAME = "RO-STS.dev.tsv"
TEST_DATASET_FILENAME = "RO-STS.test.tsv"

# Set up logging
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)

# Ensure reproducibility
torch.manual_seed(42)
np.random.seed(42)

## Data processing & Loading


In the subsequent sections, we will detail the implementation of a custom dataset. This implementation is motivated by several key objectives, central to the customization of data handling processes for specific project requirements. Some of them include:
* **Data Format Adaptation:** To accommodate unique data formats not directly supported by standard datasets, ensuring compatibility with the model's input requirements
* **Domain-Specific Preprocessing:** To apply specialized preprocessing and augmentation techniques informed by domain knowledge
* **Data Augmentation:** Incorporate data augmentation within the loading process.

  ***Example:*** *In image classification tasks, a custom dataset can apply random transformations (e.g., rotations, etc.) to each image as it is loaded.*
* **Efficient Data Pipeline:** To optimize data loading and preprocessing for improved computational efficiency.

  ***Example:*** *Handling large datasets by loading them entirely into RAM is often impractical due to system memory limitations. This traditional approach can lead to bottlenecks or outright failures when datasets exceed available memory.  The cornerstone of this alternative approach is the concept of streaming data directly from the filesystem. By adopting this strategy, data is consumed in manageable chunks as needed, drastically minimizing the memory footprint.*
  <div style="text-align: center;">
      <img src="https://drive.google.com/uc?id=1yNU2qO9Cfe3mB3Pe0DHIqBhG8SfZKNu2" style="max-width: 50%; height: auto; display: block; margin-left: auto; margin-right: auto;">
      <p><a href="https://medium.com/@alexanderwei_12000/dataset-streaming-in-pytorch-1c480157db9c" target="_blank">Source: Dataset Streaming in PyTorch</a></p>
  </div>



  



### How to create a custom dataset?

[PyTorch](https://pytorch.org/) gives you the freedom to manipulate the `Dataset` class in any manner you choose, provided you implement the following three methods:

1. **\_\_init__(self, ...):** The constructor method is used to initialize the dataset object. Here, you perform tasks such as reading the data files, initializing data structures, preprocessing data, and loading any necessary resources (e.g., tokenizers). This method runs once when the dataset object is first created.

2. **\_\_len__(self):** This method should return the total number of samples in the dataset. PyTorch uses this method to get the dataset's size. Implementing this method allows PyTorch's data handling utilities, like `DataLoader`, to know the size of the dataset for batching, sampling, shuffling, and other data manipulation operations.

3. **\_\_getitem__(self, index):** This method should return a single sample from the dataset at the specified index (it retrieves the specific data instance required for training or inference at each step). PyTorch's DataLoader calls this method to fetch data in batches during the training or evaluation loop.


In [None]:
class STSDataset(Dataset):
    """
    Custom dataset for processing text data, including special tokens for BERT models.
    """
    def __init__(self, file_path):
        """
        Initializes the dataset with a tokenizer and a file path.

        Args:
            file_path: Path to the file containing data samples.
        """
        self.samples = []

        with open(file_path, "r", encoding="utf8") as f:
            for line in f:
                similarity_score, first_sentence, second_sentence = line.strip().split("\t")
                first_sentence = self._standardize_diacritics(first_sentence)
                second_sentence = self._standardize_diacritics(second_sentence)
                formatted_sentence = self._format_with_special_tokens(first_sentence, second_sentence)
                self.samples.append({
                    "similarity_score": float(similarity_score) / 5.0,  # Normalize similarity score: [0, 5] -> [0, 1]
                    "sentences_pair": formatted_sentence
                })

    def _standardize_diacritics(self, text):
        """
        Standardizes Romanian diacritics in the given text.

        Args:
            text: The input text string to process.

        Returns:
            The text string with standardized Romanian diacritics.
        """
        return text.replace("ţ", "ț").replace("ş", "ș").replace("Ţ", "Ț").replace("Ş", "Ș")

    def _format_with_special_tokens(self, sentence1, sentence2):
        """
        Formats two sentences with [CLS] and [SEP] tokens. [CLS] is used at the start
        for classification tasks to represent the whole sequence. [SEP] separates sentences
        or marks the end of a sentence.

        Args:
            sent1: The first sentence.
            sent2: The second sentence.

        Returns:
            A string with [CLS] at the start, [SEP] between sentences, and [SEP] at the end.
        """
        return f"[CLS]{sentence1}[SEP]{sentence2}[SEP]"

    def __len__(self):
        """Returns the number of samples in the dataset."""
        return len(self.samples)

    def __getitem__(self, index):
        """
        Retrieves a sample by its index.

        Args:
            index: Index of the sample to retrieve.

        Returns:
            A sample consisting of a normalized similarity score and a concatenated
            sentence with [CLS] and [SEP] tokens.
        """
        return self.samples[index]

In [None]:
# @title Solution { form-width: "12px", display-mode: "form" }
class STSDataset(Dataset):
    """
    Custom dataset for processing text data, including special tokens for BERT models.
    """
    def __init__(self, file_path):
        """
        Initializes the dataset with a tokenizer and a file path.

        Args:
            file_path: Path to the file containing data samples.
        """
        self.samples = []

        with open(file_path, "r", encoding="utf8") as f:
            for line in f:
                similarity_score, first_sentence, second_sentence = line.strip().split("\t")
                first_sentence = self._standardize_diacritics(first_sentence)
                second_sentence = self._standardize_diacritics(second_sentence)
                formatted_sentence = self._format_with_special_tokens(first_sentence, second_sentence)
                self.samples.append({
                    "similarity_score": float(similarity_score) / 5.0,  # Normalize similarity score
                    "sentences_pair": formatted_sentence
                })

    def _standardize_diacritics(self, text):
        """
        Standardizes Romanian diacritics in the given text.

        Args:
            text: The input text string to process.

        Returns:
            The text string with standardized Romanian diacritics.
        """
        return text.replace("ţ", "ț").replace("ş", "ș").replace("Ţ", "Ț").replace("Ş", "Ș")

    def _format_with_special_tokens(self, sentence1, sentence2):
        """
        Formats two sentences with [CLS] and [SEP] tokens. [CLS] is used at the start
        for classification tasks to represent the whole sequence. [SEP] separates sentences
        or marks the end of a sentence.

        Args:
            sent1: The first sentence.
            sent2: The second sentence.

        Returns:
            A string with [CLS] at the start, [SEP] between sentences, and [SEP] at the end.
        """
        return f"[CLS]{sentence1}[SEP]{sentence2}[SEP]"

    def __len__(self):
        """Returns the number of samples in the dataset."""
        return len(self.samples)

    def __getitem__(self, index):
        """
        Retrieves a sample by its index.

        Args:
            index: Index of the sample to retrieve.

        Returns:
            A sample consisting of a normalized similarity score and a concatenated
            sentence with [CLS] and [SEP] tokens.
        """
        return self.samples[index]

### Splitting the Dataset into Training, Validation, and Test Sets

In [None]:
train_dataset = STSDataset(TRAIN_DATASET_FILENAME)
dev_dataset = STSDataset(DEV_DATASET_FILENAME)
test_dataset = STSDataset(TEST_DATASET_FILENAME)

Let's see some samples from our training dataset.

In [None]:
# Define the number of samples you want to pick
num_samples_to_show = 5

# Generate random indices
random_indices = random.sample(range(len(train_dataset)), num_samples_to_show)

# Retrieve and display the samples
for idx in random_indices:
    sample = train_dataset[idx]
    print(f"Sample {idx}: {sample}")

Sample 1808: {'similarity_score': 0.12, 'sentences_pair': '[CLS]Un bărbat pulverizează un lichid dintr-un furtun lung la plajă.[SEP]Un băiat stă pe bicileta sa cu o roată în aer.[SEP]'}
Sample 2932: {'similarity_score': 0.9199999999999999, 'sentences_pair': '[CLS]Sorkin, care este acuzat de conspirație pentru obstrucționarea justiției și de sperjur, urma să fie judecat separat.[SEP]Sorkin urma să fie judecat separat pentru acuzații de conspirație și sperjur.[SEP]'}
Sample 1423: {'similarity_score': 0.8, 'sentences_pair': '[CLS]Opt sticle de lager Harp aliniate pe podea.[SEP]Opt sticle de bere Harp aliniate pe o podea din lemn.[SEP]'}
Sample 4072: {'similarity_score': 0.72, 'sentences_pair': '[CLS]Sondaj AP: Majoritatea din SUA are prejudicii împotriva negrilor[SEP]Sondaj AP: Majoritatea are prejudecăți împotriva negrilor[SEP]'}
Sample 4339: {'similarity_score': 0.12, 'sentences_pair': '[CLS]Muncitorii cu dizabilități „trădați” protestează împotriva închiderilor Remploy[SEP]Ziua de mai,

In our case, the dataset is already split into train, dev, and test, but most of the time you will have to do it yourselves. It depends on the amount of data you have—whether it's a small, medium, or large dataset.

A good rule of thumb for splitting small and medium datasets into training, validation (development), and testing sets is as follows:

**Small Datasets:**
* **Training Set:** Approximately 60-70%. This ensures that the model has access to as much data as possible for learning, which is crucial when the overall dataset size is limited.
* **Validation Set:** Around 15-20%. This portion is used for tuning the model's hyperparameters and for performing checks during the training process to avoid overfitting.
* **Test Set:** Also around 15-20%. This final slice of the dataset serves to evaluate how well the model can generalize to new, unseen data.

**Medium Datasets:**
* **Training Set:** Typically, 70-80%. With a slightly larger dataset, you can afford to allocate more data for training without compromising the model's ability to generalize from the validation and test sets.
* **Validation Set:** About 10-15%. This allows for effective model tuning and validation, ensuring the model performs well on data it hasn't been trained on.
* **Test Set:** Again, 10-15%. This ensures that there is a sufficient amount of unseen data to accurately assess the model's performance and generalizability.

### How to create a custom data collator?
Custom data collators like the one above enable more efficient, flexible, and model-specific preparation of data batches. These collators are responsible for dynamically batching individual data points together in a way that's optimized for the specific requirements of a model or a training process.


In [None]:
MODEL_PATH = "dumitrescustefan/bert-base-romanian-cased-v1"

tokenizer = AutoTokenizer.from_pretrained(MODEL_PATH, strip_accents=False)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/29.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/385 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/397k [00:00<?, ?B/s]

In [None]:
class STSCollator:
    """
    A custom data collator that prepares batches for model training or evaluation.
    This collator handles tokenization and ensures that sequences are padded to a uniform length.
    """
    def __init__(self, tokenizer: AutoTokenizer, max_seq_len: int):
        """
        Initializes the collator with a tokenizer and maximum sequence length.

        :param tokenizer: The tokenizer to use for tokenizing input sequences.
        :param max_seq_len: The maximum length of the sequences after tokenization.
        """
        self.max_seq_len = max_seq_len
        self.tokenizer = tokenizer

    def __call__(self, input_batch: list[dict]) -> dict:
        """
        Processes a batch of input data, tokenizing sequences and padding them to the same length.

        :param input_batch: A list of dictionaries, where each dictionary contains 'sim' and 'sent' keys.
        :return: A dictionary with tokenized and padded sequences, and similarity scores.
        """
        similarity_scores = [instance['similarity_score'] for instance in input_batch]
        sentences = [instance['sentences_pair'] for instance in input_batch]

        tokenized_batch = self.tokenizer(
            sentences,
            padding=True,
            max_length=self.max_seq_len,
            truncation=True,
            return_tensors="pt"
        )
        similarity_scores_tensor = torch.tensor(similarity_scores, dtype=torch.float)

        return {
            "tokenized_batch": tokenized_batch,
            "similarity_scores": similarity_scores_tensor
        }

Let's create a [`DataLoader`](https://pytorch-lightning.readthedocs.io/en/0.9.0/datamodules.html) for each dataset split and iterate through them.

In [None]:
MAX_SEQ_LEN = 256  # Adjust the maximum length of the sequence to your needs
BATCH_SIZE = 16    # Adjust the batch size to your requirements

sts_collator = STSCollator(tokenizer, MAX_SEQ_LEN)

train_dataloader = DataLoader(train_dataset, batch_size=BATCH_SIZE, shuffle=True, collate_fn=sts_collator, pin_memory=True)
validation_dataloader = DataLoader(dev_dataset, batch_size=BATCH_SIZE, shuffle=False, collate_fn=sts_collator, pin_memory=True)
test_dataloader = DataLoader(test_dataset, batch_size=BATCH_SIZE, shuffle=False, collate_fn=sts_collator)

In [None]:
# Let's iterate through just the first batch
for batch in train_dataloader:
    tokenized_batch = batch['tokenized_batch']
    similarity_scores = batch['similarity_scores']

    # Convert token IDs back to tokens to check the first few tokens of the first sequence
    # Note: Adjust the slicing based on how large you want the sample to be
    tokens = tokenizer.convert_ids_to_tokens(tokenized_batch['input_ids'][0][:10])

    print("First few tokens from the first sequence:", tokens)
    print("Corresponding similarity score:", similarity_scores[0].item())

    # Assuming you only want to check the first batch for now
    # Note: You can delete this if you want to iterate through more batches
    break

First few tokens from the first sequence: ['[CLS]', '[CLS]', 'Senatul', 'SUA', 'va', 'vota', 'acordul', 'în', 'legătură', 'cu']
Corresponding similarity score: 0.5600000023841858


## Model

PyTorch is renowned for its simplicity and flexibility, making it a popular choice for building complex AI models. However, as research projects grow in complexity—incorporating elements like multi-GPU training, 16-bit precision, and TPU support—the risk of introducing bugs increases significantly. [PyTorch Lightning](https://lightning.ai/docs/pytorch/stable/) addresses these challenges by structuring PyTorch code to abstract away the intricate details of the training process. This abstraction not only minimizes potential errors but also accelerates the research cycle, making AI development more scalable and iteration faster. While PyTorch Lightning adds structure and a layer of abstraction to PyTorch, it usually maintains compatibility with the original PyTorch code.

Since we're utilizing PyTorch Lightning's Trainer for our training process, it's essential to implement at least the `training_step()` and `configure_optimizers()` methods.

The `training_step()` method outlines the operations to be performed on each batch of data during training, such as forward pass computations and loss calculation.

`configure_optimizers()` specifies the optimization algorithms and learning rate schedules to be used for updating model parameters based on computed gradients.

In [None]:
class RoBERTModel(pl.LightningModule):
  def __init__(self, model_name: str, lr: float = 2e-5, sequence_max_length: int = MAX_SEQ_LEN):
    """
    Initializes the RoBERTModel with a specified base model and configuration.

    :param model_name: Name or path to the pretrained model.
    :param lr: Learning rate.
    :param sequence_max_length: Maximum input sequence length for the model.
    """
    super().__init__()

    self.tokenizer = AutoTokenizer.from_pretrained(model_name, strip_accents=False)
    self.model = AutoModel.from_pretrained(model_name)
    self.output_layer = nn.Linear(self.model.config.hidden_size, 1)

    self.loss_fct = nn.MSELoss()

    self.lr = lr
    self.save_hyperparameters()

    self.gpu_available = torch.cuda.is_available()
    if self.gpu_available:
      print(f"GPU is available: {torch.cuda.get_device_name(0)}")
    else:
      print("GPU is not available, using CPU.")

  def forward(self, tokenized_batch):
    """
    Forward pass through the model.

    :param tokenized_batch: Tokenized input batch including input_ids and attention_mask.
    :return: Predictions for the input batch.
    """
    output = self.model(**tokenized_batch, return_dict=True)
    cls_embedding = output.pooler_output
    prediction = self.output_layer(cls_embedding)

    return prediction.flatten()


  def training_step(self, batch, batch_idx):
    tokenized_batch = batch['tokenized_batch']
    ground_truth = batch['similarity_scores']

    prediction = self.forward(tokenized_batch)
    loss = self.loss_fct(prediction, ground_truth)

    ## For CPU: Conditionally detach and move the loss to CPU if GPU is available, and convert to
    # TODO
    self.log("train_loss", loss.detach().cpu().item(), on_step=True, on_epoch=True, prog_bar=True,)
    return {"loss": loss}


  def validation_step(self, batch, batch_idx):
    tokenized_batch = batch['tokenized_batch']
    ground_truth = batch['similarity_scores']

    prediction = self.forward(tokenized_batch)
    loss = self.loss_fct(prediction, ground_truth)

    ## For CPU: Conditionally detach and move the loss to CPU if GPU is available, and convert to
    # TODO
    self.log("train_loss", loss.detach().cpu().item(), on_step=True, on_epoch=True, prog_bar=True,)
    return {"loss": loss}

  def configure_optimizers(self):
    """
    Configures the model's optimizers.

    :return: The AdamW optimizer with the defined learning rate and parameters.
    """
    return AdamW([p for p in self.parameters() if p.requires_grad], lr=self.lr, eps=1e-08)


In [None]:
model = RoBERTModel(MODEL_PATH)

trainer = pl.Trainer(
    devices=-1,  # Comment this when training on cpu
    accelerator="gpu",
    max_epochs=-1,  # Set this to -1 when training fully
    #limit_train_batches=10,  # Uncomment this when training fully
    #limit_val_batches=5,  # Uncomment this when training fully
    gradient_clip_val=1.0,
    enable_checkpointing=True
)

INFO:pytorch_lightning.utilities.rank_zero:GPU available: True (cuda), used: True
INFO:pytorch_lightning.utilities.rank_zero:TPU available: False, using: 0 TPU cores
INFO:pytorch_lightning.utilities.rank_zero:IPU available: False, using: 0 IPUs
INFO:pytorch_lightning.utilities.rank_zero:HPU available: False, using: 0 HPUs


GPU is available: Tesla T4


### Training

In [None]:
trainer.fit(model, train_dataloader, validation_dataloader)

INFO:pytorch_lightning.accelerators.cuda:LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
INFO:pytorch_lightning.callbacks.model_summary:
  | Name         | Type      | Params
-------------------------------------------
0 | model        | BertModel | 124 M 
1 | output_layer | Linear    | 769   
2 | loss_fct     | MSELoss   | 0     
-------------------------------------------
124 M     Trainable params
0         Non-trainable params
124 M     Total params
497.768   Total estimated model params size (MB)


Sanity Checking: |          | 0/? [00:00<?, ?it/s]

Training: |          | 0/? [00:00<?, ?it/s]

Validation: |          | 0/? [00:00<?, ?it/s]

Validation: |          | 0/? [00:00<?, ?it/s]

Validation: |          | 0/? [00:00<?, ?it/s]

Validation: |          | 0/? [00:00<?, ?it/s]

Validation: |          | 0/? [00:00<?, ?it/s]

INFO:pytorch_lightning.utilities.rank_zero:`Trainer.fit` stopped: `max_epochs=5` reached.


## Predict

In [None]:
def predict(model, sent1, sent2):
    """
    Generates a prediction score for a pair of sentences using a given model.

    Args:
        model: The trained model to use for generating predictions.
        sent1 (str): The first sentence.
        sent2 (str): The second sentence.

    Returns:
        float: The prediction score, scaled by a factor of 5.
    """
    # Format the input sentences for the model. Prepend [CLS] and append [SEP] as needed.
    # TODO

    # Tokenize the concatenated sentences. Specify padding (True), max_length (MAX_SEQ_LEN), truncation (True), and tensor type (pt).
    # TODO

    # Generate predictions using the model's forward method. No need to explicitly call forward.
    # TODO

    # Scale back the prediction
    # TODO

    return prediction_score

In [None]:
# @title Solutions { display-mode: "form" }
def predict(model, sent1, sent2):
    """
    Generates a prediction score for a pair of sentences using a given model.

    Args:
        model: The trained model to use for generating predictions.
        sent1 (str): The first sentence.
        sent2 (str): The second sentence.

    Returns:
        float: The prediction score, scaled by a factor of 5.
    """
    # Format the input sentences for the model. Prepend [CLS] and append [SEP] as needed.
    concatenated_sentences = f"[CLS]{sent1.strip()}[SEP]{sent2.strip()}[SEP]"

    # Tokenize the concatenated sentences. Specify padding (True), max_length (MAX_SEQ_LEN), truncation (True), and tensor type (pt).
    tokenized_batch = model.tokenizer(
        [concatenated_sentences],
        padding=True,
        max_length=MAX_SEQ_LEN,
        truncation=True,
        return_tensors="pt"
    )

    # Generate predictions using the model's forward method. No need to explicitly call forward.
    predictions = model(tokenized_batch)

    # The prediction is scaled by 5 due to normalization applied during the model's training phase.
    # This scaling ensures the prediction is on the same scale as the target values.
    prediction_score = predictions[0].item() * 5

    return prediction_score

In [None]:
def evaluate_model_on_tests(model, tests):
    """
    Evaluate the model on a series of sentence pairs and print their similarity scores.

    Args:
        model: The trained model to be evaluated.
        tests: A list of tuples, where each tuple contains two sentences to compare.
    """
    model.eval()  # Set the model to evaluation mode.

    with torch.no_grad():  # Inference mode, no gradients needed.
        for (s1, s2) in tests:
            similarity_score = predict(model, s1, s2)
            print(f"'{s1}' || '{s2}' \t SIM = {similarity_score:.2f}")

# Define test sentence pairs.
tests = [
    ("Ana are mere.", "Andreea are pere."),
    ("Filmul este foarte bun", "Filmul este extrem de slab."),
    ("Cerul este albastru azi", "Pisica a urcat pe acoperiș."),
    ("Cartea a fost interesantă", "Lectura nu m-a captivat deloc."),
    ("Am alergat un maraton", "Am terminat cursa de 42 de kilometri."),
    ("Muzica clasică este relaxantă", "Genul clasic muzical îmi calmează mintea.")
]

# Call the function to evaluate the model on the test data.
evaluate_model_on_tests(model, tests)

'Ana are mere.' || 'Andreea are pere.' 	 SIM = 2.50
'Filmul este foarte bun' || 'Filmul este extrem de slab.' 	 SIM = 2.95
'Cerul este albastru azi' || 'Pisica a urcat pe acoperiș.' 	 SIM = 0.47
'Cartea a fost interesantă' || 'Lectura nu m-a captivat deloc.' 	 SIM = 2.05
'Am alergat un maraton' || 'Am terminat cursa de 42 de kilometri.' 	 SIM = 2.28
'Muzica clasică este relaxantă' || 'Genul clasic muzical îmi calmează mintea.' 	 SIM = 2.64


## Other resources

*   If you want to read more about RoBERT: https://arxiv.org/abs/2009.08712
*   A centralized spaced dedicated to Romanian language models, specifically those built using Transformer architectures, and a consistent framework for evaluating these models: https://github.com/dumitrescustefan/Romanian-Transformers

