# Task 1

 Task Definition - Given two sentences, calculate the similarity between these two sentences. The similarity is given as a score ranging from 0to5.
<br>

Train datapoint examples -
<br>
</t>score sentence1 sentence2
* 4.750 A young child is riding a horse. A child is riding a horse.
* 2.400 A woman is playing the guitar. A man is playing guitar.

<br>
The dataset is already divided into training and validation sets in the files - ‘train.csv’ and ‘dev.csv’, respectively. Both files are given to you in the zip file attached to the assignment. Please note that it is tab-separated. A testing file excluding the score field will be provided to
you during the demo to run inference on. You are required to create dataset classes and data loaders appropriately for your training and evaluation setups.
For this task, you are required to implement three setups:


## Setup - 1A

You are required to train a BERT model (google-bert/bert-base-uncased ·
Hugging Face) using HuggingFace for the task of Text Similarity. You are required to
obtain BERT embeddings while making use of a special token used by BERT for
separating multiple sentences in an input text and an appropriate linear layer or setting
of BertForSequenceClassification (BERT) framework for a float output. Choose a
suitable loss function. Report the required evaluation metric on the validation set.

In [None]:
!pip install datasets

Collecting datasets
  Downloading datasets-2.18.0-py3-none-any.whl (510 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m510.5/510.5 kB[0m [31m4.7 MB/s[0m eta [36m0:00:00[0m
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl (116 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m6.8 MB/s[0m eta [36m0:00:00[0m
Collecting xxhash (from datasets)
  Downloading xxhash-3.4.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (194 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m194.1/194.1 kB[0m [31m5.5 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting multiprocess (from datasets)
  Downloading multiprocess-0.70.16-py310-none-any.whl (134 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m134.8/134.8 kB[0m [31m2.4 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: xxhash, dill, multiprocess, datasets
Successfully installed datasets-2

In [None]:
import pandas as pd
from datasets import Dataset

# Load the CSV file
df = pd.read_csv('/content/train.csv', sep='\t')

# Convert DataFrame to Dataset format compatible with datasets library
dataset_dict = {
    "sentence1": df["sentence1"].tolist(),
    "sentence2": df["sentence2"].tolist(),
    "label": df["score"].tolist(),
}
dataset = Dataset.from_dict(dataset_dict)

from transformers import AutoTokenizer, DataCollatorWithPadding

# Load tokenizer
checkpoint = "bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

# Define function to tokenize examples
def tokenize_function(example):
    return tokenizer(example["sentence1"], example["sentence2"], truncation=True)

# Tokenize dataset
tokenized_datasets = dataset.map(tokenize_function, batched=True)

# Define data collator
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)


In [None]:
|

In [None]:
from transformers import BertForSequenceClassification, Trainer, TrainingArguments
training_args = TrainingArguments("test-trainer")
model = BertForSequenceClassification.from_pretrained(checkpoint, num_labels=1)

In [None]:
trainer = Trainer(
    model,
    training_args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["validation"],
    data_collator=data_collator,
    tokenizer=tokenizer,
)

## Setup - 1B

You are required to make use of the Sentence-BERT model
(https://arxiv.org/pdf/1908.10084.pdf) and the SentenceTransformers framework
(Sentence-Transformers). For this setup, make use of the Sentence-BERT model to
encode the sentences and determine the cosine similarity between these embeddings
for the validation set. Report the required evaluation metric on the validation set.

In [None]:
!pip install -U transformers

In [None]:
!pip install -U sentence-transformers

In [None]:
import pandas as pd
from sklearn.preprocessing import MinMaxScaler

train_df = pd.read_csv('./A3_task1_data_files/train.csv', sep='\t')
val_df = pd.read_csv('./A3_task1_data_files/dev.csv', sep='\t')
scaler = MinMaxScaler(feature_range=(0, 1))
train_df['score'] = scaler.fit_transform(train_df[['score']])
val_df['score'] = scaler.fit_transform(val_df[['score']])

In [None]:
# train_df.head()

In [None]:
import pandas as pd
from sklearn.preprocessing import MinMaxScaler
from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity

model = SentenceTransformer("distilbert-base-nli-mean-tokens")
val_embeddings = []
for _, row in val_df.iterrows():
    if pd.notnull(row['sentence1']) and pd.notnull(row['sentence2']):  # Check for missing values
        sentence1_embedding = model.encode(row['sentence1'], convert_to_tensor=True)
        sentence2_embedding = model.encode(row['sentence2'], convert_to_tensor=True)
        val_embeddings.append((sentence1_embedding, sentence2_embedding))

# Calculate cosine similarity between embeddings
cosine_similarities = []
for embedding_pair in val_embeddings:
    cosine_similarities.append(cosine_similarity(embedding_pair[0].unsqueeze(0), embedding_pair[1].unsqueeze(0)).item())

correlation_coefficient = val_df['score'].corr(pd.Series(cosine_similarities))
print("Correlation coefficient (Pearson correlation) between predicted similarities and actual scores:", correlation_coefficient)

Correlation coefficient (Pearson correlation) between predicted similarities and actual scores: 0.6379508453621849


## Setup 1C

In this setup, you must fine-tune the Sentence-BERT model for the task of
STS. Make use of the CosineSimilarityLoss function (Losses — Sentence-Transformers
documentation). Report the required evaluation metric on the validation set—reference:
Semantic Textual Similarity — Sentence-Transformers documentation. You must train for
at least two epochs and surpass the performance of Setup 2B.