## Setup - 1B

You are required to make use of the Sentence-BERT model
(https://arxiv.org/pdf/1908.10084.pdf) and the SentenceTransformers framework
(Sentence-Transformers). For this setup, make use of the Sentence-BERT model to
encode the sentences and determine the cosine similarity between these embeddings
for the validation set. Report the required evaluation metric on the validation set.

In [None]:
!pip install -U transformers

In [None]:
!pip install -U sentence-transformers

In [None]:
import pandas as pd
from sklearn.preprocessing import MinMaxScaler

train_df = pd.read_csv('./A3_task1_data_files/train.csv', sep='\t')
val_df = pd.read_csv('./A3_task1_data_files/dev.csv', sep='\t')
scaler = MinMaxScaler(feature_range=(0, 1))
train_df['score'] = scaler.fit_transform(train_df[['score']])
val_df['score'] = scaler.fit_transform(val_df[['score']])

In [None]:
# train_df.head()

In [None]:
import pandas as pd
from sklearn.preprocessing import MinMaxScaler
from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity

model = SentenceTransformer("distilbert-base-nli-mean-tokens")
val_embeddings = []
for _, row in val_df.iterrows():
    if pd.notnull(row['sentence1']) and pd.notnull(row['sentence2']):  # Check for missing values
        sentence1_embedding = model.encode(row['sentence1'], convert_to_tensor=True)
        sentence2_embedding = model.encode(row['sentence2'], convert_to_tensor=True)
        val_embeddings.append((sentence1_embedding, sentence2_embedding))

# Calculate cosine similarity between embeddings
cosine_similarities = []
for embedding_pair in val_embeddings:
    cosine_similarities.append(cosine_similarity(embedding_pair[0].unsqueeze(0), embedding_pair[1].unsqueeze(0)).item())

correlation_coefficient = val_df['score'].corr(pd.Series(cosine_similarities))
print("Correlation coefficient (Pearson correlation) between predicted similarities and actual scores:", correlation_coefficient)

Correlation coefficient (Pearson correlation) between predicted similarities and actual scores: 0.6379508453621849


A Pearson correlation coefficient of 0.6379508453621849 indicates a moderately strong positive linear relationship between the predicted similarities and the actual scores