## Text Similarity Feature

The goal here is to check the performance of openai and cohere embedding model on some specific datasets on 
textual similarity and to detect plagiarism.

### Reporting results
#### STSB dataset (1500 datasets. Very short sentences)
> Note: Time eval not that accurate as it includes time it takes for dataset to run. I made sure to restart kernel between runs.

__text-embedding-3-small__

time = 38secs (0.63mins); correlation = 0.88

__text-embedding-3-large__

time = 34secs (0.57mins); correlation = 8.88

#### MRPC dataset (408 datasets. somewhat short sentences as well)
>Note this eval does not include time required to load dataset

After eyeballing the MRPC dataset, it is a really tricky dataset because any model that really does well on the dataset would have to be good at making a really good distinction between paraphrased entry(in that text2 is a paraphrased text that stays on topic) and semantic entry(in that text2 is a semantic text that good number of words in texts but generally does not stay on topic).

##### text-embedding-3-small

Scaled similarity embeddings have length 408 same as dataset. hence each score represents an entry in the dataset.

time = 11.3 secs (0.19 minutes); 
mean-similarity (equivalent pair) = 0.93
mean-similarity (non-equivalent pair) = 0.87
separation between the two = 0.06

**classification metrics**
precision = 0.7319
recall = 0.9785
f1-Score = 0.8374

**ranking metrics**
ROC-AUC = 0.7711
PR-AUC = 0.8755

##### text-embedding-3-large

Scaled similarity embeddings have length 408 same as dataset. hence each score represents an entry in the dataset.

time = 14.85 seconds (0.25 minutes); 
mean-similarity (equivalent pair) = 0.93
mean-similarity (non-equivalent pair) = 0.87
separation between the two = 0.06

**classification metrics**
precision = 0.7549
recall = 0.9606
f1-Score = 0.8454

**ranking metrics**
ROC-AUC = 0.7816
PR-AUC = 0.8809

### Summary
- For STSB, openai embedding small and large achieved a correlation of `0.88` with ground-truth labels on STSB dataset with no clear winner.
- For MRPC, openai embedding small and large have about the same scores. So going forward. There is not much benefit testing on `text-embedding-3-large`.

## Evaluate OPENAI embed model on STSB dataset

In [None]:
# @title Load STSB dataset

from datasets import load_dataset

# Load STSB validation set for a quick evaluation
dataset = load_dataset("sentence-transformers/stsb", split="validation")

# The validation set has 1500 pairs

# Get all pairs and true scores
sents1 = dataset["sentence1"]
sents2 = dataset["sentence2"]
scores = dataset["score"]  # Ground truth similarity scores (float: 0-1)

In [None]:
# @title Evaluate openai model on STSB dataset on huggingface

# Evaluate openai embedding on STSB dataset on huggingface
# Url https://huggingface.co/datasets/sentence-transformers/stsb
from openai import OpenAI
import numpy as np
from scipy.stats import spearmanr
from dotenv import load_dotenv
import os
import time

# Start timing
start_time = time.time()

# Load environment variables from .env file
load_dotenv()

# Set your OpenAI API key from .env file
openai_api_key = os.getenv("OPENAI_API_KEY")
if not openai_api_key:
    raise ValueError("OPENAI_API_KEY not found in .env file. Please create a .env file with OPENAI_API_KEY=your_key")

client = OpenAI(api_key=openai_api_key)

def get_openai_embedding(text, model="text-embedding-ada-002"):
    # Handles too-long texts by truncating
    if len(text) > 8191:
        text = text[:8191]
    response = client.embeddings.create(model=model, input=[text])
    return response.data[0].embedding

def batch_get_openai_embeddings(texts, model="text-embedding-ada-002", batch_size=64):
    embeddings = []
    for i in range(0, len(texts), batch_size):
        batch = texts[i:i+batch_size]
        response = client.embeddings.create(model=model, input=batch)
        batch_embeds = [item.embedding for item in response.data]
        embeddings.extend(batch_embeds)
    return embeddings



print("Fetching embeddings for sentence1...")
embeds1 = batch_get_openai_embeddings(sents1, model="text-embedding-3-large")
print("Fetching embeddings for sentence2...")
embeds2 = batch_get_openai_embeddings(sents2, model="text-embedding-3-large")

# Cosine similarity function
def cosine_similarity(a, b):
    a = np.array(a)
    b = np.array(b)
    return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))

# Compute similarity scores
pred_similarities = [
    cosine_similarity(e1, e2) for e1, e2 in zip(embeds1, embeds2)
]
# Scale cosine similarity (-1..1) to (0..1) to match the normalized STSB scoring
pred_similarities_scaled = [(sim + 1) / 2 for sim in pred_similarities]

# Evaluate with Spearman correlation coefficient
spearman_corr, _ = spearmanr(pred_similarities_scaled, scores)
print(f"OpenAI embedding on STSB (validation): Spearman correlation = {spearman_corr:.4f}")

# End timing and print duration
end_time = time.time()
elapsed_time = end_time - start_time
print(f"\nTotal execution time: {elapsed_time:.2f} seconds ({elapsed_time/60:.2f} minutes)")

Fetching embeddings for sentence1...
Fetching embeddings for sentence2...
OpenAI embedding on STSB (validation): Spearman correlation = 0.8775

Total execution time: 33.93 seconds (0.57 minutes)


## Evaluate OPENAI embed model on MRPC dataset

In [1]:
# @title Load MRPC dataset
# url https://huggingface.co/datasets/SetFit/mrpc
from datasets import load_dataset

# The validation set has 408 pairs
dataset = load_dataset("SetFit/mrpc", split="validation")

# There is no score here only binary label (`0` and `1`)
# 1 means equivalent(SIMILAR)
# 0 means non-equivalent(DISSIMILAR)
# The two texts are labeled text1 and text2
len(dataset)

Repo card metadata block was not found. Setting CardData to empty.


408

In [2]:
# @title Evaluate openai model on MRPC dataset
# url https://huggingface.co/datasets/SetFit/mrpc

from openai import OpenAI
import numpy as np
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score, average_precision_score, roc_curve, precision_recall_curve
from dotenv import load_dotenv
import os
import time

# Start timing
start_time = time.time()

# Load environment variables from .env file
load_dotenv()

# Set your OpenAI API key from .env file
openai_api_key = os.getenv("OPENAI_API_KEY")
if not openai_api_key:
    raise ValueError("OPENAI_API_KEY not found in .env file. Please create a .env file with OPENAI_API_KEY=your_key")

client = OpenAI(api_key=openai_api_key)

def batch_get_openai_embeddings(texts, model="text-embedding-ada-002", batch_size=64):
    embeddings = []
    for i in range(0, len(texts), batch_size):
        batch = texts[i:i+batch_size]
        response = client.embeddings.create(model=model, input=batch)
        batch_embeds = [item.embedding for item in response.data]
        embeddings.extend(batch_embeds)
    return embeddings

# Get text pairs and labels
text1 = dataset["text1"]
text2 = dataset["text2"]
labels = dataset["label"]  # Binary labels: 1 = similar, 0 = dissimilar

print("Fetching embeddings for text1...")
embeds1 = batch_get_openai_embeddings(text1, model="text-embedding-3-large")
print("Fetching embeddings for text2...")
embeds2 = batch_get_openai_embeddings(text2, model="text-embedding-3-large")

# Cosine similarity function
def cosine_similarity(a, b):
    a = np.array(a)
    b = np.array(b)
    return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))

# Compute similarity scores (range: -1 to 1)
pred_similarities = np.array([
    cosine_similarity(e1, e2) for e1, e2 in zip(embeds1, embeds2)
])

# Scale cosine similarity (-1..1) to (0..1) for easier threshold interpretation
pred_similarities_scaled = (pred_similarities + 1) / 2

# Find optimal threshold that maximizes F1-score
best_threshold = 0.5
best_f1 = 0
thresholds = np.arange(0, 1.01, 0.01)

for threshold in thresholds:
    pred_labels = (pred_similarities_scaled >= threshold).astype(int)
    f1 = f1_score(labels, pred_labels)
    if f1 > best_f1:
        best_f1 = f1
        best_threshold = threshold

# Predictions using optimal threshold
pred_labels = (pred_similarities_scaled >= best_threshold).astype(int)

# Calculate classification metrics
accuracy = accuracy_score(labels, pred_labels)
precision = precision_score(labels, pred_labels)
recall = recall_score(labels, pred_labels)
f1 = f1_score(labels, pred_labels)

# Calculate ROC-AUC and PR-AUC (using scaled similarities as probabilities)
roc_auc = roc_auc_score(labels, pred_similarities_scaled)
pr_auc = average_precision_score(labels, pred_similarities_scaled)

# Calculate mean similarity scores for equivalent and non-equivalent pairs
labels_array = np.array(labels)
mean_sim_equivalent = np.mean(pred_similarities_scaled[labels_array == 1])
mean_sim_non_equivalent = np.mean(pred_similarities_scaled[labels_array == 0])
separation = mean_sim_equivalent - mean_sim_non_equivalent

print(f"\n{'='*60}")
print(f"OpenAI embedding on MRPC (validation) - Binary Classification")
print(f"{'='*60}")
print(f"Optimal threshold: {best_threshold:.3f}")
print(f"\nLength of scaled pred similarities: {len(pred_similarities_scaled)}")
print(f"\nSimilarity Score Statistics:")
print(f"  Mean similarity (Equivalent pairs):     {mean_sim_equivalent:.4f}")
print(f"  Mean similarity (Non-equivalent pairs): {mean_sim_non_equivalent:.4f}")
print(f"  Separation (Difference):                {separation:.4f}")
print(f"\nClassification Metrics:")
print(f"  Accuracy:  {accuracy:.4f}")
print(f"  Precision: {precision:.4f}")
print(f"  Recall:    {recall:.4f}")
print(f"  F1-Score:  {f1:.4f}")
print(f"\nRanking Metrics:")
print(f"  ROC-AUC:   {roc_auc:.4f}")
print(f"  PR-AUC:    {pr_auc:.4f}")
print(f"{'='*60}")

# End timing and print duration
end_time = time.time()
elapsed_time = end_time - start_time
print(f"\nTotal execution time: {elapsed_time:.2f} seconds ({elapsed_time/60:.2f} minutes)")


Fetching embeddings for text1...
Fetching embeddings for text2...

OpenAI embedding on MRPC (validation) - Binary Classification
Optimal threshold: 0.840

Length of scaled pred similarities: 408

Similarity Score Statistics:
  Mean similarity (Equivalent pairs):     0.9302
  Mean similarity (Non-equivalent pairs): 0.8718
  Separation (Difference):                0.0584

Classification Metrics:
  Accuracy:  0.7598
  Precision: 0.7549
  Recall:    0.9606
  F1-Score:  0.8454

Ranking Metrics:
  ROC-AUC:   0.7816
  PR-AUC:    0.8809

Total execution time: 14.85 seconds (0.25 minutes)


### Evaluate cohere model on STSB dataset on huggingface

In [None]:
# Evaluate Cohere embedding on STSB dataset (validation split)

import os
import requests
from tqdm.notebook import tqdm
import time

# Start timing
start_time = time.time()

COHERE_API_KEY = os.getenv("COHERE_API_KEY")
assert COHERE_API_KEY is not None, "Set COHERE_API_KEY in your environment."

# Models to try 'embed-english-v3.0', 'embed-english-light-v3.0'.
def batch_get_cohere_embeddings(texts, model="embed-english-v3.0", batch_size=32):
    endpoint = "https://api.cohere.ai/v2/embed"
    headers = {"Authorization": f"Bearer {COHERE_API_KEY}", "Content-Type": "application/json"}
    embeddings = []
    for i in tqdm(range(0, len(texts), batch_size), desc="Cohere Embeddings"):
        batch_texts = texts[i:i+batch_size]
        data = {
            "texts": batch_texts,
            "model": model,
            "input_type": "classification"
        }
        response = requests.post(endpoint, headers=headers, json=data)
        response.raise_for_status()
        batch_embeds = response.json()["embeddings"]
        embeddings.extend(batch_embeds)
    return embeddings

print("Fetching Cohere embeddings for sentence1...")
cohere_embeds1 = batch_get_cohere_embeddings(sents1)
print("Fetching Cohere embeddings for sentence2...")
cohere_embeds2 = batch_get_cohere_embeddings(sents2)

def cosine_similarity(a, b):
    a = np.array(a)
    b = np.array(b)
    return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))

cohere_pred_similarities = [
    cosine_similarity(e1, e2) for e1, e2 in zip(cohere_embeds1, cohere_embeds2)
]
cohere_pred_similarities_scaled = [(sim + 1) / 2 for sim in cohere_pred_similarities]

cohere_spearman_corr, _ = spearmanr(cohere_pred_similarities_scaled, scores)
print(f"Cohere embedding on STSB (validation): Spearman correlation = {cohere_spearman_corr:.4f}")

# End timing and print duration
end_time = time.time()
elapsed_time = end_time - start_time
print(f"\nTotal execution time: {elapsed_time:.2f} seconds ({elapsed_time/60:.2f} minutes)")