# Embedding Comparison on ISL_CLSRT Dataset

This notebook demonstrates how to generate sentence embeddings from cleaned gloss sentences and compute pairwise cosine similarity using sentence-transformers (BERT-tiny).


In [None]:
# !pip install sentence-transformers scikit-learn pandas


In [None]:
import pandas as pd
from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity
import matplotlib.pyplot as plt
import seaborn as sns


In [None]:
# Load pre-cleaned ISL_CLSRT data
df = pd.read_csv('isl_train_meta_cleaned.csv')
df[['Sentences', 'cleaned_gloss']].head()


## Step 1: Initialize Sentence Transformer Model

We use `all-MiniLM-L6-v2` or `bert-tiny` for a light-weight embedding extraction.


In [None]:
model = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2')


## Step 2: Generate Sentence Embeddings


In [None]:
sentences = df['cleaned_gloss'].tolist()
embeddings = model.encode(sentences, show_progress_bar=True)


## Step 3: Compute Pairwise Cosine Similarity


In [None]:
similarity_matrix = cosine_similarity(embeddings)
similarity_df = pd.DataFrame(similarity_matrix, index=df['cleaned_gloss'], columns=df['cleaned_gloss'])
similarity_df.iloc[:5, :5]


## Step 4: Visualize Similarity Matrix (Heatmap)


In [None]:
plt.figure(figsize=(8,6))
sns.heatmap(similarity_df.iloc[:10, :10], annot=True, cmap='coolwarm', fmt=".2f")
plt.title("Cosine Similarity Heatmap (Top 10 Glosses)")
plt.show()


### Summary
This notebook demonstrated how to compute semantic similarity between gloss sentences using sentence embeddings. Such techniques are useful for duplicate detection, glossary alignment, and synonym discovery in sign language corpora.
