In [1]:
!jupyter nbextension disable --py widgetsnbextension

Disabling notebook extension jupyter-js-widgets/extension...
      - Validating: [32mOK[0m


# **Sentence Similarity Experiment Using Embeddings**

This notebook lets you compare any two sentences and calculates a semantic similarity score, showing how close their meanings are using pre-trained sentence embeddings.


In [2]:
# Install the sentence-transformers library
# - Provides pre-trained models to generate embeddings for sentences, paragraphs, or documents.
# - Embeddings are numerical vector representations of text, capturing semantic meaning.

!pip install -q sentence-transformers

In [3]:
# Import required libraries
from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np

In [4]:
# Load a pre-trained sentence transformer model
# 'all-MiniLM-L6-v2' is lightweight, fast, and effective for semantic similarity tasks.
embed_model = SentenceTransformer('all-MiniLM-L6-v2')

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

In [5]:
def compare_sentence_similarity(embed_model, sent1, sent2):
  """
    Compute the semantic similarity between two sentences using embeddings.

    Args:
        embed_model (SentenceTransformer): Pre-trained sentence transformer model.
        sent1 (str): First sentence to compare.
        sent2 (str): Second sentence to compare.

    Returns:
        float: Similarity percentage between 0 and 100.

    Steps:
        1. Encode each sentence into embeddings.
        2. Compute cosine similarity between embeddings.
        3. Convert the similarity score to a percentage.
    """
  # Step 1: Generate embeddings for both sentences
  emb1 = embed_model.encode([sent1])
  emb2 = embed_model.encode([sent2])
  # Step 2: Compute cosine similarity between embeddings
  sim_score = cosine_similarity(emb1, emb2)[0][0]
  # Step 3: Convert similarity to percentage
  similarity_percentage = sim_score * 100
  return similarity_percentage


In [6]:
def interpret_similarity_score(sim_percentage):
  """
    Interpret a similarity score as a human-readable message.

    Args:
        sim_percentage (float): Similarity score in percentage (0-100).

    Returns:
        str: Message describing how similar the sentences are.

    Thresholds:
        >95%  : Highly similar
        70-95%: Somewhat similar
        <70%  : Not similar
    """
  if sim_percentage > 95:
    return "The sentences are highly similar in meaning."
  elif sim_percentage > 70:
    return "The sentences are somewhat similar."
  else:
    return "The sentences are not similar."


In [7]:
# Interactive user input
# Prompt the user to enter two sentences for comparison
sentence1 = input("Enter the first sentence: ")
sentence2 = input("Enter the second sentence: ")

# Compute similarity score
similarity_score = compare_sentence_similarity(embed_model,sentence1, sentence2)
# Interpret the similarity score
are_they_similar = interpret_similarity_score(similarity_score)
# Display results
print("Result = ")
print(are_they_similar)
print("Similarity Score = {:.2f}%".format(similarity_score))


Enter the first sentence: The dog is happy
Enter the second sentence: The cat is sad
Result = 
The sentences are not similar.
Similarity Score = 46.06%
