# Text Similarity Checker: Cosine Similarity and Embeddings

## Introduction
**Text similarity** is a core concept in Natural Language Processing (NLP) that involves determining how semantically alike two pieces of text are. This is incredibly useful for tasks like semantic search, plagiarism detection, recommendation systems, and grouping similar documents.

Instead of relying on simple word overlap, modern approaches leverage **text embeddings**. These are dense vector representations of words, phrases, or entire sentences that capture their meaning and context. Texts with similar meanings will have embeddings that are 'close' to each other in a multi-dimensional space.

To measure this 'closeness', we often use **Cosine Similarity**, which calculates the cosine of the angle between two non-zero vectors. A cosine similarity of 1 indicates identical direction (most similar), 0 indicates orthogonality (no similarity), and -1 indicates opposite directions (most dissimilar).

In this assignment, you'll use pre-trained models from the `sentence-transformers` library to generate these embeddings and then apply cosine similarity to build a text similarity checker.

---

## Learning Objectives
Upon completion of this assignment, you should be able to:
- Understand the concept of text embeddings and their role in semantic similarity.
- Load a pre-trained sentence embedding model using `sentence-transformers`.
- Generate embeddings for individual sentences and batches of sentences.
- Calculate cosine similarity between text embeddings.
- Implement a function to calculate text similarity between any two given texts.
- Find top-K most similar sentences from a collection.
- Discuss the advantages, applications, and limitations of embedding-based text similarity.

---

## Setup and Prerequisites
Ensure you have the necessary libraries installed. If not, uncomment and run the following cells:

```bash
# pip install sentence-transformers torch
```

---

In [None]:
import torch
from sentence_transformers import SentenceTransformer, util

print(f"PyTorch Version: {torch.__version__}")
print(f"Sentence-Transformers Version: {sentence_transformers.__version__}")

# Check if GPU is available
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using device: {device}")

# --- Sample Sentences for Testing ---
sample_sentences = [
    "The cat sat on the mat.",
    "A feline was resting on the rug.",
    "The dog barked loudly at the mailman.",
    "The quick brown fox jumps over the lazy dog.",
    "Artificial intelligence is a rapidly evolving field.",
    "AI is transforming various industries.",
    "The capital of France is Paris.",
    "What is the largest city in France?",
    "I love to eat pizza.",
    "I enjoy consuming delicious pepperoni slices.",
    "Space exploration fascinates me.",
    "Astronauts are preparing for a mission to Mars."
]

print("\nSample sentences loaded. Total sentences:", len(sample_sentences))
print("\nFirst sample sentence:\n", sample_sentences[0])

---

## Assignment Questions

---

### Question 1: Load Sentence Embedding Model
The `SentenceTransformer` class makes it easy to load pre-trained models specifically designed for generating sentence-level embeddings. A good general-purpose model is `'all-MiniLM-L6-v2'`.

1.  **Load Model:** Load the pre-trained `SentenceTransformer` model `'all-MiniLM-L6-v2'`. Move the model to your `device`.
2.  **Inspect:** Print the type of the loaded model and confirm it's on the correct device. You can also print the model's structure (e.g., `model` itself) to see its components.

---

---

### Question 2: Generate Sentence Embeddings
Once the model is loaded, you can generate embeddings for individual sentences or lists of sentences.

1.  **Select Sentences:** Choose `sample_sentences[0]` and `sample_sentences[1]`.
2.  **Encode:** Generate embeddings for each of these two sentences using `model.encode()`. Make sure to convert them to PyTorch tensors and move to device if not already.
3.  **Print Shape:** Print the shape of each generated embedding. Explain what the dimensions represent (e.g., `(768,)` means a single sentence is embedded into a 768-dimensional vector).

---

---

### Question 3: Calculate Cosine Similarity
The `util.cos_sim` function from `sentence_transformers` or `torch.nn.functional.cosine_similarity` can be used to calculate the similarity between two embedding vectors.

1.  **Calculate Similarity:** Calculate the cosine similarity between the two embeddings you generated in Question 2.
2.  **Print Score:** Print the similarity score.
3.  **Interpretation:** Based on the sentences and the score, explain what a high score (close to 1) and a low score (close to 0 or negative) signifies in terms of text similarity.

---

---

### Question 4: Build a Text Similarity Function
Encapsulate the embedding generation and cosine similarity calculation into a reusable function.

1.  **Create Function:** Define a Python function `get_text_similarity(text1, text2, model)` that:
    * Takes two strings (`text1`, `text2`) and the `SentenceTransformer` model as input.
    * Encodes both texts into embeddings.
    * Calculates their cosine similarity.
    * Returns the similarity score.
2.  **Test Cases:** Test your function with the following pairs from `sample_sentences` and print their similarities:
    * `sample_sentences[4]` (AI) and `sample_sentences[5]` (AI transforming industries)
    * `sample_sentences[2]` (dog barked) and `sample_sentences[3]` (fox jumps)
    * `sample_sentences[8]` (love pizza) and `sample_sentences[9]` (enjoy pepperoni)
    * `sample_sentences[6]` (Paris capital) and `sample_sentences[7]` (largest city in France)

---

---

### Question 5: Top-K Similar Sentences
A common application is to find the most similar documents/sentences to a given query from a larger collection.

1.  **Define Query:** Choose a query sentence, e.g., `"I want to learn about space." `
2.  **Encode All Sentences:** Encode the query sentence and *all* `sample_sentences` from the initial list into embeddings.
3.  **Calculate All Similarities:** Calculate the cosine similarity between the query embedding and the embedding of *each* sentence in the `sample_sentences` list.
4.  **Find Top-K:** Identify the top 3 most similar sentences to the query. Sort them by similarity score in descending order.
5.  **Print Results:** Print the query sentence, and then the top 3 similar sentences along with their respective similarity scores.

---

---

### Question 6: Discussion and Applications
1.  **Embeddings vs. Traditional Methods:** What are the key advantages of using sentence embeddings and cosine similarity for text similarity compared to traditional methods like TF-IDF + Cosine Similarity or Bag-of-Words based approaches? (Think about semantic understanding).
2.  **Real-world Applications:** Describe three distinct real-world applications (different from those mentioned in the introduction) where an embedding-based text similarity checker would be highly beneficial. Explain *how* it would be used in each scenario.
3.  **Limitations/Challenges:** What are some potential limitations or challenges of this embedding-based text similarity approach? (Consider aspects like sarcasm, ambiguity, very specific domain knowledge, or the computational cost for extremely large datasets).

---

## Submission Guidelines
- Ensure your notebook runs without errors from top to bottom.
- Save your notebook as `your_name_text_similarity_assignment.ipynb`.
- Clearly answer all questions and provide explanations where requested in Markdown cells.
- Feel free to add additional code cells or markdown cells for clarity or experimentation.

---