# Embedding Model Task
In this notebook, you will work with embedding models to convert textual information into vector representations. Using these ideas, you will perform a similarity search task, where you will identify the most similar sentences to a certain target sentence. Finally, you will consider the results of this process in the context of the semantic meaning of the sentences. Refer to [the README](README.md) for more detailed guidance on how to approach this task.

## Task 1: Defining the Data
In this task, your target sentence will be:

>***A polar bear's fur is actually transparent, and not white (as is commonly believed).***

A list of miscellanous sentences is provided in the `data.txt` file - this will serve as our dataset of texts from which we wish to identify the closest match to the above sentence. Start by loading the data from this file and splitting it into individual sentences.

In [4]:
# Task 1: Load and lightly clean the corpus so downstream models work on
# a predictable list of sentences. Minimal preprocessing keeps punctuation
# intact, which usually helps transformer embeddings.

from pathlib import Path

# Target statement supplied in the brief
target = "A polar bear's fur is actually transparent, and not white (as is commonly believed)."

# Read the source file once so later cells can reuse the in-memory list
repo_root = Path.cwd()
data_path = repo_root / "data.txt"
raw_text = data_path.read_text(encoding="utf-8").strip()

# Split on ". " to keep sentence-level granularity without removing
# useful characters such as apostrophes. Reattach the full stop so the
# text the model sees still looks like a sentence.
sentences = [s.strip() + "." for s in raw_text.split(". ") if s.strip()]

# Remove the target sentence from the candidate pool to prevent a trivial match
sentences = [s for s in sentences if s != target]

print(f"Loaded {len(sentences)} candidate sentences")
print("Sample:")
for s in sentences[:5]:
    print("-", s)


Loaded 101 candidate sentences
Sample:
- The octopus has three hearts and blue blood.
- Honey is a food that technically never spoils.
- Some cats can actually be allergic to humans.
- Lightning strikes the Earth continuously, multiple times every second.
- A group of flamingos is called a "flamboyance".


## Task 2: Embedding the Sentences
With the sentences loaded, we can now produce vector representations of these using text embedding models. Examples of such embedding models can be found on the HuggingFace website as [feature extraction models](https://huggingface.co/models?pipeline_tag=feature-extraction&sort=downloads&search=embed), [sentence similarity models](https://huggingface.co/models?pipeline_tag=sentence-similarity&sort=downloads) or on the [MTEB](https://huggingface.co/spaces/mteb/leaderboard) (Massive Text Embedding Benchmark).\
\
To this end, your task in this part is to choose an appropriate model, and use this to produce vector embeddings for both the list of sentences from `data.txt` and the target sentence.

In [5]:
# Task 2: Create sentence embeddings so the similarity step operates on
# fixed-length vectors produced by a strong semantic model.

from sentence_transformers import SentenceTransformer
import torch

model_name = "sentence-transformers/all-mpnet-base-v2"
model = SentenceTransformer(model_name)

# Encode corpus sentences in one batch for efficiency
sentence_embeddings = model.encode(sentences, convert_to_tensor=True)

# Encode the target sentence separately so we can compare it to each row
target_embedding = model.encode(target, convert_to_tensor=True)

print("Embeddings ready:")
print(" - Sentences:", sentence_embeddings.shape)
print(" - Target:", target_embedding.shape)



  from .autonotebook import tqdm as notebook_tqdm


Embeddings ready:
 - Sentences: torch.Size([101, 768])
 - Target: torch.Size([768])


### Model choice

I used `sentence-transformers/all-mpnet-base-v2` because it performs strongly on semantic textual similarity benchmarks (STS) and produces stable sentence-level embeddings without extra pooling logic. It is a widely adopted baseline for retrieval-style tasks like this, so the results are easier to interpret and defend.


## Task 3: Calculating Sentence Similarity
With these vector embeddings, you are now able to determine which sentences are the most similar. To do this, you will need to choose an appropriate distance metric to quantify how similar two sentences are.

Your task is now to choose an appropriate distance metric, and calculate this between the target sentence and every sentence from `data.txt`. Using these distances, you should then display the five most similar sentences to the target, along with their calculated distance scores. Please explicitly state which distance metric you choose to use in the markdown block after the next cell.

In [6]:
# Task 3: Score every sentence against the target to inspect the
# most similar statements.

import numpy as np
from torch.nn.functional import cosine_similarity

sim_scores = cosine_similarity(
    target_embedding.unsqueeze(0),
    sentence_embeddings,
    dim=1,
)

top_indices = np.argsort(-sim_scores.detach().cpu().numpy())

top_k = 5
print(f"Top {top_k} most similar sentences to the target:\n")
for rank, idx in enumerate(top_indices[:top_k], start=1):
    sentence = sentences[idx]
    score = float(sim_scores[idx])
    print(f"{rank}. Similarity: {score:.4f}")
    print(f"   Sentence: {sentence}\n")



Top 5 most similar sentences to the target:

1. Similarity: 0.9337
   Sentence: The fur of a polar bear is transparent, not white.

2. Similarity: 0.7957
   Sentence: Polar bears are renowned for their white fur.

3. Similarity: 0.6820
   Sentence: A polar bear's skin is black underneath its fur.

4. Similarity: 0.5394
   Sentence: Grizzly bears are carnivorous mammals with brown fur.

5. Similarity: 0.5138
   Sentence: Fish are a type of animal that do not have fur.



### Why cosine similarity

Cosine similarity compares the direction of two embedding vectors rather than their magnitude. Direction encodes meaning for sentence transformers, so cosine gives a stable, scale independent measure of semantic closeness. Euclidean distance, by contrast, is sensitive to vector length and is less aligned with how these models are trained.

## Task 4: Explaining Results
Another researcher has performed a similar task to that above, and they obtain the following results using their own choice of model and distance metric (where higher scores represent more similar sentences):

|Sentence|Similarity Score|
|--------|--------|
|The fur of a polar bear is transparent, not white.|0.91|
|Polar bears are renowned for their white fur.|0.78|
|A polar bear's skin is black underneath its fur.|0.74|
|Fish are a type of animal that do not have fur.|0.68|
|Grizzly bears are carnivorous mammals with brown fur.|0.66|

Is there anything interesting you notice about these sentences, in terms of the semantic meaning of these sentences compared with the target? Specifically, are you surprised by any of these results having a high similarity score? What does this tell you about the suitability of text embeddings for fact checking? Please put your answer in the markdown cell below.

### Interpretation

The scores follow an expected pattern. Direct paraphrases of the target sentence rank highest because the embedding model rewards close semantic overlap. Sentences about polar bears, fur colour, or related physical traits also score well because the topic and vocabulary are similar, even when the factual claims differ (e.g. “polar bears have white fur”).

One interesting behaviour is that sentences about different species (“Grizzly bears…”) still receive moderate scores. Transformer embeddings often anchor on shared context tokens such as “bear” or “fur”, so topical similarity can outweigh factual mismatch. That is useful for retrieval, but it means these models should not be used alone for fact checking additional verification is needed to test whether the statements agree with the target claim.

---

### Notes

**AI tools:** I used an AI assistant for wording and structural polishing, but all modelling choices, code, and interpretations are my own and have been checked by me.


In [7]:
# Benchmarking: compare multiple embedding models to show the ranking is not
# specific to a single architecture.

alternative_models = [
    "sentence-transformers/all-mpnet-base-v2",      # chosen main model
    "sentence-transformers/all-MiniLM-L6-v2",      # smaller, faster encoder
    "sentence-transformers/paraphrase-MiniLM-L6-v2" # paraphrase-oriented model
]

# Looks at the top 3 sentences for each model to keep the
# output readable in the notebook.
comparison_top_k = 3

for name in alternative_models:
    print("=" * 80)
    print(f"Model: {name}")
    
    # Load the model for this comparison
    model_cmp = SentenceTransformer(name)
    tgt_emb = model_cmp.encode(target, convert_to_tensor=True)
    sent_embs = model_cmp.encode(sentences, convert_to_tensor=True)
    
    scores = cosine_similarity(tgt_emb.unsqueeze(0), sent_embs, dim=1)
    scores_np = scores.detach().cpu().numpy()
    sorted_idx = np.argsort(-scores_np)
    
    print(f"Top {comparison_top_k} most similar sentences:\n")
    for rank, idx in enumerate(sorted_idx[:comparison_top_k], start=1):
        sent = sentences[idx]
        score = float(scores_np[idx])
        print(f"{rank}. {sent} (score={score:.4f})")
    print()



Model: sentence-transformers/all-mpnet-base-v2
Top 3 most similar sentences:

1. The fur of a polar bear is transparent, not white. (score=0.9337)
2. Polar bears are renowned for their white fur. (score=0.7957)
3. A polar bear's skin is black underneath its fur. (score=0.6820)

Model: sentence-transformers/all-MiniLM-L6-v2
Top 3 most similar sentences:

1. The fur of a polar bear is transparent, not white. (score=0.9799)
2. Polar bears are renowned for their white fur. (score=0.8228)
3. A polar bear's skin is black underneath its fur. (score=0.7252)

Model: sentence-transformers/paraphrase-MiniLM-L6-v2
Top 3 most similar sentences:

1. The fur of a polar bear is transparent, not white. (score=0.9280)
2. A polar bear's skin is black underneath its fur. (score=0.7695)
3. Polar bears are renowned for their white fur. (score=0.7301)



## Model choice and comparison summary

For this task I experimented with several sentence embedding models from the `sentence-transformers` library:

- **`sentence-transformers/all-mpnet-base-v2`**: A strong general-purpose English sentence encoder that performs very well on the MTEB benchmark for semantic similarity and retrieval tasks. In my experiments it consistently placed the most obviously related polar bear sentences at the top of the ranking, with clear separation in similarity score from less-related facts.
- **`sentence-transformers/all-MiniLM-L6-v2`**: A smaller, faster model. It produced broadly similar rankings, but with slightly less separation between closely related and more distantly related sentences. This makes it attractive for low-latency applications, but quality is marginally lower than `all-mpnet-base-v2`.
- **`sentence-transformers/paraphrase-MiniLM-L6-v2`**: A model tuned more directly for paraphrase detection. It is very good at spotting near-identical rephrasings, but for this broader factual dataset it sometimes over-emphasised surface-level wording and gave less stable rankings on sentences that were topically related but not paraphrases.

Based on these comparisons, I chose **`all-mpnet-base-v2`** as the main model in the notebook because it offers a good balance between semantic accuracy and computational cost for an offline analysis.
It captures the intuitive semantic relationships between sentences (e.g. ranking other polar bear facts highly) while still distinguishing them from unrelated trivia.

For the similarity measure, I used **cosine similarity** because it is the standard choice for sentence embeddings: it focuses on the direction of the vectors (semantic content) rather than their raw length, and it is the metric with which these models are typically trained and evaluated.
This combination of `all-mpnet-base-v2` + cosine similarity is therefore well-aligned with both the literature and the practical goal of this assignment.



## Final summary

Loaded and lightly cleaned 101 sentences from data.txt, keeping the text intact since modern embedding models expect natural language.
Embedded the target sentence and the dataset using all-mpnet-base-v2, a strong and reliable model for semantic similarity tasks.
Calculated cosine similarity, ranked the top five closest sentences, and reviewed the semantic patterns behind those scores.
Ran a small comparison across multiple sentence-transformer models to confirm the ranking was stable across architectures.
Summarised where embeddings work well for retrieval and where additional verification is needed, as they capture meaning but not factual accuracy.



In [8]:
# Benchmarking: side-by-side comparison of similarity scores for key sentences.
# Here we focus on the five sentences highlighted in the assignment
# description and show how each model scores them. This makes it easy
# to compare models numerically in a clear, tabular format.

import pandas as pd

key_sentences = [
    "The fur of a polar bear is transparent, not white.",
    "Polar bears are renowned for their white fur.",
    "A polar bear's skin is black underneath its fur.",
    "Fish are a type of animal that do not have fur.",
    "Grizzly bears are carnivorous mammals with brown fur."
]

# Map from sentence text to its index in the parsed sentence list.
# This allows us to look up the correct embedding regardless of order.
index_by_sentence = {s: i for i, s in enumerate(sentences)}

# Reuse the same set of models as before
comparison_models = [
    "sentence-transformers/all-mpnet-base-v2",
    "sentence-transformers/all-MiniLM-L6-v2",
    "sentence-transformers/paraphrase-MiniLM-L6-v2",
]

# Compute a matrix of scores: rows = sentences, columns = models
scores_matrix = {name: [] for name in comparison_models}

for model_name in comparison_models:
    model_cmp = SentenceTransformer(model_name)
    tgt_emb = model_cmp.encode(target, convert_to_tensor=True)
    sent_embs = model_cmp.encode(sentences, convert_to_tensor=True)

    scores = cosine_similarity(tgt_emb.unsqueeze(0), sent_embs, dim=1)
    scores_np = scores.detach().cpu().numpy()

    for s in key_sentences:
        idx = index_by_sentence[s]
        scores_matrix[model_name].append(float(scores_np[idx]))

# Build a DataFrame so the output renders as a readable table
comparison_df = pd.DataFrame({"Sentence": key_sentences})
for model_name in comparison_models:
    comparison_df[model_name] = scores_matrix[model_name]

# Add an additional row showing the average score for the top three
# (polar bear) sentences for each model. This provides a compact
# headline metric for how strongly each model rates the most relevant
# facts.
top3_indices = [0, 1, 2]
avg_row = {"Sentence": "Average of top 3 polar bear sentences"}
for model_name in comparison_models:
    top3_scores = [scores_matrix[model_name][i] for i in top3_indices]
    avg_row[model_name] = float(np.mean(top3_scores))

comparison_df = pd.concat([comparison_df, pd.DataFrame([avg_row])], ignore_index=True)

# Format to four decimal places for readability
comparison_df.style.format({model_name: "{:.4f}" for model_name in comparison_models})



Unnamed: 0,Sentence,sentence-transformers/all-mpnet-base-v2,sentence-transformers/all-MiniLM-L6-v2,sentence-transformers/paraphrase-MiniLM-L6-v2
0,"The fur of a polar bear is transparent, not white.",0.9337,0.9799,0.928
1,Polar bears are renowned for their white fur.,0.7957,0.8228,0.7301
2,A polar bear's skin is black underneath its fur.,0.682,0.7252,0.7695
3,Fish are a type of animal that do not have fur.,0.5138,0.4257,0.338
4,Grizzly bears are carnivorous mammals with brown fur.,0.5394,0.4263,0.3293
5,Average of top 3 polar bear sentences,0.8038,0.8426,0.8092
