# Slight Detour: What is Semantic Similarity?

Semantic similarity is a concept in natural language processing and computational linguistics that measures how similar two pieces of text are in meaning, rather than just looking at lexical (word-level) matches.

At its core, semantic similarity aims to capture when two different texts express similar ideas, concepts, or information, even if they use completely different words. For example, "The automobile won't start" and "My car isn't working" have high semantic similarity despite using different vocabulary.

In [None]:
! pip install faiss-cpu

In [30]:
import requests
import numpy as np
import faiss

In [31]:
texts = [
    "The cat sat on the mat",
    "A feline was resting on a rug",
    "Dogs are great pets",
    "I love having a canine companion",
    "Paris is the capital of France",
    "The Eiffel Tower is in Paris"
]

In [32]:
# Using Nomic model served locally via Ollama for embedding
# Ollama is a friend --> https://ollama.com/
def get_embeddings_from_ollama(text, model="nomic-embed-text"):
    url = "http://localhost:11434/api/embeddings"
    
    payload = {
        "model": model,
        "prompt": text
    }
    
    response = requests.post(url, json=payload)
    return np.array(response.json()["embedding"], dtype=np.float32)


In [14]:
# Generate embeddings for all texts
embeddings = []
for text in texts:
    embedding = get_embeddings_from_ollama(text)
    embeddings.append(embedding)

In [15]:
# Convert list to numpy array and create FAISS index
embeddings = np.array(embeddings)

dimension = embeddings.shape[1]
index = faiss.IndexFlatL2(dimension)
index.add(embeddings)

In [34]:
# Create an embedding for input 
# We will compare this sentence to the ones above
query_text = "A cat is sitting on a carpet"
query_vector = get_embeddings_from_ollama(query_text)
query_vector = query_vector.reshape(1, -1)  # Reshape for FAISS

In [35]:
# Print out for fun
print(f"Vector shape: {query_vector.shape}")
print(f"Vector type: {query_vector.dtype}")
print(query_vector)

Vector shape: (1, 768)
Vector type: float32
[[ 1.19977033e+00  1.58822513e+00 -2.68919611e+00 -1.53683496e+00
   4.33461040e-01  9.94788587e-01 -6.42810762e-01  6.79520011e-01
  -1.20871091e+00 -8.47337246e-01 -4.66290534e-01  1.79046774e+00
   7.40709901e-01  5.90299785e-01 -2.15659842e-01 -1.99350402e-01
   5.68904698e-01 -9.70953226e-01  6.25149012e-01 -8.45871985e-01
   4.55646873e-01 -4.34556752e-02 -6.85987473e-01 -3.99236917e-01
   1.07917547e+00  8.49327743e-01  1.18206644e+00  7.07792044e-01
   1.79182863e+00  7.41470754e-01  8.41558695e-01 -5.60902119e-01
  -3.05747807e-01 -1.06164205e+00 -1.47894129e-01 -3.14708531e-01
   1.75366330e+00  9.26748633e-01  1.53842962e+00  1.12964797e+00
  -9.59145844e-01  6.95822120e-01 -3.55064809e-01 -1.78971723e-01
  -6.95277810e-01 -5.74833333e-01 -1.71975046e-01 -2.54991561e-01
  -5.86528897e-01  2.52667904e-01 -1.33063757e+00 -2.06139833e-01
  -1.72461605e+00 -9.54838455e-01  1.48306823e+00  5.10760486e-01
   8.08741391e-01  8.61016929e-0

In [36]:
# Search for similarity with ALL texts in the index
k = len(texts)  # Return all results
distances, indices = index.search(query_vector, k)

# Display results
print(f"Query: {query_text}\n")
print("All texts ranked by similarity:")
for i, (dist, idx) in enumerate(zip(distances[0], indices[0])):
    print(f"{i+1}. \"{texts[idx]}\" (Distance: {dist:.4f})")

Query: A cat is sitting on a carpet

All texts ranked by similarity:
1. "A feline was resting on a rug" (Distance: 185.5565)
2. "The cat sat on the mat" (Distance: 206.1640)
3. "Dogs are great pets" (Distance: 521.8005)
4. "The Eiffel Tower is in Paris" (Distance: 589.1736)
5. "Paris is the capital of France" (Distance: 598.8473)
6. "I love having a canine companion" (Distance: 647.4026)
