# Similarity Searches and Embeddings

## Introduction to Similarity and Distance
In the realm of embeddings, similarity and distance have an inverse relationship. Similarity measures how close two documents are in meaning, while distance quantifies how far apart they are. To find similarities between documents represented as vectors, we often utilize a distance metric, such as Cosine Distance.

## Similarity with CosineDistance
Cosine similarity (or distance) measures the angular distance between two vectors. It is commonly used in document embeddings to determine how similar two documents are based on their vector representations.

In [1]:
from swarmauri.distances.concrete.CosineDistance import CosineDistance
from swarmauri.vectors.concrete.Vector import Vector

# Example documents represented as vectors
doc1_vector = Vector(value=[1, 2])
doc2_vector = Vector(value=[1, 2])

# Calculate Cosine Distance between the two vectors
cosine_distance = CosineDistance().distance(doc1_vector, doc2_vector)
print(f"Cosine Distance between [1,2] and [1,2]: {cosine_distance}")

Cosine Distance between [1,2] and [1,2]: 2.220446049250313e-16


## Explanation of Floating-Point Precision

#### Floating-point precision is a crucial concept in numerical computations, particularly in machine learning. It refers to the limitations of how computers represent real numbers.

When calculating cosine distance, you may encounter situations where the expected distance between two identical vectors is not zero. This discrepancy arises because:

- **Cosine Distance and Angular Measurement:** Theoretically, the cosine distance between two identical vectors should be zero, as there is no angular difference.
- **Floating-Point Arithmetic:** Computers store numbers using finite precision (typically 64 bits for floating-point numbers). This limitation can lead to tiny rounding errors in calculations.

### Example: Why Cosine Distance Doesn't Equal 0

In [2]:
# Floating point precision explanation
# Theoretical cosine distance between identical vectors should be 0, but due to FP64 limitations:
expected_result = 0.0
actual_result = CosineDistance().distance(Vector(value=[1, 2]), Vector(value=[1, 2]))
print(f"Expected Result: {expected_result}, Actual Result: {actual_result}")

# Difference due to floating-point precision
difference = abs(expected_result - actual_result)
print(f"Difference due to Floating Point Precision: {difference}")

Expected Result: 0.0, Actual Result: 2.220446049250313e-16
Difference due to Floating Point Precision: 2.220446049250313e-16


## Similarity Search Example

Now, let’s create a list of document embeddings and find the most similar document using `CosineDistance`.

In [16]:
from swarmauri.vectors.concrete.Vector import Vector

# Example documents and embeddings
documents = ["The cat is on the mat", "Dogs are loyal pets", "I love eating bananas"]
# Generate embeddings for the example documents
# Generate embeddings for the example documents
embeddings = embedder.fit_transform(documents)

# Example of embedding generation (you should adapt this based on your setup)
# Make sure you use the same embedding method for both query and documents
query_embedding = embedder.fit_transform(["The cat is playing"])[0]
documents_embeddings = embedder.fit_transform(documents)

# Check the shapes of both embeddings
print("Query embedding shape:", query_embedding.value)
print("Documents embeddings shapes:")
for doc_embedding in documents_embeddings:
    print(doc_embedding.value)

# Determine the maximum length for padding/trimming
max_length = max(len(query_embedding_values), max(len(embedding.value) for embedding in embeddings))

# Function to pad or trim vectors
def normalize_vector(vector, max_length):
    return vector + [0] * (max_length - len(vector)) if len(vector) < max_length else vector[:max_length]

# Normalize query and embeddings
query_embedding_values_normalized = normalize_vector(query_embedding_values, max_length)
embeddings_normalized = [normalize_vector(embedding.value, max_length) for embedding in embeddings]

# Compute the distances with normalized vectors
distances = []
for i, embedding_normalized in enumerate(embeddings_normalized):
    distances.append(cosine.distance(Vector(value=query_embedding_values_normalized), Vector(value=embedding_normalized)))

# Find the most similar document (smallest distance)
most_similar_index = distances.index(min(distances))
print(f"Most similar document: {documents[most_similar_index]}")


Query embedding shape: [0.5, 0.5, 0.5, 0.5]
Documents embeddings shapes:
[0.0, 0.0, 0.3535533905932738, 0.0, 0.0, 0.3535533905932738, 0.0, 0.0, 0.3535533905932738, 0.3535533905932738, 0.0, 0.7071067811865476]
[0.5, 0.0, 0.0, 0.5, 0.0, 0.0, 0.0, 0.5, 0.0, 0.0, 0.5, 0.0]
[0.0, 0.5773502691896257, 0.0, 0.0, 0.5773502691896257, 0.0, 0.5773502691896257, 0.0, 0.0, 0.0, 0.0, 0.0]
Most similar document: Dogs are loyal pets


#### The code takes a set of documents and a query, generates their embeddings, normalizes those embeddings, calculates the cosine distances between the query and each document, and identifies the document that is most similar to the query based on the smallest cosine distance.

## Notebook Metadata

In [17]:
import os
import platform
import sys
from datetime import datetime

author_name = "Huzaifa Irshad " 
github_username = "irshadhuzaifa"

print(f"Author: {author_name}")
print(f"GitHub Username: {github_username}")

notebook_file = "Notebook_02_Preprocessing_Data_For_Embeddings.ipynb"
try:
    last_modified_time = os.path.getmtime(notebook_file)
    last_modified_datetime = datetime.fromtimestamp(last_modified_time)
    print(f"Last Modified: {last_modified_datetime}")
except Exception as e:
    print(f"Could not retrieve last modified datetime: {e}")

print(f"Platform: {platform.system()} {platform.release()}")
print(f"Python Version: {sys.version}")

try:
    import swarmauri
    print(f"Swarmauri Version: {swarmauri.__version__}")
except ImportError:
    print("Swarmauri is not installed.")

Author: Huzaifa Irshad 
GitHub Username: irshadhuzaifa
Last Modified: 2024-10-18 11:16:32.881434
Platform: Windows 11
Python Version: 3.12.7 | packaged by Anaconda, Inc. | (main, Oct  4 2024, 13:17:27) [MSC v.1929 64 bit (AMD64)]
Swarmauri Version: 0.5.0
