In [1]:
pip install faiss-cpu

Note: you may need to restart the kernel to use updated packages.


In [2]:
pip install faiss-gpu

Note: you may need to restart the kernel to use updated packages.


ERROR: Could not find a version that satisfies the requirement faiss-gpu (from versions: none)
ERROR: No matching distribution found for faiss-gpu


In [1]:
import faiss
import numpy as np

# Initialize a Flat Index (brute-force search)
dimension = 128  # Dimension of vectors
index = faiss.IndexFlatL2(dimension)  # L2 distance (Euclidean)

# Generate random vectors (e.g., embeddings)
num_vectors = 1000
vectors = np.random.random((num_vectors, dimension)).astype('float32')

# Add vectors to the index
index.add(vectors)

print(f"Number of vectors in the index: {index.ntotal}")

Number of vectors in the index: 1000


In [2]:
# Generate a query vector
query_vector = np.random.random((1, dimension)).astype('float32')

# Search for the 5 nearest neighbors
k = 5
distances, indices = index.search(query_vector, k)

print("Distances:", distances)
print("Indices of nearest neighbors:", indices)


Distances: [[14.156978 14.640308 14.93914  15.006089 15.42473 ]]
Indices of nearest neighbors: [[ 52 450 918 534 177]]


Query Explanation
Distances: Represents the similarity between the query vector and the retrieved vectors (lower is better for L2).
Indices: Gives the positions of the retrieved vectors in the original dataset.


Key Concepts
What is a Vector?

A vector is a numerical representation of data, often generated by machine learning models.
Example:
Images: Extracted using CNN (ResNet, etc.).
Text: Extracted using NLP models (BERT, word2vec).
What is a Flat Index?

A basic FAISS index that performs brute-force search to find the most similar vectors.
Works well for smaller datasets (e.g., a few thousand vectors).
What is L2 Distance?

A measure of similarity where smaller distances indicate more similarity.
Example:
Two identical vectors have a distance of 0.
Farther vectors have larger distances.

In [5]:
# Generate a random query vector (e.g., similar to embeddings from a new image)
query_vector = np.random.random((1, dimension)).astype('float32')

# Perform a search for the 5 nearest neighbors
k = 5  # Number of nearest neighbors to retrieve
distances, indices = index.search(query_vector, k)

# Print the results
print("Query Vector:", query_vector)
print("Distances to Neighbors:", distances)
print("Indices of Neighbors:", indices)


Query Vector: [[0.7279403  0.10713751 0.32860276 0.4993713  0.7423833  0.27279237
  0.79273903 0.2916031  0.48047748 0.06298422 0.06591693 0.09477851
  0.55380917 0.12910019 0.04598691 0.9078159  0.77033895 0.88538265
  0.03636814 0.2612472  0.4319332  0.62366617 0.505568   0.096889
  0.5075388  0.11634566 0.8683764  0.9852671  0.97208357 0.09180584
  0.32982185 0.21606591 0.9732137  0.27511105 0.99516237 0.11287562
  0.15530793 0.8789886  0.9932234  0.49323556 0.24727508 0.533561
  0.7541848  0.04145546 0.27339694 0.8266389  0.06087651 0.29161167
  0.7918901  0.70615387 0.24259007 0.7554129  0.13981779 0.6970747
  0.58426106 0.8528335  0.79104716 0.2480611  0.4978429  0.95577997
  0.60850316 0.19041154 0.18559423 0.95386124 0.30269492 0.23273489
  0.6286228  0.90319395 0.46590626 0.61094815 0.316676   0.9066248
  0.792105   0.81925005 0.83912486 0.27209285 0.02078237 0.80653065
  0.5683758  0.6223491  0.19222236 0.5523597  0.88503015 0.8711004
  0.46168157 0.5705977  0.71410024 0.1892

In [6]:
pip install faiss-cpu sentence-transformers numpy

Note: you may need to restart the kernel to use updated packages.


In [1]:
import faiss
import numpy as np
from sentence_transformers import SentenceTransformer

# Initialize the Sentence Transformer model to get sentence embeddings
model = SentenceTransformer('all-MiniLM-L6-v2')

# Example sentences
sentences = [
    "The quick brown fox jumps over the lazy dog.",
    "I love programming in Python.",
    "Artificial intelligence is transforming the world.",
    "Machine learning is a subset of artificial intelligence.",
    "Deep learning models are highly effective for image recognition."
]

# Generate sentence embeddings (each sentence is converted to a 384-dimensional vector)
embeddings = model.encode(sentences)

# Convert embeddings to numpy array (float32 type is required by FAISS)
embeddings = np.array(embeddings).astype('float32')

# Initialize FAISS index for L2 distance
dimension = embeddings.shape[1]  # 384 dimensions for the 'all-MiniLM-L6-v2' model
index = faiss.IndexFlatL2(dimension)

# Add embeddings to the FAISS index
index.add(embeddings)

# Check the number of vectors in the index
print(f"Number of sentence embeddings in the index: {index.ntotal}")



Number of sentence embeddings in the index: 5


In [11]:
print(embeddings)

[[ 0.04393355  0.05893443  0.04817838 ...  0.05216278  0.05610652
   0.10206394]
 [-0.05761699  0.00426226 -0.02815318 ...  0.11543837  0.10225611
  -0.01581942]
 [ 0.03872415 -0.00110552  0.08271618 ... -0.02902935  0.04854367
  -0.03839865]
 [-0.02345928 -0.01058114  0.07192817 ...  0.06038335  0.07951811
  -0.04710237]
 [-0.01509204 -0.06898728  0.07579856 ...  0.02140344 -0.05689414
  -0.04317182]]


In [5]:
print(index)

<faiss.swigfaiss_avx2.IndexFlatL2; proxy of <Swig Object of type 'faiss::IndexFlatL2 *' at 0x000001BC64328120> >


In [2]:
# Query sentence
query_sentence = "I enjoy coding in Python."

# Convert the query sentence to an embedding
query_embedding = model.encode([query_sentence])

# Perform the search for the 2 most similar sentences
k = 2  # Number of nearest neighbors to retrieve
distances, indices = index.search(np.array(query_embedding).astype('float32'), k)

# Print the query and the most similar sentences
print(f"Query Sentence: {query_sentence}")
print("\nMost similar sentences:")
for i in range(k):
    print(f"Sentence: {sentences[indices[0][i]]} - Distance: {distances[0][i]}")


Query Sentence: I enjoy coding in Python.

Most similar sentences:
Sentence: I love programming in Python. - Distance: 0.12243586778640747
Sentence: Artificial intelligence is transforming the world. - Distance: 1.4568238258361816


 Sentence Embedding Generation:
First, the sentences are passed through the SentenceTransformer model ('all-MiniLM-L6-v2'), which generates 384-dimensional embeddings for each sentence. This means that each sentence is represented by a 384-dimensional vector (array of floating-point numbers). These embeddings capture the semantic meaning of the sentences.
"The quick brown fox jumps over the lazy dog."[0.12, -0.34, 0.56, ..., 0.98, -0.12]



FAISS Indexing:
Next, these embeddings are added to a FAISS index using index.add(embeddings). FAISS is designed to efficiently store and search vectors in high-dimensional space. Here's what happens:

FAISS Index: The IndexFlatL2 is used here, which is a simple, brute-force search index. The L2 in IndexFlatL2 refers to the Euclidean distance used for comparison. This means that when you query the index, FAISS will measure the Euclidean distance between the query vector and all vectors in the index to find the closest matches.

Storing Embeddings: The embeddings are stored directly in the FAISS index. FAISS uses an internal structure to organize the vectors, allowing it to search them efficiently later.

Storing the Embeddings:
Indexing Process: When you add embeddings to the FAISS index using index.add(embeddings), the embeddings are stored in memory or a persistent file (depending on how FAISS is configured). The internal representation might look like a high-dimensional matrix, where each row corresponds to a sentence embedding.


        Sentence	                                                 Embedding (384-dimensional vector)
"The quick brown fox jumps over the lazy dog."	                    [0.12, -0.34, 0.56, ..., 0.98, -0.12]
"I love programming in Python."	                                    [0.23, -0.12, 0.45, ..., 0.76, -0.54]
"Artificial intelligence is transforming the world."                [0.56, 0.34, 0.67, ..., 0.34, -0.76]
"Machine learning is a subset of artificial intelligence."	        [0.21, -0.67, 0.45, ..., 0.78, 0.12]
"Deep learning models are highly effective for image recognition."	[0.43, -0.22, 0.34, ..., 0.56, 0.45]


Flat Index (IndexFlatL2):

This is a simple, brute-force index where all vectors are stored directly in memory.
The number of vectors it can store depends on the available system memory (RAM). It will scale linearly with the number of vectors, so as the dataset grows, the memory required to store these vectors grows.
For example, if you have 100,000 vectors and each vector has a dimension of 128, the memory required will be 100,000 * 128 * 4 bytes (since each float32 element is 4 bytes).
Inverted File Index (IVF):

The Inverted File Index (IVF) is designed to be more memory-efficient for large datasets.
Instead of storing all vectors in memory, it partitions the data into groups (clusters) and uses an inverted index to store and retrieve the vectors.
IVF can scale to millions of vectors by using disk storage in addition to memory, so it can handle much larger datasets compared to a FlatL2 index.
HNSW (Hierarchical Navigable Small World) Index:

The HNSW index is a graph-based index that is highly efficient for large-scale vector searches.
It organizes vectors in a hierarchical graph structure and is very memory efficient compared to FlatL2.
HNSW can handle millions of vectors while maintaining fast search times, but the memory usage depends on the number of neighbors stored per vector (which can be adjusted).