Facebook AI similarity Search

In [1]:
pip install faiss-cpu

Note: you may need to restart the kernel to use updated packages.


In [2]:
pip install numpy requests

Note: you may need to restart the kernel to use updated packages.


In [3]:
data = """Journal on Vector Databases
Abstract

With the rapid rise of artificial intelligence (AI) and large language models (LLMs), the need for efficient storage and retrieval of high-dimensional vector data has emerged as a critical requirement. Vector databases (VectorDBs) provide the infrastructure to perform similarity search at scale, enabling applications such as semantic search, recommendation systems, anomaly detection, and retrieval-augmented generation (RAG). This paper discusses the architecture, applications, advantages, and challenges of vector databases.

Introduction

Traditional databases are designed to handle structured, relational, or document-based data. However, modern AI applications generate embeddings—dense numerical vectors—that capture semantic meaning. Searching in this high-dimensional space requires specialized data structures and algorithms. Vector databases bridge this gap, enabling nearest-neighbor search across billions of vectors efficiently.

Core Concepts

Vector Embeddings: Numerical representations of unstructured data (e.g., sentences, images) in a continuous vector space.

Similarity Metrics: Methods such as cosine similarity, Euclidean distance, or dot product to measure closeness of vectors.

Indexing Techniques: Structures like HNSW (Hierarchical Navigable Small World), IVF (Inverted File Index), and PQ (Product Quantization) optimize search performance.

Approximate Nearest Neighbor (ANN) Search: Balances speed and accuracy for scalable similarity queries.

Architecture of a Vector Database

A VectorDB typically includes:

Storage Layer: Stores embeddings along with metadata.

Index Layer: Organizes embeddings into searchable structures.

Query Engine: Executes similarity searches.

Integration Layer: Provides APIs for AI/ML pipelines, often REST or gRPC endpoints.

Applications

Semantic Search – retrieving relevant documents or passages based on meaning rather than keywords.

Recommendation Systems – finding similar products, users, or media.

Fraud and Anomaly Detection – identifying outliers in high-dimensional financial or cybersecurity data.

Multimodal AI – searching across different media types (text-to-image retrieval).

Retrieval-Augmented Generation (RAG) – powering LLMs with external knowledge bases.

Popular Vector Databases

Pinecone – managed vector DB service with strong cloud integration.

Weaviate – open-source, schema-aware vector database.

Milvus – high-performance open-source VectorDB for large-scale similarity search.

FAISS (Facebook AI Similarity Search) – a library for efficient ANN search (often embedded in VectorDBs).

Qdrant – open-source vector search engine optimized for production.

Challenges

Scalability: Managing billions of vectors with low latency.

Hybrid Search: Combining vector similarity with traditional keyword or filter-based search.

Data Freshness: Updating embeddings dynamically as content evolves.

Cost Efficiency: Storing and querying large volumes of embeddings.

Future Directions

The adoption of VectorDBs will accelerate as LLMs and generative AI scale. Integration with relational databases, graph databases, and cloud-native architectures will shape the next generation of hybrid intelligent data platforms.

Conclusion

Vector databases represent a paradigm shift in data storage and retrieval. They enable machines to understand and search based on meaning rather than syntax, making them foundational for AI-driven applications. Their role will continue to expand across industries as data becomes increasingly unstructured and AI systems demand more sophisticated retrieval mechanisms."""

Convert data into small chunk to make it readable in better way
perform data cleaning based on NLP principles.
mostly  steps are required 1. data collection, data reading, data cleaning (make data chunks with overlap data for better integration), embeddings

#Data Cleaning

In [4]:
Clean_data = data.strip()

Use 300–500 tokens per chunk for general-purpose search.
✅ Increase to 800–1200 tokens if your model & use case need more context (e.g., RAG).
✅ Keep 10–20% overlap between chunks. if chunk size is 1000 then overlap should be between 100-200
✅ Align chunks to sentence/paragraph boundaries where possible.
✅ Pre-process text (remove boilerplate, tables, repeated headers).
✅ Store metadata (e.g., section, page, source) with each chunk in the VectorDB → helps in filtering later.

In [5]:
max_char = 800 # chunk size
overlap = 100 # overlap is needed between chunks to make relation between data chunks
chunks = []
i=0
while i < len(Clean_data):
    piece = Clean_data[i:i+max_char]
    chunks.append(piece)
    i = i+max_char - overlap


In [6]:
len(chunks)

6

In [7]:
chunks

['Journal on Vector Databases\nAbstract\n\nWith the rapid rise of artificial intelligence (AI) and large language models (LLMs), the need for efficient storage and retrieval of high-dimensional vector data has emerged as a critical requirement. Vector databases (VectorDBs) provide the infrastructure to perform similarity search at scale, enabling applications such as semantic search, recommendation systems, anomaly detection, and retrieval-augmented generation (RAG). This paper discusses the architecture, applications, advantages, and challenges of vector databases.\n\nIntroduction\n\nTraditional databases are designed to handle structured, relational, or document-based data. However, modern AI applications generate embeddings—dense numerical vectors—that capture semantic meaning. Searching in this',
 'cations generate embeddings—dense numerical vectors—that capture semantic meaning. Searching in this high-dimensional space requires specialized data structures and algorithms. Vector da

Embedding: Convert data into numerical format using some model or API, this model must relate to  text Embedding & similarity

In [9]:

import requests
import numpy as np

def generate_embeddings(text):
    url = "https://api.euron.one/api/v1/euri/embeddings"
    headers = {
        "Content-Type": "application/json",
        "Authorization": "Bearer euri-6fd0de29827ed6295aaa48cebc00e9705077ac2fcec039476ef67c19cf13e07a"
    }
    payload = {
        "input": text,
        "model": "text-embedding-3-small" #Open-AI based embedding model, supports multilingual.
    }

    response = requests.post(url, headers=headers, json=payload)
    data = response.json()
    
    embedding = np.array(data['data'][0]['embedding'])
    
    return embedding

text = "The weather is sunny today."

embedding = generate_embeddings(text)

In [10]:
for i in chunks:
    embedding = generate_embeddings(i)
    print(embedding)


[-0.01709348  0.03181893  0.03002567 ...  0.02860025  0.01844992
  0.00995491]
[-0.00992389  0.02803624  0.05719573 ...  0.01869083 -0.00962623
  0.01687117]
[-0.02161345  0.04889561  0.03228603 ...  0.02150847  0.01083588
 -0.00042282]
[-8.1308280e-05  3.7904516e-02  5.5489090e-02 ...  1.7649697e-02
  1.3715965e-02  1.1892379e-02]
[-0.01607724  0.02467457  0.02392202 ...  0.03739953  0.01955494
  0.00642519]
[-0.00881631 -0.00226156  0.05537876 ... -0.00156267 -0.00101808
  0.02672122]


Store this embedded data into the vector DB

In [14]:
emb_list = []
metadata = []
for idx, chunk in enumerate(chunks):
    vec = generate_embeddings(chunk)
    emb_list.append(vec.astype("float32"))
    metadata.append({"id": idx, "text": chunk})
    print(f"chunk{idx+1} embedding:{vec}")

chunk1 embedding:[-0.01709348  0.03181893  0.03002567 ...  0.02860025  0.01844992
  0.00995491]
chunk2 embedding:[-0.0099126   0.02803608  0.05728526 ...  0.01870195 -0.0096374
  0.0168823 ]
chunk3 embedding:[-0.02167191  0.04886963  0.03226888 ...  0.02149704  0.01100499
 -0.00031895]
chunk4 embedding:[-8.1308280e-05  3.7904516e-02  5.5489090e-02 ...  1.7649697e-02
  1.3715965e-02  1.1892379e-02]
chunk5 embedding:[-0.01605848  0.02470361  0.02390524 ...  0.03743178  0.01958268
  0.00640971]
chunk6 embedding:[-0.00885155 -0.00227339  0.05537213 ... -0.00149971 -0.00100888
  0.02671802]


In [15]:
emb_list

[array([-0.01709348,  0.03181893,  0.03002567, ...,  0.02860025,
         0.01844992,  0.00995491], shape=(1536,), dtype=float32),
 array([-0.0099126 ,  0.02803608,  0.05728526, ...,  0.01870195,
        -0.0096374 ,  0.0168823 ], shape=(1536,), dtype=float32),
 array([-0.02167191,  0.04886963,  0.03226888, ...,  0.02149704,
         0.01100499, -0.00031895], shape=(1536,), dtype=float32),
 array([-8.1308281e-05,  3.7904516e-02,  5.5489089e-02, ...,
         1.7649697e-02,  1.3715965e-02,  1.1892379e-02],
       shape=(1536,), dtype=float32),
 array([-0.01605848,  0.02470361,  0.02390524, ...,  0.03743178,
         0.01958268,  0.00640971], shape=(1536,), dtype=float32),
 array([-0.00885155, -0.00227339,  0.05537213, ..., -0.00149971,
        -0.00100888,  0.02671802], shape=(1536,), dtype=float32)]

In [16]:
metadata

[{'id': 0,
  'text': 'Journal on Vector Databases\nAbstract\n\nWith the rapid rise of artificial intelligence (AI) and large language models (LLMs), the need for efficient storage and retrieval of high-dimensional vector data has emerged as a critical requirement. Vector databases (VectorDBs) provide the infrastructure to perform similarity search at scale, enabling applications such as semantic search, recommendation systems, anomaly detection, and retrieval-augmented generation (RAG). This paper discusses the architecture, applications, advantages, and challenges of vector databases.\n\nIntroduction\n\nTraditional databases are designed to handle structured, relational, or document-based data. However, modern AI applications generate embeddings—dense numerical vectors—that capture semantic meaning. Searching in this'},
 {'id': 1,
  'text': 'cations generate embeddings—dense numerical vectors—that capture semantic meaning. Searching in this high-dimensional space requires specialized 

In [19]:
xb = np.vstack(emb_list) #Numpy compatible data


Normalize the data to make it importable into database

In [20]:
import faiss

In [28]:
faiss.normalize_L2(xb)
d = xb.shape[1]
d

1536

In [None]:
index = faiss.IndexFlatIP(d)
index.add(xb) # DataStorage command, this data is currently stored in the memory and not in a real real phyzical files/storage.

Data Storage in physical files

In [32]:
index_path = "index_vectordb.faiss" # store index to physical file with this name
meta_path = "meta_vectordb.json" # store meta data

In [34]:
faiss.write_index(index, index_path)

In [35]:
import json,os

In [36]:
with open(meta_path, "w") as f:
    for item in metadata:
        f.write(json.dumps(item) + "\n")

Search Operation

In [39]:
query = "what is vectordb ?"
#convert query to embedding
q = generate_embeddings(query).astype("float32").reshape(1, -1)
faiss.normalize_L2(q) # normalize the query to convert to reabable format for vectordb
index.search(q, 5) # which search result is best fit to my query

(array([[0.4657818 , 0.46344537, 0.45117822, 0.44971085, 0.37138897]],
       dtype=float32),
 array([[2, 0, 3, 4, 1]]))

(array([[0.4657818 , 0.46344537, 0.45117822, 0.44971085, 0.37138897]],
       dtype=float32),
 array([[2, 0, 3, 4, 1]]))

 these are the python similarity score with respect to my data

In [40]:
#test and checking how normalize regularisation works, normalization_L2 is a norm regularization technique and will always be lesser than 1.
test = np.array([[5,6,7,9,10]], dtype = np.float32)
test


array([[ 5.,  6.,  7.,  9., 10.]], dtype=float32)

In [41]:
np.linalg.norm(test) # normalization

np.float32(17.058722)

In [46]:
faiss.normalize_L2(test) # backend math is cosine similarity, working with indexflatIP function.

In [47]:
np.linalg.norm(test)

np.float32(1.0)

Test done