# Introduction

If you're building with AI, especially if you've worked on recommendation systems or need to store embeddings generated by large models, you're likely familiar with the challenges. By the end of this document, I hope to accomplish two things:

1. Provide you with a better understanding of how vector databases work.
2. Demonstrate a practical example of how to start using a vector database right now, using Python.


## Getting Started: Understanding the Basics

### Facebook AI Similarity Search (Faiss)
I’ll begin by talking about Faiss (Facebook AI Similarity Search), which is essentially a library. I'll show you how to use Faiss in your project, discussing three different algorithms to help you understand the progress in this field. An excellent article on the Pinecone website titled "Introduction to Facebook AI Similarity Search" offers an in-depth explanation of these concepts.

### Pinecone Vector Database
The vector database I’ll demonstrate today is Pinecone, which is probably the most popular and one of the first databases that emerged with the AI revolution. Pinecone allows you to generate databases serverlessly, so you don't need to maintain a server—super cool, and we’ll get into that shortly.

### Faiss Library Paper
I also recommend reading the Faiss library paper, written by the AI department at Facebook (FAIR). It covers all the technical aspects of Faiss, which will help deepen your understanding.


## Getting Into the Code
Let's dive into the code and go through it line by line to understand how it works.

### Loading the Dataset
First, I’ll start with a simple dataset, a CSV file containing sentences. These are just random sentences—like "A little girl is smiling and running outside"—and I have a thousand of them. Now, imagine these sentences as summaries of books. Suppose we work for a library, and we want to allow people to search our collection, not just by keyword but also by more obscure ideas contained within the book summaries. This is the essence of semantic search.

### Semantic Search Explained
Semantic search goes beyond simple keyword matching. For example, if someone searches for books about soccer, I don’t want to return only those that explicitly mention the word "soccer." Instead, I want to include books that discuss related concepts, even if they don’t use the exact term. This is similar to when you search your phone's photo library for pictures of the ocean—an algorithm identifies ocean images, even if you didn't tag them as such.

---

To perform this kind of search, I’ll convert the sentences into embeddings using the OpenAI Embeddings API. An embedding is essentially a vector of numbers that represents coordinates in a multi-dimensional space, much like latitude and longitude represent coordinates on a map. In this case, OpenAI’s embeddings consist of 1,536 dimensions—mind-blowing, right? Similar concepts will be positioned close together in this space, while unrelated concepts will be farther apart.

### Loading the Sentences
The first step in the code is loading the sentences from the CSV file into memory using the Pandas library. After loading, I display the sentences to ensure everything loaded correctly.

### Generating the Embeddings
The code also includes a function to generate embeddings for the sentences using OpenAI’s API. If the embedded sentences CSV file already exists, the code will simply load it. Otherwise, it will generate the embeddings anew. I’ve already generated the embeddings for you, so you don’t need to run this yourself—it’s a time-saver and avoids unnecessary API costs.


In [1]:
import pandas as pd

dataset = pd.read_csv("./data/sentences.csv")
dataset.head()

Unnamed: 0,sentence
0,A little girl is smiling and running outside
1,A man is drawing on a digital dry erase board
2,A black bird is sitting on a dead tree
3,An elderly man is sitting on a bench
4,A man and a woman are sitting comfortably on t...


In [2]:
from openai import OpenAI

from dotenv import load_dotenv , find_dotenv

client = OpenAI()

_ = load_dotenv(find_dotenv())
# openai.api_key = os.environ('OPENAI_API_KEY')

def get_embedding(sentence):
    return (
        client.embeddings.create(input=sentence, model="text-embedding-3-small")
        .data[0]
        .embedding
    )

In [6]:
import os
import numpy as np

if os.path.exists("./data/embedded_sentences.csv"):
    dataset = pd.read_csv("./data/embedded_sentences.csv")
    dataset["embedding"] = dataset.embedding.apply(eval).apply(np.array)
# else:
#     dataset["embedding"] = dataset["sentence"].apply(get_embedding)
#     dataset.to_csv("embedded_sentences.csv", index=False)

In [7]:
dataset["id"] = range(1, len(dataset) + 1)
dataset.head()

Unnamed: 0,sentence,embedding,id
0,A little girl is smiling and running outside,"[0.0436425618827343, 0.01375775970518589, 0.00...",1
1,A man is drawing on a digital dry erase board,"[-0.008048108778893948, 0.030766354873776436, ...",2
2,A black bird is sitting on a dead tree,"[0.027433251962065697, 1.8205369087809231e-06,...",3
3,An elderly man is sitting on a bench,"[-0.004122881218791008, -0.056238383054733276,...",4
4,A man and a woman are sitting comfortably on t...,"[0.021146269515156746, -0.032280709594488144, ...",5


In [8]:
embedding_dimension = len(dataset.iloc[0]["embedding"])
embedding_dimension

1536

## Vector Store Preparation
Up to this point, all we’ve done is preparation. The real magic happens next—storing and searching these embeddings.

### Vector Store Creation
The vector store will allow us to quickly retrieve embeddings that are similar to a user’s query. For example, if a user searches for "I love soccer," we want to find all related sentences in our dataset.

### Cosine Similarity
To compare vectors, we use cosine similarity. The closer the vectors are, the higher their cosine similarity, indicating that they represent similar concepts. However, there’s a challenge: if we have thousands of books or millions of embeddings, comparing the user’s query with every embedding in the database would be incredibly slow.

## The Power of Vector Databases

## Faiss

Check [Faiss Indexes](https://github.com/facebookresearch/faiss/wiki/Faiss-indexes) for more information.


### Flat L2 Index
Initially, we’ll use a simple approach: the Flat L2 index. This method stores all vectors and compares them one by one against the query vector using Euclidean distance (L2). While straightforward, this method doesn’t scale well with large datasets.

### Implementing the Flat L2 Index
I’ll create the Flat L2 index with 1,536 dimensions, add all the embeddings to it, and then search for the top four documents related to the query "I love soccer." The results will show sentences closely related to soccer, demonstrating the effectiveness of this approach.


In [9]:
embeddings = np.array(dataset.embedding.tolist())

query = "I love soccer"
xq = get_embedding(query)

In [10]:
xq

[0.005098666530102491,
 -0.010370757430791855,
 -0.012289983220398426,
 -0.011544259265065193,
 -0.021943921223282814,
 -0.02225608378648758,
 0.0421304777264595,
 0.048235002905130386,
 -0.035124145448207855,
 -0.03172503784298897,
 -0.01620936580002308,
 -0.006110306829214096,
 -0.0686296671628952,
 -0.012382475659251213,
 0.01810546964406967,
 0.01703023910522461,
 -0.05600440129637718,
 -0.004878995940089226,
 0.043402254581451416,
 -0.015307561494410038,
 0.07454921305179596,
 0.03332053869962692,
 -0.01686837710440159,
 0.0052431863732635975,
 0.008734790608286858,
 0.006480277981609106,
 0.029597701504826546,
 -0.01758519746363163,
 -0.03799142315983772,
 0.010353414341807365,
 -0.0016200694954022765,
 -0.018070783466100693,
 -0.027794091030955315,
 -0.03394486382603645,
 0.051102280616760254,
 -0.02031373418867588,
 -0.038245778530836105,
 0.047448813915252686,
 0.02659168466925621,
 -0.023770654574036598,
 0.063912533223629,
 -0.0017963838763535023,
 0.02192079834640026,
 -0.0

In [11]:
len(xq)

1536

### IndexFlatL2 - Exact Search for L2


In [12]:
import faiss

index_l2 = faiss.IndexFlatL2(embedding_dimension)
index_l2.is_trained

True

In [13]:
index_l2.add(embeddings)
index_l2.ntotal

1000

In [14]:
_, document_indices = index_l2.search(np.expand_dims(xq, axis=0), k=4)
dataset.iloc[document_indices[0]]

Unnamed: 0,sentence,embedding,id
684,A man is punching a soccer ball,"[-0.01688985712826252, 0.029744451865553856, 0...",685
950,A soccer player is sitting on the field and is...,"[-0.011246275156736374, 0.009713653475046158, ...",951
352,An opponent is tackling a soccer player,"[0.0006955061689950526, -0.02708514593541622, ...",353
25,A group of men is playing soccer on the beach,"[0.004342387896031141, 0.04711844399571419, 0....",26


## IndexIVFFlat - Inverted file with exact post-verification



### IVF Flat Index: A Better Solution
The IVF (Inverted File) Flat index improves search efficiency by clustering related embeddings. This method reduces the number of comparisons needed, making it faster and more scalable.

### Voronoi Diagrams
To understand the IVF Flat index, think of Voronoi diagrams. These diagrams divide a space into cells, with each cell containing points closer to a specific centroid. Similarly, the IVF Flat index clusters embeddings so that at search time, we only need to compare the query with embeddings in the nearest cluster, significantly speeding up the search.

<img src='images/ivf.png' width="1000">


In [15]:
ncentroids = 20
quantizer = faiss.IndexFlatL2(embedding_dimension)
index_ivf = faiss.IndexIVFFlat(quantizer, embedding_dimension, ncentroids)
index_ivf.is_trained

False

In [16]:
index_ivf.train(embeddings)
index_ivf.is_trained

True

In [17]:
index_ivf.add(embeddings)
index_ivf.ntotal

1000

In [18]:
_, document_indices = index_ivf.search(np.expand_dims(xq, axis=0), k=4)
dataset.iloc[document_indices[0]]

Unnamed: 0,sentence,embedding,id
352,An opponent is tackling a soccer player,"[0.0006955061689950526, -0.02708514593541622, ...",353
979,The crowd is watching a football game,"[-0.01140331570059061, 0.015461748465895653, -...",980
551,A football player is running past an official ...,"[0.03300335630774498, 0.017781982198357582, -0...",552
172,Two men are playing table football,"[-0.03752468526363373, 0.05197532847523689, -0...",173


In [19]:
index_ivf.nprobe = 5
_, document_indices = index_ivf.search(np.expand_dims(xq, axis=0), k=4)
dataset.iloc[document_indices[0]]

Unnamed: 0,sentence,embedding,id
684,A man is punching a soccer ball,"[-0.01688985712826252, 0.029744451865553856, 0...",685
950,A soccer player is sitting on the field and is...,"[-0.011246275156736374, 0.009713653475046158, ...",951
352,An opponent is tackling a soccer player,"[0.0006955061689950526, -0.02708514593541622, ...",353
137,A group of boys is playing soccer on the seashore,"[0.023316802456974983, 0.0333450511097908, 0.0...",138


### IndexIVFPQ - IVF + Product Quantizer (PQ)

The product quantizer is a technique often used in large-scale similarity search and nearest neighbor search, particularly in the context of vector databases like Pinecone. It helps reduce the memory footprint of high-dimensional vectors while still allowing for efficient approximate nearest neighbor (ANN) searches.

**How Product Quantization Works:**
1. Vector Splitting:
    * The original high-dimensional vector is split into smaller sub-vectors. For example, if you have a 128-dimensional vector, it might be split into four 32-dimensional sub-vectors.

2. Sub-Quantization:

    * Each sub-vector is quantized independently using a technique such as k-means clustering. The idea is to represent each sub-vector by the nearest centroid from a precomputed codebook (set of representative vectors).
3. Codebook Lookup:

    * Instead of storing the full high-dimensional vectors, only the indices of the nearest centroids from the codebooks are stored for each sub-vector. This reduces the storage requirements significantly.
4. Approximate Search:
    * During a query, the same process is applied to the query vector, and the distances between the quantized versions of the query and database vectors are computed using the codebook indices. This allows for fast similarity searches with reduced memory usage.

**Advantages:**
* **Memory Efficiency**: By quantizing vectors into a smaller number of bits, the memory footprint of the dataset is significantly reduced.
* **Speed**: The search becomes faster as it operates on quantized representations rather than full vectors.
* **Scalability**: It's particularly useful for very large datasets where storing and processing full vectors would be impractical.

**Limitations:**
* **Approximation**: The search is approximate, meaning it may not always return the exact nearest neighbors but rather those that are close enough.
* **Quantization Error**: The process introduces some error due to the approximation, which might affect the precision of the search results.

<img src='images/ivf-pq.png' width="1000">

In [20]:
code_size = 8
bits_per_centroid = 4

index_ivf_pq = faiss.IndexIVFPQ(
    quantizer, embedding_dimension, ncentroids, code_size, bits_per_centroid
)
index_ivf_pq.is_trained

False

In [21]:
index_ivf_pq.train(embeddings)
index_ivf_pq.add(embeddings)
index_ivf_pq.ntotal

1000

In [22]:
index_ivf_pq.nprobe = 5
_, document_indices = index_ivf_pq.search(np.expand_dims(xq, axis=0), k=4)
dataset.iloc[document_indices[0]]

Unnamed: 0,sentence,embedding,id
352,An opponent is tackling a soccer player,"[0.0006955061689950526, -0.02708514593541622, ...",353
551,A football player is running past an official ...,"[0.03300335630774498, 0.017781982198357582, -0...",552
979,The crowd is watching a football game,"[-0.01140331570059061, 0.015461748465895653, -...",980
469,A football player in a red and white uniform i...,"[-0.01379761379212141, -0.04831472039222717, 0...",470


## Pinecone


In [27]:
import os

from pinecone import Pinecone

load_dotenv()

# Retrieve the Pinecone API key
PINECONE_API_KEY = os.getenv("PINECONE_API_KEY")

# Initialize the Pinecone client
database = Pinecone(api_key=PINECONE_API_KEY)

In [28]:
from pinecone import ServerlessSpec

serverless_spec = ServerlessSpec(cloud="aws", region="us-east-1")

In [29]:
import time

INDEX_NAME = "underfitted-random-sentences"

if INDEX_NAME not in database.list_indexes().names():
    database.create_index(
        name=INDEX_NAME,
        dimension=embedding_dimension,
        metric="cosine",
        spec=serverless_spec,
    )

    time.sleep(1)

pinecone_index = database.Index(INDEX_NAME)

In [30]:
pinecone_index.describe_index_stats()

{'dimension': 1536,
 'index_fullness': 0.0,
 'namespaces': {},
 'total_vector_count': 0}

In [31]:
def iterator(dataset, size):
    for i in range(0, len(dataset), size):
        yield dataset.iloc[i : i + size]


def vector(batch):
    vector = []
    for i in batch.to_dict("records"):
        vector.append((str(i["id"]), i["embedding"], {"sentence": i["sentence"]}))

    return vector

In [32]:
if pinecone_index.describe_index_stats()["total_vector_count"] == 0:
    for batch in iterator(dataset, 100):
        pinecone_index.upsert(vector(batch))

In [33]:
pinecone_index.describe_index_stats()

{'dimension': 1536,
 'index_fullness': 0.0,
 'namespaces': {'': {'vector_count': 1000}},
 'total_vector_count': 1000}

In [34]:
response = pinecone_index.query(vector=xq, top_k=4, include_metadata=True)
for match in response["matches"]:
    print(match["metadata"]["sentence"])

A man is punching a soccer ball
A soccer player is sitting on the field and is drinking water
An opponent is tackling a soccer player
A group of men is playing soccer on the beach


In [35]:
query2 = "I like animals that eat too much"
xq2 = get_embedding(query2)
response = pinecone_index.query(vector=xq2, top_k=5, include_metadata=True)
for match in response["matches"]:
    print(match["metadata"]["sentence"])

The animal with big eyes is voraciously eating
Some kittens are hungry
Someone is cleaning an animal
A lemur is eating quickly
A cat is eating corn on the cob


In [36]:
database.delete_index(INDEX_NAME)