So far, we have worked primarily with relational databases. We have also explored MongoDB and APIs. These systems excel at filtering and querying structured data based on explicit attributes, such as filtering by a specific field value. However, they struggle when it comes to understanding the semantic meaning behind data — for example, finding items that are similar in meaning rather than exact matches.

Today, we will expand our toolkit with something similar yet quite distinct — vector databases.

## Background

Relational databases, MongoDB, and similar systems are excellent at filtering data by specific attributes or exact matches, which makes them well-suited for structured queries. For example, if we want to find books with a particular word in the title, we can easily do so with MongoDB:

```json
{"original_title": {"$regex": "world", "$options": "i"}}
```

However, these databases are limited when we want to perform semantic search — that is, searching based on the meaning or similarity of content rather than exact keywords. For instance, finding titles that are similar in meaning or recommending books with related themes is challenging with traditional databases.

Vector databases address this problem by representing data as vectors in a high-dimensional space. They enable semantic search by performing nearest neighbor searches to find items that are most similar in meaning.

> If terms like "vector space" sound intimidating, don’t worry — the concept is quite intuitive, as we will see.

## Embeddings

If you have some basic understanding of machine learning, you know that algorithms like SVMs or neural networks require numerical input. For images, this might be pixel values. For text, however, we need to convert words or sentences into numerical representations that capture their meaning.

This is where embeddings come in. Embeddings are numerical vectors that represent text in a way that preserves semantic relationships. A good embedding algorithm ensures that texts with similar meanings have similar vector representations.

> Embeddings are not limited to text — they can be applied to images, audio, and other data types as well.

For example, consider these embeddings:

- `"king"` → [0.12, -0.43, ...]
- `"queen"` → [0.10, -0.40, ...]
- `"apple"` → [0.87, 0.11, ...]

Notice that the vectors for `"king"` and `"queen"` are close together, while `"king"` and `"apple"` are far apart, reflecting their semantic similarity.

### Calculating Embeddings

There are many models available for generating embeddings, broadly classified into:

- Classical NLP models
- Deep learning models
- Transformers

Transformers are currently the most popular due to their efficiency, accuracy, and ease of use, thanks to numerous libraries.

Popular providers offering transformer-based embedding models include:

- HuggingFace
- OpenAI
- AWS Bedrock

#### HuggingFace

In 2017, Google Research published a landmark paper titled [Attention Is All You Need](https://arxiv.org/abs/1706.03762), introducing the Transformer architecture. This architecture revolutionized NLP by effectively handling long sequences of text.

Following that, models like GPT (2018), BERT (2018), and GPT-2 (2019) pushed the boundaries of what Transformers could achieve. OpenAI's GPT-3 (2020) further advanced the field by generating highly natural text (and won the best paper award at NeurIPS 2020).

Since then, Transformers have become the _de facto_ standard for NLP tasks. HuggingFace, launched in 2021, is a great example of a company arriving at the perfect time with a strong team. It provides a rich ecosystem of pre-trained models and tools through its Python `transformers` library.

To install the library, run:

In [1]:
!pip install transformers sentence-transformers



We can generate embeddings using HuggingFace's `sentence-transformers`. All it needs is to specify the model name and a list of sentences.


In [7]:
from sentence_transformers import SentenceTransformer

model = SentenceTransformer('all-MiniLM-L6-v2')
sentences = ["Call me Ishmael", "Some years ago—never mind how long precisely—having little or no money in my purse, and nothing particular to interest me on shore, I thought I would sail about a little and see the watery part of the world."]
embeddings = model.encode(sentences)

for sentence, embedding in zip(sentences, embeddings):
    print(f"Sentence: {sentence}")
    print(f"Embedding: {embedding[:5]}...")

Sentence: Call me Ishmael
Embedding: [-0.0425336   0.03945521  0.05039315  0.00251705 -0.04552436]...
Sentence: Some years ago—never mind how long precisely—having little or no money in my purse, and nothing particular to interest me on shore, I thought I would sail about a little and see the watery part of the world.
Embedding: [0.07079883 0.05998563 0.05050527 0.06264909 0.08035159]...


If you check the embedding length above, it will be 384. This is the dimensionality of the embeddings.


In [8]:
print(len(embeddings[0]))

384


#### OpenAI

OpenAI provides an API to generate embeddings using their advanced models. You can use the `openai` Python package to interact with the API.

First, install the OpenAI package:

In [21]:
!pip install openai

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)




OpenAPI has much higher-dimenstionality (better precision/quality) embedding models, like:

- `text-embedding-3-small`
- `text-embedding-3-large`
- `text-embedding-3-base`
- `text-embedding-3-mix`
- `text-embedding-3-fine-tuned`

and so on.


In [17]:
from openai import OpenAI

client = OpenAI(api_key="sk-proj-zOY****EBEA")

response = client.embeddings.create(
    model="text-embedding-3-small",
    input=sentences
)

embeddings = [item.embedding for item in response.data]

for sentence, embedding in zip(sentences, embeddings):
    print(f"Sentence: {sentence}")
    print(f"Embedding length: {len(embedding)}")
    print(f"First 5 values: {embedding[:5]}")
    print("---")

Sentence: Call me Ishmael
Embedding length: 1536
First 5 values: [0.007819030433893204, 0.015133155509829521, -0.0009440690628252923, 0.032959140837192535, -0.043506067246198654]
---
Sentence: Some years ago—never mind how long precisely—having little or no money in my purse, and nothing particular to interest me on shore, I thought I would sail about a little and see the watery part of the world.
Embedding length: 1536
First 5 values: [0.0020605484023690224, -0.004665764514356852, -0.012547033838927746, 0.07706713676452637, -0.012763588689267635]
---


As you can see, embedding resolution (1536) is much higher than the free models. If you use some better model, you can get even higher resolution:

In [18]:
response = client.embeddings.create(
    model="text-embedding-3-large",
    input=sentences
)

embeddings = [item.embedding for item in response.data]

for sentence, embedding in zip(sentences, embeddings):
    print(f"Sentence: {sentence}")
    print(f"Embedding length: {len(embedding)}")
    print(f"First 5 values: {embedding[:5]}")
    print("---")

Sentence: Call me Ishmael
Embedding length: 3072
First 5 values: [0.0021298318170011044, -0.016946136951446533, -0.0015679803909733891, 0.011603247374296188, -0.02826412208378315]
---
Sentence: Some years ago—never mind how long precisely—having little or no money in my purse, and nothing particular to interest me on shore, I thought I would sail about a little and see the watery part of the world.
Embedding length: 3072
First 5 values: [0.010106501169502735, 0.029184548184275627, -0.003808274632319808, -0.045645251870155334, 0.010237754322588444]
---


OpenAI's models are behind a paywall, but as you can see, they are much more accurate than HuggingFace's models and justify their price. But HuggingFace models will be fine for your basic tasks.

## Similarity Search

Now that we understand embeddings and how to generate them, let's explore how to use these vectors for similarity search—that is, finding which stored embeddings are most similar to a given query embedding.

### Cosine Similarity

**Cosine similarity** measures the cosine of the angle between two vectors. This means it considers their direction, not their magnitude—so it's _scale-invariant_. If two vectors point in the same direction, their cosine similarity is 1; if they're orthogonal, it's 0; if they're opposite, it's -1. This is particularly useful for comparing embeddings, since the scale of the vectors often doesn't matter—just their orientation in space.

**Intuition:** Two sentences with similar meaning will have embeddings pointing in similar directions, even if their magnitudes differ.

**Code Example:**

```python

```



```python

```

In [6]:
import numpy as np

embedding_a = np.array([0.1, 0.2, 0.7, 0.5, 0.3])
embedding_b = np.array([0.2, 0.1, 0.6, 0.4, 0.4])

def cosine_similarity(a, b):
    return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))

sim = cosine_similarity(embedding_a, embedding_b)
print(f"Cosine similarity: {sim:.3f}")


Cosine similarity: 0.973


### L2 Distance

**$L_2$ distance** (also called _Euclidean distance_) measures the straight-line distance between two points in space. For embeddings, this means the absolute difference between the vectors, taking both direction and magnitude into account. Smaller $L_2$ distances mean the vectors are closer together.

**Intuition:** If two embeddings are "close together" in the vector space, their $L_2$ (Euclidean) distance will be small, indicating high similarity.

**Code Example:**

In [20]:
import numpy as np

embedding_a = np.array([0.1, 0.2, 0.7, 0.5, 0.3])
embedding_b = np.array([0.2, 0.1, 0.6, 0.4, 0.4])

def l2_distance(a, b):
    return np.linalg.norm(a - b)

dist = l2_distance(embedding_a, embedding_b)
print(f"L2 (Euclidean) distance: {dist:.3f}")

L2 (Euclidean) distance: 0.224


Now lets put it into perspective by taking some random vectors and performing a Cosine similarity search.


In [8]:
import numpy as np

np.random.seed(37)
db = np.random.rand(1000000, 1024).astype('float32')
query = np.random.rand(1024).astype('float32')

cos_sims = cosine_similarity(db, query)
best_idx = np.argmax(cos_sims)
print(f"Brute-force best match index: {best_idx}, similarity: {cos_sims[best_idx]:.3f}")


Brute-force best match index: 195311, similarity: 0.001


Great. It worked. But if you have a look at the execution time for the above cell (~15 sec), its quite slow for search. Imagine you have to make some transaction and a 15 sec delay is quite intolerable. And we have taken 1 million vectors with 1024-dimension so far. No prize for guessing how long it will take with, say 1536 or 3072 length vectors.

And the reason is clear, it has to sieve through all the points/vectors for comparison ($O(N)$). And dataset of millions or billions of vectors aren't uncommon, which means we need some better solution.

## Approximate-Nearest Neighbours (ANN)

Vector databases use approximation algorithms, which use some greedy searches to reach the nearest neighbhours. Like:

- HNSW

### HNSW

(To be continued)