# Introduction to Vector Search

Vector Search might sound scary and complicated, but it's actually really easy! And then really hard and complicated!

We computed *embeddings* of audio snippets. Now, given some example, we want to find similar audio snippets in our dataset.  To do this, we'll use vector search over the embeddings.

We'll start by doing *brute force nearest-neighbor search*. This is conceptually very easy: Just compare the query embedding to all of the target embeddings, and return the closest one(s).

We'll also find that we need to have some good data mangement along the way - it's not enough to get the best embedding, we also need to know what audio it is associated with.

In [None]:
# ------------------------------------------------------------
# 📝 Exercise 0 – Create a query.
# ------------------------------------------------------------
# This should mostly use code you've already written for
# previous exercises.
# 1) Create a variable for a file-path to some audio.
# 2) Load the audio.
# 3) Run the audio through an embedding model (Perch?) and get
#    the audio embedding. This is your query embedding.
# ------------------------------------------------------------


# ------------------------------------------------------------
# 📝 Exercise 1 – Create an embeddings dataset.
# ------------------------------------------------------------
# This also should rely heavily on code you've already 
# written for previous exercises... but will go a bit further.
# 1) Get all the embeddings for your dataset (eg, your 
#    favorite anuraset site, or all of the anuraset data).
# 2) Write the embeddings to disk, somehow. Check that you
#    can reload the embeddings from disk, and that the 
#    reloaded embeddings match the originals.
# 3) Given some choice of embedding from your dataset,
#    write a function which gets the audio that the embedding
#    came from.
# ------------------------------------------------------------


In [None]:
# ------------------------------------------------------------
# 📝 Exercise 2 – Nearest-neighbor search.
# ------------------------------------------------------------
# Now we should have a *query embedding* and a numpy array 
# of *target embeddings.*
# 1) Write a function which compares the query embedding to
#    each of the target embeddings, and finds the one that is
#    closest in Euclidean distance. Return the embedding and
#    the associated audio.
# 2) Update your function to take a `top_k` parameter. Then
#    it should return a numpy array of embeddings (with shape
#    [top_k, embedding_dim]) and an array of audio (with shape
#    [top_k, 5*sample_rate]).
# 3) Update your function to have some options for how to 
#    the nearest neighbor - try out cosine similarity and
#    maximum inner product.
# ------------------------------------------------------------


# ------------------------------------------------------------
# 📝 Exercise 3 – Displaying results.
# ------------------------------------------------------------
# Now that you have some top_k results, display them!
# Draw a spectrogram and audio player for each result.
# It's also helpful to write the filename and offset within
# the file for each result, to make it possible to go
# back and see the result in context.
#
# Extensions:
# E1) Provide a button for 'relevant' / 'irrelevant'.
# E2) Skip results which have already been marked.
# ------------------------------------------------------------


## Vector Databases in a Nutshell

Brute force search is fantastic (and fast!) until you get to millions 
or billions of embeddings.

(Micro-exercise: A 32-bit floating point number takes 4 bytes. 
How many embeddings can fit in RAM in the machine you're working on?
How many hours of audio does that number of embeddings correspond to?
How long would it take to run a brute-force search on that many embeddings?)

When the number of embeddings becomes enormous, vector databases become
sort of helpful!

The idea of a vector database is to find *approximate* nearest neighbors
quickly. This is done by *indexing* the data. There is an enormous literature
on ways to do this well, but a good cartoon-version of what works well is 
*hierarchical k-means*. You cluster the data into k clusters, then cluster
all the data assigned to each cluster centroid, and so on. Then to find 
nearest neighbors, you find the nearest top-centroid, then the nearest centroid
at the second level, and so on.

(More info than you need: this hierarchical k-means procedure has trouble when
searching for something near the cluster boundary. Various tricks can be introduced
to deal with this. There is also a family of *graph-based* indices, which
are really cool mathematically and work pretty great, but deep in the weeds.)

In my opinion, the best vector database in 2025 is called **usearch**. It's great because:

* It is self-contained, with almost no dependencies: It has one job, and it does it well.
* It can use an *on-disk* index! Most of the popular vector databases only work with embeddings in RAM, in order to give results as fast as possible. Using an on-disk index lets you scale to a much larger number of vectors for far less money.

Let's try it out!

In [None]:
from usearch import index as uindex
import numpy as np

# For "free", here's an example of how to use usearch to index 
# and search some vectors.

# Create the index, specifying the size of the vectors, 
# data type (dtype), and the metric to use.
ui = uindex.Index(ndim=512, metric="L2sq", dtype='f16')

# Make some random data.
n = 100_000
keys = np.arange(n)
vectors = np.random.rand(n, 512).astype(np.float32)
print('\ntime to create the index: ')
%time ui.add(keys, vectors)

print(len(ui.keys))

# Make a random query.
query = np.random.rand(512).astype(np.float32)

print('\ntime to run exact search: ')
%time exact_top_k = ui.search(query, count=5, exact=True)

print('\ntime to run approximate search: ')
%time approx_top_k = ui.search(query, count=5, exact=False)

# TODO: Write some code measuring the proportion of exact_top_k
# that are in approx_top_k. This is a *recall* metric.

# Save the index to disk.
ui.save("/tmp/index.bin")

# Use the on-disk index.
ui2 = uindex.Index()
ui2.view("/tmp/index.bin")
# Check that the loaded index is the same as the original.
assert np.all(np.array(ui2.keys) == np.array(ui.keys))

print('\ntime to run approximate search from disk: ')
%time approx_disk_top_k = ui2.search(query, count=5, exact=False)


CPU times: user 2min 9s, sys: 427 ms, total: 2min 9s
Wall time: 17.9 s
100000

time to run exact search: 
CPU times: user 18.1 ms, sys: 1 μs, total: 18.1 ms
Wall time: 18 ms

time to run approximate search: 
CPU times: user 1.02 ms, sys: 0 ns, total: 1.02 ms
Wall time: 1.02 ms

time to run approximate search from disk: 
CPU times: user 3.46 ms, sys: 3.99 ms, total: 7.44 ms
Wall time: 7.44 ms


In [None]:
# ------------------------------------------------------------
# 📝 Exercise 4 – Embeddings usearch.
# ------------------------------------------------------------
# Do all the same stuff again, but this time with embeddings
# from the Perch model.
#
# 1) Create a usearch index for the embeddings, and insert
#    the embeddings into it.
# 2) Save the index to disk, and load it again.
# 3) Create a query embedding, and search the index for the
#    nearest neighbors.
# 4) Display the results, with a spectrogram of the query
#    and the nearest neighbors.
# 5) Measure the recall of the approximate search.
# ------------------------------------------------------------


True

227733