## Credit

Notes are taken from NLPlanet Practical NLP with Python course section 2.10 Semantic Search on Big Data
* https://www.nlplanet.org/course-practical-nlp/02-practical-nlp-first-tasks/10-semantic-search-big-data

Authored by Fabio Chiusano
* https://medium.com/@chiusanofabio94

**All quotes '' are sourced from the NLPlanet course.**

## Semantic Search Recap

<u>Recap:</u>
* All the documents are embedded using the embedding model.
* The query is embedded using the same model, producing the query embedding.
* A similarity between the query embedding and the embedding of each document is computed.
* The document with the highest similarity is returned as the best result

## Speeding up Semantic Search using Faiss

In [None]:
# Install faiss library
!pip install faiss-cpu
# also a GPU version available: faiss-gpu

### Generate Vectors

In [1]:
# 500k vectors of 512 dimensions

import numpy as np
# used for manipulating data
from sklearn.preprocessing import normalize
# normalize function used for normalizing arrays or vectors

np.random.seed(1234)
# sets seed for NumPy random number generator

num_dimensions = 512
number_of_vectors = 10**5 * 5

vectors = np.random.random((number_of_vectors, num_dimensions)).astype('float32')
# np.random.random() generates random numbers in a provided range
# (vectors, dimensions) specifies the shape of the generated array
# .astype('float32') method converts the data type of the generated array into float32
    # a data type representing 32-bit floating-point numbers
vectors = normalize(vectors)
# normalizes vectors along an axis (a dimension)
print(vectors[:10])

[[0.01417776 0.04605334 0.03240402 ... 0.00286546 0.05125616 0.0352571 ]
 [0.01299159 0.00484864 0.0024733  ... 0.04634576 0.01395291 0.03983717]
 [0.05602823 0.07297722 0.04174635 ... 0.04760421 0.00648946 0.0560016 ]
 ...
 [0.04004947 0.02678304 0.02726158 ... 0.07530405 0.07019431 0.06849267]
 [0.07222328 0.07369468 0.0300264  ... 0.04832328 0.00220381 0.06468705]
 [0.06522474 0.02248333 0.04374786 ... 0.03468747 0.04629404 0.03798534]]


### Brute-force Search (Slow)

In [15]:
import faiss
# Facebook AI similarity Search
# Effifient in similarity search and the clustering of dense vectors

# Create Index
index = faiss.IndexFlatL2(num_dimensions)
# Creates an IndexFlatL2 in a space with a specified number of dimensions
# A flat index is an index where all values are stored without hierarchy 
    # (all vectors have the same level of priority)
# L2 is a similarity metric that queries for nearest neighbors 
    # (similar vectors with the least distance from eachother)
index.add(vectors)

# Unused:
retrieved_vector = index.reconstruct(0)
# .reconstruct retrieves a single data point from the index based on the provided position/index (0)

In [16]:
# Creating a random query vector to find its 4 nearest neighbors

# Create Query Vector
query_vector = np.random.random((1, num_dimensions)).astype('float32')
# creates a single vectors with 512 dimensions
query_vector = normalize(query_vector)
# changes the vector's magnitude to the L2 norm while the direction stays the same
# L2 norm is a measure of the magnitude of a vector 
    # OR
    # |v| = sqrt((c1)^2 + (c2)^2 + ... + (cn)^2)
    # where all c are components of the vector
# L1 norm measures the absolute sum of all components in a vector

# Nearest Neighbor Search
num_neighbors = 4
distances, indices = index.search(query_vector, num_neighbors)
# .search() performs a nearest neighbor search provided:
    # A starting point (query_vector)
    # A number of neighbors to find (4)
# Finds 1st, 2nd, 3rd, and 4th closest vectors to the query_vector
# Returns:
    # the distances between the vector and the query_vector
    # the index of each vectors inside of the IndexFlatL2 obj

print(distances)
print(indices)

[[0.3934843  0.3936579  0.39374584 0.39577028]]
[[421983 173520 400525 455621]]


### Search with Space-partitioning Index (Fast)

In [20]:
import faiss
# Facebook AI similarity Search
# Effifient in similarity search and the clustering of dense vectors

# create index
n_cells = 500
quantizer = faiss.IndexFlatL2(num_dimensions)
# Creates a flat index object
index = faiss.IndexIVFFlat(quantizer, num_dimensions, n_cells)
# Inverted File with Flat Index
    # partitions the vector space into smaller cells (or clusters) of vectors
    # each cell contains a subset of vectors based on their proximity (or similarity)
# The quantizer Index serves as a coarse quantizer
    # which retrieves an initial candidate set, then searches within this set for nearest neighbors
# num_dimensions needs to be specified to maintain vectors with the same dimensions

index.train(vectors)
# trains the index using the vectors dataset
# prepares the data structure for efficient search
index.add(vectors)
# adds the vectors to the index structure
# the vectors are partitioned in cells based on the quantizer to be used for nearest neighbor searches

In [22]:
# Explained under brute-force method

# Create query vector
query_vector = np.random.random((1, num_dimensions)).astype('float32')
query_vector = normalize(query_vector)

# Nearest Neighbor Search
num_neighbors = 4
distances, indexes = index.search(query_vector, num_neighbors)

print(distances)
print(indexes)

[[0.39594385 0.40824294 0.40859386 0.41268638]]
[[411930 188744 286482 319469]]
