<a href="https://colab.research.google.com/github/RDGopal/IB9CW0-Text-Analytics/blob/main/day_seven_vector_dbs.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Vector Databases and Embeddings Retrieval
In this notebook we will be building a vector database and writing queries to return similar documents. We will be using Facebook's [Faiss](https://github.com/facebookresearch/faiss) database as the store, along with sentence level [DistilBERT](https://huggingface.co/docs/transformers/en/model_doc/distilbert) for embeddings. Our target domain will be academic papers which we will extract from the populuar preprint resource [ArXiv](https://arxiv.org/).

In [1]:
# use capture to hide output messages
%%capture

!pip install accelerate -U
!pip install -U sentence-transformers
!pip install faiss-gpu
!pip install arxiv

import faiss
import arxiv
import pandas as pd
import numpy as np
import torch
from sentence_transformers import SentenceTransformer
from sklearn.manifold import TSNE

Now we need some data! We will be extracting 500 abstracts of academic papers that match the search term _"text analytics"_ from ArXiv. To do so we will create a simple client and connect to the API:

In [2]:
# Number of records
n_records = 500

# Construct the default API client.
client = arxiv.Client()

# Search for the 10 most recent articles matching the keyword "text analytics."
search = arxiv.Search(
  query = "text analytics", # search query
  max_results = n_records, # number of records to return - defined above
  sort_by = arxiv.SortCriterion.SubmittedDate # sort order (submission date)
)

results = client.results(search) # collect results

Now we have some results we can extract the bits we need and save them in a dataframe. Obviously we need the abstracts (documents) but we will also keep the ArXiv unique ID. However, the ArXiv ID is alphanumeric and Faiss wants a purely numeric ID system - so we will finally construct our own ID system as an autoincrement integer.

In [3]:
ids = [] # empty list to store ArXiv IDs
abstracts = [] # empty list to store abstracts

for r in client.results(search): # iterate through the results
  ids.append(r.entry_id) # add the ArXiv ID to the list
  abstracts.append(r.summary) # add the abstract to the list

# create a list of numbers between 0 and n_records
uid = np.arange(0, n_records, dtype=int) # create a list of numeric IDs 0-499

# combine the data together in a dictionary
df_data = {'uid': uid, 'aid': ids, 'abstract': abstracts}

# create a dataframe from this dictionary
df = pd.DataFrame(df_data)
df.head() # top 5 records

Unnamed: 0,uid,aid,abstract
0,0,http://arxiv.org/abs/2404.19759v1,"This work introduces MotionLCM, extending cont..."
1,1,http://arxiv.org/abs/2404.19758v1,3D scene generation has quickly become a chall...
2,2,http://arxiv.org/abs/2404.19757v1,Utilizing a series of Bott indices formulated ...
3,3,http://arxiv.org/abs/2404.19753v1,Vision-language datasets are vital for both te...
4,4,http://arxiv.org/abs/2404.19752v1,Existing automatic captioning methods for visu...


Now we have our data we will create embedings of the abstracts (encoding) using sentence level [DistilBERT](https://huggingface.co/docs/transformers/en/model_doc/distilbert). DistilBERT is a smaller version of classic BERT, designed to have similar performance with 40% fewer parameters (so faster).

In [4]:
# Instantiate the sentence-level DistilBERT
model = SentenceTransformer('distilbert-base-nli-stsb-mean-tokens')

# Convert abstracts to vectors
embeddings = model.encode(df.abstract.to_list(), show_progress_bar=True)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/229 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/122 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/4.05k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/555 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/265M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/505 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Batches:   0%|          | 0/16 [00:00<?, ?it/s]

Let's check the shape of the vectorised abstracts to ensure everything worked:

In [5]:
print(f'Shape of the vectorised abstract: {embeddings.shape}')

Shape of the vectorised abstract: (500, 768)


As we expected, we have 500 records (matching our earlier query) by 768 dimensions (our embedding space). We this complete we can populate our database. The data will be the custom IDs we created and the 768 dimensions of the embedding space.

In [6]:
# Step 1: Change data type to float32
embeddings = np.array([embedding for embedding in embeddings]).astype("float32")

# Step 2: Create the index based on the column shape of the embeddings
index = faiss.IndexFlatL2(embeddings.shape[1])

# Step 3: Pass the index to IndexIDMap
index = faiss.IndexIDMap(index)

# Step 4: Add vectors and their IDs
index.add_with_ids(embeddings, df.uid.values)

print(f"Number of vectors in the Faiss index: {index.ntotal}")

Number of vectors in the Faiss index: 500


Let's have a look at one of the abstracts we loaded:

In [7]:
df.iloc[9, 2]

'Recent popular decoder-only text-to-speech models are known for their ability\nof generating natural-sounding speech. However, such models sometimes suffer\nfrom word skipping and repeating due to the lack of explicit monotonic\nalignment constraints. In this paper, we notice from the attention maps that\nsome particular attention heads of the decoder-only model indicate the\nalignments between speech and text. We call the attention maps of those heads\nAlignment-Emerged Attention Maps (AEAMs). Based on this discovery, we propose a\nnovel inference method without altering the training process, named\nAttention-Constrained Inference (ACI), to facilitate monotonic synthesis. It\nfirst identifies AEAMs using the Attention Sweeping algorithm and then applies\nconstraining masks on AEAMs. Our experimental results on decoder-only TTS model\nVALL-E show that the WER of synthesized speech is reduced by up to 20.5%\nrelatively with ACI while the naturalness and speaker similarity are\ncomparab

Using this abstract as a query, let's search our database for similar papers. We'll use [$L2$ distance](https://en.wikipedia.org/wiki/Euclidean_distance) - also known as Euclidean distance - as our distance measure:

In [8]:
# Retrieve the 10 nearest neighbours
l2_distance, similar = index.search(np.array([embeddings[9]]), k=10)
l2_distance = l2_distance.flatten().tolist()
similar = similar.flatten().tolist()
print(f'L2 distances: {l2_distance}')
print(f'Top papers: {similar}')

L2 distances: [0.0, 74.3465347290039, 75.58238983154297, 78.76838684082031, 79.36618041992188, 81.44171905517578, 82.61811065673828, 84.16861724853516, 85.44877624511719, 87.46012878417969]
Top papers: [9, 158, 495, 18, 52, 97, 288, 474, 201, 172]


Obviously the closest match is the abstract itself (#17) ... with an $L2$ disnace of 0. Let's have a look at the next two closest results - abstracts #207 and #157:

In [9]:
df.iloc[158, 2]

'Conventional text-to-speech (TTS) research has predominantly focused on\nenhancing the quality of synthesized speech for speakers in the training\ndataset. The challenge of synthesizing lifelike speech for unseen,\nout-of-dataset speakers, especially those with limited reference data, remains\na significant and unresolved problem. While zero-shot or few-shot\nspeaker-adaptive TTS approaches have been explored, they have many limitations.\nZero-shot approaches tend to suffer from insufficient generalization\nperformance to reproduce the voice of speakers with heavy accents. While\nfew-shot methods can reproduce highly varying accents, they bring a significant\nstorage burden and the risk of overfitting and catastrophic forgetting. In\naddition, prior approaches only provide either zero-shot or few-shot\nadaptation, constraining their utility across varied real-world scenarios with\ndifferent demands. Besides, most current evaluations of speaker-adaptive TTS\nare conducted only on datas

In [10]:
df.iloc[495, 2]

"Large language models (LLMs) can adapt to new tasks through in-context\nlearning (ICL) based on a few examples presented in dialogue history without\nany model parameter update. Despite such convenience, the performance of ICL\nheavily depends on the quality of the in-context examples presented, which\nmakes the in-context example selection approach a critical choice. This paper\nproposes a novel Bayesian in-Context example Selection method (ByCS) for ICL.\nExtending the inference probability conditioned on in-context examples based on\nBayes' theorem, ByCS focuses on the inverse inference conditioned on test\ninput. Following the assumption that accurate inverse inference probability\n(likelihood) will result in accurate inference probability (posterior),\nin-context examples are selected based on their inverse inference results.\nDiverse and extensive cross-tasking and cross-modality experiments are\nperformed with speech, text, and image examples. Experimental results show the\neff

We can see that we have two similar looking abstracts! Both clearly discussing RAG much like our reference abstract.

We can also visualise this similarity via a 3D scatter graph. However, we need to reduce down to 3 dimensions. For this we can use [$t$-SNE](https://www.jmlr.org/papers/volume9/vandermaaten08a/vandermaaten08a.pdf), basically a similar approach as PCA (dimension reduction) but better suited to non-linear relationships in data.

In [11]:
n_components = 3 # 3D visualisation

# create an empty list
viz_embeddings = []

# add the closest 10 embeddings to the list
for embedding in similar:
  viz_embeddings.append(embeddings[embedding])

# convert to a np array
viz_array = np.array(viz_embeddings)

# reduce the embeddings to n_components (3) dimensions using TSNE
tsne = TSNE(n_components=n_components, random_state=42, perplexity=5)
reduced_vectors = tsne.fit_transform(viz_array) # transform the data
reduced_vectors[0:5] # show the first 5

array([[ -42.20033  ,   34.388603 ,  -17.566608 ],
       [  12.53604  ,  -55.056408 ,    2.4248497],
       [   7.4069624, -136.09106  ,  110.76674  ],
       [ -10.024233 ,   90.393456 , -124.575615 ],
       [ -93.96338  ,  -96.549164 ,   -2.6139417]], dtype=float32)

In [12]:
# Code adapted from: Afzal(2024)
# https://medium.com/@sarmadafzalj/visualize-vector-embeddings-in-a-rag-system-89d0c44a3be4

import plotly.graph_objs as go
from plotly.subplots import make_subplots
import plotly.io as pio

# Create a 3D scatter plot
scatter_plot = go.Scatter3d(
    x=reduced_vectors[:, 0],
    y=reduced_vectors[:, 1],
    z=reduced_vectors[:, 2],
    mode='markers',
    marker=dict(size=5, color='grey', opacity=0.5, line=dict(color='lightgray', width=1)),
    text=[f"Point {i}" for i in range(len(reduced_vectors))],
    name="Abstracts"
)

# Highlight the first point with a different colour (red)
highlighted_point = go.Scatter3d(
    x=[reduced_vectors[0, 0]],
    y=[reduced_vectors[0, 1]],
    z=[reduced_vectors[0, 2]],
    mode='markers',
    marker=dict(size=8, color='red', opacity=0.8, line=dict(color='lightgray', width=1)),
    text=["Question"],
    name="Query abstract"

)

# Highlight the closest two points with a different colour again (blue)
blue_points = go.Scatter3d(
    x=reduced_vectors[1:3, 0],
    y=reduced_vectors[1:3, 1],
    z=reduced_vectors[1:3, 2],
    mode='markers',
    marker=dict(size=8, color='blue', opacity=0.8,  line=dict(color='black', width=1)),
    text=["Top 1 Document","Top 2 Document"],
    name="Closest abstracts"
)

# Create the layout for the plot
layout = go.Layout(
    scene=dict(
        xaxis=dict(title='X'),
        yaxis=dict(title='Y'),
        zaxis=dict(title='Z'),
    ),
    title=f'3D Representation after t-SNE (Perplexity=5)'
)


fig = make_subplots(rows=1, cols=1, specs=[[{'type': 'scatter3d'}]])

# Add the scatter plots to the Figure
fig.add_trace(scatter_plot)
fig.add_trace(highlighted_point)
fig.add_trace(blue_points)

fig.update_layout(layout)

pio.write_html(fig, 'interactive_plot.html')
fig.show()

As we can see in 3-dimensional space, our query abstract is relatively close to our two papers, and further from the remaining 7.

How about if we query with new text (rather than using an existing abstract in the database). The original text (abstract #17) seems to discuss RAG in multimodal applications - so let's get ChatGPT to create something similar:

In [13]:
rag_text = "Monotonic synthesis in text-to-speech (TTS) models refers to a specific \
  method of handling prosody—the rhythm, stress, and intonation in speech—which is \
  crucial for creating natural-sounding spoken language. While the term itself might \
  not be commonly used in a standalone manner, it relates closely to methods that \
  ensure a direct, one-to-one mapping between input text units and output speech \
  units without complex alignments or patterns. Here's a breakdown of how monotonic \
  behavior is relevant in TTS systems, \particularly in prosody generation and \
  alignment techniques"

Now we can embedd our text, as before, and use it to search the database:

In [14]:
# Convert RAG text to vectors
rag_embedding = model.encode(rag_text, show_progress_bar=True)

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

In [15]:
# Retrieve the 10 nearest neighbours
l2_distance_two, similar_two = index.search(np.array([rag_embedding]), k=10)
similar_two = similar_two.flatten().tolist()
print(f'Original search: {similar}')
print(f'New search: {similar_two}')
common_elements = set(similar) # change similar to a set
# print the documents in common (the intersection of the sets of the two lists)
print(f'Common documents: {common_elements.intersection(set(similar_two))}')

Original search: [9, 158, 495, 18, 52, 97, 288, 474, 201, 172]
New search: [9, 172, 158, 126, 192, 52, 470, 18, 116, 113]
Common documents: {9, 172, 18, 52, 158}


As you can see, we get similar results, with four of our top five results also appearing in the top five results of the original search. Great work!

Let's also, as a bit of foreshadowing, try finding documents to support a more standard Q&A prompt:

In [16]:
# Q&A prompt
qna_prompt = "what is sentiment analysis?"

# Convert Q&A prompt to vectors
rag_embedding = model.encode(qna_prompt, show_progress_bar=True)

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Now we can find the best two abstracts related to this query:

In [17]:
# Retrieve the top nearest neighbour
l2_distance_three, similar_three = index.search(np.array([rag_embedding]), k=1)
similar_three = similar_three.flatten().tolist()

# Print the result
print(f'Top result: {df.iloc[similar_three[0], 2]}')

Top result: This paper explores the importance of text sentiment analysis and
classification in the field of natural language processing, and proposes a new
approach to sentiment analysis and classification based on the bidirectional
gated recurrent units (GRUs) model. The study firstly analyses the word cloud
model of the text with six sentiment labels, and then carries out data
preprocessing, including the steps of removing special symbols, punctuation
marks, numbers, stop words and non-alphabetic parts. Subsequently, the data set
is divided into training set and test set, and through model training and
testing, it is found that the accuracy of the validation set is increased from
85% to 93% with training, which is an increase of 8%; at the same time, the
loss value of the validation set decreases from 0.7 to 0.1 and tends to be
stable, and the model is gradually close to the actual value, which can
effectively classify the text emotions. The confusion matrix shows that the
accuracy 

A seemingly relevant result!

## Exercise
Complete the following tasks:

1.   Create a new API query for arXiv and extract 500 abstracts.
2.   Add the abstracts to your Faiss database.
3. Query the database to find new relevant documents.
