<a href="https://colab.research.google.com/github/RDGopal/IB9CW0-Text-Analytics/blob/main/day_seven_vector_dbs.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Vector Databases and Embeddings Retrieval
In this notebook we will be building a vector database and writing queries to return similar documents. We will be using Facebook's [Faiss](https://github.com/facebookresearch/faiss) database as the store, along with sentence level [DistilBERT](https://huggingface.co/docs/transformers/en/model_doc/distilbert) for embeddings. Our target domain will be academic papers which we will extract from the populuar preprint resource [ArXiv](https://arxiv.org/).

In [1]:
# use capture to hide output messages
%%capture

!pip install accelerate -U
!pip install -U sentence-transformers
!pip install faiss-gpu
!pip install arxiv

import faiss
import arxiv
import pandas as pd
import numpy as np
import torch
from sentence_transformers import SentenceTransformer
from sklearn.manifold import TSNE

Now we need some data! We will be extracting 500 abstracts of academic papers that match the search term _"text analytics"_ from ArXiv. To do so we will create a simple client and connect to the API:

In [2]:
# Number of records
n_records = 500

# Construct the default API client.
client = arxiv.Client()

# Search for the 10 most recent articles matching the keyword "text analytics."
search = arxiv.Search(
  query = "text analytics", # search query
  max_results = n_records, # number of records to return - defined above
  sort_by = arxiv.SortCriterion.SubmittedDate # sort order (submission date)
)

results = client.results(search) # collect results

Now we have some results we can extract the bits we need and save them in a dataframe. Obviously we need the abstracts (documents) but we will also keep the ArXiv unique ID. However, the ArXiv ID is alphanumeric and Faiss wants a purely numeric ID system - so we will finally construct our own ID system as an autoincrement integer.

In [3]:
ids = [] # empty list to store ArXiv IDs
abstracts = [] # empty list to store abstracts

for r in client.results(search): # iterate through the results
  ids.append(r.entry_id) # add the ArXiv ID to the list
  abstracts.append(r.summary) # add the abstract to the list

# create a list of numbers between 0 and n_records
uid = np.arange(0, n_records, dtype=int) # create a list of numeric IDs 0-499

# combine the data together in a dictionary
df_data = {'uid': uid, 'aid': ids, 'abstract': abstracts}

# create a dataframe from this dictionary
df = pd.DataFrame(df_data)
df.head() # top 5 records

Unnamed: 0,uid,aid,abstract
0,0,http://arxiv.org/abs/2405.02284v1,"Backflow, or retropropagation, is a counterint..."
1,1,http://arxiv.org/abs/2405.02282v1,"In this work, we investigate the modulational ..."
2,2,http://arxiv.org/abs/2405.02278v1,Photon loss rates set an effective upper limit...
3,3,http://arxiv.org/abs/2405.02272v1,"On a smooth manifold, we associate to any clos..."
4,4,http://arxiv.org/abs/2405.02262v1,Magnetic hopfions are localized magnetic solit...


Now we have our data we will create embedings of the abstracts (encoding) using sentence level [DistilBERT](https://huggingface.co/docs/transformers/en/model_doc/distilbert). DistilBERT is a smaller version of classic BERT, designed to have similar performance with 40% fewer parameters (so faster).

In [4]:
# Instantiate the sentence-level DistilBERT
model = SentenceTransformer('distilbert-base-nli-stsb-mean-tokens')

# Convert abstracts to vectors
embeddings = model.encode(df.abstract.to_list(), show_progress_bar=True)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/229 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/122 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/4.05k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/555 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/265M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/505 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Batches:   0%|          | 0/16 [00:00<?, ?it/s]

Let's check the shape of the vectorised abstracts to ensure everything worked:

In [5]:
print(f'Shape of the vectorised abstract: {embeddings.shape}')

Shape of the vectorised abstract: (500, 768)


As we expected, we have 500 records (matching our earlier query) by 768 dimensions (our embedding space). We this complete we can populate our database. The data will be the custom IDs we created and the 768 dimensions of the embedding space.

In [6]:
# Step 1: Change data type to float32
embeddings = np.array([embedding for embedding in embeddings]).astype("float32")

# Step 2: Create the index based on the column shape of the embeddings (# of columns)
index = faiss.IndexFlatIP(embeddings.shape[1]) # cosine similarity

# Step 3: Pass the index to IndexIDMap - allows us to map vectors to the ID
index = faiss.IndexIDMap(index)

# Step 4: Add vectors and their IDs
index.add_with_ids(embeddings, df.uid.values)

print(f"Number of vectors in the Faiss index: {index.ntotal}")

Number of vectors in the Faiss index: 500


Let's have a look at one of the abstracts we loaded:

In [34]:
df.iloc[54, 2]

'Large language models (LLMs) increasingly serve as the backbone for\nclassifying text associated with distinct domains and simultaneously several\nlabels (classes). When encountering domain shifts, e.g., classifier of movie\nreviews from IMDb to Rotten Tomatoes, adapting such an LLM-based multi-label\nclassifier is challenging due to incomplete label sets at the target domain and\ndaunting training overhead. The existing domain adaptation methods address\neither image multi-label classifiers or text binary classifiers. In this paper,\nwe design DALLMi, Domain Adaptation Large Language Model interpolator, a\nfirst-of-its-kind semi-supervised domain adaptation method for text data models\nbased on LLMs, specifically BERT. The core of DALLMi is the novel variation\nloss and MixUp regularization, which jointly leverage the limited positively\nlabeled and large quantity of unlabeled text and, importantly, their\ninterpolation from the BERT word embeddings. DALLMi also introduces a\nlabel-b

Using this abstract as a query, let's search our database for similar papers. We'll use cosine distance as our distance measure:

In [36]:
# Retrieve the 10 nearest neighbours
cosine_similarity, similar = index.search(np.array([embeddings[54]]), k=10)
cosine_similarity = cosine_similarity.flatten().tolist()
similar = similar.flatten().tolist()
print(f'Cosine similarity: {cosine_similarity}')
print(f'Top papers: {similar}')

Cosine similarity: [159.4054412841797, 123.85952758789062, 119.92066192626953, 118.6053237915039, 118.54507446289062, 118.35850524902344, 118.09319305419922, 118.04651641845703, 117.84342193603516, 116.38523864746094]
Top papers: [54, 142, 90, 240, 68, 316, 403, 287, 417, 338]


Obviously the closest match is the abstract itself (#36). Let's have a look at the next two closest results:

In [37]:
df.iloc[similar[1], 2]

'Large Language Models (LLMs) have enabled new ways to satisfy information\nneeds. Although great strides have been made in applying them to settings like\ndocument ranking and short-form text generation, they still struggle to compose\ncomplete, accurate, and verifiable long-form reports. Reports with these\nqualities are necessary to satisfy the complex, nuanced, or multi-faceted\ninformation needs of users. In this perspective paper, we draw together\nopinions from industry and academia, and from a variety of related research\nareas, to present our vision for automatic report generation, and -- critically\n-- a flexible framework by which such reports can be evaluated. In contrast\nwith other summarization tasks, automatic report generation starts with a\ndetailed description of an information need, stating the necessary background,\nrequirements, and scope of the report. Further, the generated reports should be\ncomplete, accurate, and verifiable. These qualities, which are desirab

In [38]:
df.iloc[similar[2], 2]

'Large-scale Text-to-Image (T2I) diffusion models demonstrate significant\ngeneration capabilities based on textual prompts. Based on the T2I diffusion\nmodels, text-guided image editing research aims to empower users to manipulate\ngenerated images by altering the text prompts. However, existing image editing\ntechniques are prone to editing over unintentional regions that are beyond the\nintended target area, primarily due to inaccuracies in cross-attention maps. To\naddress this problem, we propose Localization-aware Inversion (LocInv), which\nexploits segmentation maps or bounding boxes as extra localization priors to\nrefine the cross-attention maps in the denoising phases of the diffusion\nprocess. Through the dynamic updating of tokens corresponding to noun words in\nthe textual input, we are compelling the cross-attention maps to closely align\nwith the correct noun and adjective words in the text prompt. Based on this\ntechnique, we achieve fine-grained image editing over part

We can see that we have two similar looking abstracts! Both clearly discussing RAG much like our reference abstract.

We can also visualise this similarity via a 3D scatter graph. However, we need to reduce down to 3 dimensions. For this we can use [$t$-SNE](https://www.jmlr.org/papers/volume9/vandermaaten08a/vandermaaten08a.pdf), basically a similar approach as PCA (dimension reduction) but better suited to non-linear relationships in data.

In [11]:
n_components = 3 # 3D visualisation

# create an empty list
viz_embeddings = []

# add the closest 10 embeddings to the list
for embedding in similar:
  viz_embeddings.append(embeddings[embedding])

# convert to a np array
viz_array = np.array(viz_embeddings)

# reduce the embeddings to n_components (3) dimensions using TSNE
tsne = TSNE(n_components=n_components, random_state=42, perplexity=5)
reduced_vectors = tsne.fit_transform(viz_array) # transform the data
reduced_vectors[0:5] # show the first 5

array([[ -96.76133 ,  -16.78268 ,  -12.690235],
       [-109.19027 ,  -88.27165 ,  -75.16752 ],
       [ -19.696934,  -84.72847 ,   94.33796 ],
       [-179.14848 ,  -19.984583,   68.73254 ],
       [  79.20754 ,   24.442919,   63.034298]], dtype=float32)

In [12]:
# Code adapted from: Afzal(2024)
# https://medium.com/@sarmadafzalj/visualize-vector-embeddings-in-a-rag-system-89d0c44a3be4

import plotly.graph_objs as go
from plotly.subplots import make_subplots
import plotly.io as pio

# Create a 3D scatter plot
scatter_plot = go.Scatter3d(
    x=reduced_vectors[:, 0],
    y=reduced_vectors[:, 1],
    z=reduced_vectors[:, 2],
    mode='markers',
    marker=dict(size=5, color='grey', opacity=0.5, line=dict(color='lightgray', width=1)),
    text=[f"Point {i}" for i in range(len(reduced_vectors))],
    name="Abstracts"
)

# Highlight the first point with a different colour (red)
highlighted_point = go.Scatter3d(
    x=[reduced_vectors[0, 0]],
    y=[reduced_vectors[0, 1]],
    z=[reduced_vectors[0, 2]],
    mode='markers',
    marker=dict(size=8, color='red', opacity=0.8, line=dict(color='lightgray', width=1)),
    text=["Question"],
    name="Query abstract"

)

# Highlight the closest two points with a different colour again (blue)
blue_points = go.Scatter3d(
    x=reduced_vectors[1:3, 0],
    y=reduced_vectors[1:3, 1],
    z=reduced_vectors[1:3, 2],
    mode='markers',
    marker=dict(size=8, color='blue', opacity=0.8,  line=dict(color='black', width=1)),
    text=["Top 1 Document","Top 2 Document"],
    name="Closest abstracts"
)

# Create the layout for the plot
layout = go.Layout(
    scene=dict(
        xaxis=dict(title='X'),
        yaxis=dict(title='Y'),
        zaxis=dict(title='Z'),
    ),
    title=f'3D Representation after t-SNE (Perplexity=5)'
)


fig = make_subplots(rows=1, cols=1, specs=[[{'type': 'scatter3d'}]])

# Add the scatter plots to the Figure
fig.add_trace(scatter_plot)
fig.add_trace(highlighted_point)
fig.add_trace(blue_points)

fig.update_layout(layout)

pio.write_html(fig, 'interactive_plot.html')
fig.show()

As we can see in 3-dimensional space, our query abstract is relatively close to our two papers, and further from the remaining 7.

How about if we query with new text (rather than using an existing abstract in the database). Let's get ChatGPT to create something similar to the paper we found:

In [42]:
rag_text = "Large language models (LLMs) are increasingly pivotal in supporting \
text categorization across various specialized areas while managing multiple \
labels concurrently. Adapting these LLM-based multi-label classifiers to domain \
shifts, such as transitioning a news sentiment classifier from financial to \
political news, presents significant challenges. These challenges stem from \
incomplete label sets in the new domain and the considerable burden of retraining. "

Now we can embedd our text, as before, and use it to search the database:

In [43]:
# Convert RAG text to vectors
rag_embedding = model.encode(rag_text, show_progress_bar=True)

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

In [44]:
# Retrieve the 10 nearest neighbours
cs_similarity_two, similar_two = index.search(np.array([rag_embedding]), k=10)
similar_two = similar_two.flatten().tolist()
print(f'Original search: {similar}')
print(f'New search: {similar_two}')
common_elements = set(similar) # change similar to a set
# print the documents in common (the intersection of the sets of the two lists)
print(f'Common documents: {common_elements.intersection(set(similar_two))}')

Original search: [54, 142, 90, 240, 68, 316, 403, 287, 417, 338]
New search: [142, 54, 240, 93, 338, 445, 13, 55, 166, 90]
Common documents: {142, 240, 338, 54, 90}


As you can see, we get similar results, with many similar papers found as found in the original search. Great work!

Let's also, as a bit of foreshadowing, try finding documents to support a more standard Q&A prompt:

In [45]:
# Q&A prompt
qna_prompt = "what is sentiment analysis?"

# Convert Q&A prompt to vectors
rag_embedding = model.encode(qna_prompt, show_progress_bar=True)

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Now we can find the best two abstracts related to this query:

In [46]:
# Retrieve the top nearest neighbour
cs_similarity_three, similar_three = index.search(np.array([rag_embedding]), k=1)
similar_three = similar_three.flatten().tolist()

# Print the result
print(f'Top result: {df.iloc[similar_three[0], 2]}')

Top result: Text summarization models have typically focused on optimizing aspects of
quality such as fluency, relevance, and coherence, particularly in the context
of news articles. However, summarization models are increasingly being used to
summarize diverse sources of text, such as social media data, that encompass a
wide demographic user base. It is thus crucial to assess not only the quality
of the generated summaries, but also the extent to which they can fairly
represent the opinions of diverse social groups. Position bias, a long-known
issue in news summarization, has received limited attention in the context of
social multi-document summarization. We deeply investigate this phenomenon by
analyzing the effect of group ordering in input documents when summarizing
tweets from three distinct linguistic communities: African-American English,
Hispanic-aligned Language, and White-aligned Language. Our empirical analysis
shows that although the textual quality of the summaries remain

A seemingly relevant result!