# Semantic Search using Open-source Transformers and Faiss

Ref: [How to Build a Semantic Search Engine With Transformers and Faiss](https://towardsdatascience.com/how-to-build-a-semantic-search-engine-with-transformers-and-faiss-dcbea307a0e8)

**GPU can help to run this demo faster**

For all the required packages:
`!pip install -r requirements.txt`

In [2]:
import numpy as np
import pandas as pd
import pickle

from pathlib import Path


import torch
from sentence_transformers import SentenceTransformer

import faiss

#import s3fs


In [3]:
import os
import sys

module_path = os.path.abspath(os.path.join('..'))
if module_path not in sys.path:
    sys.path.append(module_path)
    
from vector_engine.utils import vector_search, id2details   

## Raw Data

Misinformation, disinformation and fake news papers processed and stored in csv file

In [4]:
#df = pd.read_csv("s3://vector-search-blog/misinformation_papers.csv")
df = pd.read_csv("../data/misinformation_papers.csv")

df.head()

Unnamed: 0,original_title,abstract,year,citations,id,is_EN
0,When Corrections Fail: The Persistence of Poli...,An extensive literature addresses citizen igno...,2010,901,2132553681,1
1,A postmodern Pandora's box: anti-vaccination m...,The Internet plays a large role in disseminati...,2010,440,2117485795,1
2,Spread of (Mis)Information in Social Networks,We provide a model to investigate the tension ...,2010,278,2120015072,1
3,Mandatory Influenza Vaccination of Health Care...,BACKGROUND Influenza vaccination of health car...,2010,232,2156683801,1
4,Intrauterine contraception in Saint Louis: a s...,Abstract Background Many obstacles to intraute...,2010,99,2148310716,1


In [5]:
print(f"Misinformation, disinformation and fake news papers: {df.shape}")

Misinformation, disinformation and fake news papers: (8430, 6)


## Embedding using Language Model

[SentenceTransformers](https://www.sbert.net/)

The [Sentence Transformers library](https://github.com/UKPLab/sentence-transformers) offers pretrained transformers that produce SOTA sentence embeddings. Checkout this [spreadsheet](https://docs.google.com/spreadsheets/d/14QplCdTCDwEmTqrn1LH4yrbKvdogK4oQvYO1K1aPR5M/) with all the available models.

The `distilbert-base-nli-stsb-mean-tokens` model has the best performance on Semantic Textual Similarity tasks among the DistilBERT versions. Moreover, comparing to BERT, it is smaller and faster.

BTW, [Orion's semantic search engine](https://www.orion-search.org/) is using the same model!

In [6]:
# Instantiate the sentence-level DistilBERT
model = SentenceTransformer('distilbert-base-nli-stsb-mean-tokens')
# Check if GPU is available and use it
if torch.cuda.is_available():
    model = model.to(torch.device("cuda"))
print(model.device)

Downloading (…)7e0d5/.gitattributes:   0%|          | 0.00/345 [00:00<?, ?B/s]

Downloading (…)_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Downloading (…)0e5ca7e0d5/README.md:   0%|          | 0.00/4.01k [00:00<?, ?B/s]

Downloading (…)5ca7e0d5/config.json:   0%|          | 0.00/555 [00:00<?, ?B/s]

Downloading (…)ce_transformers.json:   0%|          | 0.00/122 [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/265M [00:00<?, ?B/s]

Downloading (…)nce_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

Downloading (…)7e0d5/tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/505 [00:00<?, ?B/s]

Downloading (…)0e5ca7e0d5/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading (…)ca7e0d5/modules.json:   0%|          | 0.00/229 [00:00<?, ?B/s]

cpu


In [8]:
# Convert abstracts to vectors
embeddings = model.encode(df.abstract.to_list(), show_progress_bar=True)

Batches:   0%|          | 0/264 [00:00<?, ?it/s]

In [38]:
print(f"Shape of the vectorised abstract: {embeddings.shape}")

Shape of the vectorised abstract: (8430, 768)


## Vector similarity search with Faiss



[Faiss](https://github.com/facebookresearch/faiss) is a library for efficient similarity search and clustering of dense vectors. It contains algorithms that search in sets of vectors of any size, even ones that do not fit in RAM. [ref: Faiss Paper](https://arxiv.org/abs/1702.08734)
    
Faiss is built around the `Index` object which contains, and sometimes preprocesses, the searchable vectors. Faiss has a large collection of [indexes](https://github.com/facebookresearch/faiss/wiki/Faiss-indexes). You can even create [composite indexes](https://github.com/facebookresearch/faiss/wiki/Faiss-indexes-(composite)). Faiss handles collections of vectors of a fixed dimensionality d, typically a few 10s to 100s.

**Note**: Faiss uses only 32-bit floating point matrices. This means that you will have to change the data type of the input before building the index.


Here, we will the `IndexFlatL2` index:
- It's a simple index that performs a brute-force L2 distance search
- It scales linearly. It will work fine with our data but you might want to try [faster indexes](https://github.com/facebookresearch/faiss/wiki/Faster-search) if you work will millions of vectors.

To create an index with the `misinformation` abstract vectors, we will:
1. Change the data type of the abstract vectors to float32.
2. Build an index and pass it the dimension of the vectors it will operate on.
3. Pass the index to IndexIDMap, an object that enables us to provide a custom list of IDs for the indexed vectors.
4. Add the abstract vectors and their ID mapping to the index. In our case, we will map vectors to their paper IDs from MAG.

In [39]:
# Step 1: Change data type
embeddings_f32 = embeddings.astype("float32")

# Step 2: Instantiate the index
index = faiss.IndexFlatL2(embeddings_f32.shape[1])

# Step 3: Pass the index to IndexIDMap
index = faiss.IndexIDMap(index)

# Step 4: Add vectors and their IDs
index.add_with_ids(embeddings_f32, df.id.values)

print(f"Number of vectors in the Faiss index: {index.ntotal}")

Number of vectors in the Faiss index: 8430


### Searching the index
The index we built will perform a k-nearest-neighbour search. We have to provide the number of neighbours to be returned. 

Let's query the index with an abstract from our dataset and retrieve the 10 most relevant documents. **The first one must be our query!**


In [54]:
# pick a paper abstract
random_pick = 5415
df.iloc[random_pick]["abstract"]

"We address the diffusion of information about the COVID-19 with a massive data analysis on Twitter, Instagram, YouTube, Reddit and Gab. We analyze engagement and interest in the COVID-19 topic and provide a differential assessment on the evolution of the discourse on a global scale for each platform and their users. We fit information spreading with epidemic models characterizing the basic reproduction number [Formula: see text] for each social media platform. Moreover, we identify information spreading from questionable sources, finding different volumes of misinformation in each platform. However, information from both reliable and questionable sources do not present different spreading patterns. Finally, we provide platform-dependent numerical estimates of rumors' amplification."

In [55]:
# Retrieve the 10 nearest neighbours
D, I = index.search(np.array([embeddings[random_pick]]), k=10)
print(f'L2 distance: {D.flatten().tolist()}\n\nMAG paper IDs: {I.flatten().tolist()}')

L2 distance: [0.0, 1.3749969005584717, 55.55474853515625, 65.87544250488281, 67.96585083007812, 69.05718994140625, 69.70634460449219, 70.40652465820312, 70.57284545898438, 71.0338134765625]

MAG paper IDs: [3092618151, 3011345566, 3012936764, 3011186656, 3092128270, 3048848247, 3044429417, 3055557295, 3024620668, 3044097955]


In [56]:
# Fetch the paper titles based on their index
id2details(df, I, 'original_title')

[['The COVID-19 social media infodemic.'],
 ['The COVID-19 Social Media Infodemic'],
 ['Understanding the perception of COVID-19 policies by mining a multilanguage Twitter dataset'],
 ['Coronavirus Goes Viral: Quantifying the COVID-19 Misinformation Epidemic on Twitter'],
 ['Analysis of online misinformation during the peak of the COVID-19 pandemics in Italy'],
 ['COVID-19-Related Infodemic and Its Impact on Public Health: A Global Social Media Analysis.'],
 ['Effects of misinformation on COVID-19 individual responses and recommendations for resilience of disastrous consequences of misinformation'],
 ['Covid-19 infodemic reveals new tipping point epidemiology and a revised R formula.'],
 ['Quantifying COVID-19 Content in the Online Health Opinion War Using Machine Learning'],
 ['Coronavirus-related online web search desire amidst the rising novel coronavirus incidence in Ethiopia: Google Trends-based infodemiology']]

In [25]:
# Fetch the paper abstracts based on their index
id2details(df, I, 'abstract')

[["We address the diffusion of information about the COVID-19 with a massive data analysis on Twitter, Instagram, YouTube, Reddit and Gab. We analyze engagement and interest in the COVID-19 topic and provide a differential assessment on the evolution of the discourse on a global scale for each platform and their users. We fit information spreading with epidemic models characterizing the basic reproduction number [Formula: see text] for each social media platform. Moreover, we identify information spreading from questionable sources, finding different volumes of misinformation in each platform. However, information from both reliable and questionable sources do not present different spreading patterns. Finally, we provide platform-dependent numerical estimates of rumors' amplification."],
 ["We address the diffusion of information about the COVID-19 with a massive data analysis on Twitter, Instagram, YouTube, Reddit and Gab. We analyze engagement and interest in the COVID-19 topic and p


## Query

So far, we've built a Faiss index using the misinformation abstract vectors we encoded with a sentence-DistilBERT model. That's helpful but in a real case scenario, we would have to work with unseen data. To query the index with an unseen query and retrieve its most relevant documents, we would have to do the following:

1. Encode the query with the same sentence-DistilBERT model we used for the rest of the abstract vectors.
2. Change its data type to float32.
3. Search the index with the encoded query.

Here, we will use the introduction of an article published on [HKS Misinformation Review](https://misinforeview.hks.harvard.edu/article/can-whatsapp-benefit-from-debunked-fact-checked-stories-to-reduce-misinformation/).


In [57]:
user_query = """
WhatsApp was alleged to have been widely used to spread misinformation and propaganda 
during the 2018 elections in Brazil and the 2019 elections in India. Due to the 
private encrypted nature of the messages on WhatsApp, it is hard to track the dissemination 
of misinformation at scale. In this work, using public WhatsApp data from Brazil and India, we 
observe that misinformation has been largely shared on WhatsApp public groups even after they 
were already fact-checked by popular fact-checking agencies. This represents a significant portion 
of misinformation spread in both Brazil and India in the groups analyzed. We posit that such 
misinformation content could be prevented if WhatsApp had a means to flag already fact-checked 
content. To this end, we propose an architecture that could be implemented by WhatsApp to counter 
such misinformation. Our proposal respects the current end-to-end encryption architecture on WhatsApp, 
thus protecting users’ privacy while providing an approach to detect the misinformation that benefits 
from fact-checking efforts.
"""

In [58]:
# For convenience, I've wrapped all steps in the vector_search function.
# It takes four arguments: 
# A query, the sentence-level transformer, the Faiss index and the number of requested results
D, I = vector_search([user_query], model, index, num_results=10)
print(f'L2 distance: {D.flatten().tolist()}\n\nMAG paper IDs: {I.flatten().tolist()}')

L2 distance: [7.6366119384765625, 58.32743453979492, 58.32743453979492, 70.91807556152344, 73.32896423339844, 81.48760986328125, 85.36541748046875, 85.85224914550781, 87.20014953613281, 92.0755615234375]

MAG paper IDs: [3047438096, 3037966274, 3021927925, 2889959140, 2791045616, 2943077655, 2990343632, 2974128076, 3014380170, 3028584171]


In [59]:
# Fetching the paper titles based on their index
id2details(df, I, 'original_title')

[['Can WhatsApp Benefit from Debunked Fact-Checked Stories to Reduce Misinformation?'],
 ['A Dataset of Fact-Checked Images Shared on WhatsApp During the Brazilian and Indian Elections'],
 ['A Dataset of Fact-Checked Images Shared on WhatsApp During the Brazilian and Indian Elections'],
 ['A System for Monitoring Public Political Groups in WhatsApp'],
 ['Politics of Fake News: How WhatsApp Became a Potent Propaganda Tool in India'],
 ['Characterizing Attention Cascades in WhatsApp Groups'],
 ['Can WhatsApp Counter Misinformation by Limiting Message Forwarding'],
 ['Can WhatsApp Counter Misinformation by Limiting Message Forwarding'],
 ['OS IMPACTOS JURÍDICOS E SOCIAIS DAS FAKE NEWS EM TERRITÓRIO BRASILEIRO'],
 ['Images and Misinformation in Political Groups: Evidence from WhatsApp in India']]

In [None]:
# Define project base directory
# Change the index from 1 to 0 if you run this on Google Colab
project_dir = Path('notebooks').resolve().parents[1]
print(project_dir)

# Serialise index and store it as a pickle
with open(f"{project_dir}/models/faiss_index.pickle", "wb") as h:
    pickle.dump(faiss.serialize_index(index), h)