# Semantic Search using Language Models and Nearest Neighbor Indexes

Go ahead and install these libraries


In [20]:
 ! pip install sentence-transformers faiss-cpu ir-measures pyserini torch 

Collecting pyserini
  Using cached pyserini-0.18.0-py3-none-any.whl (114.8 MB)
Collecting onnxruntime>=1.8.1
  Using cached onnxruntime-1.12.1-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (5.5 MB)


Installing collected packages: onnxruntime, pyserini
  Attempting uninstall: onnxruntime
    Found existing installation: onnxruntime 1.6.0
    Uninstalling onnxruntime-1.6.0:
      Successfully uninstalled onnxruntime-1.6.0
Successfully installed onnxruntime-1.12.1 pyserini-0.18.0


ANN is a form of semantic search which leverages vector based represenatations for queries and documents to perform retrieval. ANN approaches have become incredibly popular for search because of the success of Language Models such as BERT. 

Before we explore how they can be used for search we will explore how language models can be used to form vector based representations. We will be using the Sentence Transformers library which is a Language Model library which is optimized for representing sentences as text.

## Creating Sentence Representations

Import the needed libraries

In [10]:
from sentence_transformers import SentenceTransformer
import json
import numpy
from torch import nn
import torch

To create sentence representations we must select and load a model as shown below

In [3]:
model = SentenceTransformer('sentence-transformers/multi-qa-mpnet-base-dot-v1')

Lets looking at some of the samples and comparing their cosine distance

In [5]:
# What we want to represent
sentences = ['CS410 is a computer science class focused on information retireval',
    'The Cat is in the hat',
    'Methods in search use semantic search']

# create embedding
embeddings = model.encode(sentences)

#Print the embeddings
for sentence, embedding in zip(sentences, embeddings):
    print("Sentence:", sentence)
    print("Embedding:", embedding)
    print("")

Sentence: CS410 is a computer science class focused on information retireval
Embedding: [-1.85216412e-01 -1.66864976e-01 -3.03928524e-01  3.46954137e-01
 -1.48631826e-01 -2.07251552e-02  2.20957547e-01  1.60436645e-01
 -7.27824211e-01 -4.80092205e-02  2.63962775e-01 -1.21291324e-01
  1.32310703e-01  3.23694944e-01  3.07744425e-02 -1.75159276e-01
  2.13862639e-02  3.27777356e-01 -2.46569261e-01 -1.20440729e-01
 -1.21249333e-01  9.51177031e-02 -3.43358248e-01 -2.69212246e-01
 -2.54858673e-01 -2.17364237e-01 -8.54702219e-02  2.82846838e-01
 -9.65579003e-02  4.40875441e-02  2.73259938e-01  3.32060866e-02
  1.88677147e-01  2.57546425e-01 -1.06852895e-04 -5.55440187e-01
  8.37837905e-02 -2.93398649e-02 -2.79575706e-01  3.69383901e-01
 -4.65674400e-01  2.36389264e-02 -1.54408246e-01 -1.41146809e-01
 -1.71877176e-01  1.29293889e-01 -7.72814825e-02  1.71887368e-01
  3.20354372e-01  2.20564947e-01  2.44294897e-01 -9.51622576e-02
  1.04230627e-01 -8.53689983e-02  3.65195185e-01 -1.71411216e-01
  

As we can see below sentence one, which deals with search, is closer to three than two, but both are far away.Go ahead and explore some sentences to see how minor semantic variations can be understood. Each setence has a extracted embedding which is a projection of the inputed text into a N dimensional space. In this case 768 dimensions

In [18]:
cos = nn.CosineSimilarity(dim=0, eps=1e-6)
output = cos(torch.tensor(embeddings[0]),torch.tensor(embeddings[1]))
print("The similarity between sentence one and two is: {}".format(output))
output = cos(torch.tensor(embeddings[0]),torch.tensor(embeddings[2]))
print("The similarity between sentence one and thee is: {}".format(output))
output = cos(torch.tensor(embeddings[1]),torch.tensor(embeddings[2]))
print("The similarity between sentence two and three is: {}".format(output))


The similarity between sentence one and two is: 0.24906384944915771
The similarity between sentence one and thee is: 0.29295048117637634
The similarity between sentence two and three is: 0.23856401443481445


In [21]:
## Using Sentence Representations for Search

Using these sentence representations can be an effictive way of representing text and is also incredibly efficient when paired with a nearest neighbor index such as FAISS. Below is an example on how to go ahead loading sentence embeddings into an index and how to search this index.

In [31]:
import faiss  
d = 768 # The dimensionality of out embeddings
model.max_seq_length = 512
index = faiss.IndexFlatL2(d)   # build the index
print(index.is_trained)
print(index.ntotal)
document_collection = {}
i = 0

sentences = ['CS410 is a computer science class focused on information retireval',
    'The Cat is in the hat',
    'Romeo Romeo where are you']

for sentence in sentences:
    embedding = model.encode([sentence])
    index.add(embedding)# add vectors to the index
    i += 1

print("Our Index now has {} documents".format(index.ntotal))

True
0
Our Index now has 3 documents


Now that we have loaded each document into our index we can go about searching semantically. Note that as our index only has 3 items we will retrieve the whole index each time. In practice a FAISS index can have Billions of items and can still work incredibly well

In [34]:
k = 1 # amount of Docs to retieve

query = 'lets learn about cats'
embedding = model.encode([query])
distances, indexes = index.search(embedding, k) # search
for i in range(len(indexes)):
    print("Document {} is {} far away from the query: {}".format(indexes[0][i], distances[0][i],query))

query = 'computer science'
embedding = model.encode([query])
distances, indexes = index.search(embedding, k) # search
for i in range(len(indexes)):
    print("Document {} is {} far away from the query: {}".format(indexes[0][i], distances[0][i],query))
    
    
query = 'Juliet misses you'
embedding = model.encode([query])
distances, indexes = index.search(embedding, k) # search
for i in range(len(indexes)):
    print("Document {} is {} far away from the query: {}".format(indexes[0][i], distances[0][i],query))

Document 1 is 28.104280471801758 far away from the query: lets learn about cats
Document 0 is 32.00548553466797 far away from the query: computer science
Document 2 is 26.879138946533203 far away from the query: Juliet misses you


Using this same approach we have gone ahead and created some representations using a few different language models for the CS410 corpus. Feel free to create addition models if you wish

In [37]:
import pickle
import numpy as np

In [40]:
model_names = ['sentence-transformers/multi-qa-mpnet-base-dot-v1','sentence-transformers/multi-qa-distilbert-cos-v1','sentence-transformers/multi-qa-mpnet-base-dot-v1','sentence-transformers/msmarco-distilbert-base-tas-b', 'sentence-transformers/all-mpnet-base-v2','sentence-transformers/sentence-t5-base','sentence-transformers/all-distilroberta-v1','sentence-transformers/msmarco-bert-base-dot-v5','sentence-transformers/stsb-distilbert-base','sentence-transformers/multi-qa-distilbert-cos-v1','sentence-transformers/nq-distilbert-base-v1']
for model_name in model_names:
    model = SentenceTransformer(model_name)
    embeddings = []
    with open('data/collection.jsonl','r') as f:
        for l in f:
            data = json.loads(l)
            embedding = model.encode([data['contents']])
            embeddings.append(embedding)

    with open('data/documents' + model_name.replace('/','') + '.pkl', 'wb') as f:
        pickle.dump(embeddings, f)

In [None]:
model_names = ['sentence-transformers/multi-qa-mpnet-base-dot-v1','sentence-transformers/multi-qa-distilbert-cos-v1','sentence-transformers/multi-qa-mpnet-base-dot-v1','sentence-transformers/msmarco-distilbert-base-tas-b', 'sentence-transformers/all-mpnet-base-v2','sentence-transformers/sentence-t5-base','sentence-transformers/all-distilroberta-v1','sentence-transformers/msmarco-bert-base-dot-v5','sentence-transformers/stsb-distilbert-base','sentence-transformers/multi-qa-distilbert-cos-v1','sentence-transformers/nq-distilbert-base-v1']
for model_name in model_names:
    model = SentenceTransformer(model_name)
    embeddings = []
    with open('data/queries.txt','r') as f:
        for l in f:
            l = l.strip()
            embedding = model.encode([l])
            embeddings.append(embedding)

    with open('data/queries' + model_name.replace('/','') + '.pkl', 'wb') as f:
        pickle.dump(embeddings, f)