<a href="https://www.kaggle.com/code/aisuko/wikipedia-q-a-retrieval-semantic-search?scriptVersionId=162241049" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

# Overview

We show two **symmetric search task** in [Similar Questions Retrieval](https://www.kaggle.com/code/aisuko/similar-questions-retrieval) and [Semantic Search in Publications](https://www.kaggle.com/code/aisuko/semantic-search-in-publications) notebooks. In this notebook, we are going to show a **asymmetric search task**.

We will use model `nq-distilbert-base-v1` which was trained on the Natural Questions dataset. It consists of about 100k real Google search queries, together with an annotated passage from Wikipedia that provides the answer. 

As corpus, we use the similar Simple English Wikipedia so that it fits easily into memory.

In [1]:
!pip install sentence-transformers==2.3.1

Collecting sentence-transformers==2.3.1
  Downloading sentence_transformers-2.3.1-py3-none-any.whl.metadata (11 kB)
Downloading sentence_transformers-2.3.1-py3-none-any.whl (132 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m132.8/132.8 kB[0m [31m3.2 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: sentence-transformers
Successfully installed sentence-transformers-2.3.1


In [2]:
import os

os.environ['MODEL_NAME']='nq-distilbert-base-v1'
os.environ['DATASET_NAME']='simplewiki-2020-11-01.jsonl.gz'
os.environ['EMBEDDINGS_NAME']='simplewiki-2020-11-01-nq-distilbert-base-v1.pt'
os.environ['EMBEDDINGS_URL']='http://sbert.net/datasets/simplewiki-2020-11-01-nq-distilbert-base-v1.pt'
os.environ['DATASET_URL']='http://sbert.net/datasets/simplewiki-2020-11-01.jsonl.gz'

# Loading the dataset

AS the dataset, we use Simple English Wikipedia. Compared to the full English Wikipedia, it has only about 170k articles. We split these articles into paragraphs and encode them with our semantic models. 

In [3]:
import gzip
import json
from sentence_transformers.util import http_get

# download the dataset first
http_get(os.getenv('DATASET_URL'), os.getenv('DATASET_NAME'))

  0%|          | 0.00/50.2M [00:00<?, ?B/s]

In [4]:
passages=[]
with gzip.open(os.getenv('DATASET_NAME'), 'rt', encoding='utf-8') as fIn:
    for line in fIn:
        data=json.loads(line.strip())
        for paragraph in data['paragraphs']:
            # We encode the passages as [title, text]
            passages.append([data['title'], paragraph])
len(passages)

509663

# Loading/Computing the embeddings

We can load the pre-computed embeddings or compute it directly.

In [5]:
import torch

load_pre_computed=False

if load_pre_computed:
    http_get(os.getenv('EMBEDDINGS_URL'), os.getenv('EMBEDDINGS_NAME'))
    corpus_embeddings=torch.load(os.getenv('EMBEDDINGS_NAME'))
    corpus_embeddings=corpus_embeddings.float().to('cuda')

# Loading model

In [6]:
from sentence_transformers import SentenceTransformer

model=SentenceTransformer(os.getenv('MODEL_NAME'))
model.max_seq_length=256
model.to('cuda')
model

modules.json:   0%|          | 0.00/229 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/122 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/3.69k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/540 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/265M [00:00<?, ?B/s]

  return self.fget.__get__(instance, owner)()


tokenizer_config.json:   0%|          | 0.00/554 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

SentenceTransformer(
  (0): Transformer({'max_seq_length': 256, 'do_lower_case': False}) with Transformer model: DistilBertModel 
  (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False})
)

In [7]:
# It tooks 11min on Single GPU P100
from sentence_transformers.util import normalize_embeddings
if not load_pre_computed:
    corpus_embeddings=model.encode(passages, convert_to_tensor=True, show_progress_bar=True).to('cuda')
    corpus_embeddings=normalize_embeddings(corpus_embeddings)

corpus_embeddings

Batches:   0%|          | 0/15927 [00:00<?, ?it/s]

tensor([[-0.0540,  0.0581, -0.0644,  ...,  0.0849,  0.0323, -0.0228],
        [ 0.0377, -0.0514,  0.0110,  ...,  0.0031,  0.0120, -0.0113],
        [ 0.0179, -0.0202,  0.0298,  ..., -0.0110,  0.0071, -0.0185],
        ...,
        [-0.0289, -0.0142, -0.0056,  ..., -0.0289, -0.0078,  0.0238],
        [ 0.0190,  0.0472,  0.0008,  ..., -0.0226, -0.0427,  0.0419],
        [ 0.0648,  0.0707, -0.0032,  ...,  0.0265, -0.0434,  0.0143]],
       device='cuda:0')

In [8]:
# It takes too much time to save to csv
# import pandas as pd

# embedding_data=pd.DataFrame(corpus_embeddings.cpu())
# embedding_data.to_csv('simple_english_wikipedia.csv', index=False)

# Define the search function

In [9]:
import time
from sentence_transformers.util import semantic_search

def search(query, top_k=5):
    # Encode the query using the model and find potentially relevant passages
    start_time=time.time()
    question_embedding=model.encode(query, convert_to_tensor=True).to('cuda')
    hits=semantic_search(
        question_embedding,
        corpus_embeddings,
        query_chunk_size=100,
        corpus_chunk_size=500000,
        top_k=top_k
    )
    hits=hits[0]
    end_time=time.time()
    
    # Output of top-k hits
    print('Input question', query)
    print('Results (after {:.3f} seconds):'.format(end_time-start_time))
    for hit in hits:
        print("\t{:.3f}\t{}".format(hit['score'], passages[hit['corpus_id']]))

# Inference

In [10]:
search(query='What is the capital of the Australia?')

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Input question What is the capital of the Australia?
Results (after 0.068 seconds):
	0.771	['Australia', 'Australia, formally the Commonwealth of Australia, is a country and sovereign state in the southern hemisphere, located in Oceania. Its capital city is Canberra, and its largest city is Sydney.']
	0.707	['New South Wales', 'New South Wales is one of the states of Australia. It the oldest state in Australia and is sometimes called the "Premier State". Of all Australian states, New South Wales has the most people. An inhabitant of New South Wales is referred to as a New South Welshman. The capital city of New South Wales is Sydney. Sydney is the biggest city in Australia.']
	0.687	['Australia', "In 2013 according to world bank Australia had just over 23.13 million people. Most Australians live in cities along the coast, such as Sydney, Melbourne, Brisbane, Perth, Darwin, Hobart and Adelaide. The largest inland city is Canberra, which is also the nation's capital. The largest city is 