<a href="https://colab.research.google.com/github/B34R-e/My-Projects/blob/main/Sentence_Transformer_Text_Retrieval.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
!pip install datasets sentence_transformers

In [None]:
from datasets import load_dataset

dataset = load_dataset('ms_marco', 'v1.1')

In [3]:
# select a test sample
subset = dataset['test']

# declare a list contains related queries and documents
queries_infos = []
queries = []
corpus = []

# split the data
# Duyet qua tung sample trong bo test duoc lay ra tu dataset va chi lay cac sample co cau truy van thuoc kieu 'entity'
for sample in subset:
  query_type = sample['query_type']
  if query_type != 'entity':
    continue
  # Lay ra noi dung cau truy van va id cua chung
  query_id = sample['query_id']
  query_str = sample['query']
  # Lay danh sach cac tai lieu va nhan tuong ung cua cau truy van
  passages_dict = sample['passages']
  is_selected_lst = passages_dict['is_selected']
  passage_text_lst = passages_dict['passage_text']
  # Khai bao mot dictionary chua cac thong tin cua cau truy van
  query_info = {
      'query_id': query_id,
      'query': query_str,
      'relevant_docs': []
  }
  # Tu danh sach cac tai lieu va nhan
  # Chon cac tai lieu duoc gan co lien quan den cau truy van
  # Dua vao key 'relevant_docs'
  # Luu tru nhan duoi dang chi muc trong list corpus
  current_len_corpus = len(corpus)
  for idx in range(len(is_selected_lst)):
    if is_selected_lst[idx] == 1:
      doc_idx = current_len_corpus + idx
      query_info['relevant_docs'].append(doc_idx)
    # Bo qua cac sample khong chua tai lieu lien quan de thuan tien cho viec danh gia
    if query_info['relevant_docs'] == []:
      continue
  # Dua thong tin cau truy van va tai lieu vao cac danh sanh da khai bao
  queries.append(query_str)
  queries_infos.append(query_info)
  corpus += passage_text_lst

In [None]:
import torch
from sentence_transformers import SentenceTransformer

model = SentenceTransformer('all-MiniLM-L6-v2')
corpus_embeddings = model.encode(corpus, convert_to_tensor = True)

### **Custom Search Function**

In [9]:
corpus_embeddings.shape
print(len(corpus))
print(len(corpus_embeddings))

7486
7486


In [7]:
from sentence_transformers import util

def similarity(query_embeddings, corpus_embeddings):
  return util.cos_sim(query_embeddings, corpus_embeddings)[0]

In [8]:
def ranking(query, top_k = 10):
  query_embeddings = model.encode(query, convert_to_tensor = True)
  cos_scores = similarity(query_embeddings, corpus_embeddings)

  top_results = torch.topk(cos_scores, k = top_k)

  return top_results

In [23]:
custom_queries = ['what is the official language in Fiji']

top_k = min(5, len(corpus))

for query in custom_queries:
  top_results = ranking(query, top_k)

  print(f'Query: {query}')
  print('\n=================')
  print(f'Top {top_k} most similar sentences in corpus:\n')

  for idx, (score, doc_idx) in enumerate(zip(top_results[0], top_results[1])):
    print(f'Document rank {idx + 1}:')
    print(corpus[doc_idx], f'\n(Score: {score:.4f})', '\n')

Query: what is the official language in Fiji

Top 5 most similar sentences in corpus:

Document rank 1:
The official languages. Fiji’s 1997 Constitution established Fijian as one of the official languages of the country. Fijian is an Austronesian language, a grouping that includes thousands of other languages spanning the globe. The language is of the Malayo-Polynesian family, not too different from Hawaiian and Maori. 
(Score: 0.8663) 

Document rank 2:
Fiji has three official languages under the 1997 constitution (and not revoked by the 2013 Constitution): English, Fijian and Hindi. Fijian is spoken either as a first or second language by indigenous Fijians who make up around 54% of the population. 
(Score: 0.8464) 

Document rank 3:
The Republic of the Fiji Islands citizens speak British English. Fijian and Fiji-Hindi is the second language. Other major language that is taught in elementary/primary schools and high schools are Urdu and French. Urdu and French is never considered to 