## **Reranking with Sentence Transformers and BM25 API**

In Retrieval-Augmented Generation (RAG), reranking is the process of refining the list of documents or passages retrieved from a knowledge base to better match the relevance to the user's query. After an initial retrieval step, a reranking model re-evaluates the relevance of these documents, reorders them, and selects the top ones. This ensures that the most relevant and high-quality information is provided to the generative model, leading to more accurate and contextually appropriate responses.

### **Benifits of reranking**  
- getting Revelent data from LLM
- Reduce the noise
-  Improving the efficiency of the data


In [1]:
documents = [
    "This is a list which containing sample documents ." ,
    "Keywords are important for keyowrd-baseed search" ,
    "Document analysis involves extracking keywords" ,
    "keyword-based search relies on sparse embeddings"  ,
    "understanding the document structure aids in keyword extraction." ,
    "efficient keyword extraction enhances search accuracy" ,

]

In [2]:
!pip install sentence_transformers

Installing collected packages: sentence_transformers
Successfully installed sentence_transformers-3.0.1


In [3]:
model_name = "sentence-transformers/paraphrase-distilroberta-base-v1"

In [4]:
from sentence_transformers import SentenceTransformer


  from tqdm.autonotebook import tqdm, trange


In [5]:
!huggingface-cli login



    _|    _|  _|    _|    _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|_|_|_|    _|_|      _|_|_|  _|_|_|_|
    _|    _|  _|    _|  _|        _|          _|    _|_|    _|  _|            _|        _|    _|  _|        _|
    _|_|_|_|  _|    _|  _|  _|_|  _|  _|_|    _|    _|  _|  _|  _|  _|_|      _|_|_|    _|_|_|_|  _|        _|_|_|
    _|    _|  _|    _|  _|    _|  _|    _|    _|    _|    _|_|  _|    _|      _|        _|    _|  _|        _|
    _|    _|    _|_|      _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|        _|    _|    _|_|_|  _|_|_|_|

    To login, `huggingface_hub` requires a token generated from https://huggingface.co/settings/tokens .
Enter your token (input will not be visible): 
Add token as git credential? (Y/n) y
Token is valid (permission: fineGrained).
[1m[31mCannot authenticate through git-credential as no helper is defined on your machine.
You might have to re-authenticate when pushing to the Hugging Face Hub.
Run the following command in yo

In [6]:
model =  SentenceTransformer(model_name)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/229 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/122 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/3.78k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/718 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/328M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/1.32k [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/798k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

In [7]:
document_embeddings = model.encode(documents)

In [8]:
len(document_embeddings)

6

In [9]:
query = "Natural Langueage processing techinques enhance keyword extraction  efficiency."

In [10]:
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity

In [11]:
query_embedding =  model.encode([query])

In [12]:
query_embedding

array([[ 4.83688742e-01,  1.83549672e-01,  1.80234045e-01,
        -1.63556620e-01,  1.23959687e-03, -1.12243645e-01,
         1.17718289e-02,  1.20779127e-03,  2.19325274e-01,
        -1.21330440e-01, -2.22452030e-01, -5.54232299e-01,
         7.58434311e-02,  7.37263262e-01, -2.94861123e-02,
        -1.00635457e+00, -3.40117991e-01, -7.80434608e-02,
         3.07097197e-01,  2.23024786e-01, -5.29717058e-02,
        -2.66743958e-01, -1.50195450e-01, -1.53032750e-01,
         4.72508878e-01, -1.42260581e-01,  9.20170024e-02,
         2.49737918e-01, -4.67505574e-01,  3.29155922e-02,
        -3.25162828e-01, -2.10755453e-01,  8.19945261e-02,
         7.47007877e-02,  1.10842809e-01, -1.05710261e-01,
         7.81222060e-02, -4.30204660e-01, -3.33237708e-01,
         4.82284501e-02,  1.09717183e-01,  3.26865941e-01,
         2.01408401e-01,  2.59879529e-01, -2.27288470e-01,
        -1.17897615e-01, -3.33150655e-01, -7.25063562e-01,
         1.49504632e-01,  3.83876085e-01, -3.36214393e-0

In [13]:
len(query_embedding)

1

In [14]:
similaries = cosine_similarity(query_embedding , document_embeddings)

In [15]:
similaries

array([[-0.01127331,  0.2993295 ,  0.3493766 ,  0.22223252,  0.46395612,
         0.6157012 ]], dtype=float32)

In [16]:
most_similar_index = np.argmax(similaries)

In [17]:
most_similar_index

5

In [18]:
documents[most_similar_index]

'efficient keyword extraction enhances search accuracy'

In [19]:
sorted_index = np.argsort(similaries)[0][::-1]

In [20]:
ranked_documents = [documents[i] for i in sorted_index]

In [21]:
ranked_documents

['efficient keyword extraction enhances search accuracy',
 'understanding the document structure aids in keyword extraction.',
 'Document analysis involves extracking keywords',
 'Keywords are important for keyowrd-baseed search',
 'keyword-based search relies on sparse embeddings',
 'This is a list which containing sample documents .']

In [22]:
top_4_document =  [documents[i] for i in sorted_index[:4]]

In [23]:
top_4_document

['efficient keyword extraction enhances search accuracy',
 'understanding the document structure aids in keyword extraction.',
 'Document analysis involves extracking keywords',
 'Keywords are important for keyowrd-baseed search']

In [24]:
tokenized_top_4_document =  [doc.split()  for doc in top_4_document]

In [25]:
tokenized_top_4_document

[['efficient', 'keyword', 'extraction', 'enhances', 'search', 'accuracy'],
 ['understanding',
  'the',
  'document',
  'structure',
  'aids',
  'in',
  'keyword',
  'extraction.'],
 ['Document', 'analysis', 'involves', 'extracking', 'keywords'],
 ['Keywords', 'are', 'important', 'for', 'keyowrd-baseed', 'search']]

In [34]:
tokenized_query = query.split()

In [37]:
tokenized_query

['Natural',
 'Langueage',
 'processing',
 'techinques',
 'enhance',
 'keyword',
 'extraction',
 'efficiency.']

In [None]:
# prompt: import BM25
!pip install rank_bm25
from rank_bm25 import BM25Okapi


In [30]:
bm25_score = BM25Okapi(tokenized_top_4_document)

In [31]:
bm25_score

<rank_bm25.BM25Okapi at 0x7f20ec6a9ff0>

In [35]:
bm25_score =  bm25_score.get_scores(tokenized_query)

In [36]:
bm25_score

array([0.86282878, 0.        , 0.        , 0.        ])

In [38]:
sorted_indices = np.argsort(bm25_score)[::-1]

In [39]:
sorted_indices

array([0, 3, 2, 1])

In [40]:
reranked_documents = [top_4_document[i] for i in sorted_indices]

In [41]:
reranked_documents

['efficient keyword extraction enhances search accuracy',
 'Keywords are important for keyowrd-baseed search',
 'Document analysis involves extracking keywords',
 'understanding the document structure aids in keyword extraction.']