### Reranking is a process that improves the accuracy and relevance of search results. It involves reordering a list of search results based on a more sophisticated analysis.

### How it works
A primary retrieval method, like keyword search or BM25, is used to find a set of documents that might be relevant to the query

A reranker analyzes the documents and computes a relevance score for each one
The reranker reorders the documents from most relevant to least relevant

### Why it's useful
Reranking is a crucial step in information retrieval. It's especially important in Retrieval-Augmented Generation (RAG) systems, where it helps ensure that the LLM has access to the most relevant documents when generating responses.
### How it's used
Reranking can be used to enhance the relevance of search results in many different search systems. It can also be used in RAG systems to improve the quality of the documents used to generate the final output

In [1]:
documents = [
    "This is a list which containing sample documents.",
    "Keywords are important for keyword-based search.",
    "Document analysis involves extracting keywords.",
    "Keyword-based search relies on sparse embeddings.",
    "Understanding document structure aids in keyword extraction.",
    "Efficient keyword extraction enhances search accuracy.",
    "Semantic similarity improves document retrieval performance.",
    "Machine learning algorithms can optimize keyword extraction methods."
]

In [2]:
!pip install --quiet sentence_transformers

[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m363.4/363.4 MB[0m [31m3.7 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m13.8/13.8 MB[0m [31m51.6 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m24.6/24.6 MB[0m [31m54.4 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m883.7/883.7 kB[0m [31m32.9 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m664.8/664.8 MB[0m [31m773.4 kB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m211.5/211.5 MB[0m [31m6.4 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m56.3/56.3 MB[0m [31m11.8 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m127.9/127.9 MB[0m [31m7.5 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━

In [3]:
from sentence_transformers import SentenceTransformer

In [4]:
# Load pre-trained Sentence Transformer model
model_name = 'sentence-transformers/paraphrase-xlm-r-multilingual-v1'

In [5]:
model = SentenceTransformer(model_name)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/229 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/122 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/3.56k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/718 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.11G [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/550 [00:00<?, ?B/s]

sentencepiece.bpe.model:   0%|          | 0.00/5.07M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/9.10M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/150 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

In [6]:
len(documents)

8

In [7]:
document_embeddings = model.encode(documents)

In [8]:
len(document_embeddings)

8

In [9]:
for i, embedding in enumerate(document_embeddings):
    print(f"Document {i+1} embedding: {embedding}")

Document 1 embedding: [ 0.10894679  0.07812068  0.11626568 -0.3191251   0.46890235  0.43514395
  0.01453739  0.4423876   0.297164   -0.18982704  0.07389052 -0.2786491
  0.21338163 -0.12077003  0.17891704 -0.0078989   0.04754863 -0.18204564
  0.34227112 -0.06994259 -0.1428874   0.57141256 -0.11153248 -0.17895402
  0.01523133  0.26105714 -0.20555836  0.05203114 -0.02810765  0.23873238
  0.01206972  0.0440492   0.02242316 -0.13895178 -0.7410038   0.2560101
  0.08149683  0.18820493 -0.4123769   0.11368614  0.28121158  0.05860889
 -0.1731878   0.33549133  0.21803682 -0.05090713 -0.0545779  -0.8738479
 -0.24082269  0.32006973  0.4476166   0.06347829  0.5357485   0.16607259
 -0.33196998  0.33393648  0.28615907 -0.5419566  -0.2713242   0.24881156
 -0.23919384 -0.46926272  0.13836573  0.3784289  -0.01304435  0.019906
  0.3236508   0.45857537  0.07600265  0.25299582 -0.4293894   0.1005193
 -0.330426   -0.6987646   0.010359    0.05666573  0.14731236 -0.47082353
  0.08063987  0.33870456 -0.2727815

In [10]:
query = "Natural language processing techniques enhance keyword extraction efficiency."

In [11]:
query_embedding = model.encode(query)

In [12]:
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity

similarities = cosine_similarity(np.array([query_embedding]), document_embeddings)

In [13]:
similarities

array([[0.16948152, 0.45802274, 0.5675694 , 0.44123292, 0.6316117 ,
        0.75214124, 0.55035204, 0.74481654]], dtype=float32)

In [14]:
most_similar_index = np.argmax(similarities)

In [15]:
most_similar_index

np.int64(5)

In [16]:
most_similar_document = documents[most_similar_index]

In [17]:
most_similar_document

'Efficient keyword extraction enhances search accuracy.'

In [18]:
similarity_score = similarities[0][most_similar_index]

In [19]:
similarity_score

np.float32(0.75214124)

In [20]:
sorted_indices = np.argsort(similarities[0])[::-1]
sorted_indices

array([5, 7, 4, 2, 6, 1, 3, 0])

In [21]:
ranked_documents = [(documents[i], similarities[0][i]) for i in sorted_indices]
ranked_documents

[('Efficient keyword extraction enhances search accuracy.',
  np.float32(0.75214124)),
 ('Machine learning algorithms can optimize keyword extraction methods.',
  np.float32(0.74481654)),
 ('Understanding document structure aids in keyword extraction.',
  np.float32(0.6316117)),
 ('Document analysis involves extracting keywords.', np.float32(0.5675694)),
 ('Semantic similarity improves document retrieval performance.',
  np.float32(0.55035204)),
 ('Keywords are important for keyword-based search.', np.float32(0.45802274)),
 ('Keyword-based search relies on sparse embeddings.', np.float32(0.44123292)),
 ('This is a list which containing sample documents.', np.float32(0.16948152))]

In [22]:
print("Ranked Documents:")
for rank, (document, similarity) in enumerate(ranked_documents, start=1):
    print(f"Rank {rank}: Document - '{document}', Similarity Score - {similarity}")


Ranked Documents:
Rank 1: Document - 'Efficient keyword extraction enhances search accuracy.', Similarity Score - 0.7521412372589111
Rank 2: Document - 'Machine learning algorithms can optimize keyword extraction methods.', Similarity Score - 0.7448165416717529
Rank 3: Document - 'Understanding document structure aids in keyword extraction.', Similarity Score - 0.631611704826355
Rank 4: Document - 'Document analysis involves extracting keywords.', Similarity Score - 0.567569375038147
Rank 5: Document - 'Semantic similarity improves document retrieval performance.', Similarity Score - 0.5503520369529724
Rank 6: Document - 'Keywords are important for keyword-based search.', Similarity Score - 0.45802274346351624
Rank 7: Document - 'Keyword-based search relies on sparse embeddings.', Similarity Score - 0.44123291969299316
Rank 8: Document - 'This is a list which containing sample documents.', Similarity Score - 0.16948151588439941


In [23]:
print("Top 4 Documents:")
for rank, (document, similarity) in enumerate(ranked_documents[:4], start=1):
    print(f"Rank {rank}: Document - '{document}', Similarity Score - {similarity}")

Top 4 Documents:
Rank 1: Document - 'Efficient keyword extraction enhances search accuracy.', Similarity Score - 0.7521412372589111
Rank 2: Document - 'Machine learning algorithms can optimize keyword extraction methods.', Similarity Score - 0.7448165416717529
Rank 3: Document - 'Understanding document structure aids in keyword extraction.', Similarity Score - 0.631611704826355
Rank 4: Document - 'Document analysis involves extracting keywords.', Similarity Score - 0.567569375038147


## Reranking with BM25

In [24]:
!pip install rank_bm25

Collecting rank_bm25
  Downloading rank_bm25-0.2.2-py3-none-any.whl.metadata (3.2 kB)
Downloading rank_bm25-0.2.2-py3-none-any.whl (8.6 kB)
Installing collected packages: rank_bm25
Successfully installed rank_bm25-0.2.2


In [25]:
from rank_bm25 import BM25Okapi

In [26]:
top_4_documents = [doc[0] for doc in ranked_documents[:4]]
top_4_documents

['Efficient keyword extraction enhances search accuracy.',
 'Machine learning algorithms can optimize keyword extraction methods.',
 'Understanding document structure aids in keyword extraction.',
 'Document analysis involves extracting keywords.']

In [27]:
tokenized_top_4_documents = [doc.split() for doc in top_4_documents]
tokenized_top_4_documents

[['Efficient', 'keyword', 'extraction', 'enhances', 'search', 'accuracy.'],
 ['Machine',
  'learning',
  'algorithms',
  'can',
  'optimize',
  'keyword',
  'extraction',
  'methods.'],
 ['Understanding',
  'document',
  'structure',
  'aids',
  'in',
  'keyword',
  'extraction.'],
 ['Document', 'analysis', 'involves', 'extracting', 'keywords.']]

In [28]:
tokenized_query = query.split()

In [29]:
bm25=BM25Okapi(tokenized_top_4_documents)

In [30]:
bm25

<rank_bm25.BM25Okapi at 0x79ca7eb4af90>

In [31]:
bm25_scores = bm25.get_scores(tokenized_query)

In [32]:
bm25_scores

array([0.1907998 , 0.16686672, 0.17803252, 0.        ])

In [34]:
sorted_indices2 = np.argsort(bm25_scores)[::-1]

In [35]:
sorted_indices2

array([0, 2, 1, 3])

In [36]:
reranked_documents = [(top_4_documents[i], bm25_scores[i]) for i in sorted_indices2]

In [37]:
reranked_documents

[('Efficient keyword extraction enhances search accuracy.',
  np.float64(0.19079979534096053)),
 ('Understanding document structure aids in keyword extraction.',
  np.float64(0.1780325227902643)),
 ('Machine learning algorithms can optimize keyword extraction methods.',
  np.float64(0.1668667199671815)),
 ('Document analysis involves extracting keywords.', np.float64(0.0))]

In [38]:
print("Rerank of top 4 Documents:")
for rank, (document, similarity) in enumerate(reranked_documents, start=1):
    print(f"Rank {rank}: Document - '{document}', Similarity Score - {similarity}")

Rerank of top 4 Documents:
Rank 1: Document - 'Efficient keyword extraction enhances search accuracy.', Similarity Score - 0.19079979534096053
Rank 2: Document - 'Understanding document structure aids in keyword extraction.', Similarity Score - 0.1780325227902643
Rank 3: Document - 'Machine learning algorithms can optimize keyword extraction methods.', Similarity Score - 0.1668667199671815
Rank 4: Document - 'Document analysis involves extracting keywords.', Similarity Score - 0.0


## Cross Encoder

In [39]:
from sentence_transformers import CrossEncoder
cross_encoder = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2')

config.json:   0%|          | 0.00/794 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/316 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

In [40]:
top_4_documents

['Efficient keyword extraction enhances search accuracy.',
 'Machine learning algorithms can optimize keyword extraction methods.',
 'Understanding document structure aids in keyword extraction.',
 'Document analysis involves extracting keywords.']

In [41]:
pairs = []
for doc in top_4_documents:
    pairs.append([query, doc])

In [42]:
pairs

[['Natural language processing techniques enhance keyword extraction efficiency.',
  'Efficient keyword extraction enhances search accuracy.'],
 ['Natural language processing techniques enhance keyword extraction efficiency.',
  'Machine learning algorithms can optimize keyword extraction methods.'],
 ['Natural language processing techniques enhance keyword extraction efficiency.',
  'Understanding document structure aids in keyword extraction.'],
 ['Natural language processing techniques enhance keyword extraction efficiency.',
  'Document analysis involves extracting keywords.']]

In [43]:
scores = cross_encoder.predict(pairs)
scores

array([ 3.1378708 ,  0.84216565, -2.9193    , -2.878192  ], dtype=float32)

In [44]:
scored_docs = zip(scores, top_4_documents)

In [45]:
scored_docs

<zip at 0x79ca6fdf7100>

In [46]:
reranked_document_cross_encoder = sorted(scored_docs, reverse=True)

In [47]:
reranked_document_cross_encoder

[(np.float32(3.1378708),
  'Efficient keyword extraction enhances search accuracy.'),
 (np.float32(0.84216565),
  'Machine learning algorithms can optimize keyword extraction methods.'),
 (np.float32(-2.878192), 'Document analysis involves extracting keywords.'),
 (np.float32(-2.9193),
  'Understanding document structure aids in keyword extraction.')]

## Cohere

In [48]:
!pip install --quiet cohere

[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m253.9/253.9 kB[0m [31m4.9 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.3/3.3 MB[0m [31m35.1 MB/s[0m eta [36m0:00:00[0m
[?25h

In [49]:
import cohere

In [51]:
from google.colab import userdata
Cohere_API = userdata.get('Cohere_API')

In [52]:
co = cohere.Client(Cohere_API)

In [53]:
response = co.rerank(
    model="rerank-english-v3.0",
    query="Natural language processing techniques enhance keyword extraction efficiency.",
    documents=top_4_documents,
    return_documents=True
)

In [54]:
print(response)



In [55]:
response.results[0].document.text

'Efficient keyword extraction enhances search accuracy.'

In [56]:
response.results[0].relevance_score

0.99411184

In [57]:
for i in range(4):
  print(f'text: {response.results[i].document.text} score: {response.results[i].relevance_score}')

text: Efficient keyword extraction enhances search accuracy. score: 0.99411184
text: Machine learning algorithms can optimize keyword extraction methods. score: 0.9129032
text: Understanding document structure aids in keyword extraction. score: 0.32885265
text: Document analysis involves extracting keywords. score: 0.02865267
