# Cross-encoder re-ranking

The basic idea is that we want to score the results from the retrived results (i.e., the vector database.) This way instead of just taking the bi-encoders probaility of the best results from the document, we can use the cross-encoder to check the similarity score against each retrieved chunk. Therefore we rerank the results we ultimitl want to send the LLM. 

Note: Cross-encoders are light weight.

## Installation

In [19]:
%pip install -q -r requirements.txt

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)



[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.0[0m[39;49m -> [0m[32;49m25.0.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.


## Connecting to the vector database

In [20]:
import chromadb
from helper_utils import load_chroma, word_wrap, project_embeddings
from chromadb.utils.embedding_functions import SentenceTransformerEmbeddingFunction
from chromadb.config import DEFAULT_TENANT, DEFAULT_DATABASE, Settings
import numpy as np

In [21]:
chroma_client = chromadb.PersistentClient(
    path="data/chroma_db/",
    settings=Settings(),
    tenant=DEFAULT_TENANT,
    database=DEFAULT_DATABASE,
)
# Load the existing collection by its name
collection_name = 'microsoft_annual_report_2022'
chroma_collection = chroma_client.get_or_create_collection(name=collection_name)

# Count the number of items in the collection
count = chroma_collection.count()
print(f"Number of items in the collection '{collection_name}': {count}")

Number of items in the collection 'microsoft_annual_report_2022': 349


## Setup the embedding function

In [22]:
# Access the underlying SentenceTransformer model (Defaults)
embedding_function = SentenceTransformerEmbeddingFunction()
model = embedding_function.models
print(model)

{'all-MiniLM-L6-v2': SentenceTransformer(
  (0): Transformer({'max_seq_length': 256, 'do_lower_case': False}) with Transformer model: BertModel 
  (1): Pooling({'word_embedding_dimension': 384, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
  (2): Normalize()
)}


## Helper to print the retrieved results

In [24]:
def print_results_and_documents(results, retrieved_documents, word_wrap):
    """
    Prints keys and values from the results dictionary and documents with word wrapping.

    Args:
        results (dict): A dictionary where keys are strings and values are either strings or lists.
        retrieved_documents (list): A list of documents to be printed.
        word_wrap (function): A function to apply word wrapping to the documents.

    Returns:
        None
    """
    # Iterate through the dictionary and print each key with its associated value
    for key, value in results.items():
        print(f"{key}:")

        # Check if the value is a list and print its elements
        if isinstance(value, list):
            for i, item in enumerate(value):
                print(f"  Item {i+1}: {item}")
        else:
            # Directly print the value if it's not a list
            print(f"  {value}")

        print()  # Add a newline for better readability

    # Iterate through the list of documents and print each one with word wrapping
    #for document in retrieved_documents:
    #    print(word_wrap(document))
    #    print('\n')

## Retrieving the results

In [66]:
query = "What has been the investment in research and development?"

results = chroma_collection.query(query_texts=query,
                                   n_results=10, 
                                   include=['documents', 'embeddings', "distances"])

retrieved_documents = results['documents'][0]
#print_results_and_documents(results, retrieved_documents, word_wrap)

#for document in results['documents'][0]:
#    print(word_wrap(document))
#    print('')

Note:   
What were doing is asking for more results (10) so instead of just getting nearest neighbors we are getting a long tail.

## Setting up a cross_encoder

Sentence encoders are made up of two kinds of models (bi-encoders and cross-encoders)   
  
bi-encoders give us similaries (Euclidean and Cosine)   (e,g, all-MiniLM-L6-v2)   
>  bi-encoders do a nearest neighbor with the query against all the documents  
>  The query and the document (vector database) embedding are callcuated independantly  

Note: This is basically what you use in the classic RAG model. We query the vector database and specify n_results (how many chunks/documents we want back) based on the similarity score (i.e., highest to lowest). This is done agains the whole vector datbase.

cross-encoders give us a score via a classifier  (e.g, ms-marco-MiniLM-L-6-v2)        
>  cross-encoders compare the query to each document and return a score. In otherwords the query and the chunk are compared individualy (not the shole vector datbase). Each comparison returns a similarity metric.  
>  The query and chunck embeddings are calculated at the same time.    
>  The default for cross-encoders is cosine simialrity
>  The the list of similarity cores can be sorted based on distance to use as the new retrieval list to pass to the LLM.    

Note: This is basically what you use to rerank.

<https://huggingface.co/cross-encoder/ms-marco-MiniLM-L-6-v2>

In [67]:
from sentence_transformers import CrossEncoder
cross_encoder = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2')

## Calculating the scores

Note: In this example the scores come back unsorted.

Also notice the encoder returns unbounded ranges and is not normilized (0 to 1). Beware of of simlarity scores. 

In [68]:
pairs = [[query, doc] for doc in retrieved_documents]
scores = cross_encoder.predict(pairs)
print("Scores:")
for score in scores:
    print(score)

Scores:
0.8762942
2.6445775
-0.2680337
-10.731592
-7.706603
-5.6469946
-4.2970343
-10.933231
-8.666394
-7.038426


In this example chunk#2 is the most similar with 2.64. #3 is 2 and #3 is 3. #7 is next. In fact after the first 3 the rest are probaly distractors.

Note: When you use corss-encoders you tend to ask for more n_results (in this example 10). This is becuase after you sort the socres you will only pass back the most similar chunks to the LLM.

## Re-ranking the results based on the scores

In [74]:
pairs = [[query, doc] for doc in retrieved_documents]
scores = cross_encoder.predict(pairs)

# Combine scores with documents
scored_documents = list(zip(scores, retrieved_documents))

# Sort by scores in descending order
scored_documents.sort(key=lambda x: x[0], reverse=True)

print("New Chunk Ordering:")
for idx, (score, doc) in enumerate(scored_documents, start=1):
    print(f"Chunk {idx} with score: {score}")
    

New Chunk Ordering:
Chunk 1 with score: 2.6445775032043457
Chunk 2 with score: 0.8762941956520081
Chunk 3 with score: -0.2680337131023407
Chunk 4 with score: -4.29703426361084
Chunk 5 with score: -5.646994590759277
Chunk 6 with score: -7.038425922393799
Chunk 7 with score: -7.706603050231934
Chunk 8 with score: -8.666394233703613
Chunk 9 with score: -10.731592178344727
Chunk 10 with score: -10.933231353759766


Orignal scores:   
0.8762942
2.6445775
-0.2680337
-10.731592
-7.706603
-5.6469946
-4.2970343
-10.933231
-8.666394
-7.038426 
 
  Now if we process the top 5 we get a long tail with more relavant information.

We might only pass the first two chunks into the LLM. The other chunks might be distractors. 



# Re-ranking with Query Expansion

In [99]:
original_query = "What were the most important factors that contributed to increases in revenue?"
generated_queries = [
    "What were the major drivers of revenue growth?",
    "Were there any new product launches that contributed to the increase in revenue?",
    "Did any changes in pricing or promotions impact the revenue growth?",
    "What were the key market trends that facilitated the increase in revenue?",
    "Did any acquisitions or partnerships contribute to the revenue growth?"
]

Note: The combined queries will ask six questions and get six 10x chuncks each (60 documents)

In [100]:
queries = [original_query] + generated_queries

results = chroma_collection.query(query_texts=queries,
                                   n_results=10, 
                                   include=['documents', 'embeddings', "distances"])

retrieved_documents = results['documents'][0]
print_results_and_documents(results, retrieved_documents, word_wrap)


ids:
  Item 1: ['143', '166', '152', '210', '148', '149', '147', '141', '151', '319']
  Item 2: ['143', '152', '319', '147', '148', '210', '145', '144', '293', '166']
  Item 3: ['145', '127', '210', '149', '148', '139', '150', '141', '188', '320']
  Item 4: ['151', '145', '148', '149', '127', '143', '141', '147', '293', '331']
  Item 5: ['149', '148', '145', '143', '166', '151', '141', '147', '319', '293']
  Item 6: ['143', '152', '166', '145', '149', '148', '127', '262', '210', '194']

embeddings:
  Item 1: [[-0.01560105  0.00026607  0.03910113 ... -0.12766221 -0.0023229
  -0.01344453]
 [ 0.00745113 -0.10491197  0.00869199 ... -0.11494956  0.01285286
  -0.02947784]
 [ 0.08230967 -0.0701528   0.02381334 ... -0.15965647 -0.00994389
  -0.0155568 ]
 ...
 [-0.03467573 -0.02139045  0.04252068 ... -0.13420498  0.00859338
  -0.01510288]
 [ 0.02047033 -0.05500471  0.04936543 ... -0.132761   -0.01696662
   0.02401748]
 [ 0.01663858 -0.05533892  0.0223398  ... -0.09261022 -0.02912271
  -0.030962

## Re-ranking the long tail

First we get rid of duplicates. This gets us down from 60 chunks to less than 50. 

In [101]:
# Deduplicate the retrieved documents
unique_documents = set()
for documents in retrieved_documents:
    for document in documents:
        unique_documents.add(document)

unique_documents = list(unique_documents)
print(len(unique_documents)," Unique Documents")

48  Unique Documents


In [102]:
pairs = []
for doc in unique_documents:
    pairs.append([original_query, doc])

In [103]:
scores = cross_encoder.predict(pairs)


In [104]:
print("Scores:")
for score in scores:
    print(score)

Scores:
-9.072264
-8.111133
-8.0000925
-7.4521346
-7.731777
-8.733776
-7.0936494
-8.700952
-8.394533
-7.290472
-8.777536
-9.152418
-8.603168
-8.771975
-8.442549
-8.039276
-8.305858
-6.9356394
-8.212902
-8.831294
-8.321934
-9.139928
-8.321725
-9.103209
-7.958362
-8.125719
-7.6008224
-8.759289
-8.859444
-8.928579
-8.520397
-8.266382
-8.338873
-8.023626
-8.618933
-7.429163
-8.897442
-9.0392685
-8.415417
-8.560882
-8.742549
-8.211147
-5.9334097
-7.3718023
-6.7395134
-8.204199
-8.078914
-8.152823


In [None]:
# Combine scores with documents
scored_documents = list(zip(scores, pairs))

# Sort by scores in descending order
scored_documents.sort(key=lambda x: x[0], reverse=True)

print("New Chunk Ordering:")
for idx, (score, doc) in enumerate(scored_documents, start=1):
    print(f"Chunk {idx} with score: {score}")

New Chunk Ordering:
Chunk 1 with score: -5.933409690856934
Chunk 2 with score: -6.739513397216797
Chunk 3 with score: -6.935639381408691
Chunk 4 with score: -7.093649387359619
Chunk 5 with score: -7.290472030639648
Chunk 6 with score: -7.37180233001709
Chunk 7 with score: -7.429162979125977
Chunk 8 with score: -7.452134609222412
Chunk 9 with score: -7.600822448730469
Chunk 10 with score: -7.731777191162109
Chunk 11 with score: -7.958362102508545
Chunk 12 with score: -8.000092506408691
Chunk 13 with score: -8.023626327514648
Chunk 14 with score: -8.039276123046875
Chunk 15 with score: -8.078913688659668
Chunk 16 with score: -8.111132621765137
Chunk 17 with score: -8.12571907043457
Chunk 18 with score: -8.152823448181152
Chunk 19 with score: -8.204198837280273
Chunk 20 with score: -8.21114730834961
Chunk 21 with score: -8.212902069091797
Chunk 22 with score: -8.266382217407227
Chunk 23 with score: -8.30585765838623
Chunk 24 with score: -8.321724891662598
Chunk 25 with score: -8.321933746

At this point during our dev phase we would want to use a human in the middle approach to see where the relavent number of chunks should be passed to the LLM. We started out with 60 chunks, we reduced around 10 as duplicate. Now we need to figure what a good cutoff score would be to add code to decied what to pass to the LLM.