## Darboard - optimizing over Relevant information gain in Retrieval
In real life applications where knowledge bases can get dense and large, most relevant results might have redundant information, and prevent relevant information from being retrieved in top-k.




A practical solution is to **combine both relevance and diversity** into a single scoring function and directly optimize for it.
 
This is an implementation of the "dartboard" algorithm, as described in the following paper:

> [*Better RAG using Relevant Information Gain*](https://arxiv.org/abs/2407.12101)
(very elegant method, recommended reading)

The paper actually presents three variations on the same core idea; one with hybrid rag (dense and sparse), one with cross-encoder, and a vanilla version. The vanilla one conveys the idea, and is given here. If you have a hybrid RAG, you can just calculate cos-sim on both vectors and combine them for a similarity score. the shift from cross-encoder scores is straightforward too, but you might want some scaling of the distances.  

Additionally, I've introduced weights to control the balance between diversity and relevance.  
In real life, this weighting might help avoid retrieving overly diverse (and potentially less relevant) results.
The official paper also has a code implemention, and this code is based on it, but I think this one here is more readable, manageable and production ready.

### Import libraries and environment variables

In [6]:
import os
import sys
from dotenv import load_dotenv
import numpy as np
from scipy.special import logsumexp

# Load environment variables from a .env file
load_dotenv()
# Set the OpenAI API key environment variable (comment out if not using OpenAI)
if not os.getenv('OPENAI_API_KEY'):
    print("Please enter your OpenAI API key: ")
    os.environ["OPENAI_API_KEY"] = input("Please enter your OpenAI API key: ")
else:
    os.environ["OPENAI_API_KEY"] = os.getenv('OPENAI_API_KEY')

sys.path.append(os.path.abspath(os.path.join(os.getcwd(), '..'))) # Add the parent directory to the path since we work with notebooks
from helper_functions import *
from evaluation.evalute_rag import *


### Read Docs

In [3]:
path = "../data/Understanding_Climate_Change.pdf"

### Encode document

In [7]:
# this part is same like simple_rag.ipynb, only simulating a dense dataset
def encode_pdf(path, chunk_size=1000, chunk_overlap=200):
    """
    Encodes a PDF book into a vector store using OpenAI embeddings.

    Args:
        path: The path to the PDF file.
        chunk_size: The desired size of each text chunk.
        chunk_overlap: The amount of overlap between consecutive chunks.

    Returns:
        A FAISS vector store containing the encoded book content.
    """

    # Load PDF documents
    loader = PyPDFLoader(path)
    documents = loader.load()
    documents=documents*5 # load every document 5 times to emulate a dense dataset

    # Split documents into chunks
    text_splitter = RecursiveCharacterTextSplitter(
        chunk_size=chunk_size, chunk_overlap=chunk_overlap, length_function=len
    )
    texts = text_splitter.split_documents(documents)
    cleaned_texts = replace_t_with_space(texts)

    # Create embeddings (Tested with OpenAI and Amazon Bedrock)
    embeddings = get_langchain_embedding_provider(EmbeddingProvider.OPENAI)
    #embeddings = get_langchain_embedding_provider(EmbeddingProvider.AMAZON_BEDROCK)

    # Create vector store
    vectorstore = FAISS.from_documents(cleaned_texts, embeddings)

    return vectorstore

In [8]:
chunks_vector_store = encode_pdf(path, chunk_size=1000, chunk_overlap=200)

In [9]:
# this part is same like simple_rag.ipynb, only its using the actual FAISS index (not the wrapper)
def idx_to_text(idx):
    docstore_id = chunks_vector_store.index_to_docstore_id[idx]
    document = chunks_vector_store.docstore.search(docstore_id)
    return document.page_content


def get_context(query,k=5):
    # regular top k retrieval
    q_vec=chunks_vector_store.embedding_function.embed_documents([query])
    _,indices=chunks_vector_store.index.search(np.array(q_vec),k=k)

    vecs = chunks_vector_store.index.reconstruct_batch(indices[0])
    texts = [idx_to_text(i) for i in indices[0]]
    return q_vec,vecs,texts


In [21]:

test_query = "What is the main cause of climate change?"


### Regular top k retrieval

In [22]:
q_vec,vecs,texts=get_context(test_query,k=3)
show_context(texts)

Context 1:
driven by human activities, particularly the emission of greenhou se gases.  
Chapter 2: Causes of Climate Change  
Greenhouse Gases  
The primary cause of recent climate change is the increase in greenhouse gases in the 
atmosphere. Greenhouse gases, such as carbon dioxide (CO2), methane (CH4), and nitrous 
oxide (N2O), trap heat from the sun, creating a "greenhouse effect." This effect is  essential 
for life on Earth, as it keeps the planet warm enough to support life. However, human 
activities have intensified this natural process, leading to a warmer climate.  
Fossil Fuels  
Burning fossil fuels for energy releases large amounts of CO2. This includes coal, oil, and 
natural gas used for electricity, heating, and transportation. The industrial revolution marked 
the beginning of a significant increase in fossil fuel consumption, which continues to rise 
today.  
Coal


Context 2:
driven by human activities, particularly the emission of greenhou se gases.  
Chapter 2: C

### As you can see, the results are not good, we want each result to bring more information. 

In [27]:

# Adjust these according to your needs, knowledge base density, etc. 
DIVERSITY_WEIGHT=1.0
RELEVANCE_WEIGHT=1.0
SIGMA=0.1

def get_context_with_dartboard(query,k=5):
    q_vec=chunks_vector_store.embedding_function.embed_documents([query])
    _,indices=chunks_vector_store.index.search(np.array(q_vec),k=k*3) # fetch more than k to ensure we overcome density and use diversity

    vecs = chunks_vector_store.index.reconstruct_batch(indices[0])
    texts = [idx_to_text(i) for i in indices[0]]

    vecs=np.array(vecs)
    
    dists_mat = 1-np.dot(vecs,vecs.T) # 1-cosine distance, you may think of better distance functions. This can also be applied to cross-encoder scores. 
    
    q_dists=1-np.dot(q_vec,vecs.T)
    
    texts,scores=get_dartboard(q_dists,dists_mat,texts,SIGMA,k)

    return texts,scores



def get_dartboard(qdists, distsmat, texts, sigma: float, k: int):
    sigma=np.max(sigma,1e-5)
    qprobs = lognorm(qdists, sigma)
    ccprobmat = lognorm(distsmat, sigma)
    return greedy_dartsearch(qprobs, ccprobmat, texts, k)


def lognorm(dist, sigma):
    if sigma < 1e-9: 
        return -np.inf * dist
    return -np.log(sigma) - 0.5 * np.log(2 * np.pi) - dist**2 / (2 * sigma**2)


def greedy_dartsearch(qprobs, dists_mat, texts, k):
    out_scores=[]
    top_idx = np.argmax(qprobs)
    dset = np.array([top_idx])
    maxes = dists_mat[top_idx]
    while len(dset) < k:
        newmaxes = np.maximum(maxes, dists_mat)

        logscores = newmaxes*DIVERSITY_WEIGHT + qprobs*RELEVANCE_WEIGHT
        scores = logsumexp(logscores, axis=1)
        scores[dset] = -np.inf
        best_idx = np.argmax(scores)
        best_score=np.log(np.max(scores))
        maxes = newmaxes[best_idx]
        dset = np.append(dset, best_idx)
        out_scores.append(best_score)
    return [texts[i] for i in dset],out_scores


### dartboard retrieval

In [29]:
texts,scores=get_context_with_dartboard(test_query,k=3)
show_context(texts)
# now top 3 results are not mere repetitions. 

Context 1:
driven by human activities, particularly the emission of greenhou se gases.  
Chapter 2: Causes of Climate Change  
Greenhouse Gases  
The primary cause of recent climate change is the increase in greenhouse gases in the 
atmosphere. Greenhouse gases, such as carbon dioxide (CO2), methane (CH4), and nitrous 
oxide (N2O), trap heat from the sun, creating a "greenhouse effect." This effect is  essential 
for life on Earth, as it keeps the planet warm enough to support life. However, human 
activities have intensified this natural process, leading to a warmer climate.  
Fossil Fuels  
Burning fossil fuels for energy releases large amounts of CO2. This includes coal, oil, and 
natural gas used for electricity, heating, and transportation. The industrial revolution marked 
the beginning of a significant increase in fossil fuel consumption, which continues to rise 
today.  
Coal


Context 2:
Most of these climate changes are attributed to very small variations in Earth's orbit tha