## Libraries Overview:

1. **AzureOpenAI**: Used to interact with Azure OpenAI for handling embeddings and chat completions.

2. **FAISS**: A vector search engine that allows fast similarity search and retrieval based on embeddings.

3. **LangChain Libraries**:
   - **PyPDFLoader**: Loads PDF documents for processing.
   - **RecursiveCharacterTextSplitter**: Splits large texts into smaller chunks, with overlap, to improve retrieval and context handling.
   - **InMemoryDocstore**: Stores documents in memory, which is useful for quick retrieval in vector stores.
   - **Document and BaseRetriever**: Define document structures and retrieval interfaces for LangChain workflows.
   - **RetrievalQA**: A LangChain component for question-answering tasks that retrieves relevant documents and answers queries.

4. **faiss**: Provides the ability to handle high-dimensional vector data for efficient document retrieval.

5. **pydantic**: Handles data validation and model structures, especially when defining custom classes like `CustomAzureLLM`.

6. **numpy**: For numerical operations, particularly in handling embeddings.


In [1]:
from dotenv import load_dotenv
import os
from langchain_core.embeddings import Embeddings
from langchain_core.language_models import LLM
from openai import AzureOpenAI
from langchain.vectorstores import FAISS
from langchain.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.docstore import InMemoryDocstore
from langchain.schema import BaseRetriever, Document
from langchain.chains import RetrievalQA
from typing import List, Any, Optional, Dict
from pydantic import BaseModel, Field
import faiss
import numpy as np
from sentence_transformers import CrossEncoder

  from tqdm.autonotebook import tqdm, trange


In [2]:
load_dotenv('variables.env')
# Azure OpenAI configuration
azure_openai_api_key = os.getenv('AZURE_OPENAI_API_KEY')
azure_openai_api_version = os.getenv('AZURE_OPENAI_API_VERSION')
azure_endpoint = os.getenv('AZURE_OPENAI_ENDPOINT')

print(azure_openai_api_key)
print(azure_openai_api_version)
print(azure_endpoint)

4d69d52da8164fc8939c74eae4d66ef9
2024-02-15-preview
https://home-openai-01.openai.azure.com/


## CustomAzureEmbeddings Class:

This class allows us to interact with Azure OpenAI's `text-embedding-ada-002` model to generate embeddings for both documents and queries.

### Methods:
1. **`__init__`**: Initializes the class by setting up the Azure OpenAI client using an API key, API version, and the Azure endpoint.
  
2. **`embed_documents`**: Loops through a list of documents and calls the `embed_query` method to generate embeddings for each document.

3. **`embed_query`**: Sends a query or document to Azure OpenAI’s embeddings model and retrieves the embedding.


In [3]:
class CustomAzureEmbeddings(Embeddings):
    def __init__(self, api_key, api_version, azure_endpoint):
        self.client = AzureOpenAI(
            api_key=api_key,
            api_version=api_version,
            azure_endpoint=azure_endpoint
        )

    def embed_documents(self, texts):
        return [self.embed_query(text) for text in texts]

    def embed_query(self, text):
        response = self.client.embeddings.create(
            model="text-embedding-ada-002",
            input=text
        )
        return response.data[0].embedding

## CustomAzureLLM Class:

This custom LLM class interacts with Azure OpenAI to handle chat completions using the GPT-4 model.

### Components:
1. **`__init__`**: Initializes the Azure OpenAI client for chat model interactions.

2. **`_call`**: Sends a prompt to the Azure OpenAI GPT-4o model and retrieves the generated response.

3. **`_llm_type`**: Specifies the type of the LLM being used (`"custom_azure_llm"`).

4. **`_identifying_params`**: Returns the deployment name to identify the specific model being used.


In [4]:
# Updated Custom Azure LLM class
class CustomAzureLLM(LLM, BaseModel):
    client: AzureOpenAI = Field(default=None)
    deployment_name: str
    api_key: str
    api_version: str
    azure_endpoint: str

    class Config:
        arbitrary_types_allowed = True

    def __init__(self, **data):
        super().__init__(**data)
        self.client = AzureOpenAI(
            api_key=self.api_key,
            api_version=self.api_version,
            azure_endpoint=self.azure_endpoint
        )

    def _call(self, prompt: str, stop: Optional[List[str]] = None) -> str:
        messages = [{"role": "user", "content": prompt}]
        response = self.client.chat.completions.create(
            model=self.deployment_name,
            messages=messages,
            temperature=0,
            stop=stop
        )
        return response.choices[0].message.content

    @property
    def _llm_type(self) -> str:
        return "custom_azure_llm"

    @property
    def _identifying_params(self) -> Dict[str, Any]:
        """Get the identifying parameters."""
        return {"deployment_name": self.deployment_name}


## Reranking Function:

This function reranks a list of documents based on their relevance to the query using Azure OpenAI's GPT-4 model.

### Steps:
1. **`prompt_template`**: A prompt is sent to GPT-4o to rate the relevance of each document on a scale from 1 to 10.
  
2. **Loop Through Documents**: For each document, the relevance score is determined by GPT-4o based on the query.

3. **Sort and Return Top Documents**: The documents are then sorted by their relevance score, and the top N documents are returned.


In [5]:
# Reranking function
def rerank_documents(query: str, docs: List[Document], top_n: int = 3) -> List[Document]:
    client = AzureOpenAI(
        api_key=azure_openai_api_key,
        api_version=azure_openai_api_version,
        azure_endpoint=azure_endpoint
    )

    prompt_template = """On a scale of 1-10, rate the relevance of the following document to the query. Consider the specific context and intent of the query, not just keyword matches.
    Query: {query}
    Document: {doc}
    Relevance Score:"""

    scored_docs = []
    for doc in docs:
        messages = [
            {"role": "system", "content": "You are a helpful assistant that rates document relevance."},
            {"role": "user", "content": prompt_template.format(query=query, doc=doc.page_content)}
        ]
        response = client.chat.completions.create(
            model="gpt-4o",  # Replace with your actual GPT-4 deployment name
            messages=messages,
            temperature=0
        )
        result = response.choices[0].message.content.strip()
        try:
            score = float(result)
        except ValueError:
            score = 0  # Default score if parsing fails
        scored_docs.append((doc, score))

    reranked_docs = sorted(scored_docs, key=lambda x: x[1], reverse=True)
    return [doc for doc, _ in reranked_docs[:top_n]]


## CustomRetriever Class:

This class uses a FAISS vector store for initial document retrieval and then reranks the documents using the reranking function.

### Components:
1. **`vectorstore`**: The FAISS vector store that stores the document embeddings.

2. **`get_relevant_documents`**: Retrieves documents using similarity search and reranks them based on relevance to the query.


In [6]:
# Custom Retriever class
class CustomRetriever(BaseRetriever, BaseModel):
    vectorstore: Any = Field(description="Vector store for initial retrieval")

    class Config:
        arbitrary_types_allowed = True

    def get_relevant_documents(self, query: str, num_docs=2) -> List[Document]:
        initial_docs = self.vectorstore.similarity_search(query, k=30)
        return rerank_documents(query, initial_docs, top_n=num_docs)


  class CustomRetriever(BaseRetriever, BaseModel):


## Encode PDF and Create Vector Store:

This function handles the following tasks:
1. **Loads PDF**: Loads the text from the PDF file.
2. **Text Splitting**: Splits the document into smaller chunks for embedding.
3. **Generate Embeddings**: Embeds the text chunks using Azure OpenAI.
4. **Create FAISS Index**: Stores the embeddings in a FAISS vector store for similarity search.


In [7]:
# Function to encode PDF and create vector store
def encode_pdf(path, chunk_size=300, chunk_overlap=200):


    try:
        embeddings_client = CustomAzureEmbeddings(
            api_key=azure_openai_api_key,
            api_version=azure_openai_api_version,
            azure_endpoint=azure_endpoint
        )

        loader = PyPDFLoader(path)
        documents = loader.load()
        text_splitter = RecursiveCharacterTextSplitter(chunk_size=chunk_size, chunk_overlap=chunk_overlap)
        texts = text_splitter.split_documents(documents)
        text_list = [doc.page_content for doc in texts]

        embeddings = embeddings_client.embed_documents(text_list)
        embeddings_array = np.array(embeddings)

        dimension = embeddings_array.shape[1]
        index = faiss.IndexFlatL2(dimension)
        index.add(embeddings_array)

        docstore = InMemoryDocstore({str(i): doc for i, doc in enumerate(texts)})
        index_to_docstore_id = {i: str(i) for i in range(len(texts))}

        vectorstore = FAISS(
            embedding_function=embeddings_client.embed_query,
            index=index,
            docstore=docstore,
            index_to_docstore_id=index_to_docstore_id,
        )

        return vectorstore

    except Exception as e:
        print(f"An error occurred: {e}")
        raise


## Main Execution for Vector Store and QA System:

This section runs the complete pipeline for document retrieval and question-answering.

### Steps:
1. **Vector Store Creation**: The PDF is processed and stored in a FAISS vector store.
  
2. **Custom Retriever**: A custom retriever is created to rerank and retrieve relevant documents.

3. **Custom Azure LLM**: Azure GPT-4 is used to answer questions based on the retrieved context.

4. **RetrievalQA Chain**: Combines document retrieval and question-answering into one workflow.
  
5. **Query Execution**: Answers the query and prints the results along with the source documents.


In [8]:
# Main execution
if __name__ == "__main__":
    # Example usage
    pdf_path = './data/Understanding_Climate_Change.pdf'
    vectorstore = encode_pdf(pdf_path)

    # Create the custom retriever
    custom_retriever = CustomRetriever(vectorstore=vectorstore)

    # Create the custom Azure LLM
    azure_llm = CustomAzureLLM(
        api_key=azure_openai_api_key,
        api_version=azure_openai_api_version,
        azure_endpoint=azure_endpoint,
        deployment_name="gpt-4o"  # Replace with your actual GPT-4 deployment name
    )

    # Create the RetrievalQA chain with the custom retriever and LLM
    qa_chain = RetrievalQA.from_chain_type(
        llm=azure_llm,
        chain_type="stuff",
        retriever=custom_retriever,
        return_source_documents=True
    )

    # Example query
    query = "What are the main impacts of climate change on biodiversity?"
    result = qa_chain({"query": query})

    print(f"Query: {query}")
    print(f"\nAnswer: {result['result']}")
    print("\nSource Documents:")
    for i, doc in enumerate(result['source_documents']):
        print(f"\nDocument {i+1}:")
        print(doc.page_content[:200] + "...")  # Print first 200 characters of each source document


`embedding_function` is expected to be an Embeddings object, support for passing in a function will soon be removed.
  result = qa_chain({"query": query})


Query: What are the main impacts of climate change on biodiversity?

Answer: The main impacts of climate change on biodiversity include disruptions to ecosystems, which can lead to the loss of species and habitats. Climate change can alter the distribution of species, affect migration patterns, and change the timing of biological events such as flowering and breeding. These changes can reduce biodiversity and weaken ecosystem resilience, making it harder for ecosystems to provide essential services and support human well-being.

Source Documents:

Document 1:
the impacts of climate change and build a resilient, equitable, and thriving world for future 
generations. The journey ahead requires dedication, creativity, and c ollective effort from all 
sectors ...

Document 2:
goals. Policies should promote synergies between biodiversity conservation and climate 
action.  
Chapter 10: Climate Change and Human Health  
Health Impacts  
Heat -Related Illnesses  
Rising temper...


## Function: retrieve_context_per_question

This function is designed to retrieve the most relevant context (documents) for a given question from a vector store.

### Parameters:
- **`question`**: The query or question for which the context needs to be retrieved.
- **`vectorstore`**: The FAISS vector store containing the document embeddings. This is where the relevant documents are searched from.
- **`k`**: The number of top relevant documents to retrieve. The default value is 5.

### Process:
1. **Create a Retriever**: 
   - The vector store is converted into a retriever using the `as_retriever()` method.
   - The number of documents retrieved is controlled by the `k` parameter.

2. **Retrieve Relevant Documents**:
   - The retriever searches for the most relevant documents for the given question.
   - The `get_relevant_documents()` method returns the documents, and their content is extracted.

3. **Prepare the Context**:
   - The content of each retrieved document is stored in a list, which is returned by the function.

### Output:
- The function returns a list containing the contents (context) of the most relevant documents, which can be used to answer the given question or for further processing.


In [9]:
def retrieve_context_per_question(question, vectorstore, k=5):
    """
    Retrieves relevant context for a given question using the vector store retriever.
    
    Parameters:
    - question (str): The question for which to retrieve relevant context.
    - vectorstore: The FAISS vector store to retrieve the documents from.
    - k (int): The number of top documents to retrieve (default is 5).
    
    Returns:
    - context (list of str): A list of the retrieved documents' contents.
    """
    # Create a retriever from the vector store
    retriever = vectorstore.as_retriever(search_kwargs={"k": k})
    
    # Retrieve the relevant documents
    docs = retriever.get_relevant_documents(question)
    
    # Prepare the context by extracting the content of each retrieved document
    context = [doc.page_content for doc in docs]
    
    return context


## Steps for Querying the Vector Store and Printing Context:

### Step 2: Query the Vector Store for Relevant Context
1. **Define the Query**:
   - In this step, you define the query (or question) for which you want to retrieve relevant documents from the vector store.
   - The query in this case is `"What is the main cause of climate change?"`.

2. **Retrieve Context**:
   - The function `retrieve_context_per_question()` is called with the query and the `vectorstore` (the FAISS vector store created earlier).
   - This function returns the top relevant documents from the vector store as context chunks. The number of chunks retrieved is determined by the `k` value in the function (default is 5).

### Step 3: Print the Retrieved Context
1. **Print Each Chunk**:
   - The `for` loop iterates over each chunk of context returned by the `retrieve_context_per_question` function.
   - For each chunk, it prints the content of the chunk along with its index (starting from 1).

### Output:
- This code outputs the content of the top N (k) most relevant documents (chunks) that are retrieved from the vector store based on the query.
- The chunks represent different sections of documents that are relevant to the query.


In [10]:
# Step 2: Query the vector store for relevant context
query = "What is the main cause of climate change?"
context = retrieve_context_per_question(query, vectorstore)

# Step 3: Print the retrieved context
for i, chunk in enumerate(context):
    print(f"Chunk {i + 1}: {chunk}")

  docs = retriever.get_relevant_documents(question)


Chunk 1: Greenhouse Gases  
The primary cause of recent climate change is the increase in greenhouse gases in the 
atmosphere. Greenhouse gases, such as carbon dioxide (CO2), methane (CH4), and nitrous 
oxide (N2O), trap heat from the sun, creating a "greenhouse effect." This effect is  essential
Chunk 2: Chapter 2: Causes of Climate Change  
Greenhouse Gases  
The primary cause of recent climate change is the increase in greenhouse gases in the 
atmosphere. Greenhouse gases, such as carbon dioxide (CO2), methane (CH4), and nitrous
Chunk 3: driven by human activities, particularly the emission of greenhou se gases.  
Chapter 2: Causes of Climate Change  
Greenhouse Gases  
The primary cause of recent climate change is the increase in greenhouse gases in the
Chunk 4: provide a historical record that scientists use to understand past climate conditions and 
predict future trends. The evidence overwhelmingly shows that recent changes are primarily 
driven by human activities, particularly

#### Create a vector store


## Method 1: LLM based function to rerank the retrieved documents
In this step, we define a method for reranking documents using an LLM (GPT-4) based approach. The function, `rerank_documents`, takes a query and a list of retrieved documents, asks the LLM to rate each document based on its relevance to the query, and then sorts the documents by their scores.

This approach allows for better reranking by considering the context of the query rather than just keyword matches.
#### Description: 
1. **Define the Query**:
   - In this step, you define the query (or question) for which you want to retrieve relevant documents from the vector store.
   - The query in this case is `"What is the main cause of climate change?"`.

2. **Retrieve Context**:
   - The function `retrieve_context_per_question()` is called with the query and the `vectorstore` (the FAISS vector store created earlier).
   - This function returns the top relevant documents from the vector store as context chunks. The number of chunks retrieved is determined by the `k` value in the function (default is 5).

1. **Print Each Chunk**:
   - The `for` loop iterates over each chunk of context returned by the `retrieve_context_per_question` function.
   - For each chunk, it prints the content of the chunk along with its index (starting from 1).

### Output:
- This code outputs the content of the top N (k) most relevant documents (chunks) that are retrieved from the vector store based on the query.
- The chunks represent different sections of documents that are relevant to the query.

### Step 3: Example Usage of the Reranking Function
In this step, we will demonstrate how to use the `rerank_documents` function with a sample query relevant to climate change. We first retrieve a set of documents using the `vectorstore.similarity_search` function and then rerank these documents based on relevance using our custom reranking method.

After reranking, we print the top initial documents (before reranking) and the top reranked documents for comparison.
#### Description:
- The query asks, "What are the impacts of climate change on biodiversity?"
- We retrieve 15 initial documents from the vector store using similarity search.

- The rerank_documents function is then used to rerank these documents based on their relevance to the query.
- We print the top 3 documents from the initial set (before reranking) and the top reranked documents to compare how the reranking process improved the relevance of the results.

In [17]:
import re

def rerank_documents(query: str, docs: List[Document], top_n: int = 3) -> List[Document]:
    client = AzureOpenAI(
        api_key=azure_openai_api_key,
        api_version=azure_openai_api_version,
        azure_endpoint=azure_endpoint
    )

    prompt_template = """On a scale of 1-10, rate the relevance of the following document to the query. Consider the specific context and intent of the query, not just keyword matches. Provide the numeric score followed by a brief explanation.

Query: {query}

Document: {doc}

Relevance Score (1-10) and brief explanation:"""

    scored_docs = []
    for i, doc in enumerate(docs):
        messages = [
            {"role": "system", "content": "You are a helpful assistant that rates document relevance."},
            {"role": "user", "content": prompt_template.format(query=query, doc=doc.page_content)}
        ]
        try:
            response = client.chat.completions.create(
                model="gpt-4o",  # Replace with your actual GPT-4 deployment name
                messages=messages,
                temperature=0
            )
            result = response.choices[0].message.content.strip()
            print(f"Document {i+1} raw response: {result}")
            
            # Extract the numeric score using regex
            score_match = re.search(r'Relevance Score:?\s*(\d+(?:\.\d+)?)', result)
            if score_match:
                score = float(score_match.group(1))
                print(f"Document {i+1} parsed score: {score}")
            else:
                print(f"Failed to parse score for Document {i+1}. Using default score.")
                score = 0
        except Exception as e:
            print(f"Error processing Document {i+1}: {str(e)}")
            score = 0
        
        scored_docs.append((doc, score))

    reranked_docs = sorted(scored_docs, key=lambda x: x[1], reverse=True)
    
    print("\nFinal Scores:")
    for i, (doc, score) in enumerate(reranked_docs):
        print(f"Document {i+1}: Score {score}")
    
    return [doc for doc, _ in reranked_docs[:top_n]]

In [18]:
# Define the query
query = "What are the impacts of climate change on biodiversity?"

# Retrieve the initial set of documents using vector similarity search
initial_docs = vectorstore.similarity_search(query, k=5)  # Using 5 documents for demonstration

# Apply the final improved LLM-based reranking function to reorder the documents based on relevance
reranked_docs = rerank_documents(query, initial_docs)

# Print the top reranked documents
print(f"\nQuery: {query}")
print("Top reranked documents:")
for i, doc in enumerate(reranked_docs):
    print(f"\nDocument {i+1}:")
    print(doc.page_content[:200] + "...")  # Print the first 200 characters of each document

Document 1 raw response: Relevance Score: 4

Explanation: The document contains a chapter specifically titled "Climate Change and Biodiversity," which suggests that it may cover the impacts of climate change on biodiversity. However, the provided excerpt does not include any specific information or details about these impacts. The general context of the document seems to focus on broader themes of resilience and collective effort, rather than directly addressing the query. Therefore, while there is some potential relevance, the excerpt itself does not provide concrete information on the impacts of climate change on biodiversity.
Document 1 parsed score: 4.0
Document 2 raw response: Relevance Score: 2

Explanation: The document primarily focuses on the health impacts of climate change, specifically heat-related illnesses, rather than the impacts of climate change on biodiversity. While there is a brief mention of promoting synergies between biodiversity conservation and climate action, 

### Step 4: Create a Custom Retriever Based on Reranking
In this step, we implement a custom retriever that integrates our reranking function. This custom retriever will first perform an initial retrieval of documents using a vector store and then apply the LLM-based reranking function to refine the results before returning the most relevant documents.

The custom retriever is then used in a `RetrievalQA` chain with GPT-4 for answering questions based on the reranked documents.

#### Description:
- The CustomRetriever class extends the base retriever and integrates the reranking logic. It performs an initial vector search and then reranks the documents based on relevance
- The RetrievalQA chain is created using this custom retriever, allowing us to answer questions by retrieving and reranking the most relevant documents before passing them to GPT-4 for generating responses.

In [19]:
from pydantic import BaseModel, Field

# Custom Retriever class
class CustomRetriever(BaseRetriever, BaseModel):
    vectorstore: Any = Field(description="Vector store for initial retrieval")

    class Config:
        arbitrary_types_allowed = True

    def get_relevant_documents(self, query: str, num_docs=2) -> List[Document]:
        initial_docs = self.vectorstore.similarity_search(query, k=30)
        return rerank_documents(query, initial_docs, top_n=num_docs)


# Create the custom retriever
custom_retriever = CustomRetriever(vectorstore=vectorstore)



  class CustomRetriever(BaseRetriever, BaseModel):


### Step 5: Example Query with the Custom QA Chain
In this final step, we test our `qa_chain` with a sample query. The query is passed to the chain, which uses the custom retriever to retrieve and rerank the documents based on relevance. The top relevant documents are then used by GPT-4 to generate an answer.

We print the answer along with the first 200 characters of each relevant source document to understand how the reranked documents were used.

#### Description:
- This cell runs a sample query against the qa_chain and retrieves the top relevant documents using the custom reranking retriever.
- The answer generated by GPT-4 is printed, followed by a list of the relevant source documents, giving insight into how the reranked documents contributed to the answer. This ensures that the reranking process is effective and relevant to the query.

In [20]:
# Run an example query
result = qa_chain({"query": query})

# Print the question and the generated answer
print(f"\nQuestion: {query}")
print(f"Answer: {result['result']}")

# Print the relevant source documents used to generate the answer
print("\nRelevant source documents:")
for i, doc in enumerate(result["source_documents"]):
    print(f"\nDocument {i+1}:")
    print(doc.page_content[:200] + "...")  # Print the first 200 characters of each document


Document 1 raw response: Relevance Score: 4

Explanation: The document contains a chapter specifically titled "Climate Change and Biodiversity," which suggests that it may cover the impacts of climate change on biodiversity. However, the provided excerpt does not include any specific information or details about these impacts. The general language about building a resilient and equitable world is not directly relevant to the query. Therefore, while the document might contain relevant information in the specified chapter, the excerpt itself does not provide enough context to be highly relevant.
Document 1 parsed score: 4.0
Document 2 raw response: Relevance Score: 2

Explanation: The document primarily focuses on the health impacts of climate change, specifically heat-related illnesses, rather than the impacts of climate change on biodiversity. While there is a brief mention of policies promoting synergies between biodiversity conservation and climate action, the main content does not a

### Step 6: Demonstrating the Importance of Reranking
In this step, we create a small set of documents (chunks) with varying relevance to a query. These chunks contain similar statements about the capital of France, Paris. We will compare the results from a baseline vector search retrieval with the reranked results from our custom retriever.

This example demonstrates how reranking can improve the relevance of the results by providing contextually better matches to the query.

#### Description:
- The dataset consists of simple statements and sentences about Paris and France. Some chunks provide more context about Paris being the capital, while others are less informative.
- We run two retrieval methods: the baseline vector similarity search and the advanced reranked approach using our custom retriever.
- The output compares the top 2 documents retrieved by each approach. The reranked results should show that contextually richer documents (those providing more than just "the capital of France is...") are prioritized over simpler, less informative ones. This demonstrates why reranking is valuable in a retrieval-augmented generation pipeline.



## Define the Document Chunks:

1. **Chunks**: A set of sentences related to Paris and France, which simulate small sections of a document.
2. **Convert to Document Objects**: Each text chunk is converted into a `Document` object from the LangChain framework to represent structured content.


In [21]:
# Assuming CustomAzureEmbeddings, CustomAzureLLM, and CustomRetriever are defined as in the previous example

# Sample chunks containing different statements about Paris and France
chunks = [
    "The capital of France is great.",
    "The capital of France is huge.",
    "The capital of France is beautiful.",
    """Have you ever visited Paris? It is a beautiful city where you can eat delicious food and see the Eiffel Tower. 
    I really enjoyed all the cities in france, but its capital with the Eiffel Tower is my favorite city.""", 
    "I really enjoyed my trip to Paris, France. The city is beautiful and the food is delicious. I would love to visit again. Such a great capital city."
]

# Convert each chunk into a Document object
docs = [Document(page_content=sentence) for sentence in chunks]

## Comparison of Baseline and Advanced Retrieval Techniques:

1. **Azure OpenAI Configuration**:
   - Configures the API key, version, and endpoint to interact with Azure OpenAI.
  
2. **CustomAzureEmbeddings**: 
   - Embeddings are generated using Azure OpenAI's `text-embedding-ada-002` model for each document.

3. **Baseline Retrieval**:
   - Using FAISS, the function performs a standard similarity search to find the top 2 documents that match the query.
   - It prints the result of this baseline search.

4. **Advanced Retrieval (Reranking)**:
   - After the initial similarity search, documents are reranked based on their relevance to the query using a `CustomRetriever`.
   - The function then prints the top reranked documents.


In [22]:




# Function to compare baseline and advanced (reranked) retrieval techniques
def compare_rag_techniques(query: str, docs: List[Document] = docs) -> None:


    # Create Azure embeddings
    embeddings = CustomAzureEmbeddings(
        api_key=azure_openai_api_key,
        api_version=azure_openai_api_version,
        azure_endpoint=azure_endpoint
    )

    # Create vector store
    vectorstore = FAISS.from_documents(docs, embeddings)

    print("Comparison of Retrieval Techniques")
    print("==================================")
    print(f"Query: {query}\n")
    
    # Baseline Retrieval
    print("Baseline Retrieval Result:")
    baseline_docs = vectorstore.similarity_search(query, k=2)
    for i, doc in enumerate(baseline_docs):
        print(f"\nDocument {i+1}:")
        print(doc.page_content)

    # Advanced Retrieval using Reranking
    print("\nAdvanced Retrieval Result:")
    custom_retriever = CustomRetriever(vectorstore=vectorstore)
    advanced_docs = custom_retriever.get_relevant_documents(query)
    for i, doc in enumerate(advanced_docs):
        print(f"\nDocument {i+1}:")
        print(doc.page_content)

# Main execution
if __name__ == "__main__":
    # Query to demonstrate reranking
    query = "what is the capital of france?"
    compare_rag_techniques(query, docs)

Comparison of Retrieval Techniques
Query: what is the capital of france?

Baseline Retrieval Result:

Document 1:
The capital of France is great.

Document 2:
The capital of France is beautiful.

Advanced Retrieval Result:
Document 1 raw response: Relevance Score: 3

Explanation: The document does mention that the capital of France is "great," which indirectly implies knowledge of the capital. However, it does not directly answer the query by stating that the capital of France is Paris. The response is vague and lacks the specific information requested.
Document 1 parsed score: 3.0
Document 2 raw response: Relevance Score: 8

Explanation: The document directly addresses the query by stating that the capital of France is beautiful. While it does not explicitly name the capital, it implies knowledge of the capital, which is Paris. The document is relevant but could be more precise by explicitly stating "The capital of France is Paris."
Document 2 parsed score: 8.0
Document 3 raw response

# Method 2: Cross Encoder models
###  Cross Encoder-Based Document Reranking
In this step, we use a pre-trained cross-encoder model from Hugging Face (`ms-marco-MiniLM-L-6-v2`) to rerank the retrieved documents. The cross-encoder model works by scoring document-query pairs to determine relevance.

We define a `CrossEncoderRetriever` class that first retrieves documents using vector similarity search and then reranks them based on their cross-encoder scores.

#### Description:
- This cell implements a cross-encoder-based reranking retriever. It first retrieves an initial set of documents based on vector similarity search.
- The retrieved documents are then paired with the query and passed through a pre-trained cross-encoder model, which assigns relevance scores to each document-query pair.
- The documents are sorted based on their scores, and the top rerank_top_k documents are returned. This approach provides more context-aware document retrieval compared to simple similarity search.

In [23]:


# Assuming CustomAzureEmbeddings and CustomAzureLLM are defined as before

# Initialize the cross-encoder model
cross_encoder = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2')

class AzureCrossEncoderRetriever(BaseRetriever, BaseModel):
    vectorstore: Any = Field(description="Vector store for initial retrieval")
    cross_encoder: Any = Field(description="Cross-encoder model for reranking")
    k: int = Field(default=5, description="Number of documents to retrieve initially")
    rerank_top_k: int = Field(default=3, description="Number of documents to return after reranking")

    class Config:
        arbitrary_types_allowed = True

    def get_relevant_documents(self, query: str) -> List[Document]:
        # Perform an initial retrieval using vector similarity search
        initial_docs = self.vectorstore.similarity_search(query, k=self.k)
        
        # Create pairs of query and document content for cross-encoder reranking
        pairs = [[query, doc.page_content] for doc in initial_docs]
        
        # Use the cross-encoder to predict relevance scores for each query-document pair
        scores = self.cross_encoder.predict(pairs)
        
        # Sort the documents by their cross-encoder scores in descending order
        scored_docs = sorted(zip(initial_docs, scores), key=lambda x: x[1], reverse=True)
        
        # Return the top reranked documents based on the cross-encoder scores
        return [doc for doc, _ in scored_docs[:self.rerank_top_k]]

    async def aget_relevant_documents(self, query: str) -> List[Document]:
        raise NotImplementedError("Async retrieval not implemented")

# Function to compare baseline and cross-encoder reranking retrieval techniques
def compare_rag_techniques(query: str, docs: List[Document]) -> None:


    embeddings = CustomAzureEmbeddings(
        api_key=azure_openai_api_key,
        api_version=azure_openai_api_version,
        azure_endpoint=azure_endpoint
    )

    vectorstore = FAISS.from_documents(docs, embeddings)

    print("Comparison of Retrieval Techniques")
    print("==================================")
    print(f"Query: {query}\n")
    
    # Baseline Retrieval
    print("Baseline Retrieval Result:")
    baseline_docs = vectorstore.similarity_search(query, k=2)
    for i, doc in enumerate(baseline_docs):
        print(f"\nDocument {i+1}:")
        print(doc.page_content)

    # Cross-Encoder Reranking Retrieval
    print("\nCross-Encoder Reranking Result:")
    cross_encoder_retriever = AzureCrossEncoderRetriever(
        vectorstore=vectorstore,
        cross_encoder=cross_encoder,
        k=5,
        rerank_top_k=2
    )
    reranked_docs = cross_encoder_retriever.get_relevant_documents(query)
    for i, doc in enumerate(reranked_docs):
        print(f"\nDocument {i+1}:")
        print(doc.page_content)

# Sample chunks
chunks = [
    "The capital of France is great.",
    "The capital of France is huge.",
    "The capital of France is beautiful.",
    """Have you ever visited Paris? It is a beautiful city where you can eat delicious food and see the Eiffel Tower. 
    I really enjoyed all the cities in France, but its capital with the Eiffel Tower is my favorite city.""", 
    "I really enjoyed my trip to Paris, France. The city is beautiful and the food is delicious. I would love to visit again. Such a great capital city."
]

# Convert each chunk into a Document object
docs = [Document(page_content=sentence) for sentence in chunks]

# Main execution
if __name__ == "__main__":
    query = "what is the capital of france?"
    compare_rag_techniques(query, docs)

  class AzureCrossEncoderRetriever(BaseRetriever, BaseModel):
  class AzureCrossEncoderRetriever(BaseRetriever, BaseModel):


Comparison of Retrieval Techniques
Query: what is the capital of france?

Baseline Retrieval Result:

Document 1:
The capital of France is great.

Document 2:
The capital of France is beautiful.

Cross-Encoder Reranking Result:

Document 1:
The capital of France is great.

Document 2:
The capital of France is beautiful.


### Step 8: Example Query Using Cross-Encoder Retriever
In this final step, we create an instance of the `CrossEncoderRetriever` and use it in a `RetrievalQA` chain to answer a query. The chain retrieves documents using a vector store, reranks them with the cross-encoder, and then uses GPT-4 to generate an answer. We print both the answer and the relevant source documents.
#### Description:
- This cell creates an instance of the CrossEncoderRetriever, which uses a cross-encoder model to rerank the retrieved documents.
- The RetrievalQA chain is set up with GPT-4 (gpt-4o) as the LLM and uses the cross-encoder retriever to return the most relevant documents.
- The query asks about the impacts of climate change on biodiversity. The answer is generated by GPT-4, and the top 5 relevant source documents (based on reranking) are displayed.

## 1. Initialize the Cross-Encoder Model:

- `CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2')`: 
   - This model is used for reranking documents after they have been retrieved by the initial vector store search.
   - The model scores each document based on how well it answers the query. 
   - `ms-marco-MiniLM-L-6-v2` is a commonly used model for such reranking tasks.

## 2. Create the Cross-Encoder Retriever:

- `AzureCrossEncoderRetriever`: 
   - Combines a standard retriever (vector store search) with the cross-encoder for reranking.
   - First, it retrieves 10 documents (`k=10`) based on their vector similarity.
   - It then reranks the top 10 documents using the cross-encoder and selects the top 5 most relevant ones (`rerank_top_k=5`).

## 3. Set up the Azure LLM (GPT-4):

- `CustomAzureLLM`: 
   - A custom class to interact with Azure OpenAI's GPT-4 model.
   - The `api_key`, `api_version`, and `azure_endpoint` are specific to your Azure OpenAI resource.
   - `deployment_name="gpt-4o"`: Specifies the deployed GPT-4 model in your Azure OpenAI instance.

## 4. Create the RetrievalQA Chain:

- `RetrievalQA.from_chain_type`:
   - Combines document retrieval with question-answering.
   - The chain first uses the `cross_encoder_retriever` to find the most relevant documents, then passes those documents to `azure_llm` to generate the final answer.
   - `return_source_documents=True`: Ensures the source documents used to generate the answer are returned, so the user can verify the answer's provenance.

## 5. Example Query:

- `query = "What are the impacts of climate change on biodiversity?"`: 
   - This query is used to test the system, where the retriever will find relevant documents, and GPT-4 will generate the final answer based on those documents.

## 6. Display Results:

- **Answer**: The output generated by GPT-4 based on the retrieved and reranked documents.
- **Source Documents**: The most relevant source documents used to answer the query.


In [24]:
from langchain.chains import RetrievalQA
from sentence_transformers import CrossEncoder

# Assuming you have already defined CustomAzureLLM, AzureCrossEncoderRetriever, and have the vectorstore ready

# Initialize the cross-encoder model
cross_encoder = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2')

# Create the cross-encoder retriever
cross_encoder_retriever = AzureCrossEncoderRetriever(
    vectorstore=vectorstore,
    cross_encoder=cross_encoder,
    k=10,  # Retrieve 10 documents initially
    rerank_top_k=5  # Return top 5 after reranking
)

# Set up the Azure LLM (GPT-4)
azure_llm = CustomAzureLLM(
    api_key=azure_openai_api_key,
    api_version=azure_openai_api_version,
    azure_endpoint=azure_endpoint,
    deployment_name="gpt-4o"  # Replace with your actual GPT-4 deployment name
)

# Create the RetrievalQA chain with the cross-encoder retriever
qa_chain = RetrievalQA.from_chain_type(
    llm=azure_llm,
    chain_type="stuff",
    retriever=cross_encoder_retriever,
    return_source_documents=True
)

# Example query
query = "What are the impacts of climate change on biodiversity?"
result = qa_chain({"query": query})

# Print the query and answer
print(f"\nQuestion: {query}")
print(f"Answer: {result['result']}")

# Print the relevant source documents
print("\nRelevant source documents:")
for i, doc in enumerate(result["source_documents"]):
    print(f"\nDocument {i+1}:")
    print(doc.page_content[:200] + "...")  # Print the first 200 characters of each document




Question: What are the impacts of climate change on biodiversity?
Answer: Climate change impacts biodiversity by altering terrestrial and marine ecosystems. In terrestrial ecosystems, it shifts habitat ranges, changes species distributions, and impacts ecosystem functions, leading to shifts in plant and animal species composition. These changes can result in a loss of biodiversity and disrupt ecological balance. In marine ecosystems, rising sea temperatures and other climate-related changes make them highly vulnerable, further affecting biodiversity.

Relevant source documents:

Document 1:
sectors of society.  
Chapter 9: Climate Change and Biodiversity  
Impact on Ecosystems  
Terrestrial Ecosystems  
Climate change is altering terrestrial ecosystems by shifting habitat ranges, changin...

Document 2:
the impacts of climate change and build a resilient, equitable, and thriving world for future 
generations. The journey ahead requires dedication, creativity, and c ollective effort fr