<img align="left" src="https://ithaka-labs.s3.amazonaws.com/static-files/images/tdm/tdmdocs/tapi-logo-small.png" />

This notebook free for educational reuse under [Creative Commons CC BY License](https://creativecommons.org/licenses/by/4.0/).

Created by [Grant Glass](https://glassgrant.com) for the 2024 Text Analysis Pedagogy Institute, with support from [Constellate](https://constellate.org).

For questions/comments/improvements, email grantg@unc.edu.<br />
____

# `Large Language Models and Embeddings for Retrieval Augmented Generation: Day 3` `7/19/24`

This is lesson `3` of 3 in the educational series on `Large Language Models (LLMs) and Retrieval Augmented Generation (RAG)`. This notebook focuses on advanced techniques for optimizing RAG systems.

**Skills:** 
* Data analysis
* Machine learning
* Text analysis
* Language models
* Vector embeddings
* Retrieval Augmented Generation
* Performance optimization

**Audience:** `Learners`

**Use case:** `Tutorial`

This tutorial guides users through advanced techniques for optimizing Retrieval Augmented Generation systems.



**Difficulty:** `Advanced`

Advanced assumes users are very familiar with Python and have been programming for years, but they may not be familiar with the specific optimization techniques for RAG systems.


**Completion time:** `90 minutes`

**Knowledge Required:** 
* Python programming (including object-oriented programming)
* Understanding of LLMs and embeddings (covered in Days 1 and 2)
* Basic knowledge of RAG systems (covered in Day 2)


**Knowledge Recommended:**
* Experience with natural language processing (NLP)
* Familiarity with information retrieval concepts

**Learning Objectives:**
After this lesson, learners will be able to:
1. Implement advanced retrieval techniques for RAG systems
2. Optimize prompt engineering for improved RAG performance
3. Develop strategies for handling long contexts in RAG
4. Implement and evaluate different reranking methods
5. Create a more sophisticated RAG pipeline integrating multiple optimization techniques


**Research Pipeline:**
1. Introduction to LLMs and their applications (Day 1)
2. Exploring embeddings and introduction to RAG (Day 2)
3. **Optimizing RAG systems for enhanced performance**
4. Applying optimized RAG in research contexts

___

# Required Python Libraries

* [OpenAI](https://github.com/openai/openai-python) for generating embeddings and interacting with GPT models
* [Pandas](https://pandas.pydata.org/) for data manipulation
* [NumPy](https://numpy.org/) for numerical operations
* [Scikit-learn](https://scikit-learn.org/) for similarity calculations and evaluation metrics
* [FAISS](https://github.com/facebookresearch/faiss) for efficient similarity search
* [NLTK](https://www.nltk.org/) for text preprocessing

## Install Required Libraries

In [None]:
### Install Libraries ###
!pip install openai pandas numpy scikit-learn faiss-cpu nltk

In [None]:
# Import Libraries

# Import the OpenAI library to interact with the OpenAI API, useful for tasks like text generation or semantic search.
import openai

# Import pandas, a powerful data manipulation and analysis library for Python.
import pandas as pd

# Import numpy, a library for numerical operations on large, multi-dimensional arrays and matrices.
import numpy as np

# Import cosine_similarity from sklearn, a method for calculating the cosine similarity between vectors, useful in various machine learning tasks.
from sklearn.metrics.pairwise import cosine_similarity

# Import precision_score, recall_score, f1_score from sklearn for evaluating the accuracy of a classification.
from sklearn.metrics import precision_score, recall_score, f1_score

# Import faiss, a library for efficient similarity search and clustering of dense vectors.
import faiss

# Import nltk, a toolkit for natural language processing (NLP) tasks.
import nltk

# From nltk, import word_tokenize for splitting strings into words and stopwords for filtering out common words.
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords

# Import os, a module for interacting with the operating system, useful for file paths, environment variables, etc.
import os

# The line intended to import the faiss library is misspelled. It should be "import faiss".
# Faiss is a library for efficient similarity search and clustering of dense vectors.
import faiss

# Import the ast library, which is used for processing trees of the Python abstract syntax grammar.
# The ast module helps in introspecting and analyzing Python code.
# Add this import to handle string to list conversion
import ast  

# Download the 'punkt' tokenizer models. This is necessary for tokenizing words in text.
nltk.download('punkt')

# Download the list of stopwords to filter out common words that are usually irrelevant in NLP tasks.
nltk.download('stopwords')

# Required Data

We'll continue using the dataset from Day 2.

## Prepare Data

In [None]:
# Load the dataset
df = pd.read_csv('day2_dataset.csv')  # Assuming we saved the DataFrame from Day 1


# Introduction

In this final lesson of our LLMs with RAG workshop, we'll focus on advanced techniques for optimizing Retrieval Augmented Generation (RAG) systems. We'll explore methods to enhance retrieval accuracy, improve prompt engineering, handle long contexts, and implement reranking strategies.

Key topics we'll cover:
1. Advanced retrieval techniques
2. Optimizing prompt engineering
3. Handling long contexts
4. Implementing reranking methods
5. Building an optimized RAG pipeline

Let's begin by setting up our OpenAI API access and defining some utility functions:

## Configure the OpenAI client

To setup the client for our use, we need to create an API key to use with our request. Skip these steps if you already have an API key for usage.

You can get an API key by following these steps:

1. [Create a new project](https://help.openai.com/en/articles/9186755-managing-your-work-in-the-api-platform-with-projects)
2. [Generate an API key in your project](https://platform.openai.com/api-keys)
3. (RECOMMENDED, BUT NOT REQUIRED) [Setup your API key for all projects as an env var](https://platform.openai.com/docs/quickstart/step-2-set-up-your-api-key)

In [None]:
## Set the API key and client
client = OpenAI(api_key=os.environ.get("OPENAI_API_KEY", "<your OpenAI API key if not set as an env var>"))

In [None]:
# Define a function to get the embedding of a given text.
# This function takes a text string and an optional model name as input.
def get_embedding(text, model="text-embedding-ada-002"):
    # Replace newline characters with spaces in the text to ensure it's in a single line.
    text = text.replace("\n", " ")
    # Use the OpenAI API client to create an embedding for the text using the specified model.
    # The function returns the embedding of the first (and only) input text.
    return client.embeddings.create(input=[text], model=model).data[0].embedding

# Define a function to get a completion for a given prompt using GPT.
# This function takes a prompt string and an optional model name as input.
def get_completion(prompt, model="gpt-3.5-turbo"):
    # Structure the prompt into a format suitable for the OpenAI API, specifying the role as "user".
    messages = [{"role": "user", "content": prompt}]
    # Use the OpenAI API client to create a chat completion using the specified model.
    # The temperature is set to 0 for deterministic output, meaning no randomness in the response.
    response = client.chat.completions.create(
        model=model,
        messages=messages,
        temperature=0,
    )
    # Return the content of the first (and only) message in the response.
    return response.choices[0].message.content

# Print a message indicating that the OpenAI API is ready for use.
print("OpenAI API is ready.")

# Lesson

## 1. Advanced Retrieval Techniques

Let's implement a more sophisticated retrieval system using FAISS for efficient similarity search:

In [None]:
# Define a class for retrieving documents using FAISS (Facebook AI Similarity Search)
class FAISSRetriever:
    # Initialize the retriever with a DataFrame containing the documents and their embeddings
    def __init__(self, df):
        self.df = df  # Store the DataFrame
        self.index = None  # Initialize the FAISS index as None
        # Check if the DataFrame is not empty, contains an 'embedding' column, and all embeddings are not null
        if not df.empty and 'embedding' in df.columns and not df['embedding'].isnull().all():
            self.build_index()  # Build the FAISS index
        else:
            # If conditions are not met, print a warning and do not build the index
            print("DataFrame is empty or does not contain valid embeddings. Index will not be built.")

    # Method to build the FAISS index
    def build_index(self):
        try:
            # Check if embeddings are stored as strings; if so, convert them to lists
            if isinstance(self.df['embedding'].iloc[0], str):
                # Convert string embeddings to lists using ast.literal_eval and stack them vertically
                embeddings = np.vstack([ast.literal_eval(emb) for emb in self.df['embedding'].values])
            else:
                # If embeddings are not strings, stack them vertically as they are
                embeddings = np.vstack(self.df['embedding'].values)
        except ValueError as e:
            # Catch and print any ValueError that occurs during stacking
            print(f"Error stacking embeddings: {e}")
            return

        # Check if the embeddings are a 2D array as expected
        if embeddings.ndim != 2:
            print(f"Expected embeddings to be a 2D array, got {embeddings.ndim}D instead.")
            return

        # Create a FAISS index for L2 distance and add the embeddings
        self.index = faiss.IndexFlatL2(embeddings.shape[1])
        self.index.add(embeddings.astype('float32'))

    # Method to retrieve documents similar to a given query
    def retrieve(self, query, k=3):
        # Check if the index is built
        if self.index is None:
            print("Index is not built. Cannot retrieve documents.")
            return None
        # Convert the query to an embedding and search the index for the k nearest neighbors
        query_embedding = np.array([get_embedding(query)]).astype('float32')
        _, indices = self.index.search(query_embedding, k)
        # Return the DataFrame rows corresponding to the indices of the retrieved documents
        return self.df.iloc[indices[0]]

## 2. Optimizing Prompt Engineering

Let's create a more sophisticated prompt template that includes multiple retrieved documents and encourages the model to synthesize information:

In [None]:
def create_optimized_prompt(query, retrieved_docs, max_tokens=16385, max_docs=3, max_length_per_doc=500):
    """
    Create an optimized prompt with a limit on the total number of tokens, the number of documents,
    and the length of each document.
    """
    prompt = f"""Given the following query and relevant information from multiple sources, 
    provide a comprehensive and accurate answer. Synthesize the information from all sources,
    and if there are any contradictions or gaps in the information, point them out.

    Query: {query}

    Relevant Information:
    """
    
    doc_count = 0
    for i, doc in retrieved_docs.iterrows():
        if doc_count >= max_docs:
            break
        doc_text = doc['fullText'][:max_length_per_doc]  # Truncate document
        prompt += f"\nSource {i+1} - {doc['title']}:\n{doc_text}\n"
        doc_count += 1
    
    prompt += "\nSynthesized Answer:"
    
    # Check if the prompt exceeds the maximum token limit
    if len(prompt) > max_tokens:
        prompt = prompt[:max_tokens]  # Truncate the prompt to fit the token limit
    
    return prompt

# Example usage
query = "How do different authors explore the theme of societal control?"
retrieved = retriever.retrieve(query)  # Corrected this line
optimized_prompt = create_optimized_prompt(query, retrieved)
response = get_completion(optimized_prompt)
print(f"Query: {query}")
print(f"Optimized RAG Response: {response}")

## 3. Handling Long Contexts

To handle longer contexts, we'll implement a chunking strategy and use a sliding window approach:

In [None]:
# Define a function to create an optimized prompt for text generation tasks.
# This function limits the total number of tokens, the number of documents, and the length of each document.
def create_optimized_prompt(query, retrieved_docs, max_tokens=16385, max_docs=3, max_length_per_doc=500):
    """
    Create an optimized prompt with a limit on the total number of tokens, the number of documents,
    and the length of each document.
    """
    # Start building the prompt with an introduction and the query.
    prompt = f"""Given the following query and relevant information from multiple sources, 
    provide a comprehensive and accurate answer. Synthesize the information from all sources,
    and if there are any contradictions or gaps in the information, point them out.

    Query: {query}

    Relevant Information:
    """
    
    # Initialize a counter for the number of documents added to the prompt.
    doc_count = 0
    # Iterate over the retrieved documents.
    for i, doc in retrieved_docs.iterrows():
        # Stop adding documents if the maximum number of documents has been reached.
        if doc_count >= max_docs:
            break
        # Truncate the document text to the maximum length per document.
        doc_text = doc['fullText'][:max_length_per_doc]
        # Add the document title and truncated text to the prompt.
        prompt += f"\nSource {i+1} - {doc['title']}:\n{doc_text}\n"
        # Increment the document counter.
        doc_count += 1
    
    # Append a section for the synthesized answer to the prompt.
    prompt += "\nSynthesized Answer:"
    
    # Check if the prompt exceeds the maximum token limit.
    if len(prompt) > max_tokens:
        # Truncate the prompt to fit the token limit.
        prompt = prompt[:max_tokens]
    
    # Return the constructed prompt.
    return prompt

# Example usage of the function.
# Define a query.
query = "How do different authors explore the theme of societal control?"
# Retrieve documents relevant to the query.
retrieved = retriever.retrieve(query)  # Assume 'retriever' is a previously defined object with a 'retrieve' method.
# Create an optimized prompt using the retrieved documents.
optimized_prompt = create_optimized_prompt(query, retrieved)
# Generate a response using the optimized prompt.
response = get_completion(optimized_prompt)
# Print the query and the generated response.
print(f"Query: {query}")
print(f"Optimized RAG Response: {response}")

## 4. Implementing Reranking Methods

Let's implement a simple reranking method based on keyword matching:

In [None]:
# Define a function to rerank retrieved documents based on keyword overlap with the query.
def keyword_reranker(query, retrieved_docs, top_k=2):
    # Tokenize the query, convert to lowercase, remove stopwords, and create a set of unique keywords.
    query_keywords = set(word_tokenize(query.lower())) - set(stopwords.words('english'))
    
    # Define a function to calculate the keyword score for a given text.
    # The score is the number of query keywords present in the text.
    def keyword_score(text):
        # Tokenize the text, convert to lowercase, remove stopwords, and create a set of unique keywords.
        text_keywords = set(word_tokenize(text.lower())) - set(stopwords.words('english'))
        # Return the count of query keywords found in the text's keywords.
        return len(query_keywords.intersection(text_keywords))
    
    # Create a copy of the DataFrame to avoid SettingWithCopyWarning when modifying it.
    retrieved_docs_copy = retrieved_docs.copy()
    # Apply the keyword_score function to each document's fullText and store the results in a new column.
    retrieved_docs_copy['keyword_score'] = retrieved_docs_copy['fullText'].apply(keyword_score)
    # Sort the documents by their keyword score in descending order and return the top_k documents.
    return retrieved_docs_copy.sort_values('keyword_score', ascending=False).head(top_k)

# Example usage of the function.
# Define a query.
query = "A story about slavery and its impact on society"
# Retrieve documents relevant to the query, assuming 'retriever' is a previously defined object.
retrieved = retriever.retrieve(query, k=5)
# Rerank the retrieved documents based on keyword overlap with the query.
reranked = keyword_reranker(query, retrieved)
# Print the top 2 reranked results, including their titles, full texts, and keyword scores.
print(f"Top 2 reranked results for '{query}':")
print(reranked[['title', 'fullText', 'keyword_score']])

## 5. Building an Optimized RAG Pipeline

Now, let's put everything together into an optimized RAG pipeline:

In [None]:
# Define a class named OptimizedRAG for retrieving and generating answers using a Retriever-Generator approach.
class OptimizedRAG:
    # Initialize the OptimizedRAG object with a DataFrame containing documents and their embeddings.
    def __init__(self, df):
        # Create a FAISSRetriever instance using the provided DataFrame for document retrieval.
        self.retriever = FAISSRetriever(df)
    
    # Define a method to get an answer for a given query.
    def get_answer(self, query):
        # Use the retriever to fetch initial candidate documents based on the query, limiting to top 5.
        retrieved = self.retriever.retrieve(query, k=5)
        
        # Rerank the retrieved documents using keyword overlap with the query, selecting the top 3.
        reranked = keyword_reranker(query, retrieved, top_k=3)
        
        # Start building the prompt for the language model, including an instruction and the query.
        prompt = f"""Given the following query and relevant information from multiple sources, 
        provide a comprehensive and accurate answer. Synthesize the information from all sources,
        and if there are any contradictions or gaps in the information, point them out.

        Query: {query}

        Relevant Information:
        """
        
        # For each reranked document, append its title and a truncated version of its text to the prompt.
        for _, doc in reranked.iterrows():
            prompt += f"\nSource: {doc['title']}:\n{doc['fullText'][:500]}...\n"
        
        # Append a section to the prompt where the synthesized answer will be placed.
        prompt += "\nSynthesized Answer:"
        
        # Use the get_completion function to generate a response from the language model based on the prompt.
        response = get_completion(prompt)
        
        # Return the generated response.
        return response

# Create an instance of the OptimizedRAG system with a DataFrame 'df' containing documents and their embeddings.
rag_system = OptimizedRAG(df)

# Example usage of the OptimizedRAG system.
# Define a query about Frederick Douglass' views on slavery and freedom.
query = "How did Frederick Douglass' experiences shape his views on slavery and freedom?"
# Get an answer for the query using the OptimizedRAG system.
answer = rag_system.get_answer(query)
# Print the query and the optimized answer generated by the system.
print(f"Query: {query}")
print(f"Optimized RAG Answer: {answer}")

# Exercises

1. Implement a cross-encoder reranking method using a pre-trained model from the `sentence-transformers` library.

2. Develop a method to dynamically adjust the number of retrieved documents based on the complexity of the query.

3. Implement a simple caching mechanism to store and reuse embeddings and retrieved results for frequently asked questions.

4. Create a method to generate follow-up questions based on the initial RAG response, encouraging a more interactive and in-depth exploration of the topic.

# Conclusion

In this final lesson of our LLMs with RAG workshop, we've explored advanced techniques for optimizing Retrieval Augmented Generation systems. We've implemented efficient retrieval using FAISS, developed strategies for handling long contexts, created optimized prompts, and built a reranking method. Finally, we combined these techniques into a comprehensive RAG pipeline.

These optimization techniques can significantly enhance the performance of RAG systems, making them more accurate, efficient, and capable of handling complex queries and large knowledge bases.

As you continue to work with RAG systems, remember that ongoing experimentation and refinement are key to achieving the best results for your specific use case.

# References

1. Karpukhin, V., et al. (2020). [Dense Passage Retrieval for Open-Domain Question Answering](https://arxiv.org/abs/2004.04906). arXiv preprint arXiv:2004.04906.
2. Khattab, O., & Zaharia, M. (2020). [ColBERT: Efficient and Effective Passage Search via Contextualized Late Interaction over BERT](https://arxiv.org/abs/2004.12832). arXiv preprint arXiv:2004.12832.
3. Gao, L., et al. (2021). [Making Pre-trained Language Models Better Few-shot Learners](https://arxiv.org/abs/2012.15723). arXiv preprint arXiv:2012.15723.
4. Johnson, J., Douze, M., & Jégou, H. (2019). [Billion-scale similarity search with GPUs](https://arxiv.org/abs/1702.08734). IEEE Transactions on Big Data.

___
This concludes our LLMs with RAG workshop series. I hope you found these lessons informative and practical for your research and applications!