# Simple RAG (Retrieval-Augmented Generation) System

## Overview

This code implements a basic Retrieval-Augmented Generation (RAG) system for processing and querying PDF document(s). The system uses a pipeline that encodes the documents and creates nodes. These nodes then can be used to build a vector index to retrieve relevant information.

## Key Components

1. PDF processing and text extraction
2. Text chunking for manageable processing
3. Ingestion pipeline creation using FAISS as vector store and OpenAI embeddings
4. Retriever setup for querying the processed documents
5. Evaluation of the RAG system

## Method Details

### Document Preprocessing

1. The PDF is loaded using [SimpleDirectoryReader](https://docs.llamaindex.ai/en/stable/module_guides/loading/simpledirectoryreader/).
2. The text is split into [nodes/chunks](https://docs.llamaindex.ai/en/stable/module_guides/loading/documents_and_nodes/) using [SentenceSplitter](https://docs.llamaindex.ai/en/stable/module_guides/loading/node_parsers/modules/#sentencesplitter) with specified chunk size and overlap.

### Text Cleaning

A custom transformation `TextCleaner` is applied to clean the texts. This likely addresses specific formatting issues in the PDF.

### Ingestion Pipeline Creation

1. OpenAI embeddings are used to create vector representations of the text nodes.
2. A FAISS vector store is created from these embeddings for efficient similarity search.

### Retriever Setup

1. A retriever is configured to fetch the top 2 most relevant chunks for a given query.


## Key Features

1. Modular Design: The ingestion process is encapsulated in a single function for easy reuse.
2. Configurable Chunking: Allows adjustment of chunk size and overlap.
3. Efficient Retrieval: Uses FAISS for fast similarity search.
4. Evaluation: Includes a function to evaluate the RAG system's performance.

## Usage Example

The code includes a test query: "What is the main cause of climate change?". This demonstrates how to use the retriever to fetch relevant context from the processed document.

## Evaluation

The system includes an `evaluate_rag` function to assess the performance of the retriever, though the specific metrics used are not detailed in the provided code.

## Benefits of this Approach

1. Scalability: Can handle large documents by processing them in chunks.
2. Flexibility: Easy to adjust parameters like chunk size and number of retrieved results.
3. Efficiency: Utilizes FAISS for fast similarity search in high-dimensional spaces.
4. Integration with Advanced NLP: Uses OpenAI embeddings for state-of-the-art text representation.

## Conclusion

This simple RAG system provides a solid foundation for building more complex information retrieval and question-answering systems. By encoding document content into a searchable vector store, it enables efficient retrieval of relevant information in response to queries. This approach is particularly useful for applications requiring quick access to specific information within large documents or document collections.

### Import libraries and environment variables

In [152]:
from typing import List
from llama_index.core import VectorStoreIndex
from llama_index.core.ingestion import IngestionPipeline
from llama_index.core.schema import BaseNode, TransformComponent
from llama_index.vector_stores.faiss import FaissVectorStore
from llama_index.core.text_splitter import SentenceSplitter
from llama_index.embeddings.ollama import OllamaEmbedding
from llama_index.embeddings.openai import OpenAIEmbedding
from llama_index.llms.ollama import Ollama
from llama_index.core import Settings
from llama_index.readers.web.firecrawl_web.base import FireCrawlWebReader
import os
import sys

sys.path.append(os.path.abspath(os.path.join(os.getcwd(), '..'))) # Add the parent directory to the path sicnce we work with notebooks

EMBED_DIMENSION = 512

# Chunk settings are way different than langchain examples
# Beacuse for the chunk length langchain uses length of the string,
# while llamaindex uses length of the tokens
CHUNK_SIZE = 2000
CHUNK_OVERLAP = 100

Settings.llm = Ollama(model="gemma2:27b", request_timeout=300.0)

# Set embeddig model on LlamaIndex global settings
Settings.embed_model = OllamaEmbedding(
    model_name="mxbai-embed-large",
    base_url="http://localhost:11434",
    ollama_additional_kwargs={"mirostat": 0},
)


### Load data from the Unreal Docs using the FireCrawl Web Reader

Because the site uses CloudFlare for DDoS protection, a local crawler won't work.\
Firecrawl took a bit of fiddling to get it to pull anything useful, but at least it was free.

In [41]:
loader = FireCrawlWebReader(
    api_key="fc-4fe144ebc99a4c49ad74bf7580996189",
    mode="crawl",
    params={
        "limit": 100,
        "allowBackwardLinks": True,
        "includePaths": ["/*"]
    }
)
documents = loader.load_data(
    url="https://dev.epicgames.com/documentation/en-us/unreal-engine"
)

### Vector Store
This was included in Nir's example, but it actually gave worse results than using the vector store built in to LlamaIndex.

In [153]:
# Create FaisVectorStore to store embeddings
faiss_index = faiss.IndexFlatL2(EMBED_DIMENSION)
vector_store = FaissVectorStore(faiss_index=faiss_index)

### Text Cleaner Transformation
This was also included in the example, but gives an error on our node type.


In [147]:
class TextCleaner(TransformComponent):
    
    """
    Transformation to be used within the ingestion pipeline.
    Cleans clutters from texts.
    """
    def __call__(self, nodes, **kwargs) -> List[BaseNode]:
        
        for node in nodes:
            node.text = node.text.replace('\t', ' ') # Replace tabs with spaces
            node.text = node.text.replace(' \n', ' ') # Replace paragraph seperator with spacaes
            
        return nodes

### Ingestion Pipeline

In [184]:
text_splitter = SentenceSplitter(chunk_size=CHUNK_SIZE, chunk_overlap=CHUNK_OVERLAP)

# Create a pipeline with defined document transformations and vectorstore
pipeline = IngestionPipeline(
    transformations=[
        #TextCleaner(),
        text_splitter,
    ],
)

In [185]:
# Run pipeline and get generated nodes from the process
nodes = pipeline.run(documents=documents)

### Create retriever

In [186]:
vector_store_index = VectorStoreIndex(nodes, show_progress=True)
retriever = vector_store_index.as_retriever(similarity_top_k=4)

Generating embeddings: 100%|██████████| 411/411 [00:18<00:00, 22.57it/s]


### Test retriever

In [179]:
def show_context(context):
    """
    Display the contents of the provided context list.

    Args:
        context (list): A list of context items to be displayed.

    Prints each context item in the list with a heading indicating its position.
    """
    for i, c in enumerate(context):
        print(f"Context {i+1}:")
        print(c.metadata["url"])
        print(c.node.text)
        print("\n")

In [189]:
test_query = "What's new in unreal 5.5?"
context = retriever.retrieve(test_query)
show_context(context)

Context 1:
https://dev.epicgames.com/documentation/en-us/unreal-engine/whats-new?application_version=5.2
Unreal Engine
5.2

- [Unreal Engine\\
5.5](https://dev.epicgames.com/documentation/en-us/unreal-engine/whats-new?application_version=5.5)
- [Unreal Engine\\
5.4](https://dev.epicgames.com/documentation/en-us/unreal-engine/whats-new?application_version=5.4)
- [Unreal Engine\\
5.3](https://dev.epicgames.com/documentation/en-us/unreal-engine/whats-new?application_version=5.3)
- [Unreal Engine\\
5.2](https://dev.epicgames.com/documentation/en-us/unreal-engine/whats-new?application_version=5.2)
- [Unreal Engine\\
5.1](https://dev.epicgames.com/documentation/en-us/unreal-engine/whats-new?application_version=5.1)
- [Unreal Engine\\
5.0](https://dev.epicgames.com/documentation/en-us/unreal-engine/whats-new?application_version=5.0)
- [Unreal Engine\\
4.27](https://dev.epicgames.com/documentation/en-us/unreal-engine/whats-new?application_version=4.27)

Table of Contents

![What's New](https:/

### Let's try a different indexing method:

The document summary index will extract a summary from each document and store that summary, as well as all nodes corresponding to the document.

Unfortunately, it only summarized 28 documents after 10 hours and I had to scrap this attempt.

In [167]:
from llama_index.core import DocumentSummaryIndex

index = DocumentSummaryIndex.from_documents(documents, show_progress=True)
query_engine = index.as_query_engine()

Parsing nodes: 100%|██████████| 100/100 [00:00<00:00, 172.53it/s]
Generating embeddings:  99%|█████████▉| 99/100 [09:36<00:05,  5.82s/it]
Summarizing documents:   0%|          | 0/100 [00:00<?, ?it/s]

current doc id: b2608a5c-482a-48d2-8e22-5b9d8a170308


Summarizing documents:   1%|          | 1/100 [13:36<22:26:37, 816.14s/it]

current doc id: bf6e3bd8-d1c9-4bd9-9842-b5eea95f918b


Summarizing documents:   2%|▏         | 2/100 [14:56<10:25:46, 383.13s/it]

current doc id: 18557c6f-cade-433a-945e-8c89b2778e2e


Summarizing documents:   3%|▎         | 3/100 [16:15<6:35:25, 244.59s/it] 

current doc id: 7b0fa4d7-4c70-469a-8f8f-038be1d3c426


Summarizing documents:   4%|▍         | 4/100 [19:27<5:57:59, 223.75s/it]

current doc id: b1a35f4f-f1e1-4562-8922-5eb390c2f7d4


Summarizing documents:   5%|▌         | 5/100 [26:33<7:49:40, 296.63s/it]

current doc id: d9c6663b-be49-4b2a-9aab-c7544249df3d


Summarizing documents:   6%|▌         | 6/100 [30:24<7:09:52, 274.39s/it]

current doc id: bba7d234-b876-4ed2-bd69-bbc0c4cb6d60


Summarizing documents:   7%|▋         | 7/100 [43:44<11:31:27, 446.10s/it]

current doc id: b93f2898-595f-46e3-b941-2e1468aa2bc9


Summarizing documents:   8%|▊         | 8/100 [54:43<13:07:49, 513.80s/it]

current doc id: 7e1bc365-3c16-469e-8e4b-df08b1ed1554


Summarizing documents:   9%|▉         | 9/100 [1:06:22<14:27:08, 571.75s/it]

current doc id: c496847e-26ec-4525-b4be-7aeab102208f


Summarizing documents:  10%|█         | 10/100 [1:22:19<17:15:56, 690.62s/it]

current doc id: 7b23ec49-2b06-403f-99f9-d21681e01afe


Summarizing documents:  11%|█         | 11/100 [1:45:56<22:34:31, 913.16s/it]

current doc id: 4b521bef-8b18-4473-b762-b295ed39599b


Summarizing documents:  12%|█▏        | 12/100 [2:02:24<22:52:41, 935.92s/it]

current doc id: 82e1215f-298b-4f31-bc74-ef51e1c764d6


Summarizing documents:  13%|█▎        | 13/100 [2:13:51<20:47:45, 860.52s/it]

current doc id: c270a9f3-de9f-422d-8e97-56e63056dd09


Summarizing documents:  14%|█▍        | 14/100 [2:37:29<24:34:30, 1028.73s/it]

current doc id: 375c9463-e134-4450-90fd-292863cd7848


Summarizing documents:  15%|█▌        | 15/100 [2:50:03<22:20:14, 946.05s/it] 

current doc id: 9e461a0d-4d7c-4862-acc3-d79949e86f6e


Summarizing documents:  16%|█▌        | 16/100 [2:55:55<17:54:12, 767.29s/it]

current doc id: 5fe0255c-ad2c-40f7-af18-4b944f02d7c5


Summarizing documents:  17%|█▋        | 17/100 [3:07:48<17:18:50, 750.97s/it]

current doc id: d3272e67-daab-450b-8863-8b8e84e24b0e


Summarizing documents:  18%|█▊        | 18/100 [3:09:04<12:28:47, 547.89s/it]

current doc id: 2839cc97-d8ca-41fe-bf34-f6e077ac2498


Summarizing documents:  19%|█▉        | 19/100 [3:20:09<13:07:12, 583.12s/it]

current doc id: 8d07c25f-c2e1-41a7-b4c6-5c560efc9ede


Summarizing documents:  20%|██        | 20/100 [3:30:53<13:22:06, 601.58s/it]

current doc id: 662c52cd-691f-40fb-8fe0-e6ba3025f6ed


Summarizing documents:  21%|██        | 21/100 [3:38:51<12:23:08, 564.41s/it]

current doc id: 40e52eb8-f4a5-4ce8-88e5-5b2a22a10b05


Summarizing documents:  22%|██▏       | 22/100 [4:23:20<25:54:50, 1196.03s/it]

current doc id: 07fbfda0-c224-470b-b233-37ff0bdfe89f


Summarizing documents:  23%|██▎       | 23/100 [5:01:01<32:24:53, 1515.50s/it]

current doc id: 1ef582b1-919a-4dd5-a793-e0adf8ba580d


Summarizing documents:  24%|██▍       | 24/100 [7:13:17<72:40:05, 3442.18s/it]

current doc id: 627b33ba-de25-4a8b-b96d-3adf76e2713d


Summarizing documents:  25%|██▌       | 25/100 [8:50:11<86:32:03, 4153.65s/it]

current doc id: 0480c785-dcce-44a6-9559-6badec254387


Summarizing documents:  26%|██▌       | 26/100 [9:05:55<65:35:15, 3190.75s/it]

current doc id: 773e2645-e559-4e71-8a2f-c02728a5b3f8


Summarizing documents:  27%|██▋       | 27/100 [9:10:54<47:06:35, 2323.23s/it]

current doc id: b194112f-a0be-4add-bb82-3c8ec0d17a3f


Summarizing documents:  28%|██▊       | 28/100 [10:41:31<65:08:50, 3257.36s/it]

current doc id: 6569a019-6b5c-4c3b-aac0-c512057059a5


Summarizing documents:  28%|██▊       | 28/100 [11:27:34<29:28:03, 1473.38s/it]


KeyboardInterrupt: 

### Let's see how well it performs:
First, let's see what pages were pulled.

In [192]:
for document in documents:
    print(document.metadata["url"])

https://dev.epicgames.com/documentation/en-us/unreal-engine/unreal-engine-5-5-documentation
https://dev.epicgames.com/en-US/indies/news/meet-the-epic-team-at-unreal-fest-seattle
https://dev.epicgames.com/en-US/indies/news/join-senscape-midnight-jam-2024-
https://dev.epicgames.com/en-US/indies/news/meet-the-team-behind-atre
https://dev.epicgames.com/en-US/indies/news
https://dev.epicgames.com/en-US
https://dev.epicgames.com/documentation/en-us/unreal-engine/unreal-engine-5-4-documentation?application_version=5.4
https://dev.epicgames.com/documentation/en-us/unreal-engine/unreal-engine-5-2-documentation?application_version=5.2
https://dev.epicgames.com/documentation/en-us/unreal-engine/unreal-engine-5-3-documentation?application_version=5.3
https://dev.epicgames.com/documentation/en-us/unreal-engine/unreal-engine-5-5-documentation?application_version=5.5
https://dev.epicgames.com/documentation/en-us/unreal-engine/creating-visual-effects-in-niagara-for-unreal-engine
https://dev.epicgames.

In [208]:
import json
from deepeval import evaluate
from deepeval.metrics import GEval, FaithfulnessMetric, ContextualRelevancyMetric
from deepeval.test_case import LLMTestCase, LLMTestCaseParams

# Set llm model for evaluation of the question and answers 
LLM_MODEL = "gemma2:27b"

# Define evaluation metrics
correctness_metric = GEval(
    name="Correctness",
    model=LLM_MODEL,
    evaluation_params=[
        LLMTestCaseParams.EXPECTED_OUTPUT,
        LLMTestCaseParams.ACTUAL_OUTPUT
    ],
    evaluation_steps=[
        "Determine whether the actual output is factually correct based on the expected output."
    ],
)

faithfulness_metric = FaithfulnessMetric(
    threshold=0.7,
    model=LLM_MODEL,
    include_reason=False
)

relevance_metric = ContextualRelevancyMetric(
    threshold=1,
    model=LLM_MODEL,
    include_reason=True
)

def evaluate_rag(query_engine, num_questions: int = 5) -> None:
    """
    Evaluate the RAG system using predefined metrics.

    Args:
        query_engine: Query engine to ask questions and get answers along with retrieved context.
        num_questions (int): Number of questions to evaluate (default: 5).
    """
    
    
    # Load questions and answers from JSON file
    q_a_file_name = "q_a.json"
    with open(q_a_file_name, "r", encoding="utf-8") as json_file:
        q_a = json.load(json_file)

    questions = [qa["question"] for qa in q_a][:num_questions]
    ground_truth_answers = [qa["answer"] for qa in q_a][:num_questions]
    generated_answers = []
    retrieved_documents = []

    # Generate answers and retrieve documents for each question
    for question in questions:
        response = query_engine.query(question)
        context = [doc.text for doc in response.source_nodes]
        retrieved_documents.append(context)
        generated_answers.append(response.response)

    # Create test cases and evaluate
    test_cases = [
        LLMTestCase(
            input=question,
            expected_output=gt_answer,
            actual_output=generated_answer,
            retrieval_context=retrieved_document
        )
        for question, gt_answer, generated_answer, retrieved_document in zip(
            questions, ground_truth_answers, generated_answers, retrieved_documents
        )
    ]
    evaluate(
        test_cases=test_cases,
        metrics=[correctness_metric]
    )

### Evaluate results

In [209]:
query_engine  = vector_store_index.as_query_engine(similarity_top_k=2)
evaluate_rag(query_engine, num_questions=3)

Event loop is already running. Applying nest_asyncio patch to allow async execution...


Evaluating 1 test case(s) in parallel: |          |  0% (0/1) [Time Taken: 00:45, ?test case/s]


AttributeError: 'str' object has no attribute 'truths'