# Building Agentic RAG with LlamaIndex

This project implements an Agentic Retrieval-Augmented Generation (RAG) using LlamaIndex and OpenAI API to process and query information from a PDF document. The system is designed to have two query engines: summary_query_engine to summarize the document and qa_query_engine to answer specific questions.  The Router Query Engine selects dynamically the best query engine based on user input.

# Pipeline:

1. **Setup Environment & Install Dependencies**
* Mount Google Drive
* Install llama-index, openai, pypdf, etc.
2. **Initialize OpenAI API & Models**
* Define LLM and embedding model (GPT-4, text-embedding-ada-002).
* Set up Settings.llm and Settings.embed_model.
3. **Load and Process Documents**
* Read pdf file (transformers.pdf in my  case) using SimpleDirectoryReader.
* Convert document data into vector and summary indexes.
4. **Create Query Engines**
* summary_query_engine for summarization.
* qa_query_engine for answering specific questions.
5. **Configure the Router Query Engine**
* Uses LLM-based query selection (LLMSingleSelector).
* Dynamically routes the query to the correct engine.
6. **Query the System**
* User inputs a question.
* The router selects the best query engine.

# References

This project is based on the course **"Building Agentic RAG with Llamaindex"** by **Deeplearning.AI** and is available at the following [link](https://learn.deeplearning.ai/courses/building-agentic-rag-with-llamaindex/).

## Setup

In [None]:
# Mounting to Google Drive
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
cd "YOUR-PATH-HERE"

In [None]:
%%capture
!pip install llama-index llama-index-llms-openai llama-index-embeddings-openai openai pypdf

In [None]:
!pip list | grep llama-index

llama-index                             0.12.19
llama-index-agent-openai                0.4.6
llama-index-cli                         0.4.0
llama-index-core                        0.12.19
llama-index-embeddings-openai           0.3.1
llama-index-indices-managed-llama-cloud 0.6.7
llama-index-llms-openai                 0.3.20
llama-index-multi-modal-llms-openai     0.4.3
llama-index-program-openai              0.3.1
llama-index-question-gen-openai         0.3.0
llama-index-readers-file                0.4.5
llama-index-readers-llama-parse         0.4.0


In [None]:
import os
from llama_index.core import (
    VectorStoreIndex,
    SummaryIndex,
    SimpleDirectoryReader,
    ServiceContext,
    Settings
)
from llama_index.embeddings.openai import OpenAIEmbedding
from llama_index.llms.openai import OpenAI
from llama_index.core.query_engine import RouterQueryEngine
from llama_index.core.tools import QueryEngineTool
from llama_index.core.selectors import LLMSingleSelector
from llama_index.core.prompts import PromptTemplate

In [None]:
# Set OpenAI API key
import openai

openai.api_key = 'YOUR-OPENAI-API-KEY-HERE'

In [None]:
import nest_asyncio

nest_asyncio.apply()

In [None]:
# Initialize OpenAI models (LLM & Embedding)
llm = OpenAI(model="gpt-4", temperature=0)
embed_model = OpenAIEmbedding(model="text-embedding-ada-002")

Settings.llm = llm
Settings.embed_model = embed_model

In [None]:
# Load and parse the PDF document
reader = SimpleDirectoryReader(input_files=["transformers.pdf"])
documents = reader.load_data()

In [None]:
# Create Vector Index (for Q&A)
vector_index = VectorStoreIndex.from_documents(documents)

In [None]:
# Create Summary Index
summary_index = SummaryIndex.from_documents(documents)

In [None]:
# Create Query Engines

summary_query_engine = summary_index.as_query_engine(
    response_mode="tree_summarize",
    use_async=True,
    )
qa_query_engine = vector_index.as_query_engine()


In [None]:
summary_tool = QueryEngineTool.from_defaults(
    query_engine=summary_query_engine,
    description=(
        "Useful for summarization questions related to the document"
    ),
)

qa_tool = QueryEngineTool.from_defaults(
    query_engine=qa_query_engine,
    description=(
        "Useful for retrieving specific context from the document."
    ),
)

In [None]:
# Define a Router Query Engine with the selector

router_query_engine = RouterQueryEngine(
    selector=LLMSingleSelector.from_defaults(),
    query_engine_tools=[
        summary_tool,
        qa_tool,
    ],
    verbose=True
)

In [None]:
# Test the query
import textwrap
query1 = "What is the summary of the document?"
response = router_query_engine.query(query1)

# Extract the response text
response_text = str(response) if isinstance(response, str) else response.response

# Wrap text to a readable width
wrapped_response = textwrap.fill(response_text, width=80)

# Print structured output
print("=" * 80)
print(f"**User Query:**\n{query1}\n")
print("=" * 80)
print("**Generated Response:**\n")
print(wrapped_response)
print("=" * 80)

[1;3;38;5;200mSelecting query engine 0: The question asks for a summary of the document, which directly relates to choice 1 about summarization questions related to the document..
**User Query:**
What is the summary of the document?

**Generated Response:**

The document is a research paper titled "Attention Is All You Need" by a team
from Google Brain and Google Research. It introduces a new network architecture
known as the Transformer, which is based solely on attention mechanisms,
replacing the recurrent and convolutional layers typically used in sequence
transduction models. The Transformer model has demonstrated superior performance
in terms of quality, parallelizability, and training time compared to existing
models. It has been tested on two machine translation tasks, English-to-German
and English-to-French, and has achieved state-of-the-art results. The paper also
explores the benefits of self-attention layers and presents experiments on
English constituency parsing. The auth

In [None]:
# Test the query
query2="What is the self-attention?"
response = router_query_engine.query(query2)

# Extract the response text
response_text = str(response) if isinstance(response, str) else response.response

# Wrap text to a readable width
wrapped_response = textwrap.fill(response_text, width=80)

# Print structured output
print("=" * 80)
print(f"**User Query:**\n{query2}\n")
print("=" * 80)
print("**Generated Response:**\n")
print(wrapped_response)
print("=" * 80)

[1;3;38;5;200mSelecting query engine 1: The question 'What is the self-attention?' is asking for a specific context or definition from the document, not a summary of the document..
**User Query:**
What is the self-attention?

**Generated Response:**

Self-attention, also known as scaled dot-product attention, is a mechanism where
the input consists of queries and keys of dimension dk, and values of dimension
dv. The dot products of the query with all keys are computed, each divided by
√dk, and a softmax function is applied to obtain the weights on the values. This
attention function is computed on a set of queries simultaneously, packed
together into a matrix Q. The keys and values are also packed together into
matrices K and V. The matrix of outputs is computed as: Attention(Q, K, V) =
softmax(QKT √dk )V. This mechanism is faster and more space-efficient in
practice, as it can be implemented using highly optimized matrix multiplication
code.
