# 📘 Multi-Retriever RAG Chatbot using PDF, Wikipedia, and Web Search

This notebook demonstrates a Retrieval-Augmented Generation (RAG) pipeline that pulls information from multiple sources—PDF documents, Wikipedia, and web search—to answer user queries using OpenAI GPT-4.

---

🔍 **What this notebook does:**

- Extracts structured content (text, tables, images) from PDFs using Unstructured.io
- Converts extracted content into LangChain documents and embeds them using FAISS
- Uses Wikipedia for open-domain factual retrieval
- Uses Tavily for live web search (requires API key)
- Combines all retrievers into an ensemble using LangChain's `EnsembleRetriever`
- Feeds the retrieved context into a GPT-4-powered RAG chain to answer questions
- Evaluate the output of the model using llumo sdk.

---

💡 **Example Questions:**

- "Compare Artificial Intelligence and Machine Learning."
- "What is the latest trend in Generative AI?"
- "Summarize the contents of the uploaded PDF."


### Install dependencies

In [1]:
!pip install --quiet faiss-cpu pytesseract unstructured-client "unstructured[all-docs]"
!pip install --quiet langchain_openai langchain-community Wikipedia tavily-python
!apt-get -qq install poppler-utils tesseract-ocr libtesseract-dev

[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/981.5 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━━━━━━━━━━━[0m[91m╸[0m[90m━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m327.7/981.5 kB[0m [31m37.8 MB/s[0m eta [36m0:00:01[0m[2K     [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m [32m972.8/981.5 kB[0m [31m30.6 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m981.5/981.5 kB[0m [31m12.3 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m48.2/48.2 kB[0m [31m4.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m117.0/117.0 kB[0m [31m10.0 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m31.3/31.3 MB[0m [31m62.7 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━

### 🔑 Set up API key

In [7]:
import os
from google.colab import userdata
# os.environ["OPENAI_API_KEY"] = "Enter Your OPENAI_API_KEY"
# os.environ["TAVILY_API_KEY"] = "Enter Your TAVILY_API_KEY"
os.environ["OPENAI_API_KEY"] = userdata.get('OPENAI_API_KEY') # Uncomment if you Stored your key securely in Colab
os.environ["TAVILY_API_KEY"] = userdata.get("TAVILY_API_KEY") # Uncomment if you Stored your key securely in Colab you can your api key at (https://app.tavily.com/) for free



### 📄 Extract Structured Content from PDF  with High-Resolution Layout Parsing


In [3]:
from unstructured.partition.pdf import partition_pdf

filename = "/content/Ml_sample.pdf" # Path to your PDF file

# Extract elements including images, tables, and structured text
pdf_elements = partition_pdf(
    filename=filename,
    extract_images_in_pdf=True,               # Enable image extraction
    strategy="hi_res",                       # Use high-resolution parsing
    hi_res_model_name="yolox",               # YOLOX model for detecting layout
    infer_table_structure=True,               # Try to parse tables
    chunking_strategy="by_title",            # Split text by document headings
    max_characters=3000,
    combine_text_under_n_chars=200
)


yolox_l0.05.onnx:   0%|          | 0.00/217M [00:00<?, ?B/s]

config.json:   0%|          | 0.00/1.47k [00:00<?, ?B/s]

Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


model.safetensors:   0%|          | 0.00/115M [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/46.8M [00:00<?, ?B/s]

###🔍 Inspect parsed elements

In [4]:
# Analyze the types of elements extracted
from collections import Counter
category_counts = Counter(str(type(element)) for element in pdf_elements)
category_counts


Counter({"<class 'unstructured.documents.elements.CompositeElement'>": 47})

### 📚 Create LangChain Documents

In [5]:
# Convert each element into a searchable Document
from langchain.schema import Document
documents = [Document(page_content=el.text, metadata={"source": filename}) for el in pdf_elements]


### 🧠 Embed with OpenAI + FAISS

In [8]:
from langchain.vectorstores import FAISS
from langchain_openai import OpenAIEmbeddings
#Build FAISS Vector Store from the documents
embeddings = OpenAIEmbeddings()
pdf_vectorstore = FAISS.from_documents(documents, embeddings)

#Set up a retriever from the vectorstore
pdf_retriever = pdf_vectorstore.as_retriever()

### ✅ Set Up Wikipedia Retriever to Fetch Top 3 Relevant Articles



In [9]:
from langchain.retrievers import WikipediaRetriever

def get_wiki_retriever():
    # Retrieves top 3 Wikipedia articles
    return WikipediaRetriever(top_k_results=3)

### 🌐 Create a Web Retriever Using Tavily Search and an LLM


In [10]:

from langchain_community.retrievers.tavily_search_api import TavilySearchAPIRetriever

def get_web_retriever_tavily():
    return TavilySearchAPIRetriever(k=3)

### 🧠 Combine PDF, Wikipedia, and Web Search into a Unified Retriever

In [11]:
from langchain.retrievers import EnsembleRetriever

def get_combined_retriever():
    return EnsembleRetriever(
        retrievers=[
            pdf_retriever,
            get_wiki_retriever(),
            get_web_retriever_tavily()
        ],
        weights=[1.0, 0.7, 0.7]
    )



### 🤖 Create Prompt Template and RAG Chain

In [13]:
from langchain.prompts import ChatPromptTemplate
from langchain.schema.runnable import RunnablePassthrough
from langchain.schema.output_parser import StrOutputParser
from langchain.chat_models import ChatOpenAI

prompt = ChatPromptTemplate.from_template("""
You are a helpful assistant that answers questions based on the provided context, which may include PDFs, Wikipedia, and web search.

Question: {input}
Context: {context}
Answer:
""")


# ✅ Step 11: Load LLM and Setup RAG Chain

llm = ChatOpenAI()
combined_retriever = get_combined_retriever()

# Setup the complete RAG pipeline
rag_chain = (
    {"context": combined_retriever, "input": RunnablePassthrough()}
    | prompt
    | llm
    | StrOutputParser()
)

  llm = ChatOpenAI()


In [14]:
import pandas as pd

# Define the query
query = "What are the key takeaways of llumo ai?"

# Get the response from the RAG chain
response = rag_chain.invoke(query)

# Retrieve context separately if needed
retrieved_context = combined_retriever.invoke(query)

# Store data in a dictionary
data = {
    "Query": [query],
    "Context": [retrieved_context],
    "Output": [response]
}

# Convert to a DataFrame
df = pd.DataFrame(data)

# Display the DataFrame
print(df)

# Optional: Save to a CSV file
df.to_csv("rag_responses.csv", index=False)


                                     Query  \
0  What are the key takeaways of llumo ai?   

                                             Context  \
0  [page_content='P.MURALI\n\nP.MURALI\n\nAssista...   

                                              Output  
0  The key takeaways of LLumo AI include transfor...  


## Evaluate the output with Llumo Ai

---



In [15]:
!pip install llumo

Collecting llumo
  Downloading llumo-0.1.7-py3-none-any.whl.metadata (721 bytes)
Collecting python-socketio==5.13.0 (from python-socketio[client]==5.13.0->llumo)
  Downloading python_socketio-5.13.0-py3-none-any.whl.metadata (3.2 kB)
Collecting openai==1.75.0 (from llumo)
  Downloading openai-1.75.0-py3-none-any.whl.metadata (25 kB)
Collecting bidict>=0.21.0 (from python-socketio==5.13.0->python-socketio[client]==5.13.0->llumo)
  Downloading bidict-0.23.1-py3-none-any.whl.metadata (8.7 kB)
Collecting python-engineio>=4.11.0 (from python-socketio==5.13.0->python-socketio[client]==5.13.0->llumo)
  Downloading python_engineio-4.12.0-py3-none-any.whl.metadata (2.3 kB)
Collecting simple-websocket>=0.10.0 (from python-engineio>=4.11.0->python-socketio==5.13.0->python-socketio[client]==5.13.0->llumo)
  Downloading simple_websocket-1.1.0-py3-none-any.whl.metadata (1.5 kB)
Collecting wsproto (from simple-websocket>=0.10.0->python-engineio>=4.11.0->python-socketio==5.13.0->python-socketio[client

In [21]:
from llumo import LlumoClient
client = LlumoClient(api_key="Enter Your Llumo Api key")
res = client.evaluate(df,evals = ["Context Utilization"],prompt_template="You are a helpful assistant that answers questions based on the provided context, which may include PDFs, Wikipedia, and web search.Question: {{Query}}Context: {{Context}}Answer:",outputColName="Output")




In [22]:
res

Unnamed: 0,Query,Context,Output,Context Utilization
0,What are the key takeaways of llumo ai?,[page_content='P.MURALI\n\nP.MURALI\n\nAssista...,The key takeaways of LLumo AI include transfor...,100
