# 📘 Multi-Retriever RAG Chatbot using PDF, Wikipedia, and Web Search

This notebook demonstrates a Retrieval-Augmented Generation (RAG) pipeline that pulls information from multiple sources—PDF documents, Wikipedia, and web search—to answer user queries using OpenAI GPT-4.

---

🔍 **What this notebook does:**

- Extracts structured content (text, tables, images) from PDFs using Unstructured.io
- Converts extracted content into LangChain documents and embeds them using FAISS
- Uses Wikipedia for open-domain factual retrieval
- Uses Tavily for live web search (requires API key)
- Combines all retrievers into an ensemble using LangChain's `EnsembleRetriever`
- Feeds the retrieved context into a GPT-4-powered RAG chain to answer questions

---

💡 **Example Questions:**

- "Compare Artificial Intelligence and Machine Learning."
- "What is the latest trend in Generative AI?"
- "Summarize the contents of the uploaded PDF."


### Install dependencies

In [None]:
!pip install --quiet faiss-cpu pytesseract unstructured-client "unstructured[all-docs]"
!pip install --quiet langchain_openai langchain-community duckduckgo-search Wikipedia
!apt-get -qq install poppler-utils tesseract-ocr libtesseract-dev

### 🔑 Set up API key

In [None]:
import os
from google.colab import userdata
# os.environ["OPENAI_API_KEY"] = "Enter Your OPENAI_API_KEY"
# os.environ["TAVILY_API_KEY"] = "Enter Your TAVILY_API_KEY"
os.environ["OPENAI_API_KEY"] = userdata.get('OPENAI_API_KEY') # Uncomment if you Stored your key securely in Colab
os.environ["TAVILY_API_KEY"] = userdata.get("TAVILY_API_KEY") # Uncomment if you Stored your key securely in Colab you can your api key at (https://app.tavily.com/) for free



### 📄 Extract Structured Content from PDF  with High-Resolution Layout Parsing


In [None]:
from unstructured.partition.pdf import partition_pdf

filename = "/content/Ml_sample.pdf" # Path to your PDF file

# Extract elements including images, tables, and structured text
pdf_elements = partition_pdf(
    filename=filename,
    extract_images_in_pdf=True,               # Enable image extraction
    strategy="hi_res",                       # Use high-resolution parsing
    hi_res_model_name="yolox",               # YOLOX model for detecting layout
    infer_table_structure=True,               # Try to parse tables
    chunking_strategy="by_title",            # Split text by document headings
    max_characters=3000,
    combine_text_under_n_chars=200
)


###🔍 Inspect parsed elements

In [6]:
# Analyze the types of elements extracted
from collections import Counter
category_counts = Counter(str(type(element)) for element in pdf_elements)
category_counts


Counter({"<class 'unstructured.documents.elements.CompositeElement'>": 47})


### 📚 Create LangChain Documents

In [None]:
# Convert each element into a searchable Document
from langchain.schema import Document
documents = [Document(page_content=el.text, metadata={"source": filename}) for el in pdf_elements]


### 🧠 Embed with OpenAI + FAISS

In [None]:
from langchain.vectorstores import FAISS
from langchain_openai import OpenAIEmbeddings
#Build FAISS Vector Store from the documents
embeddings = OpenAIEmbeddings()
pdf_vectorstore = FAISS.from_documents(documents, embeddings)

#Set up a retriever from the vectorstore
pdf_retriever = pdf_vectorstore.as_retriever()

### ✅ Set Up Wikipedia Retriever to Fetch Top 3 Relevant Articles



In [None]:
from langchain.retrievers import WikipediaRetriever

def get_wiki_retriever():
    # Retrieves top 3 Wikipedia articles
    return WikipediaRetriever(top_k_results=3)

### 🌐 Create a Web Retriever Using DuckDuckGo Search and an LLM


In [None]:

from langchain_community.retrievers.tavily_search_api import TavilySearchAPIRetriever

def get_web_retriever_tavily():
    return TavilySearchAPIRetriever(k=3)

### 🧠 Combine PDF, Wikipedia, and Web Search into a Unified Retriever

In [None]:
from langchain.retrievers import EnsembleRetriever

def get_combined_retriever():
    return EnsembleRetriever(
        retrievers=[
            pdf_retriever,
            get_wiki_retriever(),
            get_web_retriever_tavily()
        ],
        weights=[1.0, 0.7, 0.7]
    )



### 🤖 Create Prompt Template and RAG Chain

In [None]:
from langchain.prompts import ChatPromptTemplate
from langchain.schema.runnable import RunnablePassthrough
from langchain.schema.output_parser import StrOutputParser

prompt = ChatPromptTemplate.from_template("""
You are a helpful assistant that answers questions based on the provided context, which may include PDFs, Wikipedia, and web search.

Question: {input}
Context: {context}
Answer:
""")


# ✅ Step 11: Load LLM and Setup RAG Chain

llm = ChatOpenAI()
combined_retriever = get_combined_retriever()

# Setup the complete RAG pipeline
rag_chain = (
    {"context": combined_retriever, "input": RunnablePassthrough()}
    | prompt
    | llm
    | StrOutputParser()
)

In [3]:
query = "What are the key takeaways of llumo ai"
response1 = rag_chain.invoke(query)
print(response1)

Some key takeaways of LLUMO AI include delivering exceptional AI solutions, optimizing performance, ensuring technology achieves excellence through innovative and tailored strategies, providing data upload made easy, offering advanced evaluation tools like RAGAs and LLUMO Eval LM for one-click evaluations, customizable KPIs, actionable insights for refining models, and tracking prompt performance on key metrics for production and development.


## Evaluate the output with Llumo Ai

---



In [7]:
# Code to Evaluate the output
User_Query="What are the key takeaways of llumo ai"
llm_response=response1
inputs = {}
import requests
# Define the endpoint, headers, and payload
LLUMO_ENDPOINT = "https://app.llumo.ai/api/create-eval-analytics"
headers = {
    "Authorization": "Bearer {Your llumo api key}", # Replace with your LLumo API key it will look like this "Bearer A1B2C3"
    "Content-Type": "application/json"
}
payload = {
    "prompt": User_Query,
    "input": inputs,
    "output": llm_response,
    "analytics": ["Context Utilization"] # ANALYTICS NAME are Confidence,Clarity,Context.....etc.
}
# Make the API request
response = requests.post(LLUMO_ENDPOINT, json=payload, headers=headers)
try:
    result = response.json()  # Parse the JSON response
    print("statusCode : ", result['data']['statusCode'])
    print("message : ",result['data']['message'])
    # Extract the 'data' part
    data = result.get('data', {})
    print("Analytics:", data)
    # Return the data and a success flag

except Exception as e:
  print(e)




statusCode :  200
message :  SUCCESS
Analytics: {'data': '{"analyticsScore": {"*the output should accurately reflect the core functionalities and benefits of llumo ai.": 85, "*the response should be well-structured and easy to understand, using clear and concise language.": 90, "*the output must be comprehensive, covering a wide range of llumo ai's features and capabilities.": 70, "*the output should be factually accurate and avoid any misleading or false information about llumo ai.": 95, "*the response should be relevant to the prompt and directly address the user's request for key takeaways.": 100, "overallScore": 88}, "reasoning": {"*the output should accurately reflect the core functionalities and benefits of llumo ai.": ["The output correctly identifies several key functionalities of LLUMO AI, such as delivering AI solutions, performance optimization, and providing evaluation tools.", "It also mentions important benefits like easy data upload, customizable KPIs, and actionable in