<a href="https://colab.research.google.com/github/Aryayayayaa/Multi-Doc-RAG/blob/main/Research_Agent_Gem.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Project Name - Multi-Document Research Agent (RAG + Planning)**    



# **Project Summary -**

This project is a multi-document research agent built in Python within a Google Colab environment. The system's core function is to intelligently answer complex, natural language questions by orchestrating a hybrid retrieval process that leverages both a local document repository and real-time web search capabilities. It uses an agentic framework to plan and execute multi-step reasoning, ensuring that the final output is a structured, comprehensive, and traceable report.

# **GitHub Link -**

https://github.com/Aryayayayaa/Multi-Doc-RAG/blob/main/Research_Agent_Gem.ipynb

# **Problem Statement**


The central challenge this project addresses is the need for a sophisticated AI research assistant that can synthesize information from disparate sources to provide accurate and context-rich answers. Traditional search methods are often limited to either a single document corpus or the public web. This project solves that limitation by building a system that can seamlessly access and combine knowledge from a private, local repository of documents with up-to-the-minute information retrieved from the internet, all while providing full traceability for every claim made.

# **Code -**


In [None]:
# @title Necessary pakages
!pip install pypdf sentence-transformers faiss-cpu langchain-community openai openai-agents serpapi
!pip install google-search-results
!pip install langchain-openai

In [None]:
# @title Loading Dataset for Local Search
from google.colab import drive
drive.mount('/content/drive')

In [None]:
# @title Set up API Keys as Environment Variables
import os
from google.colab import drive
from dotenv import load_dotenv

# Specify the path to your.env file in Google Drive
dotenv_path = '/content/drive/My Drive/RAG Agent Workspace/secrets.env'

# Check if the file exists before attempting to load
if os.path.exists(dotenv_path):
    load_dotenv(dotenv_path)
    print("API keys loaded from Google Drive.")
else:
    print("Warning:.env file not found. Please create one with your API keys.")
    print("Path expected: " + dotenv_path)

# You can now access your keys via os.getenv()
# For example:
# openai_key = os.getenv("OPENAI_API_KEY")
# serpapi_key = os.getenv("SERPAPI_API_KEY")

In [None]:
# @title  Load and Chunk Documents

from langchain_community.document_loaders import DirectoryLoader, PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
import os
import warnings

warnings.filterwarnings("ignore")

# Define the path to your documents in Google Drive
# You should place your 9 documents in a folder, for example: "local_docs"
# Adjust the path to where you have saved your files.
docs_path = '/content/drive/My Drive/RAG Agent Workspace/local_docs'

# Define the loader for your documents. We'll use PyPDFLoader for PDF files.
# If you have markdown (.md) or text (.txt) files, you would use a different loader like UnstructuredMarkdownLoader
# or TextLoader. The DirectoryLoader can handle multiple file types.
loader = DirectoryLoader(docs_path, glob="**/*.pdf", loader_cls=PyPDFLoader)

# Load the documents from the specified directory
print("Loading documents from Google Drive...")
documents = loader.load()
print(f"Loaded {len(documents)} documents.")

# Initialize the text splitter to create chunks with a specific size and overlap
# A chunk_size of 500 characters and an overlap of 50 characters is a good starting point.
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=500,
    chunk_overlap=50
)

# Split the loaded documents into chunks
print("Splitting documents into chunks...")
texts = text_splitter.split_documents(documents)
print(f"Created {len(texts)} text chunks.")

# The 'texts' variable now contains a list of all the document chunks
# and is ready for the next step.

In [None]:
# @title Generate Embeddings and Build FAISS Index

from langchain_community.vectorstores import FAISS
from langchain_community.embeddings import SentenceTransformerEmbeddings

# Initialize the Sentence Transformer model to create embeddings
# 'all-MiniLM-L6-v2' is a small, fast, and effective model for this task.
print("Initializing Sentence Transformer model...")
embeddings = SentenceTransformerEmbeddings(model_name="all-MiniLM-L6-v2")

# Create a FAISS vector store from the document chunks and embeddings
# This is where the index is built and the document vectors are stored.
print("Creating FAISS vector store from chunks...")
vector_store = FAISS.from_documents(texts, embeddings)
print("Vector store created successfully.")

In [None]:
# @title  Save the Index for Future Use

import os

# Define the path where you want to save your FAISS index
# This will save the index and its metadata to your Google Drive.
index_path = '/content/drive/My Drive/RAG Agent Workspace/faiss_index'

print(f"Saving FAISS index to {index_path}...")
vector_store.save_local(index_path)
print("FAISS index saved successfully.")

# To reload the index in a new session, you can use the following code:
# from langchain_community.vectorstores import FAISS
# from langchain_community.embeddings import SentenceTransformerEmbeddings
# embeddings = SentenceTransformerEmbeddings(model_name="all-MiniLM-L6-v2")
# reloaded_vector_store = FAISS.load_local(index_path, embeddings, allow_dangerous_deserialization=True)
# print("FAISS index reloaded from Google Drive.")

In [None]:
# @title Create the Local Document Search Tool

from langchain_community.vectorstores import FAISS
from langchain_community.embeddings import SentenceTransformerEmbeddings
from langchain.agents import tool
import os

# Ensure the Sentence Transformer model is initialized again to reload the index
embeddings = SentenceTransformerEmbeddings(model_name="all-MiniLM-L6-v2")

# Define the path where the FAISS index is saved
index_path = '/content/drive/My Drive/RAG Agent Workspace/faiss_index'

print("Attempting to load FAISS index from Google Drive...")
try:
    reloaded_vector_store = FAISS.load_local(index_path, embeddings, allow_dangerous_deserialization=True)
    print("FAISS index loaded successfully.")
except Exception as e:
    print(f"Error loading FAISS index: {e}")
    print("Please ensure you have run the previous cells to create and save the index.")
    reloaded_vector_store = None # Set to None to prevent subsequent errors

# Define a function to perform the local search with a proper docstring
@tool
def local_document_search(query: str) -> str:
    """
    Searches the local FAISS vector store for relevant document chunks.
    Use this tool to find information from the provided local document repository.
    """
    if reloaded_vector_store:
        # Perform a similarity search and get the top 4 most relevant chunks
        docs = reloaded_vector_store.similarity_search(query, k=4)
        return "\n\n".join([doc.page_content for doc in docs])
    else:
        return "Local document search is not available. FAISS index not loaded."

In [None]:
# @title Create Web Search Tool

# Ensure you have installed the necessary libraries:
# pip install langchain langchain-community serpapi

import os
from langchain_community.utilities import SerpAPIWrapper
from langchain.agents import tool

# Check if the SerpAPI key is set as an environment variable
serpapi_api_key = os.environ.get("SERPAPI_API_KEY")
if not serpapi_api_key:
    raise ValueError("SERPAPI_API_KEY environment variable is not set. Please set it before running this cell.")

# Define the SerpAPIWrapper with the API key
try:
    search = SerpAPIWrapper(serpapi_api_key=serpapi_api_key)
except Exception as e:
    raise RuntimeError(f"Failed to initialize SerpAPIWrapper. Check your API key. Error: {e}")

# Define a function to perform the web search
@tool
def web_search(query: str) -> str:
    """
    Performs a real-time web search for the given query.
    Useful for finding up-to-date or general knowledge not available in local documents.
    """
    try:
        return search.run(query)
    except Exception as e:
        return f"An error occurred during the web search: {e}"

print("Web Search Tool 'web_search' created successfully.")

In [None]:
# @title Create the Agent and Agent Executor

from langchain.agents import create_tool_calling_agent, AgentExecutor
from langchain_openai import ChatOpenAI
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.messages import HumanMessage
import os

# Check if the OpenAI API key is set
if "OPENAI_API_KEY" not in os.environ:
    print("Warning: OPENAI_API_KEY environment variable is not set.")
    print("Please run Cell 3 to load your API keys from the.env file.")

# Define the LLM to be used by the agent
# 'gpt-4o' is a powerful model that is great at tool use and reasoning.
llm = ChatOpenAI(model="gpt-4o", temperature=0)

# Define the tools the agent can use.
# Add both the local document search and the web search tools.
tools = [local_document_search, web_search] # <-- MODIFIED LINE

# Create a prompt for the agent. This instructs the LLM on its role.
prompt = ChatPromptTemplate.from_messages([
    ("system", "You are a helpful AI assistant."),
    ("human", "{input}"),
    ("placeholder", "{agent_scratchpad}") # This is required for the agent to track thought process
])

# Create the agent
# This binds the LLM and the tools together with the prompt.
agent = create_tool_calling_agent(llm, tools, prompt)

# Create the AgentExecutor
# This is the central runtime that will execute the agent's plan.
agent_executor = AgentExecutor(agent=agent, tools=tools, verbose=True)

print("Agent and AgentExecutor created. You are now ready to run queries.")

In [None]:
# @title Ask the Agent a Question
user_question = input("Please enter your question for the agent: ")

response = agent_executor.invoke({"input": user_question})
print(response["output"])

# **Applications/Usage**

This hybrid retrieval system can be used in any scenario where a combination of internal, domain-specific knowledge and up-to-date, public information is required. Key applications include:

* **Corporate Knowledge Management:** Employees can query internal company documents (e.g., policy manuals, project reports, training materials) and, if the information is insufficient, the system automatically uses a web search to find external best practices or competitive data.
* **Customer Support & Helpdesks:** The system can first search a local knowledge base of common issues and solutions. For novel or complex queries, it can then perform a web search to find relevant forums, product documentation, or recent bug fixes.
* **Legal & Medical Research:** Researchers can use this to search a private repository of case law or medical journals and seamlessly extend the search to the public web for new legislation or the latest research findings.





# **Recommendations**

To improve the system's performance and robustness in the short term, consider the following enhancements:

* **Implement a more robust Reranking Model:** Replace the current simple reranking with a more advanced model (e.g., a larger CrossEncoder) to more accurately score the relevance of retrieved documents, leading to more precise answers.
* **Add Error Handling & Fallbacks:** Implement a robust error-handling mechanism to gracefully manage cases where a tool fails (e.g., web search API is down). The system should be able to fall back to the local documents and inform the user of the issue.
* **Improve Local Data Refreshment:** Develop a simple script to periodically check for new documents in the Google Drive folder and update the FAISS index. This ensures the local knowledge base remains current without manual intervention.

# **Conclusion**

This project successfully demonstrates the power of a hybrid retrieval-augmented generation (RAG) system. By intelligently combining a local document search with a web search tool, the system can provide comprehensive and well-cited answers that go beyond a single data source. The use of a multi-step agent planner and structured report generation ensures the output is not only accurate but also easy to understand and verify. This architecture serves as a solid foundation for building more complex, domain-specific AI research assistants.


# **Future Work**

To evolve this project into a more powerful and scalable solution, consider these long-term goals:

* **Multi-Agent Collaboration:** Instead of a single agent, implement a multi-agent system where different agents specialize in specific tasks (e.g., a "Local Search Agent" and a "Web Search Agent"). A "Planning Agent" can then orchestrate their collaboration to solve complex, multi-faceted problems.
* **Advanced Data Ingestion:** Develop a more sophisticated data pipeline that can ingest and process various data types beyond PDFs, such as web pages, emails, and database records. This would make the system more versatile for diverse use cases.
* **User Interface (UI) Development:** Create a web-based user interface to make the system accessible to non-technical users. The UI would provide a simple chat interface for queries and render the final Markdown report in a clean, readable format.