# DocuBrain - Multiple PDF Chat

## 1. Install Necessary Libraries

This cell installs or upgrades the required Python libraries for this project. This includes `langchain` and related packages for building LLM applications, `faiss-cpu` for efficient vector similarity search, `python-dotenv` for managing environment variables, and `pymupdf` for PDF text extraction.

In [None]:
%pip install --upgrade langchain langchain-community langchain-openai faiss-cpu python-dotenv pymupdf

## 2. Import Modules

This section imports all the necessary modules from various libraries used throughout the notebook. These include operating system utilities, dotenv for environment variables, `fitz` (PyMuPDF) for robust PDF handling, `CharacterTextSplitter` for breaking text into manageable chunks, `FAISS` for vector storage, `OpenAIEmbeddings` and `ChatOpenAI` for OpenAI's embedding and chat models, `ChatPromptTemplate` for defining chat prompts, `StrOutputParser` for parsing LLM output, and `RunnablePassthrough` for passing inputs directly through a LangChain chain.

In [None]:
import os
from dotenv import load_dotenv
import fitz
from langchain_text_splitters import CharacterTextSplitter
from langchain_community.vectorstores import FAISS
from langchain_openai import OpenAIEmbeddings, ChatOpenAI
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough

## 3. Configure OpenAI API Key

This cell handles setting of your OpenAI API key. A placeholder key is provided for demonstration, but you should replace it with actual key or load it securely (e.g., from Colab secrets).

In [None]:
load_dotenv()
OPENAI_API_KEY = os.getenv("OPENAI_API_KEY")
os.environ["OPENAI_API_KEY"] = OPENAI_API_KEY

## 4. PDF Text Extraction Function

### `extract_text_pymupdf(pdf_path)`

This function is designed to extract all textual content from a given PDF file. It uses the `fitz` library (PyMuPDF), which is highly efficient for working with PDF documents. The function opens the PDF, iterates through each page, extracts its text, and concatenates it into a single string. This raw text is then returned for further processing.

In [None]:
def extract_text_pymupdf(pdf_path):
    text = ""
    with fitz.open(pdf_path) as doc:
        for page in doc:
            text += page.get_text()
    return text

## 5. Text Chunking Function

### `get_text_chunks(text)`

This function takes a large string of text and divides it into smaller, overlapping segments called 'chunks'. It employs LangChain's `CharacterTextSplitter`, configured to split text by newline characters, with a `chunk_size` of 1000 characters and an `chunk_overlap` of 200 characters. Text chunking is a critical step for RAG systems, as it prepares documents for efficient embedding and retrieval, ensuring that relevant context can be provided to the LLM without exceeding token limits.

In [None]:
def get_text_chunks(text):
    text_splitter = CharacterTextSplitter(separator="\n", chunk_size=1000, chunk_overlap=200, length_function=len)
    return text_splitter.split_text(text)

## 6. Batch Embedding Function

### `embed_chunks_batched(embeddings, text_chunks, batch_size=10)`

This function efficiently generates vector embeddings for a list of text chunks. It utilizes an `embeddings` model (e.g., `OpenAIEmbeddings`) and processes the `text_chunks` in specified `batch_size` groups. This batched approach optimizes API calls and resource usage, accumulating all generated embeddings into a single list.

In [None]:
def embed_chunks_batched(embeddings, text_chunks, batch_size=10):
    all_embeddings = []
    for i in range(0, len(text_chunks), batch_size):
        batch = text_chunks[i:i+batch_size]
        batch_embeddings = embeddings.embed_documents(batch)
        all_embeddings.extend(batch_embeddings)
    return all_embeddings

## 7. Vector Store Creation Function

### `get_vectorstore(text_chunks, embeddings, embeddings_data=None)`

This function is responsible for creating a `FAISS` vector store, which is used for rapid similarity search over the embedded text data. It can operate in two modes: either it takes `text_chunks` and an `embeddings` model to generate embeddings on the fly and build the store, or it uses pre-computed `embeddings_data` directly, combining it with the `text_chunks` to construct the FAISS index. This flexibility allows for optimized vector store creation depending on whether embeddings are already available.

In [None]:
def get_vectorstore(text_chunks, embeddings, embeddings_data=None):
    if embeddings_data:
        return FAISS.from_embeddings(list(zip(text_chunks, embeddings_data)), embedding=embeddings)
    else:
        return FAISS.from_texts(texts=text_chunks, embedding=embeddings)

## 8. Conversational Chain Setup Function

### `get_conversation_chain(vectorstore)`

This function constructs the conversational retrieval chain, which is the core of the AI's question-answering capability. It initializes a `ChatOpenAI` Large Language Model with a `temperature` of 0 for deterministic responses. It then configures a `retriever` from the provided `vectorstore` to fetch the most relevant documents based on a user's query. A `ChatPromptTemplate` guides the LLM to answer *only* based on the given context, explicitly stating 'I don't know' if the answer isn't found in the provided information. Finally, it builds a LangChain Expression Language (LCEL) chain that integrates the retriever, prompt, LLM, and an output parser to deliver coherent answers.

In [None]:
def get_conversation_chain(vectorstore):
    llm = ChatOpenAI(temperature=0)
    retriever = vectorstore.as_retriever(search_kwargs={"k": 4})

    template = """Answer based only on the following context. If the context doesn't contain the answer, say 'I don't know.'

Context:
{context}

Question: {question}

Answer:"""

    prompt = ChatPromptTemplate.from_template(template)

    def format_docs(docs):
        return "\n\n".join(doc.page_content for doc in docs)

    chain = (
        {"context": retriever | format_docs, "question": RunnablePassthrough()}
        | prompt
        | llm
        | StrOutputParser()
    )

    return chain

## 9. Main Application Logic

### `main()`

This is the central function that orchestrates the entire PDF processing and conversational AI workflow. It prompts the user to enter PDF file paths, extracts text from those PDFs, and then divides the text into chunks. It proceeds to generate vector embeddings for these chunks using `OpenAIEmbeddings` and constructs a `FAISS` vector store. Finally, it sets up the conversational chain and enters an interactive loop, allowing the user to ask questions. The AI will provide answers based on the content of the processed PDFs, exiting when the user types 'exit'.

In [None]:
def main():
    file_paths = input("Enter PDF file paths (comma separated): ").split(",")
    file_paths = [f.strip() for f in file_paths if f.strip()]
    raw_text = "\n".join([extract_text_pymupdf(fp) for fp in file_paths])
    text_chunks = get_text_chunks(raw_text)
    embeddings = OpenAIEmbeddings(openai_api_key=OPENAI_API_KEY)
    embeddings_data = embed_chunks_batched(embeddings, text_chunks)
    vectorstore = get_vectorstore(text_chunks, embeddings, embeddings_data)
    conversation = get_conversation_chain(vectorstore)
    print("PDF(s) processed. You can now ask questions. Type 'exit' to quit.")
    while True:
        user_question = input("Your question: ")
        if user_question.strip().lower() == 'exit':
            print("Goodbye!")
            break
        answer = conversation.invoke(user_question)
        print(f"Bot: {answer}")

## 10. Execute the Main Function

This cell initiates the `main()` function, starting the DocuBrain conversational AI. The `if __name__ == "__main__":` construct ensures that `main()` is called only when the script is executed directly (rather than when imported as a module).

In [None]:
if __name__ == '__main__':
    main()