<a href="https://colab.research.google.com/github/Indranil-R/rag-maester/blob/master/rag_maester.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<!-- ![](assets/img/image.png) -->
## RAG Maester
**Your AI Scholar**

Welcome to **RAG Maester**, an Academic AI assistant designed to support academic excellence.
It leverages **Retrieval Augmented Generation (RAG)** to meticulously search its knowledge base and craft well-informed responses, designed to assist with university assignments and tasks.


In [1]:
import os
import requests

In [2]:
# Downloading the required modules
if os.path.isfile("requirements.txt"):
  print("Requirements.txt already exists. Downloading modules...")
else:
  print("Requirements.txt doesn't exist downloading from github...")
  url = 'https://raw.githubusercontent.com/Indranil-R/rag-maester/refs/heads/master/requirements.txt'
  response = requests.get(url)

  with open('requirements.txt', 'w', encoding='utf-8') as file:
    file.write(response.text)
  print("File downloaded successfully.")

# !pip install -q -r requirements.txt  # Enable it only if dependencies are not installed, I have installed already

Requirements.txt already exists. Downloading modules...


## Importing all required third party libraries

---



In [3]:
if os.getenv("COLAB_RELEASE_TAG"):
    from google.colab import userdata
else:
    # do nothing
    pass


from loguru import logger
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_community.document_loaders import PyPDFLoader
from langchain_chroma import Chroma
from langchain_google_genai import GoogleGenerativeAIEmbeddings
from langchain_google_genai import ChatGoogleGenerativeAI


In [4]:
# Setting up Google API key
if os.getenv('GOOGLE_API_KEY') == None:
  os.environ["GOOGLE_API_KEY"] = userdata.get('GOOGLE_API_KEY')

## 1. Upload and Ingest Documents 📄

### Scan the docs directory for all available documents

In [5]:
# Fetch all file paths from a directory

def fetch_all_docs(docs_path: str) -> list[str]:
    docs_list = []
    if not os.path.isdir(docs_path):
        print(f"Warning: The path '{docs_path}' is not a valid directory or does not exist.")
        return []
    try:
        for item_name in os.listdir(docs_path):
            item_full_path = os.path.join(docs_path, item_name)
            if os.path.isfile(item_full_path):
                docs_list.append(item_full_path)
    except OSError as e:
        logger.error(f"Error accessing or reading directory '{docs_path}': {e}")
        return []
    return docs_list

In [6]:
# Fetching all documents from the docs directory
documents_list = fetch_all_docs(os.getcwd() + "/docs")

logger.info(f"Total number of documents found: {len(documents_list)}")

[32m2025-05-17 01:10:38.879[0m | [1mINFO    [0m | [36m__main__[0m:[36m<cell line: 0>[0m:[36m4[0m - [1mTotal number of documents found: 3[0m


#### Split the documents into smaller chunks

In [7]:
# Clean text by removing predefined phrases

def clean_text(text):
    removal_phrases = [
        "(c) Amity University Online",
        "Notes",
        "Amity Directorate of Distance & Online Education",
        "Introduction to E-Governance"
    ]
    for phrase in removal_phrases:
        text = text.replace(phrase, "")
    return text.strip()


In [8]:
# Load a PDF from the 6th page onward, clean, and split into chunks

def load_and_split_pdf(doc_path):
    loader = PyPDFLoader(file_path=doc_path, mode="page")
    all_pages = loader.load()
    relevant_pages = all_pages[5:]
    for page in relevant_pages:
        page.page_content = clean_text(page.page_content)
    text_splitter = RecursiveCharacterTextSplitter(
        chunk_size=1500,
        chunk_overlap=250,
        separators=["\n\n", "\n", ".", " "],
    )
    return text_splitter.split_documents(relevant_pages)


In [9]:
# Process multiple PDF documents into cleaned, chunked outputs

def process_documents(documents_path_list: list[str]) -> list:
    all_processed_chunks = []
    for doc_path in documents_path_list:
        logger.info(f"Processing document: {doc_path}")
        try:
            single_doc_chunks = load_and_split_pdf(doc_path)
            if single_doc_chunks:
                all_processed_chunks.extend(single_doc_chunks)
                logger.info(f"Successfully processed and extracted {len(single_doc_chunks)} chunks from {doc_path}")
            else:
                logger.warning(f"No relevant chunks found in {doc_path}.")
        except FileNotFoundError:
            logger.error(f"File not found: {doc_path}. Please check the file path.")
        except Exception as e:
            logger.error(f"Error processing document {doc_path}: {e}")
    return all_processed_chunks


In [10]:
documents = process_documents(documents_list)

[32m2025-05-17 01:10:38.932[0m | [1mINFO    [0m | [36m__main__[0m:[36mprocess_documents[0m:[36m6[0m - [1mProcessing document: /content/docs/Introduction to Data Science F-CSIT359-S.pdf[0m
[32m2025-05-17 01:10:50.221[0m | [1mINFO    [0m | [36m__main__[0m:[36mprocess_documents[0m:[36m11[0m - [1mSuccessfully processed and extracted 961 chunks from /content/docs/Introduction to Data Science F-CSIT359-S.pdf[0m
[32m2025-05-17 01:10:50.224[0m | [1mINFO    [0m | [36m__main__[0m:[36mprocess_documents[0m:[36m6[0m - [1mProcessing document: /content/docs/Introduction to E-Governance F-CSIT326 S.pdf[0m
[32m2025-05-17 01:10:54.708[0m | [1mINFO    [0m | [36m__main__[0m:[36mprocess_documents[0m:[36m11[0m - [1mSuccessfully processed and extracted 978 chunks from /content/docs/Introduction to E-Governance F-CSIT326 S.pdf[0m
[32m2025-05-17 01:10:54.712[0m | [1mINFO    [0m | [36m__main__[0m:[36mprocess_documents[0m:[36m6[0m - [1mProcessing documen

# 2. Create Embeddings 🧠

In [11]:
# Creating the embeddding function here

# Also using the latest embdedding function here :)
# embedding_fn = GoogleGenerativeAIEmbeddings(model="models/gemini-embedding-exp-03-07")
# Resource has been exhausted, its not free switching to a free one :(

embedding_fn = GoogleGenerativeAIEmbeddings(model="models/text-embedding-004")

In [13]:
persist_directory = 'db'
if not os.path.exists(persist_directory):
    os.makedirs(persist_directory, exist_ok=True)

# Creating the memory vector database
vectordb = Chroma.from_documents(documents,embedding=embedding_fn,persist_directory=persist_directory)

# vectordb = Chroma(persist_directory=persist_directory, embedding_function=embedding_fn)


### Creating the vector retreiver

In [14]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [15]:
retriever = vectordb.as_retriever(search_type="similarity", search_kwargs={"k": 7})
retrieved_docs = retriever.invoke("What is benefit of Bitcoin?")
logger.debug(retrieved_docs[0])

[32m2025-05-17 01:15:52.772[0m | [34m[1mDEBUG   [0m | [36m__main__[0m:[36m<cell line: 0>[0m:[36m3[0m - [34m[1mpage_content='forms, including paper, hardware, and software ones. The user’s private key is used to 
sign transactions before they are broadcast to the network and validated by miners.
Because Bitcoin is decentralised, it is not governed by a bank or other centralised 
entity. Instead, it is kept up by a group of autonomous nodes connected by a network. 
As a result, censorship and manipulation cannot affect Bitcoin.
Cryptography is used by Bitcoin to safeguard transactions and regulate the 
issuance of new units of the currency. The system uses sophisticated mathematical 
techniques to ensure that transactions cannot be copied or faked, and transactions 
are signed with cryptographic keys that are specific to each user and are generated for 
each transaction.
Transaction fees, which are paid to miners to entice them to process the 
transaction, may apply to bitco

### Invoking the LLM to structure and return the response

In [21]:
logger.info("Initializing the Gemini LLM instance")
llm = ChatGoogleGenerativeAI(model="gemini-2.0-flash",temperature=0, max_tokens=500)

[32m2025-05-17 01:21:27.232[0m | [1mINFO    [0m | [36m__main__[0m:[36m<cell line: 0>[0m:[36m1[0m - [1mInitializing the Gemini LLM instance[0m


In [22]:
from langchain.chains import create_retrieval_chain
from langchain.chains.combine_documents import create_stuff_documents_chain
from langchain_core.prompts import ChatPromptTemplate

system_prompt = (
    "You are an assistant for question-answering tasks. "
    "Use the following pieces of retrieved context to answer the question."
    "If you don't know the answer, say that you don't know."
    "Use three sentences maximum and keep the answer concise."
    "\n\n"
    "{context}"
    "Below are some examples showing a question and answer format:"
    """
    Question: The use of e-governance helps make all functions of the ____________ transparent.
              Question 1
              Answer a. retail.
              b. business.
              c. Both A & B.
              d. None of the above.

    Answer:  b. business.
                Because e-governance is a system that uses technology to improve the efficiency and transparency of government operations, making it easier for citizens to access information and services.


    Question: __________does not directly links to accountability.

              Question 2Answer
              a.
              Opaque.
              b.
              Transparency.
              c.
              Both A & B.
              d.
              None of the above.

    Answer:  a. Opaque.
                Because Opaque means not able to be seen through; not transparent. In the context of accountability, it suggests a lack of clarity or openness in processes or decisions, which does not directly link to accountability.



    Now, Answer the user question correctly given the example formats above:


    """
)

prompt = ChatPromptTemplate.from_messages(
    [
        ("system", system_prompt),
        ("human", "{input}"),
    ]
)

In [23]:
question_answer_chain = create_stuff_documents_chain(llm, prompt)
rag_chain = create_retrieval_chain(retriever, question_answer_chain)

In [19]:
response = rag_chain.invoke({"input": """
What is the advantage of Data Science?

Question 1Answer
a.
It is blurry

b.
Gives good salary

c.
A person can work on different approach

d.
It is very good defined
"""})
print(response["answer"])

Answer: c. A person can work on different approach.
Because data science is a multidisciplinary field that combines mathematics, statistics, artificial intelligence, and computer engineering to analyze vast volumes of data and derive useful insights for businesses. It enables the discovery of hidden patterns, the creation of prediction models, and better decision-making. Data science also helps in automating processes, constructing superior products, and evaluating opportunities.


## 3. Creating the UI

### 3.1. Using Gradio

In [24]:
import gradio as gr


def answer_question(question: str) -> str:
    response = rag_chain.invoke({"input": question})
    return response.get("answer", "No answer found")



app = gr.Interface(
    fn=answer_question,
    inputs=gr.Textbox(lines=10, label="Enter your question"),
    outputs=gr.Textbox(label="Answer"),
    title="RAG Question Answering"
)

app.launch()

It looks like you are running Gradio on a hosted a Jupyter notebook. For the Gradio app to work, sharing must be enabled. Automatically setting `share=True` (you can turn this off by setting `share=False` in `launch()` explicitly).

Colab notebook detected. To show errors in colab notebook, set debug=True in launch()
* Running on public URL: https://f23284466eb85b62bf.gradio.live

This share link expires in 1 week. For free permanent hosting and GPU upgrades, run `gradio deploy` from the terminal in the working directory to deploy to Hugging Face Spaces (https://huggingface.co/spaces)


