<a href="https://colab.research.google.com/github/Indranil-R/rag-maester/blob/master/rag_maester.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<!-- ![](assets/img/image.png) -->
## RAG Maester
**Your AI Scholar**

Welcome to **RAG Maester**, an Academic AI assistant designed to support academic excellence.
It leverages **Retrieval Augmented Generation (RAG)** to meticulously search its knowledge base and craft well-informed responses, designed to assist with university assignments and tasks.


In [3]:
import os
import requests

In [4]:
# Downloading the required modules
if os.path.isfile("requirements.txt"):
  print("Requirements.txt already exists. Downloading modules...")
else:
  print("Requirements.txt doesn't exist downloading from github...")
  url = 'https://raw.githubusercontent.com/Indranil-R/rag-maester/refs/heads/master/requirements.txt'
  response = requests.get(url)

  with open('requirements.txt', 'w', encoding='utf-8') as file:
    file.write(response.text)
  print("File downloaded successfully.")

# !pip install -q -r requirements.txt  # Enable it only if dependencies are not installed, I have installed already

Requirements.txt already exists. Downloading modules...


## Importing all required third party libraries

---



In [60]:
from google.colab import userdata
from loguru import logger
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_community.document_loaders import PyPDFLoader
from langchain_chroma import Chroma
from langchain_google_genai import GoogleGenerativeAIEmbeddings
from langchain_google_genai import ChatGoogleGenerativeAI


In [35]:
# Setting up Google API key
os.environ["GOOGLE_API_KEY"] = userdata.get('GOOGLE_API_KEY')

if os.getenv('GOOGLE_API_KEY') == None:
  logger.error("Google API key is not set properly")

## 1. Upload and Ingest Documents 📄

### Scan the docs directory for all available documents

In [7]:
# Fetch all file paths from a directory

def fetch_all_docs(docs_path: str) -> list[str]:
    docs_list = []
    if not os.path.isdir(docs_path):
        print(f"Warning: The path '{docs_path}' is not a valid directory or does not exist.")
        return []
    try:
        for item_name in os.listdir(docs_path):
            item_full_path = os.path.join(docs_path, item_name)
            if os.path.isfile(item_full_path):
                docs_list.append(item_full_path)
    except OSError as e:
        logger.error(f"Error accessing or reading directory '{docs_path}': {e}")
        return []
    return docs_list

In [36]:
# Fetching all documents from the docs directory
documents_list = fetch_all_docs(os.getcwd() + "/docs")

logger.info(f"Total number of documents found: {len(documents_list)}")

[32m2025-05-16 22:31:17.248[0m | [1mINFO    [0m | [36m__main__[0m:[36m<cell line: 0>[0m:[36m4[0m - [1mTotal number of documents found: 1[0m


#### Split the documents into smaller chunks

In [10]:
# Clean text by removing predefined phrases

def clean_text(text):
    removal_phrases = [
        "(c) Amity University Online",
        "Notes",
        "Amity Directorate of Distance & Online Education",
        "Introduction to E-Governance"
    ]
    for phrase in removal_phrases:
        text = text.replace(phrase, "")
    return text.strip()


In [11]:
# Load a PDF from the 6th page onward, clean, and split into chunks

def load_and_split_pdf(doc_path):
    loader = PyPDFLoader(file_path=doc_path, mode="page")
    all_pages = loader.load()
    relevant_pages = all_pages[5:]
    for page in relevant_pages:
        page.page_content = clean_text(page.page_content)
    text_splitter = RecursiveCharacterTextSplitter(
        chunk_size=1500,
        chunk_overlap=250,
        separators=["\n\n", "\n", ".", " "],
    )
    return text_splitter.split_documents(relevant_pages)


In [12]:
# Process multiple PDF documents into cleaned, chunked outputs

def process_documents(documents_path_list: list[str]) -> list:
    all_processed_chunks = []
    for doc_path in documents_path_list:
        print(f"Processing document: {doc_path}")
        try:
            single_doc_chunks = load_and_split_pdf(doc_path)
            if single_doc_chunks:
                all_processed_chunks.extend(single_doc_chunks)
                print(f"Successfully processed and extracted {len(single_doc_chunks)} chunks from {doc_path}")
            else:
                print(f"No relevant chunks extracted from {doc_path}.")
        except FileNotFoundError:
            print(f"Error: Document not found at {doc_path}. Skipping this document.")
        except Exception as e:
            print(f"Error processing document {doc_path}: {e}. Skipping this document.")
    return all_processed_chunks


In [40]:
documents = process_documents(documents_list)

Processing document: /content/docs/Introduction to E-Governance F-CSIT326 S.pdf
Successfully processed and extracted 978 chunks from /content/docs/Introduction to E-Governance F-CSIT326 S.pdf


# 2. Create Embeddings 🧠

In [44]:
# Creating the embeddding function here

# Also using the latest embdedding function here :)
# embedding_fn = GoogleGenerativeAIEmbeddings(model="models/gemini-embedding-exp-03-07")
# Resource has been exhausted, its not free switching to a free one :(

embedding_fn = GoogleGenerativeAIEmbeddings(model="models/text-embedding-004")

In [45]:
vector_db = Chroma.from_documents(documents,embedding=embedding_fn)

In [47]:
logger.info(f"Embeddings created successfully")

[32m2025-05-16 22:41:31.577[0m | [1mINFO    [0m | [36m__main__[0m:[36m<cell line: 0>[0m:[36m1[0m - [1mEmbeddings created successfully[0m


### Creating the vector retreiver

In [59]:
retriever = vector_db.as_retriever(search_type="similarity", search_kwargs={"k": 5})
retrieved_docs = retriever.invoke("What is Digital Divide?")
logger.debug(retrieved_docs[0].page_content)

[32m2025-05-16 22:46:10.455[0m | [34m[1mDEBUG   [0m | [36m__main__[0m:[36m<cell line: 0>[0m:[36m3[0m - [34m[1mthe digital divide is “the difference between individuals, families, enterprises, and 
geographic regions with varying socio-economic levels in terms of their access to 
information and communication technologies (ICTs) and their usage of the internet for a 
wide range of activities.” 
It represents numerous variances between and between nations. Singh adds to 
this description, “[it] is not only about those who have access and those who do not; it is 
not only about the haves and the have-nots. It is about people becoming knowers and 
non-knowers, doers and non-doers, communicators with the rest of the world and non-
communicators.” The digital divide is viewed from two distinct theoretical approaches.   
The technological diffusion normalisation model predicts that, although 
technological expansion may be gradual at first, it will eventually follow a normalisati

### Invoking the LLM to structure and return the response

In [61]:
logger.info("Initializing the Gemini LLM instance")
llm = ChatGoogleGenerativeAI(model="gemini-2.0-flash",temperature=0.2, max_tokens=500)

[32m2025-05-16 22:51:14.196[0m | [1mINFO    [0m | [36m__main__[0m:[36m<cell line: 0>[0m:[36m1[0m - [1mInitializing the Gemini LLM instance[0m


In [62]:
from langchain.chains import create_retrieval_chain
from langchain.chains.combine_documents import create_stuff_documents_chain
from langchain_core.prompts import ChatPromptTemplate

system_prompt = (
    "You are an assistant for question-answering tasks. "
    "Use the following pieces of retrieved context to answer the question."
    "If you don't know the answer, say that you don't know."
    "Use three sentences maximum and keep the answer concise."
    "\n\n"
    "{context}"
    "Below are some examples showing a question and answer format:"
    """
    Question: The use of e-governance helps make all functions of the ____________ transparent.
              Question 1
              Answer a. retail.
              b. business.
              c. Both A & B.
              d. None of the above.

    Answer:  b. business.


    Question: __________does not directly links to accountability.

              Question 2Answer
              a.
              Opaque.
              b.
              Transparency.
              c.
              Both A & B.
              d.
              None of the above.

    Answer:  a. Opaque.



    Now, Answer the user question correctly given the example formats above:


    """
)

prompt = ChatPromptTemplate.from_messages(
    [
        ("system", system_prompt),
        ("human", "{input}"),
    ]
)

In [63]:
question_answer_chain = create_stuff_documents_chain(llm, prompt)
rag_chain = create_retrieval_chain(retriever, question_answer_chain)

In [66]:
response = rag_chain.invoke({"input": """
The use of e-governance helps make all functions of the ____________ transparent.

a. retail
b. business
c. Both A & B
d. None of the above
"""})
print(response["answer"])

The answer is (b) Transparency. Transparency and the rule of law are essential for accountability. E-Government also makes government information more readily available to the public, increasing transparency.


## 3. Creating the UI

### 3.1. Using Streamlit