## Loading the PDF directory

In [1]:
from langchain_community.document_loaders import PyPDFDirectoryLoader

In [2]:
def load_documents():
    document_loader = PyPDFDirectoryLoader("Data")
    return document_loader.load()

In [3]:
documents = load_documents()            ## list of tuples, where tuple contains page_content & meta_data
print(documents[25].page_content)

MS-R.2.1 Minimum Qualifications for Admission to the M.S programme100
Candidates applying for the M.S programme in one of the following areas need to have any one of101
the minimum qualifications mentioned in the table below.102
Area Minimum Qualifications
Educational Qualifications Additional Qualifications
Engineering
B.E/ B.Tech/ 4 year online / any recog-
nised 4 year B.sc/ 4 year BS of IITs/
CFTIs /UGC or Master’s degree in
a relevant discipline, or equivalent.
301st Senate Res. No 5/2023
or
Associate Membership of the follow-
ing professional bodies of the discipline,
provided they have passed parts A and B
of the membership examinations: The
Institution of Engineers (India)(Civil,
Mechanical, Electrical and Electronics,
Electronics and Communications), The
Aeronautical Society of India, The In-
dian Institute of Metals, The Indian In-
stitute of Chemical Engineers, The In-
stitute of Electronics & Telecommunica-
tion Engineering and other professional
bodies approved by the Sena

## Splitting the pages into smaller Chunks

In [4]:
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain.schema.document import Document

In [5]:
def split_documents(documents: list[Document]):
    text_splitter = RecursiveCharacterTextSplitter(
        chunk_size = 1000,
        chunk_overlap = 200,
        length_function = len,              # Decides how the length of the chunk is calculated. len : character count; tiktoken: token based; lambda x : x.split(): word based
        is_separator_regex=False            # By default, the splitter uses a list of preferred string separators, eg. ["\n\n", "\n", " ", ""]. When false the separaters are treated as a plain string but when true split is done based on complex splitting logic.
    )
    return text_splitter.split_documents(documents)

## Normal seperator & Regex(Regular Expression) separators
## Normal separators are like saying: “Split the text wherever you see this exact substring. Example: split on "--" means you only cut when you see two hyphens next to each other.
## Regex separators are like saying: “Split the text wherever a pattern matches,” which can describe many possibilities compactly. Example: split on \s+ means “any run of whitespace (spaces, tabs, newlines)” — not a fixed string, but a pattern.

In [6]:
documents = load_documents()
chunks = split_documents(documents)             ## List of lists
print(chunks[10])

page_content='However, in the case of service officers under the control of Army / Navy / Airforce / DRDO, 
the selection will be through a central selection committee(s) with the Institute faculty serving 
on the selection committee. 
 
R.1.9 Vacancies, if required to be filled up after the admission date, will be decided by the 
Chairman, Senate, and reported to the Senate for post-facto approval. 
 
R.1.10 In all matters concerning the selection of candidates, the decision of the Chairman, Senate, or 
his / her nominee, viz. Chairman, M.Tech Admissions Committee, is final. 
 
R1.11 In addition to satisfying the conditions given in the information Brochure for M.Tech Admission 
sent along with the application forms, the selected candidates should satisfy the other 
admission requirements indicated in the offer letter of admission. Only then, they will be   
3' metadata={'producer': 'convertonlinefree.com', 'creator': 'convertonlinefree.com', 'creationdate': '2016-04-11T11:36:51+00:00

## Custom Indexing of the chunks

In [7]:
## This function is used to assign unique and tracable id to each chunk.

def define_chunk_ids(chunks):
    last_page_id = None
    current_chunk_index = 0
    for chunk in chunks:
        source = chunk.metadata.get("source")
        page = chunk.metadata.get("page_label")
        current_page_id = f"{source}:{page}"

        # Increment the chunk index for every chunk, regardless of the page
        current_chunk_index += 1

        chunk_id = f"{current_page_id}:{current_chunk_index}"
        last_page_id = current_page_id
        chunk.metadata["id"] = chunk_id
    return chunks

### Create Embedding Function (illustrative)

#### To create the database and to extract data by querying the database.

This is saved as a python function in .py file for reuse. Here, we use `OllamaEmbeddings`

In [20]:
from langchain_huggingface import HuggingFaceEmbeddings
def get_embedding_function():
    embeddings = HuggingFaceEmbeddings(
        model_name = "sentence-transformers/all-miniLM-L6-v2",
        # trust_remote_code=True
    )
    return embeddings

In [9]:
pip install -r requirements.txt

Note: you may need to restart the kernel to use updated packages.


## Creating the vector Database and Enabling Auto-addition of a new file

When a new file is added to the "data" directory, the program will detect this based on the index and add them without complete updation.

In [10]:
# from get_embedding_function import get_embedding_function
from langchain_chroma.vectorstores import Chroma

In [11]:
##  This function helps in adding document chunks to the chroma vector database
import shutil
CHROMA_PATH = "chroma_new"                                     # the directory where the Chroma database is stored or will be created.
def add_to_chromadb(chunks: list[Document]):
    db = Chroma(
        collection_name= "chunks", persist_directory=CHROMA_PATH, embedding_function=get_embedding_function()
    )

    chunks_with_ids = define_chunk_ids(chunks)

    existing_chunks = db.get(include =[])                   # nothing in include means documents, metadata & embedding won't be loaded but only ids will be loaded in vector store               
    existing_ids = set(existing_chunks["ids"])              # list of ids are converted to set for fast lookups (not the chunk id but the id that is created by default)
    print(f"Number of existing documents in DB: {len(existing_ids)}")

    new_chunks = []
    for chunk in chunks_with_ids:
        if chunk.metadata["id"] not in existing_ids:
            new_chunks.append(chunk)                    ## list of chunks which will contain page_content & metadata
        
    if len(new_chunks):
        print(f"New {len(new_chunks)} documents added to the DB")
        new_chunk_ids = [chunk.metadata["id"] for chunk in new_chunks]
        shutil.rmtree('./chroma_db', ignore_errors=True)# List of chunk ids that  we have created
        db.add_documents(new_chunks, ids = new_chunk_ids)
        # db.persist()
        print("chunk embedded!")
    else:
        print("No documents to add!")



In [12]:
import argparse

In [13]:
parser = argparse.ArgumentParser()
parser.add_argument("--reset", action="store_true", help="Reset the database.")
args = parser.parse_known_args()
# if args.reset:
#     print("Clearing Database")
#     clear_database()
add_to_chromadb(chunks)

  from .autonotebook import tqdm as notebook_tqdm


Number of existing documents in DB: 515
No documents to add!


In [14]:
from langchain_chroma.vectorstores import Chroma
from langchain.prompts import ChatPromptTemplate, SystemMessagePromptTemplate, HumanMessagePromptTemplate
from langchain_ollama.llms import OllamaLLM

from get_embedding_function import get_embedding_function

In [15]:
CHROMA_PATH = "chroma_new"

sys_instructions = SystemMessagePromptTemplate.from_template(
    """You are an academic assistant chatbot for a university. 
You specialize in answering questions about the Ordinances and Regulations related to M.Tech, MS, and PhD programs. 

Your job is to:
- Provide **accurate**, **clear**, and **concise** responses using the information available in the university's official ordinance documents.
- **Stick strictly to the content** in the provided documents. If the answer is not found, say: "I'm sorry, that information isn't available in the current document."
- Explain terms in simple, student-friendly language when necessary.
- When questions are ambiguous, **ask for clarification** instead of guessing.
- Always maintain a **formal and helpful** tone.

The document includes topics like:
- Course structure and credits
- Registration and thesis submission rules
- Evaluation procedures and grading
- Leaves and attendance
- Program duration and extension policies
- Comprehensive exam and academic misconduct policies

You are not allowed to provide speculative advice or answer beyond what is present in the document.
Question: 
{question}
"""
)

rag_context = HumanMessagePromptTemplate.from_template("Answer the question based on the following context: {context} Question: {question}")


In [16]:
chat_prompt = ChatPromptTemplate.from_messages([sys_instructions, rag_context])
print(chat_prompt)

input_variables=['context', 'question'] input_types={} partial_variables={} messages=[SystemMessagePromptTemplate(prompt=PromptTemplate(input_variables=['question'], input_types={}, partial_variables={}, template='You are an academic assistant chatbot for a university. \nYou specialize in answering questions about the Ordinances and Regulations related to M.Tech, MS, and PhD programs. \n\nYour job is to:\n- Provide **accurate**, **clear**, and **concise** responses using the information available in the university\'s official ordinance documents.\n- **Stick strictly to the content** in the provided documents. If the answer is not found, say: "I\'m sorry, that information isn\'t available in the current document."\n- Explain terms in simple, student-friendly language when necessary.\n- When questions are ambiguous, **ask for clarification** instead of guessing.\n- Always maintain a **formal and helpful** tone.\n\nThe document includes topics like:\n- Course structure and credits\n- Regi

In [21]:
embedding_function = get_embedding_function()
data_base = Chroma(collection_name="chunks", persist_directory=CHROMA_PATH, embedding_function=embedding_function)

def query_rag(query_text: str):
    # Prepare the DB.
    embedding_function = get_embedding_function()
    db = data_base

    # Search the DB.
    results = db.similarity_search_with_score(query_text, k=5)
    # print(results)
    context_text = "\n\n---\n\n".join([doc.page_content for doc, _score in results])
    # prompt_template = ChatPromptTemplate.from_template(PROMPT_TEMPLATE)
    # prompt = prompt_template.format(context=context_text, question=query_text)
    prompt = chat_prompt.format(context=context_text, question=query_text)
    # print(prompt)

    # Use OllamaLLM for generating the response text
    model = OllamaLLM(model="mistral")
    # model = OllamaLLM(model="gemma3:1b")

    response_text = model.invoke(prompt)

    sources = [doc.metadata.get("id", None) for doc, _score in results]
    formatted_response = f"Response: {response_text}\nSources: {sources}"
    # formatted_response = f"Response: {response_text}"
    # print(formatted_response)
    return formatted_response

In [22]:
# generated with mistral
query_text= input("User: ")
print(query_text)

what is leave policy for PhDs


In [25]:
# generated with mistral
response= query_rag(query_text)
print("Response: ", response)

  return forward_call(*args, **kwargs)


Response:  Response:  Based on the provided document, the leave policy for PhD students at your university is as follows:

1. PhD students should apply to the Head of the Department for leave stating the reasons whenever they are not able to attend classes or project work. (R.15.1)
2. PhD students are eligible for 8 days of casual leave and 15 days of vacation leave per academic year. (R.15.2)
3. The unutilized leave from the first year cannot be carried over to the second year. (Not specified in a rule number but implied from the statement about long leave not being availed in the month of June following the second semester, as project work has to commence on that date.)
4. Medical leaves can be considered by the Dean(AR) to extend the registration period of the program for up to a maximum period of one year, provided it is duly certified by the Institute Hospital. (R.15.0)
Sources: ['Data\\m.tech-2015.pdf:12:40', 'Data\\PhD_Ordinance_updated-01-04-2024.pdf:12:260', 'Data\\m.tech-2015

In [26]:
# generated with mistral
query_text= input("User: ")
print(query_text)

What is the admission criteria for Phd and Mtech


In [27]:
# generated with mistral
response = query_rag(query_text)
print(response)

  return forward_call(*args, **kwargs)


Response:  Based on the provided document, here are the admission criteria for both PhD and M.Tech programs:

For **PhD** Program:
There are three modes of admission: Regular Ph.D, Direct Admission to the M.S+Ph.D Programme, and Upgraded Ph.D.

1. **Regular Ph.D**: Candidates applying for this program in engineering need to have any one of the following minimum qualifications:
   - M.E/M.Tech/M.S by Research in Engineering/5-year integrated Masters/Dual Degree in engineering.
   - 2 year M.Sc from IITs (entry through JAM) with a CGPA of ≥ 8.
   - B.S+M.S (5-year integrated) from CFTI with a CGPA of ≥ 8.

2. **Direct Admission to the M.S+Ph.D Programme**: The minimum qualifications for this mode are not specified in the document provided, but it is mentioned that they should be in relevant areas/disciplines as provided by the respective departments.

3. **Upgraded Ph.D**: Candidates registered for M.S/MTech/MSc at IITM are eligible for upgradation to the Ph.D program if they satisfy cer

In [28]:
# generated with mistral
query_text= input("User: ")
print(query_text)
response = query_rag(query_text)
print(response)

Admission Criteria for Btech


  return forward_call(*args, **kwargs)


Response:  Based on the provided document, the information about admission criteria for a Bachelor of Technology (B.Tech) program is not explicitly stated. However, the document does provide information related to the admission criteria for the M.S+Ph.D program, which includes external candidates with a proven research record and top 10% students from other institutions that have a specific MoU with IITM. These students can apply for direct admission to the M.S+Ph.D program in their 4th year, and the credits earned during the first year of this program will have equivalence to the 4th year of the B.Tech in their parent institution. The scholars are eligible for HTRA (Higher Research Assistance) for 5 years after completing their first year successfully at IITM and qualifying in GATE or without GATE for students from CFTIs with a CGPA ≥ 8 on a 10.0 point scale.

For further clarification, it would be best to consult the official ordinance document regarding the B.Tech admission criteria

In [29]:
query_text= input("User: ")
print(query_text)
response = query_rag(query_text)
print(response)

How long is the PhD degree


  return forward_call(*args, **kwargs)


Response:  Based on the information provided in the document, the minimum duration for a regular Ph.D program is 2 years and the maximum duration is 5 years from the date of registration to the date of submission of the thesis for full-time research scholars. However, the Dean's Committee (DC) may grant an extension of up to 2 more years to submit the thesis. Additionally, an additional year may be allowed for scholars in certain categories such as QIP, external, part-time, and staff. It is important to note that this timeline is indicative and specific time frames for different categories can be found in the respective sections of the document.
Sources: ['Data\\PhD_Ordinance_updated-01-04-2024.pdf:19:295', 'Data\\PhD_Ordinance_updated-01-04-2024.pdf:19:190', 'Data\\PhD_Ordinance_updated-01-04-2024.pdf:2:193', 'Data\\PhD_Ordinance_updated-01-04-2024.pdf:9:245', 'Data\\PhD_Ordinance_updated-01-04-2024.pdf:13:168']


In [30]:
# import string
# import difflib

# def normalize(text):
#     # Lowercase, strip punctuation, and collapse whitespace
#     text = text.lower().strip()
#     text = text.translate(str.maketrans('', '', string.punctuation))
#     return ' '.join(text.split())

# def evaluate_retrieval_metrics(retriever, queries, ground_truths, k=5):
#     """Calculate Recall@k and MRR for a set of queries."""
#     assert len(queries) == len(ground_truths), "Mismatch in query and ground truth counts"

#     recall_total = 0
#     reciprocal_ranks = []

#     for query, truth in zip(queries, ground_truths):
#         results = retriever.similarity_search_with_score(query, k=k)
#         retrieved_texts = [doc.page_content for doc, score in results]

#         normalized_truth = normalize(truth)
#         found = False

#         for rank, text in enumerate(retrieved_texts, 1):
#             if normalized_truth in normalize(text):
#                 recall_total += 1
#                 reciprocal_ranks.append(1 / rank)
#                 found = True
#                 break

#         if not found:
#             reciprocal_ranks.append(0)

#     recall_at_k = recall_total / len(queries)
#     mrr = sum(reciprocal_ranks) / len(queries)

#     print(f"Recall@{k}: {recall_at_k:.4f}")
#     print(f"MRR: {mrr:.4f}")

# def evaluate_retrieval_metrics(retriever, queries, ground_truths, k=5, threshold=0.5):
#     assert len(queries) == len(ground_truths), "Mismatch in query and ground truth counts"

#     recall_total = 0
#     reciprocal_ranks = []

#     for query, truth in zip(queries, ground_truths):
#         results = retriever.similarity_search_with_score(query, k=k)
#         normalized_truth = normalize(truth)

#         found = False

#         for rank, (doc, score) in enumerate(results, 1):
#             retrieved_text = normalize(doc.page_content)
#             similarity = difflib.SequenceMatcher(None, normalized_truth, retrieved_text).ratio()

#             if similarity > threshold:
#                 recall_total += 1
#                 reciprocal_ranks.append(1 / rank)
#                 found = True
#                 break

#         if not found:
#             reciprocal_ranks.append(0)

#     recall_at_k = recall_total / len(queries)
#     mrr = sum(reciprocal_ranks) / len(queries)

#     print(f"Recall@{k}: {recall_at_k:.4f}")
#     print(f"MRR: {mrr:.4f}")


from sentence_transformers import SentenceTransformer, util
from tqdm import tqdm  # for progress bar

# Load sentence embedding model
embedding_model = SentenceTransformer("all-MiniLM-L6-v2")  # small and fast

def semantic_similarity(text1, text2):
    """Compute cosine similarity between two texts using embeddings."""
    emb1 = embedding_model.encode(text1, convert_to_tensor=True)
    emb2 = embedding_model.encode(text2, convert_to_tensor=True)
    return util.pytorch_cos_sim(emb1, emb2).item()  # returns a float

def evaluate_mme(retriever, queries, ground_truths, k=5, threshold=0.5):
    """
    Evaluate retrieval using semantic similarity.
    Computes Recall@k and MRR.
    """
    assert len(queries) == len(ground_truths), "Mismatch in query and ground truth lengths"

    recall_total = 0
    reciprocal_ranks = []

    for query, ground_truth in tqdm(zip(queries, ground_truths), total=len(queries), desc="Evaluating"):
        results = retriever.similarity_search_with_score(query, k=k)

        found = False

        for rank, (doc, _) in enumerate(results, 1):
            similarity = semantic_similarity(ground_truth, doc.page_content)

            if similarity >= threshold:
                recall_total += 1
                reciprocal_ranks.append(1 / rank)
                found = True
                break

        if not found:
            reciprocal_ranks.append(0)

    recall_at_k = recall_total / len(queries)
    mrr = sum(reciprocal_ranks) / len(queries)

    print(f"\nMME Evaluation:")
    print(f"Recall@{k}: {recall_at_k:.4f}")
    print(f"MRR: {mrr:.4f}")

In [32]:
queries = [
    "What is the upgradation criteria from Mtech to PhD",
    "What is the criteria for admission in MS",
    "how many days of leave can PhD take in a year",
    "What is the attendance criteria to sit in the exam for Mtech students"
]

ground_truths = [
    "completed four courses during the first semester and obtained a CGPA ≥ 8.1",
    "should possess a B.E/B.Tech degree or its equivalent from a recognized institute",
    "Based on the provided document, PhD students are eligible for 8 days of casual leave and 15 days of vacation leave per academic year",
    "Based on the provided document, M.Tech students who have less than 85% attendance in a course are not permitted to sit for the end-semester exam without the permission of the Dean Academic Courses. This criterion is specified in R.14.3. It's important to note that this rule applies only to end-semester examinations, and students who have missed sessional assessments for valid reasons can apply for a makeup examination (R.20.1). If you have any further questions or need clarification on specific terms, feel free to ask"
]

evaluate_mme(retriever=data_base, queries=queries, ground_truths=ground_truths, k=5)


Evaluating: 100%|██████████| 4/4 [00:00<00:00,  8.50it/s]


MME Evaluation:
Recall@5: 0.7500
MRR: 0.7500





In [33]:
from langchain.memory import ConversationBufferWindowMemory
from langchain.chains import ConversationalRetrievalChain

# Create a windowed memory (last 5 turns)
memory = ConversationBufferWindowMemory(
    k=5,
    memory_key="chat_history",
    return_messages=True
)

# Create a conversational RAG chain
qa_chain = ConversationalRetrievalChain.from_llm(
    llm=OllamaLLM(model="mistral"),
    retriever=data_base.as_retriever(),
    memory=memory,
    verbose=False
)

print("Chatbot with memory ready! Type 'exit' to quit.\n")
while True:
    query = input("You: ")
    if query.lower() in ["exit", "quit", "stop"]:
        print("Goodbye!")
        break
    response = qa_chain.invoke({"question": query})
    print("User", query)
    print("Bot:", response['answer'])


  memory = ConversationBufferWindowMemory(


Chatbot with memory ready! Type 'exit' to quit.



  return forward_call(*args, **kwargs)


User what are the duty that a PhD student perform for the stipend
Bot:  Based on the provided context, it is not explicitly stated what specific duties a Ph.D. student must perform to receive their stipend. However, some general responsibilities can be inferred from the regulations. For instance, the student is required to enroll every semester (PhD-R.9 Enrollment), observe disciplined and decorous behavior (PhD-R.23 Discipline), and follow the regulations as outlined in the rules (PhD-R.24 Power to Modify). Additionally, Ph.D. candidates employed in government R&D organizations, public-sector undertakings, or DST(DSIR) approved organizations with at least 2 years of relevant experience are required to submit a "No Objection Certificate" and a commitment letter from their parent organization (PhD-R.7.284). These requirements suggest that the student should be actively engaged in their research and adhere to certain standards of conduct and employment status to maintain their stipend.


  return forward_call(*args, **kwargs)


User stop
Bot:  Based on the provided context, there isn't a specific rule given regarding when a PhD student would cease receiving their stipend. However, the registration of a research scholar whose progress is not found to be satisfactory by the DC (PhD-R.19) or who has not submitted his/her thesis before the end of the maximum permissible period (PhD-R.16) will be canceled, which might imply that the stipend could be affected due to these reasons. But without explicit information about the stipend and its associated rules, it's hard to provide a definitive answer.


  return forward_call(*args, **kwargs)


User stop
Bot:  Based on the provided context, there is no explicit information about the conditions under which a PhD student's stipend might be terminated. The text focuses more on the rules related to registration cancellation and temporary withdrawal from the program. However, it is important to note that the cancellation of registration could potentially imply the termination of the stipend, as registration seems to be a prerequisite for receiving financial support. But without explicit information about the stipend and its terms, this answer can only be speculative.
Goodbye!
