<a href="https://colab.research.google.com/github/Spykabore15/students_RAG_project/blob/main/Simple_RAG_Students_database.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### Dependencies installation

- langchain: Main LLM orchestration framework
- langchain-community: community extentions and integrations
- langchain-ebeddings: HaggingFace integration for ebeddings
- langchain-faiss: vectorstore connector for FAISS
- FAISS: vectorbase
- pypdf: PDF parser to extract text from documents


In [None]:
!pip install -q \
  langchain \
  langchain-community==0.2.* \
  faiss-cpu \
  sentence-transformers \
  transformers \
  accelerate \
  pypdf==4.* \
  tqdm


### Step2

Import essentials libraries for RAG pipeline including documents ingestion, text splitting, embeddings generation and vector database management.

In [None]:
# Drive mounting
from google.colab import drive
drive.mount('/content/drive')


Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [None]:
#Standard utilities
from pathlib import Path
import json
from typing import List

# Langchain components for HaggingFace integration
from langchain_community.document_loaders import PyPDFLoader, TextLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.schema import Document
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.vectorstores import FAISS

# Langchain components for LLM orchestration
from langchain import HuggingFacePipeline
from langchain.chains.question_answering import load_qa_chain
from langchain.prompts import PromptTemplate

# Step 3
Load and Structure Core and Students Documents

In this step, we define utility functions to load and structure all the documents required for the RAG pipeline. This includes:

Loading core reference PDFs such as projects, criteria, and mentors.

Iterating through student folders to load individual reports, summaries, and metadata.

Tagging each document with relevant metadata (like source, student, and category) to support context-aware retrieval later.

In [None]:
path ="/content/drive/MyDrive/TRAINING DATA"
DATA = Path(path)

In [None]:
# Core documents that serves as global references material

CORE_PDFS= [
    DATA / "Projects.pdf",
    DATA / "criteria.pdf",
    DATA / "Mentors.pdf"
]

def load_core_pdfs() -> List[Document]:
  """
    Loads all core PDF documents (projects, criteria, mentors).
    Each page is converted into a LangChain Document with metadata attached.
  """

  docs = []
  for pdf in CORE_PDFS:
    if pdf.exists():
      # Load all pages from the PDF
      pages= PyPDFLoader(str(pdf)).load()
      # Add source metadata for traceability
      for p in pages:
        p.metadata.update({"source": pdf.name, "category": "core"})
      docs.extend(pages)
  return pages



In [None]:
def load_student_dirs() -> List[Document]:
  """
  Iterates through each student directory and loads:
    - Each document is wrapped in a report.pdf: main project report
    - summary.txt: short text summary (if exists)
    - metadata.json: student metadata (if en Document object with descriptive metadata.
  """
  docs = []
  students_dir = DATA / "students"
  if not students_dir.exists():
    return docs
  for student_dir in students_dir.iterdir():
    if not student_dir.is_dir():
      continue

    # Load student report
    report_pdf = student_dir / "report.pdf"
    if report_pdf.exists():
      pages = PyPDFLoader(str(report_pdf)).load()
      for p in pages:
        p.metadata.update({
            "source": f"{student_dir.name}/report.pdf",
            "student": student_dir.name,
            "category": "student_report"
        })
      docs.extend(pages)

    # Load Student summaru (TXT)
    summary_txt = student_dir / "summary.txt"
    if summary_txt.exists():
      tdocs = TextLoader(str(summary_txt), encoding= "utf-8").load()
      for d in tdocs:
        d.metadata.update({
          "source": f"{student_dir.name}/summary.txt",
          "student": student_dir.name,
          "category": "student_summary"
        })
      docs.extend(tdocs)

    # Load Student meatadata(JSON)
    meta_json = student_dir / "metadata.json"
    if meta_json.exists():
      try:
        meta = json.loads(meta_json.read_text(encoding="utf-8"))
        meta_doc = Document(
            page_content = json.dumps(meta, ensure_ascii=False, indent=2),
            metadata={
                "source": f"{student_dir.name}/metadata.json",
                "student": student_dir.name,
                "category": "student_metadata"
            }
        )
        docs.append(meta_doc)
      except Exception as e:
        print(f"Couln't parse {meta_json}: {e}")

  return docs

In [None]:
# Combine all loaded documents from core and student sources
raw_docs = load_core_pdfs() + load_student_dirs()

# Display how many total document objects were loaded
print(f"Loaded {len(raw_docs)} raw documents (pages + text).")


Loaded 132 raw documents (pages + text).


### Step 4
Split Documents into Manageable Chunks

In this step, we use a recursive character text splitter to break down long documents into smaller, overlapping chunks. This ensures that each chunk is within token limits and still preserves contextual continuity for embeddings and retrieval.

chunk_size=1000: Maximum number of characters in a single chunk.

chunk_overlap=150: Overlap between consecutive chunks to maintain context across boundaries.

separators: Defines the preferred order of splitting (paragraphs → lines → words → characters).

Finally, we print the number of generated chunks and preview one sample with its metadata for verification.

In [None]:
# Initialize the RecursiveCharacterTextSplitter with chunking parameters.
# It prioritizes larger splits first (paragraphs, then sentences, etc.)
splitter = RecursiveCharacterTextSplitter(
    chunk_size = 1000,
    chunk_overlap = 150,
    separators = ["\n\n", "\n", " ", ""], # Logical split hierarchy (paragraph > line > word > char)
)

# Apply the splitter to the combined raw documents
chunks = splitter.split_documents(raw_docs)

# Display how many text chunks were created
print(f"Created {len(chunks)} chunks.")


Created 283 chunks.


In [None]:
# Preview the first chunk to verify splitting and metadata tagging
print(chunks[0].page_content[:300], "...\n", chunks[0].metadata)



FACULTY
 
MENTORS
 
AND
 
AREAS
 
OF
 
EXPERTISE
 
2024
 
 
COMPUTER
 
SCIENCE
 
DEPARTMENT:
 
Dr.
 
Sarah
 
Chen
 
-
 
Artificial
 
Intelligence
 
in
 
Healthcare
 
-
 
Computer
 
Vision
 
-
 
Medical
 
Image
 
Processing
 
Office:
 
CS
 
Building
 
301
 
Email:
 
schen@university.edu
 
 
Dr.
 
Mic ...
 {'source': 'Mentors.pdf', 'page': 0, 'category': 'core'}


### Step 6
Generate embeddings and store them in  FAISS

In this step, we initialize the embedding model and FAISS vector database, then index all text chunks for retrieval. This forms the foundation of the RAG system — enabling semantic search and contextual grounding.

Embeddings: We use HuggingFace’s all-MiniLM-L6-v2 (384-dim) to convert text chunks into dense numerical vectors.

FAISS: An efficient vector similarity search library developed by Meta, designed for fast nearest-neighbor retrieval over dense embeddings without requiring external database infrastructure.

LangChain Vector Store: Wraps FAISS to simplify storing and retrieving document embeddings.

In [None]:
# Initialize HuggingFace embedding model

embeddings = HuggingFaceEmbeddings(model_name = "all-MiniLM-L6-v2")

# Build a LangChain vector store wrapper on top of FAISS
# This allows seamless integration between LangChain documents and HuggingFace embeddings
vectorstore = FAISS.from_documents(
    documents=chunks,
    embedding= embeddings
)

  embeddings = HuggingFaceEmbeddings(model_name = "all-MiniLM-L6-v2")
The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

### Step 7
Build and Run the RAG Question-Answering Chain

In this step, we connect the retriever, language model, and prompt into a working Retrieval-Augmented Generation (RAG) pipeline.

The retriever fetches the most relevant chunks from FAISS, and the LLM uses them as context to answer the user’s question.

Key Components:

LLM: An open source model (mistralai/Mistral-7B-Instruct-v0.2) is used for no-cost, fast inference.

Retriever: Fetches the top-k (here k=4) semantically closest chunks from the vector store.

PromptTemplate: Guides the LLM to answer only using the retrieved context, avoiding hallucination.

Chain: Combines retrieval, prompt formatting, and LLM inference into a single callable pipeline.

At the end, a sample question is asked, and both the final answer and the retrieved chunks are displayed to show how the model grounded its response.

In [None]:
from transformers import pipeline
from langchain.llms import HuggingFacePipeline
from transformers import AutoTokenizer, AutoModelForCausalLM, pipeline

# Build the pipeleine
pipe = pipeline(
    "text-generation",
    model = "mistralai/Mistral-7B-Instruct-v0.2",
    temperature=0.001,
    max_new_tokens=200,
    return_full_text=False
)
# Initialize the llm
llm = HuggingFacePipeline(pipeline=pipe)

# Create a retriever from the FAISS vector store
# It fetches the top 4 most relevant chunks per query
retriever = vectorstore.as_retriever(
    search_type="similarity",
    search_kwargs={"k": 4})

config.json:   0%|          | 0.00/596 [00:00<?, ?B/s]

model.safetensors.index.json: 0.00B [00:00, ?B/s]

Fetching 3 files:   0%|          | 0/3 [00:00<?, ?it/s]

model-00002-of-00003.safetensors:   0%|          | 0.00/5.00G [00:00<?, ?B/s]

model-00003-of-00003.safetensors:   0%|          | 0.00/4.54G [00:00<?, ?B/s]

model-00001-of-00003.safetensors:   0%|          | 0.00/4.94G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/111 [00:00<?, ?B/s]

tokenizer_config.json: 0.00B [00:00, ?B/s]

tokenizer.model:   0%|          | 0.00/493k [00:00<?, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/414 [00:00<?, ?B/s]

Device set to use cuda:0
  llm = HuggingFacePipeline(pipeline=pipe)


### Next Step: Implement Query Decomposition for Smater Retrieval

In this step, we create a Query Decomposition Chain - a preprocessing layer that breaks a complex user quesiton into smaller, more specific factual sub-queries. This improves retrieval accuracy in RAG pipelines by targeting multiple aspects of the question(like skills, domain or experience) instead of relying on a single broad query.

In [None]:
#Import necessary components for building the decomposiotion chain
from langchain.prompts import ChatPromptTemplate, PromptTemplate
from langchain.chains import LLMChain
from langchain.schema.runnable import RunnablePassthrough
from langchain.schema import StrOutputParser

#Define a specified prompt template to instruct the the model to break a complex user query into multiple nfactual sub-queries
template = """ You are a Query Decomposition Assistant.
Your role is to break the input question into 3-5 short factual sub-queries that will retrieve the specified attributes implied by the main question such as:
skills, experience, domain knowledge., fied of work or relevant achievemants.

Rules:
• Identify the core entities and the implied requirements in the question.
• Each sub-query must target one required attribute (e.g., skill, domain, experience).
• No generic questions (e.g., “What is the focus of the project?”).
• No procedural or meta-queries (e.g., “What are the criteria?”).
• No answering the question.

Input: {question}

Output:"""

# Create a ChatPromptTemplate from the defined template
prompt_decomposition = PromptTemplate(
    input_variables=["question"],
    template=template)

# Chain the prompt -> LLM -> output parser -> String to list conversion
chain_decompositon = RetrievalQA.from_chain_type(
    llm=llm,
    retriever=retriever,
    chain_type_kwargs={"prompt": prompt_decomposition},
    return_source_documents=True
)

# Example complex user question
question = "Provide me the names of two students for upcoming project that uses robot for teaching the students"

# Invoke the chain to generate decomposed sub-queries
result = chain_decomposition(question)

print(result["result"])

Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


### Next Step: Create a Composite Prompt for Answer Generation

In this step, we define a structure prompt template tha combine 3 information layers:
Main question -> User's current query

Background Q&A pairs -> Any previously answered or related questions

Context -> Retrieved chunks or supporting documents from the vector store.

This composite prompt ensures the LLM has access to relevant background knowledge and contextual grounding before generating an answer. It is a key element in multi-turn or context-aware RAG systems.

In [None]:
template = """Here is the question you need to answer:

\n --- \n {question} \n --- \n

Here is any available background question + answer pairs:

\n --- \n {q_a_pairs} \n --- \n

Here is additional context relevant to the question:

\n --- \n {context} \n --- \n

Use the above context and any background question + answer pairs to answer the question: \n {question}
"""

# Convert the raw template into a LangChain ChatPromptTemplate
# so it can be used seamlessly with chains and LLMs
final_prompt = PromptTemplate(
    input_variables=["question", "q_a_pairs", "context"],
    template=template
)


### Final step: Build and Execute the Multi-Stage RAG Chain (Decomposition + Context Synthesis)

In this step, we connect everything into a two-stage reasoning pipeline that first answers the decomposed sub-queries and then synthesizes a final, context-aware answer. The pipeline leverages retrieved document chunks, previously answered Q&A pairs, and the composite prompt created earlier.

Workflow Overview:

Format helper (format_qa_pair) → Formats each question–answer pair neatly for reuse in later prompts.

Per-subquery chain (rag_chain) →

Retrieves context for each sub-question.

Uses the LLM to produce a factual answer.

Appends the result to the growing list of Q&A pairs.

Final synthesis chain (final_rag_chain) →

Uses the original complex question plus all gathered Q&A pairs.

Retrieves global context again and asks the LLM to generate a consolidated final answer.

This design allows the model to reason over structured intermediate knowledge before forming the final output—improving precision and reducing hallucinations.

In [None]:
from operator import itemgetter
from langchain_core.output_parsers import StrOutputParser

def format_qa_pair(question, answer):
    """Format a question-answer pair into a readable text block."""
    formatted_string = ""
    formatted_string += f"Question: {question}\nAnswer: {answer}\n\n"
    return formatted_string.strip()

# Initialize a higher-capacity LLM for synthesis
# The same llm as before will be used

# Initialize an empty string to accumulate Q&A pairs
q_a_pairs = ""

# Iterate through each decomposed sub-query
for q in questions:

    # Define a mini-RAG chain to answer each sub-question independently

    rag_chain = RetrievalQA.from_chain_type(
      {
          "context": itemgetter("question") | retriever,  # Retrieve context relevant to this sub-question
          "question": itemgetter("question"),             # Current sub-question
          "q_a_pairs": itemgetter("q_a_pairs")            # Previous Q&A context (if any)
      }
      llm=llm,
      retriever=retriever,
      chain_type_kwargs={"prompt": prompt_decomposition},
      return_source_documents=True
)
    rag_chain = (
        {
            "context": itemgetter("question") | retriever,  # Retrieve context relevant to this sub-question
            "question": itemgetter("question"),             # Current sub-question
            "q_a_pairs": itemgetter("q_a_pairs")            # Previous Q&A context (if any)
        }
        | decomposition_prompt                              # Use composite prompt structure
        | llm                                               # Generate sub-answer
        | StrOutputParser()                                 # Convert LLM output to plain string
    )

    # Invoke the chain for the current sub-question
    answer = rag_chain.invoke({"question": q, "q_a_pairs": q_a_pairs})

    # Format and append the result for later synthesis
    q_a_pair = format_qa_pair(q, answer)
    q_a_pairs = q_a_pairs + "\n---\n" + q_a_pair


# 🔄 Final RAG chain to synthesize the comprehensive answer
final_rag_chain = (
    {
        "context": itemgetter("original_question") | retriever,  # Retrieve global context
        "question": itemgetter("original_question"),              # Original complex question
        "q_a_pairs": itemgetter("q_a_pairs")                      # All sub-query Q&A pairs
    }
    | decomposition_prompt                                        # Same structured prompt
    | llm                                                         # Synthesize final response
    | StrOutputParser()                                           # Parse to text
)

# Run the final synthesis stage
final_answer = final_rag_chain.invoke(
    {"original_question": question, "q_a_pairs": q_a_pairs}
)


In [None]:
# Import utilities for prompt-based LLM chaining
from langchain.prompts import PromptTemplate
from langchain.chains import LLMChain
from langchain.schema.runnable import RunnablePassthrough
from langchain.schema import StrOutputParser
from langchain.chains import RetrievalQA

# Define a strict prompt template
template = """
You are a helpful assistant that answers using only the provided context.
If the answer is not contained in the context, say "I do not know".

Context: {context}

Question: {question}

Other requirements: Structure the answer in a way that is easy to read and understand.

"""

prompt = PromptTemplate(
    input_variables=["context", "question"],
    template=template
)




In [None]:
# Helper functions to format retrieved docs with their metadata for readability
def format_docs(docs):
  return "\n\n".join(
      [f"[{i+1} {d.page_content}\n (meta: {d.metadata})" for i, d in enumerate(docs)]
  )

  # Create the full retrieval-augmented generation chain
chain = RetrievalQA.from_chain_type(
    llm=llm,
    retriever=retriever,
    chain_type_kwargs={"prompt": prompt},
    return_source_documents=True
)

In [None]:
# Example question that may match multiple documents
question = "Provide me a name of a student who can collaborate on a Robot and Air Quality project"

result = chain(question)

print(result["result"])


  result = chain(question)
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


Answer: Based on the provided context, Dr. Emily Zhang from the Robotics Engineering department and Dr. Patricia Brown from the Environmental Science department have students working on projects related to robotics and air quality respectively. You may consider reaching out to the following students for potential collaboration:

1. Kenji Tanaka (student_id: STU2024004, email: kenji.tanaka@university.edu) from Dr. Emily Zhang's lab, working on an Autonomous Delivery Robot project.
2. David Miller (student_id: STU2024008, email: david.miller@university.edu) from Dr. Patricia Brown's lab, working on a Real-time Air Quality Monitoring Network project.

These students' projects might provide valuable insights and resources for a collaborative Robot and Air Quality project.


In [None]:
question = "Provide me names of three students who can collaborate on a Robot and Air Quality project"

result = chain(question)

print(result["result"])


Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


Answer: Based on the context provided, the following faculty members have expertise in robotics and air quality: Dr. Emily Zhang (Robotics and Autonomous Systems) and Dr. Patricia Brown (Environmental Science). Three students who have previously worked on projects related to these areas and might be suitable for collaboration are:

1. Kenji Tanaka (Student ID: STU2024004, Department: Robotics Engineering, Project: Autonomous Delivery Robot with Obstacle Avoidance)
2. Ahmed Hassan (Student ID: STU2024006, Department: Urban Planning & Computer Science, Project: AI-Optimized Traffic Flow Management System)
3. David Miller (Student ID: STU2024008, Department: Environmental Science, Project: Real-time Air Quality Monitoring Network with Predictive Analytics)


In [None]:
# Retrieve and display the raw context chunks for transparency / debugging
docs = retriever.get_relevant_documents(question)
print("\n--- Retrieved Chunks (for debugging) ---")
for i, d in enumerate(docs, 1):
    print(f"[{i}] from {d.metadata.get('source')}, p={d.metadata.get('page', 'NA')}")
    print(d.page_content[:300].strip(), "...\n")



--- Retrieved Chunks (for debugging) ---
[1] from Mentors.pdf, p=0
FACULTY
 
MENTORS
 
AND
 
AREAS
 
OF
 
EXPERTISE
 
2024
 
 
COMPUTER
 
SCIENCE
 
DEPARTMENT:
 
Dr.
 
Sarah
 
Chen
 
-
 
Artificial
 
Intelligence
 
in
 
Healthcare
 
-
 
Computer
 
Vision
 
-
 
Medical
 
Image
 
Processing
 
Office:
 
CS
 
Building
 
301
 
Email:
 
schen@university.edu
 
 
Dr.
 
Mic ...

[2] from Kenji/metadata.json, p=NA
{
  "student_id": "STU2024004",
  "name": "Kenji Tanaka",
  "email": "kenji.tanaka@university.edu",
  "department": "Robotics Engineering",
  "project_title": "Autonomous Delivery Robot with Obstacle Avoidance",
  "submission_date": "2024-01-22",
  "academic_year": "Second Year",
  "supervisor": "Dr ...

[3] from Ahmed/metadata.json, p=NA
{
  "student_id": "STU2024006",
  "name": "Ahmed Hassan",
  "email": "ahmed.hassan@university.edu",
  "department": "Urban Planning & Computer Science",
  "project_title": "AI-Optimized Traffic Flow Management System",
  "submission_date": "2024-01-2

  docs = retriever.get_relevant_documents(question)


**The complete pipeline is designed in another notebook with a streamlit application.**