<a href="https://colab.research.google.com/github/Spykabore15/students_RAG_project/blob/main/Simple_RAG_Students_database.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### Dependencies installation

- langchain: Main LLM orchestration framework
- langchain-community: community extentions and integrations
- langchain-ebeddings: HaggingFace integration for ebeddings
- langchain-faiss: vectorstore connector for FAISS
- FAISS: vectorbase
- pypdf: PDF parser to extract text from documents


In [1]:
from google.colab import drive
drive.mount('/content/drive')

!pip install -q \
  langchain \
  langchain-community==0.2.* \
  faiss-cpu \
  sentence-transformers \
  transformers \
  accelerate \
  pypdf==4.* \
  streamlit \
  pyngrok

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


### Step2

Import essentials libraries for RAG pipeline including documents ingestion, text splitting, embeddings generation and vector database management.

In [2]:
#Standard utilities
from pathlib import Path
import json
from typing import List

# Langchain components for HaggingFace integration
from langchain_community.document_loaders import PyPDFLoader, TextLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.schema import Document
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.vectorstores import FAISS

# Langchain components for LLM orchestration
from langchain import HuggingFacePipeline
from langchain.chains.question_answering import load_qa_chain
from langchain.prompts import PromptTemplate

# Step 3
Load and Structure Core and Students Documents

In this step, we define utility functions to load and structure all the documents required for the RAG pipeline. This includes:

Loading core reference PDFs such as projects, criteria, and mentors.

Iterating through student folders to load individual reports, summaries, and metadata.

Tagging each document with relevant metadata (like source, student, and category) to support context-aware retrieval later.

In [3]:
path ="/content/drive/MyDrive/TRAINING DATA"
DATA = Path(path)

In [4]:
# Core documents that serves as global references material

CORE_PDFS= [
    DATA / "Projects.pdf",
    DATA / "criteria.pdf",
    DATA / "Mentors.pdf"
]

def load_core_pdfs() -> List[Document]:
  """
    Loads all core PDF documents (projects, criteria, mentors).
    Each page is converted into a LangChain Document with metadata attached.
  """

  docs = []
  for pdf in CORE_PDFS:
    if pdf.exists():
      # Load all pages from the PDF
      pages= PyPDFLoader(str(pdf)).load()
      # Add source metadata for traceability
      for p in pages:
        p.metadata.update({"source": pdf.name, "category": "core"})
      docs.extend(pages)
  return docs



In [5]:
def load_student_dirs() -> List[Document]:
  """
    Iterates through each student directory and loads:
    - report.pdf: main project report
    - summary.txt: short text summary (if exists)
    - metadata.json: student metadata (if exists)
    Each file is wrapped in a Document object with descriptive metadata.
    """
  docs = []
  students_dir = DATA / "students"
  if not students_dir.exists():
    return docs
  for student_dir in students_dir.iterdir():
    if not student_dir.is_dir():
      continue

    # Load student report
    report_pdf = student_dir / "report.pdf"
    if report_pdf.exists():
      pages = PyPDFLoader(str(report_pdf)).load()
      for p in pages:
        p.metadata.update({
            "source": f"{student_dir.name}/report.pdf",
            "student": student_dir.name,
            "category": "student_report"
        })
      docs.extend(pages)

    # Load Student summaru (TXT)
    summary_txt = student_dir / "summary.txt"
    if summary_txt.exists():
      tdocs = TextLoader(str(summary_txt), encoding= "utf-8").load()
      for d in tdocs:
        d.metadata.update({
          "source": f"{student_dir.name}/summary.txt",
          "student": student_dir.name,
          "category": "student_summary"
        })
      docs.extend(tdocs)

    # Load Student meatadata(JSON)
    meta_json = student_dir / "metadata.json"
    if meta_json.exists():
      try:
        meta = json.loads(meta_json.read_text(encoding="utf-8"))
        meta_doc = Document(
            page_content = json.dumps(meta, ensure_ascii=False, indent=2),
            metadata={
                "source": f"{student_dir.name}/metadata.json",
                "student": student_dir.name,
                "category": "student_metadata"
            }
        )
        docs.append(meta_doc)
      except Exception as e:
        print(f"Couln't parse {meta_json}: {e}")

  return docs

In [6]:
# Combine all loaded documents from core and student sources
raw_docs = load_core_pdfs() + load_student_dirs()

# Display how many total document objects were loaded
print(f"Loaded {len(raw_docs)} raw documents (pages + text).")


Loaded 134 raw documents (pages + text).


### Step 4
Split Documents into Manageable Chunks

In this step, we use a recursive character text splitter to break down long documents into smaller, overlapping chunks. This ensures that each chunk is within token limits and still preserves contextual continuity for embeddings and retrieval.

chunk_size=1000: Maximum number of characters in a single chunk.

chunk_overlap=150: Overlap between consecutive chunks to maintain context across boundaries.

separators: Defines the preferred order of splitting (paragraphs → lines → words → characters).

Finally, we print the number of generated chunks and preview one sample with its metadata for verification.

In [7]:
# Initialize the RecursiveCharacterTextSplitter with chunking parameters.
# It prioritizes larger splits first (paragraphs, then sentences, etc.)
splitter = RecursiveCharacterTextSplitter(
    chunk_size = 1000,
    chunk_overlap = 150,
    separators = ["\n\n", "\n", " ", ""], # Logical split hierarchy (paragraph > line > word > char)
)

# Apply the splitter to the combined raw documents
chunks = splitter.split_documents(raw_docs)

# Display how many text chunks were created
print(f"Created {len(chunks)} chunks.")


Created 287 chunks.


In [8]:
# Preview the first chunk to verify splitting and metadata tagging
print(chunks[0].page_content[:300], "...\n", chunks[0].metadata)



UNIVERSITY
 
PROJECT
 
GUIDELINES
 
2024
 
 
CAPSTONE
 
PROJECT
 
REQUIREMENTS
 
 
1.
 
PROJECT
 
DURATION
 
AND
 
MILESTONES
 
   
-
 
Project
 
Duration:
 
12
 
weeks
 
minimum
 
   
-
 
Proposal
 
Submission:
 
Week
 
1
 
   
-
 
Mid-term
 
Review:
 
Week
 
6
 
   
-
 
Final
 
Submission:
 
Week
 ...
 {'source': 'Projects.pdf', 'page': 0, 'category': 'core'}


### Step 6
Generate embeddings and store them in  FAISS

In this step, we initialize the embedding model and FAISS vector database, then index all text chunks for retrieval. This forms the foundation of the RAG system — enabling semantic search and contextual grounding.

Embeddings: We use HuggingFace’s all-MiniLM-L6-v2 (384-dim) to convert text chunks into dense numerical vectors.

FAISS: An efficient vector similarity search library developed by Meta, designed for fast nearest-neighbor retrieval over dense embeddings without requiring external database infrastructure.

LangChain Vector Store: Wraps FAISS to simplify storing and retrieving document embeddings.

In [None]:
# Initialize HuggingFace embedding model

embeddings = HuggingFaceEmbeddings(model_name = "all-MiniLM-L6-v2")

# Build a LangChain vector store wrapper on top of FAISS
# This allows seamless integration between LangChain documents and HuggingFace embeddings
vectorstore = FAISS.from_documents(
    documents=chunks,
    embedding= embeddings
)

### Step 7
Build and Run the RAG Question-Answering Chain

In this step, we connect the retriever, language model, and prompt into a working Retrieval-Augmented Generation (RAG) pipeline.

The retriever fetches the most relevant chunks from FAISS, and the LLM uses them as context to answer the user’s question.

Key Components:

LLM: An open source model (mistralai/Mistral-7B-Instruct-v0.2) is used for no-cost, fast inference.

Retriever: Fetches the top-k (here k=4) semantically closest chunks from the vector store.

PromptTemplate: Guides the LLM to answer only using the retrieved context, avoiding hallucination.

Chain: Combines retrieval, prompt formatting, and LLM inference into a single callable pipeline.

At the end, a sample question is asked, and both the final answer and the retrieved chunks are displayed to show how the model grounded its response.

In [None]:
from transformers import pipeline
from langchain.llms import HuggingFacePipeline


# Build the pipeleine
pipe = pipeline(
    "text-generation",
    model = "mistralai/Mistral-7B-Instruct-v0.2",
    temperature=0.001,
    max_new_tokens=200,
    return_full_text=False
)
# Initialize the llm
llm = HuggingFacePipeline(pipeline=pipe)

# Create a retriever from the FAISS vector store
# It fetches the top 4 most relevant chunks per query
retriever = vectorstore.as_retriever(
    search_type="similarity",
    search_kwargs={"k": 4})

In [11]:
# Import utilities for prompt-based LLM chaining
from langchain.prompts import PromptTemplate
from langchain.chains import LLMChain
from langchain.schema.runnable import RunnablePassthrough
from langchain.schema import StrOutputParser
from langchain.chains import RetrievalQA

# Define a strict prompt template
template = """
You are a helpful assistant that answers using only the provided context.
If the answer is not contained in the context, say "I do not know".

Context: {context}

Question: {question}

Other requirements: Structure the answer in a way that is easy to read and understand.

"""

prompt = PromptTemplate(
    input_variables=["context", "question"],
    template=template
)




In [12]:
# Helper functions to format retrieved docs with their metadata for readability
def format_docs(docs):
  return "\n\n".join(
      [f"[{i+1} {d.page_content}\n (meta: {d.metadata})" for i, d in enumerate(docs)]
  )

  # Create the full retrieval-augmented generation chain
chain = RetrievalQA.from_chain_type(
    llm=llm,
    retriever=retriever,
    chain_type_kwargs={"prompt": prompt},
    return_source_documents=True
)

In [13]:
# Example question that may match multiple documents
question = "Provide me a name of a student who can collaborate on a Robot and Air Quality project"

result = chain(question)

print(result["result"])


  result = chain(question)
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


Answer: Based on the context provided, Dr. Emily Zhang from the Robotics Engineering department and Dr. Patricia Brown from the Environmental Science department have students working on robotics and air quality projects respectively. A potential student for collaboration could be Kenji Tanaka (student ID: STU2024004) from Dr. Emily Zhang's lab, as his project involves autonomous delivery robots and obstacle avoidance, which could potentially be integrated with an air quality monitoring system. Alternatively, David Miller (student ID: STU2024008) from Dr. Patricia Brown's lab could also be a potential collaborator, as his project focuses on real-time air quality monitoring with predictive analytics, which could benefit from the integration of robotics technology.


In [14]:
question = "Provide me names of three students who can collaborate on a Robot and Air Quality project"

result = chain(question)

print(result["result"])


Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


Answer: Based on the context provided, three students who can potentially collaborate on a Robot and Air Quality project are:

1. Kenji Tanaka (Email: kenji.tanaka@university.edu) - He is a second-year student in the Robotics Engineering department, working on an autonomous delivery robot project under the supervision of Dr. Emily Zhang. His project involves computer vision and path planning, which could be beneficial for a Robot and Air Quality project.

2. Dr. Michael Brown (Email: mbrown@university.edu) - Although he is a faculty member, his expertise in Internet of Things (IoT) and embedded systems could be valuable for developing the IoT infrastructure required to monitor and analyze air quality data.

3. Ahmed Hassan (Email: ahmed.hassan@university.edu) - A third-year student in the Urban Planning & Computer


Question to check if the model respects instructions

In [15]:
question = "Who is the president of France ?"

result = chain(question)

print(result["result"])


Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


Answer: I do not know. The context provided does not mention anything about the president of France.


**An improved and complete pipeline is designed in another notebook with a streamlit application.**