An institutional-grade, domain-agnostic hybrid AI system designed to automatically audit exam questions against university syllabi, map course outcomes, and classify question complexity using Bloom's Taxonomy.
The system has evolved from a naive semantic retriever into a robust, context-aware curriculum audit suite. Below are the core engineering milestones implemented:
- Phase 1: Parse & Preview (
/parse_curriculum)- Autonomously segments multi-subject curriculum PDFs, text, or public URLs.
- Dynamically extracts administrative and academic metadata: Department, Semester, Program, Subject Code, Subject Name, Elective Type, and a Metadata Confidence metric.
- Extracts module lists and displays text previews, allowing faculty to inspect extracted syllabus blocks before vectorization.
- Phase 2: Selective Ingestion (
/ingest_selected)- Faculty selects which parsed subjects to vectorize, preventing vector database pollution and saving compute.
- Boilerplate Stripper (
chunk_quality.py): Evaluates content-to-noise ratio in each text chunk. Automatically purges low-information administrative boilerplate (credit counts, lecture hours, textbooks, citation indexes, syllabus page headers/footers) prior to vector database insertion. This reduces vector database noise by >95%. - Bibliographic Noise Gate (
_is_reference_entry): Uses strict regular expression and semantic publisher keyword matching to identify and discard textbook lists, author credits, and standard citations masquerading as course topics.
- 80/20 Hybrid Matcher: Combines dense semantic retrieval (
SentenceTransformerusing the highly optimizedmultilingual-e5-basemodel) with direct exact-match technical lexical overlap (lexical similarity check) inside a hybrid scoring system (80% semantic weight, 20% lexical weight). - Dynamic Concept Expansion Boost (
concept_expander.py): Uses an NLP pipeline (spaCy noun chunks, capitalized entities, acronyms) to build subject-local concept indices. When analyzing a question, the system semantically evaluates concept alignment and applies a boost (+0.12 for strong, +0.06 for moderate overlap) to resolve synonyms and academic paraphrasing (e.g., matching "eliminate redundancy" to "normalization") without hardcoded whitelists. - Strict Semantic Thresholds: Implements strict similarity thresholds (0.90 Strong Match, 0.72 No Match) with early deterministic rejection. Questions failing to meet the gatekeeper threshold are rejected as
OUT_OF_CURRICULUM, avoiding redundant LLM inference and preventing hallucinations. - Dual-Gap Cross-Module Filtering: Keeps the top match and dynamically evaluates subsequent matches. Keeps additional chunks only if they belong to the same module and fall within a 2% similarity gap, or belong to a different module and fall within a 4% similarity gap. This prevents irrelevant chunks from creeping in while perfectly capturing cross-module questions.
- Hybrid Two-Pass Bloom's Taxonomy Classifier (
bloom_classifier.py): Combines fast, deterministic cognitive action-verb mapping (scanning in descending order: Create โ Evaluate โ Analyze โ Apply โ Understand โ Remember) with a fallback WH-question heuristic router. Matches complex exam question forms (e.g., "Why does X work?", "What are the trade-offs of Y?") with extreme reliability. - Course Outcome (CO) & Program Outcome (PO) Mapper (
co_mapper.py): Ingests course learning outcomes and maps exam questions semantically to the closest CO and PO, producing institutional alignment logs.
- ๐ช Auto-Detect Context (
/detect_subject): Allows faculty to paste a question paper or upload a PDF. The system extracts subject metadata via structured regex and queries ChromaDB (first 1000 characters) to semantically identify and automatically select the matching syllabus. - Startup Hydration Engine: On startup, the server automatically scans and hydrates the in-memory syllabus index from persistent ChromaDB collection meta tables, ensuring consistency after backend server restarts.
The system is built as a highly decoupled, modular hybrid pipeline where vector embeddings handle retrieval, deterministic rules enforce bounds, and local LLM reasoning provides explainable justification.
graph TD
QP[Question Paper / Text Input] --> AD[๐ช Auto-Detect Context]
AD --> DB[ChromaDB Vector Store]
AD --> SM[Syllabus Metadata Match]
Q[Exam Question] --> HS[80/20 Hybrid Scoring]
SM --> HS
DB --> HS
HS --> CE[Concept Store Boost]
CE --> GF[Dual-Gap Module Filter]
GF --> RC[Post-Retrieval Clean Gate]
RC --> BT[Two-Pass Bloom Classifier]
RC --> CO[CO/PO Semantic Mapper]
RC & BT & CO --> LV[Grounding Validator / Local LLM]
LV --> Res[Enriched Question Report]
- Frontend: React (interactive, responsive faculty validation dashboard)
- Backend: Python, Flask (modular microservices architecture)
- Vector Database: ChromaDB (persistent local storage, HNSW cosine index)
- NLP / Embeddings: spaCy (
en_core_web_sm), SentenceTransformers (multilingual-e5-base) - Local Inference: Mistral 7B (deployed locally via llama-cpp for total privacy and zero API costs)
- Document Processing: pypdf, Advanced regex parsers
- Curriculum Segmentation: Large PDF or URL curricula are parsed into distinct subjects.
- Boilerplate Filtering: Raw texts are chunked and evaluated. low-information sentences are purged.
- Reference Book Filtering: Bibliographic listings are stripped out.
- Vector Embeddings: Valid chunks are embedded via
multilingual-e5-baseand stored in ChromaDB. - Concept Indexing: Noun-phrase and capitalized concepts are indexed in the
concept_storecollection.
- Context Loading: Syllabus metadata is loaded.
- Hybrid Semantic Matching: Embeds question, retrieves top-k chunks, combines cosine similarity with exact keyword overlap.
- Concept Boosting: Evaluates local concept alignment, applying a boost if concepts overlap.
- Similarity Threshold Gate: Rejects early if the hybrid score is < 0.72.
- Dual-Gap Filtering: Retains relevant cross-module nodes, discarding peripheral noise.
- Bloom & CO/PO Evaluation: Determines cognitive difficulty and maps to academic outcomes.
- Explainable Validation: Local LLM uses retrieved grounding chunks to output a YES/NO validation and detailed justification.
Parses curriculum text, PDF files, or URLs and returns detected subject blocks without embedding.
- Request Type: Form Data or JSON
- Body Parameters:
mode:"pdf"|"url"|"text"file: (If mode ispdf) Multipart PDF file uploadurl: (If mode isurl) String URL to fetchtext: (If mode istext) Raw curriculum text
- Response Format:
{
"parse_id": "a5b810da-...",
"segments": [
{
"syllabus_id": "IT-VIII-PEC-IT801B",
"curriculum_department": "Information Technology",
"department": "Information Technology",
"semester": "VIII",
"subject_code": "IT801B",
"subject_name": "Cryptography and Network Security",
"elective_type": "PEC",
"metadata_confidence": "High",
"modules": ["Unit I: ...", "Unit II: ..."],
"text_preview": "Syllabus content preview...",
"already_ingested": false
}
]
}Vectorizes and embeds selected subjects from a prior parsing session.
- Request Type: JSON
- Body Parameters:
{
"parse_id": "a5b810da-...",
"syllabus_ids": ["IT-VIII-PEC-IT801B"],
"ingest_all": false
}- Response Format:
{
"success": true,
"ingested": ["IT-VIII-PEC-IT801B"],
"skipped_duplicates": [],
"chunks_generated": 42
}Directly download, parse, and embed a syllabus from a URL.
- Request Type: JSON
- Body Parameters:
{
"url": "https://example.com/syllabus.pdf",
"department": "Information Technology",
"semester": "VIII",
"subject_code": "IT801B",
"subject_name": "Cryptography and Network Security"
}Extracts metadata from pasted question papers or uploads to match a loaded syllabus.
- Request Type: Form Data or JSON
- Body Parameters:
mode:"text"|"pdf"text/file: Raw text or PDF
- Response Format:
{
"success": true,
"metadata": {
"syllabus_id": "IT-VIII-PEC-IT801B",
"subject_code": "IT801B",
"subject_name": "Cryptography and Network Security",
"department": "Information Technology",
"semester": "VIII"
}
}Performs comprehensive auditing, indexing, and validation for single or batch exam questions.
- Request Type: Form Data or JSON
- Body Parameters:
mode:"text"|"pdf"question/file: Raw question string or PDF question papersyllabus_id: The ID of the syllabus to validate againstthreshold: Gatekeeper similarity threshold (defaults to0.72)
- Response Format (Single Mode):
{
"mode": "single",
"question": "Explain the working of RSA cryptosystem.",
"similarity_score": 0.92,
"is_in_syllabus": true,
"gatekeeper_passed": true,
"reason": "Successfully grounded in Unit III: Public Key Cryptography",
"retrieval_status": "MATCH_FOUND",
"match_strength": "STRONG_MATCH",
"match_type": "IN_CURRICULUM",
"modules_detected": ["Unit III: Public Key Cryptography"],
"bloom_level": "Understand",
"difficulty": "Easy",
"mapped_co": "CO2",
"mapped_pco": "PO2",
"llm_decision": "YES",
"llm_justification": "The question asks for the working of the RSA cryptosystem, which is explicitly covered in public key cryptography topics within Unit III.",
"llm_module": "Unit III: Public Key Cryptography",
"top_chunks": [
{
"text": "Unit III: Public Key Cryptography, RSA cryptosystem, Key generation, Encryption and Decryption algorithms.",
"similarity": 0.92,
"module": "Unit III: Public Key Cryptography"
}
]
}Generates the nested metadata structures (Department โ Semester โ Subject) directly from vector storage.
- Response Format:
{
"departments": {
"Information Technology": {
"VIII": [
{
"syllabus_id": "IT-VIII-PEC-IT801B",
"subject_name": "Cryptography and Network Security",
"subject_code": "IT801B",
"elective_type": "PEC"
}
]
}
}
}- Navigate to the Backend Directory:
cd backend - Set Up virtualenv & Activate:
python -m venv .venv # Windows: .venv\Scripts\activate # Linux/macOS: source .venv/bin/activate
- Install Dependencies:
pip install -r requirements.txt python -m spacy download en_core_web_sm
- Run the Flask Server:
The backend will run on
python app.py
http://127.0.0.1:5000and auto-hydrate itself.
- Navigate to the Frontend Directory:
cd frontend - Install Dependencies:
npm install
- Run the React Dev Server:
The interactive dashboard will be hosted locally (e.g.,
npm run dev
http://localhost:5173).
This project proves that successful AI implementation is not about building the largest LLM prompt. It is about architecting deterministic, context-aware gates and pipelines around local AI components to guarantee data privacy, academic rigor, and zero-hallucination accuracy.