AI Curriculum & Syllabus Validator

An institutional-grade, domain-agnostic hybrid AI system designed to automatically audit exam questions against university syllabi, map course outcomes, and classify question complexity using Bloom's Taxonomy.

🚀 Key Achievements & Features

The system has evolved from a naive semantic retriever into a robust, context-aware curriculum audit suite. Below are the core engineering milestones implemented:

1. Curriculum-Driven Ingestion Flow (Two-Phase Ingestion)

Phase 1: Parse & Preview (/parse_curriculum)
- Autonomously segments multi-subject curriculum PDFs, text, or public URLs.
- Dynamically extracts administrative and academic metadata: Department, Semester, Program, Subject Code, Subject Name, Elective Type, and a Metadata Confidence metric.
- Extracts module lists and displays text previews, allowing faculty to inspect extracted syllabus blocks before vectorization.
Phase 2: Selective Ingestion (/ingest_selected)
- Faculty selects which parsed subjects to vectorize, preventing vector database pollution and saving compute.

2. Pre-Embedding Quality Gate & Noise Sanitization

Boilerplate Stripper (chunk_quality.py): Evaluates content-to-noise ratio in each text chunk. Automatically purges low-information administrative boilerplate (credit counts, lecture hours, textbooks, citation indexes, syllabus page headers/footers) prior to vector database insertion. This reduces vector database noise by >95%.
Bibliographic Noise Gate (_is_reference_entry): Uses strict regular expression and semantic publisher keyword matching to identify and discard textbook lists, author credits, and standard citations masquerading as course topics.

3. Advanced Hybrid Retrieval & Score Gating

80/20 Hybrid Matcher: Combines dense semantic retrieval (SentenceTransformer using the highly optimized multilingual-e5-base model) with direct exact-match technical lexical overlap (lexical similarity check) inside a hybrid scoring system (80% semantic weight, 20% lexical weight).
Dynamic Concept Expansion Boost (concept_expander.py): Uses an NLP pipeline (spaCy noun chunks, capitalized entities, acronyms) to build subject-local concept indices. When analyzing a question, the system semantically evaluates concept alignment and applies a boost (+0.12 for strong, +0.06 for moderate overlap) to resolve synonyms and academic paraphrasing (e.g., matching "eliminate redundancy" to "normalization") without hardcoded whitelists.
Strict Semantic Thresholds: Implements strict similarity thresholds (0.90 Strong Match, 0.72 No Match) with early deterministic rejection. Questions failing to meet the gatekeeper threshold are rejected as OUT_OF_CURRICULUM, avoiding redundant LLM inference and preventing hallucinations.
Dual-Gap Cross-Module Filtering: Keeps the top match and dynamically evaluates subsequent matches. Keeps additional chunks only if they belong to the same module and fall within a 2% similarity gap, or belong to a different module and fall within a 4% similarity gap. This prevents irrelevant chunks from creeping in while perfectly capturing cross-module questions.

4. Rich Pedagogical Mapping

Hybrid Two-Pass Bloom's Taxonomy Classifier (bloom_classifier.py): Combines fast, deterministic cognitive action-verb mapping (scanning in descending order: Create → Evaluate → Analyze → Apply → Understand → Remember) with a fallback WH-question heuristic router. Matches complex exam question forms (e.g., "Why does X work?", "What are the trade-offs of Y?") with extreme reliability.
Course Outcome (CO) & Program Outcome (PO) Mapper (co_mapper.py): Ingests course learning outcomes and maps exam questions semantically to the closest CO and PO, producing institutional alignment logs.

5. Automated Context Detection & Hydration

🪄 Auto-Detect Context (/detect_subject): Allows faculty to paste a question paper or upload a PDF. The system extracts subject metadata via structured regex and queries ChromaDB (first 1000 characters) to semantically identify and automatically select the matching syllabus.
Startup Hydration Engine: On startup, the server automatically scans and hydrates the in-memory syllabus index from persistent ChromaDB collection meta tables, ensuring consistency after backend server restarts.

🏗️ Architecture Overview

The system is built as a highly decoupled, modular hybrid pipeline where vector embeddings handle retrieval, deterministic rules enforce bounds, and local LLM reasoning provides explainable justification.

graph TD
    QP[Question Paper / Text Input] --> AD[🪄 Auto-Detect Context]
    AD --> DB[ChromaDB Vector Store]
    AD --> SM[Syllabus Metadata Match]
    
    Q[Exam Question] --> HS[80/20 Hybrid Scoring]
    SM --> HS
    DB --> HS
    
    HS --> CE[Concept Store Boost]
    CE --> GF[Dual-Gap Module Filter]
    GF --> RC[Post-Retrieval Clean Gate]
    
    RC --> BT[Two-Pass Bloom Classifier]
    RC --> CO[CO/PO Semantic Mapper]
    
    RC & BT & CO --> LV[Grounding Validator / Local LLM]
    LV --> Res[Enriched Question Report]

Technical Stack

Frontend: React (interactive, responsive faculty validation dashboard)
Backend: Python, Flask (modular microservices architecture)
Vector Database: ChromaDB (persistent local storage, HNSW cosine index)
NLP / Embeddings: spaCy (en_core_web_sm), SentenceTransformers (multilingual-e5-base)
Local Inference: Mistral 7B (deployed locally via llama-cpp for total privacy and zero API costs)
Document Processing: pypdf, Advanced regex parsers

⚙️ How It Works (The Ingestion & Analysis Pipelines)

Phase A: Syllabus Ingestion Pipeline

Curriculum Segmentation: Large PDF or URL curricula are parsed into distinct subjects.
Boilerplate Filtering: Raw texts are chunked and evaluated. low-information sentences are purged.
Reference Book Filtering: Bibliographic listings are stripped out.
Vector Embeddings: Valid chunks are embedded via multilingual-e5-base and stored in ChromaDB.
Concept Indexing: Noun-phrase and capitalized concepts are indexed in the concept_store collection.

Phase B: Question Analysis Pipeline

Context Loading: Syllabus metadata is loaded.
Hybrid Semantic Matching: Embeds question, retrieves top-k chunks, combines cosine similarity with exact keyword overlap.
Concept Boosting: Evaluates local concept alignment, applying a boost if concepts overlap.
Similarity Threshold Gate: Rejects early if the hybrid score is < 0.72.
Dual-Gap Filtering: Retains relevant cross-module nodes, discarding peripheral noise.
Bloom & CO/PO Evaluation: Determines cognitive difficulty and maps to academic outcomes.
Explainable Validation: Local LLM uses retrieved grounding chunks to output a YES/NO validation and detailed justification.

🔌 API Reference

1. Ingestion Endpoints

`POST /parse_curriculum`

Parses curriculum text, PDF files, or URLs and returns detected subject blocks without embedding.

Request Type: Form Data or JSON
Body Parameters:
- mode: "pdf" | "url" | "text"
- file: (If mode is pdf) Multipart PDF file upload
- url: (If mode is url) String URL to fetch
- text: (If mode is text) Raw curriculum text
Response Format:

{
  "parse_id": "a5b810da-...",
  "segments": [
    {
      "syllabus_id": "IT-VIII-PEC-IT801B",
      "curriculum_department": "Information Technology",
      "department": "Information Technology",
      "semester": "VIII",
      "subject_code": "IT801B",
      "subject_name": "Cryptography and Network Security",
      "elective_type": "PEC",
      "metadata_confidence": "High",
      "modules": ["Unit I: ...", "Unit II: ..."],
      "text_preview": "Syllabus content preview...",
      "already_ingested": false
    }
  ]
}

`POST /ingest_selected`

Vectorizes and embeds selected subjects from a prior parsing session.

Request Type: JSON
Body Parameters:

{
  "parse_id": "a5b810da-...",
  "syllabus_ids": ["IT-VIII-PEC-IT801B"],
  "ingest_all": false
}

Response Format:

{
  "success": true,
  "ingested": ["IT-VIII-PEC-IT801B"],
  "skipped_duplicates": [],
  "chunks_generated": 42
}

`POST /ingest_from_url`

Directly download, parse, and embed a syllabus from a URL.

Request Type: JSON
Body Parameters:

{
  "url": "https://example.com/syllabus.pdf",
  "department": "Information Technology",
  "semester": "VIII",
  "subject_code": "IT801B",
  "subject_name": "Cryptography and Network Security"
}

2. Operational & Analysis Endpoints

`POST /detect_subject` (🪄 Auto-Detect Context)

Extracts metadata from pasted question papers or uploads to match a loaded syllabus.

Request Type: Form Data or JSON
Body Parameters:
- mode: "text" | "pdf"
- text / file: Raw text or PDF
Response Format:

{
  "success": true,
  "metadata": {
    "syllabus_id": "IT-VIII-PEC-IT801B",
    "subject_code": "IT801B",
    "subject_name": "Cryptography and Network Security",
    "department": "Information Technology",
    "semester": "VIII"
  }
}

`POST /analyze_question`

Performs comprehensive auditing, indexing, and validation for single or batch exam questions.

Request Type: Form Data or JSON
Body Parameters:
- mode: "text" | "pdf"
- question / file: Raw question string or PDF question paper
- syllabus_id: The ID of the syllabus to validate against
- threshold: Gatekeeper similarity threshold (defaults to 0.72)
Response Format (Single Mode):

{
  "mode": "single",
  "question": "Explain the working of RSA cryptosystem.",
  "similarity_score": 0.92,
  "is_in_syllabus": true,
  "gatekeeper_passed": true,
  "reason": "Successfully grounded in Unit III: Public Key Cryptography",
  "retrieval_status": "MATCH_FOUND",
  "match_strength": "STRONG_MATCH",
  "match_type": "IN_CURRICULUM",
  "modules_detected": ["Unit III: Public Key Cryptography"],
  "bloom_level": "Understand",
  "difficulty": "Easy",
  "mapped_co": "CO2",
  "mapped_pco": "PO2",
  "llm_decision": "YES",
  "llm_justification": "The question asks for the working of the RSA cryptosystem, which is explicitly covered in public key cryptography topics within Unit III.",
  "llm_module": "Unit III: Public Key Cryptography",
  "top_chunks": [
    {
      "text": "Unit III: Public Key Cryptography, RSA cryptosystem, Key generation, Encryption and Decryption algorithms.",
      "similarity": 0.92,
      "module": "Unit III: Public Key Cryptography"
    }
  ]
}

`GET /curriculum_hierarchy`

Generates the nested metadata structures (Department → Semester → Subject) directly from vector storage.

Response Format:

{
  "departments": {
    "Information Technology": {
      "VIII": [
        {
          "syllabus_id": "IT-VIII-PEC-IT801B",
          "subject_name": "Cryptography and Network Security",
          "subject_code": "IT801B",
          "elective_type": "PEC"
        }
      ]
    }
  }
}

🚀 Setting Up the System

Backend Setup

Navigate to the Backend Directory:
```
cd backend
```

Set Up virtualenv & Activate:

python -m venv .venv
# Windows:
.venv\Scripts\activate
# Linux/macOS:
source .venv/bin/activate

Install Dependencies:

pip install -r requirements.txt
python -m spacy download en_core_web_sm

Run the Flask Server:
```
python app.py
```
The backend will run on http://127.0.0.1:5000 and auto-hydrate itself.

Frontend Setup

Navigate to the Frontend Directory:
```
cd frontend
```
Install Dependencies:
```
npm install
```
Run the React Dev Server:
```
npm run dev
```
The interactive dashboard will be hosted locally (e.g., http://localhost:5173).

🔮 Key Takeaway

This project proves that successful AI implementation is not about building the largest LLM prompt. It is about architecting deterministic, context-aware gates and pipelines around local AI components to guarantee data privacy, academic rigor, and zero-hallucination accuracy.

Name		Name	Last commit message	Last commit date
Latest commit History 21 Commits
backend		backend
frontend		frontend
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

AI Curriculum & Syllabus Validator

🚀 Key Achievements & Features

1. Curriculum-Driven Ingestion Flow (Two-Phase Ingestion)

2. Pre-Embedding Quality Gate & Noise Sanitization

3. Advanced Hybrid Retrieval & Score Gating

4. Rich Pedagogical Mapping

5. Automated Context Detection & Hydration

🏗️ Architecture Overview

Technical Stack

⚙️ How It Works (The Ingestion & Analysis Pipelines)

Phase A: Syllabus Ingestion Pipeline

Phase B: Question Analysis Pipeline

🔌 API Reference

1. Ingestion Endpoints

`POST /parse_curriculum`

`POST /ingest_selected`

`POST /ingest_from_url`

2. Operational & Analysis Endpoints

`POST /detect_subject` (🪄 Auto-Detect Context)

`POST /analyze_question`

`GET /curriculum_hierarchy`

🚀 Setting Up the System

Backend Setup

Frontend Setup

🔮 Key Takeaway

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

AI Curriculum & Syllabus Validator

🚀 Key Achievements & Features

1. Curriculum-Driven Ingestion Flow (Two-Phase Ingestion)

2. Pre-Embedding Quality Gate & Noise Sanitization

3. Advanced Hybrid Retrieval & Score Gating

4. Rich Pedagogical Mapping

5. Automated Context Detection & Hydration

🏗️ Architecture Overview

Technical Stack

⚙️ How It Works (The Ingestion & Analysis Pipelines)

Phase A: Syllabus Ingestion Pipeline

Phase B: Question Analysis Pipeline

🔌 API Reference

1. Ingestion Endpoints

POST /parse_curriculum

POST /ingest_selected

POST /ingest_from_url

2. Operational & Analysis Endpoints

POST /detect_subject (🪄 Auto-Detect Context)

POST /analyze_question

GET /curriculum_hierarchy

🚀 Setting Up the System

Backend Setup

Frontend Setup

🔮 Key Takeaway

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

`POST /parse_curriculum`

`POST /ingest_selected`

`POST /ingest_from_url`

`POST /detect_subject` (🪄 Auto-Detect Context)

`POST /analyze_question`

`GET /curriculum_hierarchy`

Packages