<a href="https://colab.research.google.com/github/Pradxpk-88/RAG/blob/main/RAG_Phase_2_02_semantic_chunker.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Imports**

In [1]:
# Imports
import json
import re

**Load Cleaned Pages**

In [12]:
#Define the save path within your Google Drive
SAVE_PATH = "/content/https://drive.google.com/drive/folders/1lcNu_R9sxFOqSpVr6v3P_notx0UCqks5"

In [3]:
# Load cleaned pages (from Module 1)

file_name = "/content/cleaned_pages.json" #upload jjsaon file here

with open(file_name, "r", encoding="utf-8") as f:
    cleaned_pages = json.load(f)

print(f"Total cleaned pages loaded: {len(cleaned_pages)}")


Total cleaned pages loaded: 458


**Define Syllabus Metadata (FROZEN)**

In [4]:
# Syllabus-based UNIT mapping (authoritative)

SYLLABUS_UNIT_RULES = [

    # =========================
    # UNIT I — Civil Materials & Surveying
    # =========================
    {
        "keywords": [
            "BUILDING MATERIAL",
            "CEMENT",
            "CONCRETE",
            "STONE",
            "BRICK",
            "TIMBER",
            "AGGREGATE",
            "SURVEY",
            "SURVEYING",
            "CHAIN SURVEY",
            "COMPASS SURVEY",
            "LEVELLING",
            "CONTOUR"
        ],
        "unit": "UNIT I",
        "domain": "civil",
        "syllabus_status": "in_syllabus"
    },

    # =========================
    # UNIT II — Building Components & Foundations
    # =========================
    {
        "keywords": [
            "FOUNDATION",
            "BEARING CAPACITY",
            "FOOTING",
            "PILE",
            "RAFT",
            "SHALLOW FOUNDATION",
            "DEEP FOUNDATION",
            "BUILDING COMPONENT",
            "MASONRY",
            "COLUMN",
            "BEAM",
            "ROOF",
            "FLOOR",
            "LINTEL",
            "PLASTERING"
        ],
        "unit": "UNIT II",
        "domain": "civil",
        "syllabus_status": "in_syllabus"
    },

    # =========================
    # UNIT III — Mechanical Engineering Overview & Manufacturing
    # =========================
    {
        "keywords": [
            "MECHANICAL ENGINEERING",
            "MANUFACTURING",
            "FOUNDRY",
            "CASTING",
            "PATTERN",
            "MOULD",
            "WELDING",
            "ARC WELDING",
            "GAS WELDING",
            "LATHE",
            "MACHINING",
            "TURNING",
            "DRILLING",
            "MILLING",
            "FORGING",
            "PRESS WORK",
            "3D PRINTING",
            "ADDITIVE MANUFACTURING",
            "AUTOMATION"
        ],
        "unit": "UNIT III",
        "domain": "mechanical",
        "syllabus_status": "in_syllabus"
    },

    # =========================
    # UNIT IV — IC Engines & Power Plants
    # =========================
    {
        "keywords": [
            "IC ENGINE",
            "INTERNAL COMBUSTION",
            "PETROL ENGINE",
            "DIESEL ENGINE",
            "OTTO CYCLE",
            "DIESEL CYCLE",
            "TWO STROKE",
            "FOUR STROKE",
            "POWER PLANT",
            "STEAM POWER PLANT",
            "HYDRO POWER PLANT",
            "NUCLEAR POWER PLANT",
            "GAS TURBINE",
            "THERMAL POWER PLANT"
        ],
        "unit": "UNIT IV",
        "domain": "mechanical",
        "syllabus_status": "in_syllabus"
    },

    # =========================
    # UNIT V — Refrigeration & Air Conditioning
    # =========================
    {
        "keywords": [
            "REFRIGERATION",
            "REFRIGERATOR",
            "REFRIGERANT",
            "TON OF REFRIGERATION",
            "COP",
            "COEFFICIENT OF PERFORMANCE",
            "VAPOUR COMPRESSION",
            "VAPOUR ABSORPTION",
            "AIR CONDITIONING",
            "DOMESTIC REFRIGERATOR",
            "WINDOW AIR CONDITIONER",
            "SPLIT AIR CONDITIONER"
        ],
        "unit": "UNIT V",
        "domain": "mechanical",
        "syllabus_status": "tagged_shallow"
    }
]


**Section Heading Detection**

In [5]:
# Robust section heading detector

SECTION_PATTERN = re.compile(r'^(\d+(\.\d+)*)\s+([A-Za-z][A-Za-z\s\-&,]+)$')


def is_valid_section_heading(line: str):
    line = line.strip()
    if not SECTION_PATTERN.match(line):
        return False

    # Reject lines with math symbols
    if any(sym in line for sym in ['=', '+', '-', '*', '/', 'sin', 'cos', 'tan']):
        return False

    return True

**Chunk Object Schema (For Clarity)**

In [6]:
# Chunk schema reference (for clarity)

"""
Chunk structure:
{
  "chunk_id": int,
  "text": str,
  "unit": str,
  "chapter": str,
  "section": str,
  "section_number": str,
  "domain": str,
  "syllabus_status": str,
  "page_start": int,
  "page_end": int
}
"""

'\nChunk structure:\n{\n  "chunk_id": int,\n  "text": str,\n  "unit": str,\n  "chapter": str,\n  "section": str,\n  "section_number": str,\n  "domain": str,\n  "syllabus_status": str,\n  "page_start": int,\n  "page_end": int\n}\n'

**Core Chunking Logic**

In [7]:
# Corrected semantic chunker with syllabus mapping & noise cleanup

def assign_syllabus_metadata(section_title: str):
    title_upper = section_title.upper()

    for rule in SYLLABUS_UNIT_RULES:
        if any(k in title_upper for k in rule["keywords"]):
            return rule["unit"], rule["domain"], rule["syllabus_status"]

    # Default fallback (safe)
    return "UNMAPPED", "unknown", "excluded"


def clean_noise(text: str) -> str:
    # Remove exam references like [May, June 2009; Nov, Dec 2012]
    text = re.sub(r'\[[A-Za-z,\s0-9;]+\]', '', text)

    # Remove figure captions
    text = re.sub(r'Fig\.\s*\d+(\.\d+)?[^.\n]*', '', text)
    text = re.sub(r'fig\.\s*\d+(\.\d+)?', '', text, flags=re.IGNORECASE)

    return text.strip()


def semantic_chunker(cleaned_pages):
    chunks = []

    current_section = None
    current_section_number = None
    buffer = []
    page_start = None

    current_unit = None
    current_domain = None
    syllabus_status = None

    chunk_id = 1

    for page in cleaned_pages:
        for line in page["clean_text"].splitlines():
            line = line.strip()
            if not line:
                continue

            # SECTION DETECTION
            if is_valid_section_heading(line):
                # Flush previous chunk
                if buffer and current_section:
                    chunks.append({
                        "chunk_id": chunk_id,
                        "text": clean_noise(" ".join(buffer)),
                        "unit": current_unit,
                        "chapter": current_unit,
                        "section": current_section,
                        "section_number": current_section_number,
                        "domain": current_domain,
                        "syllabus_status": syllabus_status,
                        "page_start": page_start,
                        "page_end": page["page_number"]
                    })
                    chunk_id += 1
                    buffer = []

                match = SECTION_PATTERN.match(line)
                current_section_number = match.group(1)
                current_section = match.group(3)

                # Assign syllabus metadata from section title
                current_unit, current_domain, syllabus_status = assign_syllabus_metadata(current_section)

                page_start = page["page_number"]
                continue

            # Normal content
            if current_section:
                buffer.append(line)

    # Flush last chunk
    if buffer and current_section:
        chunks.append({
            "chunk_id": chunk_id,
            "text": clean_noise(" ".join(buffer)),
            "unit": current_unit,
            "chapter": current_unit,
            "section": current_section,
            "section_number": current_section_number,
            "domain": current_domain,
            "syllabus_status": syllabus_status,
            "page_start": page_start,
            "page_end": cleaned_pages[-1]["page_number"]
        })

    return chunks

**Run Chunker**

In [8]:
# Generate chunks

chunks = semantic_chunker(cleaned_pages)

print(f"Total chunks created: {len(chunks)}")


Total chunks created: 397


**Inspect Sample Chunks**

In [9]:
# Inspect random chunks

for ch in chunks[40:70]:
    print("="*80)
    print(ch["unit"], ch["section_number"], ch["section"])
    print("Pages:", ch["page_start"], "-", ch["page_end"])
    print(ch["text"][:800])

UNIT III 1.9.4 pattern Making
Pages: 62 - 63
Considering the importance of the pattern in the metal casting process, it is necessary to select proper equipment, machines, tools and instruments for pattern making. These requirements depend on the pattern materials used and the method of making. Wooden pattern may be hand worked or machine worked. For making wooden patterns, utensils such as work benches, carpenter’s vice, circular saw, band saw, wood planer, disc sander, pattern maker’s lathe, pattern milling machine, and wood boring machine are required. For machining a metal pattern, traditional machines such as the lathe, milling machine, drilling machine, shaper, planner, and grinding machine are used. Scope of Civil Engineering 1.47
UNIT III 1.10 MoUldINg
Pages: 63 - 63
Once the pattern of correct shape and size of the casting is prepared, it is necessary to make a cavity with the help of a medium. The process of making this cavity of the desired shape and size on a medium is calle

**Chunk Size Validation**

In [10]:
# Chunk size statistics

lengths = [len(ch["text"].split()) for ch in chunks]

print(f"Min words : {min(lengths)}")
print(f"Max words : {max(lengths)}")
print(f"Avg words : {sum(lengths)//len(lengths)}")


Min words : 0
Max words : 8179
Avg words : 320


**Save Chunks (For Embedding Module)**

In [16]:
# Save chunks

import json
import os

# Corrected SAVE_PATH to point to a valid Google Drive location
# Please update this path to your specific desired folder in Google Drive.
# Example: SAVE_PATH = "/content/drive/MyDrive/QIP_IIIT_A/Learning/Projects/rag_bcm/data"
SAVE_PATH = "/drive/folders/1lcNu_R9sxFOqSpVr6v3P_notx0UCqks5" # Placeholder path

output_filename = os.path.join(SAVE_PATH, "semantic_chunks.json")

# Ensure the directory exists before attempting to write the file
os.makedirs(SAVE_PATH, exist_ok=True)

with open(output_filename, "w", encoding="utf-8") as f:
    json.dump(chunks, f, indent=2, ensure_ascii=False)

print(f"Saved {output_filename}")

Saved /drive/folders/1lcNu_R9sxFOqSpVr6v3P_notx0UCqks5/semantic_chunks.json
