<a href="https://colab.research.google.com/github/Pradxpk-88/RAG/blob/main/RAG_Phase_2_01_pdf_loader_and_cleaner.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Environment Setup**

In [8]:
#Install dependencies
!pip install pymupdf



**Imports**

In [9]:
# Cell 2: Imports
import fitz  # PyMuPDF
import re
from collections import Counter

In [10]:
#Define the save path within your Google Drive
SAVE_PATH = "/content/Basic Civil and Mechanical Engineering.pdf"

**Upload PDF to Colab**

In [11]:
from google.colab import files

uploaded = files.upload()

Saving Basic Civil and Mechanical Engineering.pdf to Basic Civil and Mechanical Engineering.pdf


**PDF Loader (Page-wise Extraction)**

In [12]:
def load_pdf_pagewise(pdf_path: str):
    """
    Extract raw text page-by-page from PDF.
    """
    doc = fitz.open(pdf_path)
    pages = []

    for page_index in range(len(doc)):
        page = doc[page_index]
        text = page.get_text("text")

        pages.append({
            "page_number": page_index + 1,  # 1-based indexing
            "raw_text": text
        })

    doc.close()
    return pages


**Test PDF Loader (RAW INSPECTION)**

In [13]:
PDF_PATH = list(uploaded.keys())[0]
raw_pages = load_pdf_pagewise(PDF_PATH)

print(f"Total pages extracted: {len(raw_pages)}")
print("\n--- SAMPLE RAW PAGE ---\n")
print(raw_pages[16]["raw_text"][:1500])  # inspect page 17

Total pages extracted: 458

--- SAMPLE RAW PAGE ---

chapter 1
Scope of cIvIl eNgINeerINg
1.1 cIvIl eNgINeerINg
Civil Engineering is the field of engineering concerned with planning, design and 
construction for environmental control, development of natural resources, buildings, 
transportation facilities and other structures required for health, welfare, safety, 
employment and pleasure of mankind.
 
The main scope of civil engineering or the task of civil engineering is planning, 
designing, estimating, supervising construction, execution, and maintenance of structures 
like building, roads, bridges, dams, etc.
 
Population demographics along with increasing urbanization have facilitated the need 
for sustainable and efficient infrastructure solutions. Development in green buildings, 
sensor-embedded roads and buildings, geopolymer concrete, and water management 
will stimulate global civil engineering industry growth.
1.1.1 field of civil engineering
Civil engineering is a wide fiel

**Cleaner Utilities**

In [14]:
def normalize_whitespace(text: str) -> str:
    text = re.sub(r'\r', '\n', text)
    text = re.sub(r'\n{3,}', '\n\n', text)
    text = re.sub(r'[ \t]{2,}', ' ', text)
    return text.strip()


def detect_repeated_lines(pages):
    """
    Detect common headers and footers by frequency.
    """
    first_lines = []
    last_lines = []

    for p in pages:
        lines = [l.strip() for l in p["raw_text"].splitlines() if l.strip()]
        if len(lines) > 2:
            first_lines.append(lines[0])
            last_lines.append(lines[-1])

    header_candidates = Counter(first_lines)
    footer_candidates = Counter(last_lines)

    headers = {
        line for line, count in header_candidates.items()
        if count > len(pages) * 0.5
    }
    footers = {
        line for line, count in footer_candidates.items()
        if count > len(pages) * 0.5
    }

    return headers, footers

In [15]:
def normalize_heading_case(text: str) -> str:
    """
    Fix mixed-case headings caused by PDF font issues.
    Applies ONLY to short, title-like lines.
    """
    lines = text.splitlines()
    fixed_lines = []

    for line in lines:
        stripped = line.strip()

        # Heuristic: short line + many uppercase letters
        if (
            len(stripped) < 80 and
            sum(c.isupper() for c in stripped) > len(stripped) * 0.4
        ):
            fixed_lines.append(stripped.title())
        else:
            fixed_lines.append(line)

    return "\n".join(fixed_lines)

**Page Cleaner**

In [16]:
def clean_pages(pages):
    headers, footers = detect_repeated_lines(pages)
    cleaned = []

    for p in pages:
        text = p["raw_text"]

        # Remove detected headers & footers
        for h in headers:
            text = text.replace(h, "")
        for f in footers:
            text = text.replace(f, "")

        # Remove isolated page numbers
        text = re.sub(r'\n\s*\d+\s*\n', '\n', text)

        text = normalize_whitespace(text)
        text = normalize_heading_case(text)

        cleaned.append({
            "page_number": p["page_number"],
            "clean_text": text
        })

    return cleaned

**Run Cleaner**

In [17]:
cleaned_pages = clean_pages(raw_pages)

print("Cleaning complete.")
print("\n--- SAMPLE CLEAN PAGE ---\n")
print(cleaned_pages[10]["clean_text"][:1500])

Cleaning complete.

--- SAMPLE CLEAN PAGE ---

Contents
xi
uNIt 3 buIldINg CoMPoNeNts ANd struCtures
 
4. Foundation 
4.1–4.31
 
4.1 Selection of Site 4.1
 
4.2 Substructure 4.2
 
4.3 Objectives of a Foundation 4.2
 
4.4 Site Inspection 4.3
 
4.5 Soils 4.3
 
4.6 Loads on Foundations 4.6
 
4.7 Essential Requirements of a Good Foundation 4.7
 
4.8 Types of Foundation 4.7
 
4.9 Caisson Foundation or Well Foundation 4.16
 
4.10 Failure of Foundations and Remedial Measures 4.17
 
4.11 Foundations for Machinery 4.18
 
4.12 Foundations for Special Structures 4.21
 
 Short-Answer Questions 4.30
 
 Exercises 4.31
 
5. Superstructure 
5.1–5.66
 
5.1 Introduction 5.1
 
5.2 Brick Masonry 5.1
 
5.3 Stone Masonry 5.9
 
5.4 RCC Structural Members 5.18
 
5.5 Columns 5.23
 
5.6 Lintels 5.25
 
5.7 Roofing 5.28
 
5.8 Flooring 5.40
 
5.9 Damp-Proofing 5.51
 
5.10 Plastering 5.54
 
5.11 Valuation 5.57
 
 Illustrative Examples 5.61
 
 Short-Answer Questions 5.63
 
 Exercises 5.65
 
6. Bridges 
6.1–6.18
 
6.

**Manual Validation**

In [18]:
# Cell 9: Manual inspection loop

for i in [10, 30, 60, 100]:
    print("=" * 90)
    print(f"PAGE {cleaned_pages[i]['page_number']}")
    print("=" * 90)
    print(cleaned_pages[i]["clean_text"][:1200])
    print("\n\n")

PAGE 11
Contents
xi
uNIt 3 buIldINg CoMPoNeNts ANd struCtures
 
4. Foundation 
4.1–4.31
 
4.1 Selection of Site 4.1
 
4.2 Substructure 4.2
 
4.3 Objectives of a Foundation 4.2
 
4.4 Site Inspection 4.3
 
4.5 Soils 4.3
 
4.6 Loads on Foundations 4.6
 
4.7 Essential Requirements of a Good Foundation 4.7
 
4.8 Types of Foundation 4.7
 
4.9 Caisson Foundation or Well Foundation 4.16
 
4.10 Failure of Foundations and Remedial Measures 4.17
 
4.11 Foundations for Machinery 4.18
 
4.12 Foundations for Special Structures 4.21
 
 Short-Answer Questions 4.30
 
 Exercises 4.31
 
5. Superstructure 
5.1–5.66
 
5.1 Introduction 5.1
 
5.2 Brick Masonry 5.1
 
5.3 Stone Masonry 5.9
 
5.4 RCC Structural Members 5.18
 
5.5 Columns 5.23
 
5.6 Lintels 5.25
 
5.7 Roofing 5.28
 
5.8 Flooring 5.40
 
5.9 Damp-Proofing 5.51
 
5.10 Plastering 5.54
 
5.11 Valuation 5.57
 
 Illustrative Examples 5.61
 
 Short-Answer Questions 5.63
 
 Exercises 5.65
 
6. Bridges 
6.1–6.18
 
6.1 Introduction 6.1
 
6.2 Necessity of B

**Save Output (For Next Module)**

In [21]:
import json
import os

# Change SAVE_PATH to be a directory, not a file.
# The cleaned_pages.json will be saved directly in the /content/ directory.
output_filename = os.path.join("/content/", "cleaned_pages.json")

with open(output_filename, "w", encoding="utf-8") as f:
    json.dump(cleaned_pages, f, indent=2, ensure_ascii=False)

print(f"Saved {output_filename}")

Saved /content/cleaned_pages.json
