<div style="text-align: justify">

## Section 1. Introduction to the Problem/Task

**The Problem**
Navigating extensive legal and technical documents, such as the Philippine DOLE Occupational Safety and Health Standards (OSHS), presents a significant "information bottleneck." Finding specific compliance metrics, hazard guidelines, or equipment specifications via manual search is inefficient and prone to human error. Furthermore, while standard Large Language Models (LLMs) are highly capable conversational agents, they cannot be trusted with critical safety queries out-of-the-box because they are prone to "hallucinating" technical facts and lack innate knowledge of localized policy documents.

**Purpose and Domain Use Case**
The purpose of this project is to develop an LLM-powered chatbot tailored specifically to the domain of workplace safety policies and manuals. The intended use case is to serve as an interactive safety assistant for safety officers, employers, and workers. Users can query the system in natural language (e.g., "What are the required dimensions for a machine guard?") and the chatbot will instantly retrieve and synthesize the exact procedural guidelines and compliance protocols from the official DOLE OSHS text.

**Real-World Significance**
Building a retrieval-grounded conversational system (utilizing a Retrieval-Augmented Generation or RAG pipeline) is critical for this application. By anchoring the LLM's responses exclusively to retrieved chunks of the official OSHS document, we eliminate hallucinations and guarantee that the information provided is factual, reliable, and citeable. In a real-world setting, this system accelerates regulatory compliance, democratizes access to dense safety protocols, and ultimately helps mitigate workplace hazards by ensuring accurate safety knowledge is instantly accessible.

</div>

## Section 2. Dataset Description

**Knowledge Source and Collection**
The primary knowledge source for this chatbot is the **Occupational Safety and Health Standards (OSHS) As Amended** handbook, issued by the Department of Labor and Employment (DOLE) of the Philippines. The document was acquired as a digital PDF (closed-corpus) and serves as the definitive legal and regulatory baseline for occupational safety in the country. 

**Dataset Structure**
* **Format:** Single PDF document (`Osh-Handbook.pdf`)
* **Domain:** Legal, Regulatory, and Occupational Health & Safety
* **Contents:** The document is highly structured, consisting of hierarchical legal frameworks (Rules, Sections, Sub-sections) alongside dense technical matrices (e.g., Threshold Limit Values for airborne contaminants, medical supply requirements).

**Preprocessing and Data Pipeline**
To ensure the LLM accurately retrieves and contextualizes the legal statutes without hallucination, standard naive chunking was discarded in favor of a **Structure-Aware Processing Pipeline**:

1.  **Document Cleaning:** * **Artifact Removal:** Page numbers, headers, and extraneous source tags (e.g., `--- PAGE X ---`) were stripped using Regular Expressions to reduce embedding noise.
    * **Hyphenation Merging:** Words split across line breaks by hyphens (e.g., "equip-ment") were systematically rejoined to maintain semantic integrity during vector search.
2.  **Handling Tables:** * Complex tables embedded within the PDF are extracted independently using `pdfplumber`. These tables are converted into Markdown format before embedding to preserve their row-column relationships, ensuring that specific numerical limits and chemical properties remain explicitly linked to their respective entities.
3.  **Structure-Aware Chunking & Metadata Tagging:** * The text is strictly partitioned using **Rule Numbers** (e.g., "Rule 1040") as the primary delimiters. 
    * **Context Injection:** To prevent orphaned text chunks from losing their legal context, the specific Rule Number and Title are prepended as metadata to every sub-chunk generated from that section.

## Section 2.1 Dataset Cleaning

#### Environment Setup and Imports
Run this first to install the required libraries and import the modules. tabulate is required for pandas to convert tables to Markdown.

In [None]:
# Install required libraries
!pip install pdfplumber langchain langchain-text-splitters pandas tabulate

# Import modules
import re
import pdfplumber
import pandas as pd
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_core.documents import Document

# Define the file path (Ensure your PDF is uploaded to the Colab files section)
PDF_PATH = "Osh-Handbook.pdf"

^C


ModuleNotFoundError: No module named 'pdfplumber'

Collecting pdfplumber
  Downloading pdfplumber-0.11.9-py3-none-any.whl.metadata (43 kB)
Collecting langchain
  Downloading langchain-1.2.10-py3-none-any.whl.metadata (5.7 kB)
Collecting langchain-text-splitters
  Downloading langchain_text_splitters-1.1.1-py3-none-any.whl.metadata (3.3 kB)
Collecting pandas
  Downloading pandas-3.0.1-cp312-cp312-win_amd64.whl.metadata (19 kB)
Collecting tabulate
  Downloading tabulate-0.9.0-py3-none-any.whl.metadata (34 kB)
Collecting pdfminer.six==20251230 (from pdfplumber)
  Downloading pdfminer_six-20251230-py3-none-any.whl.metadata (4.3 kB)
Collecting Pillow>=9.1 (from pdfplumber)
  Downloading pillow-12.1.1-cp312-cp312-win_amd64.whl.metadata (9.0 kB)
Collecting pypdfium2>=4.18.0 (from pdfplumber)
  Downloading pypdfium2-5.5.0-py3-none-win_amd64.whl.metadata (68 kB)
Collecting charset-normalizer>=2.0.0 (from pdfminer.six==20251230->pdfplumber)
  Downloading charset_normalizer-3.4.4-cp312-cp312-win_amd64.whl.metadata (38 kB)
Collecting cryptography>


[notice] A new release of pip is available: 24.2 -> 26.0.1
[notice] To update, run: python.exe -m pip install --upgrade pip


#### Document Cleaning Utility
This cell defines the function used to strip out page numbers, source tags, and fix broken words.

In [None]:
def clean_text(text):
    """Removes PDF artifacts and merges hyphenated words."""
    if not text:
        return ""
    
    # Remove page artifacts like "--- PAGE 1 ---"
    text = re.sub(r'--- PAGE \d+ ---', '', text)
    
    # Merge hyphenated words across newlines (e.g., "work-\nplace" -> "workplace")
    text = re.sub(r'(\w+)-\n(\w+)', r'\1\2', text)
    
    # Clean up excessive newlines
    text = re.sub(r'\n{3,}', '\n\n', text)
    
    return text.strip()

#### Table Extraction
This cell handles extracting complex tables and converting them into Markdown so the LLM can understand the rows and columns.

In [None]:
def extract_tables_to_documents(pdf_path):
    print("Extracting tables...")
    table_documents = []
    
    with pdfplumber.open(pdf_path) as pdf:
        for page_num, page in enumerate(pdf.pages):
            tables = page.extract_tables()
            for i, table in enumerate(tables):
                # Minimum content filter for tables
                if not table or len(table) < 2: 
                    continue 
                
                # Harden headers: Convert None to "" and ensure unique column names
                raw_headers = [str(col) if col is not None else f"Col_{j}" for j, col in enumerate(table[0])]
                
                # Deduplicate headers if PDF parsing messed up (e.g., two columns named "Limit")
                headers = pd.Series(raw_headers).mask(pd.Series(raw_headers).duplicated(), 
                                                      pd.Series(raw_headers) + '_dup').tolist()
                
                try:
                    df = pd.DataFrame(table[1:], columns=headers).dropna(how='all')
                    df = df.fillna("") 
                    md_table = df.to_markdown(index=False)
                    
                    # Deduplication/Noise filter: Skip tiny or empty tables
                    if len(md_table.strip()) < 50:
                        continue
                        
                    # Create structured LangChain Document
                    doc = Document(
                        page_content=f"[Table Extracted from Page {page_num + 1}]\n{md_table}",
                        metadata={
                            "source": "Osh-Handbook.pdf",
                            "page": page_num + 1,
                            "type": "table",
                            "table_index": i
                        }
                    )
                    table_documents.append(doc)
                except Exception as e:
                    print(f"Skipped broken table on page {page_num + 1}: {e}")
                
    print(f"Successfully extracted {len(table_documents)} table documents.")
    return table_documents

#### Text Extraction & Structure-Aware Chunking
This is the core logic. It reads the text, splits it by DOLE Rules, and prepends the Rule Title to every sub-chunk so context is never lost.

In [None]:
def process_dole_rules_to_documents(pdf_path):
    print("Extracting and cleaning text...")
    
    text_splitter = RecursiveCharacterTextSplitter(
        chunk_size=1000,
        chunk_overlap=150,
        separators=["\n\n", "\n", ".", " ", ""]
    )
    
    text_documents = []
    seen_chunks = set()  # Deduplicate by (rule_id, chunk)
    
    print("Chunking rules and assigning metadata...")
    with pdfplumber.open(pdf_path) as pdf:
        for page_num, page in enumerate(pdf.pages, start=1):
            extracted = page.extract_text()
            if not extracted:
                continue
            
            cleaned_page_text = clean_text(extracted)
            if not cleaned_page_text:
                continue
            
            rule_splits = re.split(r'(?i)\n(?=Rule\s\d{4})', cleaned_page_text)
            
            for section in rule_splits:
                section = section.strip()
                
                # Minimum content filter
                if len(section) < 50:
                    continue
                    
                first_line = section.split('\n')[0]
                rule_match = re.match(r'(?i)Rule\s(\d{4})', first_line)
                
                # Extract specific rule ID for metadata
                rule_id = rule_match.group(1) if rule_match else "General"
                rule_title = first_line.strip() if first_line.strip() else "General OSHS Provision"
                
                sub_chunks = text_splitter.split_text(section)
                
                for chunk in sub_chunks:
                    normalized_chunk = chunk.strip()
                    
                    # Minimum content filter per chunk
                    if len(normalized_chunk.split()) < 10:  # Skip chunks with fewer than 10 words
                        continue
                    
                    # Deduplication check (rule-aware)
                    dedup_key = (rule_id, normalized_chunk)
                    if dedup_key in seen_chunks:
                        continue
                    seen_chunks.add(dedup_key)
                        
                    # Create LangChain Document with rich metadata
                    doc = Document(
                        page_content=f"[{rule_title}]\n{normalized_chunk}",
                        metadata={
                            "source": "Osh-Handbook.pdf",
                            "rule_id": rule_id,
                            "type": "text",
                            "page": page_num
                        }
                    )
                    text_documents.append(doc)
            
    print(f"Generated {len(text_documents)} structured text documents.")
    return text_documents

# Execution
tables_docs = extract_tables_to_documents(PDF_PATH)
text_docs = process_dole_rules_to_documents(PDF_PATH)
all_knowledge_base_docs = text_docs + tables_docs

# Preview the rich metadata
print("\n--- Document Object Preview ---")
if all_knowledge_base_docs:
    preview_index = min(50, len(all_knowledge_base_docs) - 1)
    print(f"Content: {all_knowledge_base_docs[preview_index].page_content[:100]}...")
    print(f"Metadata: {all_knowledge_base_docs[preview_index].metadata}")
else:
    print("No documents were generated. Check PDF path and extraction logic.")

#### Combine & Final Check
Run this cell to combine your extracted tables and text chunks into one unified knowledge base list, ready to be embedded and stored in ChromaDB in your next steps.

In [None]:
# Combine text and table documents
all_knowledge_base_docs = text_docs + tables_docs

print(f"Total Text Docs: {len(text_docs)}")
print(f"Total Table Docs: {len(tables_docs)}")
print(f"Total Combined Docs ready for Vector DB: {len(all_knowledge_base_docs)}")

# This list 'all_knowledge_base_docs' is what you will pass to your embedding model!