# Legal Document Processing Pipeline - Text Refinement & JSON Conversion

## **Overview**
This notebook implements a comprehensive pipeline for processing legal documents (laws, decrees, and transitional provisions) from PDF format to structured JSON. The pipeline handles Mexican state legislation, performing text extraction, cleaning, structural analysis, and hierarchical parsing.

## **Processing Workflow**
1. **PDF Text Extraction** - Extract raw text from PDF documents
2. **Text Cleaning** - Clean and normalize extracted text  
3. **Document Splitting** - Separate decreto, ley, and transitorios sections
4. **JSON Structuring** - Parse hierarchical structure (libros, títulos, capítulos, secciones, artículos)
5. **Validation & Error Handling** - Detect structural issues and validate article sequences

## **Input/Output**
- **Input**: PDF files in `Raw/` directory with corresponding catalog CSV
- **Output**: Structured JSON files with hierarchical legal document representation
- **Logs**: Comprehensive error tracking and validation reports

---

## **Required Dependencies & Imports**

In [3]:
# Core Python libraries
from __future__ import annotations          # Enable forward type references
import json                                 # JSON serialization for output files
import re                                   # Regular expressions for text pattern matching
import statistics                           # Statistical calculations for text analysis

# Data structures and type hints
from dataclasses import dataclass, field    # Structured data classes for law metadata
from pathlib import Path                    # Cross-platform file system path handling
from typing import Dict, List, Optional, Tuple, Union, Set, Any  # Type annotations

# External libraries
#%pip install PyMuPDF pandas unidecode
import fitz                                 # PyMuPDF - PDF text extraction library
import pandas as pd                         # Data manipulation and CSV handling
from unidecode import unidecode             # Unicode normalization and accent removal

## **Directory Structure & Data Flow Configuration**

### **Input Directories**
- **`Raw/`** - Source PDF files containing legal documents
- **`index.csv`** - Catalog mapping file numbers to law metadata

### **Processing Pipeline Directories**
- **`temp/raw_txt/`** - Step 1: Raw text extracted from PDFs  
- **`temp/clean/`** - Step 2: Cleaned and normalized text
- **`leyes/`** - Step 3: Law sections (main legal content)
- **`decretos/`** - Step 3: Decree sections (government decisions)
- **`transitorios/`** - Step 3: Transitional provisions

### **Output Directories**  
- **`json/`** - Final structured JSON files
- **`errores/`** - Error logs and validation reports

This configuration ensures a clear separation of processing stages and enables easy debugging and quality control.

In [4]:
# ============== Directory Configuration ==============

# Base working directory (current notebook location)
BASE_DIR     = Path.cwd()

# === INPUT PATHS ===
RAW_DIR      = BASE_DIR / "Raw"                 # Source PDF files location
CATALOG_CSV  = RAW_DIR / "index.csv"           # Metadata catalog for PDFs

# === OUTPUT ROOT ===
OUTPUT_DIR   = BASE_DIR / "Refined"             # All processed outputs go here
TEMP_DIR     = BASE_DIR / "temp"                # Temporary processing files

# === INTERMEDIATE PROCESSING DIRECTORIES ===
RAW_TXT_DIR  = OUTPUT_DIR / TEMP_DIR / "raw_txt"   # Step 1: Raw PDF text extraction
CLEAN_DIR    = OUTPUT_DIR / TEMP_DIR / "clean"     # Step 2: Cleaned text files

# === DOCUMENT TYPE SEPARATION (Step 3 outputs) ===
LEY_DIR      = OUTPUT_DIR / "leyes"             # Main law content
DECR_DIR     = OUTPUT_DIR / "decretos"          # Government decree sections  
TRANS_DIR    = OUTPUT_DIR / "transitorios"      # Transitional provisions

# === FINAL OUTPUTS ===
JSON_DIR     = OUTPUT_DIR / "json"              # Structured JSON documents
ERRORES_DIR  = OUTPUT_DIR / "errores"           # Error logs and validation reports

# Create all necessary directories (parents=True creates nested paths)
for d in [OUTPUT_DIR, RAW_TXT_DIR, TEMP_DIR, CLEAN_DIR, LEY_DIR, DECR_DIR, TRANS_DIR, JSON_DIR, ERRORES_DIR]:
    d.mkdir(parents=True, exist_ok=True)

print(f"Directory structure created successfully!")
print(f"Base directory: {BASE_DIR}")
print(f"Input PDFs: {RAW_DIR}")  
print(f"Catalog: {CATALOG_CSV}")
print(f"Final JSON output: {JSON_DIR}")

Directory structure created successfully!
Base directory: c:\Users\braul\Documents\_ITAMLaptop\Datalab\DataMakers\Leyes\14
Input PDFs: c:\Users\braul\Documents\_ITAMLaptop\Datalab\DataMakers\Leyes\14\Raw
Catalog: c:\Users\braul\Documents\_ITAMLaptop\Datalab\DataMakers\Leyes\14\Raw\index.csv
Final JSON output: c:\Users\braul\Documents\_ITAMLaptop\Datalab\DataMakers\Leyes\14\Refined\json


## **Utility Functions & Text Processing Tools**

### **Core Functionality**
- **Text Normalization** - Clean and standardize text for consistent processing
- **File I/O Operations** - Safe file writing with UTF-8 encoding
- **Error Logging** - Structured error reporting with contextual information  
- **Statistical Analysis** - Count lines, words, and characters for quality metrics
- **String Sanitization** - Create safe filenames and normalize accented characters

These utilities ensure robust text processing and provide comprehensive error tracking throughout the pipeline.

In [5]:
# ============== Text Processing Utilities ==============

def slugify(s: str) -> str:
    """
    Convert text to URL-safe slug format.
    - Removes accents using unidecode
    - Converts to lowercase
    - Replaces non-alphanumeric chars with underscores]
    - Collapses multiple underscores to single ones
    """
    s = unidecode(s).lower()                    # Remove accents, convert to lowercase
    s = re.sub(r"[^a-z0-9]+", "_", s)          # Replace non-alphanumeric with underscores
    return re.sub(r"_+", "_", s).strip("_") or "x"  # Clean up multiple underscores

def norm_lower(s: str) -> str:
    """Normalize text: remove accents, lowercase, collapse whitespace."""
    return re.sub(r"\s+", " ", unidecode(s).lower().strip())

def caps_line(s: str) -> str:
    """Convert text to uppercase and remove accents for header matching."""
    return unidecode(s).upper().strip()

def write_text(path: Path, content: str) -> None:
    """Safely write text content to file with UTF-8 encoding."""
    path.write_text(content, encoding="utf-8")

def write_error(base: str, kind: str, message: str, extra: Dict | None = None) -> None:
    """
    Log structured error information to JSON file.
    
    Args:
        base: File identifier (e.g., '0001')
        kind: Error category (e.g., 'catalog_missing', 'parse_error')
        message: Human-readable error description
        extra: Additional context data
    """
    rec = {"file": base, "kind": kind, "message": message}
    if extra:
        rec.update(extra)
    
    # Create safe filename for error log
    error_filename = f"{slugify(base)}_{slugify(kind)}.json"
    (ERRORES_DIR / error_filename).write_text(
        json.dumps(rec, ensure_ascii=False, indent=2), encoding="utf-8"
    )

def count_stats(text: str) -> Dict[str, int]:
    """
    Calculate text statistics for quality metrics.
    
    Returns:
        Dictionary with 'lines', 'words', and 'chars' counts
    """
    return {
        "lines": text.count("\n") + (1 if text else 0),  # Count newlines + 1
        "words": len(re.findall(r"\S+", text)),           # Count non-whitespace sequences
        "chars": len(text)                                # Total character count
    }

def _norm_caps(s: str) -> str:
    """
    Internal helper: normalize text to uppercase alphanumeric with spaces.
    Used for header pattern matching.
    """
    t = unidecode(s).upper()                    # Remove accents, uppercase
    t = re.sub(r"[^A-Z0-9]+", " ", t)          # Keep only letters/numbers
    return re.sub(r"\s+", " ", t).strip()       # Collapse whitespace

In [6]:
# ============== Law Metadata & Catalog Management ==============

@dataclass(frozen=True)
class LawMeta:
    """
    Immutable metadata container for legal documents.
    
    Attributes:
        num_est: State number identifier
        file_num: Zero-padded file number (e.g., '0001')
        law_name: Full name of the law
        link: Source URL or reference link
        first_two_caps: Auto-generated uppercase version of first two words
                       (used for document structure detection)
    """
    num_est: str
    file_num: str  
    law_name: str
    link: str
    first_two_caps: str = field(init=False)  # Computed automatically

    def __post_init__(self):
        """
        Automatically extract and normalize the first two words of law name.
        This is used for identifying the law title within document text.
        """
        # Tokenize and normalize the law name
        toks = [t for t in norm_lower(self.law_name).split() if t]
        first_two = " ".join(toks[:2]) if toks else ""
        
        # Set the computed field (frozen dataclass requires object.__setattr__)
        object.__setattr__(self, "first_two_caps", first_two.upper())

def load_catalog(path: Path) -> Dict[str, LawMeta]:
    """
    Load law metadata from CSV catalog file.
    
    Args:
        path: Path to catalog CSV file
        
    Returns:
        Dictionary mapping file_num to LawMeta objects
        
    Raises:
        ValueError: If required columns are missing from CSV
    """
    # Read CSV with string dtype to preserve leading zeros
    df = pd.read_csv(path, dtype=str, keep_default_na=False)
    df.columns = [c.lower() for c in df.columns]  # Normalize column names
    
    # Verify required columns exist
    req = {"num_est", "file_num", "law_name", "link"}
    miss = req - set(df.columns)
    if miss:
        raise ValueError(f"CSV missing required columns: {miss}")
    
    # Build catalog dictionary
    out: Dict[str, LawMeta] = {}
    for _, r in df.iterrows():
        meta = LawMeta(
            num_est=(r["num_est"] or "").strip(),
            file_num=(r["file_num"] or "").strip().zfill(4),  # Ensure 4-digit padding
            law_name=(r["law_name"] or "").strip(),
            link=(r["link"] or "").strip(),
        )
        out[meta.file_num] = meta
    
    return out

## **Step 1: PDF Text Extraction**

### **Purpose**
Extract raw text content from PDF legal documents using PyMuPDF (fitz). This step converts binary PDF files into plain text while preserving layout and structure as much as possible.

### **Process**
1. **PDF Reading** - Open each PDF file in the Raw directory
2. **Page-by-Page Extraction** - Extract text from each page sequentially  
3. **Text Concatenation** - Combine all pages with double newlines as separators
4. **File Output** - Save raw text to `temp/raw_txt/` directory
5. **Statistics Tracking** - Record file processing metrics and errors

### **Quality Control**
- Validates files against catalog metadata
- Tracks processing statistics (lines, words, characters)
- Logs missing files and extraction errors
- Generates manifest for downstream processing

In [7]:
def read_pdf(pdf_path: Path) -> str:
    """
    Extract text content from a PDF file using PyMuPDF.
    
    Args:
        pdf_path: Path to the PDF file to process
        
    Returns:
        Concatenated text content from all pages
        
    Note:
        - Processes pages sequentially to maintain document order
        - Adds double newlines between pages for section separation
        - Handles multi-page documents automatically
    """
    with fitz.open(pdf_path) as pdf_file:
        text_content = ""
        
        # Process each page in order
        for page_num in range(len(pdf_file)):
            page = pdf_file[page_num]
            text = page.get_text()                    # Extract plain text
            text_content += text + "\n\n"             # Add page separator

    return text_content

In [8]:
def step1_extract_raw(catalog_csv: Path = CATALOG_CSV) -> pd.DataFrame:
    """
    Step 1: Extract raw text from all PDF files in the Raw directory.
    
    Process:
    1. Load catalog metadata for file validation
    2. Find all PDF files in Raw directory  
    3. Extract text from each PDF using PyMuPDF
    4. Save raw text files to temp/raw_txt/ directory
    5. Generate processing statistics and manifest
    
    Args:
        catalog_csv: Path to catalog file for metadata validation
        
    Returns:
        DataFrame with processing statistics for each file
    """
    print("Starting Step 1: PDF Text Extraction...")
    
    # Load catalog for file validation and metadata
    catalog = load_catalog(catalog_csv)
    pdfs = sorted(RAW_DIR.glob("*.pdf"))
    
    print(f"Found {len(pdfs)} PDF files to process")
    print(f"Catalog contains {len(catalog)} entries")

    all_recs: List[Dict] = []
    
    for pdf_path in pdfs:
        # Extract base filename (e.g., '0001.pdf' -> '0001')
        base = pdf_path.stem.zfill(4) 
        
        # Validate against catalog
        if base not in catalog:
            write_error(base, "catalog_missing", 
                       "file_num not found in catalog", 
                       {"pdf": pdf_path.name})
            print(f"WARNING: {pdf_path.name} not found in catalog - skipping")
            continue

        # Extract text from PDF
        print(f"Processing {pdf_path.name}...")
        raw_layout = read_pdf(pdf_path)

        # Save raw text output
        raw_out = RAW_TXT_DIR / f"raw_{base}.txt"
        write_text(raw_out, raw_layout)

        # Record processing statistics
        rec_raw = {
            "stage": "raw",
            "source_pdf": pdf_path.name,
            "base": base
        }
        rec_raw.update(count_stats(raw_layout))  # Add line/word/char counts
        all_recs.append(rec_raw)

    # Create processing manifest
    df = pd.DataFrame(all_recs)
    if not df.empty:
        manifest_path = OUTPUT_DIR / "manifest_raw.csv"
        df.to_csv(manifest_path, index=False, encoding="utf-8")
        print(f"Manifest saved to {manifest_path}")
    
    print(f"Step 1 Complete: {len(df)} raw txt files written to {RAW_TXT_DIR}")
    return df

# === EXECUTE STEP 1 ===
print("=" * 60)
df_raw = step1_extract_raw(CATALOG_CSV)
print("=" * 60)

# Display processing summary
if not df_raw.empty:
    print("\n**Raw Text Extraction Summary:**")
    print(f"   Total files processed: {len(df_raw)}")
    print(f"   Average lines per file: {df_raw['lines'].mean():.1f}")
    print(f"   Average words per file: {df_raw['words'].mean():.1f}")
    print(f"   Total characters extracted: {df_raw['chars'].sum():,}")
    
df_raw.head()

Starting Step 1: PDF Text Extraction...
Found 301 PDF files to process
Catalog contains 301 entries
Processing 0001.pdf...
Processing 0002.pdf...
Processing 0003.pdf...
Processing 0004.pdf...
Processing 0005.pdf...
Processing 0006.pdf...
Processing 0007.pdf...
Processing 0008.pdf...
Processing 0009.pdf...
Processing 0010.pdf...
Processing 0011.pdf...
Processing 0012.pdf...
Processing 0013.pdf...
Processing 0014.pdf...
Processing 0015.pdf...
Processing 0016.pdf...
Processing 0017.pdf...
Processing 0018.pdf...
Processing 0019.pdf...
Processing 0020.pdf...
Processing 0021.pdf...
Processing 0022.pdf...
Processing 0023.pdf...
Processing 0024.pdf...
Processing 0025.pdf...
Processing 0026.pdf...
Processing 0027.pdf...
Processing 0028.pdf...
Processing 0029.pdf...
Processing 0030.pdf...
Processing 0031.pdf...
Processing 0032.pdf...
Processing 0033.pdf...
Processing 0034.pdf...
Processing 0035.pdf...
Processing 0036.pdf...
Processing 0037.pdf...
Processing 0038.pdf...
Processing 0039.pdf...
Pro

Unnamed: 0,stage,source_pdf,base,lines,words,chars
0,raw,0001.pdf,1,5445,52675,344041
1,raw,0002.pdf,2,20438,170028,1073667
2,raw,0003.pdf,3,2140,15764,105782
3,raw,0004.pdf,4,1253,7528,51263
4,raw,0005.pdf,5,1711,13228,89434


## **Step 2: Text Cleaning & Normalization**

### **Purpose**
Clean and normalize raw text extracted from PDFs to prepare for structural analysis. This step removes PDF artifacts, standardizes formatting, and improves text quality for downstream processing.

### **Cleaning Operations**
- **Line Break Normalization** - Convert Windows/Mac line endings to Unix format
- **Whitespace Consolidation** - Collapse multiple spaces and excessive line breaks
- **Paragraph Preservation** - Maintain paragraph structure while removing artifacts
- **Title Matching** - Remove duplicate title lines that match catalog metadata

### **Quality Improvements**
- Removes PDF extraction artifacts and formatting inconsistencies
- Preserves document structure and readability
- Standardizes text encoding and character representation
- Prepares text for accurate structural parsing in Step 3

In [9]:
# ============== Step 2 — Text Cleaning & Normalization (One-Paragraph-One-Line) ==============
# Objetivo:
# - Cada párrafo queda en UNA sola línea.
# - Preservar la separabilidad de encabezados (TÍTULO/CAPÍTULO/SECCIÓN/ART./TRANSITORIOS/DECRETO/NÚMERO)
#   y fracciones (romanas, numéricas, alfabéticas), tratándolos como inicios de nuevo párrafo.
# - Eliminar espacios redundantes, saltos de línea extra y artefactos comunes de extracción PDF.
# - Estandarizar Unicode (NFC, NBSP→espacio, quitar cero-width), dehyphenation segura.
# - Idempotente y listo para Step 3 (parsing estructural).
#
# Integra con utilidades de tu proyecto si existen (fallbacks seguros incluidos).

from __future__ import annotations
import re
import unicodedata
from pathlib import Path
from collections import Counter
from typing import Dict, List, Optional
import sys

# ----------------------- Integraciones del proyecto (con fallbacks) -----------------------
try:
    RAW_TXT_DIR
except NameError:
    RAW_TXT_DIR = Path("temp/raw_txt")

try:
    CLEAN_DIR
except NameError:
    CLEAN_DIR = Path("temp/clean")

try:
    OUTPUT_DIR
except NameError:
    OUTPUT_DIR = Path("Refined")

try:
    CATALOG_CSV
except NameError:
    CATALOG_CSV = Path("catalog.csv")

def _fallback_write_text(path: Path, text: str) -> None:
    path.parent.mkdir(parents=True, exist_ok=True)
    path.write_text(text, encoding="utf-8")

def _fallback_write_error(base, code, msg, extra=None) -> None:
    sys.stderr.write(f"[clean][{base}] {code}: {msg} | extra={extra}\n")

def _fallback_count_stats(text: str) -> Dict[str, int]:
    return {
        "lines": text.count("\n") + (0 if text.endswith("\n") else 1 if text else 0),
        "words": len(text.split()),
        "chars": len(text),
    }

write_text   = globals().get("write_text", _fallback_write_text)
write_error  = globals().get("write_error", _fallback_write_error)
count_stats  = globals().get("count_stats", _fallback_count_stats)
load_catalog = globals().get("load_catalog", None)  # opcional

# ---------------------------- Patrones estructurales clave ----------------------------
HEAD_RE = re.compile(
    r'^\s*(T[ÍI]TULO|CAP[ÍI]TULO|SECCI[ÓO]N|ART[ÍI]CULO|DECRETO|TRANSITORIOS?|N[ÚU]MERO)\b',
    re.IGNORECASE
)
# "Art." o "Art " como atajo de ARTÍCULO
ARTICLE_RE = re.compile(r'^\s*Art(?:[íi]culo|\.?)\b', re.IGNORECASE)

# Enumeraciones: romanas, numéricas, alfabéticas (a), b), etc.)
ENUM_RE = re.compile(
    r'^\s*((?:I{1,4}|V?I{0,3}|X{1,3})\s*(?:\.|-)|\d+\s*(?:\.|\)|-)|[A-Za-zÁÉÍÓÚÜÑ]\))'
)

# Contadores/pies de página típicos
PAGE_RE_1 = re.compile(r'^\s*[-–—]?\s*\d+\s*[-–—]?\s*$')
PAGE_RE_2 = re.compile(r'^\s*P[aá]gina\s+\d+(\s+de\s+\d+)?\s*$', re.IGNORECASE)

# Variante espaciada de TRANSITORIOS
TRANS_SPACED_RE = re.compile(r'^\s*T\s*R\s*A\s*N\s*S\s*I\s*T\s*O\s*R\s*I\s*O\s*S\s*$', re.IGNORECASE)

# ------------------------------ Utilidades de limpieza ------------------------------
def _normalize_unicode(s: str) -> str:
    s = s.replace('\r\n', '\n').replace('\r', '\n')
    s = s.replace('\ufeff', '')     # BOM
    s = s.replace('\u00A0', ' ')    # NBSP -> espacio
    s = re.sub('[\u200b-\u200d]', '', s)  # zero-width
    s = unicodedata.normalize('NFC', s)
    return s

def _strip_trailing_spaces(s: str) -> str:
    return "\n".join(ln.rstrip() for ln in s.split("\n"))

def _remove_page_counters_and_repeated_headers(s: str) -> str:
    # 1) Quitar folios/páginas
    lines = [ln for ln in s.split("\n") if not (PAGE_RE_1.match(ln) or PAGE_RE_2.match(ln))]
    # 2) Quitar encabezados/pies repetidos idénticos (>=3 veces) que no sean estructurales
    counts = Counter(l for l in lines if l.strip() and not HEAD_RE.match(l) and not ENUM_RE.match(l))
    repeated = {l for l, c in counts.items() if c >= 3}
    return "\n".join(l for l in lines if l not in repeated)

def _normalize_transitorios_spaced(s: str) -> str:
    out = []
    for ln in s.split("\n"):
        out.append("TRANSITORIOS" if TRANS_SPACED_RE.match(ln.strip()) else ln)
    return "\n".join(out)

def _is_paragraph_starter(ln: str) -> bool:
    if not ln.strip():
        return True
    return bool(HEAD_RE.match(ln) or ARTICLE_RE.match(ln) or ENUM_RE.match(ln))

def _join_hard_wrap(prev: str, cur: str) -> str:
    # Une respetando dehyphenation segura y un solo espacio
    if prev.endswith('-') and cur and cur[:1].islower():
        return prev[:-1] + cur
    return (prev + ' ' + cur).replace('  ', ' ')

def _paragraphs_to_one_line(s: str) -> str:
    """
    Recorre el texto línea por línea y construye párrafos de UNA línea.
    Empieza un nuevo párrafo cuando:
      - la línea está en blanco, o
      - coincide con HEAD/ARTICLE/ENUM (encabezados y fracciones).
    En caso contrario, concatena con la línea previa dentro del mismo párrafo.
    """
    out: List[str] = []
    acc: Optional[str] = None

    for raw_ln in s.split("\n"):
        ln = raw_ln.strip()
        if not ln:  # línea en blanco -> finaliza párrafo actual
            if acc is not None and acc.strip():
                out.append(acc.strip())
                acc = None
            continue

        if _is_paragraph_starter(ln):
            # Cierra el párrafo previo y empieza uno nuevo
            if acc is not None and acc.strip():
                out.append(acc.strip())
            acc = ln
        else:
            # Continúa el párrafo actual
            if acc is None:
                acc = ln
            else:
                acc = _join_hard_wrap(acc, ln)

    if acc is not None and acc.strip():
        out.append(acc.strip())

    # Sin líneas en blanco: una línea por párrafo.
    return "\n".join(out)

def _intraline_space_normalize(s: str) -> str:
    # Colapsa espacios internos y asegura separación mínima tras tokens legales comunes
    s = re.sub(r'[ \t]{2,}', ' ', s)
    s = re.sub(r'(^|\n)(Artículo\s+\d+(?:[º°]\.?|\.))\s*', r'\1\2 ', s, flags=re.IGNORECASE)
    s = re.sub(r'(^|\n)(Art\.\s*\d+\.?)\s*', r'\1\2 ', s, flags=re.IGNORECASE)
    s = re.sub(r'(^|\n)((?:I{1,4}|V?I{0,3}|X{1,3})\s*(?:\.|-))\s*', r'\1\2 ', s)
    s = re.sub(r'(^|\n)(\d+\s*(?:\.|\)|-))\s*', r'\1\2 ', s)
    s = re.sub(r'(^|\n)([A-Za-zÁÉÍÓÚÜÑ]\))\s*', r'\1\2 ', s)
    return s

# --------------------------------- API principal ---------------------------------
def clean_raw_text(raw: str, title_candidate: Optional[str] = None) -> str:
    """
    Limpieza conservadora con "una línea por párrafo".
    Pasos:
      1) Unicode & controles (NFC, NBSP→espacio, quitar zero-width, normalizar CR/CRLF)
      2) Quitar folios/pies de página y encabezados/pies repetidos (>=3)
      3) Normalizar variantes espaciadas de TRANSITORIOS
      4) Reflujo: construir párrafos de UNA línea (HEAD/ART./ENUM inician párrafo)
      5) Normalización intralínea y tidy final
    """
    txt = _normalize_unicode(raw)
    txt = _strip_trailing_spaces(txt)
    txt = _remove_page_counters_and_repeated_headers(txt)
    txt = _normalize_transitorios_spaced(txt)

    # *** clave: una línea por párrafo ***
    txt = _paragraphs_to_one_line(txt)

    # Ajustes finos de espacios
    txt = _intraline_space_normalize(txt)

    # Tidy final: idempotencia y salto final único
    txt = txt.strip() + "\n"
    return txt

# ------------------------------------ Driver ------------------------------------
def step2_clean_raw(catalog_csv: Path = CATALOG_CSV):
    """
    Step 2: Limpia y normaliza archivos raw -> clean_{base}.txt
    - Lee RAW_TXT_DIR / raw_*.txt
    - Escribe CLEAN_DIR / clean_{base}.txt
    - Manifiesto con estadísticas (si pandas disponible)
    - Logs claros por archivo
    """
    def log(msg: str, level: str = "INFO"):
        print(f"[{level}] {msg}", flush=True)

    print("Starting Step 2: Text Cleaning & Normalization (one paragraph = one line)...")
    CLEAN_DIR.mkdir(parents=True, exist_ok=True)

    catalog = None
    if load_catalog:
        try:
            catalog = load_catalog(catalog_csv)
            log(f"Catalog loaded: {catalog_csv}")
        except Exception as e:
            write_error("global", "catalog_load_failed", f"Could not load catalog: {e}", {"catalog_csv": str(catalog_csv)})
            log(f"Catalog load failed: {e}", level="WARN")

    raw_files = sorted(RAW_TXT_DIR.glob("raw_*.txt"))
    log(f"Found {len(raw_files)} raw text files in {RAW_TXT_DIR}")

    records: List[Dict] = []
    total_chars_raw = 0
    total_chars_clean = 0
    cleaned_count = 0
    skipped_count = 0
    failed_count = 0

    for raw_path in raw_files:
        base = raw_path.stem.replace("raw_", "")
        fname = raw_path.name

        if catalog is not None and base not in catalog:
            write_error(base, "catalog_missing", "file_num not found in catalog (clean stage)", {"raw_file": fname})
            log(f"{fname} → skipping (not in catalog)", level="WARN")
            skipped_count += 1
            continue

        try:
            log(f"Cleaning {fname} …")
            raw_text = raw_path.read_text(encoding="utf-8")
            meta = catalog[base] if catalog is not None else None
            title = getattr(meta, "law_name", None) if meta is not None else None

            cleaned = clean_raw_text(raw_text, title)

            out_path = CLEAN_DIR / f"clean_{base}.txt"
            write_text(out_path, cleaned)

            stats_raw = count_stats(raw_text)
            stats_clean = count_stats(cleaned)
            total_chars_raw += stats_raw.get("chars", 0)
            total_chars_clean += stats_clean.get("chars", 0)

            rec = {"stage": "clean", "source_raw": fname, "base": base, **stats_clean}
            records.append(rec)
            cleaned_count += 1

            log(f"Wrote {out_path.name} (lines={stats_clean.get('lines')}, words={stats_clean.get('words')}, chars={stats_clean.get('chars')})")

        except Exception as e:
            write_error(base, "clean_failed", f"Exception during cleaning: {e}", {"raw_file": fname})
            log(f"{fname} → cleaning failed: {e}", level="ERROR")
            failed_count += 1

    # Manifest opcional
    try:
        import pandas as pd
        if records:
            OUTPUT_DIR.mkdir(parents=True, exist_ok=True)
            import pandas as pd
            df = pd.DataFrame(records)
            manifest_path = OUTPUT_DIR / "manifest_clean.csv"
            df.to_csv(manifest_path, index=False, encoding="utf-8")
            log(f"Manifest saved to {manifest_path}")
        else:
            df = pd.DataFrame()
    except Exception as e:
        log(f"Could not write manifest (pandas missing or error: {e})", level="WARN")
        try:
            import pandas as pd
            df = pd.DataFrame(records) if records else pd.DataFrame()
        except Exception:
            df = None

    # Resumen
    reduction = (1 - (total_chars_clean / total_chars_raw)) * 100 if total_chars_raw else 0.0
    log(f"Summary: cleaned={cleaned_count}, skipped={skipped_count}, failed={failed_count}")
    if total_chars_raw:
        log(f"Text size reduction: {reduction:.1f}%")
    log(f"Step 2 complete: {cleaned_count} file(s) written to {CLEAN_DIR}")

    return df if (df is not None and not df.empty) else None

# ------------------------------- Ejecución directa --------------------------------
if __name__ == "__main__":
    _ = step2_clean_raw(CATALOG_CSV)


Starting Step 2: Text Cleaning & Normalization (one paragraph = one line)...
[INFO] Catalog loaded: c:\Users\braul\Documents\_ITAMLaptop\Datalab\DataMakers\Leyes\14\Raw\index.csv
[INFO] Found 301 raw text files in c:\Users\braul\Documents\_ITAMLaptop\Datalab\DataMakers\Leyes\14\temp\raw_txt
[INFO] Cleaning raw_0001.txt …


[INFO] Wrote clean_0001.txt (lines=1477, words=52266, chars=333390)
[INFO] Cleaning raw_0002.txt …
[INFO] Wrote clean_0002.txt (lines=6019, words=171856, chars=1037046)
[INFO] Cleaning raw_0003.txt …
[INFO] Wrote clean_0003.txt (lines=652, words=15815, chars=102309)
[INFO] Cleaning raw_0004.txt …
[INFO] Wrote clean_0004.txt (lines=300, words=7379, chars=48703)
[INFO] Cleaning raw_0005.txt …
[INFO] Wrote clean_0005.txt (lines=459, words=13085, chars=86294)
[INFO] Cleaning raw_0006.txt …
[INFO] Wrote clean_0006.txt (lines=2802, words=92858, chars=560122)
[INFO] Cleaning raw_0007.txt …
[INFO] Wrote clean_0007.txt (lines=3228, words=77099, chars=496470)
[INFO] Cleaning raw_0008.txt …
[INFO] Wrote clean_0008.txt (lines=1240, words=41662, chars=261846)
[INFO] Cleaning raw_0009.txt …
[INFO] Wrote clean_0009.txt (lines=2219, words=69867, chars=424016)
[INFO] Cleaning raw_0010.txt …
[INFO] Wrote clean_0010.txt (lines=2182, words=65971, chars=432822)
[INFO] Cleaning raw_0011.txt …
[INFO] Wrote c

## **Step 3: Document Structure Splitting**  ESTE PASO HACERLO EN EL JSONPARSER

### **Purpose**
Intelligently split cleaned legal documents into their constituent parts: decree sections, main law content, and transitional provisions. This separation enables specialized processing of different document types.

### **Document Structure Recognition**
Mexican legal documents typically follow this structure:
1. **Decreto** - Government decree and preamble (before main law)
2. **Ley** - Main law content with hierarchical structure (títulos, capítulos, artículos)  
3. **Transitorios** - Transitional provisions and implementation rules

### **Splitting Algorithm**
- **Title Detection** - Identifies law title using catalog metadata (first two words)
- **Hierarchy Recognition** - Detects formal legal structure markers
- **Transitional Identification** - Locates "Transitorios" sections
- **Intelligent Fallback** - Handles documents that don't follow standard format

### **Quality Assurance**
- Validates structural patterns against expected legal document format
- Logs splitting decisions and potential issues
- Preserves content integrity across all document sections

In [10]:
# ============== Directory Configuration ==============

# Base working directory (current notebook location)
BASE_DIR     = Path.cwd()

# === INPUT PATHS ===


RAW_DIR      = BASE_DIR / "Raw"                 # Source PDF files location
CATALOG_CSV  = RAW_DIR / "index.csv"           # Metadata catalog for PDFs

# === OUTPUT ROOT ===
OUTPUT_DIR   = BASE_DIR / "Refined"             # All processed outputs go here
TEMP_DIR     = BASE_DIR / "temp"                # Temporary processing files

# === INTERMEDIATE PROCESSING DIRECTORIES ===
RAW_TXT_DIR  = OUTPUT_DIR / TEMP_DIR / "raw_txt"   # Step 1: Raw PDF text extraction
CLEAN_DIR    = OUTPUT_DIR / TEMP_DIR / "clean"     # Step 2: Cleaned text files

# === DOCUMENT TYPE SEPARATION (Step 3 outputs) ===
LEY_DIR      = OUTPUT_DIR / "leyes"             # Main law content
DECR_DIR     = OUTPUT_DIR / "decretos"          # Government decree sections  
TRANS_DIR    = OUTPUT_DIR / "transitorios"      # Transitional provisions

# === FINAL OUTPUTS ===
JSON_DIR     = OUTPUT_DIR / "json"              # Structured JSON documents
ERRORES_DIR  = OUTPUT_DIR / "errores"           # Error logs and validation reports

# Create all necessary directories (parents=True creates nested paths)
for d in [OUTPUT_DIR, RAW_TXT_DIR, TEMP_DIR, CLEAN_DIR, LEY_DIR, DECR_DIR, TRANS_DIR, JSON_DIR, ERRORES_DIR]:
    d.mkdir(parents=True, exist_ok=True)

print(f"Directory structure created successfully!")
print(f"Base directory: {BASE_DIR}")
print(f"Input PDFs: {RAW_DIR}")  
print(f"Catalog: {CATALOG_CSV}")
print(f"Final JSON output: {JSON_DIR}")

Directory structure created successfully!
Base directory: c:\Users\braul\Documents\_ITAMLaptop\Datalab\DataMakers\Leyes\14
Input PDFs: c:\Users\braul\Documents\_ITAMLaptop\Datalab\DataMakers\Leyes\14\Raw
Catalog: c:\Users\braul\Documents\_ITAMLaptop\Datalab\DataMakers\Leyes\14\Raw\index.csv
Final JSON output: c:\Users\braul\Documents\_ITAMLaptop\Datalab\DataMakers\Leyes\14\Refined\json


In [11]:
# ============== Legal Document Structure Recognition ==============

# Hierarchical structure vocabulary for Mexican legal documents
# These terms appear in formal legal headers in order of hierarchy
HIER = [
    "disposiciones preliminares", "disposiciones generales",
    "libro", "titulo", "capitulo", "seccion",
    "articulo", "art.",                 # ← añade “art.”
    "capitulo unico", "seccion unica", "titulo preliminar",
]

# El resto del HIER_RE igual (usando norm_lower(line))


# Build regex pattern to match hierarchical headers
# Matches any of the hierarchy terms at the start of a line (case-insensitive)
HIER_RE = re.compile(
    r"^\s*(?:%s)\b" % "|".join(re.escape(h) for h in HIER),
    flags=re.I
)

# Flexible "Transitorios" pattern recognition
# Handles both spaced ("T R A N S I T O R I O S") and normal spelling
# Accounts for potential leading whitespace and case variations
TRANS_RE = re.compile(
    r"^\s*(?:"
    r"t\s*r\s*a\s*n\s*s\s*i\s*t\s*o\s*r\s*i\s*o\s*s"  # T R A N S I T O R I O S
    r"|t\s*r\s*a\s*n\s*s\s*i\s*t\s*o\s*r\s*i\s*o"     # T R A N S I T O R I O
    r"|transitorios"
    r"|transitorio"
    r"|articulos?\s+transitorios"                      # "Artículos Transitorios"
    r")\b",
    re.I,
)


print("Legal structure patterns loaded:")
print(f"   Hierarchy levels: {len(HIER)} types")
print(f"   Pattern matching: Case-insensitive with flexible spacing")
print(f"   Transitional sections: Supports both normal and spaced formatting")

Legal structure patterns loaded:
   Hierarchy levels: 11 types
   Pattern matching: Case-insensitive with flexible spacing
   Transitional sections: Supports both normal and spaced formatting


In [None]:

def _contains_allcaps_prefix(line: str, prefix_caps: str) -> bool:
    """
    True if the normalized-uppercase view of 'line' contains the two-word
    ALL-CAPS prefix 'prefix_caps' as a token span, accent-insensitive.
    """
    head = _norm_caps(line)
    pref = _norm_caps(prefix_caps)
    if not pref:
        return False
    # Whole-token boundaries: not preceded/followed by A–Z/0–9
    pat = re.compile(rf"(?<![A-Z0-9]){re.escape(pref)}(?![A-Z0-9])")
    return bool(pat.search(head))

def _next_nonempty_index(lines: List[str], j: int) -> Optional[int]:
    n = len(lines)
    while j < n and not lines[j].strip():
        j += 1
    return j if j < n else None

def find_decreto_ley_start_two_words(
    lines: List[str],
    first_two_caps: str,
    max_lookahead: int = 120,
) -> Optional[Tuple[int, int]]:
    n = len(lines)
    # 1) Fase "estricta"
    for i in range(n - 1):
        if _contains_allcaps_prefix(lines[i], first_two_caps):
            j = _next_nonempty_index(lines, i + 1)
            if j is not None and HIER_RE.search(norm_lower(lines[j].lstrip())):
                return i, j
            # 2) Fase "relajada": ventana
            start = (j if j is not None else i + 1)
            end = min(n, start + max_lookahead)
            for k in range(start, end):
                ln = norm_lower(lines[k].lstrip())
                if TRANS_RE.search(ln):  # si llegas primero a Transitorios, descarta este título
                    break
                if HIER_RE.search(ln):
                    return i, k
    return None


def first_transitorios_after(lines: List[str], start: int) -> Optional[int]:
    """
    Find the first index >= start+1 that looks like a 'Transitorios' heading,
    tolerant of leading spaces and both spaced/plain spellings.
    """
    for i in range(max(start + 1, 0), len(lines)):
        if TRANS_RE.search(norm_lower(lines[i].lstrip())):
            return i
    return None

def split_blocks_two_word_strict(cleaned_text: str, first_two_caps: str, base: str) -> Dict[str, str]:
    """
    Split a cleaned text into:
      - decreto: everything before the ALL-CAPS two-word title line
      - ley: from the first HIER line after that title, up to 'Transitorios' (if any)
      - transitorios: from 'Transitorios' to end (if present)
    If the (title -> header) pair is not found, everything (up to 'Transitorios') is put into 'decreto'.
    """
    lines = cleaned_text.splitlines()
    pair = find_decreto_ley_start_two_words(lines, first_two_caps)

    # Fallback: couldn't find the (title → header) pair
    if pair is None:
        t_idx = first_transitorios_after(lines, 0)
        if t_idx is not None:
            decreto = "\n".join(lines[:t_idx]).strip()
            tran    = "\n".join(lines[t_idx:]).strip()
        else:
            decreto = cleaned_text
            tran    = ""
        write_error(base, "two_word_pair_not_found",
                    "No (ALL-CAPS two-word title line → HIER) pair; ley not split.",
                    {"clean": f"clean_{base}.txt"})
        return {"decreto": decreto, "ley": "", "transitorios": tran}

    title_idx, ley_start = pair
    decreto = "\n".join(lines[:title_idx]).strip()

    t_idx = first_transitorios_after(lines, ley_start)
    if t_idx is not None and t_idx > ley_start:
        ley  = "\n".join(lines[ley_start: t_idx]).strip()
        tran = "\n".join(lines[t_idx:]).strip()
        
    else:
        ley  = "\n".join(lines[ley_start:]).strip()
        tran = ""
    return {"decreto": decreto, "ley": ley, "transitorios": tran}

In [None]:
# # === STEP 3 — Atomic: split ONE cleaned file into decreto/ley/transitorios (con banderas) ===
# from pathlib import Path
# from datetime import datetime

# # Asume que ya existen:
# # - load_catalog, LawMeta
# # - split_blocks_two_word_strict
# # - write_text, write_error (JSON estructurado)
# # - count_stats (devuelve dict con keys: lines, words, chars)

# # Directorios base (reutiliza los de tu pipeline si ya existen)
# CLEAN_DIR   = Path("temp/clean")     # clean_{file_num}.txt
# OUTPUT_DIR  = Path("Refined")        # carpeta base de salida

# DEC_DIR = OUTPUT_DIR / "decretos"
# LEY_DIR = OUTPUT_DIR / "leyes"
# TRA_DIR = OUTPUT_DIR / "transitorios"
# ERR_DIR = OUTPUT_DIR / "errores"
# MANIFEST_PARTS = OUTPUT_DIR / "manifest_parts.csv"

# for d in (DEC_DIR, LEY_DIR, TRA_DIR, ERR_DIR):
#     d.mkdir(parents=True, exist_ok=True)

# def _append_manifest_rows(rows: list[dict]) -> None:
#     """Append rows to manifest_parts.csv (crea encabezados si no existe)."""
#     import csv
#     cols = ["file_num","part","lines","words","chars","path","created_at"]
#     new_file = not MANIFEST_PARTS.exists()
#     with MANIFEST_PARTS.open("a", encoding="utf-8", newline="") as f:
#         w = csv.DictWriter(f, fieldnames=cols)
#         if new_file:
#             w.writeheader()
#         for r in rows:
#             w.writerow(r)

# def process_one_file(file_num: str, catalog: dict[str, "LawMeta"]) -> dict:
#     """
#     Procesa UN archivo clean_{file_num}.txt:
#       - lee el limpio
#       - hace split con split_blocks_two_word_strict
#       - escribe partes (decreto/ley/transitorios)
#       - anexa renglones al manifest_parts.csv
#       - retorna dict con 'ok' y 'paths' escritos
#     Con banderas de estado por consola y advertencias si faltan partes.
#     """
#     clean_path = CLEAN_DIR / f"clean_{file_num}.txt"
#     print(f"[INFO] Procesando {clean_path.name}...")

#     if not clean_path.exists():
#         msg = f"No existe {clean_path.name} para split Step 3"
#         print(f"[WARNING] {msg}")
#         write_error(file_num, "clean_missing", msg, {"clean_path": str(clean_path)})
#         return {"ok": False, "reason": "clean_missing"}

#     meta = catalog.get(file_num)
#     if not meta:
#         msg = "file_num no encontrado en catálogo (split stage)"
#         print(f"[WARNING] {msg} → {file_num}")
#         write_error(file_num, "catalog_missing", msg, {"file_num": file_num})
#         return {"ok": False, "reason": "catalog_missing"}

#     # Lee texto limpio y divide en partes
#     cleaned_text = clean_path.read_text(encoding="utf-8", errors="replace")
#     try:
#         parts = split_blocks_two_word_strict(
#             cleaned_text=cleaned_text,
#             first_two_caps=meta.first_two_caps,
#             base=file_num,
#         )
#     except Exception as e:
#         msg = f"Excepción al dividir {clean_path.name}: {type(e).__name__}: {e}"
#         print(f"[ERROR] {msg}")
#         write_error(file_num, "split_exception", msg, {"first_two_caps": meta.first_two_caps})
#         return {"ok": False, "reason": "split_exception"}

#     # Helper para escribir partes + recolectar stats
#     now = datetime.now().isoformat(timespec="seconds")
#     out_rows, out_paths = [], {}

#     def _emit(part_key: str, out_dir: Path):
#         content = (parts.get(part_key) or "").strip()
#         if not content:
#             print(f"[WARNING] {part_key.upper()} PARA {clean_path.name} NO ENCONTRADO")
#             write_error(file_num, f"{part_key}_missing",
#                         f"{part_key} vacío tras split",
#                         {"file": clean_path.name, "first_two_caps": meta.first_two_caps})
#             return
#         outp = out_dir / f"{file_num}.txt"
#         write_text(outp, content)
#         stats = count_stats(content)  # dict: lines, words, chars
#         out_rows.append({
#             "file_num": file_num,
#             "part": part_key,
#             "lines": stats["lines"],
#             "words": stats["words"],
#             "chars": stats["chars"],
#             "path": str(outp),
#             "created_at": now,
#         })
#         out_paths[part_key] = str(outp)
#         print(f"[OK] {part_key} guardado → {outp.name} "
#               f"({stats['lines']} líneas, {stats['words']} palabras, {stats['chars']} chars)")

#     # Intentar exportar todas las partes
#     _emit("decreto", DEC_DIR)
#     _emit("ley",      LEY_DIR)
#     _emit("transitorios", TRA_DIR)

#     if out_rows:
#         _append_manifest_rows(out_rows)
#         print(f"[INFO] Manifest actualizado para {clean_path.name} (+{len(out_rows)} filas)")
#     else:
#         print(f"[WARNING] Ninguna parte escrita para {clean_path.name}")

#     # Auditoría si no hubo 'ley'
#     if not (parts.get("ley") or "").strip():
#         msg = "Ley vacía tras split; revisar heurística o limpieza"
#         print(f"[WARNING] {msg} → {clean_path.name}")
#         write_error(file_num, "ley_missing", msg, {"first_two_caps": meta.first_two_caps})

#     print(f"[DONE] {clean_path.name} procesado.")
#     return {"ok": True, "paths": out_paths, "has_ley": bool((parts.get("ley") or "").strip())}


In [10]:


# === STEP 3 — Atomic: split ONE cleaned file into decreto/ley/transitorios (prints mínimos) ===
from pathlib import Path
from datetime import datetime

# Asume que ya existen en el entorno:
# - load_catalog, LawMeta
# - split_blocks_two_word_strict
# - write_text, write_error (JSON estructurado)
# - count_stats (devuelve dict con keys: lines, words, chars)

# Directorios base
CLEAN_DIR   = Path("temp/clean")     # clean_{file_num}.txt
OUTPUT_DIR  = Path("Refined")        # carpeta base de salida

DEC_DIR = OUTPUT_DIR / "decretos"
LEY_DIR = OUTPUT_DIR / "leyes"
TRA_DIR = OUTPUT_DIR / "transitorios"
ERR_DIR = OUTPUT_DIR / "errores"
MANIFEST_PARTS = OUTPUT_DIR / "manifest_parts.csv"

for d in (DEC_DIR, LEY_DIR, TRA_DIR, ERR_DIR):
    d.mkdir(parents=True, exist_ok=True)

def _append_manifest_rows(rows: list[dict]) -> None:
    """Append rows to manifest_parts.csv (crea encabezados si no existe)."""
    import csv
    cols = ["file_num","part","lines","words","chars","path","created_at"]
    new_file = not MANIFEST_PARTS.exists()
    with MANIFEST_PARTS.open("a", encoding="utf-8", newline="") as f:
        w = csv.DictWriter(f, fieldnames=cols)
        if new_file:
            w.writeheader()
        for r in rows:
            w.writerow(r)

def process_one_file(file_num: str, catalog: dict[str, "LawMeta"]) -> dict:
    """
    Procesa UN archivo clean_{file_num}.txt:
      - lee el limpio
      - split con split_blocks_two_word_strict
      - escribe partes (decr_/law_/tran_)
      - agrega filas al manifest (silencioso)
    Prints: solo INFO de procesamiento y ERROR cuando falta decreto/ley/transitorios.
    """
    clean_path = CLEAN_DIR / f"clean_{file_num}.txt"
    print(f"[INFO] Procesando {clean_path.name}...")

    if not clean_path.exists():
        # Error fatal: no hay clean; dejamos también registro estructurado
        write_error(file_num, "clean_missing",
                    f"No existe {clean_path.name} para split Step 3",
                    {"clean_path": str(clean_path)})
        # No añadimos más prints distintos a los solicitados
        return {"ok": False, "reason": "clean_missing"}

    meta = catalog.get(file_num)
    if not meta:
        write_error(file_num, "catalog_missing",
                    "file_num no encontrado en catálogo (split stage)",
                    {"file_num": file_num})
        return {"ok": False, "reason": "catalog_missing"}

    # Lee texto limpio y divide en partes
    cleaned_text = clean_path.read_text(encoding="utf-8", errors="replace")
    try:
        parts = split_blocks_two_word_strict(
            cleaned_text=cleaned_text,
            first_two_caps=meta.first_two_caps,
            base=file_num,
        )
    except Exception as e:
        write_error(file_num, "split_exception",
                    f"Excepción al dividir {clean_path.name}: {type(e).__name__}: {e}",
                    {"first_two_caps": meta.first_two_caps})
        return {"ok": False, "reason": "split_exception"}

    # Prefijos de salida por tipo de parte
    prefix_map = {
        "decreto": "decr",
        "ley": "law",
        "transitorios": "tran",
    }
    dir_map = {
        "decreto": DEC_DIR,
        "ley": LEY_DIR,
        "transitorios": TRA_DIR,
    }

    now = datetime.now().isoformat(timespec="seconds")
    out_rows, out_paths = [], {}

    def _emit(part_key: str):
        content = (parts.get(part_key) or "").strip()
        if not content:
            if part_key == "ley":
                # Falla real del split → error y registro estructurado
                print(f"[ERROR] {part_key} NO encontrado \u2192 {file_num}.txt")
                write_error(
                    file_num,
                    f"{part_key}_missing",
                    f"{part_key} vacío tras split",
                    {"file": clean_path.name, "first_two_caps": meta.first_two_caps},
                )
            else:
                # Falta esperada en muchos casos (decreto / transitorios) → solo INFO, sin write_error
                print(f"[INFO] {part_key} NO encontrado \u2192 {file_num}.txt")
            return

        out_dir = dir_map[part_key]
        outp = out_dir / f"{prefix_map[part_key]}_{file_num}.txt"
        write_text(outp, content)
        stats = count_stats(content)  # dict: lines, words, chars
        out_rows.append({
            "file_num": file_num,
            "part": part_key,
            "lines": stats["lines"],
            "words": stats["words"],
            "chars": stats["chars"],
            "path": str(outp),
            "created_at": now,
        })
        out_paths[part_key] = str(outp)


    # Exportar (silencioso si se guardó; solo avisa si falta)
    _emit("decreto")
    _emit("ley")
    _emit("transitorios")

    if out_rows:
        _append_manifest_rows(out_rows)

    return {"ok": True, "paths": out_paths, "has_ley": bool((parts.get("ley") or "").strip())}



#! VOLVER A CORRER ESTO !!!!!!!!!!!!!!!!!




In [11]:
# === STEP 3 — Orchestral: iterate over clean_*.txt and call atomic (con banderas) ===
from pathlib import Path
from time import perf_counter

CATALOG_CSV = Path("Raw/index.csv")
CLEAN_DIR   = Path("temp/clean")
ONLY_ONE    = None   # Ej. "0008" para solo ese; None = procesa todos

# Reusa load_catalog de Step 1/2 (retorna Dict[str, LawMeta])
catalog = load_catalog(CATALOG_CSV)

def step3_split_all(only_one: str | None = ONLY_ONE) -> dict:
    """
    Orquesta Step 3:
      - si only_one: procesa solo clean_{only_one}.txt
      - si None: procesa todos los clean_*.txt presentes
      - retorna resumen con totales y listas de procesados/fallidos
    """
    t0 = perf_counter()
    processed, failed = [], []

    print("=== Step 3 | Inicio: Split decreto / ley / transitorios ===")
    if only_one:
        clean_path = CLEAN_DIR / f"clean_{only_one}.txt"
        if not clean_path.exists():
            print(f"[WARNING] No existe {clean_path.name}; nada que procesar.")
            return {"total": 0, "processed": [], "failed": [only_one]}
        print(f"[INFO] Modo 'uno': procesando {clean_path.name}")
        res = process_one_file(only_one, catalog)
        (processed if res.get("ok") else failed).append(only_one)
    else:
        files = sorted(CLEAN_DIR.glob("clean_*.txt"))
        print(f"[INFO] Archivos detectados en {CLEAN_DIR}: {len(files)}")
        if not files:
            print("[WARNING] No hay archivos clean_*.txt para procesar.")
        for fp in files:
            fn = fp.stem.replace("clean_", "")
            res = process_one_file(fn, catalog)
            (processed if res.get("ok") else failed).append(fn)

    dt = perf_counter() - t0
    summary = {
        "total": len(processed) + len(failed),
        "processed": processed,
        "failed": failed,
        "seconds": round(dt, 2),
    }
    print(f"=== Step 3 | Fin: OK={len(processed)} | Fails={len(failed)} | "
          f"Total={summary['total']} | Tiempo={summary['seconds']}s ===")
    if failed:
        print("[WARNING] Archivos con fallo:", ", ".join(failed))
    return summary

# === Ejecutar (elige: uno o todos) ===
summary_step3 = step3_split_all(ONLY_ONE)


=== Step 3 | Inicio: Split decreto / ley / transitorios ===
[INFO] Archivos detectados en temp\clean: 301
[INFO] Procesando clean_0001.txt...
[INFO] Procesando clean_0002.txt...
[INFO] Procesando clean_0003.txt...
[INFO] Procesando clean_0004.txt...
[INFO] Procesando clean_0005.txt...
[INFO] Procesando clean_0006.txt...
[INFO] Procesando clean_0007.txt...
[INFO] Procesando clean_0008.txt...
[INFO] Procesando clean_0009.txt...
[INFO] Procesando clean_0010.txt...
[INFO] Procesando clean_0011.txt...
[INFO] Procesando clean_0012.txt...
[INFO] Procesando clean_0013.txt...
[INFO] Procesando clean_0014.txt...
[INFO] Procesando clean_0015.txt...
[INFO] Procesando clean_0016.txt...
[INFO] decreto NO encontrado → 0016.txt
[INFO] Procesando clean_0017.txt...
[INFO] Procesando clean_0018.txt...
[INFO] Procesando clean_0019.txt...
[INFO] decreto NO encontrado → 0019.txt
[INFO] transitorios NO encontrado → 0019.txt
[INFO] Procesando clean_0020.txt...
[INFO] Procesando clean_0021.txt...
[INFO] Proces

In [None]:

# # === STEP 3 — Split decreto / ley / transitorios ============================
# # Pega esta celda justo después de la celda 54

# from pathlib import Path
# from datetime import datetime
# import csv, re, unicodedata

# # ---- Config ----
# CLEAN_DIR   = Path("temp/clean")         # archivos: clean_{file_num}.txt
# CATALOG_CSV = Path("Raw/index.csv")      # columnas requeridas: file_num, law_name
# OUT_BASE    = Path("Refined")
# ONLY_ONE    = None                       # ej. "0008" para solo ese; None = procesa todos

# # ---- Fallbacks si no existen en celdas previas ----
# def _exists(name:str)->bool:
#     return name in globals()

# if not _exists("write_text"):
#     def write_text(path: Path, content: str):
#         path.parent.mkdir(parents=True, exist_ok=True)
#         path.write_text(content, encoding="utf-8")

# if not _exists("write_error"):
#     def write_error(base: str, kind: str, message: str, extra=None):
#         err_dir = OUT_BASE / "errores"
#         err_dir.mkdir(parents=True, exist_ok=True)
#         p = err_dir / f"{base}_errors.txt"
#         with p.open("a", encoding="utf-8") as f:
#             f.write(f"[{kind}] {message}")
#             if extra:
#                 f.write(f" | {extra}")
#             f.write("\n")

# if not _exists("count_stats"):
#     def count_stats(text: str):
#         lines = text.count("\n") + (1 if text else 0)
#         words = len(text.split())
#         chars = len(text)
#         return lines, words, chars

# # ---- Helpers locales ----
# def _strip_accents(s: str) -> str:
#     return ''.join(c for c in unicodedata.normalize('NFD', s) if unicodedata.category(c) != 'Mn')

# def _first_two_caps(law_name: str | None) -> str | None:
#     if not law_name:
#         return None
#     toks = [t for t in re.split(r"\W+", law_name) if t]
#     if not toks:
#         return None
#     return _strip_accents(" ".join(toks[:2])).upper()

# def _load_catalog(csv_path: Path) -> dict[str, tuple[str|None, str|None]]:
#     """
#     Devuelve: { file_num: (law_name, first_two_caps) }
#     """
#     mapping: dict[str, tuple[str|None, str|None]] = {}
#     with csv_path.open("r", encoding="utf-8") as f:
#         reader = csv.DictReader(f)
#         required = {"file_num","law_name"}
#         if not required.issubset(set(reader.fieldnames or [])):
#             raise ValueError("Raw/index.csv debe tener columnas: file_num, law_name")
#         for row in reader:
#             fn = (row.get("file_num") or "").strip()
#             ln = (row.get("law_name") or "").strip() or None
#             if fn:
#                 mapping[fn] = (ln, _first_two_caps(ln))
#     return mapping

# def _ensure_dirs():
#     (OUT_BASE / "decretos").mkdir(parents=True, exist_ok=True)
#     (OUT_BASE / "leyes").mkdir(parents=True, exist_ok=True)
#     (OUT_BASE / "transitorios").mkdir(parents=True, exist_ok=True)
#     (OUT_BASE / "errores").mkdir(parents=True, exist_ok=True)

# def _append_manifest(rows: list[dict]):
#     manifest = OUT_BASE / "manifest_parts.csv"
#     exists = manifest.exists()
#     cols = ["file_num","part","lines","words","chars","path","created_at"]
#     with manifest.open("a", encoding="utf-8", newline="") as f:
#         w = csv.DictWriter(f, fieldnames=cols)
#         if not exists:
#             w.writeheader()
#         for r in rows:
#             w.writerow(r)

# # ---- Núcleo de proceso por archivo ----
# def process_one(file_num: str, clean_path: Path, first_two_caps: str | None):
#     txt = clean_path.read_text(encoding="utf-8", errors="replace")

#     # Usa TU helper ya definido en celdas anteriores:
#     # split_blocks_two_word_strict(cleaned_text, first_two_caps, base)
#     parts = split_blocks_two_word_strict(txt, first_two_caps or "", base=file_num)

#     out_rows = []
#     now = datetime.now().isoformat(timespec="seconds")

#     def drop(part_key: str, folder: str):
#         content = (parts.get(part_key) or "").strip()
#         if not content:
#             return
#         outp = OUT_BASE / folder / f"{file_num}.txt"
#         write_text(outp, content)
#         ln, wd, ch = count_stats(content)
#         out_rows.append({
#             "file_num": file_num,
#             "part": part_key,
#             "lines": ln,
#             "words": wd,
#             "chars": ch,
#             "path": str(outp),
#             "created_at": now
#         })

#     drop("decreto", "decretos")
#     drop("ley", "leyes")
#     drop("transitorios", "transitorios")

#     if out_rows:
#         _append_manifest(out_rows)

#     # Si no se logró separar "ley", deja una nota para auditoría
#     if not parts.get("ley"):
#         write_error(file_num, "ley_missing",
#                     "Ley vacía tras split; revisar heurística o limpieza",
#                     {"first_two_caps": first_two_caps})

# # ---- Run ----
# _ensure_dirs()
# catalog = _load_catalog(CATALOG_CSV)

# if ONLY_ONE:
#     fp = CLEAN_DIR / f"clean_{ONLY_ONE}.txt"
#     if not fp.exists():
#         print(f"[WARN] No existe {fp}")
#     else:
#         _, ftc = catalog.get(ONLY_ONE, (None, None))
#         process_one(ONLY_ONE, fp, ftc)
# else:
#     for fp in sorted(CLEAN_DIR.glob("clean_*.txt")):
#         fn = fp.stem.replace("clean_","")
#         _, ftc = catalog.get(fn, (None, None))
#         process_one(fn, fp, ftc)

# print("Step 3 listo: decreto / ley / transitorios escritos en Refined/ y manifiesto actualizado.")

# #! REVISAR POR QUÉ NO FUNCIONAN LO DEL manifest_parts: lines, words, chars


Step 3 listo: decreto / ley / transitorios escritos en Refined/ y manifiesto actualizado.


## **Step 4: Hierarchical JSON Structure Generation**

### **Purpose**
Transform cleaned law text into structured JSON format that preserves the hierarchical organization of legal documents. This enables programmatic analysis, search, and processing of legal content.

### **Structural Parsing**
- **Hierarchy Detection** - Identifies libros, títulos, capítulos, secciones, artículos
- **Content Organization** - Builds nested tree structure reflecting legal document hierarchy  
- **Article Analysis** - Parses individual articles with content and annotations
- **Metadata Integration** - Includes source information and validation metadata

### **Advanced Features**
- **Automatic Repair** - Detects and fixes embedded article headers within content
- **Sequence Validation** - Identifies gaps or jumps in article numbering
- **Error Reporting** - Comprehensive logging of parsing issues and structural problems
- **Quality Metrics** - Statistical analysis of parsed content for validation

### **Output Format**
Structured JSON with nested hierarchy, article content, annotations, and comprehensive metadata for downstream processing and analysis.

### **Article Sequence Validation**

The parser includes intelligent detection of article numbering gaps (e.g., jumping from "Artículo 12" to "Artículo 15" without 13-14). This helps identify:

- **Missing Content** - Articles that may have been lost during PDF extraction
- **Structural Issues** - Formatting problems that affect article detection  
- **Document Quality** - Overall completeness of the legal document

The system automatically attempts repairs for embedded article headers and provides detailed reporting of any remaining gaps for manual review.

In [55]:
# -*- coding: utf-8 -*-
from __future__ import annotations

import re
import csv
import json
from dataclasses import dataclass, field
from pathlib import Path
from typing import Any, Dict, List, Optional, Tuple, Union, Set

from unidecode import unidecode


### **Configuration: Allowed Suffixes**

The `ALLOWED_SUFFIX` dictionary contains valid suffixes for each hierarchical level, derived from exploratory analysis of the legal corpus. 

**Future Enhancement**: This configuration should be generated dynamically for each legislature to account for:
- Regional variations in legal terminology
- Historical changes in numbering conventions  
- Document-specific structural patterns

**Current Implementation**: Hard-coded based on analysis of existing documents. See exploration notebooks for suffix derivation methodology.

### **Enhancement Opportunity: Typo-Resistant Suffix Matching**

**Current Limitation**: Exact string matching for legal suffixes may miss valid entries due to:
- OCR scanning errors in PDF extraction
- Typographical variations in source documents
- Accent mark inconsistencies

**Proposed Improvements**:
1. **Fuzzy String Matching** - Use edit distance algorithms for approximate matching
2. **Phonetic Matching** - Handle accent mark variations and similar sounds
3. **Machine Learning** - Train classifiers on known good/bad suffix patterns
4. **Manual Review Interface** - Flag uncertain matches for human validation

This would significantly improve parsing accuracy for lower-quality source documents.

In [56]:
# --------------------------------------------------------------------------------------
# Allowed suffixes and header patterns (unchanged lists provided by you)
# --------------------------------------------------------------------------------------

ALLOWED_SUFFIX = {
    "libro": ["cuarto","decimo","noveno","octavo","primero","quinto","segundo","septimo","sexto","tercero"],
    "titulo": ["catorce","cuarto","decimo","decimo bis","decimoctavo","decimocuarto","decimonoveno","decimoprimero","decimoquinto","decimosegundo","decimoseptimo","decimosexto","decimotercero","dieciseis","doce","duodecimo","i","ii","iii","iv","ix","noveno","octavo","octavo bis","once","preliminar","primero","quince","quinto","quinto bis","segundo","segundo bis","septimo","septimo bis","sexto","tercero","tercero bis","trece","trece bis","undecimo","unico","v","vi","vigesimo","vigesimocuarto","vigesimoprimero","vigesimosegundo","vigesimotercero","vii","viii","x","xi","xii","xiii","xiv","xv","xvi","especial"],
    "capitulo": ["cuarto","cuarto bis","decimo","duodecimo","especial","i","i bis","ii","ii bis","iii","iii bis","iii ter","iv","iv bis","iv ter","ix","ix bis","ix ter","noveno","octavo","primero","quinto","segundo","septimo","sexto","tercero","undecimo","unico","v","v bis","v ter","vi","vi bis","vigesimo","vii","vii bis","viii","viii bis","x","x bis","xi","xii","xii bis","xiii","xiii bis","xiv","xix","xv","xv bis","xv quater","xv ter","xvi","xvi bis","xvii","xviii","xx","xxi","xxii","xxiii","xxiv","xxix","xxv","xxvi","xxvii","xxviii", "decimocuarto","decimoquinto","decimosexto","decimoseptimo","decimooctavo","decimonoveno","decimoprimero","decimosegundo","decimotercero","vigesima"],
    "seccion": ["1a","2a","3a","4a","5a","6a","7a","8a","a","b","cuarta","decima","decima bis","decimo","i","ii","iii","iv","ix","novena","octava","primera","primera bis","quinta","segunda","segunda bis","septima","sexta","tercera","unica","v","vi","vii","vii bis","viii","x","xi","xii","xiii","xiv","xix","xv","xvi","xvii","xviii","xx","xxi","xxii","uno","dos","tres","cuatro","cinco","seis","siete","ocho","nueve","diez","once","decimoprimera","decimosegunda","decimotercera","decimocuarta","decimoquinta","decimosexta","decimoseptima","decimoctava","decimonovena","vigesima"],
}

ROMAN_RE = re.compile(r'^(?i:xxiv|xxiii|xxii|xxi|xx|xix|xviii|xvii|xvi|xv|xiv|xiii|xii|xi|x|ix|viii|vii|vi|v|iv|iii|ii|i)$')

HDR_WORDS = {
    "libro":    re.compile(r'^\s*(LIBRO)\b', re.IGNORECASE),
    "titulo":   re.compile(r'^\s*(T[ÍI]TULO)\b', re.IGNORECASE),
    "capitulo": re.compile(r'^\s*(CAP[ÍI]TULO)\b', re.IGNORECASE),
    "seccion":  re.compile(r'^\s*(SECCI[ÓO]N)\b', re.IGNORECASE),
}

LEVEL = {"libro": 1, "preliminar": 1, "titulo": 2, "capitulo": 3, "seccion": 4, "articulo": 5}

# Inline "Artículo" header finder used for repair (accepts "Artículo" or "Art.")
ARTICULO_INLINE_RE = re.compile(
    r'(?i)(?<!\w)(?:art[íi]culo|art\.)\s*'
    r'(?P<sufijo>'               # full suffix capture as text "7", "7 bis", "7-A"
    r'\d+(?:\s*(?:bis|ter|quater|quinquies|sexies|septies|octies|nonies|decies|undecies|duodecies|terdecies|[A-Za-z\-]+)?)?'
    r')\s*\.(?:-)?\s*'
)

# --------------------------------------------------------------------------------------
# Helpers
# --------------------------------------------------------------------------------------

def collapse_ws(s: str) -> str:
    return re.sub(r'\s+', ' ', s).strip()

def norm(s: str) -> str:
    return collapse_ws(unidecode(s).lower())

def edit_distance(a: str, b: str) -> int:
    la, lb = len(a), len(b)
    dp = list(range(lb+1))
    for i, ca in enumerate(a, 1):
        prev = dp[0]
        dp[0] = i
        for j, cb in enumerate(b, 1):
            cur = dp[j]
            cost = 0 if ca == cb else 1
            dp[j] = min(dp[j] + 1, dp[j-1] + 1, prev + cost)
            prev = cur
    return dp[lb]

def is_articulo_token(token: str) -> bool:
    return edit_distance(norm(token), "articulo") <= 2 or norm(token) in {"art", "art."}

def tokenize(s: str) -> List[str]:
    return [t for t in re.split(r'\s+', s.strip()) if t]

def allowed_suffix_for(tipo: str, candidate_tokens: List[str]) -> Tuple[Optional[str], int]:
    if not candidate_tokens:
        return None, 0
    raw = candidate_tokens[:3]
    cleaned = [re.sub(r'[.\-:—]+$', '', tok).strip() for tok in raw]

    allowed = set(ALLOWED_SUFFIX.get(tipo, []))
    allowed_norm = {norm(x) for x in allowed} | {norm(x.replace(' ', '')) for x in allowed}

    max_span = min(3, len(cleaned))
    for k in range(max_span, 0, -1):
        span_clean = cleaned[:k]
        as_is_clean = collapse_ws(' '.join(span_clean))
        as_norm = norm(as_is_clean)
        as_join_norm = norm(''.join(span_clean))

        if tipo == "seccion":
            if re.fullmatch(r'\d+[aª]$', as_norm) or re.fullmatch(r'\d+$', as_norm):
                return as_is_clean, k

        if tipo in ("libro", "titulo", "capitulo", "seccion"):
            if ROMAN_RE.match(as_norm) or re.fullmatch(r'\d+$', as_norm):
                return as_is_clean, k

        if as_norm in allowed_norm or as_join_norm in allowed_norm:
            return as_is_clean, k
    return None, 0

def split_header_rest(line: str, header_word_span: Tuple[int,int]) -> str:
    return line[header_word_span[1]:].strip()

def safe_filename(name: str) -> str:
    cleaned = re.sub(r'[<>:"/\\|?*\r\n\t]+', ' ', name)
    cleaned = re.sub(r'\s+', ' ', cleaned).strip()
    if not cleaned or cleaned in {".", ".."}:
        cleaned = "ley"
    return cleaned

# -------------------------------- Inline notes logic ----------------------------------

NOTE_PARENS_RE = re.compile(r'\(([^()]*)\)')  # one level

def strip_inline_notes(line: str) -> Tuple[str, List[str]]:
    """
    Remove '(...)' segments ONLY if inside they contain 'no.' (case-insensitive).
    Return (cleaned_line, [notes_without_brackets]).
    """
    if not line:
        return line, []
    notes: List[str] = []
    kept_parts: List[str] = []
    idx = 0
    for m in NOTE_PARENS_RE.finditer(line):
        start, end = m.span()
        content = m.group(1) or ""
        if re.search(r'(?i)\bno\.', content, flags=re.IGNORECASE):
            kept_parts.append(line[idx:start])
            notes.append(collapse_ws(content.strip()))
            idx = end
        else:
            kept_parts.append(line[idx:end])
            idx = end
    kept_parts.append(line[idx:])
    cleaned = ''.join(kept_parts).strip()
    return cleaned, notes

# ----------------------------------- Data model ---------------------------------------

@dataclass
class Node:
    tipo: str
    sufijo: str
    nombre: Optional[str]
    nota: List[str] = field(default_factory=list)
    contenido: Union[str, List['Node']] = field(default_factory=list)
    start: int = 0
    end: int = 0
    line: int = 0
    level: int = 0
    header_line_text: Optional[str] = None

    def to_json_obj(self) -> Dict[str, Any]:
        base = {"tipo": self.tipo, "sufijo": self.sufijo}
        if self.tipo != "articulo":
            base["nombre"] = self.nombre if self.nombre is not None else None
            base["nota"] = self.nota[:] if self.nota else []
            base["contenido"] = [c.to_json_obj() for c in (self.contenido or [])]
        else:
            base["nota"] = self.nota[:] if self.nota else []
            base["contenido"] = self.contenido if isinstance(self.contenido, str) else ""
        return base

# -------------------------------- Counting utility ------------------------------------

def count_unidades(nodes: List[Node]) -> Dict[str, int]:
    counts = {"libros":0,"titulos":0,"capitulos":0,"secciones":0,"articulos":0}
    def _walk(n: Node):
        if n.tipo == "libro": counts["libros"] += 1
        elif n.tipo == "titulo": counts["titulos"] += 1
        elif n.tipo == "capitulo": counts["capitulos"] += 1
        elif n.tipo == "seccion": counts["secciones"] += 1
        elif n.tipo == "articulo": counts["articulos"] += 1
        if isinstance(n.contenido, list):
            for c in n.contenido: _walk(c)
    for n in nodes: _walk(n)
    return counts

# -------------------- Header detection & Artículo parsing -----------------------------

def detect_container_header(line: str) -> Optional[Tuple[str, Tuple[int,int]]]:
    for tipo, pat in HDR_WORDS.items():
        m = pat.match(line)
        if m:
            return tipo, m.span(1)
    return None

def parse_container_header(clean_line: str, tipo: str, header_span: Tuple[int,int]) -> Tuple[Optional[str], Optional[str]]:
    rest = split_header_rest(clean_line, header_span)
    if not rest:
        return None, None
    tokens = tokenize(rest)
    if not tokens:
        return None, None
    suffix, consumed = allowed_suffix_for(tipo, tokens)
    if not suffix:
        return None, None
    after_suffix = ' '.join(tokens[consumed:]).strip() if consumed < len(tokens) else ""
    nombre_inline = after_suffix if after_suffix else None
    return suffix, nombre_inline

def parse_articulo_header_and_body(lines: List[str], i: int, line_starts: List[int], full_text: str) -> Optional[Tuple[Node, int]]:
    raw_line = lines[i]
    line, header_notes = strip_inline_notes(raw_line)

    m = re.match(r'^\s*(\S+)', line)
    if not m:
        return None
    first = m.group(1)
    if not is_articulo_token(first):
        return None

    rest = line[m.end():]
    if not re.match(r'^\s*\d+', rest):
        return None

    term_idx = None
    k = 0
    while k < len(rest):
        if rest[k] == '.':
            k2 = k + 1
            if k2 < len(rest) and rest[k2] == '-':
                k2 += 1
            if k2 >= len(rest) or rest[k2].isspace():
                term_idx = k
                break
        k += 1
    if term_idx is None:
        return None

    candidate_suffix = rest[:term_idx].strip()
    if not re.search(r'\d', candidate_suffix):
        return None

    jstart = term_idx + 1
    if jstart < len(rest) and rest[jstart] == '-':
        jstart += 1
    while jstart < len(rest) and rest[jstart].isspace():
        jstart += 1
    after = rest[jstart:]

    body_lines: List[str] = []
    if after.strip():
        body_clean, body_notes = strip_inline_notes(after.rstrip())
        body_lines.append(body_clean)
        header_notes.extend(body_notes)

    j = i + 1
    while j < len(lines):
        candidate_raw = lines[j]
        cand_clean, cand_notes = strip_inline_notes(candidate_raw.rstrip())

        if detect_container_header(cand_clean) or (
            cand_clean.strip() and is_articulo_token(cand_clean.strip().split(' ', 1)[0])
        ):
            if cand_clean.strip() and is_articulo_token(cand_clean.strip().split(' ', 1)[0]):
                lrest = cand_clean.strip()[len(cand_clean.strip().split(' ', 1)[0]):]
                if not re.match(r'^\s*\d+', lrest):
                    header_notes.extend(cand_notes)
                    body_lines.append(cand_clean)
                    j += 1
                    continue
            break
        header_notes.extend(cand_notes)
        body_lines.append(cand_clean)
        j += 1

    start_char = line_starts[i]
    end_char = line_starts[j] if j < len(lines) else len(full_text)
    content_text = "\n".join([ln.rstrip() for ln in body_lines if ln.strip()]).strip()

    node = Node(
        tipo="articulo",
        sufijo=candidate_suffix,
        nombre=None,
        nota=[n for n in header_notes if n],
        contenido=content_text,
        start=start_char,
        end=end_char,
        line=i+1,
        level=LEVEL["articulo"],
        header_line_text=raw_line.strip()
    )
    return node, j

# ----------------------------- Article repair utilities --------------------------------

def parse_article_base_int(sufijo: str) -> Optional[int]:
    m = re.search(r'(\d+)', sufijo)
    if m:
        try:
            return int(m.group(1))
        except:
            return None
    return None

def parse_article_base_and_variant(sufijo: str) -> Tuple[Optional[int], str]:
    """
    Return (base_int, variant_text_normalized_without_spaces) where variant can be ''.
    """
    m = re.search(r'(\d+)\s*(.*)$', sufijo.strip())
    if not m:
        return None, ""
    base = parse_article_base_int(sufijo)
    tail = (m.group(2) or "").strip()
    tail_norm = norm(tail).replace(" ", "")
    return base, tail_norm  # '' if no variant

def find_embedded_article_headers(text: str) -> List[Tuple[int, int, str]]:
    """
    Return list of (start_idx, end_idx_after_header, sufijo_text) matches for inline headers.
    """
    matches: List[Tuple[int,int,str]] = []
    for mm in ARTICULO_INLINE_RE.finditer(text):
        start = mm.start()
        end = mm.end()
        suf = collapse_ws(mm.group("sufijo"))
        matches.append((start, end, suf))
    return matches

def split_embedded_articles_in_list(nodes: List[Node]) -> None:
    """
    Traverse a node list; for each artículo node whose content contains inline 'Artículo ...'
    headers, split them into separate article nodes IF AND ONLY IF the first embedded header
    matches the expected immediate sequence: base+1 OR same base with a non-empty variant.
    We accept a chain of subsequent embedded headers only if they continue +1 steps.
    """
    i = 0
    while i < len(nodes):
        node = nodes[i]
        # Recurse into containers first
        if node.tipo != "articulo" and isinstance(node.contenido, list):
            split_embedded_articles_in_list(node.contenido)

        if node.tipo == "articulo" and isinstance(node.contenido, str) and node.contenido:
            text = node.contenido
            emb = find_embedded_article_headers(text)
            if emb:
                base0, var0 = parse_article_base_and_variant(node.sufijo)
                if base0 is not None:
                    # Evaluate first embedded header
                    s0, e0, suf0 = emb[0]
                    b1, v1 = parse_article_base_and_variant(suf0)
                    ok_first = False
                    # Allowed start: base+1 or same base with non-empty variant (and not identical suffix)
                    if b1 is not None:
                        if b1 == base0 + 1:
                            ok_first = True
                        elif b1 == base0:
                            if v1 and norm(suf0) != norm(node.sufijo):
                                ok_first = True
                    if ok_first and s0 >= 1:
                        # Build a chain of accepted matches: consecutive +1 steps
                        accepted = [(s0, e0, suf0, b1)]
                        expected_next = b1 + 1
                        for k in range(1, len(emb)):
                            sk, ek, sufk = emb[k]
                            bk, vk = parse_article_base_and_variant(sufk)
                            if bk is None:
                                break
                            if bk == expected_next:
                                accepted.append((sk, ek, sufk, bk))
                                expected_next += 1
                            else:
                                # stop at first non-consecutive
                                break

                        # Perform split
                        new_nodes: List[Node] = []
                        # part before first embedded header stays in current node
                        before = text[:accepted[0][0]].rstrip()
                        node.contenido = before

                        # For each accepted embedded header, create a new Node with its body
                        for idx_acc, (sk, ek, sufk, bk) in enumerate(accepted):
                            body_start = ek
                            body_end = accepted[idx_acc + 1][0] if idx_acc + 1 < len(accepted) else len(text)
                            body = text[body_start:body_end].strip()
                            new_nodes.append(Node(
                                tipo="articulo",
                                sufijo=sufk,
                                nombre=None,
                                nota=[],
                                contenido=body,
                                start=0, end=0,
                                line=node.line,  # best-effort
                                level=LEVEL["articulo"],
                                header_line_text=f"Artículo {sufk}."
                            ))

                        # Insert new nodes right after the original
                        nodes[i+1:i+1] = new_nodes
                        # Skip past inserted items
                        i += len(new_nodes)
        i += 1

# ------------------------------------ Main parser -------------------------------------

def parse_law_text(text: str, issues: List[Dict[str,Any]]) -> Tuple[str, List[Node], Dict[str, List[str]]]:
    """
    Parse a full law TXT into a title and list of top-level nodes.
    Also returns per-tipo invalid suffixes encountered: {"libro":[...],"titulo":[...],...}
    """
    text = text.replace('\r\n', '\n').replace('\r', '\n')
    lines = text.split('\n')

    title = next((ln.strip() for ln in lines if ln.strip()), "Sin título")

    line_starts = []
    pos = 0
    for ln in lines:
        line_starts.append(pos)
        pos += len(ln) + 1

    root_nodes: List[Node] = []
    stack: List[Node] = []

    # Track invalid suffix samples per tipo
    invalid_suffixes: Dict[str, Set[str]] = {k: set() for k in ["libro","titulo","capitulo","seccion"]}

    i = 0
    while i < len(lines):
        raw_line = lines[i]

        # Optional PRELIMINAR block at top
        if not root_nodes and not stack:
            prelim_clean, prelim_notes = strip_inline_notes(raw_line)
            if re.match(r'^\s*disposiciones\s+preliminares\b.*$', unidecode(prelim_clean), re.IGNORECASE):
                node = Node(
                    tipo="preliminar",
                    sufijo="",
                    nombre=prelim_clean.strip(),
                    nota=prelim_notes,
                    contenido=[],
                    start=line_starts[i],
                    end=0,
                    line=i+1,
                    level=LEVEL["preliminar"],
                    header_line_text=raw_line.strip()
                )
                root_nodes.append(node); stack.append(node)
                i += 1
                continue

        # 1) Try artículo first
        parsed_art = parse_articulo_header_and_body(lines, i, line_starts, text)
        if parsed_art:
            art_node, j = parsed_art
            parent = stack[-1] if stack else None
            if parent is None:
                root_nodes.append(art_node)
            else:
                if isinstance(parent.contenido, list):
                    parent.contenido.append(art_node)
                else:
                    parent.contenido = [art_node]
            i = j
            continue

        # 2) Try container header (strip notes first)
        clean_line, inline_notes = strip_inline_notes(raw_line)
        det = detect_container_header(clean_line)
        if det:
            tipo, hdr_span = det
            suffix, nombre_inline = parse_container_header(clean_line, tipo, hdr_span)

            if not suffix:
                # Capture candidate "invalid" suffix sample for this tipo
                rest = split_header_rest(clean_line, hdr_span)
                tokens = tokenize(rest)
                sample_tokens = [re.sub(r'[.\-:—,;]+$', '', t).strip() for t in tokens[:3]]
                sample = collapse_ws(' '.join([t for t in sample_tokens if t]))
                if sample:
                    invalid_suffixes.get(tipo, set()).add(sample)

                issues.append({
                    "location": f"line {i+1}",
                    "message": f"{tipo.title()} sin sufijo válido (línea ignorada)",
                    "issue_type": "warning",
                    "line_text": raw_line.strip()
                })
                i += 1
                continue

            # Name-on-next-line rule (strip notes there too)
            nombre = nombre_inline
            notes_for_node = list(inline_notes)
            if nombre is None:
                peek = i + 1
                while peek < len(lines) and not lines[peek].strip():
                    peek += 1
                if peek < len(lines):
                    next_line_clean, next_line_notes = strip_inline_notes(lines[peek].strip())
                    is_header = bool(detect_container_header(next_line_clean))
                    is_article_hdr = False
                    if next_line_clean.strip():
                        first_tok = next_line_clean.strip().split(' ', 1)[0]
                        is_article_hdr = is_articulo_token(first_tok) and re.match(
                            r'^\s*\d+', next_line_clean[len(first_tok):] or "")
                    if not (is_header or is_article_hdr):
                        nombre = next_line_clean if next_line_clean else None
                        if next_line_notes:
                            notes_for_node.extend(next_line_notes)

            node = Node(
                tipo=tipo,
                sufijo=suffix,
                nombre=nombre,
                nota=notes_for_node,
                contenido=[],
                start=line_starts[i],
                end=0,
                line=i+1,
                level=LEVEL[tipo],
                header_line_text=raw_line.strip()
            )

            while stack and stack[-1].level >= node.level:
                top = stack.pop()
                top.end = line_starts[i]

            if not stack:
                root_nodes.append(node)
            else:
                stack[-1].contenido.append(node)
            stack.append(node)
            i += 1
            continue

        # 3) Plain line: attach any inline notes to nearest open container
        if inline_notes and stack:
            for n in inline_notes:
                if n and n not in stack[-1].nota:
                    stack[-1].nota.append(n)
        i += 1

    # Close remaining containers at EOF
    for n in stack[::-1]:
        n.end = len(text)

    # ---- Post-parse normalization: split inline embedded artículo headers (auto-repair) ----
    split_embedded_articles_in_list(root_nodes)

    # Finalize nodes (end positions, note de-dup)
    def _finalize(n: Node):
        if n.end == 0:
            n.end = len(text)
        if isinstance(n.contenido, list):
            for c in n.contenido: _finalize(c)
        seen = set(); dedup = []
        for x in n.nota:
            if x not in seen:
                seen.add(x); dedup.append(x)
        n.nota = dedup

    for n in root_nodes:
        _finalize(n)

    # Convert invalid suffix sets to lists
    invalid_out = {k: sorted(list(v)) for k, v in invalid_suffixes.items() if v}

    return title, root_nodes, invalid_out

# ---------------------------- Article sequence validation -----------------------------

def validate_article_sequence(nodes: List[Node], file_issues: List[Dict[str,Any]]) -> List[Dict[str, Any]]:
    """
    Detect forward jumps in base article numbers (> +1).
    Returns a list of jump dicts for CSV export and logs a verbose warning per jump.
    (Runs AFTER auto-repair, so only genuine jumps remain.)
    """
    arts: List[Node] = []
    def _walk(n: Node):
        if n.tipo == "articulo":
            arts.append(n)
        elif isinstance(n.contenido, list):
            for c in n.contenido: _walk(c)
    for n in nodes: _walk(n)

    jumps: List[Dict[str,Any]] = []
    prev_base = None
    prev_node: Optional[Node] = None

    for a in arts:
        base = parse_article_base_int(a.sufijo)
        if base is None:
            prev_node = a if prev_node is None else prev_node
            continue
        if prev_base is None:
            prev_base = base
            prev_node = a
            continue
        if base > prev_base + 1:
            jump = {
                "prev_line": prev_node.line if prev_node else "",
                "prev_sufijo": prev_node.sufijo if prev_node else "",
                "prev_line_text": (prev_node.header_line_text or f"Artículo {prev_node.sufijo}.") if prev_node else "",
                "current_line": a.line,
                "current_sufijo": a.sufijo,
                "current_line_text": a.header_line_text or f"Artículo {a.sufijo}.",
                "prev_base": prev_base,
                "current_base": base,
                "delta": base - prev_base
            }
            jumps.append(jump)
            file_issues.append({
                "location": f"line {a.line}",
                "message": f"Secuencia de artículos salta de {prev_base} a {base}",
                "issue_type": "warning",
                "line_text": jump["current_line_text"]
            })
        if base >= prev_base:
            prev_base = base
            prev_node = a

    return jumps

# --------------------------------- I/O utilities -------------------------------------

def law_basename(path: Path) -> str:
    return path.stem

def inferred_title_from_file(path: Path, parsed_title: str) -> str:
    return path.stem

def write_json(data: Dict[str,Any], path: Path):
    path.parent.mkdir(parents=True, exist_ok=True)
    path.write_text(json.dumps(data, ensure_ascii=False, indent=2), encoding="utf-8")

def save_errors_to_file(file_issues: List[Dict[str,Any]], file_name: str, errores_dir: Path):
    if not file_issues:
        return
    errores_dir.mkdir(parents=True, exist_ok=True)
    error_file_path = errores_dir / f"{Path(file_name).stem}_errors.txt"
    with error_file_path.open("w", encoding="utf-8") as f:
        f.write(f"Errores y advertencias para: {file_name}\n")
        f.write("="*50 + "\n\n")
        for issue in file_issues:
            f.write(f"Ubicación: {issue.get('location', 'N/A')}\n")
            f.write(f"Tipo: {issue.get('issue_type', 'warning')}\n")
            f.write(f"Mensaje: {issue.get('message', '')}\n")
            if issue.get("line_text"):
                f.write(f"Línea: {issue['line_text']}\n")
            f.write("-"*30 + "\n")

# ------------------------------- Catalogue utilities ----------------------------------

def load_catalog(catalog_csv: Path) -> Dict[str, Dict[str,str]]:
    mapping: Dict[str, Dict[str,str]] = {}
    if catalog_csv and catalog_csv.exists():
        with catalog_csv.open("r", encoding="utf-8") as f:
            reader = csv.DictReader(f)
            for row in reader:
                file_num = (row.get("file_num") or "").strip()
                if not file_num:
                    continue
                mapping[file_num] = {
                    "law_name": (row.get("law_name") or "").strip(),
                    "link": (row.get("link") or "").strip(),
                    "num_est": (row.get("num_est") or "").strip(),
                }
    return mapping

# ---------------------------------- Main pipeline -------------------------------------

def walk_and_process(
    ley_dir: Path,
    out_dir: Path,
    errores_dir: Optional[Path] = None,
    catalog_csv: Optional[Path] = None
) -> Dict[str, Any]:
    out_dir.mkdir(parents=True, exist_ok=True)
    if errores_dir:
        errores_dir.mkdir(parents=True, exist_ok=True)

    catalog = load_catalog(catalog_csv) if catalog_csv else {}

    issues_rows = []
    jump_rows = []

    manifest = {
        "processed_files": [],
        "files": {},
        "totals": {"libros":0,"titulos":0,"capitulos":0,"secciones":0,"articulos":0},
        "warnings": 0,
        "errors": 0
    }

    txt_files = sorted([p for p in ley_dir.glob("*.txt") if p.is_file()])
    for p in txt_files:
        raw = p.read_text(encoding="utf-8", errors="replace")
        file_issues: List[Dict[str,Any]] = []

        title, nodes, invalid_suffixes = parse_law_text(raw, file_issues)

        # Validate article sequence AFTER auto-repair; only genuine jumps remain
        jumps = validate_article_sequence(nodes, file_issues)
        jump_rows.extend([{
            "file": p.name,
            **jr
        } for jr in jumps])

        # Save verbose errors per file
        if errores_dir and file_issues:
            save_errors_to_file(file_issues, p.name, errores_dir)

        counts = count_unidades(nodes)

        # Determine output JSON filename via catalogue (file_num -> law_name)
        stem = p.stem  # expected like '0001'
        law_name = catalog.get(stem, {}).get("law_name") or inferred_title_from_file(p, title)

        # Output JSON name = file_num.json (strict)
        out_name = f"{stem}.json"
        final_obj = {
            "ley": law_name,
            "contenido": [n.to_json_obj() for n in nodes]
        }
        out_json_path = out_dir / out_name
        write_json(final_obj, out_json_path)

        # Write invalid suffixes JSON for this file (only if there are any)
        if invalid_suffixes:
            invalid_path_base = errores_dir if errores_dir else out_dir
            invalid_path = invalid_path_base / f"{stem}_invalid_suffixes.json"
            write_json(invalid_suffixes, invalid_path)

        # Collect issues to global CSV rows
        for it in file_issues:
            issues_rows.append({
                "file": p.name,
                "location": it.get("location",""),
                "issue_type": it.get("issue_type","warning"),
                "message": it.get("message",""),
                "line_text": it.get("line_text",""),
            })

        manifest["processed_files"].append(p.name)
        manifest["files"][p.name] = {
            "output": out_json_path.name,
            "law_name": law_name,
            "counts": counts,
            "issues": file_issues,
        }
        for k in counts:
            manifest["totals"][k] += counts[k]
        manifest["warnings"] += sum(1 for it in file_issues if it.get("issue_type") == "warning")
        manifest["errors"]   += sum(1 for it in file_issues if it.get("issue_type") == "error")

    # Write manifest + CSVs
    write_json(manifest, out_dir / "manifest.json")

    if errores_dir:
        # Parsing issues CSV (includes line text)
        with (errores_dir / "parsing_issues.csv").open("w", newline="", encoding="utf-8") as f:
            w = csv.DictWriter(f, fieldnames=["file","location","issue_type","message","line_text"])
            w.writeheader(); w.writerows(issues_rows)

        # Article jumps CSV
        with (errores_dir / "articulo_jumps.csv").open("w", newline="", encoding="utf-8") as f:
            w = csv.DictWriter(
                f,
                fieldnames=[
                    "file",
                    "prev_line","prev_sufijo","prev_line_text",
                    "current_line","current_sufijo","current_line_text",
                    "prev_base","current_base","delta"
                ]
            )
            w.writeheader(); w.writerows(jump_rows)

    return manifest

# ------------------------------ Error summary & summary -------------------------------

def create_error_summary(manifest: Dict[str, Any], errores_dir: Path):
    if not errores_dir:
        return
    summary_path = errores_dir / "error_summary.txt"
    with summary_path.open("w", encoding="utf-8") as f:
        f.write("RESUMEN CONSOLIDADO DE ERRORES Y ADVERTENCIAS\n")
        f.write("="*60 + "\n\n")
        f.write(f"Archivos procesados: {len(manifest.get('processed_files', []))}\n")
        f.write(f"Total de advertencias: {manifest.get('warnings', 0)}\n")
        f.write(f"Total de errores: {manifest.get('errors', 0)}\n\n")
        f.write("DETALLES POR ARCHIVO:\n")
        f.write("-"*40 + "\n")
        for fname, info in manifest.get("files", {}).items():
            issues = info.get("issues", [])
            if issues:
                f.write(f"\n📁 {fname}:\n")
                for issue in issues:
                    f.write(f"  • {issue.get('location', 'N/A')}: ")
                    f.write(f"[{issue.get('issue_type', 'warning').upper()}] ")
                    f.write(f"{issue.get('message', '')}\n")
                    if issue.get("line_text"):
                        f.write(f"    Línea: {issue['line_text']}\n")

        f.write(f"\n\nArchivos de error individuales guardados en: {errores_dir}\n")

def print_summary(manifest: Dict[str, Any]):
    print("="*72)
    print(" TXT → JSON Parsing Summary")
    print("="*72)
    print(f" Files processed : {len(manifest.get('processed_files', []))}")
    t = manifest.get("totals", {})
    print(f" Libros         : {t.get('libros',0)}")
    print(f" Títulos        : {t.get('titulos',0)}")
    print(f" Capítulos      : {t.get('capitulos',0)}")
    print(f" Secciones      : {t.get('secciones',0)}")
    print(f" Artículos      : {t.get('articulos',0)}")
    print(f" Warnings       : {manifest.get('warnings',0)}")
    print(f" Errors         : {manifest.get('errors',0)}")
    print("-"*72)
    for fname, info in manifest.get("files", {}).items():
        c = info.get("counts", {})
        n_warn = sum(1 for it in info.get("issues",[]) if it.get("issue_type") == "warning")
        n_err  = sum(1 for it in info.get("issues",[]) if it.get("issue_type") == "error")
        print(f" {fname} → {info.get('output')} ({info.get('law_name','')})")
        print(f"   L:{c.get('libros',0)} T:{c.get('titulos',0)} C:{c.get('capitulos',0)} S:{c.get('secciones',0)} A:{c.get('articulos',0)} | warn:{n_warn} err:{n_err}")
    print("="*72)


In [57]:
# ============== FINAL PROCESSING: JSON GENERATION & VALIDATION ==============

print("Starting Final Processing: JSON Structure Generation")
print("=" * 80)

# Execute the complete processing pipeline
# - Reads law text files from LEY_DIR  
# - Parses hierarchical structure (libros, títulos, capítulos, secciones, artículos)
# - Validates article sequences and structural integrity
# - Generates structured JSON output with comprehensive metadata
# - Creates detailed error reports and validation logs

manifest = walk_and_process(
    ley_dir=LEY_DIR,           # Input: Cleaned law text files
    out_dir=JSON_DIR,          # Output: Structured JSON files
    errores_dir=ERRORES_DIR,   # Logs: Error reports and validation
    catalog_csv=CATALOG_CSV    # Metadata: Law names and references
)

print("=" * 80)
print("PROCESSING COMPLETE - Generating Summary Reports")
print("=" * 80)

# Display comprehensive processing summary
print_summary(manifest)

# Generate consolidated error summary for quality review
create_error_summary(manifest, ERRORES_DIR)

print("\n **Pipeline Execution Complete!**")
print(f" JSON files: {JSON_DIR}")
print(f" Error reports: {ERRORES_DIR}")
print(f" Processing manifest: {JSON_DIR / 'manifest.json'}")

# Display final statistics
total_articles = manifest.get("totals", {}).get("articulos", 0)
total_warnings = manifest.get("warnings", 0)
total_errors = manifest.get("errors", 0)

print(f"\n **Final Results:**")
print(f"    Total articles parsed: {total_articles:,}")
print(f"    Warnings: {total_warnings}")
print(f"    Errors: {total_errors}")
print(f"    Success rate: {((total_articles - total_errors) / max(total_articles, 1) * 100):.1f}%")

Starting Final Processing: JSON Structure Generation
PROCESSING COMPLETE - Generating Summary Reports
 TXT → JSON Parsing Summary
 Files processed : 0
 Libros         : 0
 Títulos        : 0
 Capítulos      : 0
 Secciones      : 0
 Artículos      : 0
 Errors         : 0
------------------------------------------------------------------------

 **Pipeline Execution Complete!**
 JSON files: c:\Users\braul\Documents\OneDrive\Leyes\14\Refined\json
 Error reports: c:\Users\braul\Documents\OneDrive\Leyes\14\Refined\errores
 Processing manifest: c:\Users\braul\Documents\OneDrive\Leyes\14\Refined\json\manifest.json

 **Final Results:**
    Total articles parsed: 0
    Errors: 0
    Success rate: 0.0%
