# DOCX Table Extraction Walkthrough

This notebook demonstrates extracting structured tables from Word documents (.docx).

Key capabilities:
- **Merged cell handling**: python-docx returns duplicate `_tc` references for horizontally merged cells — the extractor deduplicates them
- **Compound headers**: hierarchical headers (metric labels spanning 2+ columns) are stacked with " / " separators
- **User-defined classification**: `classify_docx_tables()` accepts caller-supplied categories and keywords
- **Dynamic pivot values**: `extract_pivot_values()` reads years from headers instead of hardcoding

The pipeline: DOCX → extract raw grids → detect header rows (merge-based) → build compound headers → render pipe-table markdown → `interpret_table()`

## 1. Synthetic Files — The Agnostic API

8 synthetic DOCX files in `inputs/docx/synthetic/` cover different table structures.
No domain knowledge required — the API works on any DOCX.

In [None]:
import glob

synth_files = sorted(glob.glob("inputs/docx/synthetic/*.docx"))
print(f"{len(synth_files)} synthetic DOCX files:")
for f in synth_files:
    print(f"  {f.split('/')[-1]}")

In [None]:
from pdf_ocr import extract_tables_from_docx, compress_docx_tables

# Basic extraction from a flat table (no merges)
tables = extract_tables_from_docx("inputs/docx/synthetic/flat.docx")
print(f"flat.docx: {len(tables)} table, {len(tables[0].column_names)} cols, {len(tables[0].data)} data rows")
print(f"Source format: {tables[0].source_format}")

## 2. Merged Cells and Compound Headers

When headers span multiple rows (horizontal merge for metric labels, vertical merge for label columns),
`compress_docx_tables()` builds compound column names joined by " / ".

In [None]:
# hspan.docx: 2 header rows with horizontal spans
# Row 0: Category | Revenue (span 2) | Cost (span 2)
# Row 1:          | Q1 | Q2          | Q1 | Q2

results = compress_docx_tables("inputs/docx/synthetic/hspan.docx")
md, meta = results[0]
print(f"Rows: {meta['row_count']}, Cols: {meta['col_count']}")
print()
print(md)

In [None]:
# deep_hierarchy.docx: 3 header rows (Group / Sub / Year)

results = compress_docx_tables("inputs/docx/synthetic/deep_hierarchy.docx")
md, meta = results[0]
print(f"Rows: {meta['row_count']}, Cols: {meta['col_count']}")
print()
print(md)

In [None]:
# title_hspan.docx: title row ("QUARTERLY REPORT") + hierarchical headers

results = compress_docx_tables("inputs/docx/synthetic/title_hspan.docx")
md, meta = results[0]
print(f"Title: {meta['title']!r}")
print(f"Rows: {meta['row_count']}, Cols: {meta['col_count']}")
print()
print(md)

## 3. The Merged Cell Bug Fix

python-docx returns **duplicate `_tc` references** for horizontally merged cells. A cell with `gridSpan=2`
appears twice in `row.cells`, both pointing to the same XML element. Without deduplication, headers shift right.

The fix: track `id(cell._tc)` in a `seen_tcs` set and skip duplicates.

In [None]:
from docx import Document
from pdf_ocr.docx_extractor import _build_grid_from_table, _detect_header_rows_from_merges

# Show raw grid from hspan.docx — verify no duplicate text
doc = Document("inputs/docx/synthetic/hspan.docx")
grid, _ = _build_grid_from_table(doc.tables[0])
hc = _detect_header_rows_from_merges(doc.tables[0])

print(f"Grid: {len(grid)} rows x {len(grid[0])} cols, {hc} header rows")
for i, row in enumerate(grid):
    tag = "[header]" if i < hc else "[data]  "
    print(f"  row {i} {tag}: {row}")

print(f"\nRow 0 col 1 = {grid[0][1]!r}  (not duplicated)")
print(f"Row 0 col 2 = {grid[0][2]!r}  (empty span continuation)")

## 4. Classification with User-Defined Categories

`classify_docx_tables(path, categories)` matches header text against caller-supplied keywords.
No hardcoded domain knowledge — you define your own categories.

In [None]:
from pdf_ocr import classify_docx_tables

# multi_category.docx has 4 tables: shipping, HR, financial, inventory
categories = {
    "shipping": ["cargo", "port", "vessel"],
    "hr": ["employee", "salary", "department"],
    "finance": ["revenue", "expenses"],
    "inventory": ["stock", "warehouse"],
}

classes = classify_docx_tables("inputs/docx/synthetic/multi_category.docx", categories)
for c in classes:
    print(f"  Table {c['index']}: {c['category']:12s} title={c['title'] or '(none)':25s} {c['rows']}r x {c['cols']}c")

In [None]:
# Use classification to filter and compress only financial tables
finance_idx = [c["index"] for c in classes if c["category"] == "finance"]
results = compress_docx_tables("inputs/docx/synthetic/multi_category.docx", table_indices=finance_idx)

for md, meta in results:
    print(f"Title: {meta['title']!r}")
    print(md)
    print()

## 5. Dynamic Pivot Values

`extract_pivot_values()` reads years from the compressed markdown headers.
This avoids hardcoding years that change every release.

In [None]:
from pdf_ocr import extract_pivot_values

# Extract years from each synthetic file that has them
for name in ["hspan", "title_hspan", "deep_hierarchy", "unicode"]:
    results = compress_docx_tables(f"inputs/docx/synthetic/{name}.docx")
    md, _ = results[0]
    years = extract_pivot_values(md)
    print(f"{name:20s} → years={years}  last 2: {years[-2:]}")

## 6. Real-World Example — Russian Agricultural Reports

6 weekly DOCX reports from the Russian Ministry of Agriculture.
Each contains multiple tables (export summaries, harvest progress, planting progress).

In [None]:
DOCX_DIR = "inputs/docx/input"
docx_files = sorted(glob.glob(f"{DOCX_DIR}/*.docx"))

labels = {
    "Apr 21-22": [f for f in docx_files if "Apr 21" in f][0],
    "Apr 25-26": [f for f in docx_files if "April 25" in f][0],
    "May 9-10": [f for f in docx_files if "May" in f][0],
    "Jun 20-21": [f for f in docx_files if "June" in f][0],
    "Jul 11-12": [f for f in docx_files if "July" in f][0],
    "Sep 2-3": [f for f in docx_files if "September" in f][0],
}

print(f"{len(docx_files)} DOCX files:")
for label, path in labels.items():
    print(f"  {label}: {path.split('/')[-1][:60]}")

In [None]:
# Define domain-specific categories for Russian agricultural reports
ag_categories = {
    "harvest": ["area harvested", "yield", "collected", "bunker", "centner",
                "harvested area", "crop harvested"],
    "planting": ["spring crops", "moa target", "spring wheat", "spring barley",
                 "sown area", "planting", "sowing", "planted"],
    "export": ["export", "shipment", "ports", "fob", "vessel", "cargo"],
}

for label, path in labels.items():
    classes = classify_docx_tables(path, ag_categories)
    counts = {}
    for c in classes:
        counts[c["category"]] = counts.get(c["category"], 0) + 1
    print(f"{label:12s} ({len(classes):2d} tables): {dict(sorted(counts.items()))}")

In [None]:
# Show the WHEAT table from the July file
july_path = labels["Jul 11-12"]
july_classes = classify_docx_tables(july_path, ag_categories)
harvest_idx = [c["index"] for c in july_classes if c["category"] == "harvest"]

july_results = compress_docx_tables(july_path, table_indices=harvest_idx)
wheat_md, wheat_meta = [(md, m) for md, m in july_results if m["title"] == "WHEAT"][0]

print(f"WHEAT: {wheat_meta['row_count']} data rows, {wheat_meta['col_count']} cols")
print()
# Show header + first 5 data rows
for line in wheat_md.split("\n")[:8]:
    print(line)

## 7. Full Pipeline: DOCX to Structured Records

Run `interpret_table()` on the compressed markdown to extract structured records.
Years are read dynamically from headers via `extract_pivot_values()`.

In [None]:
from pdf_ocr import interpret_table, interpret_tables, CanonicalSchema, to_records, to_pandas

# Dynamic year aliases from the actual headers
years = extract_pivot_values(wheat_md)
print(f"Years in headers: {years}")
year_aliases = years[-2:]  # last 2 years
print(f"Using aliases: {year_aliases}")

harvest_schema = CanonicalSchema.from_dict({
    "description": "Harvest progress by region, metric, and year",
    "columns": [
        {"name": "crop", "type": "string", "aliases": []},
        {"name": "region", "type": "string", "aliases": ["Region"]},
        {"name": "metric", "type": "string",
         "aliases": ["Area harvested", "collected", "Yield"]},
        {"name": "year", "type": "int", "aliases": year_aliases},
        {"name": "value", "type": "float", "aliases": []},
    ],
})

In [None]:
result = interpret_table(wheat_md, harvest_schema, model="openai/gpt-4o")

df_wheat = to_pandas(result, harvest_schema)
df_wheat.insert(0, "crop", "WHEAT")
print(f"WHEAT: {len(df_wheat)} records")
df_wheat.head(12)

In [None]:
import time, pandas as pd

# All 4 July harvest tables — interpreted concurrently
texts = [md for md, _ in july_results]
titles = [meta["title"] or "Unknown" for _, meta in july_results]

t0 = time.perf_counter()
mapped_tables = interpret_tables(texts, harvest_schema, model="openai/gpt-4o")
elapsed = time.perf_counter() - t0

# Build combined DataFrame via to_pandas (typed columns, OCR coercion)
frames = []
for title, mapped in zip(titles, mapped_tables):
    df = to_pandas(mapped, harvest_schema)
    df.insert(0, "crop", title)
    print(f"  {title}: {len(df)} records")
    frames.append(df)

df_july = pd.concat(frames, ignore_index=True)
print(f"\nTotal July harvest records: {len(df_july)}  ({elapsed:.1f}s concurrent)")
df_july

### FROM → TO: Source Table vs Structured DataFrame

Side-by-side comparison of the raw DOCX pipe-table (as the LLM sees it) and the normalized pandas output.

In [None]:
from IPython.display import display, HTML

def _md_table_to_html(md_text: str, title: str) -> str:
    """Convert a pipe-table markdown string to an HTML table."""
    lines = [l.strip() for l in md_text.strip().split("\n") if l.strip()]
    # Skip separator lines (e.g. |---|---|)
    rows = [l for l in lines if not all(c in "-| " for c in l)]
    html = f"<b>{title}</b><table style='font-size:11px; border-collapse:collapse; margin:4px 0'>"
    for i, row in enumerate(rows):
        cells = [c.strip() for c in row.strip("|").split("|")]
        tag = "th" if i == 0 else "td"
        style = "border:1px solid #ccc; padding:2px 6px; white-space:nowrap"
        html += "<tr>" + "".join(f"<{tag} style='{style}'>{c}</{tag}>" for c in cells) + "</tr>"
    html += "</table>"
    return html

# Show WHEAT: original pipe-table (FROM) alongside interpreted DataFrame (TO)
source_html = _md_table_to_html(wheat_md, "FROM: Raw DOCX pipe-table (WHEAT)")
df_display = df_july[df_july["crop"] == "WHEAT"].head(15)
to_html = f"<b>TO: Normalized DataFrame (WHEAT, first 15 rows)</b>{df_display.to_html(index=False)}"

display(HTML(
    "<div style='display:flex; gap:24px; align-items:flex-start'>"
    f"<div>{source_html}</div>"
    f"<div style='font-size:12px'>{to_html}</div>"
    "</div>"
))

## 8. September Harvest Tables

The September file has 8 harvest tables. The same schema handles them without changes.

In [None]:
sep_path = labels["Sep 2-3"]
sep_classes = classify_docx_tables(sep_path, ag_categories)
sep_harvest = [c["index"] for c in sep_classes if c["category"] == "harvest"]
sep_results = compress_docx_tables(sep_path, table_indices=sep_harvest)

print(f"September: {len(sep_results)} harvest tables")
for md, meta in sep_results:
    print(f"  Table {meta['table_index']}: {meta['title']!r} ({meta['row_count']} rows)")

# Interpret all 8 tables concurrently
sep_texts = [md for md, _ in sep_results]
sep_titles = [meta["title"] or "Unknown" for _, meta in sep_results]

t0 = time.perf_counter()
sep_mapped = interpret_tables(sep_texts, harvest_schema, model="openai/gpt-4o")
elapsed = time.perf_counter() - t0

sep_frames = []
for title, mapped in zip(sep_titles, sep_mapped):
    df = to_pandas(mapped, harvest_schema)
    df.insert(0, "crop", title)
    sep_frames.append(df)

df_sep = pd.concat(sep_frames, ignore_index=True)
print(f"\nTotal September harvest records: {len(df_sep)}  ({elapsed:.1f}s concurrent)")
df_sep

## 9. Compare to Expected Parquet

Load reference parquet files from `inputs/docx/output/` to compare structure and coverage.

In [None]:
try:
    import pandas as pd

    harvest_pq = "inputs/docx/output/ru_ag_min_raw_harvest_progress_20250724_08_56_15.parquet"
    planting_pq = "inputs/docx/output/ru_ag_min_raw_planting_progress_20250509_17_11_25.parquet"

    df_harvest = pd.read_parquet(harvest_pq)
    df_planting = pd.read_parquet(planting_pq)

    print("=== Harvest reference ===")
    print(f"Shape: {df_harvest.shape}")
    print(f"Columns: {list(df_harvest.columns)}")
    print(df_harvest.head())

    print(f"\n=== Planting reference ===")
    print(f"Shape: {df_planting.shape}")
    print(f"Columns: {list(df_planting.columns)}")
    print(df_planting.head())

except ImportError:
    print("pandas not available. Install with: pip install pandas pyarrow")
except FileNotFoundError as e:
    print(f"Reference file not found: {e}")