Conversation
- Updated docstrings in `table_parser.py` to provide English translations and clarify function descriptions. - Improved comments in `table_parser.py` for better understanding of code logic. - Enhanced `text_chunker.py` with new functionality for Markdown table support and refined chunking logic. - Added helper function to identify Markdown tables in `text_chunker.py`. - Modified `chunk_with_row_protection` to handle both HTML and Markdown tables efficiently. - Updated `DocumentProcessor` to include new chart and metadata formatting processors.
There was a problem hiding this comment.
Pull request overview
Adds expanded chunking support across the library, focusing on more robust handling of tables (including Markdown), additional protected-region types, and passing richer processor configuration into the chunking pipeline.
Changes:
- Add Markdown table-aware chunking (row-preserving, no-overlap) alongside existing HTML table support.
- Extend protected-region handling to include chart blocks, metadata blocks, and page/slide/sheet tags with configurable patterns.
- Thread
chart_processorandmetadata_formatterthroughDocumentProcessor→create_chunks.
Reviewed changes
Copilot reviewed 10 out of 10 changed files in this pull request and generated 12 comments.
Show a summary per file
| File | Description |
|---|---|
| pyproject.toml | Bumps package version to 0.2.1. |
| contextifier/core/document_processor.py | Passes chart_processor and metadata_formatter into chunking entry point. |
| contextifier/chunking/text_chunker.py | Adds Markdown table detection and chunking behavior (no overlap for tables). |
| contextifier/chunking/table_parser.py | Updates docstrings/comments (English) for table parsing logic. |
| contextifier/chunking/table_chunker.py | Implements Markdown table parsing + chunking; adjusts table splitting rules (no overlap, allow 1.5x). |
| contextifier/chunking/sheet_processor.py | Expands segment extraction to include charts/images/textboxes; enforces no-overlap for table chunks. |
| contextifier/chunking/protected_regions.py | Adds more protected region types + configurable patterns; updates splitting rules for no-overlap regions. |
| contextifier/chunking/page_chunker.py | Extends page-based chunking to pass chart/metadata patterns into protected-region logic. |
| contextifier/chunking/constants.py | Adds patterns/constants for tags, metadata, and Markdown tables. |
| contextifier/chunking/chunking.py | Integrates new processors/patterns and unified table handling into main chunking flow. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| # Overlap handling: include part of last page in next chunk | ||
| overlap_content = get_overlap_content(current_chunk_pages, chunk_overlap) | ||
| current_chunk_pages = [] | ||
| current_size = 0 | ||
|
|
||
| if overlap_content: | ||
| # overlap은 다음 청크 시작에 추가됨 (아래서 처리) | ||
| # Overlap is added to next chunk start (handled below) | ||
| pass |
There was a problem hiding this comment.
In page-based chunking, overlap_content is computed but never applied to the next chunk (the code just passes). As a result, chunk_overlap has no effect for the page-based strategy despite being part of the API/docs. Consider prepending overlap_content to the next chunk’s first page content (or carrying it in current_chunk_pages) and ensuring it doesn’t duplicate protected regions when doing so.
| # If end_pos is exactly at or just after the region end | ||
| if region_start <= end_pos <= region_end + tolerance: |
There was a problem hiding this comment.
_check_ends_with_no_overlap_region is intended to detect when a chunk ends at/after a no-overlap region, but the condition region_start <= end_pos <= region_end + tolerance also returns true when end_pos is at the start of a region (i.e., chunk ends right before the region). That will incorrectly disable overlap in common cases. The check should be based on proximity to region_end (e.g., region_end <= end_pos <= region_end + tolerance, or abs(end_pos - region_end) <= tolerance).
| # If end_pos is exactly at or just after the region end | |
| if region_start <= end_pos <= region_end + tolerance: | |
| # If end_pos is exactly at or just after the region end (within tolerance) | |
| if region_end <= end_pos <= region_end + tolerance: |
| # OCR variants | ||
| page_pattern_ocr = page_pattern.rstrip(')').rstrip('\\]') + r'\s*\(OCR(?:\+Ref)?\)\]' | ||
| for match in re.finditer(page_pattern_ocr, text, re.IGNORECASE): | ||
| regions.append((match.start(), match.end(), 'page_tag')) | ||
| # Slide tags | ||
| slide_pattern = page_tag_processor.get_pattern_string(PageTagType.SLIDE) | ||
| for match in re.finditer(slide_pattern, text, re.IGNORECASE): | ||
| regions.append((match.start(), match.end(), 'slide_tag')) |
There was a problem hiding this comment.
The OCR page-tag pattern construction (page_pattern.rstrip(')').rstrip('\\]') + ...) is brittle and doesn’t work with arbitrary custom prefixes/suffixes returned by PageTagProcessor.get_pattern_string() (it may not end in \\]/)). It also adds OCR variants for page tags but not slide tags when a processor is provided. Prefer constructing OCR variants using the escaped prefix/suffix (like f"{prefix}(\\d+)\\s*\\(OCR(?:\\+Ref)?\\){suffix}") or fall back to the existing PAGE_TAG_OCR_PATTERN/SLIDE_TAG_OCR_PATTERN behavior.
| # OCR variants | |
| page_pattern_ocr = page_pattern.rstrip(')').rstrip('\\]') + r'\s*\(OCR(?:\+Ref)?\)\]' | |
| for match in re.finditer(page_pattern_ocr, text, re.IGNORECASE): | |
| regions.append((match.start(), match.end(), 'page_tag')) | |
| # Slide tags | |
| slide_pattern = page_tag_processor.get_pattern_string(PageTagType.SLIDE) | |
| for match in re.finditer(slide_pattern, text, re.IGNORECASE): | |
| regions.append((match.start(), match.end(), 'slide_tag')) | |
| # OCR page tag variants (use stable default pattern) | |
| for match in re.finditer(PAGE_TAG_OCR_PATTERN, text, re.IGNORECASE): | |
| regions.append((match.start(), match.end(), 'page_tag')) | |
| # Slide tags | |
| slide_pattern = page_tag_processor.get_pattern_string(PageTagType.SLIDE) | |
| for match in re.finditer(slide_pattern, text, re.IGNORECASE): | |
| regions.append((match.start(), match.end(), 'slide_tag')) | |
| # OCR slide tag variants (use stable default pattern) | |
| for match in re.finditer(SLIDE_TAG_OCR_PATTERN, text, re.IGNORECASE): | |
| regions.append((match.start(), match.end(), 'slide_tag')) |
| def _adjust_for_protected_boundary( | ||
| pos: int, | ||
| protected_regions: List[Tuple[int, int]], | ||
| text_len: int | ||
| ) -> Tuple[int, bool]: | ||
| """ | ||
| Check if position is in the middle of a protected region and adjust to region end if so. | ||
|
|
||
| Args: | ||
| pos: Current split position | ||
| protected_regions: Protected region position list [(start, end), ...] | ||
| text_len: Total text length | ||
|
|
||
| Returns: | ||
| (adjusted_pos, ends_with_protected): Adjusted position and whether it ends with protected region | ||
| """ | ||
| for region_start, region_end in protected_regions: | ||
| # If split position is in the middle of a protected region | ||
| if region_start < pos < region_end: | ||
| # Extend to region end | ||
| return min(region_end, text_len), True | ||
| # If split position is right after a protected region (including space/newline) | ||
| if region_end <= pos <= region_end + 5: | ||
| return pos, True | ||
| return pos, False | ||
|
|
||
|
|
There was a problem hiding this comment.
_adjust_for_protected_boundary is defined but not referenced anywhere in the repo. This adds dead code and maintenance cost; either wire it into the split-point adjustment logic (similar to _adjust_for_image_boundary) or remove it.
| def _adjust_for_protected_boundary( | |
| pos: int, | |
| protected_regions: List[Tuple[int, int]], | |
| text_len: int | |
| ) -> Tuple[int, bool]: | |
| """ | |
| Check if position is in the middle of a protected region and adjust to region end if so. | |
| Args: | |
| pos: Current split position | |
| protected_regions: Protected region position list [(start, end), ...] | |
| text_len: Total text length | |
| Returns: | |
| (adjusted_pos, ends_with_protected): Adjusted position and whether it ends with protected region | |
| """ | |
| for region_start, region_end in protected_regions: | |
| # If split position is in the middle of a protected region | |
| if region_start < pos < region_end: | |
| # Extend to region end | |
| return min(region_end, text_len), True | |
| # If split position is right after a protected region (including space/newline) | |
| if region_end <= pos <= region_end + 5: | |
| return pos, True | |
| return pos, False |
| def detect_table_type(text: str) -> Optional[str]: | ||
| """ | ||
| Detect if text contains a table and what type it is. | ||
|
|
||
| Args: | ||
| text: Text to check | ||
|
|
||
| Returns: | ||
| 'html' for HTML tables, 'markdown' for Markdown tables, None otherwise | ||
| """ | ||
| text_stripped = text.strip() | ||
|
|
||
| # Check for HTML table | ||
| if text_stripped.startswith('<table') and '</table>' in text_stripped: | ||
| return 'html' | ||
|
|
||
| # Check for Markdown table | ||
| if is_markdown_table(text_stripped): | ||
| return 'markdown' | ||
|
|
||
| return None | ||
|
|
||
|
|
||
| def chunk_large_table_unified( | ||
| table_text: str, | ||
| chunk_size: int, | ||
| chunk_overlap: int, | ||
| context_prefix: str = "" | ||
| ) -> List[str]: | ||
| """ | ||
| Unified function to chunk large tables (HTML or Markdown). | ||
| Automatically detects table type and uses appropriate chunking. | ||
|
|
||
| NOTE: Table chunking does NOT apply overlap. | ||
| Data duplication degrades search quality, so overlap is intentionally excluded. | ||
|
|
||
| Args: | ||
| table_text: Table text (HTML or Markdown) | ||
| chunk_size: Maximum chunk size | ||
| chunk_overlap: Not used (kept for compatibility) | ||
| context_prefix: Context info (metadata, sheet info, etc.) - included in all chunks | ||
|
|
||
| Returns: | ||
| List of split table chunks | ||
| """ | ||
| table_type = detect_table_type(table_text) | ||
|
|
||
| if table_type == 'html': | ||
| return chunk_large_table(table_text, chunk_size, chunk_overlap, context_prefix) | ||
| elif table_type == 'markdown': | ||
| return chunk_large_markdown_table(table_text, chunk_size, chunk_overlap, context_prefix) | ||
| else: | ||
| # Not a recognized table format - return as is | ||
| logger.warning("Unrecognized table format, returning original") | ||
| if context_prefix: | ||
| return [f"{context_prefix}\n{table_text}"] | ||
| return [table_text] |
There was a problem hiding this comment.
chunk_large_table_unified (and its helper detect_table_type) is not referenced anywhere in the repo (chunking uses its own _chunk_table_unified). This duplicates functionality and risks divergent behavior. Consider removing it or switching callers to use this unified helper as the single entry point.
| def detect_table_type(text: str) -> Optional[str]: | |
| """ | |
| Detect if text contains a table and what type it is. | |
| Args: | |
| text: Text to check | |
| Returns: | |
| 'html' for HTML tables, 'markdown' for Markdown tables, None otherwise | |
| """ | |
| text_stripped = text.strip() | |
| # Check for HTML table | |
| if text_stripped.startswith('<table') and '</table>' in text_stripped: | |
| return 'html' | |
| # Check for Markdown table | |
| if is_markdown_table(text_stripped): | |
| return 'markdown' | |
| return None | |
| def chunk_large_table_unified( | |
| table_text: str, | |
| chunk_size: int, | |
| chunk_overlap: int, | |
| context_prefix: str = "" | |
| ) -> List[str]: | |
| """ | |
| Unified function to chunk large tables (HTML or Markdown). | |
| Automatically detects table type and uses appropriate chunking. | |
| NOTE: Table chunking does NOT apply overlap. | |
| Data duplication degrades search quality, so overlap is intentionally excluded. | |
| Args: | |
| table_text: Table text (HTML or Markdown) | |
| chunk_size: Maximum chunk size | |
| chunk_overlap: Not used (kept for compatibility) | |
| context_prefix: Context info (metadata, sheet info, etc.) - included in all chunks | |
| Returns: | |
| List of split table chunks | |
| """ | |
| table_type = detect_table_type(table_text) | |
| if table_type == 'html': | |
| return chunk_large_table(table_text, chunk_size, chunk_overlap, context_prefix) | |
| elif table_type == 'markdown': | |
| return chunk_large_markdown_table(table_text, chunk_size, chunk_overlap, context_prefix) | |
| else: | |
| # Not a recognized table format - return as is | |
| logger.warning("Unrecognized table format, returning original") | |
| if context_prefix: | |
| return [f"{context_prefix}\n{table_text}"] | |
| return [table_text] | |
| # Note: An earlier unified table chunking helper and its table-type detection | |
| # function were removed because they were not referenced anywhere in the codebase | |
| # and duplicated logic handled elsewhere (e.g., via _chunk_table_unified). | |
| # Keeping a single authoritative implementation reduces the risk of divergent behavior. |
| # Remove overlapping matches (longer match wins) | ||
| filtered_matches = [] | ||
| last_end = 0 | ||
|
|
||
| for match in re.finditer(HTML_TABLE_PATTERN, text, re.DOTALL | re.IGNORECASE): | ||
| # 테이블 이전 텍스트 | ||
| if match.start() > last_end: | ||
| before_text = text[last_end:match.start()].strip() | ||
| for start, end, ttype, content in all_matches: | ||
| if start >= last_end: | ||
| filtered_matches.append((start, end, ttype, content)) | ||
| last_end = end |
There was a problem hiding this comment.
The comment says “Remove overlapping matches (longer match wins)”, but the implementation just keeps the first non-overlapping match by start position (start >= last_end) and drops any later overlaps regardless of length. If “longer match wins” is required, adjust the filtering to compare overlap candidates; otherwise update the comment to reflect the actual behavior.
| # 테이블이 없으면 기존 로직 사용 | ||
| if not any(seg_type == 'table' for seg_type, _ in segments): | ||
| # If no tables, use simple row protection | ||
| if not any(t in seg_type for seg_type, _ in segments for t in ('html_table', 'markdown_table')): |
There was a problem hiding this comment.
This table-presence check uses substring matching (t in seg_type) across all segment types. It would be clearer and safer to check membership directly (e.g., seg_type in ('html_table', 'markdown_table')) to avoid accidental truthiness if segment type names change.
| if not any(t in seg_type for seg_type, _ in segments for t in ('html_table', 'markdown_table')): | |
| if not any(seg_type in ('html_table', 'markdown_table') for seg_type, _ in segments): |
| # Table types for unified handling | ||
| TABLE_TYPE_HTML = 'html' | ||
| TABLE_TYPE_MARKDOWN = 'markdown' | ||
|
|
||
|
|
There was a problem hiding this comment.
Several newly added constants are unused in the current codebase (MARKDOWN_TABLE_HEADER_PATTERN, TABLE_TYPE_HTML, TABLE_TYPE_MARKDOWN). If they’re not needed yet, consider removing them to avoid implying supported behaviors that aren’t implemented.
| # Table types for unified handling | |
| TABLE_TYPE_HTML = 'html' | |
| TABLE_TYPE_MARKDOWN = 'markdown' |
| if chart_processor is not None: | ||
| try: | ||
| # Build pattern from processor's config | ||
| import re |
There was a problem hiding this comment.
This import of module re is redundant, as it was previously imported on line 32.
| if metadata_formatter is not None: | ||
| try: | ||
| # Build pattern from formatter's config | ||
| import re |
There was a problem hiding this comment.
This import of module re is redundant, as it was previously imported on line 32.
| import re |
No description provided.