Skip to content

feat: Empower Chunking Method#9

Merged
CocoRoF merged 3 commits intodeployfrom
main
Jan 23, 2026
Merged

feat: Empower Chunking Method#9
CocoRoF merged 3 commits intodeployfrom
main

Conversation

@CocoRoF
Copy link
Copy Markdown
Owner

@CocoRoF CocoRoF commented Jan 23, 2026

No description provided.

- Updated docstrings in `table_parser.py` to provide English translations and clarify function descriptions.
- Improved comments in `table_parser.py` for better understanding of code logic.
- Enhanced `text_chunker.py` with new functionality for Markdown table support and refined chunking logic.
- Added helper function to identify Markdown tables in `text_chunker.py`.
- Modified `chunk_with_row_protection` to handle both HTML and Markdown tables efficiently.
- Updated `DocumentProcessor` to include new chart and metadata formatting processors.
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds expanded chunking support across the library, focusing on more robust handling of tables (including Markdown), additional protected-region types, and passing richer processor configuration into the chunking pipeline.

Changes:

  • Add Markdown table-aware chunking (row-preserving, no-overlap) alongside existing HTML table support.
  • Extend protected-region handling to include chart blocks, metadata blocks, and page/slide/sheet tags with configurable patterns.
  • Thread chart_processor and metadata_formatter through DocumentProcessorcreate_chunks.

Reviewed changes

Copilot reviewed 10 out of 10 changed files in this pull request and generated 12 comments.

Show a summary per file
File Description
pyproject.toml Bumps package version to 0.2.1.
contextifier/core/document_processor.py Passes chart_processor and metadata_formatter into chunking entry point.
contextifier/chunking/text_chunker.py Adds Markdown table detection and chunking behavior (no overlap for tables).
contextifier/chunking/table_parser.py Updates docstrings/comments (English) for table parsing logic.
contextifier/chunking/table_chunker.py Implements Markdown table parsing + chunking; adjusts table splitting rules (no overlap, allow 1.5x).
contextifier/chunking/sheet_processor.py Expands segment extraction to include charts/images/textboxes; enforces no-overlap for table chunks.
contextifier/chunking/protected_regions.py Adds more protected region types + configurable patterns; updates splitting rules for no-overlap regions.
contextifier/chunking/page_chunker.py Extends page-based chunking to pass chart/metadata patterns into protected-region logic.
contextifier/chunking/constants.py Adds patterns/constants for tags, metadata, and Markdown tables.
contextifier/chunking/chunking.py Integrates new processors/patterns and unified table handling into main chunking flow.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread contextifier/chunking/page_chunker.py Outdated
Comment on lines 203 to 210
# Overlap handling: include part of last page in next chunk
overlap_content = get_overlap_content(current_chunk_pages, chunk_overlap)
current_chunk_pages = []
current_size = 0

if overlap_content:
# overlap은 다음 청크 시작에 추가됨 (아래서 처리)
# Overlap is added to next chunk start (handled below)
pass
Copy link

Copilot AI Jan 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In page-based chunking, overlap_content is computed but never applied to the next chunk (the code just passes). As a result, chunk_overlap has no effect for the page-based strategy despite being part of the API/docs. Consider prepending overlap_content to the next chunk’s first page content (or carrying it in current_chunk_pages) and ensuring it doesn’t duplicate protected regions when doing so.

Copilot uses AI. Check for mistakes.
Comment on lines +661 to +662
# If end_pos is exactly at or just after the region end
if region_start <= end_pos <= region_end + tolerance:
Copy link

Copilot AI Jan 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_check_ends_with_no_overlap_region is intended to detect when a chunk ends at/after a no-overlap region, but the condition region_start <= end_pos <= region_end + tolerance also returns true when end_pos is at the start of a region (i.e., chunk ends right before the region). That will incorrectly disable overlap in common cases. The check should be based on proximity to region_end (e.g., region_end <= end_pos <= region_end + tolerance, or abs(end_pos - region_end) <= tolerance).

Suggested change
# If end_pos is exactly at or just after the region end
if region_start <= end_pos <= region_end + tolerance:
# If end_pos is exactly at or just after the region end (within tolerance)
if region_end <= end_pos <= region_end + tolerance:

Copilot uses AI. Check for mistakes.
Comment on lines +106 to +113
# OCR variants
page_pattern_ocr = page_pattern.rstrip(')').rstrip('\\]') + r'\s*\(OCR(?:\+Ref)?\)\]'
for match in re.finditer(page_pattern_ocr, text, re.IGNORECASE):
regions.append((match.start(), match.end(), 'page_tag'))
# Slide tags
slide_pattern = page_tag_processor.get_pattern_string(PageTagType.SLIDE)
for match in re.finditer(slide_pattern, text, re.IGNORECASE):
regions.append((match.start(), match.end(), 'slide_tag'))
Copy link

Copilot AI Jan 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The OCR page-tag pattern construction (page_pattern.rstrip(')').rstrip('\\]') + ...) is brittle and doesn’t work with arbitrary custom prefixes/suffixes returned by PageTagProcessor.get_pattern_string() (it may not end in \\]/)). It also adds OCR variants for page tags but not slide tags when a processor is provided. Prefer constructing OCR variants using the escaped prefix/suffix (like f"{prefix}(\\d+)\\s*\\(OCR(?:\\+Ref)?\\){suffix}") or fall back to the existing PAGE_TAG_OCR_PATTERN/SLIDE_TAG_OCR_PATTERN behavior.

Suggested change
# OCR variants
page_pattern_ocr = page_pattern.rstrip(')').rstrip('\\]') + r'\s*\(OCR(?:\+Ref)?\)\]'
for match in re.finditer(page_pattern_ocr, text, re.IGNORECASE):
regions.append((match.start(), match.end(), 'page_tag'))
# Slide tags
slide_pattern = page_tag_processor.get_pattern_string(PageTagType.SLIDE)
for match in re.finditer(slide_pattern, text, re.IGNORECASE):
regions.append((match.start(), match.end(), 'slide_tag'))
# OCR page tag variants (use stable default pattern)
for match in re.finditer(PAGE_TAG_OCR_PATTERN, text, re.IGNORECASE):
regions.append((match.start(), match.end(), 'page_tag'))
# Slide tags
slide_pattern = page_tag_processor.get_pattern_string(PageTagType.SLIDE)
for match in re.finditer(slide_pattern, text, re.IGNORECASE):
regions.append((match.start(), match.end(), 'slide_tag'))
# OCR slide tag variants (use stable default pattern)
for match in re.finditer(SLIDE_TAG_OCR_PATTERN, text, re.IGNORECASE):
regions.append((match.start(), match.end(), 'slide_tag'))

Copilot uses AI. Check for mistakes.
Comment on lines +617 to +643
def _adjust_for_protected_boundary(
pos: int,
protected_regions: List[Tuple[int, int]],
text_len: int
) -> Tuple[int, bool]:
"""
Check if position is in the middle of a protected region and adjust to region end if so.

Args:
pos: Current split position
protected_regions: Protected region position list [(start, end), ...]
text_len: Total text length

Returns:
(adjusted_pos, ends_with_protected): Adjusted position and whether it ends with protected region
"""
for region_start, region_end in protected_regions:
# If split position is in the middle of a protected region
if region_start < pos < region_end:
# Extend to region end
return min(region_end, text_len), True
# If split position is right after a protected region (including space/newline)
if region_end <= pos <= region_end + 5:
return pos, True
return pos, False


Copy link

Copilot AI Jan 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_adjust_for_protected_boundary is defined but not referenced anywhere in the repo. This adds dead code and maintenance cost; either wire it into the split-point adjustment logic (similar to _adjust_for_image_boundary) or remove it.

Suggested change
def _adjust_for_protected_boundary(
pos: int,
protected_regions: List[Tuple[int, int]],
text_len: int
) -> Tuple[int, bool]:
"""
Check if position is in the middle of a protected region and adjust to region end if so.
Args:
pos: Current split position
protected_regions: Protected region position list [(start, end), ...]
text_len: Total text length
Returns:
(adjusted_pos, ends_with_protected): Adjusted position and whether it ends with protected region
"""
for region_start, region_end in protected_regions:
# If split position is in the middle of a protected region
if region_start < pos < region_end:
# Extend to region end
return min(region_end, text_len), True
# If split position is right after a protected region (including space/newline)
if region_end <= pos <= region_end + 5:
return pos, True
return pos, False

Copilot uses AI. Check for mistakes.
Comment thread contextifier/chunking/table_chunker.py Outdated
Comment on lines +829 to +885
def detect_table_type(text: str) -> Optional[str]:
"""
Detect if text contains a table and what type it is.

Args:
text: Text to check

Returns:
'html' for HTML tables, 'markdown' for Markdown tables, None otherwise
"""
text_stripped = text.strip()

# Check for HTML table
if text_stripped.startswith('<table') and '</table>' in text_stripped:
return 'html'

# Check for Markdown table
if is_markdown_table(text_stripped):
return 'markdown'

return None


def chunk_large_table_unified(
table_text: str,
chunk_size: int,
chunk_overlap: int,
context_prefix: str = ""
) -> List[str]:
"""
Unified function to chunk large tables (HTML or Markdown).
Automatically detects table type and uses appropriate chunking.

NOTE: Table chunking does NOT apply overlap.
Data duplication degrades search quality, so overlap is intentionally excluded.

Args:
table_text: Table text (HTML or Markdown)
chunk_size: Maximum chunk size
chunk_overlap: Not used (kept for compatibility)
context_prefix: Context info (metadata, sheet info, etc.) - included in all chunks

Returns:
List of split table chunks
"""
table_type = detect_table_type(table_text)

if table_type == 'html':
return chunk_large_table(table_text, chunk_size, chunk_overlap, context_prefix)
elif table_type == 'markdown':
return chunk_large_markdown_table(table_text, chunk_size, chunk_overlap, context_prefix)
else:
# Not a recognized table format - return as is
logger.warning("Unrecognized table format, returning original")
if context_prefix:
return [f"{context_prefix}\n{table_text}"]
return [table_text]
Copy link

Copilot AI Jan 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

chunk_large_table_unified (and its helper detect_table_type) is not referenced anywhere in the repo (chunking uses its own _chunk_table_unified). This duplicates functionality and risks divergent behavior. Consider removing it or switching callers to use this unified helper as the single entry point.

Suggested change
def detect_table_type(text: str) -> Optional[str]:
"""
Detect if text contains a table and what type it is.
Args:
text: Text to check
Returns:
'html' for HTML tables, 'markdown' for Markdown tables, None otherwise
"""
text_stripped = text.strip()
# Check for HTML table
if text_stripped.startswith('<table') and '</table>' in text_stripped:
return 'html'
# Check for Markdown table
if is_markdown_table(text_stripped):
return 'markdown'
return None
def chunk_large_table_unified(
table_text: str,
chunk_size: int,
chunk_overlap: int,
context_prefix: str = ""
) -> List[str]:
"""
Unified function to chunk large tables (HTML or Markdown).
Automatically detects table type and uses appropriate chunking.
NOTE: Table chunking does NOT apply overlap.
Data duplication degrades search quality, so overlap is intentionally excluded.
Args:
table_text: Table text (HTML or Markdown)
chunk_size: Maximum chunk size
chunk_overlap: Not used (kept for compatibility)
context_prefix: Context info (metadata, sheet info, etc.) - included in all chunks
Returns:
List of split table chunks
"""
table_type = detect_table_type(table_text)
if table_type == 'html':
return chunk_large_table(table_text, chunk_size, chunk_overlap, context_prefix)
elif table_type == 'markdown':
return chunk_large_markdown_table(table_text, chunk_size, chunk_overlap, context_prefix)
else:
# Not a recognized table format - return as is
logger.warning("Unrecognized table format, returning original")
if context_prefix:
return [f"{context_prefix}\n{table_text}"]
return [table_text]
# Note: An earlier unified table chunking helper and its table-type detection
# function were removed because they were not referenced anywhere in the codebase
# and duplicated logic handled elsewhere (e.g., via _chunk_table_unified).
# Keeping a single authoritative implementation reduces the risk of divergent behavior.

Copilot uses AI. Check for mistakes.
Comment thread contextifier/chunking/text_chunker.py Outdated
Comment on lines +167 to +173
# Remove overlapping matches (longer match wins)
filtered_matches = []
last_end = 0

for match in re.finditer(HTML_TABLE_PATTERN, text, re.DOTALL | re.IGNORECASE):
# 테이블 이전 텍스트
if match.start() > last_end:
before_text = text[last_end:match.start()].strip()
for start, end, ttype, content in all_matches:
if start >= last_end:
filtered_matches.append((start, end, ttype, content))
last_end = end
Copy link

Copilot AI Jan 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment says “Remove overlapping matches (longer match wins)”, but the implementation just keeps the first non-overlapping match by start position (start >= last_end) and drops any later overlaps regardless of length. If “longer match wins” is required, adjust the filtering to compare overlap candidates; otherwise update the comment to reflect the actual behavior.

Copilot uses AI. Check for mistakes.
Comment thread contextifier/chunking/text_chunker.py Outdated
# 테이블이 없으면 기존 로직 사용
if not any(seg_type == 'table' for seg_type, _ in segments):
# If no tables, use simple row protection
if not any(t in seg_type for seg_type, _ in segments for t in ('html_table', 'markdown_table')):
Copy link

Copilot AI Jan 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This table-presence check uses substring matching (t in seg_type) across all segment types. It would be clearer and safer to check membership directly (e.g., seg_type in ('html_table', 'markdown_table')) to avoid accidental truthiness if segment type names change.

Suggested change
if not any(t in seg_type for seg_type, _ in segments for t in ('html_table', 'markdown_table')):
if not any(seg_type in ('html_table', 'markdown_table') for seg_type, _ in segments):

Copilot uses AI. Check for mistakes.
Comment thread contextifier/chunking/constants.py Outdated
Comment on lines 100 to 104
# Table types for unified handling
TABLE_TYPE_HTML = 'html'
TABLE_TYPE_MARKDOWN = 'markdown'


Copy link

Copilot AI Jan 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Several newly added constants are unused in the current codebase (MARKDOWN_TABLE_HEADER_PATTERN, TABLE_TYPE_HTML, TABLE_TYPE_MARKDOWN). If they’re not needed yet, consider removing them to avoid implying supported behaviors that aren’t implemented.

Suggested change
# Table types for unified handling
TABLE_TYPE_HTML = 'html'
TABLE_TYPE_MARKDOWN = 'markdown'

Copilot uses AI. Check for mistakes.
Comment thread contextifier/chunking/chunking.py Outdated
if chart_processor is not None:
try:
# Build pattern from processor's config
import re
Copy link

Copilot AI Jan 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This import of module re is redundant, as it was previously imported on line 32.

Copilot uses AI. Check for mistakes.
Comment thread contextifier/chunking/chunking.py Outdated
if metadata_formatter is not None:
try:
# Build pattern from formatter's config
import re
Copy link

Copilot AI Jan 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This import of module re is redundant, as it was previously imported on line 32.

Suggested change
import re

Copilot uses AI. Check for mistakes.
@CocoRoF CocoRoF merged commit b71ac85 into deploy Jan 23, 2026
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants