feat: Empower Chunking Method by CocoRoF · Pull Request #9 · CocoRoF/Contextifier

CocoRoF · 2026-01-23T07:32:39Z

No description provided.

- Updated docstrings in `table_parser.py` to provide English translations and clarify function descriptions. - Improved comments in `table_parser.py` for better understanding of code logic. - Enhanced `text_chunker.py` with new functionality for Markdown table support and refined chunking logic. - Added helper function to identify Markdown tables in `text_chunker.py`. - Modified `chunk_with_row_protection` to handle both HTML and Markdown tables efficiently. - Updated `DocumentProcessor` to include new chart and metadata formatting processors.

Copilot

Pull request overview

Adds expanded chunking support across the library, focusing on more robust handling of tables (including Markdown), additional protected-region types, and passing richer processor configuration into the chunking pipeline.

Changes:

Add Markdown table-aware chunking (row-preserving, no-overlap) alongside existing HTML table support.
Extend protected-region handling to include chart blocks, metadata blocks, and page/slide/sheet tags with configurable patterns.
Thread chart_processor and metadata_formatter through DocumentProcessor → create_chunks.

Reviewed changes

Copilot reviewed 10 out of 10 changed files in this pull request and generated 12 comments.

Show a summary per file

File	Description
pyproject.toml	Bumps package version to 0.2.1.
contextifier/core/document_processor.py	Passes `chart_processor` and `metadata_formatter` into chunking entry point.
contextifier/chunking/text_chunker.py	Adds Markdown table detection and chunking behavior (no overlap for tables).
contextifier/chunking/table_parser.py	Updates docstrings/comments (English) for table parsing logic.
contextifier/chunking/table_chunker.py	Implements Markdown table parsing + chunking; adjusts table splitting rules (no overlap, allow 1.5x).
contextifier/chunking/sheet_processor.py	Expands segment extraction to include charts/images/textboxes; enforces no-overlap for table chunks.
contextifier/chunking/protected_regions.py	Adds more protected region types + configurable patterns; updates splitting rules for no-overlap regions.
contextifier/chunking/page_chunker.py	Extends page-based chunking to pass chart/metadata patterns into protected-region logic.
contextifier/chunking/constants.py	Adds patterns/constants for tags, metadata, and Markdown tables.
contextifier/chunking/chunking.py	Integrates new processors/patterns and unified table handling into main chunking flow.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-01-23T07:41:40Z

+            # Overlap handling: include part of last page in next chunk
            overlap_content = get_overlap_content(current_chunk_pages, chunk_overlap)
            current_chunk_pages = []
            current_size = 0

            if overlap_content:
-                # overlap은 다음 청크 시작에 추가됨 (아래서 처리)
+                # Overlap is added to next chunk start (handled below)
                pass


In page-based chunking, overlap_content is computed but never applied to the next chunk (the code just passes). As a result, chunk_overlap has no effect for the page-based strategy despite being part of the API/docs. Consider prepending overlap_content to the next chunk’s first page content (or carrying it in current_chunk_pages) and ensuring it doesn’t duplicate protected regions when doing so.

Copilot · 2026-01-23T07:41:40Z

+        # If end_pos is exactly at or just after the region end
+        if region_start <= end_pos <= region_end + tolerance:


_check_ends_with_no_overlap_region is intended to detect when a chunk ends at/after a no-overlap region, but the condition region_start <= end_pos <= region_end + tolerance also returns true when end_pos is at the start of a region (i.e., chunk ends right before the region). That will incorrectly disable overlap in common cases. The check should be based on proximity to region_end (e.g., region_end <= end_pos <= region_end + tolerance, or abs(end_pos - region_end) <= tolerance).

Suggested change

# If end_pos is exactly at or just after the region end

if region_start <= end_pos <= region_end + tolerance:

# If end_pos is exactly at or just after the region end (within tolerance)

if region_end <= end_pos <= region_end + tolerance:

Copilot · 2026-01-23T07:41:41Z

+            # OCR variants
+            page_pattern_ocr = page_pattern.rstrip(')').rstrip('\\]') + r'\s*\(OCR(?:\+Ref)?\)\]'
+            for match in re.finditer(page_pattern_ocr, text, re.IGNORECASE):
+                regions.append((match.start(), match.end(), 'page_tag'))
+            # Slide tags
+            slide_pattern = page_tag_processor.get_pattern_string(PageTagType.SLIDE)
+            for match in re.finditer(slide_pattern, text, re.IGNORECASE):
+                regions.append((match.start(), match.end(), 'slide_tag'))


The OCR page-tag pattern construction (page_pattern.rstrip(')').rstrip('\\]') + ...) is brittle and doesn’t work with arbitrary custom prefixes/suffixes returned by PageTagProcessor.get_pattern_string() (it may not end in \\]/)). It also adds OCR variants for page tags but not slide tags when a processor is provided. Prefer constructing OCR variants using the escaped prefix/suffix (like f"{prefix}(\\d+)\\s*\\(OCR(?:\\+Ref)?\\){suffix}") or fall back to the existing PAGE_TAG_OCR_PATTERN/SLIDE_TAG_OCR_PATTERN behavior.

Suggested change

# OCR variants

page_pattern_ocr = page_pattern.rstrip(')').rstrip('\\]') + r'\s*\(OCR(?:\+Ref)?\)\]'

for match in re.finditer(page_pattern_ocr, text, re.IGNORECASE):

regions.append((match.start(), match.end(), 'page_tag'))

# Slide tags

slide_pattern = page_tag_processor.get_pattern_string(PageTagType.SLIDE)

for match in re.finditer(slide_pattern, text, re.IGNORECASE):

regions.append((match.start(), match.end(), 'slide_tag'))

# OCR page tag variants (use stable default pattern)

for match in re.finditer(PAGE_TAG_OCR_PATTERN, text, re.IGNORECASE):

regions.append((match.start(), match.end(), 'page_tag'))

# Slide tags

slide_pattern = page_tag_processor.get_pattern_string(PageTagType.SLIDE)

for match in re.finditer(slide_pattern, text, re.IGNORECASE):

regions.append((match.start(), match.end(), 'slide_tag'))

# OCR slide tag variants (use stable default pattern)

for match in re.finditer(SLIDE_TAG_OCR_PATTERN, text, re.IGNORECASE):

regions.append((match.start(), match.end(), 'slide_tag'))

Copilot · 2026-01-23T07:41:41Z

+def _adjust_for_protected_boundary(
+    pos: int,
+    protected_regions: List[Tuple[int, int]],
+    text_len: int
+) -> Tuple[int, bool]:
+    """
+    Check if position is in the middle of a protected region and adjust to region end if so.
+
+    Args:
+        pos: Current split position
+        protected_regions: Protected region position list [(start, end), ...]
+        text_len: Total text length
+
+    Returns:
+        (adjusted_pos, ends_with_protected): Adjusted position and whether it ends with protected region
+    """
+    for region_start, region_end in protected_regions:
+        # If split position is in the middle of a protected region
+        if region_start < pos < region_end:
+            # Extend to region end
+            return min(region_end, text_len), True
+        # If split position is right after a protected region (including space/newline)
+        if region_end <= pos <= region_end + 5:
+            return pos, True
+    return pos, False
+
+


_adjust_for_protected_boundary is defined but not referenced anywhere in the repo. This adds dead code and maintenance cost; either wire it into the split-point adjustment logic (similar to _adjust_for_image_boundary) or remove it.

Suggested change

def _adjust_for_protected_boundary(

pos: int,

protected_regions: List[Tuple[int, int]],

text_len: int

) -> Tuple[int, bool]:

"""

Check if position is in the middle of a protected region and adjust to region end if so.

Args:

pos: Current split position

protected_regions: Protected region position list [(start, end), ...]

text_len: Total text length

Returns:

(adjusted_pos, ends_with_protected): Adjusted position and whether it ends with protected region

"""

for region_start, region_end in protected_regions:

# If split position is in the middle of a protected region

if region_start < pos < region_end:

# Extend to region end

return min(region_end, text_len), True

# If split position is right after a protected region (including space/newline)

if region_end <= pos <= region_end + 5:

return pos, True

return pos, False

Copilot · 2026-01-23T07:41:41Z

+def detect_table_type(text: str) -> Optional[str]:
+    """
+    Detect if text contains a table and what type it is.
+
+    Args:
+        text: Text to check
+
+    Returns:
+        'html' for HTML tables, 'markdown' for Markdown tables, None otherwise
+    """
+    text_stripped = text.strip()
+
+    # Check for HTML table
+    if text_stripped.startswith('<table') and '</table>' in text_stripped:
+        return 'html'
+
+    # Check for Markdown table
+    if is_markdown_table(text_stripped):
+        return 'markdown'
+
+    return None
+
+
+def chunk_large_table_unified(
+    table_text: str,
+    chunk_size: int,
+    chunk_overlap: int,
+    context_prefix: str = ""
+) -> List[str]:
+    """
+    Unified function to chunk large tables (HTML or Markdown).
+    Automatically detects table type and uses appropriate chunking.
+
+    NOTE: Table chunking does NOT apply overlap.
+    Data duplication degrades search quality, so overlap is intentionally excluded.
+
+    Args:
+        table_text: Table text (HTML or Markdown)
+        chunk_size: Maximum chunk size
+        chunk_overlap: Not used (kept for compatibility)
+        context_prefix: Context info (metadata, sheet info, etc.) - included in all chunks
+
+    Returns:
+        List of split table chunks
+    """
+    table_type = detect_table_type(table_text)
+
+    if table_type == 'html':
+        return chunk_large_table(table_text, chunk_size, chunk_overlap, context_prefix)
+    elif table_type == 'markdown':
+        return chunk_large_markdown_table(table_text, chunk_size, chunk_overlap, context_prefix)
+    else:
+        # Not a recognized table format - return as is
+        logger.warning("Unrecognized table format, returning original")
+        if context_prefix:
+            return [f"{context_prefix}\n{table_text}"]
+        return [table_text]


chunk_large_table_unified (and its helper detect_table_type) is not referenced anywhere in the repo (chunking uses its own _chunk_table_unified). This duplicates functionality and risks divergent behavior. Consider removing it or switching callers to use this unified helper as the single entry point.

Suggested change

def detect_table_type(text: str) -> Optional[str]:

"""

Detect if text contains a table and what type it is.

Args:

text: Text to check

Returns:

'html' for HTML tables, 'markdown' for Markdown tables, None otherwise

"""

text_stripped = text.strip()

# Check for HTML table

if text_stripped.startswith('<table') and '</table>' in text_stripped:

return 'html'

# Check for Markdown table

if is_markdown_table(text_stripped):

return 'markdown'

return None

def chunk_large_table_unified(

table_text: str,

chunk_size: int,

chunk_overlap: int,

context_prefix: str = ""

) -> List[str]:

"""

Unified function to chunk large tables (HTML or Markdown).

Automatically detects table type and uses appropriate chunking.

NOTE: Table chunking does NOT apply overlap.

Data duplication degrades search quality, so overlap is intentionally excluded.

Args:

table_text: Table text (HTML or Markdown)

chunk_size: Maximum chunk size

chunk_overlap: Not used (kept for compatibility)

context_prefix: Context info (metadata, sheet info, etc.) - included in all chunks

Returns:

List of split table chunks

"""

table_type = detect_table_type(table_text)

if table_type == 'html':

return chunk_large_table(table_text, chunk_size, chunk_overlap, context_prefix)

elif table_type == 'markdown':

return chunk_large_markdown_table(table_text, chunk_size, chunk_overlap, context_prefix)

else:

# Not a recognized table format - return as is

logger.warning("Unrecognized table format, returning original")

if context_prefix:

return [f"{context_prefix}\n{table_text}"]

return [table_text]

# Note: An earlier unified table chunking helper and its table-type detection

# function were removed because they were not referenced anywhere in the codebase

# and duplicated logic handled elsewhere (e.g., via _chunk_table_unified).

# Keeping a single authoritative implementation reduces the risk of divergent behavior.

Copilot · 2026-01-23T07:41:42Z

+    # Remove overlapping matches (longer match wins)
+    filtered_matches = []
    last_end = 0
-
-    for match in re.finditer(HTML_TABLE_PATTERN, text, re.DOTALL | re.IGNORECASE):
-        # 테이블 이전 텍스트
-        if match.start() > last_end:
-            before_text = text[last_end:match.start()].strip()
+    for start, end, ttype, content in all_matches:
+        if start >= last_end:
+            filtered_matches.append((start, end, ttype, content))
+            last_end = end


The comment says “Remove overlapping matches (longer match wins)”, but the implementation just keeps the first non-overlapping match by start position (start >= last_end) and drops any later overlaps regardless of length. If “longer match wins” is required, adjust the filtering to compare overlap candidates; otherwise update the comment to reflect the actual behavior.

Copilot · 2026-01-23T07:41:42Z

-    # 테이블이 없으면 기존 로직 사용
-    if not any(seg_type == 'table' for seg_type, _ in segments):
+    # If no tables, use simple row protection
+    if not any(t in seg_type for seg_type, _ in segments for t in ('html_table', 'markdown_table')):


This table-presence check uses substring matching (t in seg_type) across all segment types. It would be clearer and safer to check membership directly (e.g., seg_type in ('html_table', 'markdown_table')) to avoid accidental truthiness if segment type names change.

Suggested change

if not any(t in seg_type for seg_type, _ in segments for t in ('html_table', 'markdown_table')):

if not any(seg_type in ('html_table', 'markdown_table') for seg_type, _ in segments):

Copilot · 2026-01-23T07:41:43Z

+# Table types for unified handling
+TABLE_TYPE_HTML = 'html'
+TABLE_TYPE_MARKDOWN = 'markdown'
+



Several newly added constants are unused in the current codebase (MARKDOWN_TABLE_HEADER_PATTERN, TABLE_TYPE_HTML, TABLE_TYPE_MARKDOWN). If they’re not needed yet, consider removing them to avoid implying supported behaviors that aren’t implemented.

Suggested change

# Table types for unified handling

TABLE_TYPE_HTML = 'html'

TABLE_TYPE_MARKDOWN = 'markdown'

Copilot · 2026-01-23T07:41:43Z

+    if chart_processor is not None:
+        try:
+            # Build pattern from processor's config
+            import re


This import of module re is redundant, as it was previously imported on line 32.

Copilot · 2026-01-23T07:41:43Z

+    if metadata_formatter is not None:
+        try:
+            # Build pattern from formatter's config
+            import re


This import of module re is redundant, as it was previously imported on line 32.

Suggested change

import re

…modules

CocoRoF added 2 commits January 23, 2026 16:31

feat: Bump version to 0.2.1 in pyproject.toml

7e60359

CocoRoF requested a review from Copilot January 23, 2026 07:32

Copilot started reviewing on behalf of CocoRoF January 23, 2026 07:32 View session

Copilot AI reviewed Jan 23, 2026

View reviewed changes

refactor: Remove unused functions and optimize chunking logic across …

53f4185

…modules

CocoRoF merged commit b71ac85 into deploy Jan 23, 2026
1 check passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Empower Chunking Method#9

feat: Empower Chunking Method#9
CocoRoF merged 3 commits intodeployfrom
main

CocoRoF commented Jan 23, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Jan 23, 2026

Uh oh!

Copilot AI Jan 23, 2026

Uh oh!

Copilot AI Jan 23, 2026

Uh oh!

Copilot AI Jan 23, 2026

Uh oh!

Copilot AI Jan 23, 2026

Uh oh!

Copilot AI Jan 23, 2026

Uh oh!

Copilot AI Jan 23, 2026

Uh oh!

Copilot AI Jan 23, 2026

Uh oh!

Copilot AI Jan 23, 2026

Uh oh!

Copilot AI Jan 23, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		# If end_pos is exactly at or just after the region end
		if region_start <= end_pos <= region_end + tolerance:

	if not any(t in seg_type for seg_type, _ in segments for t in ('html_table', 'markdown_table')):
	if not any(seg_type in ('html_table', 'markdown_table') for seg_type, _ in segments):

	# Table types for unified handling
	TABLE_TYPE_HTML = 'html'
	TABLE_TYPE_MARKDOWN = 'markdown'

Conversation

CocoRoF commented Jan 23, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Jan 23, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 23, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 23, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 23, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 23, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 23, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 23, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 23, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 23, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 23, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants