Skip to content

Conversation

@shaohuzhang1
Copy link
Contributor

fix: Docx segmented font title recognition

@f2c-ci-robot
Copy link

f2c-ci-robot bot commented Apr 22, 2025

Adding the "do-not-merge/release-note-label-needed" label because no release-note block was detected, please follow our release note process to remove it.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@f2c-ci-robot
Copy link

f2c-ci-robot bot commented Apr 22, 2025

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

if pt >= 30:
for _value, index in zip(title_font_list, range(len(title_font_list))):
if pt >= _value[0] and pt < _value[1]:
return index + 1
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The provided code has some minor improvements and corrections:

  1. The get_image_id function is defined at the top but used further down, which means it might not be needed there unless you're reusing it elsewhere.

  2. In the get_title_level function:

    • You've removed three sets of conditional checks that are essentially duplicating the check for pt >= 30. Only the last condition remains useful.
  3. The list comprehension in title_font_list should likely include all available sizes rather than just smaller ones to cover all possible titles if they exist beyond the given range.

  4. It's unclear why pt >= 16 or any specific conditions (like < 36) were included for fonts larger than 30 points in title_font_list, as it would always match with [30, 36].

Here's an improved version of the get_title_level function based on these considerations:

def get_title_level(paragraph: Paragraph):
    if len(paragraph.runs) == 1:
        font_size = paragraph.runs[0].font.size
        pt = font_size.pt
        
        # Use binary search to find the appropriate title level
        left, right = 0, len(title_font_list) - 1
        while left <= right:
            mid = left + (right - left) // 2
            size_range = title_font_list[mid]
            if pt >= 30 and pt < size_range[1]:
                return mid + 1
            elif pt < size_range[0]:
                right = mid - 1
            else:
                left = mid + 1

    return 1  # Default level, typically H1

Potential Optimization Suggestions:

  • For better readability and maintainability, separate out each case into different functions or methods.
  • If the number of title levels extends significantly, consider using a dictionary mapping instead of a list for faster lookups.
  • Ensure that the logic handles edge cases correctly, such as when no relevant paragraphs are found.

@shaohuzhang1 shaohuzhang1 merged commit 0c14306 into main Apr 22, 2025
4 of 5 checks passed
@shaohuzhang1 shaohuzhang1 deleted the pr@main@fix_docx branch April 22, 2025 06:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants