fix: Docx segmented font title recognition #2949

shaohuzhang1 · 2025-04-22T06:51:00Z

fix: Docx segmented font title recognition

f2c-ci-robot · 2025-04-22T06:51:03Z

Adding the "do-not-merge/release-note-label-needed" label because no release-note block was detected, please follow our release note process to remove it.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

f2c-ci-robot · 2025-04-22T06:51:07Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:

The full list of commands accepted by this bot can be found here.

Details

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

shaohuzhang1 · 2025-04-22T06:51:22Z

apps/common/handle/impl/doc_split_handle.py

+            if pt >= 30:
                for _value, index in zip(title_font_list, range(len(title_font_list))):
                    if pt >= _value[0] and pt < _value[1]:
                        return index + 1


The provided code has some minor improvements and corrections:

The get_image_id function is defined at the top but used further down, which means it might not be needed there unless you're reusing it elsewhere.

In the get_title_level function:

You've removed three sets of conditional checks that are essentially duplicating the check for pt >= 30. Only the last condition remains useful.

The list comprehension in title_font_list should likely include all available sizes rather than just smaller ones to cover all possible titles if they exist beyond the given range.

It's unclear why pt >= 16 or any specific conditions (like < 36) were included for fonts larger than 30 points in title_font_list, as it would always match with [30, 36].

Here's an improved version of the get_title_level function based on these considerations:

def get_title_level(paragraph: Paragraph): if len(paragraph.runs) == 1: font_size = paragraph.runs[0].font.size pt = font_size.pt # Use binary search to find the appropriate title level left, right = 0, len(title_font_list) - 1 while left <= right: mid = left + (right - left) // 2 size_range = title_font_list[mid] if pt >= 30 and pt < size_range[1]: return mid + 1 elif pt < size_range[0]: right = mid - 1 else: left = mid + 1 return 1 # Default level, typically H1

Potential Optimization Suggestions:

For better readability and maintainability, separate out each case into different functions or methods.

If the number of title levels extends significantly, consider using a dictionary mapping instead of a list for faster lookups.

Ensure that the logic handles edge cases correctly, such as when no relevant paragraphs are found.

fix: Docx segmented font title recognition

52ee884

f2c-ci-robot bot added the do-not-merge/release-note-label-needed label Apr 22, 2025

shaohuzhang1 commented Apr 22, 2025

View reviewed changes

shaohuzhang1 merged commit 0c14306 into main Apr 22, 2025
4 of 5 checks passed

shaohuzhang1 deleted the pr@main@fix_docx branch April 22, 2025 06:51

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix: Docx segmented font title recognition #2949

fix: Docx segmented font title recognition #2949

Uh oh!

shaohuzhang1 commented Apr 22, 2025

Uh oh!

f2c-ci-robot bot commented Apr 22, 2025

Uh oh!

f2c-ci-robot bot commented Apr 22, 2025

Uh oh!

shaohuzhang1 Apr 22, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

fix: Docx segmented font title recognition #2949

fix: Docx segmented font title recognition #2949

Uh oh!

Conversation

shaohuzhang1 commented Apr 22, 2025

Uh oh!

f2c-ci-robot bot commented Apr 22, 2025

Uh oh!

f2c-ci-robot bot commented Apr 22, 2025

Uh oh!

shaohuzhang1 Apr 22, 2025

Choose a reason for hiding this comment

Potential Optimization Suggestions:

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants