Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix ordering of closing tags of sequential entities #4268

Merged
merged 2 commits into from Dec 8, 2023
Merged

Conversation

udf
Copy link
Contributor

@udf udf commented Dec 7, 2023

#4201 breaks the parsing of sequential entities because it puts the closing tag after the next opening tag:

text, entities = html.parse('<strong>⚙️</strong><em>Settings</em>')
html_text = html.unparse(text, entities)
print(html_text)
# <strong>⚙️<em></strong>Settings</em>

This is a superficial issue with HTML, but actually breaks in Markdown and ends up as **⚙️__**Settings__

I fixed this behaviour by prioritizing closing tags that occur at the same position as opening tags, so that the </strong> is inserted before the <em>.


I added tests to check that nested entities still behave correctly, however the HTML test fails because the HTML parser swaps the order of nested entities (since it recognizes an entity when seeing the closing tag):

text, entities = html.parse('<em><strong>hello</strong></em>')
print(entities)
# [<telethon.tl.types.MessageEntityBold object at 0x7f82cfba0f50>, <telethon.tl.types.MessageEntityTextUrl object at 0x7f82cfba0f10>]
# bold entity occurs before the italic one, because it was closed first

This also means that repeatedly parsing and unparsing nested entities swaps their ordering:

text, entities = html.parse('<em><strong>hello</strong></em>')
html_text = html.unparse(text, entities)
print(html_text)
# tags have been swapped:
# <strong><em>hello</em></strong>

text, entities = html.parse(html_text)
html_text = html.unparse(text, entities)
print(html_text)
# tags have been swapped again:
# <em><strong>hello</strong></em>

(this causes the test to fail because parsing and unparsing gives a different output)

I added a commit to fix it, but I am not sure if it's acceptable. I think it could be fixed inside the parsing code directly, instead of sorting the entities afterwards.

@Lonami
Copy link
Member

Lonami commented Dec 8, 2023

Appreciate the quality the description and tests. Don't really care much for implementation as long as it works.

@Lonami Lonami merged commit 3d58dc3 into LonamiWebs:v1 Dec 8, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants