Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

stand-alone * or _ lead to incorrect results #1300

Closed
only-dev-time opened this issue Oct 30, 2022 · 13 comments · Fixed by #1303
Closed

stand-alone * or _ lead to incorrect results #1300

only-dev-time opened this issue Oct 30, 2022 · 13 comments · Fixed by #1303
Labels
bug Bug report. confirmed Confirmed bug report or approved feature request. core Related to the core parser code.

Comments

@only-dev-time
Copy link

A text with several stand-alone underscores is not converted correctly.

input

**_ _ _ N _ _ _ _ _ _ _ R S _ _ N _**

output

 <p><strong><em> _ </em> N _ <em> _ </em> _ <em> _ R S _ </em> N _</strong></p>

expected

<p><strong>_ _ _ N _ _ _ _ _ _ _ R S _ _ N _</strong></p>

It seems that underscores with the surrounding spaces are replaced by the placeholder.
In the original on https://daringfireball.net/projects/markdown/dingus it is converted correctly.

@waylan
Copy link
Member

waylan commented Oct 31, 2022

I'm curious if the behavior changed here in #792 (version 3.2) or #629 (version 3.0). In other words, did we introduce this behavior in either of those changes or has it always been this way? @facelessuser you are the most familiar with the current cade as you made those changes. Any thoughts? I won't have time to take a closer look until next week at the earliest.

@waylan waylan added the needs-confirmation The alleged behavior needs to be confirmed. label Oct 31, 2022
@facelessuser
Copy link
Collaborator

This has nothing to do with the surrounding **, but is a question of what should _ _ N _ _ _ _ R S _ _ N _ be parsed as.

#629 did not cause this, I am certain of that.

Could #805 (the fix for #792) have caused this? It is tough to tell without knowing what it was doing prior to this. No tests were changed in that review, only new ones added. I'm skeptical whether this changed the behavior, but I'd have to check.

_ and * are handled in two separate passes, they've never had direct awareness that if * was parsed on the outside that _ should be ignored within. _ can be parsed within words if "smart" is disabled.

Anyways, the first task would be to see if we broke any previous behavior.

If we did not break existing behavior, then this would be a new request. Then we would need to determine specifically what rule we are breaking (if any), and if so, see if we could add the necessary intelligence.

@facelessuser
Copy link
Collaborator

As far as #805 (the fix for #792) being the cause, I see no evidence. I reverted the changes related to * and _ locally. The first is the reverted, the second is the tip of master. Beahvior is identical

markdown git:(master) ✗ python3
Python 3.11.0 (v3.11.0:deaf509e8f, Oct 24 2022, 14:43:23) [Clang 13.0.0 (clang-1300.0.29.30)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import markdown
>>> markdown.markdown('**_ _ _ N _ _ _ _ _ _ _ R S _ _ N _**')
'<p><strong><em> _ </em> N _ <em> _ </em> _ <em> _ R S _ </em> N _</strong></p>'
>>> exit()markdown git:(master) ✗ python3
Python 3.11.0 (v3.11.0:deaf509e8f, Oct 24 2022, 14:43:23) [Clang 13.0.0 (clang-1300.0.29.30)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import markdown
>>> markdown.markdown('**_ _ _ N _ _ _ _ _ _ _ R S _ _ N _**')
'<p><strong><em> _ </em> N _ <em> _ </em> _ <em> _ R S _ </em> N _</strong></p>'

@facelessuser
Copy link
Collaborator

It is highly doubtful that #629 is the culprit either. The only extensions that were drastically changed were the link patterns. The bold and emphasis extensions began using the new processor, but their fundamental patterns did not change, nor any sequencing of the extensions.

@facelessuser
Copy link
Collaborator

So bold and italic are handled in separate passes as they've always been done. At no point as Python Markdown ever not parsed _ simply because they were surrounded by **. If this was desired, we'd have to come up with a new mechanism to allow that. I will leave that up to @waylan to consider.

@waylan
Copy link
Member

waylan commented Nov 7, 2022

I finally had some time to look at this. It appears that we are ignoring this provision of the rules:

But if you surround an * or _ with spaces, it’ll be treated as a literal asterisk or underscore.

And that is a bug. That is also why I wondered if we had introduced a change in any of the more recent changes. Regardless of how the current behavior was introduced, we do need to fix this.

That said, the fix is not exactly clear. For example, the first and last underscore in the example input are not 'surrounded' by spaces as there is no space between the asterisk and the underscore in each instance. However, there are clearly instances where underscores which are surrounded by spaces are not being treated as literal underscores in the example.

To further complicate matters, we do sometimes properly ignore underscores surrounded by spaces. Consider this simple input:

Foo _ N _ Bar

... which properly renders as...

<p>Foo _ N _ Bar</p>

I haven't looked at the code to try to understand what is going on exactly, but something is amiss.

@waylan waylan added bug Bug report. core Related to the core parser code. confirmed Confirmed bug report or approved feature request. and removed needs-confirmation The alleged behavior needs to be confirmed. labels Nov 7, 2022
@facelessuser
Copy link
Collaborator

This is not a problem with my betterem package:

>>> markdown.markdown('**_ _ _ N _ _ _ _ _ _ _ R S _ _ N _**', extensions=['pymdownx.betterem'])
'<p><strong>_ _ _ N _ _ _ _ _ _ _ R S _ _ N _</strong></p>'

I know in my package, this is done by rejecting a * or _ as a starting token if it is immediately followed by white space and rejecting * or _ as an ending token if it is immediately preceded by white space.

@only-dev-time
Copy link
Author

To further complicate matters, we do sometimes properly ignore underscores surrounded by spaces. Consider this simple input:

Foo _ N _ Bar

I only checked the code once. It may sound a bit cheeky, but a single underscore is no problem.
Only if there are several in a row does the result get wrong.

Foo _ _ N _ _ bar

renders as

<p>Foo _ <em> N _ </em> bar</p>

1. Thought:

I wonder why in the handler of the SimpleTextInlineProcessor match group 1 (_) is replaced by a placeholder (this processor is obviously only used with the stand alone _ or *).
If only the underscore (match group 3) were replaced, this correct result would be rendered:

<p>Foo _ _ N _ _ bar</p>

However, this modification gives this result for the example above:

<p><strong><em> _ _ N _ _ _ _ _ _ _ R S _ _ N </em></strong></p>

Now the first and last underscore after the double asterisk is interpreted as a markdown sign, although this does not represent the beginning and end of a text.
Therefore my

2. Thought:

The original code renders

**_ N _**

to

<p><strong><em> N </em></strong></p>

which is also wrong and leads me to suspect that something might be wrong with the joined **_ and _** and __* and *__ and ___ and ***.
Perhaps @facelessuser's method would be helpful. However, I cannot find and test the extension right now.

@facelessuser
Copy link
Collaborator

The extension is found in this extension pack: https://pypi.org/project/pymdown-extensions/.

Keep in mind, it may not match all corner cases if compared to some other parser, but it does aim to parse strong and italic cases in as sane a way as possible using the current methodology that Python Markdown employs.

@waylan
Copy link
Member

waylan commented Nov 8, 2022

I know in my package, this is done by rejecting a * or _ as a starting token if it is immediately followed by white space and rejecting * or _ as an ending token if it is immediately preceded by white space.

Yeah, this is how most implementations actually work in practice. In fact, the Commonmark spec codifies this behavior in great detail. However, the original rules explicitly state that the character should be rejected as a token if it 'surrounded' by spaces. Of course, the rule is vague and leaves out a bunch of edge cases (white space other than spaces, punctuation, etc).

That said, we can't get too hung up on the original rules as we explicitly don't follow them anyway with underscores to avoid middle-word emphasis. In the end we should probably strive to follow the behavior of the reference implementation with asterisks and copy that behavior to underscores minus middle-word emphasis.

@facelessuser
Copy link
Collaborator

I think it will be key to identify which missing rules specifically we are going to implement. Once we've identified that, I think we can move forward with a solution.

@waylan
Copy link
Member

waylan commented Nov 9, 2022

The simple solution would be to adjust the NOT_STRONG_RE:

# stand-alone * or _
NOT_STRONG_RE = r'((^|\s)(\*|_)(\s|$))'

Note that that regex consumes the space both before and after the standalone character. That explains why the example input seems to match every other underscore (the space before was consumed by the previous character match). However, if it did not consume the space (maybe by using a negative lookahead assertion; i.e.: (?!/s), then that space would still be available when attempting to match the next standalone character. However, technically NOT_STRONG_RE should probably work on single, double and triple length asterisks and underscores. It currently only works on singles.

Another approach would be to build the whitespace checking into the various regexs directly (and eliminating the need for NOT_STRONG_RE). I think the approach taken by PHP is interesting. Specially, they only match a closing token when it is not immediately preceded by white space and they only match an opening token if it is not immediately followed by whitespace or punctuation which is followed by whitespace. That last bit it what I find most interesting (although I wonder why a question mark was not included among the list of punctuation). If you use punctuation in prose according to normal grammar conventions, then it does the correct thing. If you are doing something outside of the norm, then you may need to escape your standalone characters to avoid a match.

Upon further reflection, I think that perhaps the requirement that the character is not to be recognized as a token if it is 'surrounded' by whitespace is a red herring. The opening token not followed by whitespace and closing token not preceded by whitespace rule is going to cover any case in normal prose. Edge cases outside of that should be relatively rare and can use backslash escapes to avoid a match. And there is no backward compatibility issue here as users will have needed to use more backslash escapes in documents authored prior to any change we make.

There is the issue of document authors having left extra white space in an older document which previously rendered with emphasis but will not after the change. I think we can ignore that issue as the document was not properly formatted. The user only didn't notice because of a bug in our parser. However, as our parser is clearly acting in opposition to the rules, the author should have expected a future update to alter the behavior and edited their document accordingly.

So we have two possible paths forward:

  1. Update NOT_STRONG_RE
  2. Abandon NOT_STRONG_RE and update the various other regexs to incorporate negative lookahead/lookbehind assertions for whitespace.

I think the deciding factor here is what set of rules we want to follow. If we want to strictly match the original rules as written ('surround'), then we should go with option 1, but if we want to more closely match all other implementations (only disallow whitespace on one side of the token), then option 2 makes more sense. In the real world, I think either approach would be indistinguishable from the other in all but the most unusual edge cases. Therefore, I am inclined to accept whichever one is submitted in a PR so long as the tests demonstrate that it covers all of the expected cases.

@waylan
Copy link
Member

waylan commented Nov 14, 2022

Btw, the minimum failing test is the following:

Foo * * * *

Bar _ _ _ _

which renders as

<p>Foo * <em> * </em></p>
<p>Bar _ <em> _ </em></p>

while is should render as

<p>Foo * * * *</p>
<p>Bar _ _ _ _</p>

By way of explanation, we must include something on the line to ensure it is not a horizonal rule (I used the words Foo and Bar, but anything other than an asterisk or underscore would work). And at a minimum 4 standalone asterisks or underscores must exist within a paragraph with the first 2 in a pair and the second 2 in a pair. The first and third are properly identified as standalone. However, the current regex will consume the space immediately after each and cause the asterisk or underscore immediately after it to not be properly identified as standalone. Simply breaking up those pairs (with anything, even another space) will avoid the bug. This is a very narrow edge case which has never been reported until now as it is very unlikely to ever be encountered in the real world. Regardless, it should be fixed.

Given how narrow of an edge case it is, I think probably the simple fix of updating NOT_STRONG_RE (option 1 in my previous comment) is the more sensible approach as it introduces less change to the codebase overall.

waylan added a commit to waylan/markdown that referenced this issue Nov 14, 2022
The `NOT_STRONG_RE` regex matchs 1, 2, or 3 * or _ which are surrounded by
white space to prevent them from being parsed as tokens. However, the
surrounding white space should not be consumed by the regex, which is why
lookhead and lookbehind assertions are used. As `^` cannot be matched in a
lookbehind assertion, it is left outside the assertion, but as it is zero
length, that should not matter.

Tests added and/or updated to cover various edge cases. Fixes Python-Markdown#1300.
waylan added a commit that referenced this issue Nov 15, 2022
The `NOT_STRONG_RE` regex matchs 1, 2, or 3 * or _ which are surrounded by
white space to prevent them from being parsed as tokens. However, the
surrounding white space should not be consumed by the regex, which is why
lookhead and lookbehind assertions are used. As `^` cannot be matched in a
lookbehind assertion, it is left outside the assertion, but as it is zero
length, that should not matter.

Tests added and/or updated to cover various edge cases. Fixes #1300.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Bug report. confirmed Confirmed bug report or approved feature request. core Related to the core parser code.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants