Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bold/Italic bug #792

Closed
Dave-ts opened this issue Feb 27, 2019 · 27 comments
Closed

Bold/Italic bug #792

Dave-ts opened this issue Feb 27, 2019 · 27 comments
Assignees
Labels
bug Bug report. confirmed Confirmed bug report or approved feature request.

Comments

@Dave-ts
Copy link

Dave-ts commented Feb 27, 2019

I think I'm running up against another bold/italics bug. I did some quick searches and it looks like the other issues were considered resolved, sorry if I'm re-reporting on something already fixed that hasn't made it upstream yet.

Installed from pip current Python-Markdown version 3.0.1

The raw markdown line that breaks is:

This is text **bold *italic bold*** with more text

The output I'm getting is as follows:

<p>This is text <strong>bold *italic bold</strong>* with more text</p>

However, the following format does seem to work correctly.

This is text ***bold italic** italic* more text

The output is

<p>This is text <em><strong>bold italic</strong> italic</em> more text</p>
@waylan waylan added bug Bug report. confirmed Confirmed bug report or approved feature request. invalid Invalid report (user error, upstream issue, etc). and removed confirmed Confirmed bug report or approved feature request. labels Feb 27, 2019
@waylan
Copy link
Member

waylan commented Feb 27, 2019

I can confirm I am seeing the same behavior in 3.0.1. However, this behavior does not exist in the latest code here on GitHub and will be available in the next release.

>>> import markdown
>>> markdown.version
'3.1.dev0'
>>> markdown.markdown('This is text ***bold italic** italic* more text')
'<p>This is text <em><strong>bold italic</strong> italic</em> more text</p>'

@waylan waylan closed this as completed Feb 27, 2019
@facelessuser
Copy link
Collaborator

@waylan, I don't think you tested is case properly. It wasn't This is text ***bold italic** italic* more text that failed, but This is text **bold *italic bold*** with more text.

@facelessuser facelessuser reopened this Feb 27, 2019
@facelessuser
Copy link
Collaborator

facelessuser commented Feb 27, 2019

But I will explain a little bit why this happens.

In the pattern This is text **bold *italic bold*** with more text, the parser Sees **bold *italic bold** and makes it <strong>bold *italic bold</strong>. This leaves two * in separate tags, and they cannot be resolved or you'd end up with some weird broken HTML.

The 3rd party pmdownx.betterem can handle this, but only when smart_enable is set to handle the star syntax (it only does "smart" for underscore by default).

>>> markdown.Markdown(extensions=["pymdownx.betterem"], extension_configs={"pymdownx.betterem": {"smart_enable": "all"}}).convert("This is text **bold *italic bold*** with more text")
'<p>This is text <strong>bold <em>italic bold</em></strong> with more text</p>'

Without smart enabled for *, it does what Python Markdown does by default:

>>> markdown.Markdown(extensions=["pymdownx.betterem"]).convert("This is text **bold *italic bold*** with more text")
'<p>This is text <strong>bold *italic bold</strong>* with more text</p>'

Based on how Python Markdown works, this behavior is as I would expect. One sure fire way to avoid such confusion is to use syntax like this:

This is text **bold _italic bold_** with more text

If not, something like pymdownx.betterem may be a way to handle this. Or, we'd have to do some similar logic over here.

@facelessuser
Copy link
Collaborator

If we find Python Markdown's behavior acceptable here, then we should just close this.

@waylan
Copy link
Member

waylan commented Feb 27, 2019

Oops, apparently I copied the wrong example when I tested that. And Babelmark clearly indicates we are in the minority here. This is a bug.

@waylan waylan added confirmed Confirmed bug report or approved feature request. and removed invalid Invalid report (user error, upstream issue, etc). labels Feb 27, 2019
@Dave-ts
Copy link
Author

Dave-ts commented Feb 27, 2019

I'm not sure how the parser works for the markdown as I've not been in the code but I'm currently writing a desktop browser/editor that uses markdown in the documents for a document store.

I added formatting to my editor as is typically done. It just makes the editor code more readable. I had my bold/italics formatting regex base assumptions that bold is always the inside two asterisks when working together with italics. I'm not sure if that makes sense for the markdown package or not but I'm parsing flawless formatting in my editor at the moment in terms of italics/bold and using an asterisk. I've yet to implement and support the underscore for formatting in the editor.

I'll take a look at the betterem options to resolve the issue since I'm already using the package anyways. Thank you for the feedback @facelessuser Either way, if the bug exists, we probably want to stomp on it. if it doesn't, I apologize for wasting your time.

Thank you to the whole team for the effort you put into this project.

@waylan
Copy link
Member

waylan commented Feb 27, 2019

I had my bold/italics formatting regex base assumptions that bold is always the inside two asterisks when working together with italics.

That is not a safe assumption to make. That's why @facelessuser suggests usign _ and * together so you can control which is inside, and which is outside.

This is text **bold _italic bold_** with more text

@Dave-ts
Copy link
Author

Dave-ts commented Feb 27, 2019

I like the mixed idea (using _ and *) personally but you know how it is to rely on the end user to do stuff :) I'm interested in conditions using only asterisks that would break the regex I'm using if you have any I'd love to test it. My use of the word assume is likely not the right word.

As I said I have not written code to format my editor based on the _ just yet, but here are my asterisk regex if you're willing to peak at them. So far I'm not finding many ways to break them. The difference I'm having at this point between my formatting and actual markdown/html output is that I allow cross paragraph formatting in the editor and the markdown conversion does not support this (I have just generated an issue for a bug fix on the application to change this).

Bold regex
r = r'\*\*(?!\*).*?\*\*'

Italics regex
r = r'(?<!\*)\*(?!\*).*?(?:(?<!\*)\*(?!\*)|\*{3})|(?<!\*)\*{3}(?!\*).*?(?:(?<!\*)\*(?!\*)|\*{3})'

I'm very new with regex so I'll take the feedback criticism no problem.

@facelessuser
Copy link
Collaborator

Basically, to fix this, we need to handle the opposite case of ***em*strong** which we already handle. We have nothing that catches **em*strong***.

I think even pymdownx.betterem doesn't have an explicit pattern for **em*strong***. It explicitly covers these two cases though:

***strong,em*strong**
***em,strong**em*

@Dave-ts , there is also whitespace considerations when determining if * or _ is a valid start/end to a pattern. I know that Python Markdown and pymdownx.betterem approach these a little different, but both take these into considerations.

References:

@Dave-ts
Copy link
Author

Dave-ts commented Feb 27, 2019

@Dave-ts , there is also whitespace considerations when determining if * or _ is a valid start/end to a pattern. I know that Python Markdown and pymdownx.betterem approach these a little different, but both take these into considerations.

References:

* BetterEm: https://github.com/facelessuser/pymdown-extensions/blob/master/pymdownx/betterem.py
* Python Markdown: https://github.com/Python-Markdown/markdown/blob/master/markdown/inlinepatterns.py#L116

Thank you for the references I'll have a peak right away.

@waylan
Copy link
Member

waylan commented Feb 27, 2019

Basically, to fix this, we need to handle the opposite case of ***em*strong** which we already handle. We have nothing that catches **em*strong***.

Right. Of course, ***em*strong** starts with *** and is easy to special case as we just added another pattern to handle this case only. The problem is that the start of **em*strong*** is indistinguishable from **strong**.

Perhaps we can better handle the last group at the end. Currently in the strong pattern **em*strong*** gets grouped as: (**)(em*strong)(**)*, but we want to group as (**)(em*strong*)(**). In other words, we want to group the *** at the end as *(**) not (**)*.

I was thinking we might do something like \*{2}(?!\*), but that doesn't work because once the first two don't match, then is tries to match the third one without the second and fails (in other words, a limitation of how the regex engine is implemented). In any event, I think this is the correct approach generally (with a better regex) as the reference implementation's behavior matches this (see Babelmark, also this). What's interesting is that the reference implementation is nearly alone in that behavior. Regardless, I think it is probably the correct behavior.

In other words, when matching the closing strong token, given a string of two or more asterisks, we must match the last two in the string.

@waylan
Copy link
Member

waylan commented Feb 27, 2019

Another way to solve this might be to simply add another pattern which special cases **strong***. It matches that and returns (**)(strong*)(**). Of course the actual contents (here represented as strong) could be anything.

@facelessuser
Copy link
Collaborator

Right. Of course, ***em*strong** starts with *** and is easy to special case as we just added another pattern to handle this case only. The problem is that the start of **em*strong*** is indistinguishable from **strong**.

I haven't played with this yet, but I don't think it is completely indistinguishable. It's a matter of order I believe. If you scan for **em*strong*** and come up empty, then you scan for **strong**.

Now I could be completely wrong here. Unfortunately, I don't remember if I tried this before and ran into issues, or like Python Markdown, just didn't think to put this case in.

Another way to solve this might be to simply add another pattern which special cases **strong***. It matches that and returns (**)(strong*)(**). Of course the actual contents (here represented as strong) could be anything.

This might work. I guess we'd have to run some tests to see.

@facelessuser
Copy link
Collaborator

So I believe I've solved this issue over in betterem: facelessuser/pymdown-extensions#516. I will probably port over a similar solution here when I get some time.

I opted for having pattern for **strong*em,strong** opposed to a pattern for **strong***. It just felt like there would be the potential for numerous corner cases with **strong***.

@facelessuser
Copy link
Collaborator

facelessuser commented Mar 2, 2019

I think something like this will work for Python Markdown:

# **strong*em***
EM_STRONG2_RE = r'(\*)\1(?!\1)(.+?)\1(?!\1)(.+?)\1{3}'

# __strong _em___
SMART_EM_STRONG2_RE = r'(?<!\w)(\_)\1(?!\1)(.+?)(?<!\w)\1(?!\1)(.+?)\1{3}(?!\w)'

Here we are basically requiring that the content of each doesn't start with the token, so if we had something like ***text*text*** or **text**text***, we'd skip targeting them and let the other patterns deal with it. So we really only handle actual cases of **text*text***.

With underscore, we have smart enabled by default, so we have the additional requirement that the nested _ is not preceded by a word character to continue with that "smart" logic. In the legacy_em extension, we'd replace this with the "dumb" logic.

It seems to work with basic testing. I'll upload a pull request once I've tested it more.

@facelessuser
Copy link
Collaborator

@waylan creating these new rules will of course throw off the current rule alignment we have going were each rule is done on a number divisible by 10: 90, 80, 70, etc. We would have to insert between, is that okay? Or do we want to consolidate some of the rules?

@waylan
Copy link
Member

waylan commented Mar 4, 2019

Or do we want to consolidate some of the rules?

What did you have in mind?

@waylan
Copy link
Member

waylan commented Mar 4, 2019

We would have to insert between, is that okay?

Yes, that is what I would expect.

@facelessuser
Copy link
Collaborator

What did you have in mind?

I'm not sure if it will cause problems to reduce all the * related patterns to a single pattern and all _ to a single pattern (we could), but maybe at the very least combine the special cases ***text**text* and **text*text*** etc?

We've given the impression that all the standard patterns are on boundaries of 10s and that you can define your custom patterns in between, but we have nowhere to put these new patterns, and may overwrite a position that is being used by a custom extension.

I figure we could use the new Pattern format, and in a single pattern loop through our different regex for *. We could do this to include all the related patterns or just a couple.

@facelessuser
Copy link
Collaborator

One problem would be that some people still use the deprecated method of inserting relative to a name, so if we consolidated all patterns, we might break some extensions. If we just append this new pattern to an existing pattern, you'd at least not break parity.

@waylan
Copy link
Member

waylan commented Mar 4, 2019

You just reminded me that the deprecation path for the old add method should probably be taken to the next step in the 3.1 release (raise an error rather than a warning). That being the case, I don't think that is a concern.

That said, I understand the concern about breaking the "step by 10" model we have now. In 3.0 we significantly altered the way inline parsing can work, while in practice we made very few actual changes. If you are suggesting taking this to the next step, then that seems like a reasonable approach. I like the idea that all strong processors are combined into one, if that is reasonable to accomplish.

@facelessuser
Copy link
Collaborator

That said, I understand the concern about breaking the "step by 10" model we have now. In 3.0 we significantly altered the way inline parsing can work, while in practice we made very few actual changes. If you are suggesting taking this to the next step, then that seems like a reasonable approach. I like the idea that all strong processors are combined into one, if that is reasonable to accomplish.

We could take it to the next step. I was being more conservative, but if we want to go all in and combine them, that could be done quite easily. Initially, we'd just use the new format and loop through our regular expression patterns and output the appropriate element based on what pattern matches. If we wanted to in the future, we could even rewrite to functionally parse the patterns (if there was some advantage), but I see no need to completely rewrite everything. I think the current patterns are probably fine for now, we can just group them into one pattern step.

@facelessuser
Copy link
Collaborator

Okay, it seems like the best way to consolidate patterns, if we were to go that direction, would be to maybe take a more functional approach. Trigger off * or ** and then functionally parse any nested * and build up the strong or em tags with the appropriate nested strong and em tags. Additionally considering smart semantics if enabled. So if you were given something like **text *italic* text *italic* text**, we would just trigger off ** and then resolve any * until we hit ** at the end. We would return <strong>text <em>italic</em> text <em>italic</em> text</strong> so we didn't have to handle it in multiple passes.

I may give this a shot as it may make a lot more sense and probably clean things up as well.

@facelessuser
Copy link
Collaborator

I didn't notice this before, but it looks like we don't properly apply smart logic always to underscores: test___smart_logic__. See https://johnmacfarlane.net/babelmark2/?text=test___smart_logic__.

# ***strongem*** or ***em*strong**
EM_STRONG_RE = r'(\*|_)\1{2}(.+?)\1(.*?)\1{2}'

# ***strong**em*
STRONG_EM_RE = r'(\*|_)\1{2}(.+?)\1{2}(.*?)\1'

To fix this, we'd need even more patterns, extras for underscore that employ smart logic.

We may need to do this in a better way...or just add more patterns that might conflict with plugins.

@waylan
Copy link
Member

waylan commented Mar 9, 2019

I didn't notice this before, but it looks like we don't properly apply smart logic always to underscores: test___smart_logic__.

Interesting edge case. For the record, we only apply "smart logic" to single underscores, not double or triple. Presumably the thinking is that single underscores are common enough in plain text that we should special case them, but double and triple underscores are not common enough and we always treat them the same as asterisks. Of course, the specific edge case above is handled by a doubt/triple regex, so it doesn't follow the "smart" underscore logic. And in the end, I'm not too worried about it as it only behaves that way because there are double and triple underscores with it. The point is that we only use the "smart underscores" when we are only dealing with all single underscore text.

I can't say for certain if that was a conscious decision of an accident that it turned out that way. I suppose a way to determine if it was intended behavior would be to find a test for it. If no such test exists, then it was an unforeseen side effect. Either way, I'm okay with it staying the way it is.

@facelessuser facelessuser self-assigned this Mar 11, 2019
@facelessuser
Copy link
Collaborator

I've actually worked out logistics and have implemented a prototype over in the BetterEm extension: https://github.com/facelessuser/pymdown-extensions/pull/516/files#diff-81791fadc501f2d0d876f760472c81b4R98.

Essentially, a similar inline processor will be used here: 1 for asterisks and 1 for underscores. All of the logic will be consolidated into a single processors for each.

@facelessuser
Copy link
Collaborator

This issue was fixed by #805

chipx86 added a commit to reviewboard/reviewboard that referenced this issue May 21, 2021
Some of the recent releases of Python-Markdown, along with some of the
recent fixes merged in from release-3.0.x, resulted in some regressions
in the expected Markdown rendering results in unit tests. These
regressions are, for the most part, not something users will need to
worry about.

Some of the regressions were in the XSS unit tests. These aren't
security regressions, just expected string regressions. Python Markdown
3.0 changed some of the parsing, supporting `"title"` in link URLs, and
these captured strings in JavaScript code, turning them into `title=`
attributes. Tests were updated for the new strings.

Python Markdown 3.2 added `<code>` elements to each line in a code
block, which new tests from release-3.0.x weren't accounting for. Since
we support 3.1 and 3.3+ (depending on the Python version), these testsn
ow check for the right string for the right version.

And finally, the one user-facing change is that nested bold/italic (in
the form of `*this is **a** test*` renders entirely as a series of
italic blocks on Python Markdown 3.2. The working alternative is
`*this is __a__ test*`. Tests were updated for this as well.
Unfortunately, there doesn't seem to be a way to restore prior
formatting, but hopefully this doesn't affect many people.

See Python-Markdown/markdown#792 for the
bold/italic changes in Python Markdown.

Testing Done:
Unit tests pass on all supported versions of Python, with all supported
versions of Python Markdown.

Reviewed at https://reviews.reviewboard.org/r/11611/
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Bug report. confirmed Confirmed bug report or approved feature request.
Projects
None yet
Development

No branches or pull requests

3 participants