Bold/Italic bug #792

Dave-ts · 2019-02-27T19:46:15Z

I think I'm running up against another bold/italics bug. I did some quick searches and it looks like the other issues were considered resolved, sorry if I'm re-reporting on something already fixed that hasn't made it upstream yet.

Installed from pip current Python-Markdown version 3.0.1

The raw markdown line that breaks is:

This is text **bold *italic bold*** with more text

The output I'm getting is as follows:

<p>This is text <strong>bold *italic bold</strong>* with more text</p>

However, the following format does seem to work correctly.

This is text ***bold italic** italic* more text

The output is

<p>This is text <em><strong>bold italic</strong> italic</em> more text</p>

The text was updated successfully, but these errors were encountered:

waylan · 2019-02-27T20:07:16Z

I can confirm I am seeing the same behavior in 3.0.1. However, this behavior does not exist in the latest code here on GitHub and will be available in the next release.

>>> import markdown
>>> markdown.version
'3.1.dev0'
>>> markdown.markdown('This is text ***bold italic** italic* more text')
'<p>This is text <em><strong>bold italic</strong> italic</em> more text</p>'

facelessuser · 2019-02-27T20:18:07Z

@waylan, I don't think you tested is case properly. It wasn't This is text ***bold italic** italic* more text that failed, but This is text **bold *italic bold*** with more text.

facelessuser · 2019-02-27T20:25:24Z

But I will explain a little bit why this happens.

In the pattern This is text **bold *italic bold*** with more text, the parser Sees **bold *italic bold** and makes it bold *italic bold. This leaves two * in separate tags, and they cannot be resolved or you'd end up with some weird broken HTML.

The 3rd party pmdownx.betterem can handle this, but only when smart_enable is set to handle the star syntax (it only does "smart" for underscore by default).

>>> markdown.Markdown(extensions=["pymdownx.betterem"], extension_configs={"pymdownx.betterem": {"smart_enable": "all"}}).convert("This is text **bold *italic bold*** with more text")
'<p>This is text <strong>bold <em>italic bold</em></strong> with more text</p>'

Without smart enabled for *, it does what Python Markdown does by default:

>>> markdown.Markdown(extensions=["pymdownx.betterem"]).convert("This is text **bold *italic bold*** with more text")
'<p>This is text <strong>bold *italic bold</strong>* with more text</p>'

Based on how Python Markdown works, this behavior is as I would expect. One sure fire way to avoid such confusion is to use syntax like this:

This is text **bold _italic bold_** with more text

If not, something like pymdownx.betterem may be a way to handle this. Or, we'd have to do some similar logic over here.

facelessuser · 2019-02-27T20:26:55Z

If we find Python Markdown's behavior acceptable here, then we should just close this.

waylan · 2019-02-27T20:36:05Z

Oops, apparently I copied the wrong example when I tested that. And Babelmark clearly indicates we are in the minority here. This is a bug.

Dave-ts · 2019-02-27T20:36:53Z

I'm not sure how the parser works for the markdown as I've not been in the code but I'm currently writing a desktop browser/editor that uses markdown in the documents for a document store.

I added formatting to my editor as is typically done. It just makes the editor code more readable. I had my bold/italics formatting regex base assumptions that bold is always the inside two asterisks when working together with italics. I'm not sure if that makes sense for the markdown package or not but I'm parsing flawless formatting in my editor at the moment in terms of italics/bold and using an asterisk. I've yet to implement and support the underscore for formatting in the editor.

I'll take a look at the betterem options to resolve the issue since I'm already using the package anyways. Thank you for the feedback @facelessuser Either way, if the bug exists, we probably want to stomp on it. if it doesn't, I apologize for wasting your time.

Thank you to the whole team for the effort you put into this project.

waylan · 2019-02-27T20:41:14Z

I had my bold/italics formatting regex base assumptions that bold is always the inside two asterisks when working together with italics.

That is not a safe assumption to make. That's why @facelessuser suggests usign _ and * together so you can control which is inside, and which is outside.

This is text **bold _italic bold_** with more text

Dave-ts · 2019-02-27T20:58:49Z

I like the mixed idea (using _ and *) personally but you know how it is to rely on the end user to do stuff :) I'm interested in conditions using only asterisks that would break the regex I'm using if you have any I'd love to test it. My use of the word assume is likely not the right word.

As I said I have not written code to format my editor based on the _ just yet, but here are my asterisk regex if you're willing to peak at them. So far I'm not finding many ways to break them. The difference I'm having at this point between my formatting and actual markdown/html output is that I allow cross paragraph formatting in the editor and the markdown conversion does not support this (I have just generated an issue for a bug fix on the application to change this).

Bold regex
r = r'\*\*(?!\*).*?\*\*'

Italics regex
r = r'(?<!\*)\*(?!\*).*?(?:(?<!\*)\*(?!\*)|\*{3})|(?<!\*)\*{3}(?!\*).*?(?:(?<!\*)\*(?!\*)|\*{3})'

I'm very new with regex so I'll take the feedback criticism no problem.

facelessuser · 2019-02-27T21:03:54Z

Basically, to fix this, we need to handle the opposite case of ***em*strong** which we already handle. We have nothing that catches **em*strong***.

I think even pymdownx.betterem doesn't have an explicit pattern for **em*strong***. It explicitly covers these two cases though:

***strong,em*strong**
***em,strong**em*

@Dave-ts , there is also whitespace considerations when determining if * or _ is a valid start/end to a pattern. I know that Python Markdown and pymdownx.betterem approach these a little different, but both take these into considerations.

References:

Dave-ts · 2019-02-27T21:05:56Z

@Dave-ts , there is also whitespace considerations when determining if * or _ is a valid start/end to a pattern. I know that Python Markdown and pymdownx.betterem approach these a little different, but both take these into considerations.

References:
* BetterEm: https://github.com/facelessuser/pymdown-extensions/blob/master/pymdownx/betterem.py
* Python Markdown: https://github.com/Python-Markdown/markdown/blob/master/markdown/inlinepatterns.py#L116

Thank you for the references I'll have a peak right away.

waylan · 2019-02-27T23:45:05Z

Basically, to fix this, we need to handle the opposite case of ***em*strong** which we already handle. We have nothing that catches **em*strong***.

Right. Of course, ***em*strong** starts with *** and is easy to special case as we just added another pattern to handle this case only. The problem is that the start of **em*strong*** is indistinguishable from **strong**.

Perhaps we can better handle the last group at the end. Currently in the strong pattern **em*strong*** gets grouped as: (**)(em*strong)(**)*, but we want to group as (**)(em*strong*)(**). In other words, we want to group the *** at the end as *(**) not (**)*.

I was thinking we might do something like \*{2}(?!\*), but that doesn't work because once the first two don't match, then is tries to match the third one without the second and fails (in other words, a limitation of how the regex engine is implemented). In any event, I think this is the correct approach generally (with a better regex) as the reference implementation's behavior matches this (see Babelmark, also this). What's interesting is that the reference implementation is nearly alone in that behavior. Regardless, I think it is probably the correct behavior.

In other words, when matching the closing strong token, given a string of two or more asterisks, we must match the last two in the string.

waylan · 2019-02-27T23:48:08Z

Another way to solve this might be to simply add another pattern which special cases **strong***. It matches that and returns (**)(strong*)(**). Of course the actual contents (here represented as strong) could be anything.

facelessuser · 2019-02-27T23:57:32Z

Right. Of course, ***em*strong** starts with *** and is easy to special case as we just added another pattern to handle this case only. The problem is that the start of **em*strong*** is indistinguishable from **strong**.

I haven't played with this yet, but I don't think it is completely indistinguishable. It's a matter of order I believe. If you scan for **em*strong*** and come up empty, then you scan for **strong**.

Now I could be completely wrong here. Unfortunately, I don't remember if I tried this before and ran into issues, or like Python Markdown, just didn't think to put this case in.

Another way to solve this might be to simply add another pattern which special cases **strong***. It matches that and returns (**)(strong*)(**). Of course the actual contents (here represented as strong) could be anything.

This might work. I guess we'd have to run some tests to see.

facelessuser · 2019-03-02T03:11:36Z

So I believe I've solved this issue over in betterem: facelessuser/pymdown-extensions#516. I will probably port over a similar solution here when I get some time.

I opted for having pattern for **strong*em,strong** opposed to a pattern for **strong***. It just felt like there would be the potential for numerous corner cases with **strong***.

facelessuser · 2019-03-02T16:47:23Z

I think something like this will work for Python Markdown:

# **strong*em***
EM_STRONG2_RE = r'(\*)\1(?!\1)(.+?)\1(?!\1)(.+?)\1{3}'

# __strong _em___
SMART_EM_STRONG2_RE = r'(?<!\w)(\_)\1(?!\1)(.+?)(?<!\w)\1(?!\1)(.+?)\1{3}(?!\w)'

Here we are basically requiring that the content of each doesn't start with the token, so if we had something like ***text*text*** or **text**text***, we'd skip targeting them and let the other patterns deal with it. So we really only handle actual cases of **text*text***.

With underscore, we have smart enabled by default, so we have the additional requirement that the nested _ is not preceded by a word character to continue with that "smart" logic. In the legacy_em extension, we'd replace this with the "dumb" logic.

It seems to work with basic testing. I'll upload a pull request once I've tested it more.

facelessuser · 2019-03-03T22:45:18Z

@waylan creating these new rules will of course throw off the current rule alignment we have going were each rule is done on a number divisible by 10: 90, 80, 70, etc. We would have to insert between, is that okay? Or do we want to consolidate some of the rules?

waylan · 2019-03-04T13:57:16Z

Or do we want to consolidate some of the rules?

What did you have in mind?

waylan · 2019-03-04T16:08:39Z

We would have to insert between, is that okay?

Yes, that is what I would expect.

facelessuser · 2019-03-04T16:09:15Z

What did you have in mind?

I'm not sure if it will cause problems to reduce all the * related patterns to a single pattern and all _ to a single pattern (we could), but maybe at the very least combine the special cases ***text**text* and **text*text*** etc?

We've given the impression that all the standard patterns are on boundaries of 10s and that you can define your custom patterns in between, but we have nowhere to put these new patterns, and may overwrite a position that is being used by a custom extension.

I figure we could use the new Pattern format, and in a single pattern loop through our different regex for *. We could do this to include all the related patterns or just a couple.

facelessuser · 2019-03-04T16:16:07Z

One problem would be that some people still use the deprecated method of inserting relative to a name, so if we consolidated all patterns, we might break some extensions. If we just append this new pattern to an existing pattern, you'd at least not break parity.

waylan · 2019-03-04T16:30:23Z

You just reminded me that the deprecation path for the old add method should probably be taken to the next step in the 3.1 release (raise an error rather than a warning). That being the case, I don't think that is a concern.

That said, I understand the concern about breaking the "step by 10" model we have now. In 3.0 we significantly altered the way inline parsing can work, while in practice we made very few actual changes. If you are suggesting taking this to the next step, then that seems like a reasonable approach. I like the idea that all strong processors are combined into one, if that is reasonable to accomplish.

facelessuser · 2019-03-04T16:50:05Z

That said, I understand the concern about breaking the "step by 10" model we have now. In 3.0 we significantly altered the way inline parsing can work, while in practice we made very few actual changes. If you are suggesting taking this to the next step, then that seems like a reasonable approach. I like the idea that all strong processors are combined into one, if that is reasonable to accomplish.

We could take it to the next step. I was being more conservative, but if we want to go all in and combine them, that could be done quite easily. Initially, we'd just use the new format and loop through our regular expression patterns and output the appropriate element based on what pattern matches. If we wanted to in the future, we could even rewrite to functionally parse the patterns (if there was some advantage), but I see no need to completely rewrite everything. I think the current patterns are probably fine for now, we can just group them into one pattern step.

facelessuser · 2019-03-09T17:21:52Z

Okay, it seems like the best way to consolidate patterns, if we were to go that direction, would be to maybe take a more functional approach. Trigger off * or ** and then functionally parse any nested * and build up the strong or em tags with the appropriate nested strong and em tags. Additionally considering smart semantics if enabled. So if you were given something like **text *italic* text *italic* text**, we would just trigger off ** and then resolve any * until we hit ** at the end. We would return text italic text italic text so we didn't have to handle it in multiple passes.

I may give this a shot as it may make a lot more sense and probably clean things up as well.

facelessuser · 2019-03-09T18:15:38Z

I didn't notice this before, but it looks like we don't properly apply smart logic always to underscores: test___smart_logic__. See https://johnmacfarlane.net/babelmark2/?text=test___smart_logic__.

# ***strongem*** or ***em*strong**
EM_STRONG_RE = r'(\*|_)\1{2}(.+?)\1(.*?)\1{2}'

# ***strong**em*
STRONG_EM_RE = r'(\*|_)\1{2}(.+?)\1{2}(.*?)\1'

To fix this, we'd need even more patterns, extras for underscore that employ smart logic.

We may need to do this in a better way...or just add more patterns that might conflict with plugins.

waylan · 2019-03-09T21:13:39Z

I didn't notice this before, but it looks like we don't properly apply smart logic always to underscores: test___smart_logic__.

Interesting edge case. For the record, we only apply "smart logic" to single underscores, not double or triple. Presumably the thinking is that single underscores are common enough in plain text that we should special case them, but double and triple underscores are not common enough and we always treat them the same as asterisks. Of course, the specific edge case above is handled by a doubt/triple regex, so it doesn't follow the "smart" underscore logic. And in the end, I'm not too worried about it as it only behaves that way because there are double and triple underscores with it. The point is that we only use the "smart underscores" when we are only dealing with all single underscore text.

I can't say for certain if that was a conscious decision of an accident that it turned out that way. I suppose a way to determine if it was intended behavior would be to find a test for it. If no such test exists, then it was an unforeseen side effect. Either way, I'm okay with it staying the way it is.

facelessuser · 2019-03-12T18:49:55Z

I've actually worked out logistics and have implemented a prototype over in the BetterEm extension: https://github.com/facelessuser/pymdown-extensions/pull/516/files#diff-81791fadc501f2d0d876f760472c81b4R98.

Essentially, a similar inline processor will be used here: 1 for asterisks and 1 for underscores. All of the logic will be consolidated into a single processors for each.

facelessuser · 2019-09-04T03:30:56Z

This issue was fixed by #805

Some of the recent releases of Python-Markdown, along with some of the recent fixes merged in from release-3.0.x, resulted in some regressions in the expected Markdown rendering results in unit tests. These regressions are, for the most part, not something users will need to worry about. Some of the regressions were in the XSS unit tests. These aren't security regressions, just expected string regressions. Python Markdown 3.0 changed some of the parsing, supporting `"title"` in link URLs, and these captured strings in JavaScript code, turning them into `title=` attributes. Tests were updated for the new strings. Python Markdown 3.2 added `<code>` elements to each line in a code block, which new tests from release-3.0.x weren't accounting for. Since we support 3.1 and 3.3+ (depending on the Python version), these testsn ow check for the right string for the right version. And finally, the one user-facing change is that nested bold/italic (in the form of `*this is **a** test*` renders entirely as a series of italic blocks on Python Markdown 3.2. The working alternative is `*this is __a__ test*`. Tests were updated for this as well. Unfortunately, there doesn't seem to be a way to restore prior formatting, but hopefully this doesn't affect many people. See Python-Markdown/markdown#792 for the bold/italic changes in Python Markdown. Testing Done: Unit tests pass on all supported versions of Python, with all supported versions of Python Markdown. Reviewed at https://reviews.reviewboard.org/r/11611/

waylan added bug Bug report. confirmed Confirmed bug report or approved feature request. invalid Invalid report (user error, upstream issue, etc). and removed confirmed Confirmed bug report or approved feature request. labels Feb 27, 2019

waylan closed this as completed Feb 27, 2019

facelessuser reopened this Feb 27, 2019

waylan added confirmed Confirmed bug report or approved feature request. and removed invalid Invalid report (user error, upstream issue, etc). labels Feb 27, 2019

facelessuser self-assigned this Mar 11, 2019

waylan pushed a commit that referenced this issue Sep 3, 2019

Refactor em strong to consolidate code and fix issue #792

ce04b81

facelessuser closed this as completed Sep 4, 2019

waylan mentioned this issue Oct 31, 2022

stand-alone * or _ lead to incorrect results #1300

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bold/Italic bug #792

Bold/Italic bug #792

Dave-ts commented Feb 27, 2019

waylan commented Feb 27, 2019 •

edited

Loading

facelessuser commented Feb 27, 2019

facelessuser commented Feb 27, 2019 •

edited

Loading

facelessuser commented Feb 27, 2019

waylan commented Feb 27, 2019 •

edited

Loading

Dave-ts commented Feb 27, 2019

waylan commented Feb 27, 2019

Dave-ts commented Feb 27, 2019 •

edited

Loading

facelessuser commented Feb 27, 2019

Dave-ts commented Feb 27, 2019

waylan commented Feb 27, 2019

waylan commented Feb 27, 2019

facelessuser commented Feb 27, 2019

facelessuser commented Mar 2, 2019

facelessuser commented Mar 2, 2019 •

edited

Loading

facelessuser commented Mar 3, 2019

waylan commented Mar 4, 2019

waylan commented Mar 4, 2019

facelessuser commented Mar 4, 2019

facelessuser commented Mar 4, 2019

waylan commented Mar 4, 2019

facelessuser commented Mar 4, 2019

facelessuser commented Mar 9, 2019

facelessuser commented Mar 9, 2019

waylan commented Mar 9, 2019

facelessuser commented Mar 12, 2019

facelessuser commented Sep 4, 2019

Bold/Italic bug #792

Bold/Italic bug #792

Comments

Dave-ts commented Feb 27, 2019

waylan commented Feb 27, 2019 • edited Loading

facelessuser commented Feb 27, 2019

facelessuser commented Feb 27, 2019 • edited Loading

facelessuser commented Feb 27, 2019

waylan commented Feb 27, 2019 • edited Loading

Dave-ts commented Feb 27, 2019

waylan commented Feb 27, 2019

Dave-ts commented Feb 27, 2019 • edited Loading

facelessuser commented Feb 27, 2019

Dave-ts commented Feb 27, 2019

waylan commented Feb 27, 2019

waylan commented Feb 27, 2019

facelessuser commented Feb 27, 2019

facelessuser commented Mar 2, 2019

facelessuser commented Mar 2, 2019 • edited Loading

facelessuser commented Mar 3, 2019

waylan commented Mar 4, 2019

waylan commented Mar 4, 2019

facelessuser commented Mar 4, 2019

facelessuser commented Mar 4, 2019

waylan commented Mar 4, 2019

facelessuser commented Mar 4, 2019

facelessuser commented Mar 9, 2019

facelessuser commented Mar 9, 2019

waylan commented Mar 9, 2019

facelessuser commented Mar 12, 2019

facelessuser commented Sep 4, 2019

waylan commented Feb 27, 2019 •

edited

Loading

facelessuser commented Feb 27, 2019 •

edited

Loading

waylan commented Feb 27, 2019 •

edited

Loading

Dave-ts commented Feb 27, 2019 •

edited

Loading

facelessuser commented Mar 2, 2019 •

edited

Loading