Completes pull #1043 which fixes #1040 #1044

facelessuser · 2020-10-15T01:42:01Z

Completes the tests in #1043 which fixes #1040

waylan · 2020-10-15T13:30:15Z

Looks good. There is one more test case we should probably include:

<div markdown="1">*foo*</div><div markdown="1">*bar*</div>

Specifically a markdown="1" element in the tail of a markdown="1" element. That should render as:

<div><p><em>foo</em></p></div>
<div><p><em>bar</em></p></div>

I suspect that might cover the scenario there is no test coverage for in your patch.

facelessuser · 2020-10-15T13:48:59Z

Yeah, I'll add that test. I just wanted to make sure the direction was fine first.

facelessuser · 2020-10-15T14:16:48Z

I will most likely address #1045 here as well. I'll just make this the "fix HTML parsing" pull.

waylan · 2020-10-15T14:21:18Z

Your approach for handling tails is a little different than mine. In the core parser, I track the tail by setting self.intail at the end of a block...

markdown/markdown/htmlparser.py

Lines 150 to 155 in 607a091

    
           if blank_line_re.match(self.rawdata[self.line_offset + self.offset + len(text):]): 
        
               # Preserve blank line and end of raw block. 
        
               self._cache.append('\n') 
        
           else: 
        
               # More content exists after endtag. 
        
               self.intail = True

... and then checking the status of self.intail at the start of the next one.

markdown/markdown/htmlparser.py

Lines 123 to 126 in 607a091

    
           if self.md.is_block_level(tag) and (self.intail or (self.at_line_start() and not self.inraw)): 
        
               # Started a new raw block. Prepare stack. 
        
               self.inraw = True 
        
               self.cleandoc.append('\n')

The status gets reset on the first newline.

markdown/markdown/htmlparser.py

Lines 166 to 167 in 607a091

    
           if self.intail and '\n' in data: 
        
               self.intail = False

However, you are simply adding a blank line after the end of the block, which is a simpler solution. I don't recall why I didn't use that solution myself. I might not have considered it, or perhaps it was mucking up some edge cases. In any event, in the md_in_html extension I'm less concerned with altering insignificant whitespace than in the default behavior.

One thing I do wonder about is what if the blank lines you are checking for also contain whitespace. For example, you check for '\n\n', but what if instead the document contains '\n \n '? Maybe if you use blank_line_re as we do in markdown.htmlparser (see L150 above), that will address that concern.

waylan · 2020-10-15T14:31:27Z

I will most likely address #1045 here as well. I'll just make this the "fix HTML parsing" pull.

Fine by me. Regarding that issue. I believe the problem is that I never really fully tested a mix of markdown="1" and regular raw HTML (it was either all one or all the other). If you have the time to go through the extension's tests, please feel free to extend them to fully test the interaction between the two.

facelessuser · 2020-10-15T15:21:08Z

I'm going to have to work things out on md_in_html

We aren't tracking inraw properly, but when we do, and a block element with markdown="1" is being processed, we have trouble getting the children in order if they have mixed content.

We seem to store raw HTML data in separate structures as other data. I've got some tricky things to work through.

facelessuser · 2020-10-15T16:39:13Z

We are going to probably need to refactor treebuilder logic along with how we store all the HTML data in md_in_html.

The base class doesn't use treebuilder, but md_in_html does. When we weren't updating inraw correctly, everything went to treebuilder, but that caused escaping of raw HTML in some cases. But when we update inraw correctly, now some data gets stored in treebuilder and some data doesn't, and they get synced up all wrong in the parent element.

So fairly good refactor is in order to handle the complexity of this nested content. We may have to override more of the base class in md_in_html than we originally wanted, not sure yet.

This definitely won't be as quick a fix as I hoped.

waylan · 2020-10-15T17:42:52Z

The order of things in MarkdownInHtmlProcessor(BlockProcessor) is very important so as not to reorder elements. In addition to the many comments documenting why the specific order is used, notice a few calls to children.reverse(). For example here:

markdown/markdown/extensions/md_in_html.py

Lines 230 to 232 in 607a091

    
           children.reverse() 
        
           for child in children: 
        
               element.insert(0, child)

However, it sounds like you are suggesting that some elements are being reordered before we ever get that far. Not sure how that is happening.

facelessuser · 2020-10-15T18:05:38Z

It is only happening when I try to fix md_in_html's issue with #1045. The HTML content is being escaped because it isn't recognized as "raw" HTML. When I properly assign inraw the correct value (which doesn't happen in md_in_html it fixes the escape issue, but then reorders things incorrectly.

We keep processing HTML under a block element with markdown attribute, but we only have 1 treebuilder. Maybe I'm making it more complicated than it is.

facelessuser · 2020-10-15T18:36:23Z

Okay, I get what treebuilder does now, we were processing start and end tags with treebuilder.data() when we should have used treebuilder.start() or treebuilder.end(). I got the #1045 case fixed.

I slightly altered the case because I wanted to make sure Markdown parsed under the markdown="1" element.

import markdown
md = markdown.Markdown(extensions=["markdown.extensions.md_in_html"])
test = """
<div markdown="1">
**test**
<div>
**test**
<img src=""/>
<code>Test</code>
<span>**test**</span>
<p>Test 2</p>
</div>
</div>
"""
print(md.convert(test))

Results:

<div>
<p><strong>test</strong></p>
<div>
**test**
<img src=""/>
<code>Test</code>
<span>**test**</span>
<p>Test 2</p>
</div>
</div>

Unfortunately, it broke other cases 🤦 .

…re coverage

facelessuser · 2020-10-15T19:45:40Z

I have a working fix now. I'm not convinced the md_in_html parser is quite operating as intended, or maybe not quite how I expected compared to the base parser, but I've isolated some fixes that work without breaking everything else.

I'll still need more coverage cases, and I'd like to run some checks on some more corner cases, but this is a good start I think.

There shouldn't be elements in the cleandoc list. Those may exist in the stash, but won't turn up directly in cleandoc.

facelessuser · 2020-10-15T21:50:36Z

Coverage is now being met.

Before merging, I'd like to do some more investigation. I've flagged this pull as a "work in progress" until I've had a chance to gain more confidence in this area. I think I'm quickly coming up to speed on how all the inner workings come together in the new parser.

waylan · 2020-10-17T00:35:11Z

I finally had a chance to look at this. Looking good. 👍

facelessuser · 2020-10-17T14:21:48Z

@waylan So, I added a whitespace case that mixes newlines with spaces. I can't see a need (at least in md_in_html) to add the regex for newlines. Am I missing some kind of newline case? I'm just thinking that due to the way things get processed, this appears to be a non-issue.

waylan · 2020-10-17T19:47:09Z

That test seems fine to me. If that’s passing as-is, then that should be fine.

_{Sent with GitHawk}

facelessuser · 2020-10-18T01:12:24Z

Cool. I'll try and wrap up my testing by the end of the weekend. If I'm not done by then. I'm probably done anyways 🙃. From what I can tell, the big holes seemed filled now.

facelessuser · 2020-10-18T16:56:11Z

I'm going to call this done for now. I'm generally feeling good about the changes. They solve the current issues and don't break existing tests. I can't think of specific cases that I'm not covering, but if more are found, I gladly take another look.

Probably won't have much more time to look at this over the weekend, and I don't want to hold up a fix.

waylan and others added 3 commits October 14, 2020 16:07

Bring md_in_html behavior inline with base.

80d97b8

Fix for no line break case.

dcada15

Fix remaining tests

0827498

facelessuser mentioned this pull request Oct 15, 2020

markdown.extensions.md_in_html fails with TypeError: expected string or bytes-like object #1040

Closed

greenape mentioned this pull request Oct 15, 2020

Image parsing failed greenape/mknotebooks#90

Closed

waylan mentioned this pull request Oct 15, 2020

Bring md_in_html behavior inline with base. #1043

Closed

Solution for Python-Markdown#1045 case and add additional test for mo…

1acef7e

…re coverage

Increase coverage and remove unnecessary code

938cfc5

There shouldn't be elements in the cleandoc list. Those may exist in the stash, but won't turn up directly in cleandoc.

facelessuser added the work-in-progress A partial solution. More changes will be coming. label Oct 15, 2020

facelessuser mentioned this pull request Oct 15, 2020

Inline elements getting parsed as text when doubly nested #1045

Closed

Add whitespace case

4a8efa6

facelessuser removed the work-in-progress A partial solution. More changes will be coming. label Oct 18, 2020

waylan merged commit 2766698 into Python-Markdown:master Oct 19, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Completes pull #1043 which fixes #1040 #1044

Completes pull #1043 which fixes #1040 #1044

facelessuser commented Oct 15, 2020

waylan commented Oct 15, 2020

facelessuser commented Oct 15, 2020

facelessuser commented Oct 15, 2020

waylan commented Oct 15, 2020

waylan commented Oct 15, 2020

facelessuser commented Oct 15, 2020

facelessuser commented Oct 15, 2020

waylan commented Oct 15, 2020

facelessuser commented Oct 15, 2020

facelessuser commented Oct 15, 2020

facelessuser commented Oct 15, 2020 •

edited

facelessuser commented Oct 15, 2020 •

edited

waylan commented Oct 17, 2020

facelessuser commented Oct 17, 2020

waylan commented Oct 17, 2020

facelessuser commented Oct 18, 2020

facelessuser commented Oct 18, 2020

Completes pull #1043 which fixes #1040 #1044

Completes pull #1043 which fixes #1040 #1044

Conversation

facelessuser commented Oct 15, 2020

waylan commented Oct 15, 2020

facelessuser commented Oct 15, 2020

facelessuser commented Oct 15, 2020

waylan commented Oct 15, 2020

waylan commented Oct 15, 2020

facelessuser commented Oct 15, 2020

facelessuser commented Oct 15, 2020

waylan commented Oct 15, 2020

facelessuser commented Oct 15, 2020

facelessuser commented Oct 15, 2020

facelessuser commented Oct 15, 2020 • edited

facelessuser commented Oct 15, 2020 • edited

waylan commented Oct 17, 2020

facelessuser commented Oct 17, 2020

waylan commented Oct 17, 2020

facelessuser commented Oct 18, 2020

facelessuser commented Oct 18, 2020

facelessuser commented Oct 15, 2020 •

edited

facelessuser commented Oct 15, 2020 •

edited