Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Completes pull #1043 which fixes #1040 #1044

Merged
merged 6 commits into from Oct 19, 2020

Conversation

facelessuser
Copy link
Collaborator

Completes the tests in #1043 which fixes #1040

@waylan
Copy link
Member

waylan commented Oct 15, 2020

Looks good. There is one more test case we should probably include:

<div markdown="1">*foo*</div><div markdown="1">*bar*</div>

Specifically a markdown="1" element in the tail of a markdown="1" element. That should render as:

<div><p><em>foo</em></p></div>
<div><p><em>bar</em></p></div>

I suspect that might cover the scenario there is no test coverage for in your patch.

@facelessuser
Copy link
Collaborator Author

Yeah, I'll add that test. I just wanted to make sure the direction was fine first.

@facelessuser
Copy link
Collaborator Author

I will most likely address #1045 here as well. I'll just make this the "fix HTML parsing" pull.

@waylan
Copy link
Member

waylan commented Oct 15, 2020

Your approach for handling tails is a little different than mine. In the core parser, I track the tail by setting self.intail at the end of a block...

if blank_line_re.match(self.rawdata[self.line_offset + self.offset + len(text):]):
# Preserve blank line and end of raw block.
self._cache.append('\n')
else:
# More content exists after endtag.
self.intail = True

... and then checking the status of self.intail at the start of the next one.

if self.md.is_block_level(tag) and (self.intail or (self.at_line_start() and not self.inraw)):
# Started a new raw block. Prepare stack.
self.inraw = True
self.cleandoc.append('\n')

The status gets reset on the first newline.

if self.intail and '\n' in data:
self.intail = False

However, you are simply adding a blank line after the end of the block, which is a simpler solution. I don't recall why I didn't use that solution myself. I might not have considered it, or perhaps it was mucking up some edge cases. In any event, in the md_in_html extension I'm less concerned with altering insignificant whitespace than in the default behavior.

One thing I do wonder about is what if the blank lines you are checking for also contain whitespace. For example, you check for '\n\n', but what if instead the document contains '\n \n '? Maybe if you use blank_line_re as we do in markdown.htmlparser (see L150 above), that will address that concern.

@waylan
Copy link
Member

waylan commented Oct 15, 2020

I will most likely address #1045 here as well. I'll just make this the "fix HTML parsing" pull.

Fine by me. Regarding that issue. I believe the problem is that I never really fully tested a mix of markdown="1" and regular raw HTML (it was either all one or all the other). If you have the time to go through the extension's tests, please feel free to extend them to fully test the interaction between the two.

@facelessuser
Copy link
Collaborator Author

I'm going to have to work things out on md_in_html

We aren't tracking inraw properly, but when we do, and a block element with markdown="1" is being processed, we have trouble getting the children in order if they have mixed content.

We seem to store raw HTML data in separate structures as other data. I've got some tricky things to work through.

@facelessuser
Copy link
Collaborator Author

We are going to probably need to refactor treebuilder logic along with how we store all the HTML data in md_in_html.

The base class doesn't use treebuilder, but md_in_html does. When we weren't updating inraw correctly, everything went to treebuilder, but that caused escaping of raw HTML in some cases. But when we update inraw correctly, now some data gets stored in treebuilder and some data doesn't, and they get synced up all wrong in the parent element.

So fairly good refactor is in order to handle the complexity of this nested content. We may have to override more of the base class in md_in_html than we originally wanted, not sure yet.

This definitely won't be as quick a fix as I hoped.

@waylan
Copy link
Member

waylan commented Oct 15, 2020

The order of things in MarkdownInHtmlProcessor(BlockProcessor) is very important so as not to reorder elements. In addition to the many comments documenting why the specific order is used, notice a few calls to children.reverse(). For example here:

children.reverse()
for child in children:
element.insert(0, child)

However, it sounds like you are suggesting that some elements are being reordered before we ever get that far. Not sure how that is happening.

@facelessuser
Copy link
Collaborator Author

It is only happening when I try to fix md_in_html's issue with #1045. The HTML content is being escaped because it isn't recognized as "raw" HTML. When I properly assign inraw the correct value (which doesn't happen in md_in_html it fixes the escape issue, but then reorders things incorrectly.

We keep processing HTML under a block element with markdown attribute, but we only have 1 treebuilder. Maybe I'm making it more complicated than it is.

@facelessuser
Copy link
Collaborator Author

Okay, I get what treebuilder does now, we were processing start and end tags with treebuilder.data() when we should have used treebuilder.start() or treebuilder.end(). I got the #1045 case fixed.

I slightly altered the case because I wanted to make sure Markdown parsed under the markdown="1" element.

import markdown
md = markdown.Markdown(extensions=["markdown.extensions.md_in_html"])
test = """
<div markdown="1">
**test**
<div>
**test**
<img src=""/>
<code>Test</code>
<span>**test**</span>
<p>Test 2</p>
</div>
</div>
"""
print(md.convert(test))

Results:

<div>
<p><strong>test</strong></p>
<div>
**test**
<img src=""/>
<code>Test</code>
<span>**test**</span>
<p>Test 2</p>
</div>
</div>

Unfortunately, it broke other cases 🤦 .

@facelessuser
Copy link
Collaborator Author

facelessuser commented Oct 15, 2020

I have a working fix now. I'm not convinced the md_in_html parser is quite operating as intended, or maybe not quite how I expected compared to the base parser, but I've isolated some fixes that work without breaking everything else.

I'll still need more coverage cases, and I'd like to run some checks on some more corner cases, but this is a good start I think.

There shouldn't be elements in the cleandoc list. Those may exist in
the stash, but won't turn up directly in cleandoc.
@facelessuser facelessuser added the work-in-progress A partial solution. More changes will be coming. label Oct 15, 2020
@facelessuser
Copy link
Collaborator Author

facelessuser commented Oct 15, 2020

Coverage is now being met.

Before merging, I'd like to do some more investigation. I've flagged this pull as a "work in progress" until I've had a chance to gain more confidence in this area. I think I'm quickly coming up to speed on how all the inner workings come together in the new parser.

@waylan
Copy link
Member

waylan commented Oct 17, 2020

I finally had a chance to look at this. Looking good. 👍

@facelessuser
Copy link
Collaborator Author

@waylan So, I added a whitespace case that mixes newlines with spaces. I can't see a need (at least in md_in_html) to add the regex for newlines. Am I missing some kind of newline case? I'm just thinking that due to the way things get processed, this appears to be a non-issue.

@waylan
Copy link
Member

waylan commented Oct 17, 2020

That test seems fine to me. If that’s passing as-is, then that should be fine.

Sent with GitHawk

@facelessuser
Copy link
Collaborator Author

Cool. I'll try and wrap up my testing by the end of the weekend. If I'm not done by then. I'm probably done anyways 🙃. From what I can tell, the big holes seemed filled now.

@facelessuser facelessuser removed the work-in-progress A partial solution. More changes will be coming. label Oct 18, 2020
@facelessuser
Copy link
Collaborator Author

I'm going to call this done for now. I'm generally feeling good about the changes. They solve the current issues and don't break existing tests. I can't think of specific cases that I'm not covering, but if more are found, I gladly take another look.

Probably won't have much more time to look at this over the weekend, and I don't want to hold up a fix.

@waylan waylan merged commit 2766698 into Python-Markdown:master Oct 19, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

markdown.extensions.md_in_html fails with TypeError: expected string or bytes-like object
2 participants