Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Documentation of Treeprocessor: run() argument and stripTopLevelTags #949

Closed
nxg opened this issue Apr 26, 2020 · 12 comments
Closed

Documentation of Treeprocessor: run() argument and stripTopLevelTags #949

nxg opened this issue Apr 26, 2020 · 12 comments
Labels
bug Bug report. docs Related to the project documentation.

Comments

@nxg
Copy link

nxg commented Apr 26, 2020

One documentation glitch; one possible omission:

(a) The Treeprocessor documentation says that the run(root) method takes an ElementTree as argument. It appears (based on print("root={}".format(root))) that it is in fact given an Element (specifically a single <div> element). It also seems to demand an Element is returned, and not an ElementTree as documented.

(b) Within core.py, the Element returned from treeprocessor.run() (is first serialised and then) has its doc_tag removed if self.stripTopLevelTags is truthy. This self.stripTopLevelTags flag is not documented as far as I can see (googling for site:python-markdown.github.io striptopleveltags returns nothing), but setting it to False is essential if one wants to have a Treeprocessor restructure the tree.

I am of course reluctant to rely on any undocumented functionality.

(My motivation here was to have a Treeprocessor wrap the parsed tree in a html/body element: I thought I was getting something badly wrong until I found self.stripTopLevelTags and discovered that core.py was stripping what I'd added)

@waylan
Copy link
Member

waylan commented Apr 26, 2020

Ignoring Markdown for a moment, an ElementTree object only accepts a single child. The library assumes that all children are wrapped in a single parent element and that parent is the "root" or the only child of an ElementTree instance.

Conversely, Markdown can contain any number of elements at the root level. Therefore, we create a throwaway div element which contains the entire document. You should always only ever act on the children of the root div element.

As for why self.stripTopLevelTags exists and can potentially change the behavior, IIRC, it's there for historical reasons. We probably could have removed it in version 3.0 except that there is one possible use for it. The markdown.Markdown class was built with the idea that is could be subclassed. A subclass could implement a parser for something other than Markdown. Its always possible that such a subclass would need to retain the root element. For example, I believe restructured text requires a root element in the output.

However, for Markdown, no I don't expect you will ever find a reason for stripTopLevelTags to be different.

@nxg
Copy link
Author

nxg commented Apr 27, 2020

I appreciate why the single <div> element is there. But the documentation says that run receives and returns an ElementTree object, whereas in fact it receives and returns a (single) Element. I'm not saying that the library is behaving incorrectly, simply that the documentation doesn't seem to match it (unless there's a subtlety I'm missing). It's a minor documentation defect.

Responding to your final remark, in what I've written I need to set self.stripTopLevelTags to False. The very simple treeprocessor that I've put together depends on this: it wraps the passed <div> in a <html> (etc) Element, and returns that. Before I discovered self.stripTopLevelTags, the library was carefully undoing this (via string processing!), and leaving me with only the original <div> contents (to my great confusion).

Removing self.stripTopLevelTags would therefore be a significant loss of functionality from my point of view. Without it, the only way to output an HTML document is to manipulate the output of the Markdown library as text (ewwwww). Unless, again, there's some subtlety I'm missing.

@waylan
Copy link
Member

waylan commented Apr 27, 2020

the only way to output an HTML document is to manipulate the output of the Markdown library as text

This is what we expect to happen. If you are trying to output a complete HTML document, not a document fragment, then that is outside the scope of the Markdown library. I recommend using a templating library for that.

Regardless, if the documentation is in error, thanks for the report. I wonder if perhaps the issue has been addressed in the proposed changes in #946.

@waylan waylan added bug Bug report. docs Related to the project documentation. labels Apr 27, 2020
@waylan
Copy link
Member

waylan commented Apr 27, 2020

Regarding the actual issue, the documentation is not wrong from a certain perspective. The documentation refers to an "ElementTree object." In other words, an instance of an object from the ElementTree library. It usually would be an Element instance, but theoretically, it could be a Comment or something else (therefore we do not specifically say it is an Element object). The confusion stems from the fact that the ElementTree library also contains an object named ElementTree which you incorrectly assumed the documentation was referring to. However, given the context, that assumption is completely understandable.

So in the end, while it is not wrong, the documentation should be clarified.

And I just checked #946 and it appears that the author of that PR make the same mistake you did. They actually changed it so that it can only be interpreted incorrectly. In any event, I've noted the issue, so it should be resolved there.

@nxg
Copy link
Author

nxg commented Apr 27, 2020

Ah, I see what you mean!

But yes, the distinction between the ElementTree library and that library's ElementTree object is perhaps a little subtle, and it's very easy to read it the wrong way (backticks aren't clear enough to disambiguate it).

As a concrete suggestion, how about (for the Treeprocessors section:

over-ride the run method which takes one argument root (an object from the
ElementTree module, typically Element, but specifically not an ElementTree object),
and either modifies that root element
and returns None or returns a new ElementTree module object.

Regarding the other point, I'm no fan of templating libraries (largely because they manipulate XML as text). I initially expected the python-Markdown module to produce an ElementTree as output, rather than a string, or at least have that as an option. At the danger of turning a documentation issue into an enhancement request, it would be very useful if the module could optionally make available the result of self.parser.parseDocument(self.lines).getroot() in convert(), short-circuiting the subsequent treeprocessing and serialisation.

To be clear, what I would hope to receive from any parsing library is a parse tree. That python-Markdown can go ahead and serialise it is nice, but an extra. If I were planning to do any more with the parsed result than wrap-and-serialise (which at present I can very conveniently do with the Treeprocessor -- nice design), then the first thing I'd do is to re-parse the library's XML output.

@waylan
Copy link
Member

waylan commented Apr 27, 2020

I would hope to receive from any parsing library is a parse tree.

While understandable, that is not practical considering the post processors. Without those, which only run on the serialized text, the output would be incomplete. Note that at least one of the postprecossors is solely responsible for avoiding the serializer's escaping. Therefore, you should only ever use the final text as output.

We have explored other options. But they all required either using our own custom object or adding custom types to elementtree and altering the serializer to handle them. This has been discussed in detail elsewhere.

@nxg
Copy link
Author

nxg commented Apr 27, 2020

Ah, I see – the parser stashes found literal HTML into the text attribute of Element objects, using a private index of some sort, so that the ElementTree is in principle serialisable only by this module, so that it would not in general be useful to pass the tree out to the caller. I hadn't realised ElementTree blessed storing arbitrary objects there.

All that said, it is still useful in at least some cases (eg, mine!) to do some tree-manipulation in the Treeprocessor step, so I hope that you won't remove the self.stripTopLevelTags flag.

@waylan
Copy link
Member

waylan commented Apr 27, 2020

I hadn't realised ElementTree blessed storing arbitrary objects there.

It doesn't. We are just using plain text strings delimitated with uncommon Unicode characters for our index keys. Those Unicode characters ensure the strings are unique so we can find them easily for replacement after serialization. ElementTree doesn't care as they are just valid text.

@merriam
Copy link
Contributor

merriam commented May 2, 2020

Hmmm... @nxg, would a post processor work for you? Can you tell me more about what are you trying to accomplish?

I'll figure out how to document the run mistake. There will, someday, be a section discussing the finer points of what is and isn't guaranteed.

@nxg
Copy link
Author

nxg commented May 2, 2020

Hello @merriam. I'm doing something very simple: I'm just converting Markdown to standalone XHTML. I certainly could achieve this by printing out unbalanced XML fragments before and after the string I get from python-Markdown (I'd feel... dirty, but it wouldn't kill me).

I was very pleased to discover the Treeprocessor in the python-Markdown docs (this is the first time I've used this library). That way, I can do essentially all the application processing inside a Treeprocessor (this processing is almost trivial in my present case: simply wrapping the div element inside the rest of the XHTML). But it would have to be done inside there, because the way it's written, it appears that the tree that python-Markdown creates can be serialised only by python-Markdown, because of the self.md.htmlStash (which I appreciate is there because it solves other problems). That design suggests – though I haven't explored this – that it'd be difficult to do anything with the parse-tree that involved looking at text content; but that's relatively rare: shaking the tree is more common. Also, it'd be hard to take fragments of markdown from multiple sources and combine them into a separate document (one would have to either pass round chunks of serialised XML, or re-parse the serialised XML).

This is not a complaint; I'm describing only how I use the library, and not suggesting that you should change things. Python-Markdown is advertised as a markdown-to-serialised-string library, rather than a markdown-to-parse-tree library, and its design decisions are in the service of that goal (I'm not a particular fan of ElementTree, by the way – any parse-tree would do).

But that's OK: it does what I want in this case as long as I can set self.stripTopLevelTags to False.

I hope these remarks are helpful.

@waylan
Copy link
Member

waylan commented May 2, 2020

it appears that the tree that python-Markdown creates can be serialised only by python-Markdown, because of the self.md.htmlStash

Actually that has no bearing on things. As mentioned previously, the placeholders are just plain text and they are swapped out after the tree has been serialized to a string. In fact, we used to use ElementTree's built-in HTML serializer. However, there were a few instances where the general purpose sterilizer needed to be tweaked. In the end, it was easier to maintain our own simplified fork.

waylan added a commit to merriam/markdown that referenced this issue May 7, 2020
@waylan
Copy link
Member

waylan commented May 7, 2020

This has been fixed in 88f0b77 as part of #946.

@waylan waylan closed this as completed in 375db8c May 7, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Bug report. docs Related to the project documentation.
Projects
None yet
Development

No branches or pull requests

3 participants