Skip to content

Conversation

@facelessuser
Copy link
Collaborator

@facelessuser facelessuser commented Jun 18, 2018

Serializer should only escape & in attributes if not part of & Ref #669

Serializer should only escape & in attributes if not part of & Ref #669
@facelessuser
Copy link
Collaborator Author

@waylan, Py33 tests are failing, most likely due to it not being supported properly anymore as it is end of life. Should we bother testing for it anymore?

@waylan
Copy link
Member

waylan commented Jun 18, 2018

Yes, we should drop the PY33 tests.

Interesting "fix." While effective in addressing the issue raised in #669, I'm not sure that's what we should be doing. What test cases led you to conclude this is the correct behavior?

@facelessuser
Copy link
Collaborator Author

What test cases led you to conclude this is the correct behavior?

Sure, consider these cases. In normal content we don't double escape amps. If we have & it becomes & and if we have & it remains & the only reason URLs don't get this treatment is due to the fact they are handled before we look for entities since we need to essentially keep them raw.

I also compared it to Markdown.pl. It seemed to use a general pattern as something like &madeup; would be treated in this fashion. Whether we should emulate that is something to consider. (Now I am aware that just because Markdown.pl behaves a certain way, doesn't mean it's right.)

See examples here: https://johnmacfarlane.net/babelmark2/?text=%26%0A%0A%26amp%3B%0A%0A%5Btitle%5D(http%3A%2F%2Fexample.com%2F%3F%26test%26madeup%3B).

So, I think generally we are handling & in an inconsistent manner in URLs.

Now maybe that simply means that the behavior of & in a URL should be considered undefined. But let me show another case.

![this&that](example.png)

this&that is the user trying to do the right thing.

So now we get <p><img alt="this&amp;amp;that" src="example.png" /></p> which is simply wrong.

Again this is because alt is handled in this case before entity normalization.

The approach I'm taking now could very well be wrong, and I'm open to suggestions, but I do think we have a consistency problem here I'd like to solve. I have a new proposal to keep the discussion moving:

What if we "deamped" link attributes before we returned them and they went through the serializaer? We could use the HTML lib to convert things like &amp; and &lt; back to & and <. Then let the serializer do what it always does. Then it would convert them back without double escaping them since we wouldn't allow things like &amp; in our URL to make back to the serializer.

@facelessuser
Copy link
Collaborator Author

facelessuser commented Jun 18, 2018

Considering that in Markdown content we use ENTITY_RE = r'(&[\#a-zA-Z0-9]*;)' as how we capture existing entities, I don' t think the logic I placed in the serializer is much different, maybe I should avoid \w so as not to include _.

It is possible we should clean up ENTITY_RE as currently we'll treat &amp#; as valid when it should probably only count # at the start.

@waylan
Copy link
Member

waylan commented Jun 18, 2018

I have a new proposal to keep the discussion moving:

What if we "deamped" link attributes before we returned them and they went through the serializaer? We could use the HTML lib to convert things like &amp; and &lt; back to & and <. Then let the serializer do what it always does. Then it would convert them back without double escaping them since we wouldn't allow things like &amp; in our URL to make back to the serializer.

I suppose that would work. The problem is that all third party extensions would need to never provide escaped strings in their output. I'm not comfortable with that. The serializer should just do the "right thing" with whatever it gets. My question was whether this fix is the "right thing" to do.

After thinking about this a little more and reviewing #270, I see we are encountering the same fundamental problem as there. The trick with these sorts of things is determining when to escape and when not to. If the user already escaped some text manually, then we shouldn't escape it a second time. But if the user has not escaped the string, then we should escape it. As pointed out, we already do this in plain text with amp escaping. So, yeah, this change brings consistency to the HTML attributes at least for amp escaping. I believe this may be the right approach after all.

@facelessuser
Copy link
Collaborator Author

I'll clean up the regex as \w is incorrect in this case.

@facelessuser
Copy link
Collaborator Author

Do we want to clean up our current ENTITY_RE to be a little less lazy? It's never going to 100% spot on since it has no knowledge of what actual entities are, but it should probably at least not match &; and &amp#;.

@waylan
Copy link
Member

waylan commented Jun 18, 2018

Yeah, let's clean up ENTITY_RE. I'm wondering if we should add a few more tests while we're at it.

@facelessuser
Copy link
Collaborator Author

Yeah, possibly some more tests would be good. I'll think about what to add.

@facelessuser
Copy link
Collaborator Author

Most likely I won't get around to finish this for a couple weeks as I'm about to go on vacation, but at least we settled on an approach. I'll post a comment when I'm officially done with this.

@mitya57
Copy link
Collaborator

mitya57 commented Jun 28, 2018

Yes, we should drop the PY33 tests.

I have now done this in master, hope that nobody minds.

@facelessuser
Copy link
Collaborator Author

facelessuser commented Jul 24, 2018

I think I should be able to finish this up soon. I will rebase it off the latest changes to this area as I see the qname stuff has changed.

In general, we don't want to escape already escaped content, but with code content, we want literal representations of escaped content, so have code content explicitly escape its content before placing in AtomicStrings.
@facelessuser
Copy link
Collaborator Author

I think I'm done with this.

@waylan waylan merged commit 59406c4 into Python-Markdown:master Jul 29, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants