Fix double escaping of amp in attributes #670

facelessuser · 2018-06-18T01:18:25Z

Serializer should only escape & in attributes if not part of & Ref #669

Serializer should only escape & in attributes if not part of & Ref #669

facelessuser · 2018-06-18T01:25:52Z

@waylan, Py33 tests are failing, most likely due to it not being supported properly anymore as it is end of life. Should we bother testing for it anymore?

waylan · 2018-06-18T11:56:08Z

Yes, we should drop the PY33 tests.

Interesting "fix." While effective in addressing the issue raised in #669, I'm not sure that's what we should be doing. What test cases led you to conclude this is the correct behavior?

facelessuser · 2018-06-18T13:12:35Z

What test cases led you to conclude this is the correct behavior?

Sure, consider these cases. In normal content we don't double escape amps. If we have & it becomes & and if we have & it remains & the only reason URLs don't get this treatment is due to the fact they are handled before we look for entities since we need to essentially keep them raw.

I also compared it to Markdown.pl. It seemed to use a general pattern as something like &madeup; would be treated in this fashion. Whether we should emulate that is something to consider. (Now I am aware that just because Markdown.pl behaves a certain way, doesn't mean it's right.)

See examples here: https://johnmacfarlane.net/babelmark2/?text=%26%0A%0A%26amp%3B%0A%0A%5Btitle%5D(http%3A%2F%2Fexample.com%2F%3F%26test%26madeup%3B).

So, I think generally we are handling & in an inconsistent manner in URLs.

Now maybe that simply means that the behavior of & in a URL should be considered undefined. But let me show another case.

![this&amp;that](example.png)

this&that is the user trying to do the right thing.

So now we get <p><img alt="this&amp;that" src="example.png" /></p> which is simply wrong.

Again this is because alt is handled in this case before entity normalization.

The approach I'm taking now could very well be wrong, and I'm open to suggestions, but I do think we have a consistency problem here I'd like to solve. I have a new proposal to keep the discussion moving:

What if we "deamped" link attributes before we returned them and they went through the serializaer? We could use the HTML lib to convert things like & and < back to & and <. Then let the serializer do what it always does. Then it would convert them back without double escaping them since we wouldn't allow things like & in our URL to make back to the serializer.

facelessuser · 2018-06-18T14:06:08Z

Considering that in Markdown content we use ENTITY_RE = r'(&[\#a-zA-Z0-9]*;)' as how we capture existing entities, I don' t think the logic I placed in the serializer is much different, maybe I should avoid \w so as not to include _.

It is possible we should clean up ENTITY_RE as currently we'll treat &amp#; as valid when it should probably only count # at the start.

waylan · 2018-06-18T14:13:07Z

I have a new proposal to keep the discussion moving:

What if we "deamped" link attributes before we returned them and they went through the serializaer? We could use the HTML lib to convert things like & and < back to & and <. Then let the serializer do what it always does. Then it would convert them back without double escaping them since we wouldn't allow things like & in our URL to make back to the serializer.

I suppose that would work. The problem is that all third party extensions would need to never provide escaped strings in their output. I'm not comfortable with that. The serializer should just do the "right thing" with whatever it gets. My question was whether this fix is the "right thing" to do.

After thinking about this a little more and reviewing #270, I see we are encountering the same fundamental problem as there. The trick with these sorts of things is determining when to escape and when not to. If the user already escaped some text manually, then we shouldn't escape it a second time. But if the user has not escaped the string, then we should escape it. As pointed out, we already do this in plain text with amp escaping. So, yeah, this change brings consistency to the HTML attributes at least for amp escaping. I believe this may be the right approach after all.

facelessuser · 2018-06-18T14:22:22Z

I'll clean up the regex as \w is incorrect in this case.

facelessuser · 2018-06-18T14:27:02Z

Do we want to clean up our current ENTITY_RE to be a little less lazy? It's never going to 100% spot on since it has no knowledge of what actual entities are, but it should probably at least not match &; and &amp#;.

Avoid Unicode and `_` in amp detection.

waylan · 2018-06-18T15:19:00Z

Yeah, let's clean up ENTITY_RE. I'm wondering if we should add a few more tests while we're at it.

facelessuser · 2018-06-18T15:23:09Z

Yeah, possibly some more tests would be good. I'll think about what to add.

facelessuser · 2018-06-18T21:29:20Z

Most likely I won't get around to finish this for a couple weeks as I'm about to go on vacation, but at least we settled on an approach. I'll post a comment when I'm officially done with this.

mitya57 · 2018-06-28T20:47:14Z

Yes, we should drop the PY33 tests.

I have now done this in master, hope that nobody minds.

facelessuser · 2018-07-24T15:01:17Z

I think I should be able to finish this up soon. I will rebase it off the latest changes to this area as I see the qname stuff has changed.

In general, we don't want to escape already escaped content, but with code content, we want literal representations of escaped content, so have code content explicitly escape its content before placing in AtomicStrings.

Test already esca

facelessuser · 2018-07-29T14:49:32Z

I think I'm done with this.

Fix double escaping of amp in attributes

2262418

Serializer should only escape & in attributes if not part of & Ref #669

Account for other entities

bb9d199

facelessuser mentioned this pull request Jun 18, 2018

Double escaping of ampersand in URLs #669

Closed

facelessuser added 3 commits June 18, 2018 08:29

Better regex for amp serialization

cded005

Avoid Unicode and `_` in amp detection.

Test files should start with test_ for discovery

8cb8104

Drop PY33 testing

03390ed

Less lazy entity pattern for general content

8eda0cc

facelessuser added 3 commits July 28, 2018 08:22

Merge branch 'master' into amp-escape

c5d36d8

Handle code content special

5d0bf2e

In general, we don't want to escape already escaped content, but with code content, we want literal representations of escaped content, so have code content explicitly escape its content before placing in AtomicStrings.

Test already escaped chars in qname

b38d040

Test already esca

waylan merged commit 59406c4 into Python-Markdown:master Jul 29, 2018

mitya57 mentioned this pull request Sep 25, 2018

Hexadecimal HTML entities are not rendered #712

Closed

Fix double escaping of amp in attributes #670

Fix double escaping of amp in attributes #670

Uh oh!

Conversation

facelessuser commented Jun 18, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

facelessuser commented Jun 18, 2018

Uh oh!

waylan commented Jun 18, 2018

Uh oh!

facelessuser commented Jun 18, 2018

Uh oh!

facelessuser commented Jun 18, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

waylan commented Jun 18, 2018

Uh oh!

facelessuser commented Jun 18, 2018

Uh oh!

facelessuser commented Jun 18, 2018

Uh oh!

waylan commented Jun 18, 2018

Uh oh!

facelessuser commented Jun 18, 2018

Uh oh!

facelessuser commented Jun 18, 2018

Uh oh!

mitya57 commented Jun 28, 2018

Uh oh!

facelessuser commented Jul 24, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

facelessuser commented Jul 29, 2018

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

facelessuser commented Jun 18, 2018 •

edited

Loading

facelessuser commented Jun 18, 2018 •

edited

Loading

facelessuser commented Jul 24, 2018 •

edited

Loading