-
Notifications
You must be signed in to change notification settings - Fork 894
Fix double escaping of amp in attributes #670
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Serializer should only escape & in attributes if not part of & Ref #669
|
@waylan, Py33 tests are failing, most likely due to it not being supported properly anymore as it is end of life. Should we bother testing for it anymore? |
|
Yes, we should drop the PY33 tests. Interesting "fix." While effective in addressing the issue raised in #669, I'm not sure that's what we should be doing. What test cases led you to conclude this is the correct behavior? |
Sure, consider these cases. In normal content we don't double escape amps. If we have I also compared it to Markdown.pl. It seemed to use a general pattern as something like See examples here: https://johnmacfarlane.net/babelmark2/?text=%26%0A%0A%26amp%3B%0A%0A%5Btitle%5D(http%3A%2F%2Fexample.com%2F%3F%26test%26madeup%3B). So, I think generally we are handling Now maybe that simply means that the behavior of
So now we get Again this is because The approach I'm taking now could very well be wrong, and I'm open to suggestions, but I do think we have a consistency problem here I'd like to solve. I have a new proposal to keep the discussion moving: What if we "deamped" link attributes before we returned them and they went through the serializaer? We could use the HTML lib to convert things like |
|
Considering that in Markdown content we use It is possible we should clean up |
I suppose that would work. The problem is that all third party extensions would need to never provide escaped strings in their output. I'm not comfortable with that. The serializer should just do the "right thing" with whatever it gets. My question was whether this fix is the "right thing" to do. After thinking about this a little more and reviewing #270, I see we are encountering the same fundamental problem as there. The trick with these sorts of things is determining when to escape and when not to. If the user already escaped some text manually, then we shouldn't escape it a second time. But if the user has not escaped the string, then we should escape it. As pointed out, we already do this in plain text with amp escaping. So, yeah, this change brings consistency to the HTML attributes at least for amp escaping. I believe this may be the right approach after all. |
|
I'll clean up the regex as |
|
Do we want to clean up our current |
Avoid Unicode and `_` in amp detection.
|
Yeah, let's clean up |
|
Yeah, possibly some more tests would be good. I'll think about what to add. |
|
Most likely I won't get around to finish this for a couple weeks as I'm about to go on vacation, but at least we settled on an approach. I'll post a comment when I'm officially done with this. |
I have now done this in master, hope that nobody minds. |
|
I think I should be able to finish this up soon. I will rebase it off the latest changes to this area as I see the qname stuff has changed. |
In general, we don't want to escape already escaped content, but with code content, we want literal representations of escaped content, so have code content explicitly escape its content before placing in AtomicStrings.
Test already esca
|
I think I'm done with this. |
Serializer should only escape
&in attributes if not part of&Ref #669