Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Escaped HTML tags are "un-escaped" when rendering HTML #52

Closed
sh-at-cs opened this issue Feb 19, 2024 · 8 comments · Fixed by #54
Closed

Escaped HTML tags are "un-escaped" when rendering HTML #52

sh-at-cs opened this issue Feb 19, 2024 · 8 comments · Fixed by #54

Comments

@sh-at-cs
Copy link

sh-at-cs commented Feb 19, 2024

Consider:

from mjml import mjml_to_html

result = mjml_to_html("""
<mjml>
  <mj-body>
    <mj-text>
      Pretty unsafe: &lt;script&gt;
    </mj-text>
  </mj-body>
</mjml>""")

print(result["html"])

In the resulting HTML output, the formerly HTML-escaped < (&lt;) and > (&gt) are "un-escaped", so the rendered HTML actually contains Pretty unsafe: <script>.

Why does this happen? This reverses the user's safety measures and can be dangerous.

The MJML reference implementation doesn't do this and correctly keeps such escape sequences untouched: https://mjml.io/try-it-live/fvvhZhdu9V

@sh-at-cs sh-at-cs changed the title Escaped HTML tags are "un-escaped" when rendering to HTML Escaped HTML tags are "un-escaped" when rendering HTML Feb 19, 2024
@sh-at-cs
Copy link
Author

sh-at-cs commented Feb 19, 2024

I tracked it down to the changes made in 84c495d (#45), where the formatter argument to BeautifulSoup's decode_contents was explicitly set to None. If it is set to "minimal" instead (the default and prior value), HTML escape sequences are left as they are.

But it seems like (part of) the whole point of #45 was to not leave HTML entities as they are so as to fix #44...

@caseyjhol Care to comment? 🙂

@caseyjhol
Copy link
Contributor

caseyjhol commented Feb 19, 2024

Ah I could've sworn I tested and accounted for this scenario, but clearly I missed the mark. I think the better approach then might be to limit the scope to mjStyle rather than all HTML.

@sh-at-cs
Copy link
Author

@caseyjhol Thanks for the quick response! You're right, you did add an example of a very similar situation:

&lt; Hello World &gt;

<div style="font-family:helvetica;font-size:20px;line-height:1;text-align:left;color:#F45E43;"><span style="border: 3px solid blue;"> &lt; Hello World &gt; </span></div>

But as it turns out, the entities are in fact "un-escaped" during rendering of the test MJML file as described in this issue - the reason the test passes anyway is that the space between &lt; and Hello makes HTMLCompare consider them equal to < Hello (the rendered output). Minimal example:

# no exception:
htmlcompare.assert_same_html("&lt; script &gt;", "< script >")
# exception:
htmlcompare.assert_same_html("&lt;script &gt;", "<script >")

I haven't checked HTMLCompare's code, but I think this is probably because putting a space between < and the tag name makes it stop being an HTML tag, so the < is in fact equivalent to &lt; in this context.

But that is exactly the case that people don't need to guard against with sanitization/escaping. So the fix for this issue should add another test case (or extend this one) with an escaped HTML tag like &lt;script&gt;.

@FelixSchwarz
Copy link
Owner

@sh-at-cs Thank you for reporting this issue, including the detailed analysis. I think this is a serious issue which we need to fix. I'll try to spend some time on this later - either on reviewing a solution or trying to fix this myself. When we merged #45 somehow the security implications completely escaped my attention.

@FelixSchwarz
Copy link
Owner

Unfortunately all could do yesterday was to add the test case in a new branch fix-escaped-html-tags.

How should we fix this issue? I see two approaches (just brainstorming here):

  • contents of <mj-style> should be kept as-is
  • possibly parse the contents using tinycss2 to ensure this is really just CSS

Would that solve the issue if we also remove the formatter=None? As far as I could see, css_inline will keep the escaped sequences (as it should).

@caseyjhol
Copy link
Contributor

Going to try to dig into this today a bit.

@FelixSchwarz
Copy link
Owner

FelixSchwarz commented Feb 21, 2024

I pushed some additional code in the branch fix-escaped-html-tags. Basically the idea is to do the unescape only for contents of <mj-style>.

I also added CSS parsing using tinycss2. My idea is that this would blow up if the contents would be invalid (e.g. other HTML code) but I did not test that. Not sure if it is worth the additional dependency because arbitrary CSS can influence the displayed contents...

Update: It seems like tinycss happily parses even completely invalid HTML but I think css_inline would remove that. Should we just assume that the <mj-style> does not contain untrusted user input? (and maybe add a section to the readme?)

@FelixSchwarz
Copy link
Owner

Also maybe you can also check the security advisory draft I created for this issue. Feel free to suggest additions and please check if you agree with the severity classification.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants