Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make HTML escaping configurable #25

Open
lorddoig opened this issue Apr 23, 2015 · 14 comments
Open

Make HTML escaping configurable #25

lorddoig opened this issue Apr 23, 2015 · 14 comments
Labels
category: parsing Correctness and Edge Cases. Hail the DOM type: bug Something isn't working as intended

Comments

@lorddoig
Copy link

I'm currently using as-hiccup to transform parsed HTML into a vector I can save and later render with Reagent. This works almost flawlessly except for the fact that as-hiccup escapes any TextNodes it comes across, meaning things like & show up as literal strings when I pipe it down to the browser.

Maybe a keyword argument is the way to go? Not sure how that works with protocols.

@davidsantiago
Copy link
Collaborator

Yeah, this is an encoding issue, I think. We've had issues with this before. When HTML is written out as a file for parsing or rendering, it's encoded to have those entities appear in that & encoded form. But that & is there to represent the character '&', when a web browser or something else reads the contents of that HTML file, it should give you an '&' there according to the HTML encoding. So there's this difference between the HTML-encoded form of the text and the "logical" form of the text, which is what we work with once it's parsed.

The issue is that Hiccup has chosen to represent the text in an HTML tree in the HTML-encoded form.

user=> (use 'hiccup.core)
nil
user=> (html [:span "&"])
"<span>&amp;</span>"
user=> (html [:span "&"])
"<span>&</span>"
user=> (html [:span "<Hm..."])
"<span><Hm...</span>"

As you can see, hiccup makes no change to the text it's given when it puts it out to the HTML. It's supposed to already be HTML-encoded. So we as hickory, in aiming to make a way to go from HTML to hiccup, have to do the HTML encoding ourselves when we convert something to hiccup form. But the entire time, we are working with the "logical" form of the text in memory:

user=> (use 'hickory.core)
nil
user=> (use 'hickory.render)
nil
user=> (as-hiccup (parse "&amp;"))
([:html {} [:head {}] [:body {} "&amp;"]])
user=> (as-hickory (parse "&amp;"))
{:type :document, :content [{:type :element, :attrs nil, :tag :html, :content [{:type :element, :attrs nil, :tag :head, :content nil} {:type :element, :attrs nil, :tag :body, :content ["&"]}]}]}
user=> (hiccup-to-html (as-hiccup (parse "&amp;")))
"<html><head></head><body>&amp;</body></html>"
user=> (hickory-to-html (as-hickory (parse "&amp;")))
"<html><head></head><body>&amp;</body></html>"

I don't think that either approach is more correct than the other, but we did make different choices on this issue (I wasn't even aware of the difference at the time, until someone was having problems in the other direction earlier). I do like being able to work with the text in an HTML document without worrying about the HTML encoding issues until it is converted to another form, I think that's worth it. I'm not sure that an option as you suggest is the way to fix this, though. I think the real answer is just to be more careful about the encodings of the data as we work with it.

@lorddoig
Copy link
Author

If I pull <div>Tarts &amp; Vicars</div> from the API I referred to (which I have no control over), I need that to be [:div {} "Tarts & Vicars"] because with Reagent (via React) you have to jump through hoops to get a string interpreted as HTML; my end goal is the hiccup representation, not the call to hiccup-to-html. In parse-fragment, it comes out of jsoup correctly (for me) unescaped, and is then re-escaped in as-hiccup. My current solution is to reimplement the protocol myself, removing the escaping in the TextNode implementation.

Everything you've said makes good sense but I'm not sure it helps with this. Am I missing something?

Of course my little fork does the job but it's not ideal in terms of maintenance etc.

@davidsantiago
Copy link
Collaborator

Yeah, I understand. You're not missing anything, I just don't think hickory is the place to insert a fix for the confusion caused by the combination of technologies you're using. Hickory parses HTML correctly, and it outputs HTML correctly. It also outputs hiccup forms correctly, in the sense that they can be passed without modification to hiccup and it will render to the correct HTML. I do notice that Reagent does not claim to accept hiccup as input, but rather a "hiccup-like" format. I'm sure they have their reasons for not accepting the original hiccup format, but I don't consider supporting their format a goal.

What I would do is fix this in your project. Make your own protocol similar to HiccupRepresentable, say something like ReagentHiccuplikeRepresentable, and reimplement it on the jsoup nodes to give you the hiccup-like output you need in order to get those things speaking to each other correctly.

@lorddoig
Copy link
Author

Thanks for this, I fully understand and accept your decision that hickory isn't the place for this. However if you'll humour me I have one last (largely academic) argument to make.

Hickory parses HTML correctly

The fact somebody needs to parse HTML guarantees they want to transform it, but does not guarantee that they want to turn it back into HTML again later. Given that a parser's existence is only justified by the need to manipulate, that's what it should enable: the canonical form of "&" in Clojure is literally that, and that's what you should get. Being mindful of re-rendering is nice, but it's above-and-beyond the call of duty; enforcing rendering concerns at the parsing stage is in a sense failing to parse properly.

HTML --> hiccup format --> shuffle nodes --> HTML is a perfectly valid thing to do, but it's only one transformation in an almost infinite number of transformations that need not end in HTML:

HTML --> hiccup format --> lexical analysis --> DB is just as valid, but people wanting to do this will be tripped up by non-canonical strings in their "parsed" data.

The specific change required to solve this problem only brings good things: it's simple and short (~3/4 lines), is completely non-breaking, and (as above) completes core parsing functionality.

Thoughts?

@davidsantiago
Copy link
Collaborator

I actually don't find hiccup a great format for non-trivial manipulation of the HTML data. In fact, I wrote hickory (specifically the data format) because of the difficulties I was having in that use case. I went into some of those in my comments on issue #24. My recommendation is to do as much manipulation as possible on the hickory format, and then convert that to hiccup at the end. Of course, hiccup has other strengths, and it's a widely used format in the clojure world. Maybe Reagent should consider using it, but as things stand, the fact that Hickory doesn't produce data that Reagent wants to consume is due to its being a bystander to tech choices that others have made.

Now, if Reagent wanted to formalize and document its data format and press that forward as something useful outside of Reagent proper, it might make sense for Hickory to output it. However, I've found nothing like that. They link to the hiccup repo on their "Hiccup-like" link, and give some examples. I've done a number of libraries now that deal with common data formats that are formally specified to varying degrees, and it's too much work to pursue every minor embellishment that would help someone out with their data, even when there is a clear spec for the main data format. Clear inputs and outputs make these sorts of things much more manageable, and it keeps this code in a pretty reliable state.

Additionally, I don't know Node, I don't know React, and I don't know Reagent, so I don't want to get involved in maintaining such code unless it is likely to be of very high value. I wouldn't be likely to do a very good job of maintaining code so far beyond my skill set ether. However, you are well-positioned to maintain code that does this, because you have a project with the combination of data inputs and outputs, and the technology expertise, that is relevant here. Unless it's apparent that there is a wide need for this HTML -> analysis -> hiccup-like combination, it seems like this is a combination of needs fairly unique to your project. So it seems like the best thing to do would be to take your code that does this massaging and have that exist in your own project as a utility. Though, if you are really planning to manipulate the structured HTML data a good amount, I'd again suggest working with the hickory version, and then taking src/hickory/convert.cljx:hickory-to-hiccup and copying that to your-project:hickory-to-hiccup-like with the changes you need. If they formalize their input language, or it gains traction, we can re-evaluate.

@lorddoig
Copy link
Author

I just want to make clear that I'm not arguing for Reagent support as I agree that that is beyond the scope of Hickory, and I'm not selfish enough to drag this debate out in pursuit of a minor embellishment that benefits a tiny number of users. The only reason I mentioned it all was for context; the case I'm making, while obviously motivated by Reagent, stands independently of the specifics of this use-case.

I understand that the battle is lost, so to speak, and again I want to point out that this argument really is academic, but to be clear the point I am making is that the need to parse exists if and only if there is a need to do computation on it: if you don't need to do that, you don't need to parse it - end of story. Any parser that only partially parses it's input is not fully enabling this computation and, ergo, is not fully doing it's job (again a theoretical point, not scathing criticism.) If the hiccup format was no good for computation, but only rendering, then why include it in a parsing library? I mean whats the point of (hiccup-to-html (as-hiccup x))? By including it there's an implicit acknowledgement that somebody somewhere wants to compute/transform their data in hiccup form...which goes back to my original point about not fully enabling computation by having remnants of HTML hanging around inside strings. It's a loss of generality.

Anyhow - I get it. Ultimately I'm making the case for a kind of theoretical purity in what a parser "should be" :neckbeard: and you're quite rightly sticking up for pragmatic concerns. Nevertheless thanks for entertaining the notion. 😄

@arichiardi
Copy link

👍 I have this problem too and @lorddoig I was also thinking to zip descending the three unescaping...I will check yous branch for your solution. BTW I was pretty surprised when I saw all the escaped text, but after reading this I guess it was not an easy choice from hickory's side.
IMHO a configurable option would not interfere too much. In any case very nice tool, one liner conversion is always nice to have.

@fasiha
Copy link

fasiha commented Sep 18, 2016

Is there a full list of special entities that Hickory will encode as HTML entities, like &, <, etc.? (I’m fixing to regexp the stringy input to Hickory, replacing these characters with pre-escaped Unicode code points as suggested by React and not have to deal with entity-encoded Hiccup.)

Edit Well, never mind, that obviously won’t work since one can’t tell beforehand which " will be escaped and which won’t. 😫

Edit the second I was using Specter to slice and dice Hiccup (from Hickory), so here’s a stupid, expensive way to unescape Hiccup:

(require '[com.rpl.specter :as s])
(require 'goog.string)
(s/transform [(s/walker string?)] goog.string.unescapeEntities my-hiccup-vector)

@davidsantiago
Copy link
Collaborator

Here's the function Hickory uses for HTML escaping: https://github.com/davidsantiago/quoin/blob/master/src/quoin/text.clj#L4 It just does &, <, >, and ".

@raymcdermott
Copy link

I would also appreciate a configuration option.

I'm trying to put some syntax colour on Clojure forms, like this:

(pr-str (map hickory/as-hiccup
               (hickory/parse-fragment
                 (highlight-html edn-form))))

Parsing a form like this using glow/highlight-html

(highlight-html "(println \"foo\")")

emits

"<div class=\"syntax\"><pre><span class=\"s-exp\">(</span><span class=\"core-fn\">println</span> <span class=\"string\">\"foo\"</span><span class=\"s-exp\">)</span></pre></div>"

The 'problem' is that glow emits \"foo\" in the HTML - which is obviously the correct output.

After parsing by hickory it becomes

"([:div {:class \"syntax\"} [:pre {} [:span {:class \"s-exp\"} \"(\"] [:span {:class \"core-fn\"} \"println\"] \" \" [:span {:class \"string\"} \"&quot;foo&quot;\"] [:span {:class \"s-exp\"} \")\"]]])"

using the unescape trick from @fasiha make the output nicer but I lose the quotes on "foo" which means that I cannot properly represent the input form.

Ideally I would like to only unescape the TextNode but it seems like this is ruled out.

Anyway, I like the library and am adding this comment so that you have another data point.

raymcdermott added a commit to raymcdermott/reptile-tail that referenced this issue May 19, 2018
@eneroth
Copy link

eneroth commented Jul 31, 2018

I came here looking for a resolution for the same problem. Specifically, as above, with regards to syntax highlighting.

@ghost
Copy link

ghost commented Dec 4, 2018

As I understand from this ticket, Hickory produces correct Hiccup, I guess this is a bug or missing feature in Reagent, hence I asked for a fix there: reagent-project/reagent#413

@port19x
Copy link
Contributor

port19x commented Apr 14, 2023

2023-04-14_09-33

Preemptively encoding special characters seems like a terrible idea

@port19x port19x added type: bug Something isn't working as intended category: parsing Correctness and Edge Cases. Hail the DOM and removed type: feature request Feature requests category: ergonomics Quality of Life improvements labels Apr 14, 2023
@port19x port19x self-assigned this Apr 14, 2023
@port19x
Copy link
Contributor

port19x commented Apr 14, 2023

imho this has very little to do with reagent.
It's just an unintuitive design choice.
Who in their right mind would expect their strings to change when changing the html representation format?

@port19x port19x removed their assignment Jun 19, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
category: parsing Correctness and Edge Cases. Hail the DOM type: bug Something isn't working as intended
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants