New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support HTML syntax #636
Comments
Why not simply stick with polyglot? Sigil actually uses a slightly modified version of google's gumbo html parser to do automatic html repair of both xhtml and html, injecting closing tags for non-void elements as needed and intentionally serializes things to polyglot (ie. only acknowledged void tags are self-closed, etc). In this fashion only pure xml that adds a separate closing tag on a void element needs to be fixed before passing it an html5 compliant parser like gumbo. The resulting xhtml/html5 will parse directly with any browser interface that follows the official html5 parsing rules, while at the same time is completely valid xhtml. Using our slightly modified gumbo based parser (light weight, few dependicies, fast, C code) we can happily accept and fix epub author mistakes on the fly and allow either downstream xml or html based tools. We would be happy to make our changes to google's gumbo available under any license needed as I am the author of all of those changes. This seems like the only sane solution for epub3 support in Sigil. BTW, as a bonus, this approach allows standard javascript tools (jquery, etc) to work as expected since the same dom tree will be generated from the html5 or xhtml polyglot serialized document. Dialect specific serialization just makes no sense when you have html4, html5, xml 1, xhtml 1.1, xhtml5, and even old mobi html 3.2 dialects running around in the ebook world. Simply throw a fast html5 compliant parser like gumbo at it and polyglot serialize the result. You get an xhtml version that works in any browser as html5. |
Hi! I maintain the Standard Ebooks project, which produces ebooks using Epub 3.0.1 as the base format, with an emphasis on using as many of the rich opportunities for semantics and metadata that 3.0.1 provides. Switching to HTML5 syntax over XHTML is a great step forward. XHTML is a horror to work with and dropping it for a more flexible and friendly flavor of markup can't come soon enough. I am concerned, however, about the loss of epub:type. If we're switching to using the role attribute, then the vocabulary afforded by the W3C spec is already much thinner than the Epub semantic inflection vocabulary. As new ebooks are produced we'll be losing opportunities to add significant semantic information, like which parts of the work are front, body, or backmatter, while retaining redundant semantics, like "title" (which can already be implied by an Another plus to epub:type (and a grudging nod to XHTML) is that we can use other semantic vocabularies not defined in the official Epub spec. For example, Standard Ebooks goes to great lengths to add extensive semantics to all of the books we produce. We prefer the standard Epub 3.0.1 semantic vocabulary if it contains what we need; if not, we look at the greatly expanded z3998 vocabulary; and finally we have our own custom vocabulary, which is used as a sort of transitional vocabulary until we sort out schema.org. This means that within a single ebook, we can use a variety of standardized vocabularies to inflect sections as "poems", "songs", "letters", and so on. That's nice for two reasons:
So ultimately, I applaud the move to HTML5, but doing so at the expense of getting rich semantics would be a big loss. Ideally we would have a way to get the ease-of-use of HTML5 along with the richness of semantic inflection and the ability to use non-epub-spec semantics that the current epub:type definition allows. |
Closing this issues as it was resolved not to add HTML in the 3.1 revision |
The issue was discussed in a meeting on 2020-12-18
View the transcript1. in preperation to html5 in epub3See github issue #636. Dave Cramer: Reading systems and xhtml vs html. George Kerscher: DAISY contacted devs and asked about support of features. Dave Cramer: Just trying to get a feel to understand cost. Ivan Herman: A little worried that DAISY might end up in the middle of a discussion it doesn't want to be in. George Kerscher: No problem, we can hand out the list, may need to ask a couple of people, but not a big deal. Brady Duga: I don't think this is too early to start asking. Laurent Le Meur: this is a big change and we really need to ask about it now. Tzviya Siegman: Coming at this as publisher and dev experience from epubcheck. Hadrien Gardeur: Worried about the consistency. Dave Cramer: Good point, we aren't making a decision here, just want to understand where we are and where we might want to go. Garth Conboy: HTML serialization won't work out of the box, but doesn't mean we are against it. Dave Cramer: Let's get back on track. We aren't discussing the issue now, we are just preparing to discuss it. Avneesh Singh: Also please remember there is project to move epubcheck to the w3c html validator. |
Are you are looking for input from epub development software like Sigil? If so, pure html5 has such lax parsing rules that it actually makes downstream xml based tools harder to implement as a full browser parsing engine would be needed just to parse the resulting code due to the "flexible" state based parsing rules used by pure html5. Using a modified version of Google's gumbo parser inside Sigil, Sigil can happily take pure html5 with its lax rules and automatically create strict xhtml5 based output with no real cost to the end epub developer. With Sigil generating more strict xhtml variant of html5 allows all existing downstream xml toolchains to still be used. So simple open source tools and software already exist to take pure html5 with its lax parsing rules and create something more easily processed downstream. Not all downstream tools should need to implement the full html5 parser spec just to work properly. So instead of moving the source for epubs to pure html5, simply use freely available epub devloper tools to create the proper strict syntax that makes the entire toolchain work. |
Five years later, I got an email about activity in this issue, and I wanted to pop in to say that my view of XHTML has changed since my previous comment. Previously, I was focused on the annoyance of authoring it, compared to the laxer HTML5 parsing rules. XML-isms like namespaces were a pain point. But, after working on nearly 450 epubs at Standard Ebooks, my perspective has changed. Like @kevinhendricks stated above, XML is easier and faster to parse, and thus process by other programs. Since we already need an XML parser for the metadata file, programs that work with a whole epub would need to package an additional HTML5 parser instead of reusing the XML parser. The publishing industry is heavily invested in XML so being able to use a single parsing library to (for example) process both an OPDS feed and an epub is very valuable. XHTML gives us nice things like xpath and xslt for free. Pretty-printing/canonicalization is easier and has good support in many libraries and programs. I still think XML namespaces are an annoyance at the XML spec level, but in the context of the average epub, the annoyance is limited because 95% of the time the only non-default namespace will be the Importantly, canonicalized XHTML forces uniformity on the output. In a canonicalized document we can always expect to see So, if my opinion is worth anything, five years later I would like to reverse my previous position and suggest sticking with XHTML, but expanding it to the HTML5 element vocabulary. In other words something like XHTML5. epubcheck already allows HTML5 vocabulary like (My previous opinion on |
About our company's RS(named BinB),
No.
Our RS does not use XML parsers, so does not depend on well-formed XML. On the other hand, to ingest HTML systax EPUB completely, it needs additional development for parsing. And we have some satellite tools for processing EPUB files that uses XML parser. They demand all content documents being well-formed XML. |
That is already the case in 3.2. The relevant section §2.2 says:
And the references are to HTML5 XML syntax. The upcoming EPUB 3.3 draft takes this text over verbatim. |
(Copied from Laurent's email for an easier reference, with authorization.) CC @llemeurfr |
(Copied from Garth's email, with authorization.) Cc @GarthConboy |
Speaking about our Reading System (PUBLUS Reader for Android/iOS);
Yes. PUBLUS Reader uses Blink (Android) or WkWebKit (iOS).
Seems possible. |
In my personal opinion, the various satelite tools needed for delivery have a greater impact than RS. |
I cannot speak for the Readium "mobile" implementations (iOS/Android), but here is my feedback from the Readium Desktop / Thorium point of view: support for non-XML HTML would require a thorough audit of a relatively large codebase, in order to identify code where we rely on the assumption that the markup is well-formed XML. This is by no means a complete analysis, but just from the top of my head:
To conclude: adding support for non-XML HTML is definitely within the realm of possibilities in Thorium / Readium-Desktop, but as with any non-trivial development task, this would require in-depth analysis and methodical regression testing (in other words, it wouldn't just be a case of "adding" HTML support, we would need to make sure that the existing XHTML support doesn't break when implementing dual support for XHTML / HTML in the various content processing modules). |
You'd probably remember this better, but weren't there issues with CFIs and the HTML syntax? I seem to remember a discussion about having to use the parsed DOM for HTML but that being potentially unreliable. |
I don't remember the details, but indeed there were concerns that the XML-centric CFI processing model wouldn't work reliably with HTML DOM, due to subtleties in text encoding, character offsets / text node normalisation, element boundaries (e.g. self closing tags), etc. "Polyglot Markup: A robust profile of the HTML5 vocabulary" PS: in Thorium / Readium-Desktop we do not make internal use of CFI (unlike the first / original incarnation or Readium SDK), instead we use our own optimised DOM-Range (de)serialisation technique. We can (and do) generate equivalent CFI expressions for our bookmarks / annotations, but we do this in a "vacuum" in the sense that we have no consuming API at the moment (in a future software iteration, we may produce CFI references in the context of interoperable W3C content annotations, if the need arises). |
that would definitely be problematic. I do not remember the details, but the HTML5 parser, possibly, adds some extra elements into the DOM for, e.g., enclosing other elements. Ie, the meaning of CFI can definitely be distorted. |
Apart from RS developers' advice, it is important to take into account the pressure that could come for the publishers' side. Big publishers are mostly using internal XML based workflows. I say mostly because we know that there is also a small production of carefully crafted (X)HTML based ebooks. Small publishers willing to create reflow EPUB seem to use desktop tools that produce EPUB via a save button. And the vast majority of EPUB FXL publishers are using InDesign. As a consequence, from what I hear, the pressure is quasi-null. |
Right, and I believe this is partly why we dropped reading system support for authored CFIs in 3.1 (aside from lack of anyone actually writing these things manually). It was preparatory for possibly adding HTML as they aren't webbish. The only reason I raised it was that I know there's a possible resurrection of CFI in the works, and this could prove problematic for an HTML syntax. Not a blocker, but a consideration for the viability of CFIs as currently written. |
I believe the parser would add some things even to the XML syntax of HTML. For example, tables without |
Ouch. As usual, @dauwhe is right... |
The issue was discussed in a meeting on 2021-05-27
View the transcript3. HTML SerializationSee github issue #636. Dave Cramer: i've made epubs that have used HTML that was not well-formed XHTML Brady Duga: sounds like a lot of work, so maybe we should wait until there is need for it Dave Cramer: in my mind epub 4 doesn't have to worry about backwards compatibility with epub 3, chief among which is support for XHTML Shinya Takami (高見真也): i have no objection, but we have to consider compatibility with existing epubs Dave Cramer: we would never mandate a change to HTML5 Wendy Reid: when I looked into it I specifically talked to ingestion side (where most of the problems would pop up) Dave Cramer: comment from Daniel Weck was that Readium would experience several issues if we did this Brady Duga: CFI was intended to work with the text, not with the DOM Dave Cramer: the other argument that has been put forward for HTML is that HTML tools could be used, but i've never seen an example of one that works on HTML and breaks on XHTML
Brady Duga: if we're deferring, should it be closed?
Wendy Reid: i think the idea for epub 4 is that we start with clean slate
Wendy Reid: not resolved yet. Will return to this with tomorrow's group. |
The issue was discussed in a meeting on 2021-05-28 List of resolutions:
View the transcript1. HTML serializationSee github issue #636. Continuation of the discussion on the first vF2F meeeting Dave Cramer: time to talk about serialization of html5 - I used to be more of an advocate for this but reality is that this is going to require a fair amount of work for reading systems that depend on xml tools Brady Duga: last night we discussed this and my position is that yes we can do this but it'll be a bit of work - we're not hearing for demand for this from publisher so we'd rather spend the time on more important things Tzviya Siegman: I agree with Dave. I think we need to defer this to another major version of epub. There are some things we've been thinking about doing in epubcheck that would make it easier to maintain. Some of that will be deferred until will have HTML epub. Wendy Reid: self-publishing is a constituency we need to think about - we have a big community of them at kobo. I haven't heard these publishers expressing difficulty with xhtml because they have tools to create epubs or getting outsourcers to do the work for them George Kerscher: the word-to-epub add-in does an excellent job of making epubs. Google docs produces them but would be nice to have a facelift. Scholarly publishing content that is distributed in pdf is awful for disability. Wonder if the process of converting from pdf is easier or harder with xhtml. Gregorio Pellegrino: what are the benefits of switching from xhtml to html - is it only non-closing tags? Dave Cramer: some arguments are that it makes it easier to repurpose html content - most scripting libraries aren't tested with xhtml so may work better with html Dan Lazin: this seems like the kind of thing we should defer to later - xhtml is losing popularity - we may be fine with it for a while but in the future we may need to change - put on a note on the issue to up-vote and explain why they need it so there is more info Ivan Herman: example: I made a script to translate W3C TR documents into EPUB. It was very easy to find HTML parsers but when I need to generate XHTML from it I get syntactically incorrect markup. I had to spend much more of time to find a library to convert the HTML to proper XHTML. The evolution of tools long term works against XHTML long term Brady Duga: most reading systems display their content as html, not xhtml. Moving to html removes an unnecessary intermediary step for us. I'm in favour of punting but we do this every single time
Avneesh Singh: from a management perspective with only 6 months remaining we need support in reading systems and epubcheck. we should have compelling reason to move on this now given this timeline.
Ken Jones: the self-publishers I'm aware of are not writing any code - they use tools so they are not pushing for this. they will wait for their tools to update and use whatever is available Dan Lazin: agree with Brady that most reading systems are using html internally - we could relax validation standards so the document is supposed to be xhtml but we allow syntactic invalidities - authors just want to dump html into the body, they don't care about the head Ivan Herman: I would be scared to describe formally this kind of looseness. On the other hand the we could say that reading systems spec could allow them to use html5 but formally the content spec says you must use xhtml5 Brady Duga: I don't think this would solve the problem - reading systems still process the xhtml as xml prior to display so having html in the body would cause us to reject Dan Lazin: what I'm thinking is to use an attribute on the body tag so you could fork the pipeline or skip the step that requires xml conformance Dave Cramer: I've gotten the sense from all the discussions that we have kicked this can for a long time and is not ideal but I think we've had good reasons for doing that. I don't see eagerness to change without a more compelling reason that we supplied so far Tzviya Siegman: we tried that in 3.1 and it didn't go well Dave Cramer: we have the goal of making EPUB 3.3 a W3C REC but do we want to persist this incremental backwards compatible mode for 3.4, 3.5, etc. - we need to make a break at some point and not just have working groups to make editorial changes Avneesh Singh: looking at W3C culture, a lot of work begins in the CG before getting a WG to formalize Wendy Reid: I want to have some of these ideas incubated as we're running into the limits of EPUB 3 compatibility. we need to go big at some point. Dave Cramer: are we prepared to defer this issue and hand it off to the CG to look at the long-term future of epub? George Kerscher: is there a reason to keep the ncx? Ivan Herman: we can't do anything about people including it - not part of epub 3 Gregorio Pellegrino: reading systems use it for compliance to allow old readers to open epub 3 Brady Duga: forbidding it doesn't help me as I have legacy epubs with it Dave Cramer: would be different in an epub 4 Ivan Herman: you said defer this - what do you mean? are we closing in github for now or are we keeping it open for next version? Dave Cramer: github issue are easy to find with labels. I would push-to-CG-closed label on it
Ivan Herman: in the earlier discussion last night there was a similar agreement and partial vote yielding the same result.
|
@wareid @dauwhe @shiestyle in accordance with the vF2F resolution, I have created a new label 'to-be-incubated-further' and used it for this issue before closing. |
See also #2259, which raises similar issues and arguments. |
And to be fair, Amazon is not an epub vendor and popup footnotes were specifically suggested in the original epub3 spec, so these arguments are specious at best. Epub3 is finally taking off, especially internationally, the tools to create them are many and free alternatives exist. Let's not break everything now. Slowly evolving the standard while nudging people in the right direction with epubcheck seems like the safest and best decision possible. |
Part of the alignment process with the open web platform is to begin supporting the HTML syntax of HTML in addition to the XHTML syntax.
Details are available in the following document:
https://docs.google.com/document/d/1m2XsQbYcYIRJ1CL2HojeU8XNXOluHk6g7AScM5hkrZg/edit
The proposal implemented for the first draft is to allow support for both HTML and XHTML syntax in content and require support for both syntaxes in reading systems.
The epub:type attribute will be superseded by the ARIA role attribute, but will remain available for backwards compatibility and for specifications whose semantics haven't been ported.
This issue will remain open past the first draft for comments.
The text was updated successfully, but these errors were encountered: