Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support HTML syntax #636

Closed
mattgarrish opened this issue Jan 10, 2016 · 25 comments
Closed

Support HTML syntax #636

mattgarrish opened this issue Jan 10, 2016 · 25 comments
Labels
Spec-EPUB3 The issue affects the core EPUB 3.3 Recommendation To-be-incubated-further

Comments

@mattgarrish
Copy link
Member

Part of the alignment process with the open web platform is to begin supporting the HTML syntax of HTML in addition to the XHTML syntax.

Details are available in the following document:
https://docs.google.com/document/d/1m2XsQbYcYIRJ1CL2HojeU8XNXOluHk6g7AScM5hkrZg/edit

The proposal implemented for the first draft is to allow support for both HTML and XHTML syntax in content and require support for both syntaxes in reading systems.

The epub:type attribute will be superseded by the ARIA role attribute, but will remain available for backwards compatibility and for specifications whose semantics haven't been ported.

This issue will remain open past the first draft for comments.

@mattgarrish mattgarrish added this to the EPUB 3.1 milestone Jan 10, 2016
@kevinhendricks
Copy link

Why not simply stick with polyglot? Sigil actually uses a slightly modified version of google's gumbo html parser to do automatic html repair of both xhtml and html, injecting closing tags for non-void elements as needed and intentionally serializes things to polyglot (ie. only acknowledged void tags are self-closed, etc). In this fashion only pure xml that adds a separate closing tag on a void element needs to be fixed before passing it an html5 compliant parser like gumbo. The resulting xhtml/html5 will parse directly with any browser interface that follows the official html5 parsing rules, while at the same time is completely valid xhtml. Using our slightly modified gumbo based parser (light weight, few dependicies, fast, C code) we can happily accept and fix epub author mistakes on the fly and allow either downstream xml or html based tools. We would be happy to make our changes to google's gumbo available under any license needed as I am the author of all of those changes.

This seems like the only sane solution for epub3 support in Sigil. BTW, as a bonus, this approach allows standard javascript tools (jquery, etc) to work as expected since the same dom tree will be generated from the html5 or xhtml polyglot serialized document. Dialect specific serialization just makes no sense when you have html4, html5, xml 1, xhtml 1.1, xhtml5, and even old mobi html 3.2 dialects running around in the ebook world. Simply throw a fast html5 compliant parser like gumbo at it and polyglot serialize the result. You get an xhtml version that works in any browser as html5.

@acabal
Copy link

acabal commented Feb 14, 2016

Hi! I maintain the Standard Ebooks project, which produces ebooks using Epub 3.0.1 as the base format, with an emphasis on using as many of the rich opportunities for semantics and metadata that 3.0.1 provides.

Switching to HTML5 syntax over XHTML is a great step forward. XHTML is a horror to work with and dropping it for a more flexible and friendly flavor of markup can't come soon enough.

I am concerned, however, about the loss of epub:type. If we're switching to using the role attribute, then the vocabulary afforded by the W3C spec is already much thinner than the Epub semantic inflection vocabulary. As new ebooks are produced we'll be losing opportunities to add significant semantic information, like which parts of the work are front, body, or backmatter, while retaining redundant semantics, like "title" (which can already be implied by an <h#> tag--isn't <h2 role="title"> a bit redundant?).

Another plus to epub:type (and a grudging nod to XHTML) is that we can use other semantic vocabularies not defined in the official Epub spec. For example, Standard Ebooks goes to great lengths to add extensive semantics to all of the books we produce. We prefer the standard Epub 3.0.1 semantic vocabulary if it contains what we need; if not, we look at the greatly expanded z3998 vocabulary; and finally we have our own custom vocabulary, which is used as a sort of transitional vocabulary until we sort out schema.org. This means that within a single ebook, we can use a variety of standardized vocabularies to inflect sections as "poems", "songs", "letters", and so on.

That's nice for two reasons:

  1. It opens up a lot of fascinating possibilities for data crunching and machine processing ("find all books from 1890 that had letters in them"; "create a list of every unique ship name in Moby Dick"). This isn't a particularly practical goal in today's terms, but I think marking up ebooks as richly as we can is a noble nod towards our future.

    The thing with ebooks is that it's easy to add semantics during the proofing process, but really, really hard and time consuming to go back and do it later. Because of that, losing the ability to include these kinds of semantics would, in a practical sense, lock the door and throw away the key for interesting machine processing via rich semantics for years or decades.

  2. It gives us a hook for CSS styles. Consider the following snippet:

    <div epub:type="z3998:letter">
     <p>Dear sir...</p>
    </div>
    <p>That's all she wrote.</p>

    Not only we do get some nice semantic inflection there, but we can hook CSS to it like so:

    [epub|type~="z3998:letter"]{
     margin: 1em;
    }
    
    [epub|type~="z3998:letter"] + p{
     text-indent: 0;
    }

    Without that, we'd have to style with CSS classes, which nets us the same styling but without a semantic freebie, and leaves us at the mercy of unsemantic and unstandardized antipatterns like <p class="smcap">.

So ultimately, I applaud the move to HTML5, but doing so at the expense of getting rich semantics would be a big loss. Ideally we would have a way to get the ease-of-use of HTML5 along with the richness of semantic inflection and the ability to use non-epub-spec semantics that the current epub:type definition allows.

@mattgarrish mattgarrish removed this from the EPUB 3.1 milestone May 3, 2016
@mattgarrish
Copy link
Member Author

Closing this issues as it was resolved not to add HTML in the 3.1 revision

@mattgarrish mattgarrish added Spec-EPUB3 The issue affects the core EPUB 3.3 Recommendation and removed Spec-General labels Nov 12, 2020
@dauwhe dauwhe reopened this Nov 18, 2020
@iherman
Copy link
Member

iherman commented Dec 18, 2020

The issue was discussed in a meeting on 2020-12-18

  • no resolutions were taken
View the transcript

1. in preperation to html5 in epub3

See github issue #636.

Dave Cramer: Reading systems and xhtml vs html.
… Thanks to those that already responded to their uses of xhtml vs html.
… Other RS vendors, please provide this information.
… I have tried a few via side loading and some of them work, but want a proper answer from everyone.
… this is to inform discussion next year.

George Kerscher: DAISY contacted devs and asked about support of features.
… "we" [DAISY or the WG?] could take the list of known RSes and ask them about xhtml vs html.
… or do we just want a feel.

Dave Cramer: Just trying to get a feel to understand cost.
… so might be too soon.

Ivan Herman: A little worried that DAISY might end up in the middle of a discussion it doesn't want to be in.
… it would be helpful if those contacts could be disclosed to the WG so we could contact them.

George Kerscher: No problem, we can hand out the list, may need to ask a couple of people, but not a big deal.

Brady Duga: I don't think this is too early to start asking.
… there's not too many RSs in this group.
… it's worthwhile to check in and let people know we're having this discussion.
… we need to reach out to the community.

Laurent Le Meur: this is a big change and we really need to ask about it now.
… less RSes can even support html5. The issue may not be cost, but will be RSes left behind that will never support html5.

Tzviya Siegman: Coming at this as publisher and dev experience from epubcheck.
… don't want to muddy the waters about content, but would be helpful to understand if the issue is supporting two types of books, or if it is making tools, etc.
… want to know where the issues are.

Hadrien Gardeur: Worried about the consistency.
… We can either be more aligned with the web to make it easier to use html tools to build content, etc.
… or are we being conservative?.
… We seem to be doing both.
… We have been very conservative, now we are not with this discussion.

Dave Cramer: Good point, we aren't making a decision here, just want to understand where we are and where we might want to go.

Garth Conboy: HTML serialization won't work out of the box, but doesn't mean we are against it.
… The issue about stuck in time RSes is a good one.
… As is the issue about direction.
… But we also need to talk to publishers and see if they care.
… Need to understand if RSes are willing to adapt, not understand what they do now.

Dave Cramer: Let's get back on track. We aren't discussing the issue now, we are just preparing to discuss it.

Avneesh Singh: Also please remember there is project to move epubcheck to the w3c html validator.
… if that prototype works, then it solves the problem of epubcheck not working with html5.
… second, it is good to start collecting data now.

@kevinhendricks
Copy link

Are you are looking for input from epub development software like Sigil?

If so, pure html5 has such lax parsing rules that it actually makes downstream xml based tools harder to implement as a full browser parsing engine would be needed just to parse the resulting code due to the "flexible" state based parsing rules used by pure html5.

Using a modified version of Google's gumbo parser inside Sigil, Sigil can happily take pure html5 with its lax rules and automatically create strict xhtml5 based output with no real cost to the end epub developer. With Sigil generating more strict xhtml variant of html5 allows all existing downstream xml toolchains to still be used.

So simple open source tools and software already exist to take pure html5 with its lax parsing rules and create something more easily processed downstream. Not all downstream tools should need to implement the full html5 parser spec just to work properly.

So instead of moving the source for epubs to pure html5, simply use freely available epub devloper tools to create the proper strict syntax that makes the entire toolchain work.

@acabal
Copy link

acabal commented Dec 18, 2020

Five years later, I got an email about activity in this issue, and I wanted to pop in to say that my view of XHTML has changed since my previous comment. Previously, I was focused on the annoyance of authoring it, compared to the laxer HTML5 parsing rules. XML-isms like namespaces were a pain point.

But, after working on nearly 450 epubs at Standard Ebooks, my perspective has changed. Like @kevinhendricks stated above, XML is easier and faster to parse, and thus process by other programs. Since we already need an XML parser for the metadata file, programs that work with a whole epub would need to package an additional HTML5 parser instead of reusing the XML parser. The publishing industry is heavily invested in XML so being able to use a single parsing library to (for example) process both an OPDS feed and an epub is very valuable.

XHTML gives us nice things like xpath and xslt for free. Pretty-printing/canonicalization is easier and has good support in many libraries and programs.

I still think XML namespaces are an annoyance at the XML spec level, but in the context of the average epub, the annoyance is limited because 95% of the time the only non-default namespace will be the epub namespace.

Importantly, canonicalized XHTML forces uniformity on the output. In a canonicalized document we can always expect to see <br/>, not <br> or <br></br> or anything else. This is useful not just for presentation but for the rare and unpleasant times when a program must massage XHTML using regexes or other naive string operations.

So, if my opinion is worth anything, five years later I would like to reverse my previous position and suggest sticking with XHTML, but expanding it to the HTML5 element vocabulary. In other words something like XHTML5. epubcheck already allows HTML5 vocabulary like <section> so maybe the epub spec already allows for that, I don't have it in front of me right now.

(My previous opinion on epub:type still stands; it is very useful and it would be a pity to see it go in favor of a non-standard attribute and vocabulary.)

@toshiakikoike
Copy link

About our company's RS(named BinB),

A. Is your reading system capable of rendering EPUBs that contain HTML that is not well-formed XML?

No.
Our RS does not use any web browser rendering engine. It uses our original rendering engine.

B. Is your reading system capable of ingesting such EPUBs? Does your toolchain depend on all content documents being well-formed XML?

Our RS does not use XML parsers, so does not depend on well-formed XML. On the other hand, to ingest HTML systax EPUB completely, it needs additional development for parsing.

And we have some satellite tools for processing EPUB files that uses XML parser. They demand all content documents being well-formed XML.
For example, preview(sample) file maker(my original).

@iherman
Copy link
Member

iherman commented Dec 19, 2020

@acabal,

So, if my opinion is worth anything, five years later I would like to reverse my previous position and suggest sticking with XHTML, but expanding it to the HTML5 element vocabulary. In other words something like XHTML5

That is already the case in 3.2. The relevant section §2.2 says:

An XHTML Content Document has to meet the following basic requirements:

  • It MUST be an [HTML] document that conforms to the XHTML syntax.

And the references are to HTML5 XML syntax. The upcoming EPUB 3.3 draft takes this text over verbatim.

@iherman
Copy link
Member

iherman commented Dec 19, 2020

Speaking about Readium toolkits (Mobile and Desktop) and the desktop Thorium Reader app:

A. Is your reading system capable of rendering EPUBs that contain HTML that is not well-formed XML?

Yes, as all Readium toolkits are based on major web rendering engines. They belong to what could be called the "Open Web RS profile".

The Readium Mobile iOS toolkit relies on Webkit (WKWebView).
The Readium Mobile Android toolkit relies on Chrome WebView.
The Readium Desktop toolkit relies on Chromium (via Electron.js).
And Thorium Reader relies on Readium Desktop.

B. Is your reading system capable of ingesting such EPUBs?

Yes they are.

To be sure, I created a very dirty EPUB from "wasteland" (which contains a unique spine item) by replacing the XHTML content by some random HTML tag soup. It would be good to have a proper sample, but until then ... it works on Thorium like a charm.

(Copied from Laurent's email for an easier reference, with authorization.)

CC @llemeurfr

@iherman
Copy link
Member

iherman commented Dec 19, 2020

For Play Books.

  • A. Is your reading system capable of rendering EPUBs that contain HTML that is not well-formed XML?*

Currently "no" -- epubcheck is part of the ingest pipeline and would fail such content. XML processing tools are currently used in the ingest pipeline.

  • B. Is your reading system capable of ingesting such EPUBs Does your toolchain depend on all content documents being well-formed XML?*

Such content would not currently get past the front door.

However, if EPUB 3.3 were to add HTML serialization, we would embark on the (non-trivial) effort to support non-XML content.

(Copied from Garth's email, with authorization.)

Cc @GarthConboy

@aRyoKuroda
Copy link

Speaking about our Reading System (PUBLUS Reader for Android/iOS);

A. Is your reading system capable of rendering EPUBs that contain HTML that is not well-formed XML?

Yes. PUBLUS Reader uses Blink (Android) or WkWebKit (iOS).

B. Is your reading system capable of ingesting such EPUBs? Does your toolchain depend on all content documents being well-formed XML?

Seems possible.

@toshiakikoike
Copy link

toshiakikoike commented Jan 14, 2021

In my personal opinion, the various satelite tools needed for delivery have a greater impact than RS.
As mentioned #636 (comment), I use XML parser to develop the tool to create sample EPUB from full EPUB.
When dealing with "HTML syntax" instead of "XHTML syntax", the scope of additional development will be larger because XML parser cannot be used.
I think it's better hearing to not only RS vendors but also to bookstore system vendors.

@dauwhe dauwhe added the Agenda+ F2F Possible agenda item for F2F label Apr 29, 2021
@danielweck
Copy link
Member

I cannot speak for the Readium "mobile" implementations (iOS/Android), but here is my feedback from the Readium Desktop / Thorium point of view: support for non-XML HTML would require a thorough audit of a relatively large codebase, in order to identify code where we rely on the assumption that the markup is well-formed XML. This is by no means a complete analysis, but just from the top of my head:

  • XHTML DOM parsing: we instantiate parser APIs with either explicit content types (e.g. application/xhtml+xml, for example, as authored in the EPUB OPF package manifest items), but sometimes we may rely on implicit / default XML handling, depending on the library used to process documents.
  • XHTML DOM querying: we use several techniques to access and navigate the document object model, sometimes CSS selectors, or XPath, or plain web browser DOM APIs with explicit namespace handling (SVG, MathML, etc.).
  • XHTML DOM mutations: rendered content documents can be modified by the reading system, sometimes in a namespace-aware manner (e.g. insertion of annotation markup).
  • CSS selectors sometimes rely on namespace syntax (e.g. epub:type attribute matching), with fallback on syntactical conventions (e.g. namespace prefix as plain string of characters)
  • I cannot remember the details but I remember dealing with some idiosyncrasies related to polyglot XHTML5.

To conclude: adding support for non-XML HTML is definitely within the realm of possibilities in Thorium / Readium-Desktop, but as with any non-trivial development task, this would require in-depth analysis and methodical regression testing (in other words, it wouldn't just be a case of "adding" HTML support, we would need to make sure that the existing XHTML support doesn't break when implementing dual support for XHTML / HTML in the various content processing modules).

@mattgarrish
Copy link
Member Author

This is by no means a complete analysis, but just from the top of my head:

You'd probably remember this better, but weren't there issues with CFIs and the HTML syntax? I seem to remember a discussion about having to use the parsed DOM for HTML but that being potentially unreliable.

@danielweck
Copy link
Member

You'd probably remember this better, but weren't there issues with CFIs and the HTML syntax? I seem to remember a discussion about having to use the parsed DOM for HTML but that being potentially unreliable.

I don't remember the details, but indeed there were concerns that the XML-centric CFI processing model wouldn't work reliably with HTML DOM, due to subtleties in text encoding, character offsets / text node normalisation, element boundaries (e.g. self closing tags), etc.
As an implementer, I would certainly anticipate weird edge cases in the path resolution logic (i.e. when converting DOM Ranges to CFI, and vice-versa).
In principle, "polyglot" (X)HTML5 helps mitigate this, but I suspect that in practice we would need to work around some XML / HTML discrepancies in web browsers.

"Polyglot Markup: A robust profile of the HTML5 vocabulary"
W3C Working Group Note 29 September 2015
https://www.w3.org/TR/html-polyglot/

PS: in Thorium / Readium-Desktop we do not make internal use of CFI (unlike the first / original incarnation or Readium SDK), instead we use our own optimised DOM-Range (de)serialisation technique. We can (and do) generate equivalent CFI expressions for our bookmarks / annotations, but we do this in a "vacuum" in the sense that we have no consuming API at the moment (in a future software iteration, we may produce CFI references in the context of interoperable W3C content annotations, if the need arises).

@iherman
Copy link
Member

iherman commented May 27, 2021

This is by no means a complete analysis, but just from the top of my head:

You'd probably remember this better, but weren't there issues with CFIs and the HTML syntax? I seem to remember a discussion about having to use the parsed DOM for HTML but that being potentially unreliable.

that would definitely be problematic. I do not remember the details, but the HTML5 parser, possibly, adds some extra elements into the DOM for, e.g., enclosing other elements. Ie, the meaning of CFI can definitely be distorted.

@llemeurfr
Copy link

Apart from RS developers' advice, it is important to take into account the pressure that could come for the publishers' side.

Big publishers are mostly using internal XML based workflows. I say mostly because we know that there is also a small production of carefully crafted (X)HTML based ebooks. Small publishers willing to create reflow EPUB seem to use desktop tools that produce EPUB via a save button. And the vast majority of EPUB FXL publishers are using InDesign.

As a consequence, from what I hear, the pressure is quasi-null.

@mattgarrish
Copy link
Member Author

Ie, the meaning of CFI can definitely be distorted.

Right, and I believe this is partly why we dropped reading system support for authored CFIs in 3.1 (aside from lack of anyone actually writing these things manually). It was preparatory for possibly adding HTML as they aren't webbish.

The only reason I raised it was that I know there's a possible resurrection of CFI in the works, and this could prove problematic for an HTML syntax. Not a blocker, but a consideration for the viability of CFIs as currently written.

@dauwhe
Copy link
Contributor

dauwhe commented May 27, 2021

that would definitely be problematic. I do not remember the details, but the HTML5 parser, possibly, adds some extra elements into the DOM for, e.g., enclosing other elements. Ie, the meaning of CFI can definitely be distorted.

I believe the parser would add some things even to the XML syntax of HTML. For example, tables without tbody get one.

@iherman
Copy link
Member

iherman commented May 28, 2021

that would definitely be problematic. I do not remember the details, but the HTML5 parser, possibly, adds some extra elements into the DOM for, e.g., enclosing other elements. Ie, the meaning of CFI can definitely be distorted.

I believe the parser would add some things even to the XML syntax of HTML. For example, tables without tbody get one.

Ouch. As usual, @dauwhe is right...

@iherman
Copy link
Member

iherman commented May 29, 2021

The issue was discussed in a meeting on 2021-05-27

  • no resolutions were taken
View the transcript

3. HTML Serialization

See github issue #636.

Dave Cramer: i've made epubs that have used HTML that was not well-formed XHTML
… sometimes it works
… but we hear that RS is sometimes built using XML toolchains, and would have to be reworked if they can't expect well-formed XHTML
… there are arguments in favor, but I have not felt large support from authors, publishers, or RS

Brady Duga: sounds like a lot of work, so maybe we should wait until there is need for it

Dave Cramer: in my mind epub 4 doesn't have to worry about backwards compatibility with epub 3, chief among which is support for XHTML

Shinya Takami (高見真也): i have no objection, but we have to consider compatibility with existing epubs
… so we must differentiate between HTML5 epub and XHTML5
… now might be a good time to start talking about how spec would have to change

Dave Cramer: we would never mandate a change to HTML5
… compatibility issues would arise where people begin to try to open new HTML5 epubs in older RSes
… has Kobo looked into this?

Wendy Reid: when I looked into it I specifically talked to ingestion side (where most of the problems would pop up)
… they said it probably wouldn't be too bad
… we'd have to add in new libraries for parsing HTML, and maybe do some additional validation
… not impossible, but work would be involved
… agree that we might have to identify HTML5 epubs separately
… speaking for Kobo, we have a long tail for device support

Dave Cramer: comment from Daniel Weck was that Readium would experience several issues if we did this
… also issue with epub:type, given that it is namespaced, it won't carry over to HTML serialization
… another issue was about CFI, which uses a very Xpath like syntax to point to places in XML files
… concern that that might break in HTML serialization
… but it might break in XHTML serialization too (e.g. parser inserting tbody element into the DOM)

Brady Duga: CFI was intended to work with the text, not with the DOM
… so yes, that issue could arise
… this is a hard topic because of how much work would be involved, and the lack of a clear reason to do it, especially vs all the other features we could be working on

Dave Cramer: the other argument that has been put forward for HTML is that HTML tools could be used, but i've never seen an example of one that works on HTML and breaks on XHTML

Proposed resolution: Defer HTML serialization to EPUB 4, close issue 636 (Wendy Reid)

Brady Duga: if we're deferring, should it be closed?

Brady Duga: +1

Shinya Takami (高見真也): +1

Wendy Reid: +1

Masakazu Kitahara: +1

Ben Schroeter: +1

Wendy Reid: i think the idea for epub 4 is that we start with clean slate

Dave Cramer:

Matthew Chan: +1

Marisa DeMeglio: +1

Wendy Reid: not resolved yet. Will return to this with tomorrow's group.

@iherman
Copy link
Member

iherman commented May 29, 2021

The issue was discussed in a meeting on 2021-05-28

List of resolutions:

View the transcript

1. HTML serialization

See github issue #636.

Continuation of the discussion on the first vF2F meeeting

Dave Cramer: time to talk about serialization of html5 - I used to be more of an advocate for this but reality is that this is going to require a fair amount of work for reading systems that depend on xml tools
… I don't see strong demand for a change like this from the authoring side - authors are not asking for it

Brady Duga: last night we discussed this and my position is that yes we can do this but it'll be a bit of work - we're not hearing for demand for this from publisher so we'd rather spend the time on more important things
… thinking about self-publishers, while big publishers use xml, these publishers do more of their own work and using their own tools that generate html - not sure if it's true as they don't come to these meetings but should see if this is important to them

Tzviya Siegman: I agree with Dave. I think we need to defer this to another major version of epub. There are some things we've been thinking about doing in epubcheck that would make it easier to maintain. Some of that will be deferred until will have HTML epub.
… we talk a lot of things like identifiers and we don't have that in epub but would be nice to have

Wendy Reid: self-publishing is a constituency we need to think about - we have a big community of them at kobo. I haven't heard these publishers expressing difficulty with xhtml because they have tools to create epubs or getting outsourcers to do the work for them
… it is worth asking as the authors are innovative and see if they're missing out on anything

George Kerscher: the word-to-epub add-in does an excellent job of making epubs. Google docs produces them but would be nice to have a facelift. Scholarly publishing content that is distributed in pdf is awful for disability. Wonder if the process of converting from pdf is easier or harder with xhtml.

Gregorio Pellegrino: what are the benefits of switching from xhtml to html - is it only non-closing tags?

Dave Cramer: some arguments are that it makes it easier to repurpose html content - most scripting libraries aren't tested with xhtml so may work better with html
… it's more annoying to have to change html pages to xhtml to make them valid for epub

Dan Lazin: this seems like the kind of thing we should defer to later - xhtml is losing popularity - we may be fine with it for a while but in the future we may need to change - put on a note on the issue to up-vote and explain why they need it so there is more info
… I came to epub as a self-publisher and I would say that it was not at all difficult to do xhtml - InDesign just does it for you but I cleaned up some things by hand - self-publishers may not be fully tech savvy but are willing to do things that we don't expect

Ivan Herman: example: I made a script to translate W3C TR documents into EPUB. It was very easy to find HTML parsers but when I need to generate XHTML from it I get syntactically incorrect markup. I had to spend much more of time to find a library to convert the HTML to proper XHTML. The evolution of tools long term works against XHTML long term
… the doctype, the xml declaration, etc. - the small things from our point of view to make it proper xhtml are complex. Today we may not want to do this but we need to keep the issue open so the community is aware we have to make this change eventually

Brady Duga: most reading systems display their content as html, not xhtml. Moving to html removes an unnecessary intermediary step for us. I'm in favour of punting but we do this every single time

Tzviya Siegman: +1 to duga - we can't keep kicking the can

Avneesh Singh: from a management perspective with only 6 months remaining we need support in reading systems and epubcheck. we should have compelling reason to move on this now given this timeline.
… it looks safer not to do HTML at this time. we don't need to drop or delay it but we can move it to the community group to incubate it and get momentum. will give us a longer timeline to research before the next revision

Charles LaPierre: +1 to Avneesh suggestion to punt to CG.

Ken Jones: the self-publishers I'm aware of are not writing any code - they use tools so they are not pushing for this. they will wait for their tools to update and use whatever is available

Dan Lazin: agree with Brady that most reading systems are using html internally - we could relax validation standards so the document is supposed to be xhtml but we allow syntactic invalidities - authors just want to dump html into the body, they don't care about the head
… most reading systems will take whatever you throw at them so not complicated to display what is in the body

Ivan Herman: I would be scared to describe formally this kind of looseness. On the other hand the we could say that reading systems spec could allow them to use html5 but formally the content spec says you must use xhtml5

Brady Duga: I don't think this would solve the problem - reading systems still process the xhtml as xml prior to display so having html in the body would cause us to reject

Dan Lazin: what I'm thinking is to use an attribute on the body tag so you could fork the pipeline or skip the step that requires xml conformance

Dave Cramer: I've gotten the sense from all the discussions that we have kicked this can for a long time and is not ideal but I think we've had good reasons for doing that. I don't see eagerness to change without a more compelling reason that we supplied so far
… as Dan said, xhtml becomes less viable with time. I think at some point we have to do a bunch of changes at once - EPUB 4 - we're constrained on both sides by EPUB 3 and the inability to change requirements
… we're improving the spec but we're not changing how epub works
… we could also remove all the deprecated and unwanted parts that exist today

Tzviya Siegman: we tried that in 3.1 and it didn't go well
… we'd have to solve the namespace issue, etc. - something we'd all like to do but now is not the time
… maybe this is a community group topic - not html serialization but issue like how to get rid of epub:type, etc.

Dave Cramer: we have the goal of making EPUB 3.3 a W3C REC but do we want to persist this incremental backwards compatible mode for 3.4, 3.5, etc. - we need to make a break at some point and not just have working groups to make editorial changes

Avneesh Singh: looking at W3C culture, a lot of work begins in the CG before getting a WG to formalize

Wendy Reid: I want to have some of these ideas incubated as we're running into the limits of EPUB 3 compatibility. we need to go big at some point.
… I think we need a clear goal for the CG not just incubate ideas

Dave Cramer: are we prepared to defer this issue and hand it off to the CG to look at the long-term future of epub?

George Kerscher: is there a reason to keep the ncx?
… I know why it got in there so that US publishers could provide it

Ivan Herman: we can't do anything about people including it - not part of epub 3

Gregorio Pellegrino: reading systems use it for compliance to allow old readers to open epub 3

Brady Duga: forbidding it doesn't help me as I have legacy epubs with it
… means more bookkeeping in the ingestion pipeline

Dave Cramer: would be different in an epub 4

Ivan Herman: you said defer this - what do you mean? are we closing in github for now or are we keeping it open for next version?

Dave Cramer: github issue are easy to find with labels. I would push-to-CG-closed label on it

Proposed resolution: Close issue 636, move discussion to the community group (Wendy Reid)

Deborah Kaplan: +1

Dan Lazin: +1

Ivan Herman: +1

Wendy Reid: +1

George Kerscher: +1

Gregorio Pellegrino: +1

Ken Jones: +1

Tzviya Siegman: +1

Matt Garrish: +1

Brady Duga: +1

Avneesh Singh: +1

Ben Schroeter: +1

Dave Cramer:

Charles LaPierre: +1

Ivan Herman: in the earlier discussion last night there was a similar agreement and partial vote yielding the same result.

Resolution #1: Close issue 636, move discussion to the community group

@iherman iherman added To-be-incubated-further and removed Agenda+ F2F Possible agenda item for F2F labels May 29, 2021
@iherman
Copy link
Member

iherman commented May 29, 2021

@wareid @dauwhe @shiestyle in accordance with the vF2F resolution, I have created a new label 'to-be-incubated-further' and used it for this issue before closing.

@iherman
Copy link
Member

iherman commented Apr 23, 2022

See also #2259, which raises similar issues and arguments.

@kevinhendricks
Copy link

And to be fair, Amazon is not an epub vendor and popup footnotes were specifically suggested in the original epub3 spec, so these arguments are specious at best.
Moving to html will invalidate all existing epub readers and publisher tool chains, allows for spaghetti html code that would require a full browser whatwg parser just to do just about anything. And finally enforcing xml syntax format is not difficult and can be done by any decent serializer. And all browser engines support xhtml.

Epub3 is finally taking off, especially internationally, the tools to create them are many and free alternatives exist. Let's not break everything now. Slowly evolving the standard while nudging people in the right direction with epubcheck seems like the safest and best decision possible.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Spec-EPUB3 The issue affects the core EPUB 3.3 Recommendation To-be-incubated-further
Projects
None yet
Development

No branches or pull requests

9 participants