-
Notifications
You must be signed in to change notification settings - Fork 61
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Revised Package Metadata proposal #642
Comments
Media overlays will require a new duration attribute to handle per-item durations that were formerly handled by meta/refines. |
Hi Matt—Is there an example of "linked records" in the repo or somewhere else that shows a short example of how metadata is moved from the OPF to a linked record ? |
The metadata group has a draft guide to common formats at http://www.idpf.org/epub/metadata/ There isn't a mapping guide to move dc elements to these. The topic of discrepancies in naming/structure was very briefly touched on on the metadata group's call today, but I don't know if the group will try to attempt cross mapping between metadata standards. I'll update if more happens on a future call. |
Excellent. Thanks! |
I think that we do not have backward compatibility issues, so I would think that for RDFa we should refer to HTML5 and not XHTML. This will be better on long term. Ie, it should be metadata.html, and the media type text/html Also, because we are talking about using Schema.org in this table and not any vocabulary in general, it is probably better to refer to RDFa 1.1. Lite, rather than RDFa 1.1 in general; RDFa 1.1 Lite has been developed in cooperation with the schema.org people after all, and it is way easier to use for end users. |
I believe we did that table before the html serialization was formally approved. I've added both options for review, but we should discuss with the group on a future call whether we move ahead with all or some of these entries (not just the schema.org, but mods and marcxml, too). I didn't really get much response to that question when I announced, but it was also back just before the holidays. We risk giving the perception that these are all widely used records and scaring people off when I'm not sure any get widely attached now and we don't really know which will come into common use in the future. |
We also risk looking like we are limiting the options to the handful listed
|
It is not clear to me the rationale that led to choosing the arbitrary subset of For example, certain users might be interested more in seeing With the proposed move of "richer metadata" to external records my feeling is that, in practice, creators will be discouraged to add said "richer metadata" to their ebooks, because a. their workflow will be more complex and b. independent reading systems will not support fetching and parsing said external meta (--- of course Readium will support whatever you decide to do...). Also, if you want to take a radical approach, go for it in full: go "no optional, no multiple". Select a subset of metadata you want, and for each metadatum, require exactly one element with one value (possibly empty). One might ask why only one And, for God's sake, at least deprecate the "not-blessed" |
The metadata was changed based on a survey of developers to determine what reading systems are actually using, not based on what might be nice to have or what could be useful one day use cases. That approach has existed for the last five years and has only led to confusion and complaints that most metadata is not used anywhere. The restriction to one identifier is because reading systems only use the unique identifier for identification. Allowing multiple titles hasn't led to their use in display, so publishers already are concatenating them, hence that restriction. The group was initially toying with the idea of a single creator field for the same reason, but, unlike the title, reading systems will sort and arrange by the separate creator names. They just sometimes concatenate names in ways the publishers don't want. I'm actually a bit surprised publisher ended up in the list and not contributors, but that's where surveying what is in use turned up some surprises. And there's no particular difference between using a DCMES element and the equivalent property in the meta tag, so it's not like the ability to express the other metadata is gone. I've already started a discussion on the working group list since the release that we need to be more explicit about the relationship between the two, and note how the restrictions apply not just to elements because it will affect how metadata translates to the new browser-friendly format (i.e., elements and properties will both translate to properties in json). There's also a proposal for the next cycle of the revision to allow nesting of meta elements and proper alignment with RDFa (Lite) so that a real framework exists for extensions, and so that the package document itself isn't restricted, only what is defined for use by all reading systems. Having that minimal metadata set retains compatibility with epub 3.0 reading systems that expect the elements, while shifting the rest to the meta element extension and external records. That was the goal of the group. |
I am a developer of Sigil, an opensource GPL epub editor, that runs on Linux, Mac, and Windows. Sigil has long supported epub2 but is only recently started to add epub3 support. Of course, I just found this issue immediately after coding up a user-friendly tree view based gui editor for epub3 metadata. Figures ... Please consider the user perspective of how users use ebook library software like calibre to sort and find their ebooks. Removing things like dc:description goes quite contrary to the needs of typical users. Futhermore, you are generating yet again another non-backwards compatible change and ignoring small/independent epub publishers in the process. And you are partially reinventing the wheel you just broke. Epub2 small publishers already knew how to use the opf attribute namespace to add opf:scheme, opf:file-as, opf:role directly to dc:creator to achieve what they needed. Epub3 then broke that with the refines nonsense and then added insult to injury by allowing chaining of refines (a sure anti-KISS darwin award winning idea if I ever saw one!). Now you are proposing to drop support for role and contributer. Where is the sanity in this? Why not stick to simple dc:* metadata and allow role, scheme, and file-as attributes directly on them (no attribute namespace prefix needed). Display-sequence can be indicated by sequence presented in metadata, again simplifying things. This is rich enough for small publishers and users to actually use, would be well understood by current epub2 developers, but eliminates the need for refines and chained refines to simplify things? Seems like a simple, logical choice to me. |
Given the growth in self-publishing, using epub or kindle built from epubs, don't you think that adding the voice of the ebook user, and small ebook publishers and even self-publishers, would be important to your working group? Asking developers from just the big publishing houses and other institutional interests is what made the epub3 spec the mess it turned out to be. Simplification is a laudable goal and removing non-html5 spec complicating pieces such as epub:type (hopefully replaced by "role"), epub:switch, epub:trigger, the need for namespaces everyplace, seems like a good idea that everyone will support. And simplifying metadata is good too as it turned into a dumping ground for anything not fitting in the opf, supply chain info, and special interest groups. I just feel your proposal to drop almost everything is not geared to users and small, independent publishers, and would force a meta property duplicate version of standard dc tags just to do what we used to do quite successfully in epub2 and simple dc metadata and some simple extra attributes like file-as, role, and scheme. |
We can't get rid of the meta tag, as it's need for core epub functionality (fixed layout metadata, media overlay metadata, etc.). The problem of the dc elements and properties both translating to properties in json hasn't been fully addressed yet, and it could be a case for a return to allowing any dc: elements in the package and restricting the dc: properties for simplicity. It's too early to say, and I was only addressing the thinking of the metadata group that went into this proposed change. But this is why we've put out the editor's draft for review and comment. The feedback is appreciated. |
Matt, thank you for taking time to write the rationale behind the draft. However, I am still unconvinced. Kevin listed some use cases where some of the "forbidden" dc: elements are used. Let me just add two more examples --- with apologies for the self-citation.
In both cases the current EPUB 3.1 would force me to coerce the dc semantics or to move some pieces of information in an external record or in a meta essentially replicating the role of the original dc: element. Of course I am not happy with this, especially because I still do not see the "harm" that allowing all the optional/multiple dc: elements produce. On the other hand, I am the first person happy to see the current refine mechanism go, it never felt natural to me. A simpler, attribute-based mechanism for roles and machine-readable values would look more appealing to me as well. |
Don't forget, this draft is intended to be provocative, as noted at the top of the changes document. The working group is trying to gauge which features are actually in use, as there's a strong desire to move forward without so much baggage that complicates integration with the open web. But I'm also not out to argue that what is in the specification is right and unchangeable, as the ambition is not to force changes that aren't good for the ecosystem. I just wanted to give some clarification about how we ended up at that set. The refines attribute is an example of too much compromise, and that's the kind of change we're trying to avoid. |
I'm just dropping in to say that if you want to make backwards incompatible changes, please, dont do it in a point release. From glancing over your changes document, it seems to me that you want to make several breaking changes. That's great, EPUB 3 could do with some serious breaking. But name it EPUB 4. I really dont want to have tell my users that calibre supports EPUB 3.1 but not EPUB 3. As for the proposed metadata changes. I'll say the following metadata fields are most often used by calibre users: title Make the implementation of a small set of fields (preferably the ones I listed above) dead simple and as backwards compatible as possible. People in the wild write all sorts of broken software, and EPUB does not help with its insane and completely unnecessary level of complexity. That means that EPUB using applications have to deal not just with an overly complicated spec but also dozens of broken implementations of it. |
In addition to the (very good) points raised by Kevin and Kovid, I'd like to add a sample use case of my own (albeit more narrow in scope): Anthologies / short story collections. Typically, such publications have one or more editors and a bunch of contributing authors. The canonical way to specify the contributors is:
In this case, the dc:contributor elements are almost as important as the dc:creator elements; relegating them to separate storage in a backwards incompatible fashion will in practice result in this metadata being rendered inaccessible to presentation. Please do not underestimate the importance of the distributed ecosystem of scattered, small-scale software. Like Kevin mentioned, there are more stakeholders in EPUB than the large publishing shops. Pre-ossifying the spec that way will impede grassroots adoption. |
cc @kovidgoyal @mihailim @kevinhendricks In addition to the work on OPF itself, there's also an on-going effort to design an OPF alternative that'll be used for unzipped EPUB on the Web (and potentially for EPUB 4). This effort is based on JSON-LD and I've tried to accommodate some of the needs expressed in these comments:
I'd like to get your opinion on the current proposal. The complete proposal for an OPF alternative is available at: https://github.com/dauwhe/epub31-bff Quick question for @kovidgoyal, for the series_index in Calibre, do you use an integer? schema.org supports both string and integer for the position in a series and I can find arguments for/against both of them. |
calibre uses a floating point number for series_index with a max precision of two digits. So you can have 1.01 to 1.99. I have found that this level of precision meets the needs of ~ all users. IMO it needs to be a numeric type, how does one order books in a series with a string type in the general case? Indeed, the very name of field series__index_ indicates it needs to be numeric. |
From quickly looking through that gist, some comments;
|
"author": ["Jules Verne", "Alexandre Dumas"] It works for both literals and objects: "author": [
{
"name": "Jules Verne",
"identifier": "http://isni.org/isni/0000000121400562",
"sort_as": "Verne, Jules"
}, {
"name": "Alexandre Dumas",
"identifier": "http://isni.org/isni/0000000121012885",
"sort_as": "Dumas, Alexandre"
}
] That said, EPUB BFF will most likely align with OPF in 3.1 for uniqueness of some elements (one identifier, one title). It doesn't mean that you can't include more identifiers though, but you'll have to use extensions for that: "identifier": "urn:uuid:2e37ec76-1242-4698-8cf7-b65747676c0f",
"http://schema.org/isbn": "9780000000001",
"http://https://calibre-ebook.com/internal_identifier": "18492"
|
Luc Le 26 févr. 2016 à 16:55, Hadrien Gardeur <notifications@github.commailto:notifications@github.com> a écrit :
"creator": ["Jules Verne", "Alexandre Dumas"] It works for both literals and objects: "creator": [ That said, EPUB BFF will most likely align with OPF in 3.1 for uniqueness of some elements (one identifier, one title). It doesn't mean that you can't include more identifiers though, but you'll have to use extensions for that: "identifier": "urn:uuid:2e37ec76-1242-4698-8cf7-b65747676c0f",
— |
|
@kovidgoyal that's a fair point (1) but there are pretty big benefits for supporting both:
|
Syntax for the case of a single value is simply two extra characters:
But, you say you are going to restrict title and identifiers to not allow
See (1)
Yes, but there is an extra if statement required. And people that write I cant count the number of programs that have problems with XML |
For namespaces, that won't be a problem anymore since the current proposal is to disallow additional context definition. You can only encounter two type of elements:
Regarding always using arrays vs allowing both strings/arrays, it's basically deciding between:
That said I'm not entirely sure what you're advocating for, are you saying that:
I guess it's probably 1 since 2 really doesn't make much sense (people would get false expectations if we start using an array for identifier or title). |
I really hope you are not encouraging people to write JSON by hand. That
Or to put it another way, reducing the probability and therefore number
Yes. Or do not allow any properties to have more than one value. Since |
I really don't think that we want to restrict authors, translators and such to a single element. It's feasible of course but a bad idea plus definitely a step backwards in terms of what you can do with EPUB metadata. I'll bring that point to the group to discuss it, as long as it's restricted to elements that are 1-* it's worth considering. |
Luckily epub3 has a set of predefined prefixes that need no xmlns definitions and are not allowed to be redefined. It helps to clean up the overhead of having to track url prefixed elements all around the parsed tree. And I for one am very glad you/calibre remap inconsistent prefixes to more established versions given how inconsistent many xml packages are with namespaced attributes especially. In general, I completely agree with you. Too much choice and not enough standardization simply makes software in the wild prone to bugs. And making backwards incompatible changes, leads to many problems as well. It can take years for code to find all of the corner cases and handle them. I personally think a simple epub4 with polyglot xhtml/html5 as its base, removal of refines and return to simple dc metadata with extra attributes for file-as, and role, standardized prefixes for namespaces, keeping the ncx, keeping the guide, stop farting with the recognized vocabulary, etc would allow, an epub 4 to flourish. Adding more "renditions", "collections", "distributable objects", and educational epubs, just clutters things up and adds no real value. Standards makers/developers really do need to take a course in basic engineering 101 - KISS. |
@kevinhendricks Yes, namepsaces are XML's most unfortunate feature. I have never come across any use of namespaces in any XML based application that could not have been solved in a simpler and more robust fashion in other ways. The problem with standards processes is that there are too many stakeholders and standards committees inevitably try to satisfy everyone, which usually means keep things simple is a lost cause. Human nature I guess. @HadrienGardeur All the best, I hope you can convince the group. While I'm here, I'll just point out that calibre uses multiple values for the identifiers field as well. For example, it is used to store ids corresponding to a book from multiple sites: amazon, google books, goodreads, wordcat, etc. So for me, allowing only a single identifier will mean that calibre will have to use a custom identifiers field. |
After giving this some thought I'm convinced it's worth the effort to restrict what metadata can be included in the package document. Publishers will include whatever metadata they deem necessary, regardless of where it goes. This appears to be acknowledged by the working group because it's being suggested to move metadata out to other files besides the package document. But if we're explicitly allowing metadata outside of the package document, we haven't actually reduced metadata at all--and then reading systems are going to have to adapt their current epub parsing engines to deal with that change, for no noticeable gain. The metadata is still there somewhere, and it still has to be read somehow. (And publishers will have to adopt their workflows too.) In this scenario all we've done is made a headache for developers of reading systems and big-data processors of epubs, who have to adjust how they parse this big change in the spec; and we've made a headeache for publishers, who have to adjust their publishing workflow to put their metadata elsewhere; and there's no benefit for human readers, who don't really care how or where metadata is stored, as long as the reading system displays it. It seems like ideological purity at the expense of almost everyone involved in the actual nuts and bolts production of ebooks and reading systems--shunting the problem of messy metadata from one file to another, but not actually solving it. I for one think having as much metadata in ebooks as possible is important. Not all of it has to be displayed, but a metadata-rich self-contained ebook file is important for machine processing and archival purposes. Maybe the metadata isn't in use today, but a decade or two from now having a metadata-rich self-contained file will be appreciated by archivists and readers on future platforms. Maybe instead of restricting metadata, the spec could pick an existing metadata schema to champion, like schema.org, instead of using a mess of DCMES/epub-specific definitions that each RS interprets in its own special way. A subset of that metadata could be considered "required" to be understood in a certain way by reading systems, and the rest could be ignored, but available to any person or software that has an interest. Re. JSON, I'm a little confused as to where that comes in to play. If it's being suggested to include JSON as a metadata format in an epub file, I very strongly feel that the last thing we need in an epub is another language and format for ebook producers and RS developers to have to deal with. XML/HTML work just fine and are very well suited to describing the kind of metadata that a typical ebook includes. |
Sorry for being late to the party but I feel like this is one important question. @mattgarrish wrote
Are the (anonymized) results of this survey published somewhere? I've searched this morning but couldn't find anything about that, not even a link in the Google Docs. Rest assured it’s not about trust, some of us are just a little bit curious since those surveys are being referred to but it seems—and I might be wrong so if this is indeed the case, sorry for the inconvenience—we don't have access to the results. |
A summary of the result was written up here: |
There should be a formal update on this issue in the next week or so. The metadata group has been considering this feedback and a revised proposal would re-introduce the stripped elements with some additional attributes to replace functionality removed by taking away the refines attribute. It still needs to be vetted with the full working group, though. |
@mattgarrish Thank you very much. |
It looks like we won't be publishing the next editor's draft for a little while still, but to update this issue the full set of DC elements will be returned. The refines attribute will be superseded by a set of dedicated attributes that cover the key information that reading systems need (role, file-as, id-type and a few others). |
In moving to have the Package Document include only bibliographic metadata used by reading systems for display/sorting in bookshelves, with richer metadata delegated to formal records, the following changes have been proposed:
See also the proposal at https://docs.google.com/document/d/1okss2ictXwVqx7aQJ4ARi2ALl2GIovyz5nURE3QgmIA/edit#
The text was updated successfully, but these errors were encountered: