Pagination bug for IA IIIF manifests displayed in Mirador puts spine on outside edge #50

amandelman · 2020-09-10T11:02:54Z

When loading IA IIIF manifests into Mirador 3's book view, a pagination error results in displaying the pages with the spine on the outside edge instead of in the middle. It looks like this could be because IA inserts an extra cover page showing calibration marks, which then throws off the pagination sequence, but I'm not sure.

Example manifests:

https://iiif.archivelab.org/iiif/americabeinglate00mont/manifest.json
https://iiif.archivelab.org/iiif/siguenseunosbrev01gilb/manifest.json
https://iiif.archivelab.org/iiif/arteyvocabulario00unkn/manifest.json

See what the Mirador team has to say about it: ProjectMirador/mirador#3244

mekarpeles · 2020-09-10T18:42:29Z

The Manifest generation process for Internet Archive's IIIF server is pretty vanilla --

The manifest is generated here:
https://github.com/ArchiveLabs/iiif.archivelab.org/blob/master/iiify/resolver.py#L77-L178

And in this sub-section, we consider logic relevant to dealing with Internet Archive books/texts:
https://github.com/ArchiveLabs/iiif.archivelab.org/blob/master/iiify/resolver.py#L138-L178

Furthermore, this section considers the individual pages of a text by making an http request to an existing archive.org Manifest service (used by our BookReader) which returns its own formatted (non-iiif) "manifest":
https://github.com/ArchiveLabs/iiif.archivelab.org/blob/master/iiify/resolver.py#L169-L177

The json this IA Manifest service returns looks like:
https://api.archivelab.org/books/americabeinglate00mont/ia_manifest

I don't see an easy way from this data to identify / filter out color-cards.

In order to do this filtering, we'd need to add another http request to another/different endpoint to fetch its "scandata.xml". Every archive.org book item has a "scandata.xml" file. It "usually" (there are weird exceptions, e.g. multiple books living in the same item; such edge cases I don't have time to fix) lives at...

https://archive.org/download/:identifier/:identifier_scandata.xml

so, for instance, if the ID is americabeinglate00mont, we could fetch the scandata xml (which gives us more detailed information about page types) by fetching
https://archive.org/download/americabeinglate00mont/americabeinglate00mont_scandata.xml

So, fixing this issue isn't exactly trivial (but not difficult):

after the http request for the ia manifest ~line 169, we'd add another request to fetch the scandata.xml
we'd parse the xml for scandata and identify color pages
we'd filter out these color cards by leaf or page by using both the IA Manifest data and the Scandata.xml

amandelman · 2020-09-10T18:53:12Z

Thanks so much for your attention on this and the super thorough reply. Our backender on the project (@lucasmoeskops) just figured out a work-around by telling Mirador to move the first page to the end, but since we'd really like a cleaner solution than that, this is super helpful. Plus, I imagine there are other Mirador-based projects out there that would appreciate the fix too.

mekarpeles · 2020-09-10T22:32:37Z

Yes, @amandelman + @lucasmoeskops careful as not every book item on archive.org is/was digitized by the Internet Archive and therefore it is actually likely an item will not have a color card. In these cases, moving the first page to the end won't work.

Coupling this approach with some heuristic such as....
https://archive.org/metadata/americabeinglate00mont lists a field called scanningcenter when an item was digitzed @ IA.

We already fetch this metadata as json in the code:
see: https://github.com/ArchiveLabs/iiif.archivelab.org/blob/master/iiify/resolver.py#L80

metadata = resp.get("metadata", {})

Therefore, after line https://github.com/ArchiveLabs/iiif.archivelab.org/blob/master/iiify/resolver.py#L138 you may wish to check....

if metadata.get('scanningcenter'):
    # remove color card pages (i.e. [0]th and [-1]th)

This would probably be useful to all patrons and partners, so feel free to open a PR with this patch if it works reliably for you :)

Thank you!

jcmundy · 2020-09-11T15:14:31Z

Hi, I just wanted to comment that we've also just recently noticed that some volumes (not all) have an additional first canvas. I think it's the same issue that you are discussing here. In our work this seems to throw off the relationship between the canvas/page number and the OCR canvas/page number on those volumes. We've been working around it, but I'm happy to see a discussion about it coming up. Thanks!

mekarpeles · 2023-03-13T17:14:14Z

Similar to an email I received today -- some partners reporting that some manifests being generated erroneously with a page 0 $0 (this may be because of color cards or just an off-by-one)

In Mirador: https://iiif.archivelab.org/iiif/floragraecasive3sibt
https://iiif.archivelab.org/iiif/floragraecasive3sibt/manifest.json

glenrobson added this to the IIIF v3 Update - First steps milestone Mar 13, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pagination bug for IA IIIF manifests displayed in Mirador puts spine on outside edge #50

Pagination bug for IA IIIF manifests displayed in Mirador puts spine on outside edge #50

amandelman commented Sep 10, 2020

mekarpeles commented Sep 10, 2020

amandelman commented Sep 10, 2020

mekarpeles commented Sep 10, 2020

jcmundy commented Sep 11, 2020

mekarpeles commented Mar 13, 2023

Pagination bug for IA IIIF manifests displayed in Mirador puts spine on outside edge #50

Pagination bug for IA IIIF manifests displayed in Mirador puts spine on outside edge #50

Comments

amandelman commented Sep 10, 2020

mekarpeles commented Sep 10, 2020

amandelman commented Sep 10, 2020

mekarpeles commented Sep 10, 2020

jcmundy commented Sep 11, 2020

mekarpeles commented Mar 13, 2023