Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pagination bug for IA IIIF manifests displayed in Mirador puts spine on outside edge #50

Open
amandelman opened this issue Sep 10, 2020 · 5 comments

Comments

@amandelman
Copy link

When loading IA IIIF manifests into Mirador 3's book view, a pagination error results in displaying the pages with the spine on the outside edge instead of in the middle. It looks like this could be because IA inserts an extra cover page showing calibration marks, which then throws off the pagination sequence, but I'm not sure.

Example manifests:

https://iiif.archivelab.org/iiif/americabeinglate00mont/manifest.json
https://iiif.archivelab.org/iiif/siguenseunosbrev01gilb/manifest.json
https://iiif.archivelab.org/iiif/arteyvocabulario00unkn/manifest.json

See what the Mirador team has to say about it: ProjectMirador/mirador#3244

@mekarpeles
Copy link
Member

The Manifest generation process for Internet Archive's IIIF server is pretty vanilla --

The manifest is generated here:
https://github.com/ArchiveLabs/iiif.archivelab.org/blob/master/iiify/resolver.py#L77-L178

And in this sub-section, we consider logic relevant to dealing with Internet Archive books/texts:
https://github.com/ArchiveLabs/iiif.archivelab.org/blob/master/iiify/resolver.py#L138-L178

Furthermore, this section considers the individual pages of a text by making an http request to an existing archive.org Manifest service (used by our BookReader) which returns its own formatted (non-iiif) "manifest":
https://github.com/ArchiveLabs/iiif.archivelab.org/blob/master/iiify/resolver.py#L169-L177

The json this IA Manifest service returns looks like:
https://api.archivelab.org/books/americabeinglate00mont/ia_manifest

I don't see an easy way from this data to identify / filter out color-cards.

In order to do this filtering, we'd need to add another http request to another/different endpoint to fetch its "scandata.xml". Every archive.org book item has a "scandata.xml" file. It "usually" (there are weird exceptions, e.g. multiple books living in the same item; such edge cases I don't have time to fix) lives at...

https://archive.org/download/:identifier/:identifier_scandata.xml

so, for instance, if the ID is americabeinglate00mont, we could fetch the scandata xml (which gives us more detailed information about page types) by fetching
https://archive.org/download/americabeinglate00mont/americabeinglate00mont_scandata.xml

ia902600 us archive org_12_items_americabeinglate00mont_americabeinglate00mont_scandata xml

So, fixing this issue isn't exactly trivial (but not difficult):

  1. after the http request for the ia manifest ~line 169, we'd add another request to fetch the scandata.xml
  2. we'd parse the xml for scandata and identify color pages
  3. we'd filter out these color cards by leaf or page by using both the IA Manifest data and the Scandata.xml

@amandelman
Copy link
Author

Thanks so much for your attention on this and the super thorough reply. Our backender on the project (@lucasmoeskops) just figured out a work-around by telling Mirador to move the first page to the end, but since we'd really like a cleaner solution than that, this is super helpful. Plus, I imagine there are other Mirador-based projects out there that would appreciate the fix too.

@mekarpeles
Copy link
Member

Yes, @amandelman + @lucasmoeskops careful as not every book item on archive.org is/was digitized by the Internet Archive and therefore it is actually likely an item will not have a color card. In these cases, moving the first page to the end won't work.

Coupling this approach with some heuristic such as....
https://archive.org/metadata/americabeinglate00mont lists a field called scanningcenter when an item was digitzed @ IA.

We already fetch this metadata as json in the code:
see: https://github.com/ArchiveLabs/iiif.archivelab.org/blob/master/iiify/resolver.py#L80

metadata = resp.get("metadata", {})

Therefore, after line https://github.com/ArchiveLabs/iiif.archivelab.org/blob/master/iiify/resolver.py#L138 you may wish to check....

if metadata.get('scanningcenter'):
    # remove color card pages (i.e. [0]th and [-1]th)

This would probably be useful to all patrons and partners, so feel free to open a PR with this patch if it works reliably for you :)

Thank you!

@jcmundy
Copy link

jcmundy commented Sep 11, 2020

Hi, I just wanted to comment that we've also just recently noticed that some volumes (not all) have an additional first canvas. I think it's the same issue that you are discussing here. In our work this seems to throw off the relationship between the canvas/page number and the OCR canvas/page number on those volumes. We've been working around it, but I'm happy to see a discussion about it coming up. Thanks!

@mekarpeles
Copy link
Member

Similar to an email I received today -- some partners reporting that some manifests being generated erroneously with a page 0 $0 (this may be because of color cards or just an off-by-one)

In Mirador: https://iiif.archivelab.org/iiif/floragraecasive3sibt
https://iiif.archivelab.org/iiif/floragraecasive3sibt/manifest.json

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants