This is a simple Python 3 script to adjust the title page of a text document uploaded to the Internet Archive.
It just does the following:
- Downloads a
scandata.xml
file from an IA item. - Modifies the XML, updating the
pageType
of a specified page to "Title" and removing the "Title" anywhere else. - Uploads the XML file, replacing the old one.
- (Hopefully) cleans up the XML temp files.
In my work, I upload a lot of items to the Internet Archive, mostly digitized books and government documents. The IA has robust APIs, useful tools, and is generally a fantastic website to work with.
One minor frustration, though, is the way that the IA's algorithms attempt to automatically generate thumbnails and identify title pages for uploaded texts. Founder Brewster Kahle recently wrote a blog post[1] casting about for ways to improve the analysis of book covers. I don't have any ideas to help out there, but I do wish there were a better way to assert the desired title page of an uploaded text.
I found an archive.org discussion post[2] describing how this detection works, and the manual work-around for updating title pages is to use the method I've listed at the top.
I wrote this code in order to make that process less tedious.
You will need to be an item administrator to make this change. I am using the internetarchive library to handle downloads and uploads, so I am assuming that it is installed, and that you have logged in with your IA credentials (directions are in the internetarchive
documentation.)
You can run this in a command line to modify items one by one. For example, if you want to modify sampleitem and assert that its title page should be 6, you can enter:
> python iatp.py sampleitem 6
If it works, your results will be like this:
sampleitem: d- success
Generated new scandata.xml file and uploading...
Success!
For bulk work, you can work within Python, importing or copying my code:
>>> assert_title_page(sampleitem, 6)
The assert_title_page function doesn't handle directories or file cleanup. The underlying internetarchive
item download command will create a new directory for every identifier.
The existing download command also allows for a silent=True parameter to suppress the printed-out download information. You can pass that through in my command, too:
>>> assert_title_page(sampleitem, 6, silent=True)
- Uploading a new
scandata
file should prompt a derive process on the item, after which the new title page (and thumbnail) will take effect. - The asserted page numbers are technically leaf numbers (and listed as such in the XML), and they start with zero. In general, for left-to-right texts, odd numbers are on the left (verso) side and even numbers are the right (recto), the opposite of standard page numbers. A title page is thus likely to be an even leaf number for this purpose.
- There is no guarantee that the Internet Archive won't change their architecture at any point to render this ineffectual.
- I'm not directly trying to update thumbnails, just the title pages. However, it seems like the thumbnails are generally re-derived from this process, too.
- This has only been tested on my Windows work PC using Python 3.7.0.
Feel free to adapt, suggest additional ideas, or otherwise be in touch.