A flask web application that crawls Activity Streams for IIIF Canvases and offers a search API.
Switch branches/tags
Nothing to show
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
bot_example
canvasindexer
util
.gitignore
LICENSE
README.md
config.ini.example
logo_500px.png
requirements.txt
run.py
run_crawler.py

README.md

Canvas Indexer

A flask web application that crawls Activity Streams for IIIF Canvases and offers a search API.

Project state

Canvas Indexer is being developed as part of the CODH's IIIF Curation Platform, but meant to be a general IIIF tool. Integration into the IIIF Curation Platform means that in this very early stage there is a focus on cr:Curation[1] type documents.[2] Nevertheless all development is done with generality in mind.[3]

[1] http://codh.rois.ac.jp/iiif/curation/1#Curation
[2] The crawler currently only looks for canvases within them (and not, for example, sc:Manifests) and the search API offers dedicated parameters.
[3] The crawling process implements the IIIF Change Discovery API 0.1 and extending the indexing mechanism and search API to support IIIF documents within Activity Streams in general (or at least sc:Manifests for a first step) should be straightforward.

Setup

  • create virtual environment: $ python3 -m venv venv
  • activate virtual environment: $ source venv/bin/activate
  • install requirements: $ pip install -r requirements.txt

Config

section key default explanation
shared db_uri sqlite:////tmp/ci_tmp.db a SQLAlchemy database URI (file system paths have to be absolute)
crawler as_sources [] comma seperated list of links to Activity Streams in form of OrderedCollections
interval 3600 crawl interval in seconds (value <=0 deactivates automatic crawling)
log_file /tmp/ci_crawl_log.txt file system path to where the crawling details should be logged
allow_orphan_canvases false set whether or not Canvases, that are not associated with any parent elements in the index anymore, should still appear in search results
api server_url http://localhost:5005 URL under which Canvas Indexer can be accessed (only needed when using bots (details below))
bot_urls [] comma seperated list of URLs to bots (only needed when using bots (details below))
facet_label_sort_top [] comma seperated list defining the beginning of the list returned for the /facets endpoint
facet_label_sort_bottom [] comma seperated list defining the end of the list returned for the /facets endpoint
facet_value_sort_frequency [] comma seperated list of facets to be sorted by frequency
facet_value_sort_alphanum [] comma seperated list of facets to be sorted alphanumerically
facet_value_sort_
custom_<name>
label facet label for which a custom order is defined
sort_top comma seperated list defining the beginning
sort_bottom comma seperated list defining the end

Run

$ source venv/bin/activate
$ python3 run.py [debug]

API

path: {base_url}/api
arguments:

arg default explanation
select curation set the type of search results to be returned to either canvas or curation
from curation,canvas set the type of metadata the search results should be based on to canvas, curation or a comma seperated list of aforementioned
where search keyword
where_metadata_label used to search by a property+value pair. requires where_metadata_value
where_metadata_value used to search by a property+value pair. requires where_metadata_label
where_agent human,machine set the type of metadata creator to human, machine or a comma seperated list of aforementioned
start 0 0 based index from which to start listing results from the list of all results
limit null meaning no limit limit the number of results being returned

example: {base_url}/api?select=canvas&from=canvas,curation&where=face

path: {base_url}/facets
returns a pre generated overview of the indexed metadata facets

Crawler

  • The crawler can be configured to run periodically (see Config) or triggered manually by accessing {base_url}/crawl.
  • On its first run the crawler will go through an Activity Stream in its entirety, subsequent runs will only regard Activities that occured after the previous run.
  • In its current state the crawler indexes only the label value pairs given in a IIIF resource's metadata property.

Bot integration

Canvas Indexer can be set up to send image URLs of the canvases it indexes to bots that return tags. These tags are then integrated in the index. Example code of a bot can be found in the folder bot_example.


Logo

The Canvas Indexer logo uses image content from 絵本花葛蘿 in the 日本古典籍データセット(国文研所蔵) provided by the Center for Open Data in the Humanities, used under CC-BY-SA 4.0. The Canvas Indexer logo is licensed under CC-BY-SA 4.0 by Tarek Saier. A high resolution version (4456×2326 px) can be downloaded here.

Support

Sponsored by the National Institute of Informatics.
Supported by the Center for Open Data in the Humanities, Joint Support-Center for Data Science Research, Research Organization of Information and Systems.