Permalink
Switch branches/tags
Nothing to show
Find file Copy path
Fetching contributors…
Cannot retrieve contributors at this time
183 lines (131 sloc) 9.54 KB

READ ME - OAI Dissemination - Web services in COP - Aerial Photography - Image delivery - Metadata Formats - Text Corpora

Access to web services for text search, retrieval and other operations

The Royal Danish Library provide access to some text and language resources. Until recently these resources have been intended solely for users coming to a site using a browser for searching, browsing and reading.

Recently we have decided to complement these end user services with various text APIs. We hope that they are useful for students and scholars alike, and we also hope that this could seen as a contribution to the discussions on what kind web services and what APIs are useful within digital humanities and literary computing.

The text resources are

The APIs described here are provided with similar caveats and legal restrictions as the other services described, and like them, these APIs are work in progress as public services. Also they are byproducts of our services and front ends.

There are two kinds of services (and thus servers hosting the corresponding APIs)

  • text search service API
  • text retrieval service API

The meaning of search service is obvious, the text retrieval service is somewhat less so. Snippet server is our internal nick name of a set of web services that retrieves, transforms and delivers text snippets to the front end or other components using it.

In order to be useful, the you need both search and retrieval APIs. Then you may search and discover what works and snippets there are, and retrieve and link to them.

Text encoding

Most texts are from collected works and are critical editions. All data and metadata available are in XML markup according Text Encoding Initiative, TEI, Guidelines.

Anchors, searchability and retrievability

The search system (which you cannot use just yet, see above), creates records corresponding to three levels

  • volume
  • work
  • text item

where volume and work is defined as described above. When indexing, records are created taking

  • metadata from the TEI header given the reference in the decls attribute
  • text from the appropriate level, and below.

All work and text item records contain data on the xml:id of the containing element and the xml:id page number of the preceding page break.

The text items are indexed in a way that a search result can address a single

  • paragraph of prose
  • strophe in poetry
  • speech in a play

Please note that a strophe occurring inside speech are not recognised as poetry.

Typically one volume contributes (obviously) one volume record, one to dozens of work records and hundreds or thousands of text items. The records for works and text items import basic metadata and includes

Connecting text with facsimile

Our digital library have users interested in viewing the printed text, if not for any other reason than for checking the original when there is OCR errors.

Facsimiles are delivered through our IIIF server. A page is turned whenever one finds a page break in the XML text, that is, a <pb/> element. It looks like

 <pb n="4" facs="adl/heibergpa/heibergpa01/heibergpa1004" xml:id="idm140167182645744"/>

An image URI is constructed by prepending http://kb-images.kb.dk/public/ and appending /full/,750/0/native.jpg to the content of the facs attribute in the page break, resulting in an URI on this form:

http://kb-images.kb.dk/public/adl/grundtvig/grundtvig08/grun8136/full/,750/0/native.jpg

All images connected to a given snippet can be retrieved as an HTML document through the facsimile web service

http://labs.kb.dk/storage/adl/present.xq?c=texts&doc=grundtvig08val.xml&id=workid80553&op=facsimile

Text search

The search API is described in detail in a separate documents

A search can be returned in json or xml format. Here is an example, where we search for works

  • which title contain Jerusalem
  • that are writen by Gustaf Munch-Petersen

SOLR returns JSON or XML and the returned is the same.

The simplest way to retrieve the data is to look for the url_ssi. In the example linked to it contains the value "texts/munp1.xml#workid72997", which is the concatenation of three variables

  • collection (c) = texts
  • document (doc) = munp1.xml
  • id = workid72997

You can now construct the retrieval URI using the script present.xq and the three parameters:

More on what you can do with the texts using the parameters below.

Retrieval APIs for our texts

There are several text retrieval scripts in the Snippet Server. The source code is free.

We concentrate on two, present.xq. We use it for extracting snippets and transforming them. The html produced is mere fragments that you can include in your document just as you like it.

There is an alternative script, present-text.xq which does the same as present.xq, except that it delivers the script as pure text with neither XML nor HTML markup.

Virtually all scripts work in a similarly, with the following arguments.

Some more examples