Access to web services for text search, retrieval and other operations
The Royal Danish Library provide access to some text and language resources. Until recently these resources have been intended solely for users coming to a site using a browser for searching, browsing and reading.
Recently we have decided to complement these end user services with various text APIs. We hope that they are useful for students and scholars alike, and we also hope that this could seen as a contribution to the discussions on what kind web services and what APIs are useful within digital humanities and literary computing.
The text resources are
- Archive for Danish Literature, ADL. The APIs described in this document apply to this data set. The literary texts used are available
- Danmark's Breve use the basically the same APIs, but we have not decided to release the API on this data set.
The APIs described here are provided with similar caveats and legal restrictions as the other services described, and like them, these APIs are work in progress as public services. Also they are byproducts of our services and front ends.
There are two kinds of services (and thus servers hosting the corresponding APIs)
- text search service API
- text retrieval service API
The meaning of search service is obvious, the text retrieval service is somewhat less so. Snippet server is our internal nick name of a set of web services that retrieves, transforms and delivers text snippets to the front end or other components using it.
In order to be useful, the you need both search and retrieval APIs. Then you may search and discover what works and snippets there are, and retrieve and link to them.
Most texts are from collected works and are critical editions. All data and metadata available are in XML markup according Text Encoding Initiative, TEI, Guidelines.
Anchors, searchability and retrievability
The search system (which you cannot use just yet, see above), creates records corresponding to three levels
- text item
where volume and work is defined as described above. When indexing, records are created taking
- metadata from the TEI header given the reference in the decls attribute
- text from the appropriate level, and below.
All work and text item records contain data on the xml:id of the containing element and the xml:id page number of the preceding page break.
The text items are indexed in a way that a search result can address a single
- paragraph of prose
- strophe in poetry
- speech in a play
Please note that a strophe occurring inside speech are not recognised as poetry.
Typically one volume contributes (obviously) one volume record, one to dozens of work records and hundreds or thousands of text items. The records for works and text items import basic metadata and includes
Connecting text with facsimile
Our digital library have users interested in viewing the printed text, if not for any other reason than for checking the original when there is OCR errors.
Facsimiles are delivered through our IIIF server. A page is turned whenever one finds a page break in the XML text, that is, a <pb/> element. It looks like
<pb n="4" facs="adl/heibergpa/heibergpa01/heibergpa1004" xml:id="idm140167182645744"/>
An image URI is constructed by prepending
http://kb-images.kb.dk/public/ and appending
/full/,750/0/native.jpg to the content of the facs attribute in the
page break, resulting in an URI on this form:
All images connected to a given snippet can be retrieved as an HTML document through the facsimile web service
The search API is described in detail in a separate documents
- We use SOLR for searching
- SOLR has its own Common Query Parameters
- We provide a document about what search fields there are and how to use them.
A search can be returned in json or xml format. Here is an example, where we search for works
- which title contain Jerusalem
- that are writen by Gustaf Munch-Petersen
The simplest way to retrieve the data is to look for the url_ssi. In the example linked to it contains the value "texts/munp1.xml#workid72997", which is the concatenation of three variables
- collection (c) = texts
- document (doc) = munp1.xml
- id = workid72997
You can now construct the retrieval URI using the script present.xq and the three parameters:
More on what you can do with the texts using the parameters below.
Retrieval APIs for our texts
There are several text retrieval scripts in the Snippet Server. The source code is free.
We concentrate on two, present.xq. We use it for extracting snippets and transforming them. The html produced is mere fragments that you can include in your document just as you like it.
There is an alternative script, present-text.xq which does the same as present.xq, except that it delivers the script as pure text with neither XML nor HTML markup.
Virtually all scripts work in a similarly, with the following arguments.
- doc -- the name of the document to be rendered or transformed. Here are some examples of doc names you can test
- op, targetOp -- op is the operation to be performed upon the document doc, targetOp is the operation to be performed in links inside the service. Possible values of op and targetOp are
- 'render' which implies that doc is transformed into HTML.
- http://labs.kb.dk/storage/adl/present.xq?doc=aakjaer01val.xml&op=render&q=samlede with an argument q giving a search string to be highlighted in the text, in this case samlede
- 'solrize' which returns a solr ... document, which is ready to be sent to SOLR. C.f., http://labs.kb.dk/storage/adl/present.xq?doc=aakjaer01val.xml&op=solrize
- 'toc' returns a HTML table of contents
- http://labs.kb.dk/storage/adl/present.xq?doc=aakjaer01val.xml&op=toc If a 'toc' and a text generated through 'render' are included into one document, all internal links will work.
- http://labs.kb.dk/storage/adl/present.xq?doc=aakjaer01val.xml&op=toc&targetOp=render note the targetOp=render, which makes the toc script generate links to the _render_ed version of the doc. This is good for testing.
- 'render' which implies that doc is transformed into HTML.
- id -- the id of a part inside the doc which is to be treated.
- q -- assuming that 'q' is the query, the present.xq is labelling the hits in the text
Some more examples
- Holberg, vol 3, HTML: http://labs.kb.dk/storage/adl/present.xq?doc=holb03val.xml&op=render
- Holberg, vol 3, page 18: http://labs.kb.dk/storage/adl/present.xq?doc=holb03val.xml&op=render#s18
- The TOC of the Den politiske Kandstøber http://labs.kb.dk/storage/adl/present.xq?doc=holb03val.xml&op=toc&targetOp=render&id=workid54980
- The TOC of Den politiske Kandstøber, Actus II http://labs.kb.dk/storage/adl/present.xq?doc=holb03val.xml&op=toc&targetOp=render&id=idm140583366846000
- Den politiske Kandstøber, Actus II http://labs.kb.dk/storage/adl/present.xq?doc=holb03val.xml&op=render&id=idm140583366846000
- A single 'speak' in that play,
- A TOC for a small work http://labs.kb.dk/storage/adl/present.xq?doc=aakjaer01val.xml&op=toc&targetOp=render&id=workid59384
- The page 27 (of the original volume) inside that work http://labs.kb.dk/storage/adl/present.xq?doc=aakjaer01val.xml&op=toc&targetOp=render&id=workid593843#s27