Skip to content

[en] Batch download

Alexis Michaud edited this page Dec 9, 2019 · 6 revisions

Version française : cliquer ici.

COCOON, the repository that hosts the Pangloss Collection, uses standard protocols and models so as to facilitate interoperability and use of resources for a range of purposes. Batch download is among the possibilities that the design of COCOON is intended to facilitate. As an example, this page explains how to download all the resources in the Pangloss Collection that come with a transcription and annotation in XML format. This example aims to give a sense of the possibilities opened by the standards used at COCOON (rdf, sparql, edm...).

Here is the script; see below for comments.

PREFIX edm: <http://www.europeana.eu/schemas/edm/>
PREFIX foaf: <http://xmlns.com/foaf/0.1/>
PREFIX dc: <http://purl.org/dc/elements/1.1/>
PREFIX dcterms: <http://purl.org/dc/terms/>
PREFIX ebucore: <http://www.ebu.ch/metadata/ontologies/ebucore/ebucore#>

SELECT DISTINCT ?audioFile ?textFile ?lg ?cho WHERE {
   ?aggr edm:aggregatedCHO  ?cho .

   ?cho a edm:ProvidedCHO.
   ?cho dc:subject ?lg FILTER regex(str(?lg), "^http://lexvo.org/id/iso639-3/")
   ?cho edm:isGatheredInto <http://cocoon.huma-num.fr/pub/COLLECTION_cocoon-af3bd0fd-2b33-3b0b-a6f1-49a7fc551eb1> .
   ?cho  dcterms:accessRights "Freely available for non-commercial use" .

   ?aggr edm:hasView ?transcript .
   ?transcript  dcterms:conformsTo <http://cocoon.huma-num.fr/pub/CHO_cocoon-49aefa90-8c1f-3ba8-a099-0ebefc6a2aa7> .

   ?transcript foaf:primaryTopic ?textFile .

   ?aggr edm:hasView ?recording .
   ?recording ebucore:sampleRate "22050" .
   ?recording foaf:primaryTopic ?audioFile .
}

Here are some block-by-block comments:

SELECT DISTINCT ?audioFile ?textFile ?lg ?cho WHERE {
   ?aggr edm:aggregatedCHO  ?cho .

   ?cho a edm:ProvidedCHO.
   ?cho dc:subject ?lg FILTER regex(str(?lg), "^http://lexvo.org/id/iso639-3/")
   ?cho edm:isGatheredInto <http://cocoon.huma-num.fr/pub/COLLECTION_cocoon-af3bd0fd-2b33-3b0b-a6f1-49a7fc551eb1> .
   ?cho  dcterms:accessRights "Freely available for non-commercial use" .

This selects all the resources from the Pangloss Collection that are freely accessible (=a large set, but not quite all resources, as depositors prefer some resources to be offline for some time, for various reasons). The identifier and the target language are retrieved in the process.

   ?aggr edm:hasView ?transcript .
   ?transcript  dcterms:conformsTo <http://cocoon.huma-num.fr/pub/CHO_cocoon-49aefa90-8c1f-3ba8-a099-0ebefc6a2aa7> .

   ?transcript foaf:primaryTopic ?textFile .

This gets the URLs of transcriptions in Pangloss format.

?aggr edm:hasView ?recording .
?recording ebucore:sampleRate "22050" .
?recording foaf:primaryTopic ?audioFile .

This gets the URLs of sound files sampled at 22.05 kHz

SELECT DISTINCT ?audioFile ?textFile ?lg ?cho WHERE {

This selects the fields that are desired in the results.

It yields a table with all the URLs to be downloaded. The audioFile column is for the audio, and the textFile column is for the transcription. The other two columns contain the language identifier and the resource identifier. The resource identifier allows you to get all the rest of the metadata if desired.

Once the request is executed with the sparql-endpoint (https://cocoon.huma-num.fr/sparql), a program such as wget can take these lists as input parameter.

After some weeks or months, to get items that have been uploaded since the previous 'harvest', an easy solution consists in limiting the search by date, adding the following clause in any block (there is no ordering of the clauses):

?cho  dcterms:available ?date FILTER (str(?date) > "2018-06-01").

[Text by Michel Jacobson, July 2018. Translated by Alexis Michaud. Updates to the CoCoON identifiers in December 2019.]