# OAI-PMH by record

Given a task file with repo urls, metadata schemas, and record identifiers,
this program gets that metadata for those records from that repo by means of
[OAI-PMH](https://www.openarchives.org/OAI/openarchivesprotocol.html)

## Format of the task file

The task file is a plain text file.

Every line is one of those:

* empty or whitespace only: will be ignored
* `@` as first character: the rest is *key=value*
* a record identifier

These are the valid keys:

* `set`: the name of the pseudo set of identifiers
* `url`: the url of the oai-pmh end-point of a repository
* `metadataPrefix`: a metadataPrefix

The idea is that all record identifiers will be fetched from the most recent `url` encountered,
and for that record we fetch metadata identified by the most recent `metadataPrefix` encountered.

These identifiers form sets, every identifier belongs to the most recent `set` encountered.

## Result of a task

When a task has run, the fetched record metadata ends up in a destination directory, one file per record.
The files are named `r1.xml`, `r2.xml`, and so on, in the order in which they have been listed in the task file.

There will be also a file `index.tsv` which maps the file names to the record identifiers.

The reason for this indirection is that the record ids are not always legal file names on all platforms.

In [1]:
import os
import collections
from shutil import rmtree
from subprocess import run

In [2]:
TASK = '../input/theo.txt'
DEST = '../_temp/easy'
COMMAND = ('curl', '-o')

In [3]:
def cleanDir(path):
    if os.path.exists(path):
        rmtree(path)
    os.makedirs(path, exist_ok=True)  

In [4]:
def readTask(path):
    repos = {}
    url = None
    prefix = None
    setName = None
    with open(path) as fh:
        for line in fh:
            line = line.rstrip()
            if not line: continue
            if line[0] == '@':
                (key, value) = line[1:].split('=', maxsplit=1)
                if key == 'url':
                    url = value
                elif key == 'metadataPrefix':
                    prefix = value
                elif key == 'set':
                    setName = value
                continue
            repos.setdefault(url, {}).setdefault(setName, {}).setdefault(prefix, []).append(line)
    return repos            

In [7]:
def doTask(path, dest):
    repos = readTask(path)
    i = 0
    for (url, sets) in repos.items():
        for (setName, prefixes) in sets.items():
            recordMap = []
            setDest = f'{dest}/{setName}'
            cleanDir(setDest)
            for (prefix, recordIds) in prefixes.items():
                for rId in recordIds:
                    recordMap.append(rId)
                    rUrl = f'{url}?verb=GetRecord&identifier={rId}&metadataPrefix={prefix}'
                    rDest = f'{setDest}/r{len(recordMap)}.xml'
                    print(f'fetching {rId} ...')
                    run(COMMAND + (rDest, rUrl))
            with open(f'{setDest}/index.txt', 'w') as fh:
                for (i, rId) in enumerate(recordMap):
                    fh.write(f'r{i+1}\t{rId}\n')

In [8]:
doTask(TASK, DEST)

fetching oai:easy.dans.knaw.nl:easy-dataset:4215 ...
fetching oai:easy.dans.knaw.nl:easy-dataset:25037 ...
fetching oai:easy.dans.knaw.nl:easy-dataset:30768 ...
fetching oai:easy.dans.knaw.nl:easy-dataset:32044 ...


In [9]:
!ls -lR {DEST}

total 0
drwxr-xr-x  5 dirk  staff  160 Mar 14 18:28 [34mtheo1[m[m
drwxr-xr-x  5 dirk  staff  160 Mar 14 18:28 [34mtheo2[m[m

../_temp/easy/theo1:
total 24
-rw-r--r--  1 dirk  staff    87 Mar 14 18:28 index.txt
-rw-r--r--  1 dirk  staff  2714 Mar 14 18:28 r1.xml
-rw-r--r--  1 dirk  staff  3303 Mar 14 18:28 r2.xml

../_temp/easy/theo2:
total 32
-rw-r--r--  1 dirk  staff    88 Mar 14 18:28 index.txt
-rw-r--r--  1 dirk  staff  5322 Mar 14 18:28 r1.xml
-rw-r--r--  1 dirk  staff  3593 Mar 14 18:28 r2.xml


In [10]:
!cat {DEST}/*/index.txt

r1	oai:easy.dans.knaw.nl:easy-dataset:4215
r2	oai:easy.dans.knaw.nl:easy-dataset:25037
r1	oai:easy.dans.knaw.nl:easy-dataset:30768
r2	oai:easy.dans.knaw.nl:easy-dataset:32044
