# OAI-PMH by record

[OAI-PMH](https://www.openarchives.org/OAI/openarchivesprotocol.html) harvesting allows for selective harvesting by *sets*.
However, this is only meaningful if the repository from which the records are being harvested has organised its
data in sets. If this is not the case, selective harvesting of a set of identifiers is not supported.

The purpose of this program is to perform selective harvesting from repositories that do not support sets.
Hence, this must be considered a *work-around*.

We assume that the harvesting party already knows which records to harvest.
The specification of these target records must be put in a *task file*, see below.

The records will be harvested by successive `GetRecord` commands to the repository.
The harvested records end up unmodified in a destination directory that can be configured (or passed as an argument).

This program is unsophisticated in that

* it unconditionally overwrites existing records with harvested ones, whether they have changed or not;
* it has no logic to deal with the various deletion policies of the repositories.

Having set that, the program will write a file with identifiers of records that it was unable to harvest.

Given a task file with repo urls, metadata schemas, and record identifiers,
this program gets that metadata for those records from that repo by means of

## Format of the task file

The task file is a plain text file.

Every line is one of those:

* empty or whitespace only: will be ignored
* `@` as first character: the rest is *key=value*
* a record identifier

These are the valid keys:

* `set`: the name of the pseudo set of identifiers
* `url`: the url of the oai-pmh end-point of a repository
* `metadataPrefix`: a metadataPrefix

The idea is that all record identifiers will be fetched from the most recent `url` encountered,
and for that record we fetch metadata identified by the most recent `metadataPrefix` encountered.

These identifiers form sets, every identifier belongs to the most recent `set` encountered.

## Example

```
@url=https://easy.dans.knaw.nl/oai/
@metadataPrefix=oai_dc

@set=theo1

oai:easy.dans.knaw.nl:easy-dataset:4215
oai:easy.dans.knaw.nl:easy-dataset:25037

@set=theo2

oai:easy.dans.knaw.nl:easy-dataset:30768
oai:easy.dans.knaw.nl:easy-dataset:32044
```

## Result of a task

When a task has run, the fetched record metadata ends up in a destination directory, one file per record.
The files are named `r1.xml`, `r2.xml`, and so on, in the order in which they have been listed in the task file.

There will be also a file `index.tsv` which maps the file names to the combination of metadata prefixes and record identifiers.

The reason for this indirection is that the record ids are not always legal file names on all platforms.

If there are identifiers in the task for which no metadata records in the desired metadata scheme could be harvested,
they will be listed in a file `misses.tsv`.

In [25]:
import os
import sys
import collections
import re
from shutil import rmtree
from subprocess import run

In [2]:
TASK = '../input/theo.txt'
DEST = '../_temp/easy'
COMMAND = ('curl', '-o')

In [3]:
def cleanDir(path):
    if os.path.exists(path):
        rmtree(path)
    os.makedirs(path, exist_ok=True)  

In [4]:
def readTask(path):
    repos = {}
    url = None
    prefix = None
    setName = None
    with open(path) as fh:
        for line in fh:
            line = line.rstrip()
            if not line: continue
            if line[0] == '@':
                (key, value) = line[1:].split('=', maxsplit=1)
                if key == 'url':
                    url = value
                elif key == 'metadataPrefix':
                    prefix = value
                elif key == 'set':
                    setName = value
                continue
            repos.setdefault(url, {}).setdefault(setName, {}).setdefault(prefix, []).append(line)
    return repos            

In [51]:
metadataPat = re.compile('<metadata[^>]*>(.*)</metadata>', re.S)
errorPat = re.compile('''<error.*code=['"]([^'"]*)['"][^>]*>(.*)</error>''', re.S)

def deliver(path):
    with open(path) as fh: text = fh.read()
    error = None
    if '</GetRecord>' in text and '</metadata>' in text:
        match = metadataPat.search(text)
        if match:
            text = match.group(1).strip()
            with open(path, 'w') as fh: fh.write(text)
        else:
            error = 'No metadata found'
    elif '</error>' in text:
        match = errorPat.search(text)
        if match:
            code = match.group(1)
            msg = match.group(2)
            error = f'{code}: {msg}'
        else:
            error = 'Could not parse error message'
    else:
        error = 'No record found and no error message found'
    return error

In [59]:
def doTask(path, dest):
    repos = readTask(path)
    totalRecords = 0
    totalMissed = 0
    for (url, sets) in repos.items():
        for (setName, prefixes) in sets.items():
            print(f'Harvesting {setName}')
            for (prefix, recordIds) in prefixes.items():
                missed = {}
                recordMap = []
                setDest = f'{dest}/{setName}/{prefix}'
                cleanDir(setDest)
                for rId in recordIds:
                    recordMap.append(rId)
                    n = len(recordMap)
                    m = len(missed)
                    rUrl = f'{url}?verb=GetRecord&identifier={rId}&metadataPrefix={prefix}'
                    rDest = f'{setDest}/r{n}.xml'
                    sys.stdout.write(f'\t{setName:<10} {prefix:<10}: {n:>5} ({m:>5} missed)\r')
                    try:
                        run(COMMAND + (rDest, rUrl))
                        error = deliver(rDest)
                        if error:
                           missed[rId] = error
                    except Exception as e:
                        print(e)
                        missed[rId] = 'system error'
                    if error and os.path.exists(rDest):
                        os.unlink(rDest)
                print(f'\t{setName:<10} {prefix:<10}: {len(recordMap):>5} ({len(missed):>5} missed)')
                rIds = [r for r in recordMap if r not in missed]
                if rIds:
                    with open(f'{setDest}/index.txt', 'w') as fh:
                        for (i, rId) in enumerate(rIds):
                            fh.write(f'r{i+1}\t{rId}\n')
                if missed:
                    with open(f'{setDest}/missed.txt', 'w') as fh:
                        for (rId, msg) in sorted(missed.items()):
                            fh.write(f'{rId}\t{msg}\n')
                totalRecords += len(recordMap)
                totalMissed += len(missed)
    print(f'Harvested {totalRecords:>5} records of which ({totalMissed:>5} missed)')

In [60]:
doTask(TASK, DEST)

Harvesting theo1
	theo1      oai_dc    :     3 (    1 missed)
Harvesting theo2
	theo2      oai_dc    :     2 (    0 missed)
	theo2      oai_dx    :     1 (    1 missed)
Harvested     6 records of which (    2 missed)


In [81]:
!find ../_temp/easy -type f -print \
    -exec echo "---------------------------------" \; \
    -exec echo "" \; \
    -exec head -n 5 {} \; \
    -exec echo "" \; \
    -exec echo "==================================" \; \
    -exec echo "" \;

../_temp/easy/theo2/oai_dc/r2.xml
---------------------------------

<oai_dc:dc xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
           xmlns:oai_dc="http://www.openarchives.org/OAI/2.0/oai_dc/"
           xmlns:dc="http://purl.org/dc/elements/1.1/"
           xsi:schemaLocation="http://www.openarchives.org/OAI/2.0/oai_dc/ http://www.openarchives.org/OAI/2.0/oai_dc.xsd http://purl.org/dc/elements/1.1/ http://dublincore.org/schemas/xmls/qdc/dc.xsd">
   <dc:title>Religion and personality characteristics in two polders in the Netherlands 1962</dc:title>


../_temp/easy/theo2/oai_dc/r1.xml
---------------------------------

<oai_dc:dc xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
           xmlns:oai_dc="http://www.openarchives.org/OAI/2.0/oai_dc/"
           xmlns:dc="http://purl.org/dc/elements/1.1/"
           xsi:schemaLocation="http://www.openarchives.org/OAI/2.0/oai_dc/ http://www.openarchives.org/OAI/2.0/oai_dc.xsd http://purl.org/dc/elements/1.1/ http://dublincore.