Skip to content

Commit

Permalink
Update README
Browse files Browse the repository at this point in the history
* Add detailed description
* Add templates for 'Required' and 'Optional' environment variables
* Update 'Development' section to provide more details on testing
* Add new section: CLI commands
* Update CLI function 'help' descriptions
   * Use noun phrases for command arguments (excl. date and boolean command args)
  • Loading branch information
jonavellecuerdo committed Jan 9, 2024
1 parent 15abb5e commit 312b5ee
Show file tree
Hide file tree
Showing 2 changed files with 141 additions and 56 deletions.
156 changes: 120 additions & 36 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,65 +2,149 @@

# oai-pmh-harvester

CLI app for harvesting from repositories using OAI-PMH.
OAI-PMH-Harvester is a Python CLI application for harvesting metadata from repositories (also known as "Data Providers") available through the Open Archives Initiative Protocol for Metadata Harvesting (OAI-PMH) registry.

## Harvesting
## Development
- To preview a list of available Makefile commands: `make help`
- To install with dev dependencies: `make install`
- To update dependencies: `make update`
- To run unit tests: `make test`
- To lint the repo: `make lint`
- To run the app: `pipenv run oai --help`

To install and run tests:
### Running the application on your local machine

- `make install`
- `make test`
Create a virtual environment and install dev dependencies: `make install`.

To view available commands and main options:
Additional notes:

- `pipenv run oai --help`
1. To execute the steps below, you can use the following sample url to an OAI-PMH repo: `https://aspace-staff-dev.mit.edu/oai`.

To run a harvest:
2. To write the output file to an S3 bucket, include S3 in the `-o/--output-file` argument.
* With AWS credentials:
```
-o s3://<AWS_KEY>:<AWS_SECRET_KEY>@<BUCKET_NAME>/<output-filename>.xml
```
* Wihout AWS credentials (if you have your credentials stored locally):
```
-o s3://<BUCKET_NAME>/<output-filename>.xml
```

- `pipenv run oai -h [host repo oai-pmh url] -o [path to output file] harvest [any additional desired options]`
#### With Docker

## Development
1. Run `make dist-dev` to build the Docker container image.

Clone the repo and install the dependencies using [Pipenv](https://docs.pipenv.org/):
2. To run a harvest, execute the following command in your terminal:
```
docker run -it --volume <local-file-path>:<docker-file-path>' oai-pmh-harvester-dev -h <url-to-oai-pmh-repo> -o <docker-file-path>/<output-filename>.xml harvest <optional-command-args>
```

```bash
git git@github.com:MITLibraries/oai-pmh-harvester.git
cd oai-pmh-harvester
make install
```
**Note:** The `-v/--volume` argument mounts the \<local-file-path> in the current directory into the container at \<docker-file-path>, which allows us to view the generated output file in \<local-file-path>.


#### Without Docker

## Docker
1. To run a harvest, execute the following command in your terminal:

To build and run in docker:
```
pipenv run oai -h <url-to-oai-pmh-repo> -o <output-filename>.xml harvest <optional-command-args>
```

```bash
make dist-dev
docker run -it oaiharvester
## Environment variables

### Required

```shell
# Set to dev for local development, this will be set to 'stage' and 'prod' in those environments by Terraform.
WORKSPACE=dev
```

To run this locally in Docker while maintaining the ability to see the output file, you can do something like:
### Optional

```shell
# Required only if a source has records that cause errors during a harvest and --method=get. The value provided must be a space-separated list of OAI-PMH record identifiers to skip during harvest.
RECORD_SKIP_LIST=<oai-pmh-id1> <oai-pmh-id2>

# Sets the interval for logging status updates as records are written to the output file. Defaults to 1000, which will log a status update for every thousandth record.
STATUS_UPDATE_INTERVAL = 1000

```bash
docker run -it --volume '/FULL/PATH/TO/WHERE/YOU/WANT/FILES/tmp:/app/tmp' oaiharvester -h https://aspace-staff-dev.mit.edu/oai -o tmp/out.xml harvest -m oai_ead
# If set to a valid Sentry DSN, enables Sentry exception monitoring This is not needed for local development.
SENTRY_DSN = <sentry-dsn-for-oai-pmh-harvester>
```

## S3 Output
## CLI commands

You can save to s3 by passing an s3 url as the --output-file (-o) in a format like:
All CLI commands can be run with pipenv run <COMMAND>.

```bash
-o s3://AWS_KEY:AWS_SECRET_KEY@BUCKET_NAME/FILENAME.xml
### `oai`

```text
Usage: -c [OPTIONS] COMMAND [ARGS]...
Options:
-h, --host TEXT Hostname of server for an OAI-PMH compliant source.
[required]
-o, --output-file TEXT Filepath for generated output (either an XML file
with harvested metadata or a JSON file describing
set structure of an OAI-PMH compliant source). This
value can be a local filepath or an S3 URI.
[required]
-v, --verbose Pass to log at debug level instead of info
--help Show this message and exit.
Commands:
harvest Harvest command to retrieve records from an OAI-PMH compliant source.
setlist Create a JSON file describing the set structure of an OAI-PMH compliant source.
```

If you have your credentials stored locally, you can omit the passed params like:
### `oai harvest`

```text
Usage: -c harvest [OPTIONS]
Harvest command to retrieve records from an OAI-PMH compliant source.
Options:
--method [get|list] Method for record retrieval. The 'list' method
is faster and should be used in most cases;
'get' method should be used for ArchivesSpace
due to errors retrieving a full record set with
the 'list' method. [default: list]
-m, --metadata-format TEXT Alternate metadata format for harvested records.
A record should only be returned if the format
specified can be disseminated from the item
identified by the value of the identifier
argument. [default: oai_dc]
-f, --from-date TEXT Filter for files modified on or after this date;
format YYYY-MM-DD.
-u, --until-date TEXT Filter for files modified before this date;
format YYYY-MM-DD.
-s, --set-spec TEXT SetSpec of set to be harvested. Limits harvest
to records in the provided set.
-sr, --skip-record TEXT Set of OAI-PMH identifiers for records to skip
during a harvest. Only works when --method=get.
Multiple identifiers can be provided using the
syntax: '-sr oai:12345 -sr oai:67890'. Values
can also be retrieved through the
RECORD_SKIP_LIST env var (see README for more
details).
--exclude-deleted Pass to exclude deleted records from harvest.
--help Show this message and exit.
```

```bash
-o s3://BUCKET_NAME/FILENAME.xml
### `oai setlist`
```
Usage: -c setlist [OPTIONS]
Create a JSON file describing the set structure of an OAI-PMH compliant
source.
Uses the OAI-PMH ListSets verbs to retrieve all sets from a repository, and
writes the set names and specs to a JSON output file.
Options:
--help Show this message and exit.
```


## ENV variables

- `RECORD_SKIP_LIST` = Required if a source has records that cause errors during harvest, otherwise those records will cause the harvest process to crash. Space-separated list of OAI-PMH record identifiers to skip during harvest, e.g. `RECORD_SKIP_LIST=record1 record2`. Note: this only works if the harvest method used is "get".
- `SENTRY_DSN` = Optional in dev. If set to a valid Sentry DSN, enables Sentry exception monitoring. This is not needed for local development.
- `STATUS_UPDATE_INTERVAL` = Optional. The transform process logs the # of records transformed every nth record (1000 by default). Set this env variable to any integer to change the frequency of logging status updates. Can be useful for development/debugging.
- `WORKSPACE` = Required. Set to `dev` for local development, this will be set to `stage` and `prod` in those environments by Terraform.
41 changes: 21 additions & 20 deletions harvester/cli.py
Original file line number Diff line number Diff line change
Expand Up @@ -21,17 +21,19 @@
"-h",
"--host",
required=True,
help="Hostname of OAI-PMH server to harvest from, e.g. "
"https://dspace.mit.edu/oai/request.",
help="Hostname of server for an OAI-PMH compliant source.",
)
@click.option(
"-o",
"--output-file",
required=True,
help="Filepath to write output to. Can be a local filepath or an S3 URI, e.g. "
"S3://bucketname/filename.xml.",
help="Filepath for generated output (either an XML file with harvested metadata or "
"a JSON file describing set structure of an OAI-PMH compliant source). "
"This value can be a local filepath or an S3 URI.",
)
@click.option(
"-v", "--verbose", help="Pass to log at debug level instead of info", is_flag=True
)
@click.option("-v", "--verbose", help="Optional: enable debug output.", is_flag=True)
@click.pass_context
def main(ctx: click.Context, host: str, output_file: str, verbose: bool) -> None:
ctx.ensure_object(dict)
Expand All @@ -49,7 +51,7 @@ def main(ctx: click.Context, host: str, output_file: str, verbose: bool) -> None
"--method",
default="list",
show_default=True,
help="Record retrieval method to use. Default 'list' method is faster and should "
help="Method for record retrieval. The 'list' method is faster and should "
"be used in most cases; 'get' method should be used for ArchivesSpace due to "
"errors retrieving a full record set with the 'list' method.",
type=click.Choice(["get", "list"], case_sensitive=False),
Expand All @@ -59,29 +61,28 @@ def main(ctx: click.Context, host: str, output_file: str, verbose: bool) -> None
"--metadata-format",
default="oai_dc",
show_default=True,
help="Optional: specify alternate metadata format for harvested records (e.g. "
"mods, mets, oai_dc, qdc, ore).",
help="Alternate metadata format for harvested records. A record should only be "
"returned if the format specified can be disseminated from the item identified "
"by the value of the identifier argument.",
)
@click.option(
"-f",
"--from-date",
default=None,
help="Optional: starting date to harvest records from, in format YYYY-MM-DD. "
"Limits harvest to records added/updated on or after the provided date.",
help="Filter for files modified on or after this date; format YYYY-MM-DD.",
)
@click.option(
"-u",
"--until-date",
default=None,
help="Optional: ending date to harvest records from, in format YYYY-MM-DD. "
"Limits harvest to records added/updated on or before the provided date.",
help="Filter for files modified before this date; format YYYY-MM-DD.",
)
@click.option(
"-s",
"--set-spec",
default=None,
show_default=True,
help="Optional: SetSpec of set to be harvested. Limits harvest to records in the "
help="SetSpec of set to be harvested. Limits harvest to records in the "
"provided set.",
)
@click.option(
Expand All @@ -90,14 +91,14 @@ def main(ctx: click.Context, host: str, output_file: str, verbose: bool) -> None
envvar="RECORD_SKIP_LIST",
multiple=True,
show_default=True,
help="Optional: OAI-PMH identifier of record to skip during harvest. Only works if "
"the harvest method used is 'get'. Can be repeated to skip multiple records, e.g. "
"'-sr oai:12345 -sr oai:67890'. Can also be set via ENV variable, see README for "
"details.",
help="Set of OAI-PMH identifiers for records to skip during a harvest. Only works "
"when --method=get. Multiple identifiers can be provided using the syntax: "
"'-sr oai:12345 -sr oai:67890'. Values can also be retrieved through the "
"RECORD_SKIP_LIST env var (see README for more details).",
)
@click.option(
"--exclude-deleted",
help="Optional: exclude deleted records from harvest.",
help="Pass to exclude deleted records from harvest.",
is_flag=True,
)
@click.pass_context
Expand All @@ -111,7 +112,7 @@ def harvest(
skip_record: tuple[str] | None,
exclude_deleted: bool,
) -> None:
"""Harvest records from an OAI-PMH compliant source and write to an output file."""
"""Harvest command to retrieve records from an OAI-PMH compliant source."""
logger.info(
"OAI-PMH harvesting from source %s with parameters: method=%s, "
"metadata_format=%s, from_date=%s, until_date=%s, set=%s, skip_record=%s, "
Expand Down Expand Up @@ -162,7 +163,7 @@ def harvest(
@main.command()
@click.pass_context
def setlist(ctx: click.Context) -> None:
"""Get set info from an OAI-PMH compliant source and write to an output file.
"""Create a JSON file describing the set structure of an OAI-PMH compliant source.
Uses the OAI-PMH ListSets verbs to retrieve all sets from a repository, and writes
the set names and specs to a JSON output file.
Expand Down

0 comments on commit 312b5ee

Please sign in to comment.