Skip to content

Commit

Permalink
Update README
Browse files Browse the repository at this point in the history
  • Loading branch information
jonavellecuerdo committed Jan 9, 2024
1 parent 15abb5e commit 755d07c
Showing 1 changed file with 48 additions and 38 deletions.
86 changes: 48 additions & 38 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,65 +2,75 @@

# oai-pmh-harvester

CLI app for harvesting from repositories using OAI-PMH.
OAI-PMH-Harvester is a Python CLI application for harvesting metadata from repositories (also known as "Data Providers") available through the Open Archives Initiative Protocol for Metadata Harvesting (OAI-PMH) registry.

## Harvesting
## Development
- To preview a list of available Makefile commands: `make help`
- To install with dev dependencies: `make install`
- To update dependencies: `make update`
- To run unit tests: `make test`
- To lint the repo: `make lint`
- To run the app: `pipenv run oai --help`

To install and run tests:
### Running the application on your local machine

- `make install`
- `make test`
Create a virtual environment and install dev dependencies: `make install`.

To view available commands and main options:
Additional notes:

- `pipenv run oai --help`
1. To execute the steps below, you can use the following sample url to an OAI-PMH repo: `https://aspace-staff-dev.mit.edu/oai`.

To run a harvest:
2. To write the output file to an S3 bucket, include S3 in the `-o/--output-file` argument.
* With AWS credentials:
```
-o s3://<AWS_KEY>:<AWS_SECRET_KEY>@<BUCKET_NAME>/<output-filename>.xml
```
* Wihout AWS credentials (if you have your credentials stored locally):
```
-o s3://<BUCKET_NAME>/<output-filename>.xml
```

- `pipenv run oai -h [host repo oai-pmh url] -o [path to output file] harvest [any additional desired options]`
#### With Docker

## Development
1. Run `make dist-dev` to build the Docker container image.

Clone the repo and install the dependencies using [Pipenv](https://docs.pipenv.org/):
2. To run a harvest, execute the following command in your terminal:
```
docker run -it --volume <local-file-path>:<docker-file-path>' oai-pmh-harvester-dev -h <url-to-oai-pmh-repo> -o <docker-file-path>/<output-filename>.xml harvest <optional-command-args>
```

```bash
git git@github.com:MITLibraries/oai-pmh-harvester.git
cd oai-pmh-harvester
make install
```
**Note:** The `-v/--volume` argument mounts the \<local-file-path> in the current directory into the container at \<docker-file-path>, which allows us to view the generated output file in \<local-file-path>.

## Docker

To build and run in docker:
#### Without Docker

```bash
make dist-dev
docker run -it oaiharvester
```
1. To run a harvest, execute the following command in your terminal:

To run this locally in Docker while maintaining the ability to see the output file, you can do something like:
```
pipenv run oai -h <url-to-oai-pmh-repo> -o <output-filename>.xml harvest <optional-command-args>
```

```bash
docker run -it --volume '/FULL/PATH/TO/WHERE/YOU/WANT/FILES/tmp:/app/tmp' oaiharvester -h https://aspace-staff-dev.mit.edu/oai -o tmp/out.xml harvest -m oai_ead
```
## Environment variables

### Required

## S3 Output
```shell
# Set to dev for local development, this will be set to 'stage' and 'prod' in those environments by Terraform.
WORKSPACE=dev

You can save to s3 by passing an s3 url as the --output-file (-o) in a format like:
# Required only if a source has records that cause errors during a harvest and --method=get. The value provided must be a space-separated list of OAI-PMH record identifiers to skip during harvest.
RECORD_SKIP_LIST=<oai-pmh-id1> <oai-pmh-id2>

```bash
-o s3://AWS_KEY:AWS_SECRET_KEY@BUCKET_NAME/FILENAME.xml
```

If you have your credentials stored locally, you can omit the passed params like:
### Optional

```shell
# Sets the interval for logging status updates as records are written to the output file. Defaults to 1000, which will log a status update for every thousandth record.
STATUS_UPDATE_INTERVAL = 1000

```bash
-o s3://BUCKET_NAME/FILENAME.xml
# If set to a valid Sentry DSN, enables Sentry exception monitoring This is not needed for local development.
SENTRY_DSN = <sentry-dsn-for-oai-pmh-harvester>
```

## ENV variables

- `RECORD_SKIP_LIST` = Required if a source has records that cause errors during harvest, otherwise those records will cause the harvest process to crash. Space-separated list of OAI-PMH record identifiers to skip during harvest, e.g. `RECORD_SKIP_LIST=record1 record2`. Note: this only works if the harvest method used is "get".
- `SENTRY_DSN` = Optional in dev. If set to a valid Sentry DSN, enables Sentry exception monitoring. This is not needed for local development.
- `STATUS_UPDATE_INTERVAL` = Optional. The transform process logs the # of records transformed every nth record (1000 by default). Set this env variable to any integer to change the frequency of logging status updates. Can be useful for development/debugging.
- `WORKSPACE` = Required. Set to `dev` for local development, this will be set to `stage` and `prod` in those environments by Terraform.

0 comments on commit 755d07c

Please sign in to comment.