Update README

MITLibraries · Jan 9, 2024 · 755d07c · 755d07c
1 parent 15abb5e
commit 755d07c
Showing 1 changed file with 48 additions and 38 deletions.
diff --git a/README.md b/README.md
@@ -2,65 +2,75 @@
 
 # oai-pmh-harvester
 
-CLI app for harvesting from repositories using OAI-PMH.
+OAI-PMH-Harvester is a Python CLI application for harvesting metadata from repositories (also known as "Data Providers") available through the Open Archives Initiative Protocol for Metadata Harvesting (OAI-PMH) registry. 
 
-## Harvesting
+## Development
+- To preview a list of available Makefile commands: `make help`
+- To install with dev dependencies: `make install`
+- To update dependencies: `make update`
+- To run unit tests: `make test`
+- To lint the repo: `make lint`
+- To run the app: `pipenv run oai --help`
 
-To install and run tests:
+### Running the application on your local machine
 
-- `make install`
-- `make test`
+Create a virtual environment and install dev dependencies: `make install`. 
 
-To view available commands and main options:
+Additional notes: 
 
-- `pipenv run oai --help`
+1. To execute the steps below, you can use the following sample url to an OAI-PMH repo: `https://aspace-staff-dev.mit.edu/oai`.
 
-To run a harvest:
+2. To write the output file to an S3 bucket, include S3 in the `-o/--output-file` argument.
+   * With AWS credentials: 
+      ```
+      -o s3://<AWS_KEY>:<AWS_SECRET_KEY>@<BUCKET_NAME>/<output-filename>.xml
+      ```
+   * Wihout AWS credentials (if you have your credentials stored locally):
+      ```
+      -o s3://<BUCKET_NAME>/<output-filename>.xml
+      ```
 
-- `pipenv run oai -h [host repo oai-pmh url] -o [path to output file] harvest [any additional desired options]`
+#### With Docker
 
-## Development
+1. Run `make dist-dev` to build the Docker container image.
 
-Clone the repo and install the dependencies using [Pipenv](https://docs.pipenv.org/):
+2. To run a harvest, execute the following command in your terminal:
+   ```
+   docker run -it --volume <local-file-path>:<docker-file-path>' oai-pmh-harvester-dev -h <url-to-oai-pmh-repo> -o <docker-file-path>/<output-filename>.xml harvest <optional-command-args>
+   ```
 
-```bash
-git git@github.com:MITLibraries/oai-pmh-harvester.git
-cd oai-pmh-harvester
-make install
-```
+   **Note:** The `-v/--volume` argument mounts the \<local-file-path> in the current directory into the container at \<docker-file-path>, which allows us to view the generated output file in \<local-file-path>.
 
-## Docker
 
-To build and run in docker:
+#### Without Docker 
 
-```bash
-make dist-dev
-docker run -it oaiharvester
-```
+1. To run a harvest, execute the following command in your terminal:
 
-To run this locally in Docker while maintaining the ability to see the output file, you can do something like:
+   ```
+   pipenv run oai -h <url-to-oai-pmh-repo> -o <output-filename>.xml harvest <optional-command-args>
+   ```
 
-```bash
-docker run -it --volume '/FULL/PATH/TO/WHERE/YOU/WANT/FILES/tmp:/app/tmp' oaiharvester -h https://aspace-staff-dev.mit.edu/oai -o tmp/out.xml harvest -m oai_ead
-```
+## Environment variables
+
+### Required
 
-## S3 Output
+```shell
+# Set to dev for local development, this will be set to 'stage' and 'prod' in those environments by Terraform.
+WORKSPACE=dev
 
-You can save to s3 by passing an s3 url as the --output-file (-o) in a format like:
+# Required only if a source has records that cause errors during a harvest and --method=get. The value provided must be a space-separated list of OAI-PMH record identifiers to skip during harvest.
+RECORD_SKIP_LIST=<oai-pmh-id1> <oai-pmh-id2>
 
-```bash
--o s3://AWS_KEY:AWS_SECRET_KEY@BUCKET_NAME/FILENAME.xml
 ```
 
-If you have your credentials stored locally, you can omit the passed params like:
+### Optional
+
+```shell
+# Sets the interval for logging status updates as records are written to the output file. Defaults to 1000, which will log a status update for every thousandth record.
+STATUS_UPDATE_INTERVAL = 1000
 
-```bash
--o s3://BUCKET_NAME/FILENAME.xml
+# If set to a valid Sentry DSN, enables Sentry exception monitoring This is not needed for local development.
+SENTRY_DSN = <sentry-dsn-for-oai-pmh-harvester>
 ```
 
-## ENV variables
 
-- `RECORD_SKIP_LIST` = Required if a source has records that cause errors during harvest, otherwise those records will cause the harvest process to crash. Space-separated list of OAI-PMH record identifiers to skip during harvest, e.g. `RECORD_SKIP_LIST=record1 record2`. Note: this only works if the harvest method used is "get".
-- `SENTRY_DSN` = Optional in dev. If set to a valid Sentry DSN, enables Sentry exception monitoring. This is not needed for local development.
-- `STATUS_UPDATE_INTERVAL` = Optional. The transform process logs the # of records transformed every nth record (1000 by default). Set this env variable to any integer to change the frequency of logging status updates. Can be useful for development/debugging.
-- `WORKSPACE` = Required. Set to `dev` for local development, this will be set to `stage` and `prod` in those environments by Terraform.