geo-harvester

Geo-Harvester is a python CLI application for harvesting, normalizing, and writing GIS and geospatial metadata, with a focus on providing this metadata for TIMDEX.

At a high level, this is accomplished by:

fetching metadata records generated by MIT (S3) or from OpenGeoMetadata (OGM) repositories (Github)
normalize this metadata to the Aardvark metadata format
based on normalized metadata, sending EventBridge events for other applications to potentially move and copy data

Development

To install with dev dependencies: make install
To update dependencies: make update
To run unit tests: make test
To lint the repo: make lint
To run the app: pipenv run harvester --help

Environment Variables

Required

SENTRY_DSN=### If set to a valid Sentry DSN, enables Sentry exception monitoring. This is not needed for local development.
WORKSPACE=### Set to `dev` for local development, this will be set to `stage` and `prod` in those environments by Terraform.

Optional

S3_RESTRICTED_CDN_ROOT=### S3 bucket + prefix for CDN restricted, e.g. 's3://<bucket>/path/to/restricted'
S3_PUBLIC_CDN_ROOT=### S3 bucket + prefix for CDN public, e.g. 's3://<bucket>/path/to/public'
S3_TIMDEX_ALMA=### S3 bucket + prefix for previously extracted Alma source records, e.g. 's3://<timdex-extract-bucket>/alma'
GEOHARVESTER_SQS_TOPIC_NAME=### default value for CLI argument --sqs-topic-name
OGM_CONFIG_FILEPATH=### optional location for OGM configuration YAML
OGM_CLONE_ROOT_URL=### optional base URL or filepath for where to clone OGM repositories from
OGM_CLONE_ROOT_DIR=### optional location for where cloned repositories are saved locally

CLI Commands

All CLI commands can be run with pipenv run <COMMAND>.

`harvester`

Usage: -c [OPTIONS] COMMAND [ARGS]...

  Root harvester command that other sub-commands extend.

Options:
  -v, --verbose  Pass to log at debug level instead of info.
  -h, --help     Show this message and exit.

Commands:
  harvest  Harvest command with sub-commands for different sources.
  ping     Debug ping/pong command

`harvester ping`

Usage: -c ping [OPTIONS]

  Debug ping/pong command.

  This command is purely for debugging purposes to ensure docker container
  and/or application is functional and responsive before any meaningful
  business logic.

Options:
  -h, --help  Show this message and exit.

`harvester harvest`

Base command for harvests. Expecting sub-command mit or ogm.

Usage: -c harvest [OPTIONS] COMMAND [ARGS]...

  Harvest command with sub-commands for different sources.

Options:
  -t, --harvest-type [full|incremental]
                                  Type of harvest, may be: 'incremental' or
                                  'full'.
  -f, --from-date TEXT            filter for files modified on or after this
                                  date; format YYYY-MM-DD.
  -u, --until-date TEXT           filter for files modified before this date;
                                  format YYYY-MM-DD.
  -o, --output-file TEXT          Filepath to write single, combined JSONLines
                                  file of normalized MITAardvark metadata for
                                  ALL harvested records.  This is the expected
                                  format for the TIMDEX pipeline.
  -h, --help                      Show this message and exit.

Commands:
  mit  Harvest and normalize MIT geospatial metadata records.
  ogm  Harvest and normalize OpenGeoMetadata (OGM) geospatial metadata...

`harvester harvest mit`

Usage: -c harvest mit [OPTIONS]

  Harvest and normalize MIT geospatial metadata records.

Options:
  -i, --input-files TEXT          Directory location of source record zip
                                  files (may be local or s3). Defaults to env
                                  var S3_RESTRICTED_CDN_ROOT if not set.
                                  [required]
  -osd, --output-source-directory TEXT
                                  Directory to write source metadata for EACH
                                  harvested record file with naming convention
                                  '<identifier>.<format>.source.xml|json'.
                                  Defaults to env var S3_PUBLIC_CDN_ROOT if
                                  not set.
  -ond, --output-normalized-directory TEXT
                                  Directory to write normalized MITAardvark
                                  metadata for EACH harvested record file with
                                  naming convention
                                  '<identifier>.aardvark.normalized.json'.
                                  Defaults to env var S3_PUBLIC_CDN_ROOT if
                                  not set.
  -s, --sqs-topic-name TEXT       SQS topic name with messages capturing zip
                                  file modifications. Defaults to env var
                                  GEOHARVESTER_SQS_TOPIC_NAME if not set.
                                  [required]
  --preserve-sqs-messages         If set, SQS messages will remain in the
                                  queue after incremental harvest.
  --skip-eventbridge-events       If set, will skip sending EventBridge events
                                  to manage files in CDN.
  -h, --help                      Show this message and exit.

`harvester harvest ogm`

Usage: -c harvest ogm [OPTIONS]

  Harvest and normalize OpenGeoMetadata (OGM) geospatial metadata records.

Options:
  --include-repositories TEXT  If set, limit to only these comma seperated
                               list of repositories for harvest.
  --exclude-repositories TEXT  If set, exclude these comma seperated list of
                               repositories from harvest.
  -h, --help                   Show this message and exit.

Name		Name	Last commit message	Last commit date
Latest commit History 225 Commits
.github		.github
docs		docs
harvester		harvester
tests		tests
.dockerignore		.dockerignore
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
.python-version		.python-version
Dockerfile		Dockerfile
LICENSE		LICENSE
Makefile		Makefile
Pipfile		Pipfile
Pipfile.lock		Pipfile.lock
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

geo-harvester

Development

Environment Variables

Required

Optional

CLI Commands

`harvester`

`harvester ping`

`harvester harvest`

`harvester harvest mit`

`harvester harvest ogm`

About

Releases 5

Packages

Contributors 4

Languages

License

MITLibraries/geo-harvester

Folders and files

Latest commit

History

Repository files navigation

geo-harvester

Development

Environment Variables

Required

Optional

CLI Commands

harvester

harvester ping

harvester harvest

harvester harvest mit

harvester harvest ogm

About

Resources

License

Stars

Watchers

Forks

Releases 5

Packages 0

Contributors 4

Languages

`harvester`

`harvester ping`

`harvester harvest`

`harvester harvest mit`

`harvester harvest ogm`

Packages