Skip to content

Commit

Permalink
Implement ingest CLI command
Browse files Browse the repository at this point in the history
Why these changes are being introduced:
Ingesting records is an essential feature of TIM and is how all records
will be indexed, whether manually or via automation. The ingest command
needs to create or update records in the correct index while enforcing
our business logic around index settings, naming, promotion, record
identifiers, etc.

How this addresses that need:
* Adds config and helper functions to:
  * Generate index settings from a file
  * Generate an index name from a source name using our naming
    convention
  * Get the source name from an index name
  * Iteratively parse records from a json file without loading the
    entire file into memory, as these files may be quite large
  * Generate an appropriate json request body using the syntax required
    by the OpenSearch bulk indexing API.
* Adds functions in the opensearch module to:
  * Create an index using our index settings and mappings from config
  * Get the primary index for a source
  * Return either the primary index for a source or a new index for the
    source depending on supplied parameters and whether a primary index
    for the source already exists
  * Bulk index records from a json file to an index, logging indexing
    errors as they arise and returning summary result information when
    complete
* Updates the cli ingest command to call the functions above as needed
  to index records, and log summary results when complete.
* Adds tests and fixtures for all new functionality.
* Updates README to include all optional ENV variables.

Side effects of this change:
None

Relevant ticket(s):
* https://mitlibraries.atlassian.net/browse/TIMX-76
  • Loading branch information
hakbailey committed Sep 2, 2022
1 parent 99a014c commit 06afcc9
Show file tree
Hide file tree
Showing 27 changed files with 3,724 additions and 107 deletions.
3 changes: 3 additions & 0 deletions Pipfile
Original file line number Diff line number Diff line change
Expand Up @@ -6,14 +6,17 @@ name = "pypi"
[packages]
boto3 = "*"
click = "*"
ijson = "*"
opensearch-py = "*"
sentry-sdk = "*"
smart-open = {extras = ["s3"], version = "*"}

[dev-packages]
bandit = "*"
black = "*"
coverage = "*"
coveralls = "*"
freezegun = "*"
mypy = "*"
pylama = {extras = ["all"], version = "*"}
pytest = "*"
Expand Down
229 changes: 159 additions & 70 deletions Pipfile.lock

Large diffs are not rendered by default.

10 changes: 8 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,10 +4,16 @@ TIMDEX! Index Manager (TIM) is a Python cli application for managing TIMDEX inde

## Required ENV

- `OPENSEARCH_ENDPOINT` = Optional (can also be passed directly to the CLI via the `--url` option). If using a local Docker OpenSearch instance, this isn't needed. Otherwise set to OpenSearch instance endpoint _without_ the http scheme, e.g. `search-timdex-env-1234567890.us-east-1.es.amazonaws.com`
- `SENTRY_DSN` = If set to a valid Sentry DSN, enables Sentry exception monitoring. This is not needed for local development.
- `WORKSPACE` = Set to `dev` for local development, this will be set to `stage` and `prod` in those environments by Terraform.

## Optional ENV

- `AWS_REGION` = Only needed if AWS region changes from the default of us-east-1.
- `OPENSEARCH_ENDPOINT` = If using a local Docker OpenSearch instance, this isn't needed. Otherwise set to OpenSearch instance endpoint _without_ the http scheme, e.g. `search-timdex-env-1234567890.us-east-1.es.amazonaws.com`. Can also be passed directly to the CLI via the `--url` option.
- `OPENSEARCH_REQUEST_TIMEOUT` = Only used for OpenSearch requests that tend to take longer than the default timeout of 10 seconds, such as bulk or index refresh requests. Defaults to 30 seconds if not set.
- `SENTRY_DSN` = If set to a valid Sentry DSN, enables Sentry exception monitoring. This is not needed for local development.
- `STATUS_UPDATE_INTERVAL` = The ingest process logs the # of records indexed every nth record (1000 by default). Set this env variable to any integer to change the frequency of logging status updates. Can be useful for development/debugging.

## Development

- To install with dev dependencies: `make install`
Expand Down
Loading

0 comments on commit 06afcc9

Please sign in to comment.