-
Notifications
You must be signed in to change notification settings - Fork 0
Use 181 read embeddings with tda #373
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
8c8c58c to
d0ad109
Compare
d0ad109 to
1c9ceeb
Compare
Pull Request Test Coverage Report for Build 20236928399Details
💛 - Coveralls |
1389081 to
720b700
Compare
| @my_vcr.use_cassette( | ||
| "opensearch/bulk_update_raises_bulk_operation_error_if_record_not_found.yaml" | ||
| ) | ||
| def test_bulk_update_raises_bulk_operation_error_if_record_not_found( | ||
| test_opensearch_client, | ||
| ): | ||
| updates = [ | ||
| { | ||
| "timdex_record_id": "i-am-not-found", | ||
| "title": "Materials Science & Engineering (UPDATED)", | ||
| } | ||
| ] | ||
| with pytest.raises(BulkOperationError): | ||
| tim_os.bulk_update(test_opensearch_client, "test-index", iter(updates)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ghukill Do you think this is the appropriate response? Essentially, once it hits an update for a record not found in the OpenSearch index, the BulkOperationError is raised. 🤔
This is the result of carrying over the pattern in tim.opensearch.bulk_index, where if the response[0] is False and the error type is anything other than a mapping error, a custom exception is raised.
The same pattern is currently implemented in tim.opensearch.bulk_update. 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I say we leave it for now. I'd prefer to exit eagerly and often with explicit reasons why at first. Perhaps we'll find that we're okay with most documents getting embeddings even if there are some errors.
But I think in the early days, it'll be nice to know if any records we are trying to update don't exist.
FWIW, I do think the work in USE-273 which will limit to action=index for a run will help with this. Otherwise, anytime a record has action=delete for a run we can kind of assume it won't exist in Opensearch and will likely trigger this error.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, though tim.cli.bulk_update_embeddings handles the BulkOperationError, the log in line 396 will log:
{"updated": 0, "errors": 0, "total": 0}
Is this an issue?
And re:
it'll be nice to know if any records we are trying to update don't exist.
Is there a change request for this? Are you suggesting that tim.opensearch.bulk_update return a tuple where...(<dict_of_counts>, <list_of_errors>)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To your first question: I don't think it matters. We could move that log line into the try part so it only logs when successful, that might not be a bad small change. Or, to strive for doing one thing under a try block, we could have the except block explicitly exit the CLI with a non-zero exit code.
Actually... that might be a nice option. Yes, I'd propose that! If the embeddings fail, it'd be nice to have a non-zero exit code.
As for your second question, I just meant it'll be nice to know if we ever attempt to add an embedding for any record that doesn't exist. Specific ones aren't needed I don't think. I do think that USE-273 might be a good time to revisit some these things, when we see how the updated read methods change things (if at all).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Applied the change to exit the CLI with a non-zero exit code when the BulkOperationError is raised for bulk_update_embeddings CLI command (see f2cdca7)
ghukill
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall, really nice work! Looking good. I left a few comments. Marking this review as "Request changes" 90% for requiring the --run-id become a required CLI arg.
Other than that, my questions / comments are largely discussion and optional. Looks like an excellent first pass, and we should consider that USE-273 may smooth out some rough edges. It might be worth moving this PR to merge, and noting edge cases or improvements to be performed in that ticket.
tim/cli.py
Outdated
| ), | ||
| ) | ||
| @click.option("-d", "--run-date", help="Run date, formatted as YYYY-MM-DD.") | ||
| @click.option("-rid", "--run-id", help="Run ID.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think until we take another pass at TIM given updated capabilities of TDA (related Jira ticket), we should make this --run-id argument required.
I was confused for awhile working through the PR which originally had this command:
pipenv run tim bulk-update-embeddings -s libguides tests/fixtures/dataset
It ran without error, but was showing zero updated, created, errored, etc. Obviously, I wasn't thinking too deeply about it or I would have noticed that we weren't passing a run_id to select embeddings from the dataset for! But in retrospect, I think the CLI should have thrown an error if that's not provided.
tests/test_cli.py
Outdated
| assert result.exit_code == EXIT_CODES["success"] | ||
| assert ( | ||
| f"Bulk update with embeddings complete: {json.dumps(mock_bulk_update())}" | ||
| in caplog.text | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For this test, where the bulk updating returns an error > 0, is this the response we want?
Generally speaking, I think we want processes to continue even if 1 / 10k records had an error. At the same time, we want to know about those errors so we don't just paper over (ignore) them indefinitely.
To answer my own question: I think the CLI response code and the logging is good, because we are logging that error count. This would be an excellent use of AWS metrics and observability to then react to this error > 0 count outside the context of this CLI.
My issue may ultimately be the test name where I don't think we are "raising" an error in any meaningful way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also for consideration, a quick AI suggestion which I find compelling as well:
@pytest.mark.parametrize(
"bulk_update_return",
[
{"updated": 1, "errors": 0, "total": 1},
{"updated": 0, "errors": 1, "total": 1},
],
)
@patch("tim.helpers.validate_bulk_cli_options")
@patch("tim.opensearch.bulk_update")
def test_bulk_update_embeddings_logs_complete(
mock_bulk_update,
mock_validate_bulk_cli_options,
bulk_update_return,
caplog,
monkeypatch,
runner,
):
monkeypatch.delenv("TIMDEX_OPENSEARCH_ENDPOINT", raising=False)
mock_bulk_update.return_value = bulk_update_return
mock_validate_bulk_cli_options.return_value = "libguides"
result = runner.invoke(
main,
[
"bulk-update-embeddings",
"--source",
"libguides",
"--run-id",
"85cfe316-089c-4639-a5af-c861a7321493",
"tests/fixtures/dataset",
],
)
assert result.exit_code == EXIT_CODES["success"]
assert (
f"Bulk update with embeddings complete: {json.dumps(mock_bulk_update())}"
in caplog.text
)What this communicates to me is that regardless of the success/error counts, the CLI response is basically the same.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sharing notes from our discussion:
- The
bulk_indexandbulk_updatemethods intim/opensearch.pywill raise an error (BulkIndexingErrorandBulkOperationError, respectively) as soon as it encounters an error that is not due to a mapping parsing error and exit the for loop. - This is in contrast to the desired outcome noted in this comment.
- At the CLI level, the raised exceptions are handled by the
bulk_updateandbulk_update_embeddingsCLI commands but in different ways:bulk_update:- Logs an error message with the
timdex_record_idof the failing record but proceed with thedeleteprocess. - Logs a likely inaccurate
index_results
- Logs an error message with the
bulk_update_embeddings:- Logs an error message with the
timdex_record_idof the failing record and exit the CLI (status code 1)
- Logs an error message with the
TLDR: These learnings highlighted that the bulk methods in tim/opensearch.py could benefit from a revisit/refactor to better align logging and error handling across the bulk methods.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed! Thanks for summarizing.
| @my_vcr.use_cassette( | ||
| "opensearch/bulk_update_raises_bulk_operation_error_if_record_not_found.yaml" | ||
| ) | ||
| def test_bulk_update_raises_bulk_operation_error_if_record_not_found( | ||
| test_opensearch_client, | ||
| ): | ||
| updates = [ | ||
| { | ||
| "timdex_record_id": "i-am-not-found", | ||
| "title": "Materials Science & Engineering (UPDATED)", | ||
| } | ||
| ] | ||
| with pytest.raises(BulkOperationError): | ||
| tim_os.bulk_update(test_opensearch_client, "test-index", iter(updates)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I say we leave it for now. I'd prefer to exit eagerly and often with explicit reasons why at first. Perhaps we'll find that we're okay with most documents getting embeddings even if there are some errors.
But I think in the early days, it'll be nice to know if any records we are trying to update don't exist.
FWIW, I do think the work in USE-273 which will limit to action=index for a run will help with this. Otherwise, anytime a record has action=delete for a run we can kind of assume it won't exist in Opensearch and will likely trigger this error.
* Fix cassette to handle method call to refresh index * Use parameterized tests * Remove comments
* Use logger.error when handling custom bulk exceptions * Call ctx.exit in bulk_update_embeddings when BulkOperationError is raisd * Update method to log elapsed time * Clean up log when refreshing index
|
@ghukill I undid the change to replace |
ghukill
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ship it! Approved!
As you've noted in a comment, the completion of this new CLI command revealed some areas that we could tidy up and normalize with regards to bulk operations. As discussed, seems better to address that as a focused pass.
With that in mind, I do feel as though this PR and new CLI command will perform bulk updates with embeddings which is the intent here. As we move into testing, and we expose some bugs, those will be good opportunities to revisit bulk operations logging and error handling a bit more holistically.
Nice work! One step closer to embeddings.
Why these changes are being introduced: * Now that TDA supports reading embeddings associated with TIMDEX records in the TIMDEX dataset, the stub CLI command can be completed. The OpenSearch mapping requires a new field to store embeddings. How this addresses that need: * Add 'embedding_full_record' field to OpenSearch mapping * Add helper method to format embeddings as input JSON for OpenSearch client * Update cli * Use logger.error when handling custom bulk exceptions * Call ctx.exit in bulk_update_embeddings when BulkOperationError is raisd * Clean up log when refreshing index Side effects of this change: * None Relevant ticket(s): * https://mitlibraries.atlassian.net/browse/USE-181
d536f3e to
fe79bd9
Compare
Purpose and background context
Now that TDA supports reading embeddings associated with TIMDEX records in the TIMDEX dataset, the
bulk_update_embeddingsCLI command can be completed. Additionally, this PR updates the OpenSearch mappings to enable indexing of embeddings into OpenSearch.Notes
vcrcassettes were created by running a local OpenSearch instance that contained the following: an index forsource="test-index"that indexes the records from the TIMDEX dataset loaded fromtests/fixtures/data.tests/fixtures/data/embeddings.How can a reviewer manually see the effects of these changes?
source="libguides"using TIMDEX dataset loaded fromtests/fixtures/datasetbulk_update_embeddings:embedding_full_record:Includes new or updated dependencies?
YES - To resolve vulnerabilities but primarily to update
timdex-dataset-api==3.7.2Changes expectations for external applications?
MAYBE - The decisions made here must be considered in future work to update query / search functions to use embeddings.
What are the relevant tickets?
https://github.com/MITLibraries/timdex-index-manager/tree/USE-181-read-embeddings-with-tda
Code review