Skip to content

Conversation

@jonavellecuerdo
Copy link
Contributor

@jonavellecuerdo jonavellecuerdo commented Dec 11, 2025

Purpose and background context

Now that TDA supports reading embeddings associated with TIMDEX records in the TIMDEX dataset, the bulk_update_embeddings CLI command can be completed. Additionally, this PR updates the OpenSearch mappings to enable indexing of embeddings into OpenSearch.

Notes

  • The number of lines changed is significant due to the update of dependencies.
  • Two new vcr cassettes were created by running a local OpenSearch instance that contained the following: an index for source="test-index" that indexes the records from the TIMDEX dataset loaded from tests/fixtures/data.
    • I generated embeddings for this small dataset and stored those embeddings in tests/fixtures/data/embeddings.
    • Running the tests with VCR recorded the actual result of running TIM against this index.
      • Example: The result of running this test results in the update of the following document:
      {
        "_index": "test-index-2025-12-11t16-58-08",
        "_id": "libguides:guides-175846",
        "_score": 1,
        "_source": {
          "source": "LibGuides",
          "source_link": "https://libguides.mit.edu/materials",
          "timdex_record_id": "libguides:guides-175846",
          "title": "Materials Science & Engineering (UPDATED)",
          "citation": "Phoebe Ayers. Materials Science & Engineering. MIT Libraries. libguides. https://libguides.mit.edu/materials",
          "content_type": [
            "libguides"
          ]
      

How can a reviewer manually see the effects of these changes?

  1. Review unit tests.
  2. [OPTIONAL] Run TIM against Local OpenSearch instance.
    1. Open a new terminal (A)
    2. In terminal A, launch local instances of OpenSearch and OpenSearch Dashboards.
    3. Open another terminal (B)
    4. In terminal B, reindex source="libguides" using TIMDEX dataset loaded from tests/fixtures/dataset
      pipenv run tim reindex-source -s libguides tests/fixtures/dataset
    5. In your browser, go to OpenSearch Dashboards and navigate to Dev Tools.
    6. Run the following query to see records before running bulk_update_embeddings:
      GET libguides/_search
      
    7. In terminal B, bulk update documents with embeddings
      pipenv run tim bulk-update-embeddings -s libguides -rid 85cfe316-089c-4639-a5af-c861a7321493 tests/fixtures/dataset
      
    8. In your browser, rerun the query. You should see that documents are updated to include embedding_full_record:
{
        "_index": "libguides-2025-12-11t20-27-58",
        "_id": "libguides:guides-175853",
        "_score": 1,
        "_source": {
          "source": "LibGuides",
          "source_link": "https://libguides.mit.edu/news",
          "timdex_record_id": "libguides:guides-175853",
          ...
          "embedding_full_record": {
            "85": 0.09610562771558762,
            "202": 0.11397245526313782,
            "1758": 0.3144601583480835,
            "authored": 0.01303903479129076
           ...
          }
       }
}

Includes new or updated dependencies?

YES - To resolve vulnerabilities but primarily to update timdex-dataset-api==3.7.2

Changes expectations for external applications?

MAYBE - The decisions made here must be considered in future work to update query / search functions to use embeddings.

What are the relevant tickets?

https://github.com/MITLibraries/timdex-index-manager/tree/USE-181-read-embeddings-with-tda

Code review

  • Code review best practices are documented here and you are encouraged to have a constructive dialogue with your reviewers about their preferences and expectations.

@jonavellecuerdo jonavellecuerdo force-pushed the USE-181-read-embeddings-with-tda branch from 8c8c58c to d0ad109 Compare December 11, 2025 18:39
@jonavellecuerdo jonavellecuerdo mentioned this pull request Dec 11, 2025
@jonavellecuerdo jonavellecuerdo force-pushed the USE-181-read-embeddings-with-tda branch from d0ad109 to 1c9ceeb Compare December 11, 2025 20:12
@coveralls
Copy link

coveralls commented Dec 11, 2025

Pull Request Test Coverage Report for Build 20236928399

Details

  • 15 of 17 (88.24%) changed or added relevant lines in 3 files are covered.
  • No unchanged relevant lines lost coverage.
  • Overall coverage increased (+6.7%) to 96.082%

Changes Missing Coverage Covered Lines Changed/Added Lines %
tim/helpers.py 2 4 50.0%
Totals Coverage Status
Change from base Build 19441567561: 6.7%
Covered Lines: 466
Relevant Lines: 485

💛 - Coveralls

@jonavellecuerdo jonavellecuerdo force-pushed the USE-181-read-embeddings-with-tda branch 3 times, most recently from 1389081 to 720b700 Compare December 12, 2025 13:48
@jonavellecuerdo jonavellecuerdo marked this pull request as ready for review December 12, 2025 13:50
@jonavellecuerdo jonavellecuerdo requested review from a team and ghukill December 12, 2025 13:50
@ghukill ghukill self-assigned this Dec 12, 2025
Comment on lines +557 to +566
@my_vcr.use_cassette(
"opensearch/bulk_update_raises_bulk_operation_error_if_record_not_found.yaml"
)
def test_bulk_update_raises_bulk_operation_error_if_record_not_found(
test_opensearch_client,
):
updates = [
{
"timdex_record_id": "i-am-not-found",
"title": "Materials Science & Engineering (UPDATED)",
}
]
with pytest.raises(BulkOperationError):
tim_os.bulk_update(test_opensearch_client, "test-index", iter(updates))
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ghukill Do you think this is the appropriate response? Essentially, once it hits an update for a record not found in the OpenSearch index, the BulkOperationError is raised. 🤔

This is the result of carrying over the pattern in tim.opensearch.bulk_index, where if the response[0] is False and the error type is anything other than a mapping error, a custom exception is raised.

The same pattern is currently implemented in tim.opensearch.bulk_update. 🤔

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I say we leave it for now. I'd prefer to exit eagerly and often with explicit reasons why at first. Perhaps we'll find that we're okay with most documents getting embeddings even if there are some errors.

But I think in the early days, it'll be nice to know if any records we are trying to update don't exist.

FWIW, I do think the work in USE-273 which will limit to action=index for a run will help with this. Otherwise, anytime a record has action=delete for a run we can kind of assume it won't exist in Opensearch and will likely trigger this error.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, though tim.cli.bulk_update_embeddings handles the BulkOperationError, the log in line 396 will log:

{"updated": 0, "errors": 0, "total": 0}

Is this an issue?

And re:

it'll be nice to know if any records we are trying to update don't exist.

Is there a change request for this? Are you suggesting that tim.opensearch.bulk_update return a tuple where...(<dict_of_counts>, <list_of_errors>)?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To your first question: I don't think it matters. We could move that log line into the try part so it only logs when successful, that might not be a bad small change. Or, to strive for doing one thing under a try block, we could have the except block explicitly exit the CLI with a non-zero exit code.

Actually... that might be a nice option. Yes, I'd propose that! If the embeddings fail, it'd be nice to have a non-zero exit code.

As for your second question, I just meant it'll be nice to know if we ever attempt to add an embedding for any record that doesn't exist. Specific ones aren't needed I don't think. I do think that USE-273 might be a good time to revisit some these things, when we see how the updated read methods change things (if at all).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Applied the change to exit the CLI with a non-zero exit code when the BulkOperationError is raised for bulk_update_embeddings CLI command (see f2cdca7)

Copy link
Contributor

@ghukill ghukill left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall, really nice work! Looking good. I left a few comments. Marking this review as "Request changes" 90% for requiring the --run-id become a required CLI arg.

Other than that, my questions / comments are largely discussion and optional. Looks like an excellent first pass, and we should consider that USE-273 may smooth out some rough edges. It might be worth moving this PR to merge, and noting edge cases or improvements to be performed in that ticket.

tim/cli.py Outdated
),
)
@click.option("-d", "--run-date", help="Run date, formatted as YYYY-MM-DD.")
@click.option("-rid", "--run-id", help="Run ID.")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think until we take another pass at TIM given updated capabilities of TDA (related Jira ticket), we should make this --run-id argument required.

I was confused for awhile working through the PR which originally had this command:

pipenv run tim bulk-update-embeddings -s libguides tests/fixtures/dataset

It ran without error, but was showing zero updated, created, errored, etc. Obviously, I wasn't thinking too deeply about it or I would have noticed that we weren't passing a run_id to select embeddings from the dataset for! But in retrospect, I think the CLI should have thrown an error if that's not provided.

Comment on lines 325 to 329
assert result.exit_code == EXIT_CODES["success"]
assert (
f"Bulk update with embeddings complete: {json.dumps(mock_bulk_update())}"
in caplog.text
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For this test, where the bulk updating returns an error > 0, is this the response we want?

Generally speaking, I think we want processes to continue even if 1 / 10k records had an error. At the same time, we want to know about those errors so we don't just paper over (ignore) them indefinitely.

To answer my own question: I think the CLI response code and the logging is good, because we are logging that error count. This would be an excellent use of AWS metrics and observability to then react to this error > 0 count outside the context of this CLI.

My issue may ultimately be the test name where I don't think we are "raising" an error in any meaningful way.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also for consideration, a quick AI suggestion which I find compelling as well:

@pytest.mark.parametrize(
    "bulk_update_return",
    [
        {"updated": 1, "errors": 0, "total": 1},
        {"updated": 0, "errors": 1, "total": 1},
    ],
)
@patch("tim.helpers.validate_bulk_cli_options")
@patch("tim.opensearch.bulk_update")
def test_bulk_update_embeddings_logs_complete(
    mock_bulk_update,
    mock_validate_bulk_cli_options,
    bulk_update_return,
    caplog,
    monkeypatch,
    runner,
):
    monkeypatch.delenv("TIMDEX_OPENSEARCH_ENDPOINT", raising=False)
    mock_bulk_update.return_value = bulk_update_return
    mock_validate_bulk_cli_options.return_value = "libguides"

    result = runner.invoke(
        main,
        [
            "bulk-update-embeddings",
            "--source",
            "libguides",
            "--run-id",
            "85cfe316-089c-4639-a5af-c861a7321493",
            "tests/fixtures/dataset",
        ],
    )

    assert result.exit_code == EXIT_CODES["success"]
    assert (
        f"Bulk update with embeddings complete: {json.dumps(mock_bulk_update())}"
        in caplog.text
    )

What this communicates to me is that regardless of the success/error counts, the CLI response is basically the same.

Copy link
Contributor Author

@jonavellecuerdo jonavellecuerdo Dec 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sharing notes from our discussion:

  • The bulk_index and bulk_update methods in tim/opensearch.py will raise an error (BulkIndexingError and BulkOperationError, respectively) as soon as it encounters an error that is not due to a mapping parsing error and exit the for loop.
  • This is in contrast to the desired outcome noted in this comment.
  • At the CLI level, the raised exceptions are handled by the bulk_update and bulk_update_embeddings CLI commands but in different ways:
    • bulk_update:
      • Logs an error message with the timdex_record_id of the failing record but proceed with the delete process.
      • Logs a likely inaccurate index_results
    • bulk_update_embeddings:
      • Logs an error message with the timdex_record_id of the failing record and exit the CLI (status code 1)

TLDR: These learnings highlighted that the bulk methods in tim/opensearch.py could benefit from a revisit/refactor to better align logging and error handling across the bulk methods.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed! Thanks for summarizing.

Comment on lines +557 to +566
@my_vcr.use_cassette(
"opensearch/bulk_update_raises_bulk_operation_error_if_record_not_found.yaml"
)
def test_bulk_update_raises_bulk_operation_error_if_record_not_found(
test_opensearch_client,
):
updates = [
{
"timdex_record_id": "i-am-not-found",
"title": "Materials Science & Engineering (UPDATED)",
}
]
with pytest.raises(BulkOperationError):
tim_os.bulk_update(test_opensearch_client, "test-index", iter(updates))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I say we leave it for now. I'd prefer to exit eagerly and often with explicit reasons why at first. Perhaps we'll find that we're okay with most documents getting embeddings even if there are some errors.

But I think in the early days, it'll be nice to know if any records we are trying to update don't exist.

FWIW, I do think the work in USE-273 which will limit to action=index for a run will help with this. Otherwise, anytime a record has action=delete for a run we can kind of assume it won't exist in Opensearch and will likely trigger this error.

jonavellecuerdo added a commit that referenced this pull request Dec 12, 2025
* Fix cassette to handle method call to refresh index
* Use parameterized tests
* Remove comments
jonavellecuerdo added a commit that referenced this pull request Dec 12, 2025
* Use logger.error when handling custom bulk exceptions
* Call ctx.exit in bulk_update_embeddings when BulkOperationError is raisd
* Update method to log elapsed time
* Clean up log when refreshing index
@jonavellecuerdo
Copy link
Contributor Author

@ghukill I undid the change to replace main.result_callback with ctx.call_on_close. 🙃 It seems there are several tests that rely on this way to log the elapsed time. Will tackle this another time!

Copy link
Contributor

@ghukill ghukill left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ship it! Approved!

As you've noted in a comment, the completion of this new CLI command revealed some areas that we could tidy up and normalize with regards to bulk operations. As discussed, seems better to address that as a focused pass.

With that in mind, I do feel as though this PR and new CLI command will perform bulk updates with embeddings which is the intent here. As we move into testing, and we expose some bugs, those will be good opportunities to revisit bulk operations logging and error handling a bit more holistically.

Nice work! One step closer to embeddings.

Why these changes are being introduced:
* Now that TDA supports reading embeddings associated with
TIMDEX records in the TIMDEX dataset, the stub CLI command can
be completed. The OpenSearch mapping requires a new field to
store embeddings.

How this addresses that need:
* Add 'embedding_full_record' field to OpenSearch mapping
* Add helper method to format embeddings as input JSON for OpenSearch client
* Update cli
* Use logger.error when handling custom bulk exceptions
* Call ctx.exit in bulk_update_embeddings when BulkOperationError is raisd
* Clean up log when refreshing index

Side effects of this change:
* None

Relevant ticket(s):
* https://mitlibraries.atlassian.net/browse/USE-181
@jonavellecuerdo jonavellecuerdo force-pushed the USE-181-read-embeddings-with-tda branch from d536f3e to fe79bd9 Compare December 15, 2025 15:04
@jonavellecuerdo jonavellecuerdo merged commit b5202f8 into main Dec 15, 2025
3 checks passed
@jonavellecuerdo jonavellecuerdo deleted the USE-181-read-embeddings-with-tda branch December 15, 2025 15:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants