Use 181 read embeddings with tda #373

jonavellecuerdo · 2025-12-11T18:34:39Z

Purpose and background context

Now that TDA supports reading embeddings associated with TIMDEX records in the TIMDEX dataset, the bulk_update_embeddings CLI command can be completed. Additionally, this PR updates the OpenSearch mappings to enable indexing of embeddings into OpenSearch.

Notes

The number of lines changed is significant due to the update of dependencies.
Two new vcr cassettes were created by running a local OpenSearch instance that contained the following: an index for source="test-index" that indexes the records from the TIMDEX dataset loaded from tests/fixtures/data.
- I generated embeddings for this small dataset and stored those embeddings in tests/fixtures/data/embeddings.
- Running the tests with VCR recorded the actual result of running TIM against this index.
  - Example: The result of running this test results in the update of the following document:
```
{
  "_index": "test-index-2025-12-11t16-58-08",
  "_id": "libguides:guides-175846",
  "_score": 1,
  "_source": {
    "source": "LibGuides",
    "source_link": "https://libguides.mit.edu/materials",
    "timdex_record_id": "libguides:guides-175846",
    "title": "Materials Science & Engineering (UPDATED)",
    "citation": "Phoebe Ayers. Materials Science & Engineering. MIT Libraries. libguides. https://libguides.mit.edu/materials",
    "content_type": [
      "libguides"
    ]
```

How can a reviewer manually see the effects of these changes?

Review unit tests.
[OPTIONAL] Run TIM against Local OpenSearch instance.
1. Open a new terminal (A)
2. In terminal A, launch local instances of OpenSearch and OpenSearch Dashboards.
3. Open another terminal (B)
4. In terminal B, reindex source="libguides" using TIMDEX dataset loaded from tests/fixtures/dataset
```
pipenv run tim reindex-source -s libguides tests/fixtures/dataset
```
5. In your browser, go to OpenSearch Dashboards and navigate to Dev Tools.
6. Run the following query to see records before running bulk_update_embeddings:
```
GET libguides/_search
```
7. In terminal B, bulk update documents with embeddings
```
pipenv run tim bulk-update-embeddings -s libguides -rid 85cfe316-089c-4639-a5af-c861a7321493 tests/fixtures/dataset
```
8. In your browser, rerun the query. You should see that documents are updated to include embedding_full_record:

{
        "_index": "libguides-2025-12-11t20-27-58",
        "_id": "libguides:guides-175853",
        "_score": 1,
        "_source": {
          "source": "LibGuides",
          "source_link": "https://libguides.mit.edu/news",
          "timdex_record_id": "libguides:guides-175853",
          ...
          "embedding_full_record": {
            "85": 0.09610562771558762,
            "202": 0.11397245526313782,
            "1758": 0.3144601583480835,
            "authored": 0.01303903479129076
           ...
          }
       }
}

Includes new or updated dependencies?

YES - To resolve vulnerabilities but primarily to update timdex-dataset-api==3.7.2

Changes expectations for external applications?

MAYBE - The decisions made here must be considered in future work to update query / search functions to use embeddings.

What are the relevant tickets?

https://github.com/MITLibraries/timdex-index-manager/tree/USE-181-read-embeddings-with-tda

Code review

Code review best practices are documented here and you are encouraged to have a constructive dialogue with your reviewers about their preferences and expectations.

coveralls · 2025-12-11T20:15:23Z

Pull Request Test Coverage Report for Build 20236928399

Details

15 of 17 (88.24%) changed or added relevant lines in 3 files are covered.
No unchanged relevant lines lost coverage.
Overall coverage increased (+6.7%) to 96.082%

Changes Missing Coverage	Covered Lines	Changed/Added Lines	%
tim/helpers.py	2	4	50.0%

Totals
Change from base Build 19441567561:	6.7%
Covered Lines:	466
Relevant Lines:	485

💛 - Coveralls

tim/opensearch.py

jonavellecuerdo · 2025-12-12T14:24:19Z

tests/test_opensearch.py

+@my_vcr.use_cassette(
+    "opensearch/bulk_update_raises_bulk_operation_error_if_record_not_found.yaml"
+)
+def test_bulk_update_raises_bulk_operation_error_if_record_not_found(
+    test_opensearch_client,
+):
+    updates = [
+        {
+            "timdex_record_id": "i-am-not-found",
+            "title": "Materials Science & Engineering (UPDATED)",
+        }
+    ]
+    with pytest.raises(BulkOperationError):
+        tim_os.bulk_update(test_opensearch_client, "test-index", iter(updates))


@ghukill Do you think this is the appropriate response? Essentially, once it hits an update for a record not found in the OpenSearch index, the BulkOperationError is raised. 🤔

This is the result of carrying over the pattern in tim.opensearch.bulk_index, where if the response[0] is False and the error type is anything other than a mapping error, a custom exception is raised.

The same pattern is currently implemented in tim.opensearch.bulk_update. 🤔

I say we leave it for now. I'd prefer to exit eagerly and often with explicit reasons why at first. Perhaps we'll find that we're okay with most documents getting embeddings even if there are some errors.

But I think in the early days, it'll be nice to know if any records we are trying to update don't exist.

FWIW, I do think the work in USE-273 which will limit to action=index for a run will help with this. Otherwise, anytime a record has action=delete for a run we can kind of assume it won't exist in Opensearch and will likely trigger this error.

Hmm, though tim.cli.bulk_update_embeddings handles the BulkOperationError, the log in line 396 will log:

{"updated": 0, "errors": 0, "total": 0}

Is this an issue?

And re:

it'll be nice to know if any records we are trying to update don't exist.

Is there a change request for this? Are you suggesting that tim.opensearch.bulk_update return a tuple where...(<dict_of_counts>, <list_of_errors>)?

To your first question: I don't think it matters. We could move that log line into the try part so it only logs when successful, that might not be a bad small change. Or, to strive for doing one thing under a try block, we could have the except block explicitly exit the CLI with a non-zero exit code.

Actually... that might be a nice option. Yes, I'd propose that! If the embeddings fail, it'd be nice to have a non-zero exit code.

As for your second question, I just meant it'll be nice to know if we ever attempt to add an embedding for any record that doesn't exist. Specific ones aren't needed I don't think. I do think that USE-273 might be a good time to revisit some these things, when we see how the updated read methods change things (if at all).

Applied the change to exit the CLI with a non-zero exit code when the BulkOperationError is raised for bulk_update_embeddings CLI command (see f2cdca7)

ghukill

Overall, really nice work! Looking good. I left a few comments. Marking this review as "Request changes" 90% for requiring the --run-id become a required CLI arg.

Other than that, my questions / comments are largely discussion and optional. Looks like an excellent first pass, and we should consider that USE-273 may smooth out some rough edges. It might be worth moving this PR to merge, and noting edge cases or improvements to be performed in that ticket.

ghukill · 2025-12-12T14:53:37Z

tim/cli.py

    ),
 )
-@click.option("-d", "--run-date", help="Run date, formatted as YYYY-MM-DD.")
 @click.option("-rid", "--run-id", help="Run ID.")


I think until we take another pass at TIM given updated capabilities of TDA (related Jira ticket), we should make this --run-id argument required.

I was confused for awhile working through the PR which originally had this command:

pipenv run tim bulk-update-embeddings -s libguides tests/fixtures/dataset

It ran without error, but was showing zero updated, created, errored, etc. Obviously, I wasn't thinking too deeply about it or I would have noticed that we weren't passing a run_id to select embeddings from the dataset for! But in retrospect, I think the CLI should have thrown an error if that's not provided.

.github/pull-request-template.md

ghukill · 2025-12-12T16:14:48Z

tests/test_cli.py

+    assert result.exit_code == EXIT_CODES["success"]
+    assert (
+        f"Bulk update with embeddings complete: {json.dumps(mock_bulk_update())}"
+        in caplog.text
+    )


For this test, where the bulk updating returns an error > 0, is this the response we want?

Generally speaking, I think we want processes to continue even if 1 / 10k records had an error. At the same time, we want to know about those errors so we don't just paper over (ignore) them indefinitely.

To answer my own question: I think the CLI response code and the logging is good, because we are logging that error count. This would be an excellent use of AWS metrics and observability to then react to this error > 0 count outside the context of this CLI.

My issue may ultimately be the test name where I don't think we are "raising" an error in any meaningful way.

Also for consideration, a quick AI suggestion which I find compelling as well:

@pytest.mark.parametrize( "bulk_update_return", [ {"updated": 1, "errors": 0, "total": 1}, {"updated": 0, "errors": 1, "total": 1}, ], ) @patch("tim.helpers.validate_bulk_cli_options") @patch("tim.opensearch.bulk_update") def test_bulk_update_embeddings_logs_complete( mock_bulk_update, mock_validate_bulk_cli_options, bulk_update_return, caplog, monkeypatch, runner, ): monkeypatch.delenv("TIMDEX_OPENSEARCH_ENDPOINT", raising=False) mock_bulk_update.return_value = bulk_update_return mock_validate_bulk_cli_options.return_value = "libguides" result = runner.invoke( main, [ "bulk-update-embeddings", "--source", "libguides", "--run-id", "85cfe316-089c-4639-a5af-c861a7321493", "tests/fixtures/dataset", ], ) assert result.exit_code == EXIT_CODES["success"] assert ( f"Bulk update with embeddings complete: {json.dumps(mock_bulk_update())}" in caplog.text )

What this communicates to me is that regardless of the success/error counts, the CLI response is basically the same.

Sharing notes from our discussion:

The bulk_index and bulk_update methods in tim/opensearch.py will raise an error (BulkIndexingError and BulkOperationError, respectively) as soon as it encounters an error that is not due to a mapping parsing error and exit the for loop.

This is in contrast to the desired outcome noted in this comment.

At the CLI level, the raised exceptions are handled by the bulk_update and bulk_update_embeddings CLI commands but in different ways:

bulk_update:

Logs an error message with the timdex_record_id of the failing record but proceed with the delete process.

Logs a likely inaccurate index_results

bulk_update_embeddings:

Logs an error message with the timdex_record_id of the failing record and exit the CLI (status code 1)

TLDR: These learnings highlighted that the bulk methods in tim/opensearch.py could benefit from a revisit/refactor to better align logging and error handling across the bulk methods.

Agreed! Thanks for summarizing.

tests/test_opensearch.py

ghukill · 2025-12-12T16:20:52Z

tests/test_opensearch.py

+@my_vcr.use_cassette(
+    "opensearch/bulk_update_raises_bulk_operation_error_if_record_not_found.yaml"
+)
+def test_bulk_update_raises_bulk_operation_error_if_record_not_found(
+    test_opensearch_client,
+):
+    updates = [
+        {
+            "timdex_record_id": "i-am-not-found",
+            "title": "Materials Science & Engineering (UPDATED)",
+        }
+    ]
+    with pytest.raises(BulkOperationError):
+        tim_os.bulk_update(test_opensearch_client, "test-index", iter(updates))


I say we leave it for now. I'd prefer to exit eagerly and often with explicit reasons why at first. Perhaps we'll find that we're okay with most documents getting embeddings even if there are some errors.

But I think in the early days, it'll be nice to know if any records we are trying to update don't exist.

FWIW, I do think the work in USE-273 which will limit to action=index for a run will help with this. Otherwise, anytime a record has action=delete for a run we can kind of assume it won't exist in Opensearch and will likely trigger this error.

* Fix cassette to handle method call to refresh index * Use parameterized tests * Remove comments

* Use logger.error when handling custom bulk exceptions * Call ctx.exit in bulk_update_embeddings when BulkOperationError is raisd * Update method to log elapsed time * Clean up log when refreshing index

jonavellecuerdo · 2025-12-12T21:36:30Z

@ghukill I undid the change to replace main.result_callback with ctx.call_on_close. 🙃 It seems there are several tests that rely on this way to log the elapsed time. Will tackle this another time!

ghukill

Ship it! Approved!

As you've noted in a comment, the completion of this new CLI command revealed some areas that we could tidy up and normalize with regards to bulk operations. As discussed, seems better to address that as a focused pass.

With that in mind, I do feel as though this PR and new CLI command will perform bulk updates with embeddings which is the intent here. As we move into testing, and we expose some bugs, those will be good opportunities to revisit bulk operations logging and error handling a bit more holistically.

Nice work! One step closer to embeddings.

Why these changes are being introduced: * Now that TDA supports reading embeddings associated with TIMDEX records in the TIMDEX dataset, the stub CLI command can be completed. The OpenSearch mapping requires a new field to store embeddings. How this addresses that need: * Add 'embedding_full_record' field to OpenSearch mapping * Add helper method to format embeddings as input JSON for OpenSearch client * Update cli * Use logger.error when handling custom bulk exceptions * Call ctx.exit in bulk_update_embeddings when BulkOperationError is raisd * Clean up log when refreshing index Side effects of this change: * None Relevant ticket(s): * https://mitlibraries.atlassian.net/browse/USE-181

Update dependencies

e85e7be

jonavellecuerdo force-pushed the USE-181-read-embeddings-with-tda branch from 8c8c58c to d0ad109 Compare December 11, 2025 18:39

jonavellecuerdo mentioned this pull request Dec 11, 2025

Remove test code #372

Closed

jonavellecuerdo force-pushed the USE-181-read-embeddings-with-tda branch from d0ad109 to 1c9ceeb Compare December 11, 2025 20:12

jonavellecuerdo force-pushed the USE-181-read-embeddings-with-tda branch 3 times, most recently from 1389081 to 720b700 Compare December 12, 2025 13:48

jonavellecuerdo marked this pull request as ready for review December 12, 2025 13:50

jonavellecuerdo requested review from a team and ghukill December 12, 2025 13:50

ghukill self-assigned this Dec 12, 2025

jonavellecuerdo commented Dec 12, 2025

View reviewed changes

tim/opensearch.py Show resolved Hide resolved

jonavellecuerdo commented Dec 12, 2025

View reviewed changes

ghukill requested changes Dec 12, 2025

View reviewed changes

jonavellecuerdo added a commit that referenced this pull request Dec 12, 2025

Address comments in PR #373

67c064b

* Fix cassette to handle method call to refresh index * Use parameterized tests * Remove comments

jonavellecuerdo requested a review from ghukill December 12, 2025 21:37

ghukill approved these changes Dec 12, 2025

View reviewed changes

jonavellecuerdo added 5 commits December 15, 2025 10:01

Update test suite

016f6fb

Update PR template

2bfa827

Add CODEOWNERS file

57c8aa7

Remove vulnerability ignore

fe79bd9

jonavellecuerdo force-pushed the USE-181-read-embeddings-with-tda branch from d536f3e to fe79bd9 Compare December 15, 2025 15:04

jonavellecuerdo merged commit b5202f8 into main Dec 15, 2025
3 checks passed

jonavellecuerdo deleted the USE-181-read-embeddings-with-tda branch December 15, 2025 15:13

Use 181 read embeddings with tda #373

Use 181 read embeddings with tda #373

Uh oh!

Conversation

jonavellecuerdo commented Dec 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose and background context

How can a reviewer manually see the effects of these changes?

Includes new or updated dependencies?

Changes expectations for external applications?

What are the relevant tickets?

Code review

Uh oh!

coveralls commented Dec 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Pull Request Test Coverage Report for Build 20236928399

Details

💛 - Coveralls

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ghukill left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jonavellecuerdo Dec 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jonavellecuerdo commented Dec 12, 2025

Uh oh!

ghukill left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

jonavellecuerdo commented Dec 11, 2025 •

edited

Loading

coveralls commented Dec 11, 2025 •

edited

Loading

jonavellecuerdo Dec 12, 2025 •

edited

Loading