Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
18 changes: 18 additions & 0 deletions .github/CODEOWNERS
Original file line number Diff line number Diff line change
@@ -0,0 +1,18 @@
# CODEOWNERS file (from GitHub template at
# https://docs.github.com/en/repositories/managing-your-repositorys-settings-and-features/customizing-your-repository/about-code-owners)
# Each line is a file pattern followed by one or more owners.

################################################################################
# These owners will be the default owners for everything in the repo. This is commented
# out in favor of using a team as the default (see below). It is left here as a comment
# to indicate the primary expert for this code.
# * @adamshire123

# Teams can be specified as code owners as well. Teams should be identified in
# the format @org/team-name. Teams must have explicit write access to the
# repository.
* @mitlibraries/dataeng

# We set the senior engineer in the team as the owner of the CODEOWNERS file as
# a layer of protection for unauthorized changes.
/.github/CODEOWNERS @ghukill
14 changes: 2 additions & 12 deletions .github/pull-request-template.md
Original file line number Diff line number Diff line change
Expand Up @@ -15,15 +15,5 @@ YES | NO
### What are the relevant tickets?
- Include links to Jira Software and/or Jira Service Management tickets here.

### Developer
- [ ] All new ENV is documented in README
- [ ] All new ENV has been added to staging and production environments
- [ ] All related Jira tickets are linked in commit message(s)
- [ ] Stakeholder approval has been confirmed (or is not needed)

### Code Reviewer(s)
- [ ] The commit message is clear and follows our guidelines (not just this PR message)
- [ ] There are appropriate tests covering any new functionality
- [ ] The provided documentation is sufficient for understanding any new functionality introduced
- [ ] Any manual tests have been performed **or** provided examples have been verified
- [ ] New dependencies are appropriate or there were no changes
### Code review
* Code review best practices are documented [here](https://mitlibraries.github.io/guides/collaboration/code_review.html) and you are encouraged to have a constructive dialogue with your reviewers about their preferences and expectations.
2 changes: 1 addition & 1 deletion .pre-commit-config.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -24,6 +24,6 @@ repos:
types: ["python"]
- id: pip-audit
name: pip-audit
entry: pipenv run pip-audit --ignore-vuln GHSA-4xh5-x5gv-qwph
entry: pipenv run pip-audit
language: system
pass_filenames: false
5 changes: 3 additions & 2 deletions Makefile
Original file line number Diff line number Diff line change
@@ -1,4 +1,5 @@
### This is the Terraform-generated header for timdex-index-manager-dev ###
.PHONY: test
ECR_NAME_DEV:=timdex-index-manager-dev
ECR_URL_DEV:=222053980223.dkr.ecr.us-east-1.amazonaws.com/timdex-index-manager-dev
### End of Terraform-generated header ###
Expand Down Expand Up @@ -28,7 +29,7 @@ update: install # Update Python dependencies
######################

test: # Run tests and print a coverage report
pipenv run coverage run --source=tim -m pytest -vv
pipenv run coverage run --source=tim -m pytest -vv
pipenv run coverage report -m

coveralls: test # Write coverage data to an LCOV report
Expand All @@ -50,7 +51,7 @@ ruff: # Run 'ruff' linter and print a preview of errors
pipenv run ruff check .

safety: # Check for security vulnerabilities and verify Pipfile.lock is up-to-date
pipenv run pip-audit --ignore-vuln GHSA-4xh5-x5gv-qwph
pipenv run pip-audit
pipenv verify

lint-apply: black-apply ruff-apply # Apply changes with 'black' and resolve 'fixable errors' with 'ruff'
Expand Down
1,887 changes: 809 additions & 1,078 deletions Pipfile.lock

Large diffs are not rendered by default.

3 changes: 3 additions & 0 deletions config/opensearch_mappings.json
Original file line number Diff line number Diff line change
Expand Up @@ -132,6 +132,9 @@
"edition": {
"type": "text"
},
"embedding_full_record": {
"type": "rank_features"
},
"file_formats": {
"type": "keyword",
"normalizer": "lowercase"
Expand Down
22 changes: 22 additions & 0 deletions tests/fixtures/cassettes/bulk_update_embeddings_success.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,22 @@
interactions:
- request:
body: null
headers:
content-type:
- application/json
user-agent:
- opensearch-py/2.8.0 (Python 3.12.11)
method: GET
uri: http://localhost:9200/_cat/aliases?format=json
response:
body:
string: '[{"alias":"all-current","index":"libguides-2025-12-11t16-36-09","filter":"-","routing.index":"-","routing.search":"-","is_write_index":"-"},{"alias":"libguides","index":"libguides-2025-12-11t16-36-09","filter":"-","routing.index":"-","routing.search":"-","is_write_index":"-"},{"alias":"all-current","index":"test-index-2025-12-11t16-58-08","filter":"-","routing.index":"-","routing.search":"-","is_write_index":"-"},{"alias":"test-index","index":"test-index-2025-12-11t16-58-08","filter":"-","routing.index":"-","routing.search":"-","is_write_index":"-"},{"alias":".kibana","index":".kibana_1","filter":"-","routing.index":"-","routing.search":"-","is_write_index":"-"}]'
headers:
content-length:
- '671'
content-type:
- application/json; charset=UTF-8
status:
code: 200
message: OK
version: 1
Original file line number Diff line number Diff line change
@@ -0,0 +1,30 @@
interactions:
- request:
body: '{"update":{"_id":"i-am-not-found","_index":"test-index"}}

{"doc":{"timdex_record_id":"i-am-not-found","title":"Materials Science & Engineering
(UPDATED)"}}

'
headers:
Content-Length:
- '156'
content-type:
- application/json
user-agent:
- opensearch-py/2.8.0 (Python 3.12.11)
method: POST
uri: http://localhost:9200/_bulk
response:
body:
string: '{"took":9,"errors":true,"items":[{"update":{"_index":"test-index-2025-12-11t16-58-08","_id":"i-am-not-found","status":404,"error":{"type":"document_missing_exception","reason":"[i-am-not-found]:
document missing","index":"test-index-2025-12-11t16-58-08","shard":"0","index_uuid":"in04_JvQS5qqCvUXeZta_g"}}}]}'
headers:
content-length:
- '308'
content-type:
- application/json; charset=UTF-8
status:
code: 200
message: OK
version: 1
Original file line number Diff line number Diff line change
@@ -0,0 +1,51 @@
interactions:
- request:
body: '{"update":{"_id":"libguides:guides-175846","_index":"test-index"}}

{"doc":{"timdex_record_id":"libguides:guides-175846","title":"Materials Science
& Engineering (UPDATED)"}}

'
headers:
Content-Length:
- '174'
content-type:
- application/json
user-agent:
- opensearch-py/2.8.0 (Python 3.12.11)
method: POST
uri: http://localhost:9200/_bulk
response:
body:
string: '{"took":7,"errors":false,"items":[{"update":{"_index":"test-index-2025-12-11t16-58-08","_id":"libguides:guides-175846","_version":4,"result":"updated","_shards":{"total":2,"successful":1,"failed":0},"_seq_no":7,"_primary_term":1,"status":200}}]}'
headers:
content-length:
- '245'
content-type:
- application/json; charset=UTF-8
status:
code: 200
message: OK
- request:
body: null
headers:
Content-Length:
- '0'
content-type:
- application/json
user-agent:
- opensearch-py/2.8.0 (Python 3.12.2)
method: POST
uri: http://localhost:9200/test-index/_refresh
response:
body:
string: '{"_shards":{"total":2,"successful":1,"failed":0}}'
headers:
content-length:
- '49'
content-type:
- application/json; charset=UTF-8
status:
code: 200
message: OK
version: 1
Binary file not shown.
69 changes: 68 additions & 1 deletion tests/test_cli.py
Original file line number Diff line number Diff line change
Expand Up @@ -2,10 +2,11 @@
import re
from unittest.mock import patch

import pytest
from freezegun import freeze_time

from tim.cli import main
from tim.errors import BulkIndexingError
from tim.errors import BulkIndexingError, BulkOperationError

from .conftest import EXIT_CODES, my_vcr

Expand Down Expand Up @@ -274,6 +275,72 @@ def test_bulk_update_with_source_raise_bulk_indexing_error(
)


@pytest.mark.parametrize(
"bulk_update_return",
[
{"updated": 1, "errors": 0, "total": 1},
{"updated": 0, "errors": 1, "total": 1},
],
)
@patch("tim.helpers.validate_bulk_cli_options")
@patch("tim.opensearch.bulk_update")
def test_bulk_update_embeddings_logs_complete(
mock_bulk_update,
mock_validate_bulk_cli_options,
bulk_update_return,
caplog,
monkeypatch,
runner,
):
monkeypatch.delenv("TIMDEX_OPENSEARCH_ENDPOINT", raising=False)
mock_bulk_update.return_value = bulk_update_return
mock_validate_bulk_cli_options.return_value = "libguides"

result = runner.invoke(
main,
[
"bulk-update-embeddings",
"--source",
"libguides",
"--run-id",
"85cfe316-089c-4639-a5af-c861a7321493",
"tests/fixtures/dataset",
],
)

assert result.exit_code == EXIT_CODES["success"]
assert (
f"Bulk update with embeddings complete: {json.dumps(mock_bulk_update())}"
in caplog.text
)


@patch("tim.helpers.validate_bulk_cli_options")
@patch("tim.opensearch.bulk_update")
def test_bulk_update_embeddings_exit_bulk_operation_error(
mock_bulk_update, mock_validate_bulk_cli_options, caplog, monkeypatch, runner
):
monkeypatch.delenv("TIMDEX_OPENSEARCH_ENDPOINT", raising=False)
mock_bulk_update.side_effect = BulkOperationError(
action="update", record="alma:0", index="index", error="exception"
)
mock_validate_bulk_cli_options.return_value = "libguides"

result = runner.invoke(
main,
[
"bulk-update-embeddings",
"--source",
"libguides",
"--run-id",
"85cfe316-089c-4639-a5af-c861a7321493",
"tests/fixtures/dataset",
],
)
assert result.exit_code == EXIT_CODES["error"]
assert "Bulk update with embeddings failed" in caplog.text


@patch("tim.opensearch.create_index")
@patch("tim.opensearch.promote_index")
@patch("tim.opensearch.get_index_aliases")
Expand Down
11 changes: 11 additions & 0 deletions tests/test_helpers.py
Original file line number Diff line number Diff line change
Expand Up @@ -51,6 +51,17 @@ def test_generate_bulk_actions_delete():
}


def test_generate_bulk_actions_update():
records = [{"timdex_record_id": "12345"}]
actions = helpers.generate_bulk_actions("test-index", records, "update")
assert next(actions) == {
"_op_type": "update",
"_index": "test-index",
"_id": "12345",
"doc": {"timdex_record_id": "12345"},
}


def test_generate_bulk_actions_invalid_action_raises_error():
records = [{"timdex_record_id": "12345", "other_fields": "some_data"}]
actions = helpers.generate_bulk_actions("test-index", records, "wrong")
Expand Down
32 changes: 32 additions & 0 deletions tests/test_opensearch.py
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,7 @@
from tim.config import PRIMARY_ALIAS
from tim.errors import (
AliasNotFoundError,
BulkOperationError,
IndexExistsError,
IndexNotFoundError,
)
Expand Down Expand Up @@ -532,3 +533,34 @@ def test_bulk_delete_logs_error_if_record_not_found(
"Record to delete 'i-am-not-found' was not found in index 'test-index'."
in caplog.text
)


@my_vcr.use_cassette("opensearch/bulk_update_updates_records.yaml")
def test_bulk_update_updates_records(test_opensearch_client):
updates = [
{
"timdex_record_id": "libguides:guides-175846",
"title": "Materials Science & Engineering (UPDATED)",
}
]
assert tim_os.bulk_update(test_opensearch_client, "test-index", iter(updates)) == {
"updated": 1,
"errors": 0,
"total": 1,
}


@my_vcr.use_cassette(
"opensearch/bulk_update_raises_bulk_operation_error_if_record_not_found.yaml"
)
def test_bulk_update_raises_bulk_operation_error_if_record_not_found(
test_opensearch_client,
):
updates = [
{
"timdex_record_id": "i-am-not-found",
"title": "Materials Science & Engineering (UPDATED)",
}
]
with pytest.raises(BulkOperationError):
tim_os.bulk_update(test_opensearch_client, "test-index", iter(updates))
Comment on lines +553 to +566
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ghukill Do you think this is the appropriate response? Essentially, once it hits an update for a record not found in the OpenSearch index, the BulkOperationError is raised. 🤔

This is the result of carrying over the pattern in tim.opensearch.bulk_index, where if the response[0] is False and the error type is anything other than a mapping error, a custom exception is raised.

The same pattern is currently implemented in tim.opensearch.bulk_update. 🤔

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I say we leave it for now. I'd prefer to exit eagerly and often with explicit reasons why at first. Perhaps we'll find that we're okay with most documents getting embeddings even if there are some errors.

But I think in the early days, it'll be nice to know if any records we are trying to update don't exist.

FWIW, I do think the work in USE-273 which will limit to action=index for a run will help with this. Otherwise, anytime a record has action=delete for a run we can kind of assume it won't exist in Opensearch and will likely trigger this error.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, though tim.cli.bulk_update_embeddings handles the BulkOperationError, the log in line 396 will log:

{"updated": 0, "errors": 0, "total": 0}

Is this an issue?

And re:

it'll be nice to know if any records we are trying to update don't exist.

Is there a change request for this? Are you suggesting that tim.opensearch.bulk_update return a tuple where...(<dict_of_counts>, <list_of_errors>)?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To your first question: I don't think it matters. We could move that log line into the try part so it only logs when successful, that might not be a bad small change. Or, to strive for doing one thing under a try block, we could have the except block explicitly exit the CLI with a non-zero exit code.

Actually... that might be a nice option. Yes, I'd propose that! If the embeddings fail, it'd be nice to have a non-zero exit code.

As for your second question, I just meant it'll be nice to know if we ever attempt to add an embedding for any record that doesn't exist. Specific ones aren't needed I don't think. I do think that USE-273 might be a good time to revisit some these things, when we see how the updated read methods change things (if at all).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Applied the change to exit the CLI with a non-zero exit code when the BulkOperationError is raised for bulk_update_embeddings CLI command (see f2cdca7)

Loading