refactor(medcat): CU-869b44wz8 Better internal components #219

mart-r · 2025-11-11T09:25:40Z

This PR is intended to improve internal components.

The issue is that the current setup required a new NER or linking component to re-implement quite a coupled setup to actually set the ner_ents or linked_ents on the MutableDocument. This tight coupling wasn't well documented nor was it a good experience when writing a new component.

Thus, this PR attempts to improve the situation by:

Introducing an abstract base class that can be extended for new / other components
- This class does the setting of the NER or linked entities on the document
- It defines a predict_entities(MutableDocument, list[MutableEntity] | None) -> list[MutableEntity]
  - This is the only method that needs to be implemented by the extending class
  - For NER, the incoming list is (generally) expected to be empty
  - For linking, this is the list that will be used to attempt the linking stage
- This class handles the setting of MutableDocument.ner_ents or MutableDocument.linked_ents
  - It does so automatically based on the component type
  - Though custom behaviour is able to be forced
    - I.e the transformers NER component has written its output to linked_ents and will continue to do so
- The idea is to have an implementation of a linker or NER be oblivious of where it gets its import entities or where it writes its output entities
  - This will be handled by the abstract class
Removes the side effects setting entities
- For linked_ents in medcat.utils.postprocessing.create_main_ann
- For ner_ents in medcat.components.ner.vocab_based_annotator.maybe_annotate_name
- The idea is to limit the number of places that directly interact with MutableDocument.ner_ents and MutableDocument.linked_ents
It also make some changes to how the ner_ents and linked_ents are used
- Along the way I discovered that these were not really used as expected
- I.e the ner_ents list would be changed alongside linked_ents during linking
- This was because the decoupling of these 2 lists from v1 was incomplete

PS:
This PR (currently) changes the signature for medcat.utils.postprocessing.create_main_ann and medcat.components.ner.vocab_based_annotator.maybe_annotate_name. The former now requires a list of the entities as input (instead of reading it from the document). The latter now requires a current ID since it won't read the length of the existing list.

The later versions address the above by:

Creating a new method alongside create_main_ann
- Called filter_linked_annotations
- This is used by everything internal
- The old method / signature remains, but a deprecation warning was added
Add a workaround for maybe_annotate_name
- The method is able to produce a (almost certainly) unique ID if the old signature is used
- The start index is multiplied by 1000 and the lenght of the span is added to get this unique ID
- Though this signature should not be used in the native code, it's just there for backwards compatibility

…nents (e.g NER and Linker)

tomolopolis · 2025-11-11T09:25:44Z

Task linked: CU-869b44wz8 Improve internals for NER / linker component

…eded

tomolopolis

looks good - there's no edits required in ner/*.py, to use the new predict_ents? looks like they all use maybe_annotate_name, which didn't change?

Does this change remove the side effect of doc.ner_ents, doc.linked_ents, might be worth a test in here if it did?

mart-r · 2025-11-11T17:15:20Z

looks good - there's no edits required in ner/*.py, to use the new predict_ents? looks like they all use maybe_annotate_name, which didn't change?

There's a slight difference in how maybe_annotate_name works. It now doesn't save the annotated entity into MutableDocument.ner_ents, instead it just returns it (if it exists). So if you're calling maybe_annotate_name, in the new setup, you'd need to explicitly add the name to MutableDocument.ner_ents later on.
With that said, if you've got a setup that works with the __call__(MutableDocument) -> MutableDocument it will still work (as long as it does the setting of .ner_ents for the NER step and linked_ents for the linking step).

Does this change remove the side effect of doc.ner_ents, doc.linked_ents, might be worth a test in here if it did?

You're right, a few additional tests here would be beneficial. The existing test suite runs so I'm reasonable confident everything is working as expected. But testing this explicitly (that these methods don't have side effects) would certainly be beneficial. I'll get to that tomorrow.

…D (i.e old API) is used to preserve previous functionality

…_name

…nked_annotations

mart-r added 12 commits November 10, 2025 17:07

CU-869b44wz8: Create new abstraction layer for entity providing compo…

623aa00

…nents (e.g NER and Linker)

CU-869b44wz8: Use new abstraction for linkers

fc5ee2d

CU-869b44wz8: Use new abstraaction for DeID

03b3ead

CU-869b44wz8: Fix setting of linker entities - do it all in one place

15a5ad8

Fix NER tests

8a91f9f

Fix postporcesing tests

7bc4dd5

CU-869b44wz8: Update NER components with new abstraction

bcd7f18

CU-869b44wz8: Fix issue with wrong base class

19f6db8

CU-869b44wz8: Add missing base class init call

6d0612c

CU-869b44wz8: Fix typo

3747dca

CU-869b44wz8: Avoid implicit use of doc.ner_ents

a9fa26a

CU-869b44wz8: Fix issue with entity IDs

c4583e0

mart-r added 6 commits November 11, 2025 10:10

Update tutorial with up to date example

4926d43

CU-869b44wz8: Fix issue with wrong base class in tutorial

7e202a4

CU-869b44wz8: Reinstate old signature of create_main_ann and use new one

0b0d698

CU-869b44wz8: Deprecate old create_main_ann method

4c4113b

CU-869b44wz8: Use correct syntax in tutorials for maybe_annotate_name

4a56bbd

CU-869b44wz8: Allow None for current ID and produce a unique ID if ne…

f7fe6e9

…eded

tomolopolis approved these changes Nov 11, 2025

View reviewed changes

mart-r added 5 commits November 12, 2025 10:28

CU-869b44wz8: Add entity to doc.ner_ents during annotate_name if no I…

dafc986

…D (i.e old API) is used to preserve previous functionality

CU-869b44wz8: Add a few tests for old and new API for maybe_annnotate…

82577ef

…_name

CU-869b44wz8: Fix old behaviour of create_main_ann

0723230

CU-869b44wz8: Add a few small tests fro create_main_ann and filter_li…

5dd5739

…nked_annotations

CU-869b44wz8: Add a baseline test

14fc9ae

mart-r merged commit 7442cec into main Nov 17, 2025
20 checks passed

mart-r deleted the refactor/medcat/CU-869b44wz8-better-internal-components branch November 17, 2025 10:42

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

refactor(medcat): CU-869b44wz8 Better internal components #219

refactor(medcat): CU-869b44wz8 Better internal components #219

Uh oh!

mart-r commented Nov 11, 2025 •

edited

Loading

Uh oh!

tomolopolis commented Nov 11, 2025

Uh oh!

tomolopolis left a comment

Uh oh!

mart-r commented Nov 11, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

refactor(medcat): CU-869b44wz8 Better internal components #219

refactor(medcat): CU-869b44wz8 Better internal components #219

Uh oh!

Conversation

mart-r commented Nov 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tomolopolis commented Nov 11, 2025

Uh oh!

tomolopolis left a comment

Choose a reason for hiding this comment

Uh oh!

mart-r commented Nov 11, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

mart-r commented Nov 11, 2025 •

edited

Loading