Skip to content

Conversation

@mart-r
Copy link
Collaborator

@mart-r mart-r commented Nov 11, 2025

This PR is intended to improve internal components.

The issue is that the current setup required a new NER or linking component to re-implement quite a coupled setup to actually set the ner_ents or linked_ents on the MutableDocument. This tight coupling wasn't well documented nor was it a good experience when writing a new component.

Thus, this PR attempts to improve the situation by:

  • Introducing an abstract base class that can be extended for new / other components
    • This class does the setting of the NER or linked entities on the document
    • It defines a predict_entities(MutableDocument, list[MutableEntity] | None) -> list[MutableEntity]
      • This is the only method that needs to be implemented by the extending class
      • For NER, the incoming list is (generally) expected to be empty
      • For linking, this is the list that will be used to attempt the linking stage
    • This class handles the setting of MutableDocument.ner_ents or MutableDocument.linked_ents
      • It does so automatically based on the component type
      • Though custom behaviour is able to be forced
        • I.e the transformers NER component has written its output to linked_ents and will continue to do so
    • The idea is to have an implementation of a linker or NER be oblivious of where it gets its import entities or where it writes its output entities
      • This will be handled by the abstract class
  • Removes the side effects setting entities
    • For linked_ents in medcat.utils.postprocessing.create_main_ann
    • For ner_ents in medcat.components.ner.vocab_based_annotator.maybe_annotate_name
    • The idea is to limit the number of places that directly interact with MutableDocument.ner_ents and MutableDocument.linked_ents
  • It also make some changes to how the ner_ents and linked_ents are used
    • Along the way I discovered that these were not really used as expected
    • I.e the ner_ents list would be changed alongside linked_ents during linking
    • This was because the decoupling of these 2 lists from v1 was incomplete

PS:
This PR (currently) changes the signature for medcat.utils.postprocessing.create_main_ann and medcat.components.ner.vocab_based_annotator.maybe_annotate_name. The former now requires a list of the entities as input (instead of reading it from the document). The latter now requires a current ID since it won't read the length of the existing list.

The later versions address the above by:

  • Creating a new method alongside create_main_ann
    • Called filter_linked_annotations
    • This is used by everything internal
    • The old method / signature remains, but a deprecation warning was added
  • Add a workaround for maybe_annotate_name
    • The method is able to produce a (almost certainly) unique ID if the old signature is used
    • The start index is multiplied by 1000 and the lenght of the span is added to get this unique ID
    • Though this signature should not be used in the native code, it's just there for backwards compatibility

@tomolopolis
Copy link
Member

Copy link
Member

@tomolopolis tomolopolis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good - there's no edits required in ner/*.py, to use the new predict_ents? looks like they all use maybe_annotate_name, which didn't change?

Does this change remove the side effect of doc.ner_ents, doc.linked_ents, might be worth a test in here if it did?

@mart-r
Copy link
Collaborator Author

mart-r commented Nov 11, 2025

looks good - there's no edits required in ner/*.py, to use the new predict_ents? looks like they all use maybe_annotate_name, which didn't change?

There's a slight difference in how maybe_annotate_name works. It now doesn't save the annotated entity into MutableDocument.ner_ents, instead it just returns it (if it exists). So if you're calling maybe_annotate_name, in the new setup, you'd need to explicitly add the name to MutableDocument.ner_ents later on.
With that said, if you've got a setup that works with the __call__(MutableDocument) -> MutableDocument it will still work (as long as it does the setting of .ner_ents for the NER step and linked_ents for the linking step).

Does this change remove the side effect of doc.ner_ents, doc.linked_ents, might be worth a test in here if it did?

You're right, a few additional tests here would be beneficial. The existing test suite runs so I'm reasonable confident everything is working as expected. But testing this explicitly (that these methods don't have side effects) would certainly be beneficial. I'll get to that tomorrow.

@mart-r mart-r merged commit 7442cec into main Nov 17, 2025
20 checks passed
@mart-r mart-r deleted the refactor/medcat/CU-869b44wz8-better-internal-components branch November 17, 2025 10:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants