Skip to content

Conversation

@mart-r
Copy link
Collaborator

@mart-r mart-r commented Nov 25, 2025

This PR adds a faster linker to the mix.

This faster linker (primary_name_only_linker) is designed to link names only if
a) There's 1 suitable concept
b) There's 1 concept that considers the name a primary name

This results in faster linking. But it's also likely to reduce performance in cases where disambiguation is needed.

I ran a few performance / speed tests to look at the throughput and performance tradeoffs:

Dataset Configuration Precision Recall F1 Time (s)
COMETA
Spacy Vector context 0.9245 0.4521 0.6072 68.16
Spacy Faster linker 0.9266 0.4225 0.5804 51.64
Regex Vector context 0.9130 0.4136 0.5693 30.54
Regex Faster linker 0.9205 0.4108 0.5681 6.21
2023 Linking Challenge
Spacy Vector context 0.5353 0.3337 0.4112 75.40
Spacy Faster linker 0.5934 0.2873 0.3871 48.05
Regex Vector context 0.4522 0.3162 0.3722 117.55
Regex Faster linker 0.5091 0.2862 0.3664 82.61

As we can see, for the COMETA dataset, there's a clear benefit in running the faster componetns (tested for both regex tokenizer and this new faster linker). You can improve throughput by an order of magnitude! And the performance benefit isn't that big (up to around 10% in recall - no change in precision).

However, the Linking Challenge dataset shows that the situation is quite a bit more nuanced. In this case, the regex tokenizer results in slower execution than its spacy counterpart. I'm not entirely sure what the underlying cause is here (because the regex tokenizer creates aroudn 25% fewer entities accross the dataset). But it's a good example of having to tailor the config to the specific usecase.

EDIT: See comment below for why the linking challenge dataset isn't seeing speedup from the regex based tokenizer as well as some overall inference speed investigation.

@tomolopolis
Copy link
Member

Task linked: CU-869b9h7y6 Add simple/fast linker

@mart-r
Copy link
Collaborator Author

mart-r commented Nov 26, 2025

I've gone ahead and looked at the inference speed for the regular and fast tokenizer. This was run over 400 MIMIC documents with no filter (this becomes important a little later). Here are the results:

Linker Tokenizer Number of entities Time spent
Normal spacy 259 383 168.48s
Faster spacy 188 164 110.75s
Normal regex 402 554 284.63s
Faster regex 270 149 208.02s

As we can see, there's a clear speedup of between 27% (with regex) and 35% (with spacy) with the faster linker introduced here. However, it comes at the expese of 28% fewer entities with spacy and 33% fewer with regex.

Notably, we can see from this that the regex based tokenizer is still quite a bit slower here compared to the spacy tokenizer. But we can also clearly see why that is. It's because the regex tokenizer produces a lot more entities. Many of these are (probably) false positives. But they still require processing throughout the pipe. And importantly, the prior experiments for metrics were done with filtering enabled (only concepts of interest for a specific dataset were in the filter). And because of that most of these were filtered out (at least for the COMETA dataset - more on that later) when doing the metrics.

The last piece of the puzzle I wanted to explore and explain was the difference in performance when comparing the COMETA dataset and the linking challenge dataset. So I ran inference only on the texts in these datasets (while using the filters within), and here's the results:

Linker Tokenizer Number of entities Time spent
COMETA
Normal spacy 7 072 61.66s
Faster spacy 6 594 49.37s
Normal regex 6 608 26.61s
Faster regex 6 488 4.49s
Linking Challenge
Normal spacy 32 152 77.13s
Faster spacy 24 971 51.27s
Normal regex 36 201 117.10s
Faster regex 29 023 81.25s

We see again the same situation we did at the start. The COMETA dataset enjoys a brilliant increase in throughput / speed. This time even more than 10 fold (around 13.7 times at max). However, we still see that the regex tokenizer is slower for the Linking challenge dataset. Turns out this is down to how the datasets were prepared.
The COMETA dataset was prepared into 14 000 different projects. Each with 1 document and 1 concept and 1-4 (though mostly (97%) just 1) annotations for said CUI.
Whereas the Linking challenge dataset was prepared as 1 big project. That project has 5337 CUIs in the filter because they are all relevant.

This explains a number of key things:

  1. The reason the performance is so high for the COMETA dataset is that the task is simple: look for 1 (and only 1) concept.
  2. The reason there's great speedup in the COMETA dataset is clear: only looking for one concept at a time - everything else is ignored
  3. The reason there's no speedup for the linking challenge dataset with regex: the number of concepts it is looking for is so much bigger and the regex tokenizer picks up a lot more entities that needed to be disambiguated / processed in some way

So my takeaway from this would be:
For a simple task with a very tight filter, the regex tokenizer may work wonders for throughput.
But for more complex tasks there may be better options out there.

Copy link
Member

@tomolopolis tomolopolis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM - couple of typos, logger things and a broader registry comment

logger = logging.getLogger(__name__)


class OnlyPrimaryNamesLinker(Linker):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for brevity - consider renaming this to PrimaryNameLinker or PNameLinker?

in StatusTypes.PRIMARY_STATUS and
cnf_l.filters.check_filters(cui))]
if not primary_cuis:
logger.info("No pimary CUIs for name %s", name)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo for primary

return
if len(primary_cuis) > 1:
logger.info(
"Ambiguous pimary CUIs for name %s: %s", name, primary_cuis)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and here. Also should these info statements be debug?

# primary name only
"primary_name_only_linker": (
"medcat.components.linking.only_primary_name_linker",
"OnlyPrimaryNamesLinker.create_new_component"),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thinking about the comp_registry generally - should it just accept the comp class, the module will be included in that, and when won't it be clazz.create_new_component

Copy link
Collaborator Author

@mart-r mart-r Nov 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason I've set it up like this is for lazy loading. Don't want to import all the components and their internals if we aren't going to be using them.

There is the option to register classes - and that's what you do with a custom component. But the assumption there is that you're registering it in order to immediately use it.

@mart-r mart-r merged commit e2d0940 into main Nov 28, 2025
20 checks passed
@mart-r mart-r deleted the feat/medcat/CU-869b9h7y6-add-faster-linker branch November 28, 2025 16:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants