feat(medcat): CU-869b9h7y6 Add faster linker #243

mart-r · 2025-11-25T16:15:43Z

This PR adds a faster linker to the mix.

This faster linker (primary_name_only_linker) is designed to link names only if
a) There's 1 suitable concept
b) There's 1 concept that considers the name a primary name

This results in faster linking. But it's also likely to reduce performance in cases where disambiguation is needed.

I ran a few performance / speed tests to look at the throughput and performance tradeoffs:

Dataset	Configuration	Precision	Recall	F1	Time (s)
COMETA
Spacy	Vector context	0.9245	0.4521	0.6072	68.16
Spacy	Faster linker	0.9266	0.4225	0.5804	51.64
Regex	Vector context	0.9130	0.4136	0.5693	30.54
Regex	Faster linker	0.9205	0.4108	0.5681	6.21
2023 Linking Challenge
Spacy	Vector context	0.5353	0.3337	0.4112	75.40
Spacy	Faster linker	0.5934	0.2873	0.3871	48.05
Regex	Vector context	0.4522	0.3162	0.3722	117.55
Regex	Faster linker	0.5091	0.2862	0.3664	82.61

As we can see, for the COMETA dataset, there's a clear benefit in running the faster componetns (tested for both regex tokenizer and this new faster linker). You can improve throughput by an order of magnitude! And the performance benefit isn't that big (up to around 10% in recall - no change in precision).

However, the Linking Challenge dataset shows that the situation is quite a bit more nuanced. In this case, the regex tokenizer results in slower execution than its spacy counterpart. I'm not entirely sure what the underlying cause is here (because the regex tokenizer creates aroudn 25% fewer entities accross the dataset). But it's a good example of having to tailor the config to the specific usecase.

EDIT: See comment below for why the linking challenge dataset isn't seeing speedup from the regex based tokenizer as well as some overall inference speed investigation.

…e CUI options

tomolopolis · 2025-11-25T16:15:48Z

Task linked: CU-869b9h7y6 Add simple/fast linker

mart-r · 2025-11-26T16:14:01Z

I've gone ahead and looked at the inference speed for the regular and fast tokenizer. This was run over 400 MIMIC documents with no filter (this becomes important a little later). Here are the results:

Linker	Tokenizer	Number of entities	Time spent
Normal	spacy	259 383	168.48s
Faster	spacy	188 164	110.75s
Normal	regex	402 554	284.63s
Faster	regex	270 149	208.02s

As we can see, there's a clear speedup of between 27% (with regex) and 35% (with spacy) with the faster linker introduced here. However, it comes at the expese of 28% fewer entities with spacy and 33% fewer with regex.

Notably, we can see from this that the regex based tokenizer is still quite a bit slower here compared to the spacy tokenizer. But we can also clearly see why that is. It's because the regex tokenizer produces a lot more entities. Many of these are (probably) false positives. But they still require processing throughout the pipe. And importantly, the prior experiments for metrics were done with filtering enabled (only concepts of interest for a specific dataset were in the filter). And because of that most of these were filtered out (at least for the COMETA dataset - more on that later) when doing the metrics.

The last piece of the puzzle I wanted to explore and explain was the difference in performance when comparing the COMETA dataset and the linking challenge dataset. So I ran inference only on the texts in these datasets (while using the filters within), and here's the results:

Linker	Tokenizer	Number of entities	Time spent
COMETA
Normal	spacy	7 072	61.66s
Faster	spacy	6 594	49.37s
Normal	regex	6 608	26.61s
Faster	regex	6 488	4.49s
Linking Challenge
Normal	spacy	32 152	77.13s
Faster	spacy	24 971	51.27s
Normal	regex	36 201	117.10s
Faster	regex	29 023	81.25s

We see again the same situation we did at the start. The COMETA dataset enjoys a brilliant increase in throughput / speed. This time even more than 10 fold (around 13.7 times at max). However, we still see that the regex tokenizer is slower for the Linking challenge dataset. Turns out this is down to how the datasets were prepared.
The COMETA dataset was prepared into 14 000 different projects. Each with 1 document and 1 concept and 1-4 (though mostly (97%) just 1) annotations for said CUI.
Whereas the Linking challenge dataset was prepared as 1 big project. That project has 5337 CUIs in the filter because they are all relevant.

This explains a number of key things:

The reason the performance is so high for the COMETA dataset is that the task is simple: look for 1 (and only 1) concept.
The reason there's great speedup in the COMETA dataset is clear: only looking for one concept at a time - everything else is ignored
The reason there's no speedup for the linking challenge dataset with regex: the number of concepts it is looking for is so much bigger and the regex tokenizer picks up a lot more entities that needed to be disambiguated / processed in some way

So my takeaway from this would be:
For a simple task with a very tight filter, the regex tokenizer may work wonders for throughput.
But for more complex tasks there may be better options out there.

tomolopolis

LGTM - couple of typos, logger things and a broader registry comment

tomolopolis · 2025-11-27T16:29:40Z

medcat-v2/medcat/components/linking/only_primary_name_linker.py

+logger = logging.getLogger(__name__)
+
+
+class OnlyPrimaryNamesLinker(Linker):


for brevity - consider renaming this to PrimaryNameLinker or PNameLinker?

tomolopolis · 2025-11-27T16:31:01Z

medcat-v2/medcat/components/linking/only_primary_name_linker.py

+                            in StatusTypes.PRIMARY_STATUS and
+                            cnf_l.filters.check_filters(cui))]
+        if not primary_cuis:
+            logger.info("No pimary CUIs for name %s", name)


typo for primary

tomolopolis · 2025-11-27T16:31:36Z

medcat-v2/medcat/components/linking/only_primary_name_linker.py

+            return
+        if len(primary_cuis) > 1:
+            logger.info(
+                "Ambiguous pimary CUIs for name %s: %s", name, primary_cuis)


and here. Also should these info statements be debug?

tomolopolis · 2025-11-27T16:54:18Z

medcat-v2/medcat/components/types.py

+    # primary name only
+    "primary_name_only_linker": (
+        "medcat.components.linking.only_primary_name_linker",
+        "OnlyPrimaryNamesLinker.create_new_component"),


thinking about the comp_registry generally - should it just accept the comp class, the module will be included in that, and when won't it be clazz.create_new_component

The reason I've set it up like this is for lazy loading. Don't want to import all the components and their internals if we aren't going to be using them.

There is the option to register classes - and that's what you do with a custom component. But the assumption there is that you're registering it in order to immediately use it.

mart-r added 4 commits November 25, 2025 13:18

CU-869b9h7y6: Add faster linker that only links to primary names

98e64e1

CU-869b9h7y6: Remove debug output

d72b4f9

CU-869b9h7y6: Add proper filtering as well as usage of single-possibl…

0839a24

…e CUI options

CU-869b9h7y6: Add a simple test for the new linker

48396af

tomolopolis approved these changes Nov 27, 2025

View reviewed changes

mart-r added 3 commits November 28, 2025 15:53

CU-869b9h7y6: Rename primary name linker with a shorter name

b5e3d51

CU-869b9h7y6: Fix typos in logged output

ba06513

CU-869b9h7y6: Lower logged output priority (info -> debug)

9fde456

mart-r merged commit e2d0940 into main Nov 28, 2025
20 checks passed

mart-r deleted the feat/medcat/CU-869b9h7y6-add-faster-linker branch November 28, 2025 16:27

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(medcat): CU-869b9h7y6 Add faster linker #243

feat(medcat): CU-869b9h7y6 Add faster linker #243

Uh oh!

mart-r commented Nov 25, 2025 •

edited

Loading

Uh oh!

tomolopolis commented Nov 25, 2025

Uh oh!

mart-r commented Nov 26, 2025

Uh oh!

tomolopolis left a comment

Uh oh!

tomolopolis Nov 27, 2025

Uh oh!

tomolopolis Nov 27, 2025

Uh oh!

tomolopolis Nov 27, 2025

Uh oh!

tomolopolis Nov 27, 2025

Uh oh!

mart-r Nov 28, 2025 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		logger = logging.getLogger(__name__)


		class OnlyPrimaryNamesLinker(Linker):

feat(medcat): CU-869b9h7y6 Add faster linker #243

feat(medcat): CU-869b9h7y6 Add faster linker #243

Uh oh!

Conversation

mart-r commented Nov 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tomolopolis commented Nov 25, 2025

Uh oh!

mart-r commented Nov 26, 2025

Uh oh!

tomolopolis left a comment

Choose a reason for hiding this comment

Uh oh!

tomolopolis Nov 27, 2025

Choose a reason for hiding this comment

Uh oh!

tomolopolis Nov 27, 2025

Choose a reason for hiding this comment

Uh oh!

tomolopolis Nov 27, 2025

Choose a reason for hiding this comment

Uh oh!

tomolopolis Nov 27, 2025

Choose a reason for hiding this comment

Uh oh!

mart-r Nov 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

mart-r commented Nov 25, 2025 •

edited

Loading

mart-r Nov 28, 2025 •

edited

Loading