feat(medcat):CU-869cy3xa0 Improve training by mart-r · Pull Request #414 · CogStack/cogstack-nlp

mart-r · 2026-04-21T15:34:22Z

This PR does an overhaul to the training setup of MedCAT:

It modifies the existing TrainableComponent protocol to also include a train_unsupervised method
- And uses that over the "check config for train and run inference" unsupervised training
It allows all components that follow the TrainableComponent protocol to be trained supervised
- Previously only the linker was able to be trained in a supervised manner
It provides a few utility methods to allow training and evaluating components individually
- I.e dataset-aware components that will enable either training or evaluating only NER or Linker if/when required

Example code snippets:

When only training linker

with dataset_aware_component(cat, CoreComponentType.ner, DATASET):
    trainer.train_supervised_raw(DATASET, nepochs=1)

When only training NER

with dataset_aware_component(cat, CoreComponentType.linking, self.DATASET):
    trainer.train_unsupervised([doc['text'] for proj in self.DATASET['projects'] for doc in proj['documents']], nepochs=1)

When doing evaluation / stats one component at a time

with dataset_aware_component(cat, CoreComponentType.ner, self.DATASET):
    tps, fns, tps, cui_prec, cui_rec, cui_f1, cui_counts, examples = get_stats(
        cat, self.DATASET, do_print=False)

… protocol

…xt based linker

…nner

…omponent

adam-sutton-1992 · 2026-04-24T21:54:27Z

A few queries but I think it looks good. I might've missed these within the commits:

If you have two trainable components. is it possible to turn of training for one of them when running training methods? Do the dataset aware components serve that purpose?

And one more above^^^

mart-r · 2026-04-25T05:33:00Z

If you have two trainable components. is it possible to turn of training for one of them when running training methods? Do the dataset aware components serve that purpose?

The description already had 2 examples for this :)

The dataset aware implementation can serve that purpose. Because they replace the specific component with another one (which isn't trainable, but that's kind of irrelevant since it's a different component) for the duration of the context manager.

But I think what makes it unclear is that in the example I've given it a dataset, but realistically, you could provide an empty dataset for it, i.e like this:

# supervised
with dataset_aware_component(cat, CoreComponentType.ner, {"projects" : []}):
    trainer.train_supervised_raw(DATASET, nepochs=1)
# unsupervised
with dataset_aware_component(cat, CoreComponentType.ner, {"projects" : []}):
    trainer.train_unsupervisedsupervised(["list", "of", "texts'], nepochs=1)

adam-sutton-1992

Just the one comment:

adam-sutton-1992 · 2026-05-12T14:44:56Z

+    def test_train_supervised_can_train_only_linker_when_ner_is_cheating(self):
+        ner = _TrainableNER()
+        linker = _TrainablePassThroughLinker()
+        cat = _FakeCat(self.DATASET, [ner, linker])
+        trainer = Trainer(cat.cdb, cat.__call__, cat.pipe)
+
+        with unittest.mock.patch("medcat.trainer.prepare_name", return_value={"abc": {}}):
+            with dataset_aware_component(cat, CoreComponentType.ner, self.DATASET):
+                trainer.train_supervised_raw(self.DATASET, disable_progress=True)
+
+        self.assertEqual(ner.sup_train_calls, 0)
+        self.assertEqual(linker.sup_train_calls, 1)
+
+    def test_train_supervised_can_train_only_ner_when_linker_is_cheating(self):
+        ner = _TrainableNER()
+        linker = _TrainablePassThroughLinker()
+        cat = _FakeCat(self.DATASET, [ner, linker])
+        trainer = Trainer(cat.cdb, cat.__call__, cat.pipe)
+
+        with unittest.mock.patch("medcat.trainer.prepare_name", return_value={"abc": {}}):
+            with dataset_aware_component(cat, CoreComponentType.linking, self.DATASET):
+                trainer.train_supervised_raw(self.DATASET, disable_progress=True)
+
+        self.assertEqual(ner.sup_train_calls, 1)
+        self.assertEqual(linker.sup_train_calls, 0)


Why not train both at the same time?

You can. But the point is that you don't have to! I.e flexibiliy.

Sorry I missread this. assume you intended for only one or the other.

github-actions Bot added 15 commits April 16, 2026 15:24

CU-869cy3yz9: Add unsupervised training method to trainable component…

d307dab

… protocol

CU-869cy3yz9: Follow the intercace for unsupervised trianing in conte…

acf623b

…xt based linker

CU-869cy3xa0: Use new interface for self-supervised training

f204535

CU-869cy3yz9: Fix fake pipe in tests

b1cc70a

CU-869cy3yz9: Fix issue with unrealised generator

a76fde4

CU-869cy3z45: Allow any component to be trained in an unsupervised ma…

7d9adf5

…nner

CU-869cy3z45: Remove unused import

4a71c1c

CU-869cy3yz9: Add a few more tests for trainable components

4c25292

CU-869cy3zb0: Add utilities to create a dataset-aware NER or linker c…

9620d9b

…omponent

CU-869cy3zb0: Fix minor issues with new utilities

7463586

CU-869cy3zb0: Fix minor order of operations issue

d782e0e

CU-869cy3zb0: Add a few tests for training utilities

3e3ae16

CU-869cy3zb0: Add a few missing doc strings

1e7e8cd

CU-869cy3zb0: Add a few supervised training based tests

9185ae7

CU-869cy3zb0: Fix import of Self (from typing extensions)

5df2870

adam-sutton-1992 reviewed Apr 24, 2026

View reviewed changes

Comment thread medcat-v2/medcat/trainer.py

adam-sutton-1992 self-assigned this May 13, 2026

adam-sutton-1992 reviewed May 13, 2026

View reviewed changes

mart-r merged commit de36124 into main May 13, 2026
22 checks passed

mart-r deleted the feat/medcat/CU-869cy3xa0-specify-unsupervised-training-in-trainable-component-protocol branch May 13, 2026 13:26

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(medcat):CU-869cy3xa0 Improve training#414

feat(medcat):CU-869cy3xa0 Improve training#414
mart-r merged 15 commits into
mainfrom
feat/medcat/CU-869cy3xa0-specify-unsupervised-training-in-trainable-component-protocol

mart-r commented Apr 21, 2026 •

edited

Loading

Uh oh!

Uh oh!

adam-sutton-1992 commented Apr 24, 2026

Uh oh!

mart-r commented Apr 25, 2026 •

edited

Loading

Uh oh!

adam-sutton-1992 left a comment

Uh oh!

adam-sutton-1992 May 12, 2026

Uh oh!

mart-r May 13, 2026

Uh oh!

adam-sutton-1992 May 13, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

mart-r commented Apr 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

adam-sutton-1992 commented Apr 24, 2026

Uh oh!

mart-r commented Apr 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

adam-sutton-1992 left a comment

Choose a reason for hiding this comment

Uh oh!

adam-sutton-1992 May 12, 2026

Choose a reason for hiding this comment

Uh oh!

mart-r May 13, 2026

Choose a reason for hiding this comment

Uh oh!

adam-sutton-1992 May 13, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

mart-r commented Apr 21, 2026 •

edited

Loading

mart-r commented Apr 25, 2026 •

edited

Loading