feat(medcat):CU-869cy3xa0 Improve training#414
Conversation
|
A few queries but I think it looks good. I might've missed these within the commits: If you have two trainable components. is it possible to turn of training for one of them when running training methods? Do the dataset aware components serve that purpose? And one more above^^^ |
The description already had 2 examples for this :) The dataset aware implementation can serve that purpose. Because they replace the specific component with another one (which isn't trainable, but that's kind of irrelevant since it's a different component) for the duration of the context manager. But I think what makes it unclear is that in the example I've given it a dataset, but realistically, you could provide an empty dataset for it, i.e like this: # supervised
with dataset_aware_component(cat, CoreComponentType.ner, {"projects" : []}):
trainer.train_supervised_raw(DATASET, nepochs=1)
# unsupervised
with dataset_aware_component(cat, CoreComponentType.ner, {"projects" : []}):
trainer.train_unsupervisedsupervised(["list", "of", "texts'], nepochs=1) |
adam-sutton-1992
left a comment
There was a problem hiding this comment.
Just the one comment:
| def test_train_supervised_can_train_only_linker_when_ner_is_cheating(self): | ||
| ner = _TrainableNER() | ||
| linker = _TrainablePassThroughLinker() | ||
| cat = _FakeCat(self.DATASET, [ner, linker]) | ||
| trainer = Trainer(cat.cdb, cat.__call__, cat.pipe) | ||
|
|
||
| with unittest.mock.patch("medcat.trainer.prepare_name", return_value={"abc": {}}): | ||
| with dataset_aware_component(cat, CoreComponentType.ner, self.DATASET): | ||
| trainer.train_supervised_raw(self.DATASET, disable_progress=True) | ||
|
|
||
| self.assertEqual(ner.sup_train_calls, 0) | ||
| self.assertEqual(linker.sup_train_calls, 1) | ||
|
|
||
| def test_train_supervised_can_train_only_ner_when_linker_is_cheating(self): | ||
| ner = _TrainableNER() | ||
| linker = _TrainablePassThroughLinker() | ||
| cat = _FakeCat(self.DATASET, [ner, linker]) | ||
| trainer = Trainer(cat.cdb, cat.__call__, cat.pipe) | ||
|
|
||
| with unittest.mock.patch("medcat.trainer.prepare_name", return_value={"abc": {}}): | ||
| with dataset_aware_component(cat, CoreComponentType.linking, self.DATASET): | ||
| trainer.train_supervised_raw(self.DATASET, disable_progress=True) | ||
|
|
||
| self.assertEqual(ner.sup_train_calls, 1) | ||
| self.assertEqual(linker.sup_train_calls, 0) |
There was a problem hiding this comment.
Why not train both at the same time?
There was a problem hiding this comment.
You can. But the point is that you don't have to! I.e flexibiliy.
There was a problem hiding this comment.
Sorry I missread this. assume you intended for only one or the other.
This PR does an overhaul to the training setup of MedCAT:
TrainableComponentprotocol to also include atrain_unsupervisedmethodTrainableComponentprotocol to be trained supervisedExample code snippets: