CU-8693qx9yp Deid chunking - hugging face pipeline approach #405

shubham-s-agarwal · 2024-02-26T13:59:51Z

Adding functionality for chunking documents that exceed the maximum number of tokens the model can process.

Used hugging face pipeline functionality to perform chunking with overlap.
Added an attribute 'chunking_overlap_window' to the NER config to control the size of the overlap window (default value set to 5)

mart-r

Great find! It's much better for us to use work something that someone else has done somewhere. And that's what this seems to do!

With that said, do we need to specifically allow the addition of a config dict when loading the model?
The TransformersNER model gets saved and loaded along with its config already.
Now, for older models, this config would not have the new option set. But the model for the config has a default value for it so when initialised from a previous instance, it should use the default where no value was loaded off disk.
In any case, we certainly shouldn't hijack the MetaCAT config dict. If we need this functionality for some reason, we'd need to create and use a new argument.

If this is so we can inject a new value for chunking_overlap_window before the pipe is created, surely we should just set the config value and call TransformersNER.create_eval_pipe again? Though in any case, we may want to document this within the config entry so it's clear that simply changing the value does not change behaviour before the pipe is recreated.

EDIT:
I noticed the multiprocessing DeID tests were failing due to taking too long / timing out. I'll take a look and see what the issue may be.

medcat/cat.py

medcat/ner/transformers_ner.py

medcat/config_transformers_ner.py

Added NER config in cat load function

tomolopolis

lgtm - just some comments to clarify please

tomolopolis · 2024-02-28T09:47:05Z

medcat/config_transformers_ner.py

@@ -13,6 +13,8 @@ class General(MixingConfig, BaseModel):
    """How many characters are piped at once into the meta_cat class"""
    ner_aggregation_strategy: str = 'simple'
    """Agg strategy for HF pipeline for NER"""
+    chunking_overlap_window: int = 5


empirically 5 is good?

Anthony mentioned he'd want it to be 5 as it would have a good trade-off between computational complexity and performance.
I feel 10 would be better, but 5 works as well

medcat/ner/transformers_ner.py

mart-r · 2024-02-28T10:12:44Z

Just as a note here.
The GHA fails due to the deid multiprocessing taking too long. It times out at 3 minutes (normally the tests take between 5 and 20 seconds each).

I've isolated the issue to some changes in this branch. The test runs fine without these changes. But I don't know why it would have this effect. Especially since it seems to persist even when setting the value to 0, which should be the default.
The WIP PR #406 (so I can run it in GHA environment - I've yet to experience issues locally).

EDIT:
Still looking into it by the way.

tomolopolis

lgtm

… non-functioning chunking window

tomolopolis · 2024-02-28T16:21:39Z

Task linked: CU-8693qx9yp Fix chunking issues for De-ID

mart-r

Looks good to me.

* Cu 8693u6b4u tests continue on fail (#400) * CU-8693u6b4u: Make sure failed/errored tests fail the main workflow * CU-8693u6b4u: Attempt to fix deid multiprocessing, at least for GHA * CU-8693u6b4u: Fix small docstring issue * CU-8693v3tt6 SOMED opcs refset selection (#402) * CU-8693v3tt6: Update refset ID for OPCS4 mappings in newer SNOMED releases * CU-8693v3tt6: Add method to get direct refset mappings * CU-8693v3tt6: Add tests to direct refset mappings method * CU-8693v3tt6: Fix OPCS4 refset ID selection logic * CU-8693v3tt6: Add test for OPCS4 refset ID selection * CU-8693v6epd: Move typing imports away from pydantic (#403) * CU-8693qx9yp Deid chunking - hugging face pipeline approach (#405) * Pushing chunking update * Update transformers_ner.py * Pushing update to config Added NER config in cat load function * Update cat.py * Updating chunking overlap * CU-8693qx9yp: Add warning for deid multiprocessing with (potentially) non-functioning chunking window * CU-8693qx9yp: Fix linting issue --------- Co-authored-by: mart-r <mart.ratas@gmail.com> --------- Co-authored-by: Shubham Agarwal <66172189+shubham-s-agarwal@users.noreply.github.com>

shubham-s-agarwal added 2 commits February 26, 2024 11:38

Pushing chunking update

5fd7990

Update transformers_ner.py

ec0148b

shubham-s-agarwal added the enhancement New feature or request label Feb 26, 2024

shubham-s-agarwal requested review from tomolopolis and mart-r February 26, 2024 13:59

shubham-s-agarwal self-assigned this Feb 26, 2024

mart-r requested changes Feb 27, 2024

View reviewed changes

medcat/cat.py Outdated Show resolved Hide resolved

medcat/ner/transformers_ner.py Show resolved Hide resolved

medcat/config_transformers_ner.py Show resolved Hide resolved

shubham-s-agarwal added 2 commits February 27, 2024 12:18

Pushing update to config

d3e7de5

Added NER config in cat load function

Update cat.py

38c1dd6

mart-r mentioned this pull request Feb 27, 2024

Cu 8693y13a0 deid multi #406

Closed

tomolopolis approved these changes Feb 28, 2024

View reviewed changes

Updating chunking overlap

6898fa7

tomolopolis approved these changes Feb 28, 2024

View reviewed changes

mart-r added 2 commits February 28, 2024 16:05

CU-8693qx9yp: Add warning for deid multiprocessing with (potentially)…

ec2149b

… non-functioning chunking window

CU-8693qx9yp: Fix linting issue

a0ef1cc

mart-r changed the title ~~Deid chunking - hugging face pipeline approach~~ CU-8693qx9yp Deid chunking - hugging face pipeline approach Feb 28, 2024

mart-r approved these changes Feb 28, 2024

View reviewed changes

mart-r merged commit 67f1126 into master Feb 28, 2024
5 checks passed

mart-r deleted the deid_chunking branch August 12, 2024 12:51

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CU-8693qx9yp Deid chunking - hugging face pipeline approach #405

CU-8693qx9yp Deid chunking - hugging face pipeline approach #405

shubham-s-agarwal commented Feb 26, 2024

mart-r left a comment •

edited

Loading

tomolopolis left a comment

tomolopolis Feb 28, 2024

shubham-s-agarwal Feb 28, 2024

mart-r commented Feb 28, 2024 •

edited

Loading

tomolopolis left a comment

tomolopolis commented Feb 28, 2024

mart-r left a comment

CU-8693qx9yp Deid chunking - hugging face pipeline approach #405

CU-8693qx9yp Deid chunking - hugging face pipeline approach #405

Conversation

shubham-s-agarwal commented Feb 26, 2024

mart-r left a comment • edited Loading

Choose a reason for hiding this comment

tomolopolis left a comment

Choose a reason for hiding this comment

tomolopolis Feb 28, 2024

Choose a reason for hiding this comment

shubham-s-agarwal Feb 28, 2024

Choose a reason for hiding this comment

mart-r commented Feb 28, 2024 • edited Loading

tomolopolis left a comment

Choose a reason for hiding this comment

tomolopolis commented Feb 28, 2024

mart-r left a comment

Choose a reason for hiding this comment

mart-r left a comment •

edited

Loading

mart-r commented Feb 28, 2024 •

edited

Loading