Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CU-8693qx9yp Deid chunking - hugging face pipeline approach #405

Merged
merged 7 commits into from
Feb 28, 2024

Conversation

shubham-s-agarwal
Copy link
Collaborator

Adding functionality for chunking documents that exceed the maximum number of tokens the model can process.

  • Used hugging face pipeline functionality to perform chunking with overlap.
  • Added an attribute 'chunking_overlap_window' to the NER config to control the size of the overlap window (default value set to 5)

@shubham-s-agarwal shubham-s-agarwal added the enhancement New feature or request label Feb 26, 2024
@shubham-s-agarwal shubham-s-agarwal self-assigned this Feb 26, 2024
Copy link
Collaborator

@mart-r mart-r left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great find! It's much better for us to use work something that someone else has done somewhere. And that's what this seems to do!

With that said, do we need to specifically allow the addition of a config dict when loading the model?
The TransformersNER model gets saved and loaded along with its config already.
Now, for older models, this config would not have the new option set. But the model for the config has a default value for it so when initialised from a previous instance, it should use the default where no value was loaded off disk.
In any case, we certainly shouldn't hijack the MetaCAT config dict. If we need this functionality for some reason, we'd need to create and use a new argument.

If this is so we can inject a new value for chunking_overlap_window before the pipe is created, surely we should just set the config value and call TransformersNER.create_eval_pipe again? Though in any case, we may want to document this within the config entry so it's clear that simply changing the value does not change behaviour before the pipe is recreated.

EDIT:
I noticed the multiprocessing DeID tests were failing due to taking too long / timing out. I'll take a look and see what the issue may be.

medcat/cat.py Outdated Show resolved Hide resolved
medcat/ner/transformers_ner.py Show resolved Hide resolved
medcat/config_transformers_ner.py Show resolved Hide resolved
@mart-r mart-r mentioned this pull request Feb 27, 2024
Copy link
Member

@tomolopolis tomolopolis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm - just some comments to clarify please

@@ -13,6 +13,8 @@ class General(MixingConfig, BaseModel):
"""How many characters are piped at once into the meta_cat class"""
ner_aggregation_strategy: str = 'simple'
"""Agg strategy for HF pipeline for NER"""
chunking_overlap_window: int = 5
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

empirically 5 is good?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Anthony mentioned he'd want it to be 5 as it would have a good trade-off between computational complexity and performance.
I feel 10 would be better, but 5 works as well

medcat/ner/transformers_ner.py Show resolved Hide resolved
@mart-r
Copy link
Collaborator

mart-r commented Feb 28, 2024

Just as a note here.
The GHA fails due to the deid multiprocessing taking too long. It times out at 3 minutes (normally the tests take between 5 and 20 seconds each).

I've isolated the issue to some changes in this branch. The test runs fine without these changes. But I don't know why it would have this effect. Especially since it seems to persist even when setting the value to 0, which should be the default.
The WIP PR #406 (so I can run it in GHA environment - I've yet to experience issues locally).

EDIT:
Still looking into it by the way.

Copy link
Member

@tomolopolis tomolopolis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@mart-r mart-r changed the title Deid chunking - hugging face pipeline approach CU-8693qx9yp Deid chunking - hugging face pipeline approach Feb 28, 2024
@tomolopolis
Copy link
Member

Task linked: CU-8693qx9yp Fix chunking issues for De-ID

Copy link
Collaborator

@mart-r mart-r left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me.

@mart-r mart-r merged commit 67f1126 into master Feb 28, 2024
5 checks passed
mart-r added a commit that referenced this pull request Feb 28, 2024
* Cu 8693u6b4u tests continue on fail (#400)

* CU-8693u6b4u: Make sure failed/errored tests fail the main workflow

* CU-8693u6b4u: Attempt to fix deid multiprocessing, at least for GHA

* CU-8693u6b4u: Fix small docstring issue

* CU-8693v3tt6 SOMED opcs refset selection (#402)

* CU-8693v3tt6: Update refset ID for OPCS4 mappings in newer SNOMED releases

* CU-8693v3tt6: Add method to get direct refset mappings

* CU-8693v3tt6: Add tests to direct refset mappings method

* CU-8693v3tt6: Fix OPCS4 refset ID selection logic

* CU-8693v3tt6: Add test for OPCS4 refset ID selection

* CU-8693v6epd: Move typing imports away from pydantic (#403)

* CU-8693qx9yp Deid chunking - hugging face pipeline approach (#405)

* Pushing chunking update

* Update transformers_ner.py

* Pushing update to config

Added NER config in cat load function

* Update cat.py

* Updating chunking overlap

* CU-8693qx9yp: Add warning for deid multiprocessing with (potentially) non-functioning chunking window

* CU-8693qx9yp: Fix linting issue

---------

Co-authored-by: mart-r <mart.ratas@gmail.com>

---------

Co-authored-by: Shubham Agarwal <66172189+shubham-s-agarwal@users.noreply.github.com>
@mart-r mart-r deleted the deid_chunking branch August 12, 2024 12:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants