Pushing update for MetaCAT #155

shubham-s-agarwal · 2025-09-30T15:39:40Z

This includes:

When the category_name does not match the data (and the alternatives), ensuring the correct category_name is shown in the exception instead of None
Performing under sampling only when 2 phase learning is used
Updating the logic for category_value2id to ensure it covers all possible variations, including partially filled mapping (with incorrect class names)
Performing safety check for nclasses and the number of classes found in data, in case of mismatch, exception is raised as model loading needs to be performed again

Includes changes to data_utils

mart-r

There's a question of whether we should change the API for the encode_category_values method. Especially since it's used by other parts (e.g trainer) as well. At the very least, I'd like to have some test that makes sure it has consistent behaviour. But ideally, I'd like to keep API stable if we can.

The other comment I've left is with the indentation within the method. After all, hard to read and/or maintain code is one of the reasons we had (at least one of) the issues before. So in order to avoid the same issue happening again, let's split this long method into multiple parts - each with its own separate scope and responsibility.

mart-r · 2025-09-30T18:28:51Z

v1/medcat/medcat/utils/meta_cat/data_utils.py

    return data_sampled


 def encode_category_values(data: Dict, existing_category_value2id: Optional[Dict] = None,


This method is used by the trainer:

cogstack-nlp/medcat-trainer/webapp/api/api/metrics.py

Line 311 in 6279c36

data, _, _ = encode_category_values(data, existing_category_value2id=category_value2id)

Now, it looks like this change doesn't change the API in a way that would break that (at least not immediately). However, I'd like to have some stability in our API.

Perhaps a test for this method to make sure the behaviour is consistent?

v1/medcat/medcat/utils/meta_cat/data_utils.py

Creating helper functions for checking alternative class names and undersampling data

Changes for flake8

mart-r

Can we please add type hints as well.
You've already described the types in the doc strings, so adding them to the signature shouldn't be that much extra work.

v1/medcat/medcat/utils/meta_cat/data_utils.py

mart-r

Looking good!

* Pushing update for metacat Includes changes to data_utils * Update data_utils.py * Update data_utils.py * Update data_utils.py Creating helper functions for checking alternative class names and undersampling data * Update data_utils.py * Update data_utils.py Changes for flake8 * Update data_utils.py * Update data_utils.py

Pushing update for metacat

0fcc996

Includes changes to data_utils

shubham-s-agarwal requested review from mart-r and tomolopolis September 30, 2025 15:39

shubham-s-agarwal self-assigned this Sep 30, 2025

shubham-s-agarwal added 2 commits September 30, 2025 16:50

Update data_utils.py

006f190

Update data_utils.py

dc1a2bf

mart-r requested changes Sep 30, 2025

View reviewed changes

shubham-s-agarwal added 3 commits October 1, 2025 11:04

Update data_utils.py

089069e

Creating helper functions for checking alternative class names and undersampling data

Update data_utils.py

25153c0

Update data_utils.py

5828cb2

Changes for flake8

mart-r requested changes Oct 1, 2025

View reviewed changes

v1/medcat/medcat/utils/meta_cat/data_utils.py Outdated Show resolved Hide resolved

v1/medcat/medcat/utils/meta_cat/data_utils.py Outdated Show resolved Hide resolved

shubham-s-agarwal added 2 commits October 1, 2025 13:11

Update data_utils.py

ba6e71d

Update data_utils.py

5eaed0f

mart-r approved these changes Oct 1, 2025

View reviewed changes

mart-r merged commit 99f357f into main Oct 1, 2025
18 checks passed

mart-r deleted the metacat_fix branch October 1, 2025 15:35

mart-r mentioned this pull request Oct 2, 2025

Bug(medcat)CU-869aprnhg: Port meta cat fixes from v1 #162

Merged

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Pushing update for MetaCAT #155

Pushing update for MetaCAT #155

Uh oh!

shubham-s-agarwal commented Sep 30, 2025

Uh oh!

mart-r left a comment

Uh oh!

mart-r Sep 30, 2025

Uh oh!

Uh oh!

Uh oh!

mart-r left a comment

Uh oh!

Uh oh!

Uh oh!

mart-r left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

		return data_sampled


		def encode_category_values(data: Dict, existing_category_value2id: Optional[Dict] = None,

Pushing update for MetaCAT #155

Pushing update for MetaCAT #155

Uh oh!

Conversation

shubham-s-agarwal commented Sep 30, 2025

Uh oh!

mart-r left a comment

Choose a reason for hiding this comment

Uh oh!

mart-r Sep 30, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

mart-r left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

mart-r left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants