Skip to content

Conversation

@shubham-s-agarwal
Copy link
Contributor

This includes:

  • When the category_name does not match the data (and the alternatives), ensuring the correct category_name is shown in the exception instead of None
  • Performing under sampling only when 2 phase learning is used
  • Updating the logic for category_value2id to ensure it covers all possible variations, including partially filled mapping (with incorrect class names)
  • Performing safety check for nclasses and the number of classes found in data, in case of mismatch, exception is raised as model loading needs to be performed again

Includes changes to data_utils
Copy link
Collaborator

@mart-r mart-r left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's a question of whether we should change the API for the encode_category_values method. Especially since it's used by other parts (e.g trainer) as well. At the very least, I'd like to have some test that makes sure it has consistent behaviour. But ideally, I'd like to keep API stable if we can.

The other comment I've left is with the indentation within the method. After all, hard to read and/or maintain code is one of the reasons we had (at least one of) the issues before. So in order to avoid the same issue happening again, let's split this long method into multiple parts - each with its own separate scope and responsibility.

return data_sampled


def encode_category_values(data: Dict, existing_category_value2id: Optional[Dict] = None,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This method is used by the trainer:

data, _, _ = encode_category_values(data, existing_category_value2id=category_value2id)

Now, it looks like this change doesn't change the API in a way that would break that (at least not immediately). However, I'd like to have some stability in our API.

Perhaps a test for this method to make sure the behaviour is consistent?

Creating helper functions for checking alternative class names and undersampling data
Changes for flake8
Copy link
Collaborator

@mart-r mart-r left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we please add type hints as well.
You've already described the types in the doc strings, so adding them to the signature shouldn't be that much extra work.

Copy link
Collaborator

@mart-r mart-r left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking good!

@mart-r mart-r merged commit 99f357f into main Oct 1, 2025
18 checks passed
@mart-r mart-r deleted the metacat_fix branch October 1, 2025 15:35
mart-r pushed a commit that referenced this pull request Oct 1, 2025
* Pushing update for metacat

Includes changes to data_utils

* Update data_utils.py

* Update data_utils.py

* Update data_utils.py

Creating helper functions for checking alternative class names and undersampling data

* Update data_utils.py

* Update data_utils.py

Changes for flake8

* Update data_utils.py

* Update data_utils.py
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants