-
Notifications
You must be signed in to change notification settings - Fork 538
Add support for uneven splits for large data #494
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces support for uneven data splits, which is a valuable addition for handling large datasets during fine-tuning. The changes to the public API in TabPFNClassifier and TabPFNRegressor are well-designed, using a keyword-only argument with a default that preserves existing behavior.
However, I've identified a critical bug in the implementation of the uneven splitting logic in src/tabpfn/utils.py which could lead to incorrect behavior. Additionally, the new functionality is not covered by tests, which is a significant gap. My review includes a suggested fix for the bug and a recommendation to add comprehensive test cases for the new code path.
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR adds support for uneven data splits as an alternative to the existing equal-size splitting behavior. The new functionality allows chunks to be filled to maximum capacity during fine-tuning while preserving backward compatibility.
- Adds a new
equal_split_sizeparameter to control splitting behavior - Implements uneven splitting that creates chunks of
max_data_sizewith remainder handling - Updates all relevant APIs to support the new parameter while maintaining backward compatibility
Reviewed Changes
Copilot reviewed 5 out of 5 changed files in this pull request and generated 1 comment.
Show a summary per file
| File | Description |
|---|---|
| src/tabpfn/utils.py | Adds core uneven splitting logic with equal_split_size parameter |
| src/tabpfn/base.py | Updates helper function to pass through the new parameter |
| src/tabpfn/classifier.py | Adds equal_split_size parameter to classifier's public API |
| src/tabpfn/regressor.py | Adds equal_split_size parameter to regressor's public API |
| tests/test_ft_utils.py | Updates all test calls to include the new parameter |
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
noahho
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great, LGTM!
Motivation and Context
Add support for uneven splits for large datasets, ensuring that batches are filled during fine-tuning, while also preserving the current functionality of equal-size splits.
Public API Changes
How Has This Been Tested?
Checklist
CHANGELOG.md(if relevant for users).