Skip to content

Conversation

@anuragg1209
Copy link
Contributor

Motivation and Context

Add support for uneven splits for large datasets, ensuring that batches are filled during fine-tuning, while also preserving the current functionality of equal-size splits.

Public API Changes

  • No Public API changes
  • Yes, Public API changes (Details below)

How Has This Been Tested?


Checklist

  • The changes have been tested locally.
  • Documentation has been updated (if the public API or usage changes).
  • A entry has been added to CHANGELOG.md (if relevant for users).
  • The code follows the project's style guidelines.
  • I have considered the impact of these changes on the public API.

@anuragg1209 anuragg1209 requested review from Copilot and noahho and removed request for Copilot September 8, 2025 16:01
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces support for uneven data splits, which is a valuable addition for handling large datasets during fine-tuning. The changes to the public API in TabPFNClassifier and TabPFNRegressor are well-designed, using a keyword-only argument with a default that preserves existing behavior.

However, I've identified a critical bug in the implementation of the uneven splitting logic in src/tabpfn/utils.py which could lead to incorrect behavior. Additionally, the new functionality is not covered by tests, which is a significant gap. My review includes a suggested fix for the bug and a recommendation to add comprehensive test cases for the new code path.

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Copilot AI review requested due to automatic review settings September 8, 2025 16:14
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR adds support for uneven data splits as an alternative to the existing equal-size splitting behavior. The new functionality allows chunks to be filled to maximum capacity during fine-tuning while preserving backward compatibility.

  • Adds a new equal_split_size parameter to control splitting behavior
  • Implements uneven splitting that creates chunks of max_data_size with remainder handling
  • Updates all relevant APIs to support the new parameter while maintaining backward compatibility

Reviewed Changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 1 comment.

Show a summary per file
File Description
src/tabpfn/utils.py Adds core uneven splitting logic with equal_split_size parameter
src/tabpfn/base.py Updates helper function to pass through the new parameter
src/tabpfn/classifier.py Adds equal_split_size parameter to classifier's public API
src/tabpfn/regressor.py Adds equal_split_size parameter to regressor's public API
tests/test_ft_utils.py Updates all test calls to include the new parameter

Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

Copy link
Collaborator

@noahho noahho left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great, LGTM!

@anuragg1209 anuragg1209 merged commit 648e3d8 into PriorLabs:main Sep 8, 2025
18 of 19 checks passed
oscarkey pushed a commit that referenced this pull request Nov 12, 2025
* Record copied public PR 494

* Add support for uneven splits for large data (#494)

(cherry picked from commit 648e3d8)

---------

Co-authored-by: mirror-bot <mirror-bot@users.noreply.github.com>
Co-authored-by: Anurag garg <50840934+anuragg1209@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants