Add support for uneven splits for large data #494

anuragg1209 · 2025-09-08T16:01:26Z

Motivation and Context

Add support for uneven splits for large datasets, ensuring that batches are filled during fine-tuning, while also preserving the current functionality of equal-size splits.

Public API Changes

No Public API changes
Yes, Public API changes (Details below)

How Has This Been Tested?

Checklist

The changes have been tested locally.
Documentation has been updated (if the public API or usage changes).
A entry has been added to CHANGELOG.md (if relevant for users).
The code follows the project's style guidelines.
I have considered the impact of these changes on the public API.

gemini-code-assist

Code Review

This pull request introduces support for uneven data splits, which is a valuable addition for handling large datasets during fine-tuning. The changes to the public API in TabPFNClassifier and TabPFNRegressor are well-designed, using a keyword-only argument with a default that preserves existing behavior.

However, I've identified a critical bug in the implementation of the uneven splitting logic in src/tabpfn/utils.py which could lead to incorrect behavior. Additionally, the new functionality is not covered by tests, which is a significant gap. My review includes a suggested fix for the bug and a recommendation to add comprehensive test cases for the new code path.

src/tabpfn/utils.py

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

Copilot

Pull Request Overview

This PR adds support for uneven data splits as an alternative to the existing equal-size splitting behavior. The new functionality allows chunks to be filled to maximum capacity during fine-tuning while preserving backward compatibility.

Adds a new equal_split_size parameter to control splitting behavior
Implements uneven splitting that creates chunks of max_data_size with remainder handling
Updates all relevant APIs to support the new parameter while maintaining backward compatibility

Reviewed Changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 1 comment.

Show a summary per file

File	Description
src/tabpfn/utils.py	Adds core uneven splitting logic with `equal_split_size` parameter
src/tabpfn/base.py	Updates helper function to pass through the new parameter
src/tabpfn/classifier.py	Adds `equal_split_size` parameter to classifier's public API
src/tabpfn/regressor.py	Adds `equal_split_size` parameter to regressor's public API
tests/test_ft_utils.py	Updates all test calls to include the new parameter

_{Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.}

src/tabpfn/utils.py

noahho

Great, LGTM!

* Record copied public PR 494 * Add support for uneven splits for large data (#494) (cherry picked from commit 648e3d8) --------- Co-authored-by: mirror-bot <mirror-bot@users.noreply.github.com> Co-authored-by: Anurag garg <50840934+anuragg1209@users.noreply.github.com>

Add support for uneven splits for large data

936f7c3

anuragg1209 requested review from Copilot and noahho and removed request for Copilot September 8, 2025 16:01

gemini-code-assist bot reviewed Sep 8, 2025

View reviewed changes

src/tabpfn/utils.py Show resolved Hide resolved

Update src/tabpfn/utils.py

17bec55

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

Copilot AI review requested due to automatic review settings September 8, 2025 16:14

Copilot AI reviewed Sep 8, 2025

View reviewed changes

src/tabpfn/utils.py Show resolved Hide resolved

noahho approved these changes Sep 8, 2025

View reviewed changes

anuragg1209 merged commit 648e3d8 into PriorLabs:main Sep 8, 2025
18 of 19 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add support for uneven splits for large data #494

Add support for uneven splits for large data #494

Uh oh!

anuragg1209 commented Sep 8, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

noahho left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Add support for uneven splits for large data #494

Add support for uneven splits for large data #494

Uh oh!

Conversation

anuragg1209 commented Sep 8, 2025

Motivation and Context

Public API Changes

How Has This Been Tested?

Checklist

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Uh oh!

noahho left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants