Skip to content

Conversation

@gbrgr
Copy link
Collaborator

@gbrgr gbrgr commented Nov 17, 2025

Closes RAI-43688
Closes RAI-43694

Upstream PR is here: apache#1824

@gbrgr gbrgr marked this pull request as ready for review November 18, 2025 07:38
// The _file column will be a RunEndEncoded array with the file path
if let Some(run_array) = file_col
.as_any()
.downcast_ref::<arrow_array::RunArray<arrow_array::types::Int32Type>>()
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

perhaps just downcast to the int array and iterate through values, instead of having to separate run array values from run array run ends

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also shouldn't we not have REE here?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Regarding the not REE stuff: No, I think in a single PR, we should have REE, and if we want to disable it we should have a separate one.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah, ok, so you'd follow up with that right after this PR?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes

{
record_batch_transformer_builder =
record_batch_transformer_builder.with_partition(partition_spec, partition_data);
record_batch_transformer_builder.with_partition(partition_spec, partition_data)?;
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do you want to do this for incremental scan as well?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We do not have partition information (yet? not sure if it is needed at all) in the incremental scan

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would we get wrong results because of it? If table had partitions and some partition transforms for example?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure, this partition stuff has only been added recently. We may just add the same logic to the incremental tasks, but we first need to understand what's the actual issue


for column_name in column_names.iter() {
// Handle metadata columns (like "_file")
if is_metadata_column_name(column_name) {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it already shows that it's a burden to have to maintain same features in two places (not only code but also tests)...at some point we should maybe reconsider how to consolidate this

/// The Arrow Field definition for the metadata column, or an error if not a metadata field
pub fn get_metadata_field(field_id: i32) -> Result<Arc<Field>> {
match field_id {
RESERVED_FIELD_ID_FILE => Ok(Arc::clone(file_field())),
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what about other fields? Also should they have static lazy field?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It depends on what the use of these fields is. Since this is a constant arrow field we can create it. But I leave the others until they are actually used.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we should include it for the sake of #10 (comment). Otherwise it's really tactile and error-prone

pub fn get_metadata_field(field_id: i32) -> Result<Arc<Field>> {
match field_id {
RESERVED_FIELD_ID_FILE => Ok(Arc::clone(file_field())),
_ if is_metadata_field(field_id) => {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we include other fields as suggested in the comment above, this will never fire. I think we have to extend is_metadata_field to include all metadata fields (a range), and that way we would throw this proper error.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I left implementation of the other fields out intentionally, hence this fires. The functions here do not have to return actual fields for all constants, rather metadata_columns.rs is in general a module for collecting all sorts of constants/helpers. It is for example not clear whether you would ever want a field for _pos, because this comes from the arrow reader. However, we still want to collect the constants here...

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We talked offline and agreed to remove non used constants, and for those that are used we should introduce Lazy Fields. Then we can address this comment I think.

Copy link

@vustef vustef left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Gerald.
I'm not particularly clear on metadata column fields listing, but otherwise looks good. We should follow up with disabling REE right after this PR though, unless you want to pick Arrow.jl battle to support REE there.

@gbrgr gbrgr merged commit c2ba373 into main Nov 19, 2025
18 checks passed
@gbrgr gbrgr deleted the feature/gb/file-column-inc branch November 19, 2025 07:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants