Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: support RANGE in queries Part 2: Arrow #1868

Merged
merged 42 commits into from
Apr 18, 2024

Conversation

Linchin
Copy link
Contributor

@Linchin Linchin commented Mar 23, 2024

This PR supports getting RANGE values in queries as Arrow. This is the second part of the RANGE queries PR, and should not be merged until part 1 is merged. Part 1 #1884 supports RANGE as JSON.

Part of #1724🦕

@Linchin Linchin requested a review from shollyman March 23, 2024 00:02
@Linchin Linchin requested review from a team as code owners March 23, 2024 00:02
@product-auto-label product-auto-label bot added size: m Pull request size is medium. api: bigquery Issues related to the googleapis/python-bigquery API. labels Mar 23, 2024
@product-auto-label product-auto-label bot added size: l Pull request size is large. and removed size: m Pull request size is medium. labels Mar 28, 2024
@@ -3503,7 +3503,10 @@ def test_to_dataframe_no_tqdm_no_progress_bar(self):
user_warnings = [
warning for warning in warned if warning.category is UserWarning
]
self.assertEqual(len(user_warnings), 0)
# Note: number of warnings is inconsistent across python versions
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I removed the assertion for number of warnings, because we get different number of them with different python versions. Same for line 3540

@@ -3534,7 +3537,10 @@ def test_to_dataframe_no_tqdm(self):
user_warnings = [
warning for warning in warned if warning.category is UserWarning
]
self.assertEqual(len(user_warnings), 1)
# Note: number of warnings is inconsistent across python versions
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I removed the assertion for number of warnings, because we get different number of them with different python versions.

@Linchin
Copy link
Contributor Author

Linchin commented Mar 29, 2024

Adding a csv file for system test due to RANGE is not supported in JSON with load jobs.

@Linchin Linchin changed the title feat: support RANGE in queries in json and arrow feat: support RANGE in queries Part 2: Arrow Apr 3, 2024
Copy link
Contributor

@shollyman shollyman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for putting this together. Mostly minor comments and questions.

@@ -142,6 +142,12 @@ def bq_to_arrow_struct_data_type(field):
return pyarrow.struct(arrow_fields)


def bq_to_arrow_range_data_type(field):
element_type = field.element_type.upper()
arrow_element_type = _pyarrow_helpers.bq_to_arrow_scalars(element_type)()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need to do validation here? None-check?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great point, I will add a None-check here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added it, as well as the unit tests in test__pandas_helpers.py.

@@ -274,6 +286,22 @@ def types_mapper(arrow_data_type):
elif time_dtype is not None and pyarrow.types.is_time(arrow_data_type):
return time_dtype

elif pyarrow.types.is_struct(arrow_data_type):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to handle structs more generally here, or is that logic elsewhere?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question! Indeed, our types mapper function doesn't seem to do any conversion for STRUCT or ARRAY. This function is used as the parameter types_mapper to pyarrow's Table.to_pandas(), allowing for customizable type mapping from pyarrow to pandas. I'm not entirely sure why we didn't provide struct/array mapping options, but I think it might be related to the fact that the fields of a struct can be of any type, so the conversion can become quite complicated.

# only supports upto pandas 1.3. If pandas.ArrowDtype is not
# present, we raise a warning and set range_date_dtype to None.
msg = (
"Unable ro find class ArrowDtype in pandas, setting "
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/ro/to/ here and in the other two msgs below

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for catching this!

@@ -103,6 +103,7 @@ class Model(proto.Message):

class ModelType(proto.Enum):
r"""Indicates the type of the Model."""

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these formatting changes are benign, but it's not clear why they're in this PR. Maybe pull out unrelated cleanups into a separate PR?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed, I will revert these changes

df = row_iterator.to_dataframe(create_bqstorage_client=False)

user_warnings = [
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not familiar with these tests, but are we losing some existing signal here? Or did test expectations change somehow? Its not clear how this related to the PR intent.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed, we are losing some signals here - we no longer validate the number of warnings here. With this PR we will have different length of warnings depending on the version of pandas. I deleted the warnings check, because I didn't want to have pandas version hard-coded in our tests (Should I be concerned about this?). Another alternative I can think of is to check self.assertTrue(len(user_warnings) in {length_with_older_pandas, length_with_newer_pandas})- this way we will lose some signal but less. What do you think?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If all the supported versions emit at least one UserWarning we could still do an assert on the count being nonzero, but if we don't think this is a useful validation I'm fine removing it as well.

@Linchin Linchin added the kokoro:force-run Add this label to force Kokoro to re-run the tests. label Apr 15, 2024
@yoshi-kokoro yoshi-kokoro removed the kokoro:force-run Add this label to force Kokoro to re-run the tests. label Apr 16, 2024
@Linchin Linchin added the kokoro:force-run Add this label to force Kokoro to re-run the tests. label Apr 16, 2024
@yoshi-kokoro yoshi-kokoro removed the kokoro:force-run Add this label to force Kokoro to re-run the tests. label Apr 16, 2024
@Linchin Linchin added the kokoro:force-run Add this label to force Kokoro to re-run the tests. label Apr 16, 2024
@yoshi-kokoro yoshi-kokoro removed the kokoro:force-run Add this label to force Kokoro to re-run the tests. label Apr 16, 2024
Copy link
Contributor

@shollyman shollyman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left some followup comments and a very minor nit, otherwise LGTM. Thanks!

else:
raise ValueError(f"Unsupported range field type: {value}")
raise ValueError(f"Unsupported range field type: {field.element_type}")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: should we change this to indicate the element type is unsupported, rather than "field type"?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed, I'll change it to be consistent with the name of the field.

df = row_iterator.to_dataframe(create_bqstorage_client=False)

user_warnings = [
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If all the supported versions emit at least one UserWarning we could still do an assert on the count being nonzero, but if we don't think this is a useful validation I'm fine removing it as well.

@Linchin
Copy link
Contributor Author

Linchin commented Apr 18, 2024

If all the supported versions emit at least one UserWarning we could still do an assert on the count being nonzero, but if we don't think this is a useful validation I'm fine removing it as well.

Indeed, I added back the test, and it will pass when len(warnings) == 0 or len(warnings) == 3, so we can cover both cases. I added a comment that explains why and when each value should be true. (Same for another warnings number check).

@Linchin Linchin merged commit 5251b5d into googleapis:main Apr 18, 2024
18 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api: bigquery Issues related to the googleapis/python-bigquery API. size: l Pull request size is large.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants