-
Notifications
You must be signed in to change notification settings - Fork 194
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Not able to apply push down filtering to ingestion time partitioned table #1005
Comments
Can you please specify the spark and. connector versions, and which connector is it (spark-bigquery-with-dependencies_* or spark-X.Y-bigquery) |
sorry, knew I'd missed some information 🤦♂️ spark: v3.3.1 |
Try the following filter: |
because I'm partitioning by month, the table doesn't have the pseudo column running the read statement with the above gives a long error including
|
@davidrabinowitz can you offer any more insights on this? I've now hit another issue as I cannot read from tables where |
Hi @pricemg , I was facing a similar issue recently where I had setup a Spark Thrift Server to query Bigquery tables (needed to join table from different sources) and the queries with filter on date/timestamp columns were failing with InvalidArgumentException. Here, I noticed in the source code the compileValue method which parses the filters to be pushed, is handling DATE and TIMESTAMP type columns by wrapping them in required string as follows:
Here, in my case the objects passed were of type Can you try this once and let me know if it works? If not, can you please share the logs containing details of the |
Tried reproducing in 0.39.0, this issue is no longer present.
|
In trying to understand how pushdown filters work with the BigQuery connector I realise I have no understanding on how the filters can be applied to a table to yield more performant results. The only documentation can be found here but I'm not having much luck with these examples.
The example in particular I'm considering is having a table which is partitioned by ingestion time (by month, and so has pseudo column
_PARTITIONTIME
[side note- I don't understand why only day partitions get the_PARTITIONDATE
pseudocolumn instead, but I think that's a general BigQuery thing]), and also has some clustering applied also.Some example code to create a table would be
I have then been running variations of
to explore whether I'm understanding the push down filters correctly
e.g.
vs
which highlight an attempt at reading and whether the filters have pushed down.
For example I can see I can write these queries
or
which appears to push down the filter as desired when looking at the execution plan.
I am struggling however to perform the filtering on the pseudo partition column. This query on the above generated table:
yields the typical long spark error, the crux of which is
Similarly trying a query like
'(_PARTITIONTIME > "2023-05-21")'
yields the same errors as above.The only way I've been able to read a partition is using
but the execution plan doesn't show any push down filtering occurring
However, am I maybe misunderstanding the usage of using the full SQL query in the reading of the table and in this instance the optimisation of reading from the partition is happening, just entirely outside of spark?
The text was updated successfully, but these errors were encountered: