-
Notifications
You must be signed in to change notification settings - Fork 193
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Dynamic overwrite of partitions does not work as expected #103
Comments
Thanks for taking a look @davidrabinowitz. Just wanted to check in to get a sense of what timing might look like on this one. Thanks! |
Hi @davidrabinowitz Do we have any update on this one?
What we want is that only those partitions should be overwritten which are present in the dfDayPartitionAgg dataframe. But the above code ends up overwriting all partitions in the table. |
+1 this would be nice to get fixed for us also! |
I'm working with the BigQuery team on this |
Hi @saurabh24292 @jasonflittner It appears the BigQuery API does not support it yet, but they are aware of this issue. Once implemented, it will be supported in the connector as well. As a workaround, you can upload the data to a temporary (short lived) table, and use sql in order to copy the data from this table to the relevant partitions. |
As a workaround, I set the write mode to "Append", so, BigQuery will add new partition to the table without deleting old partitions. If I need to delete a partition, in case of a reset, I can use |
+1 for the request |
… date partition
@davidrabinowitz Dear David, any update on this issue? We faced the same problem, using spark-bigquery-with-dependencies_2.12 with 0.22.2 still wipes out all old partitions and just save new individual partitions as opposed to the dynamic overwrites |
Any progress? |
+1 for the request |
1 similar comment
+1 for the request |
Thanks for the feedback. |
+1 for the request |
Any update? I want to be able to overwrite partitions instead of appending which creates duplicates. |
+1 any news about this feature |
The workaround I found here suggests first deleting the partitions that the incoming dataframe would overwrite, then using append mode to write the dataframe. Obviously this is not ideal, since it can fail in the second step and leave you with an inconsistent table. |
This feature is code-complete and will be available in the connector version 0.34.0 |
In Spark when you set spark.conf.set("spark.sql.sources.partitionOverwriteMode","dynamic") and then do an insert into a partitioned table in overwrite mode. The newly inserted partitions would overwrite only partitions being inserted. Other partitions that were not part of that group would also stick around untouched. When writing to BigQuery with this connector the entire table in BigQuery gets wiped out and only the new partitions inserted show up. Can the connector be updated to support dynamic partition overwrites?
I am testing with gs://spark-lib/bigquery/spark-bigquery-latest.jar.
Thanks!
Example setup of this scenario:
Ran this on BigQuery directly:
CREATE OR REPLACE TABLE
gcp-project.dev.wiki_page_views_spark_write
(
wiki_project STRING,
wiki_page STRING,
wiki_page_views INT64,
date DATE
)
PARTITION BY date
OPTIONS (
partition_expiration_days=999999
)
spark.conf.set("spark.sql.sources.partitionOverwriteMode","dynamic")
Saving the data to BigQuery
wiki.write.format('bigquery')
.option('table', 'gcp-project.dev.wiki_page_views_spark_write')
.option('project','gcp-project')
.option('temporaryGcsBucket','gcp-project/tmp/bq_staging')
.mode('overwrite')
.save()
The text was updated successfully, but these errors were encountered: