Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dynamic overwrite of partitions does not work as expected #103

Closed
jasonflittner opened this issue Jan 15, 2020 · 17 comments
Closed

Dynamic overwrite of partitions does not work as expected #103

jasonflittner opened this issue Jan 15, 2020 · 17 comments
Assignees

Comments

@jasonflittner
Copy link

In Spark when you set spark.conf.set("spark.sql.sources.partitionOverwriteMode","dynamic") and then do an insert into a partitioned table in overwrite mode. The newly inserted partitions would overwrite only partitions being inserted. Other partitions that were not part of that group would also stick around untouched. When writing to BigQuery with this connector the entire table in BigQuery gets wiped out and only the new partitions inserted show up. Can the connector be updated to support dynamic partition overwrites?

I am testing with gs://spark-lib/bigquery/spark-bigquery-latest.jar.

Thanks!

Example setup of this scenario:

Ran this on BigQuery directly:

CREATE OR REPLACE TABLE gcp-project.dev.wiki_page_views_spark_write
(
wiki_project STRING,
wiki_page STRING,
wiki_page_views INT64,
date DATE
)
PARTITION BY date
OPTIONS (
partition_expiration_days=999999
)

spark.conf.set("spark.sql.sources.partitionOverwriteMode","dynamic")

Saving the data to BigQuery

wiki.write.format('bigquery')
.option('table', 'gcp-project.dev.wiki_page_views_spark_write')
.option('project','gcp-project')
.option('temporaryGcsBucket','gcp-project/tmp/bq_staging')
.mode('overwrite')
.save()

@davidrabinowitz davidrabinowitz self-assigned this Jan 15, 2020
@jasonflittner
Copy link
Author

Thanks for taking a look @davidrabinowitz. Just wanted to check in to get a sense of what timing might look like on this one. Thanks!

@saurabh24292
Copy link

Hi @davidrabinowitz Do we have any update on this one?
We have a use case where we are writing from a dataframe to bigquery:

dfDayPartitionAgg.write.mode(SaveMode.Overwrite).format("bigquery")
.option("table", "sample-project:sample_dataset.day_partitioned_table")
.option("createDisposition", "CREATE_NEVER")
.option("partitionField", "day").save()

What we want is that only those partitions should be overwritten which are present in the dfDayPartitionAgg dataframe. But the above code ends up overwriting all partitions in the table.

@jasonflittner
Copy link
Author

+1 this would be nice to get fixed for us also!

@davidrabinowitz
Copy link
Member

I'm working with the BigQuery team on this

@davidrabinowitz
Copy link
Member

davidrabinowitz commented May 7, 2020

Hi @saurabh24292 @jasonflittner It appears the BigQuery API does not support it yet, but they are aware of this issue. Once implemented, it will be supported in the connector as well.

As a workaround, you can upload the data to a temporary (short lived) table, and use sql in order to copy the data from this table to the relevant partitions.

@AmineSagaama
Copy link

As a workaround, I set the write mode to "Append", so, BigQuery will add new partition to the table without deleting old partitions. If I need to delete a partition, in case of a reset, I can use
bq rm 'dataset.table$20200614' to delete a specific partition in the Table.

@imakarsh
Copy link

+1 for the request

davidrabinowitz added a commit to davidrabinowitz/spark-bigquery-connector that referenced this issue Jul 13, 2020
@timshen24
Copy link

@davidrabinowitz Dear David, any update on this issue? We faced the same problem, using spark-bigquery-with-dependencies_2.12 with 0.22.2 still wipes out all old partitions and just save new individual partitions as opposed to the dynamic overwrites

@mathfish
Copy link

Any progress?

@vasu-arora
Copy link

+1 for the request

1 similar comment
@hkarkach-externe
Copy link

+1 for the request

@allysonlm
Copy link

As a workaround, I set the write mode to "Append", so, BigQuery will add new partition to the table without deleting old partitions. If I need to delete a partition, in case of a reset, I can use bq rm 'dataset.table$20200614' to delete a specific partition in the Table.

Thanks for the feedback.
Can anyone confirm if Append mode is actually supported for using partitionField?

@gitmstoute
Copy link

+1 for the request

@kane-statsig
Copy link

Any update? I want to be able to overwrite partitions instead of appending which creates duplicates.

@arezki1990
Copy link

arezki1990 commented Mar 5, 2023

+1 any news about this feature

@khaledh
Copy link

khaledh commented Apr 6, 2023

The workaround I found here suggests first deleting the partitions that the incoming dataframe would overwrite, then using append mode to write the dataframe. Obviously this is not ideal, since it can fail in the second step and leave you with an inconsistent table.

@isha97
Copy link
Member

isha97 commented Oct 27, 2023

This feature is code-complete and will be available in the connector version 0.34.0

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests