Dynamic overwrite of partitions does not work as expected #103

jasonflittner · 2020-01-15T15:55:30Z

In Spark when you set spark.conf.set("spark.sql.sources.partitionOverwriteMode","dynamic") and then do an insert into a partitioned table in overwrite mode. The newly inserted partitions would overwrite only partitions being inserted. Other partitions that were not part of that group would also stick around untouched. When writing to BigQuery with this connector the entire table in BigQuery gets wiped out and only the new partitions inserted show up. Can the connector be updated to support dynamic partition overwrites?

I am testing with gs://spark-lib/bigquery/spark-bigquery-latest.jar.

Thanks!

Example setup of this scenario:

Ran this on BigQuery directly:

CREATE OR REPLACE TABLE gcp-project.dev.wiki_page_views_spark_write
(
wiki_project STRING,
wiki_page STRING,
wiki_page_views INT64,
date DATE
)
PARTITION BY date
OPTIONS (
partition_expiration_days=999999
)

spark.conf.set("spark.sql.sources.partitionOverwriteMode","dynamic")

Saving the data to BigQuery

wiki.write.format('bigquery')
.option('table', 'gcp-project.dev.wiki_page_views_spark_write')
.option('project','gcp-project')
.option('temporaryGcsBucket','gcp-project/tmp/bq_staging')
.mode('overwrite')
.save()

The text was updated successfully, but these errors were encountered:

jasonflittner · 2020-01-28T19:49:06Z

Thanks for taking a look @davidrabinowitz. Just wanted to check in to get a sense of what timing might look like on this one. Thanks!

saurabh24292 · 2020-05-06T15:20:08Z

Hi @davidrabinowitz Do we have any update on this one?
We have a use case where we are writing from a dataframe to bigquery:

dfDayPartitionAgg.write.mode(SaveMode.Overwrite).format("bigquery")
.option("table", "sample-project:sample_dataset.day_partitioned_table")
.option("createDisposition", "CREATE_NEVER")
.option("partitionField", "day").save()

What we want is that only those partitions should be overwritten which are present in the dfDayPartitionAgg dataframe. But the above code ends up overwriting all partitions in the table.

jasonflittner · 2020-05-06T15:26:06Z

+1 this would be nice to get fixed for us also!

davidrabinowitz · 2020-05-06T15:45:48Z

I'm working with the BigQuery team on this

davidrabinowitz · 2020-05-07T15:56:27Z

Hi @saurabh24292 @jasonflittner It appears the BigQuery API does not support it yet, but they are aware of this issue. Once implemented, it will be supported in the connector as well.

As a workaround, you can upload the data to a temporary (short lived) table, and use sql in order to copy the data from this table to the relevant partitions.

AmineSagaama · 2020-06-13T20:32:42Z

As a workaround, I set the write mode to "Append", so, BigQuery will add new partition to the table without deleting old partitions. If I need to delete a partition, in case of a reset, I can use
bq rm 'dataset.table$20200614' to delete a specific partition in the Table.

imakarsh · 2020-07-10T20:26:57Z

+1 for the request

… date partition

timshen24 · 2021-11-03T22:50:37Z

@davidrabinowitz Dear David, any update on this issue? We faced the same problem, using spark-bigquery-with-dependencies_2.12 with 0.22.2 still wipes out all old partitions and just save new individual partitions as opposed to the dynamic overwrites

mathfish · 2022-02-28T16:01:08Z

Any progress?

vasu-arora · 2022-04-14T14:15:57Z

+1 for the request

hkarkach-externe · 2022-04-25T12:10:33Z

+1 for the request

allysonlm · 2022-05-26T15:13:16Z

As a workaround, I set the write mode to "Append", so, BigQuery will add new partition to the table without deleting old partitions. If I need to delete a partition, in case of a reset, I can use bq rm 'dataset.table$20200614' to delete a specific partition in the Table.

Thanks for the feedback.
Can anyone confirm if Append mode is actually supported for using partitionField?

gitmstoute · 2022-10-13T20:27:30Z

+1 for the request

kane-statsig · 2023-02-27T19:04:47Z

Any update? I want to be able to overwrite partitions instead of appending which creates duplicates.

arezki1990 · 2023-03-05T19:46:01Z

+1 any news about this feature

khaledh · 2023-04-06T20:28:34Z

The workaround I found here suggests first deleting the partitions that the incoming dataframe would overwrite, then using append mode to write the dataframe. Obviously this is not ideal, since it can fail in the second step and leave you with an inconsistent table.

…artitioning] (#1087)

isha97 · 2023-10-27T21:37:48Z

This feature is code-complete and will be available in the connector version 0.34.0

davidrabinowitz self-assigned this Jan 15, 2020

davidrabinowitz added a commit to davidrabinowitz/spark-bigquery-connector that referenced this issue Jul 13, 2020

Issue GoogleCloudDataproc#103: Enabling to overwrite data of a single…

d4cb6dd

… date partition

davidrabinowitz added a commit that referenced this issue Jul 17, 2020

Issue #103: Enabling to overwrite data of a single date partition (#211)

7b810e4

davidrabinowitz closed this as completed Oct 18, 2021

davidrabinowitz reopened this Oct 18, 2021

davidrabinowitz assigned isha97 and unassigned davidrabinowitz Sep 29, 2023

isha97 added a commit that referenced this issue Oct 18, 2023

Issue #103: Support Dynamic Partition Overwrite [DIRECT write, Time P…

57a5301

…artitioning] (#1087)

isha97 closed this as completed Oct 27, 2023

bkinzle mentioned this issue Dec 6, 2023

No enum constant com.google.cloud.spark.bigquery.PartitionOverwriteMode.dynamic #1139

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Dynamic overwrite of partitions does not work as expected #103

Dynamic overwrite of partitions does not work as expected #103

jasonflittner commented Jan 15, 2020

jasonflittner commented Jan 28, 2020

saurabh24292 commented May 6, 2020

jasonflittner commented May 6, 2020

davidrabinowitz commented May 6, 2020

davidrabinowitz commented May 7, 2020 •

edited

Loading

AmineSagaama commented Jun 13, 2020

imakarsh commented Jul 10, 2020

timshen24 commented Nov 3, 2021

mathfish commented Feb 28, 2022

vasu-arora commented Apr 14, 2022

hkarkach-externe commented Apr 25, 2022

allysonlm commented May 26, 2022

gitmstoute commented Oct 13, 2022

kane-statsig commented Feb 27, 2023

arezki1990 commented Mar 5, 2023 •

edited

Loading

khaledh commented Apr 6, 2023

isha97 commented Oct 27, 2023

Dynamic overwrite of partitions does not work as expected #103

Dynamic overwrite of partitions does not work as expected #103

Comments

jasonflittner commented Jan 15, 2020

Ran this on BigQuery directly:

Saving the data to BigQuery

jasonflittner commented Jan 28, 2020

saurabh24292 commented May 6, 2020

jasonflittner commented May 6, 2020

davidrabinowitz commented May 6, 2020

davidrabinowitz commented May 7, 2020 • edited Loading

AmineSagaama commented Jun 13, 2020

imakarsh commented Jul 10, 2020

timshen24 commented Nov 3, 2021

mathfish commented Feb 28, 2022

vasu-arora commented Apr 14, 2022

hkarkach-externe commented Apr 25, 2022

allysonlm commented May 26, 2022

gitmstoute commented Oct 13, 2022

kane-statsig commented Feb 27, 2023

arezki1990 commented Mar 5, 2023 • edited Loading

khaledh commented Apr 6, 2023

isha97 commented Oct 27, 2023

davidrabinowitz commented May 7, 2020 •

edited

Loading

arezki1990 commented Mar 5, 2023 •

edited

Loading