Unable to write to BQ tables containing columns with REQUIRED=True #651

mattchatporter · 2022-05-30T18:20:24Z

Writing to BQ with this plugin in Append mode does not work if a column in a table is set to Required=True. Spark generally forces NULLABLE=True on all columns, particularly when a Parquet file is used, so reading a Parquet file in (or creating a DataFrame from various other file formats) will by default produce all columns as NULLABLE=TRUE. Attempting to write such a DataFrame to BQ via the connector if a table contains columns where REQUIRED=True will throw an error.

Is it possible to update the plugin to auto-convert columns that are treated as NULLABLE within Spark to be required when BQ requests a column to be required? Nowadays, Spark seems to prefer treating all columns as nullable and does not generally encourage setting them to be a required field, and this is reflected in the lack of straightforward methods for switching a Spark column to be required as well as the forced assignment of NULLABLE=True values when writing out to Parquet.

See the Spark documentation on forced nullability preference:
https://spark.apache.org/docs/latest/sql-data-sources-parquet.html
"When reading Parquet files, all columns are automatically converted to be nullable for compatibility reasons."

suryasoma · 2022-06-06T22:57:08Z

Hello @mattchatporter, the issue is fixed and the fix would be available in the next release.

suryasoma · 2022-06-24T00:01:44Z

Hey @mattchatporter, please find the fix in the latest release 0.25.2
Thanks

mattchatporter · 2022-06-24T23:18:55Z

I'm still seeing the issue. The BQ column in question is set to REQUIRED while the Parquet file being read in and an attempt made to write it to BQ fails on a column that is NULLABLE=TRUE in the Parquet:

# Was using:
pyspark --jars gs://spark-lib/bigquery/spark-bigquery-latest_2.12.jar
# Now using:
pyspark --jars gs://spark-lib/bigquery/spark-bigquery-with-dependencies_2.12-0.25.2.jar

New error message with spark-bigquery-with-dependencies_2.12-0.25.2.jar:

Caused by: com.google.cloud.spark.bigquery.repackaged.com.google.cloud.bigquery.BigQueryException: Provided Schema does not match Table .... Field ... has changed mode from REQUIRED to NULLABLE

davidrabinowitz assigned suryasoma Jun 2, 2022

suryasoma closed this as completed Jun 24, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unable to write to BQ tables containing columns with REQUIRED=True #651

Unable to write to BQ tables containing columns with REQUIRED=True #651

mattchatporter commented May 30, 2022 •

edited

Loading

suryasoma commented Jun 6, 2022

suryasoma commented Jun 24, 2022

mattchatporter commented Jun 24, 2022

Unable to write to BQ tables containing columns with REQUIRED=True #651

Unable to write to BQ tables containing columns with REQUIRED=True #651

Comments

mattchatporter commented May 30, 2022 • edited Loading

suryasoma commented Jun 6, 2022

suryasoma commented Jun 24, 2022

mattchatporter commented Jun 24, 2022

mattchatporter commented May 30, 2022 •

edited

Loading