You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Writing to BQ with this plugin in Append mode does not work if a column in a table is set to Required=True. Spark generally forces NULLABLE=True on all columns, particularly when a Parquet file is used, so reading a Parquet file in (or creating a DataFrame from various other file formats) will by default produce all columns as NULLABLE=TRUE. Attempting to write such a DataFrame to BQ via the connector if a table contains columns where REQUIRED=True will throw an error.
Is it possible to update the plugin to auto-convert columns that are treated as NULLABLE within Spark to be required when BQ requests a column to be required? Nowadays, Spark seems to prefer treating all columns as nullable and does not generally encourage setting them to be a required field, and this is reflected in the lack of straightforward methods for switching a Spark column to be required as well as the forced assignment of NULLABLE=True values when writing out to Parquet.
I'm still seeing the issue. The BQ column in question is set to REQUIRED while the Parquet file being read in and an attempt made to write it to BQ fails on a column that is NULLABLE=TRUE in the Parquet:
# Was using:
pyspark --jars gs://spark-lib/bigquery/spark-bigquery-latest_2.12.jar
# Now using:
pyspark --jars gs://spark-lib/bigquery/spark-bigquery-with-dependencies_2.12-0.25.2.jar
New error message with spark-bigquery-with-dependencies_2.12-0.25.2.jar:
Caused by: com.google.cloud.spark.bigquery.repackaged.com.google.cloud.bigquery.BigQueryException: Provided Schema does not match Table .... Field ... has changed mode from REQUIRED to NULLABLE
Writing to BQ with this plugin in Append mode does not work if a column in a table is set to Required=True. Spark generally forces NULLABLE=True on all columns, particularly when a Parquet file is used, so reading a Parquet file in (or creating a DataFrame from various other file formats) will by default produce all columns as NULLABLE=TRUE. Attempting to write such a DataFrame to BQ via the connector if a table contains columns where REQUIRED=True will throw an error.
Is it possible to update the plugin to auto-convert columns that are treated as NULLABLE within Spark to be required when BQ requests a column to be required? Nowadays, Spark seems to prefer treating all columns as nullable and does not generally encourage setting them to be a required field, and this is reflected in the lack of straightforward methods for switching a Spark column to be required as well as the forced assignment of NULLABLE=True values when writing out to Parquet.
See the Spark documentation on forced nullability preference:
https://spark.apache.org/docs/latest/sql-data-sources-parquet.html
"When reading Parquet files, all columns are automatically converted to be nullable for compatibility reasons."
The text was updated successfully, but these errors were encountered: