Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unable to write to BQ tables containing columns with REQUIRED=True #651

Closed
mattchatporter opened this issue May 30, 2022 · 3 comments
Closed
Assignees

Comments

@mattchatporter
Copy link

mattchatporter commented May 30, 2022

Writing to BQ with this plugin in Append mode does not work if a column in a table is set to Required=True. Spark generally forces NULLABLE=True on all columns, particularly when a Parquet file is used, so reading a Parquet file in (or creating a DataFrame from various other file formats) will by default produce all columns as NULLABLE=TRUE. Attempting to write such a DataFrame to BQ via the connector if a table contains columns where REQUIRED=True will throw an error.

Is it possible to update the plugin to auto-convert columns that are treated as NULLABLE within Spark to be required when BQ requests a column to be required? Nowadays, Spark seems to prefer treating all columns as nullable and does not generally encourage setting them to be a required field, and this is reflected in the lack of straightforward methods for switching a Spark column to be required as well as the forced assignment of NULLABLE=True values when writing out to Parquet.

See the Spark documentation on forced nullability preference:
https://spark.apache.org/docs/latest/sql-data-sources-parquet.html
"When reading Parquet files, all columns are automatically converted to be nullable for compatibility reasons."

@suryasoma
Copy link
Contributor

Hello @mattchatporter, the issue is fixed and the fix would be available in the next release.

@suryasoma
Copy link
Contributor

Hey @mattchatporter, please find the fix in the latest release 0.25.2
Thanks

@mattchatporter
Copy link
Author

I'm still seeing the issue. The BQ column in question is set to REQUIRED while the Parquet file being read in and an attempt made to write it to BQ fails on a column that is NULLABLE=TRUE in the Parquet:

# Was using:
pyspark --jars gs://spark-lib/bigquery/spark-bigquery-latest_2.12.jar
# Now using:
pyspark --jars gs://spark-lib/bigquery/spark-bigquery-with-dependencies_2.12-0.25.2.jar

New error message with spark-bigquery-with-dependencies_2.12-0.25.2.jar:

Caused by: com.google.cloud.spark.bigquery.repackaged.com.google.cloud.bigquery.BigQueryException: Provided Schema does not match Table .... Field ... has changed mode from REQUIRED to NULLABLE

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants