Preserve nullability from Avro to Catalyst Schema #137

kevinwallimann · 2020-06-15T15:32:27Z

Currently, in a Kafka-to-Kafka (i.e. Avro -> Catalyst -> Avro) workflow (with columnselectortransformer), all fields are always nullable in the destination topic.
Example:
Source schema

{
  "type": "record",
  "name": "pageviews",
  "namespace": "ksql",
  "fields": [
    {
      "name": "viewtime",
      "type": "long"
    }
  ]
}

is written as

{
  "type": "record",
  "name": "pageviews",
  "namespace": "ksql",
  "fields": [
    {
      "name": "viewtime",
      "type": ["long", "null"]
    }
  ]
}

Expected: Non-nullable fields in the source Avro schema should also be non-nullable in the destination.
Nullable fields should stay nullable obviously.

Migration note
Making an existing nullable field non-nullable is a forward-compatible change (it's almost like adding a field)

The text was updated successfully, but these errors were encountered:

kevinwallimann · 2020-06-15T15:57:53Z

Analysis
za.co.absa.abris.avro.sql.AvroDataToCatalyst is always nullable. https://github.com/AbsaOSS/ABRiS/blob/985fab1894826b4ea97a48d065a048b44a0ed180/src/main/scala/za/co/absa/abris/avro/sql/AvroDataToCatalyst.scala#L47
Not sure why it's not child.nullable, but it wouldn't change much in our case.

Since the expression wraps around the actual avro record, it causes all fields to be nullable when the wrapper expression is flattened like this

dataFrame
      .select(from_confluent_avro(col("value"), schemaRegistrySettings) as 'data)
      .select("data.*")

The root cause for the nullability is that the value of a kafka record can in general always be null. One use-case for a null value is the tombstone https://www.confluent.io/blog/handling-gdpr-log-forget/.

#137: Make column non nullable

kevinwallimann added the bug Something isn't working label Jun 15, 2020

kevinwallimann added this to the v3.x.0 milestone Jun 15, 2020

kevinwallimann self-assigned this Jun 15, 2020

kevinwallimann mentioned this issue Jun 15, 2020

Preserve avro schema from Avro to Catalyst and from Catalyst to Avro #138

Closed

kevinwallimann added a commit that referenced this issue Jun 15, 2020

#137: Make column non nullable

359507a

kevinwallimann added a commit that referenced this issue Jun 16, 2020

#137: Make column non nullable

b609f68

kevinwallimann mentioned this issue Jun 16, 2020

#137: Make column non nullable #140

Merged

kevinwallimann closed this as completed in #140 Jun 18, 2020

kevinwallimann added a commit that referenced this issue Jun 18, 2020

Merge pull request #140 from AbsaOSS/feature/137-avro-non-nullable2

49e4c05

#137: Make column non nullable

kevinwallimann mentioned this issue Apr 13, 2021

Add switch to preserve previous behaviour of non-nullable columns #215

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Preserve nullability from Avro to Catalyst Schema #137

Preserve nullability from Avro to Catalyst Schema #137

kevinwallimann commented Jun 15, 2020 •

edited

Loading

kevinwallimann commented Jun 15, 2020

Preserve nullability from Avro to Catalyst Schema #137

Preserve nullability from Avro to Catalyst Schema #137

Comments

kevinwallimann commented Jun 15, 2020 • edited Loading

kevinwallimann commented Jun 15, 2020

kevinwallimann commented Jun 15, 2020 •

edited

Loading