New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
to_avro specify the key #291
Comments
You need to use The key needs its own abrisConfig config that points to the correct key schema. For some schema naming strategies, you need to set the Examples of the config are here: https://github.com/AbsaOSS/ABRiS/blob/master/documentation/confluent-avro-documentation.md |
Do you have a full and working example? Indeed, I was following https://github.com/AbsaOSS/ABRiS/blob/master/documentation/confluent-avro-documentation.md and also for the schema of the key:
but then fail with:
when trying to use the Avro string type for the string key column. |
When using:
for a topic |
If it was one config, how would |
For example from https://github.com/AbsaOSS/ABRiS/blob/master/documentation/confluent-avro-documentation.md:
If one follows the (default) naming convention for topic:
schema. Where would you specify here if it is a key or value? |
As mentioned in other place in the documentation, the value schema is the default. If you look at the code, you will see that the method looks like this: def andTopicNameStrategy(topicName: String, isKey: Boolean = false) So if you want to use key schema, you must set |
Thanks. But when setting two different configurations:
still is the error. For an input data frame of:
As you can see the original:
is transformed into a single key and value column. |
Here you go with a full and self-contained example. import spark.implicits._
val aggedDf = Seq(("foo", 1.0, 1.0), ("bar", 2.0, 2.0)).toDF("brand", "rating_mean", "duration_mean")
aggedDf.printSchema
aggedDf.show
+-----+-----------+-------------+
|brand|rating_mean|duration_mean|
+-----+-----------+-------------+
| foo| 1.0| 1.0|
| bar| 2.0| 2.0|
+-----+-----------+-------------+
import za.co.absa.abris.avro.parsing.utils.AvroSchemaUtils
import za.co.absa.abris.avro.read.confluent.SchemaManagerFactory
import org.apache.avro.Schema
import za.co.absa.abris.avro.read.confluent.SchemaManager
import za.co.absa.abris.avro.registry.SchemaSubject
import za.co.absa.abris.avro.functions.to_avro
import org.apache.spark.sql._
import za.co.absa.abris.config.ToAvroConfig
// generate schema for all columns in a dataframe
val valueSchema = AvroSchemaUtils.toAvroSchema(aggedDf)
val keySchema = AvroSchemaUtils.toAvroSchema(aggedDf.select($"brand".alias("key_brand")), "key_brand")
val schemaRegistryClientConfig = Map(AbrisConfig.SCHEMA_REGISTRY_URL -> "http://localhost:8081")
val t = "metrics_per_brand_spark222xx"
val schemaManager = SchemaManagerFactory.create(schemaRegistryClientConfig)
// register schema with topic name strategy
def registerSchema1(schemaKey: Schema, schemaValue: Schema, schemaManager: SchemaManager, schemaName:String): Int = {
schemaManager.register(SchemaSubject.usingTopicNameStrategy(schemaName, true), schemaKey)
schemaManager.register(SchemaSubject.usingTopicNameStrategy(schemaName, false), schemaValue)
}
registerSchema1(keySchema, valueSchema, schemaManager, t)
val toAvroConfig4 = AbrisConfig
.toConfluentAvro
.downloadSchemaByLatestVersion
.andTopicNameStrategy(t)
.usingSchemaRegistry("http://localhost:8081")
val toAvroConfig4Key = AbrisConfig
.toConfluentAvro
.downloadSchemaByLatestVersion
.andTopicNameStrategy(t, isKey = true)
.usingSchemaRegistry("http://localhost:8081")
def writeDfToAvro(keyAvroConfig: ToAvroConfig, toAvroConfig: ToAvroConfig)(dataFrame:DataFrame) = {
// this is the key! need to keep the key to guarantee temporal ordering
val availableCols = dataFrame.columns//.drop("brand").columns
val allColumns = struct(availableCols.head, availableCols.tail: _*)
dataFrame.select(to_avro($"brand", keyAvroConfig).alias("key_brand"), to_avro(allColumns, toAvroConfig) as 'value)
// dataFrame.select($"brand".alias("key_brand"), to_avro(allColumns, toAvroConfig) as 'value)
}
val aggedAsAvro = aggedDf.transform(writeDfToAvro(toAvroConfig4Key, toAvroConfig4))
aggedAsAvro.printSchema
root
|-- key_brand: binary (nullable = true)
|-- value: binary (nullable = false)
aggedAsAvro.write
.format("kafka")
.option("kafka.bootstrap.servers", "localhost:9092")
.option("topic", t).save() |
Oh, ok now I understand. Your problem is not Abris.
You need to rename the |
Indeed. Many thanks. |
The commercial (databricks edition) allows to specify a key when writing Avro like: https://docs.microsoft.com/en-us/azure/databricks/spark/latest/structured-streaming/avro-dataframe
Sadly, this was never added to spark apache/spark#31771.
How can I specify a key using ABRiS?
ABRiS/src/main/scala/za/co/absa/abris/examples/ConfluentKafkaAvroWriter.scala
Line 62 in c5301ee
When trying to specify a key I either get errors - or no results (= a nulled out key):
for a dataframe
df
with the following schema:where both the key and value schema are registered.
When now writing to kafka ( for a dataframe with a key kolumn and avro-value column):
The
Key
of the messages is set tonull
. How can I get the key to actually write the key as well?The text was updated successfully, but these errors were encountered: