Description
Hello, question for the audience. How should we handle the calculation of fields in a dataset map function? Do we need a manual A -> B class mapping or can we use something generic like the map Row -> Row and then to to handle the field mappings? I can see it becoming tedious to pass in every since property from A -> B. Some of the classes I need to process have 20+ properties.
For example, I would like to accept fields as String?, but in later data frames I want to convert them to Int?. Maybe it would be something like what is below. I know that this example is small but keep in mind I don't want to pass in all 20+ properties to another class constructor. Maybe in the example below we should use sealed classes for possible errors? Not to opinionated about how this should be handled. I know the following code cannot work since RDD's are immutable but it would nice to have some kind of convenience like below to work around that.
data class Client(val age: String?)
data class ClientCalculated(val age: Int?, val errorMessage: String?)
fun litNullAsString() = functions.lit(null).cast(DataTypes.StringType)
listOf(Client("30"), Client("thirty"))
.toDS()
.withColumn("errorMessage", litNullAsString())
.map {
val age = it.getAs<String?>("age")?.toIntOrNull()
it["age"] = age
it["errorMessage"] = if (age == null) {
"age is invalid"
} else {
null
}
it
}.to<ClientCalculated>()