Skip to content

Question About Mapping #185

Open
Open
@bkenn

Description

@bkenn

Hello, question for the audience. How should we handle the calculation of fields in a dataset map function? Do we need a manual A -> B class mapping or can we use something generic like the map Row -> Row and then to to handle the field mappings? I can see it becoming tedious to pass in every since property from A -> B. Some of the classes I need to process have 20+ properties.

For example, I would like to accept fields as String?, but in later data frames I want to convert them to Int?. Maybe it would be something like what is below. I know that this example is small but keep in mind I don't want to pass in all 20+ properties to another class constructor. Maybe in the example below we should use sealed classes for possible errors? Not to opinionated about how this should be handled. I know the following code cannot work since RDD's are immutable but it would nice to have some kind of convenience like below to work around that.

data class Client(val age: String?)

data class ClientCalculated(val age: Int?, val errorMessage: String?)

fun litNullAsString() = functions.lit(null).cast(DataTypes.StringType)

 listOf(Client("30"), Client("thirty"))
                .toDS()
                .withColumn("errorMessage", litNullAsString())
                .map {
                    val age = it.getAs<String?>("age")?.toIntOrNull()
                    it["age"] = age
                    it["errorMessage"] = if (age == null) {
                        "age is invalid"
                    } else {
                        null
                    }
                    it
                }.to<ClientCalculated>()

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions