dataFrameOf doesn't recognize @DataSchema instances #177

Jolanrensen · 2022-10-05T16:03:45Z

given

@DataSchema
data class Location(
    val name: String,
    val gps: Gps?,
)

@DataSchema
data class Gps(
    val latitude: Double,
    val longitude: Double,
)

val a: DataFrame<Location> = listOf(
    Location("Home", Gps(0.0, 0.0)),
    Location("Away", null),
).toDataFrame()
// ⌌----------------------------------------------------------⌍
// |  | name:String| gps:{latitude:Double?, longitude:Double?}|
// |--|------------|------------------------------------------|
// | 0|        Home|  { latitude:0.000000, longitude:0.0000...|
// | 1|        Away|                                       { }|
// ⌎----------------------------------------------------------⌏
// 
// name: String
// gps:
//     latitude: Double?
//     longitude: Double?

val b: AnyFrame = dataFrameOf("name", "gps")(
    "Home", Gps(0.0, 0.0),
    "Away", null,
)
// ⌌-------------------------------------------------⌍
// |  | name:String|                         gps:Gps?|
// |--|------------|---------------------------------|
// | 0|        Home| Gps(latitude=0.0, longitude=0.0)|
// | 1|        Away|                             null|
// ⌎-------------------------------------------------⌏
// 
// name: String
// gps: Gps?

One would expect a and b to give the same result. However, in a gps is recognized as a Group and in the second as a Value.

The text was updated successfully, but these errors were encountered:

2x2xplz · 2022-10-30T20:56:35Z

I'm not on the JetBrains team, so take this with a grain of salt, but I don't agree with your assumption that a and b should give the same result. Every row in a is specifically defined to be a Location (and can reference its schema), b never states the intended type.

The docs for dataFrameOf state: Returns DataFrame with given column names and values. It seems consistent with those docs that b would define the second column as values (of Gps), and not a ColumnGroup. Meanwhile toDataFrame's docs mention perform deep object graph traversal and convert nested objects into ColumnGroups and FrameColumns, so not surprising to see independent nested cols for latitude and longitude.

Even if it had the capability to do so, wouldn't expect the program to search through all of your available DataSchemas to see which is the best fit... you have to specify the schema somewhere. And now that I say that, it would actually be potentially dangerous to have the program just assume that every pair of String + Gps is automatically a Location, without you specifying that.

Just my 2 cents, I'm still learning more about the dataframe API every day.

Jolanrensen · 2022-10-31T11:46:28Z

I'm not saying it should recognize it as a Location, but like a { latitude:0.0, longitude:0.0 }. But I see your point to be able to read it as "value". It's a valid point!

2x2xplz · 2022-11-01T05:08:33Z

It's interesting, a is actually represented by 3 columns -- name, gps/latitude and gps/longitude. .toDataFrame() breaks down Location into its two properties and then it breaks down the Gps type into its two properties and creates nested columns for each. That's actually pretty cool. dataFrameOf() is simpler, it just creates a single value per column, so in this case that value's type is Gps. But it doesn't perform the extra step of creating nested columns from Gps.

Jolanrensen · 2022-11-01T10:25:07Z

Exactly! That's why I thought it would (or should) behave the same. Unless you specifically want it to be a Value Column

nikitinas · 2022-12-20T08:00:36Z

Currently .toDataFrame allows to configure objects traversal, including depth, but dataFrameOf doesn't. If dataFrameOf unfolds all objects marked with @DataShema into ColumnGroups by default, there should be an option to fallback to ValueColumns, because there is no universal way to pack column groups back into objects (although it can be supported for data classes).

So, I agree with suggestion to unpack @DataSchema objects by default, but let's add some DSL for dataFrameOf configuration to make it flexible:

@DataSchema
data class A(val value: Int)

@DataSchema
data class B(val c: C)

@DataSchema
data class C(val value: Int)

val b: AnyFrame = dataFrameOf("a","b")(
    A(1), B(C(2)),
    A(3), null,
) {
  depth = 1 // default depth of object traversal: unfold only top-level objects
  unfold<C>() // force unfolding of `C`
  preserve<A>() // prevent `A` from unfolding
}

// ⌌-------------------------------------------------⌍
// |  | a:A        |         b:{c:{value: Int}}      |
// |--|------------|---------------------------------|
// | 0| A(value=1) | { c: {value:2}}                 |
// | 1| A(value=3) | { c: {value: null}}             |
// ⌎-------------------------------------------------⌏

Default depth will be null that means that all @DataSchema objects should be unfolded and all other objects should be preserved.
depth = 0 means that no objects should be unfolded and @DataSchema annotation should be ignored.

nikitinas · 2022-12-20T08:21:58Z

.toDataFrame and unfold should have the same behaviour and configuration DSLs

pacher · 2023-01-13T18:28:45Z

Just my +1 for configuration option.

Unless you specifically want it to be a Value Column

But how can I achieve this if it just unfolds everything by default?

Use case: I have a data class representing a set of parameters for a measurement. So first I want to group results by parameters, but later after some processing I would like to unfold parameter set to distinct columns of parameters. So it make sense to me to annotate my parameter set with DataSchema.

+1 @nikitinas I can always call unfold on a column, but I can't fold it back if it is unfolded automagically for me.

Jolanrensen added this to the 0.9.0 milestone Nov 28, 2022

Jolanrensen mentioned this issue Nov 29, 2022

Column creation of @DataSchema instances to produce ColumnGroups #197

Open

Jolanrensen removed this from the 0.9.0 milestone Dec 1, 2022

Jolanrensen added this to the 0.10.0 milestone Jan 18, 2023

Jolanrensen added enhancement New feature or request invalid This doesn't seem right labels Jan 18, 2023

Jolanrensen modified the milestones: 0.10.0, 0.11.0 Apr 6, 2023

zaleslaw added the research label Apr 25, 2023

zaleslaw modified the milestones: 0.11.0, Backlog Apr 25, 2023

zaleslaw assigned Jolanrensen Apr 25, 2023

zaleslaw removed the invalid This doesn't seem right label Apr 8, 2024

Jolanrensen mentioned this issue Apr 26, 2024

Missing tests for unfold #677

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

dataFrameOf doesn't recognize @DataSchema instances #177

dataFrameOf doesn't recognize @DataSchema instances #177

Jolanrensen commented Oct 5, 2022 •

edited

2x2xplz commented Oct 30, 2022

Jolanrensen commented Oct 31, 2022

2x2xplz commented Nov 1, 2022

Jolanrensen commented Nov 1, 2022

nikitinas commented Dec 20, 2022 •

edited

nikitinas commented Dec 20, 2022 •

edited

pacher commented Jan 13, 2023

dataFrameOf doesn't recognize @DataSchema instances #177

dataFrameOf doesn't recognize @DataSchema instances #177

Comments

Jolanrensen commented Oct 5, 2022 • edited

2x2xplz commented Oct 30, 2022

Jolanrensen commented Oct 31, 2022

2x2xplz commented Nov 1, 2022

Jolanrensen commented Nov 1, 2022

nikitinas commented Dec 20, 2022 • edited

nikitinas commented Dec 20, 2022 • edited

pacher commented Jan 13, 2023

Jolanrensen commented Oct 5, 2022 •

edited

nikitinas commented Dec 20, 2022 •

edited

nikitinas commented Dec 20, 2022 •

edited