Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

dataFrameOf doesn't recognize @DataSchema instances #177

Open
Jolanrensen opened this issue Oct 5, 2022 · 7 comments
Open

dataFrameOf doesn't recognize @DataSchema instances #177

Jolanrensen opened this issue Oct 5, 2022 · 7 comments
Assignees
Labels
enhancement New feature or request research
Milestone

Comments

@Jolanrensen
Copy link
Collaborator

Jolanrensen commented Oct 5, 2022

given

@DataSchema
data class Location(
    val name: String,
    val gps: Gps?,
)

@DataSchema
data class Gps(
    val latitude: Double,
    val longitude: Double,
)

val a: DataFrame<Location> = listOf(
    Location("Home", Gps(0.0, 0.0)),
    Location("Away", null),
).toDataFrame()
// ⌌----------------------------------------------------------⌍
// |  | name:String| gps:{latitude:Double?, longitude:Double?}|
// |--|------------|------------------------------------------|
// | 0|        Home|  { latitude:0.000000, longitude:0.0000...|
// | 1|        Away|                                       { }|
// ⌎----------------------------------------------------------⌏
// 
// name: String
// gps:
//     latitude: Double?
//     longitude: Double?

val b: AnyFrame = dataFrameOf("name", "gps")(
    "Home", Gps(0.0, 0.0),
    "Away", null,
)
// ⌌-------------------------------------------------⌍
// |  | name:String|                         gps:Gps?|
// |--|------------|---------------------------------|
// | 0|        Home| Gps(latitude=0.0, longitude=0.0)|
// | 1|        Away|                             null|
// ⌎-------------------------------------------------⌏
// 
// name: String
// gps: Gps?

One would expect a and b to give the same result. However, in a gps is recognized as a Group and in the second as a Value.

@2x2xplz
Copy link

2x2xplz commented Oct 30, 2022

I'm not on the JetBrains team, so take this with a grain of salt, but I don't agree with your assumption that a and b should give the same result. Every row in a is specifically defined to be a Location (and can reference its schema), b never states the intended type.

The docs for dataFrameOf state: Returns DataFrame with given column names and values. It seems consistent with those docs that b would define the second column as values (of Gps), and not a ColumnGroup. Meanwhile toDataFrame's docs mention perform deep object graph traversal and convert nested objects into ColumnGroups and FrameColumns, so not surprising to see independent nested cols for latitude and longitude.

Even if it had the capability to do so, wouldn't expect the program to search through all of your available DataSchemas to see which is the best fit... you have to specify the schema somewhere. And now that I say that, it would actually be potentially dangerous to have the program just assume that every pair of String + Gps is automatically a Location, without you specifying that.

Just my 2 cents, I'm still learning more about the dataframe API every day.

@Jolanrensen
Copy link
Collaborator Author

I'm not saying it should recognize it as a Location, but like a { latitude:0.0, longitude:0.0 }. But I see your point to be able to read it as "value". It's a valid point!

@2x2xplz
Copy link

2x2xplz commented Nov 1, 2022

It's interesting, a is actually represented by 3 columns -- name, gps/latitude and gps/longitude. .toDataFrame() breaks down Location into its two properties and then it breaks down the Gps type into its two properties and creates nested columns for each. That's actually pretty cool. dataFrameOf() is simpler, it just creates a single value per column, so in this case that value's type is Gps. But it doesn't perform the extra step of creating nested columns from Gps.

image

image

@Jolanrensen
Copy link
Collaborator Author

Exactly! That's why I thought it would (or should) behave the same. Unless you specifically want it to be a Value Column

@nikitinas
Copy link
Contributor

nikitinas commented Dec 20, 2022

Currently .toDataFrame allows to configure objects traversal, including depth, but dataFrameOf doesn't. If dataFrameOf unfolds all objects marked with @DataShema into ColumnGroups by default, there should be an option to fallback to ValueColumns, because there is no universal way to pack column groups back into objects (although it can be supported for data classes).

So, I agree with suggestion to unpack @DataSchema objects by default, but let's add some DSL for dataFrameOf configuration to make it flexible:

@DataSchema
data class A(val value: Int)

@DataSchema
data class B(val c: C)

@DataSchema
data class C(val value: Int)

val b: AnyFrame = dataFrameOf("a","b")(
    A(1), B(C(2)),
    A(3), null,
) {
  depth = 1 // default depth of object traversal: unfold only top-level objects
  unfold<C>() // force unfolding of `C`
  preserve<A>() // prevent `A` from unfolding
}

// ⌌-------------------------------------------------⌍
// |  | a:A        |         b:{c:{value: Int}}      |
// |--|------------|---------------------------------|
// | 0| A(value=1) | { c: {value:2}}                 |
// | 1| A(value=3) | { c: {value: null}}             |
// ⌎-------------------------------------------------⌏

Default depth will be null that means that all @DataSchema objects should be unfolded and all other objects should be preserved.
depth = 0 means that no objects should be unfolded and @DataSchema annotation should be ignored.

@nikitinas
Copy link
Contributor

nikitinas commented Dec 20, 2022

.toDataFrame and unfold should have the same behaviour and configuration DSLs

@pacher
Copy link
Contributor

pacher commented Jan 13, 2023

Just my +1 for configuration option.

Unless you specifically want it to be a Value Column

But how can I achieve this if it just unfolds everything by default?

Use case: I have a data class representing a set of parameters for a measurement. So first I want to group results by parameters, but later after some processing I would like to unfold parameter set to distinct columns of parameters. So it make sense to me to annotate my parameter set with DataSchema.

+1 @nikitinas I can always call unfold on a column, but I can't fold it back if it is unfolded automagically for me.

@Jolanrensen Jolanrensen added this to the 0.10.0 milestone Jan 18, 2023
@Jolanrensen Jolanrensen added enhancement New feature or request invalid This doesn't seem right labels Jan 18, 2023
@Jolanrensen Jolanrensen modified the milestones: 0.10.0, 0.11.0 Apr 6, 2023
@zaleslaw zaleslaw modified the milestones: 0.11.0, Backlog Apr 25, 2023
@zaleslaw zaleslaw removed the invalid This doesn't seem right label Apr 8, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request research
Projects
None yet
Development

No branches or pull requests

5 participants