Column creation of @DataSchema instances to produce ColumnGroups #197

Jolanrensen · 2022-11-29T16:06:58Z

Based on #177. See this issue for more details and motivation.

Might be related to dataFrameOf() overloads in #113

With these changes:

dataFrameOf("name", "gps")(
    "Home", Gps(0.0, 0.0),
    "Away", null,
)
// ⌌----------------------------------------------------------⌍
// |  | name:String| gps:{latitude:Double?, longitude:Double?}|
// |--|------------|------------------------------------------|
// | 0|        Home|  { latitude:0.000000, longitude:0.0000...|
// | 1|        Away|                                       { }|
// ⌎----------------------------------------------------------⌏
// 
// name: String
// gps:
//     latitude: Double?
//     longitude: Double?

instead of

// ⌌-------------------------------------------------⌍
// |  | name:String|                         gps:Gps?|
// |--|------------|---------------------------------|
// | 0|        Home| Gps(latitude=0.0, longitude=0.0)|
// | 1|        Away|                             null|
// ⌎-------------------------------------------------⌏
// 
// name: String
// gps: Gps?

Also changes the behavior of columnOf(Gps(...)) to produce a ColumnGroup now. To be specific, if a new column is created from values where all values are either null or the same @DataSchema annotated type instance, a ColumnGroup will be formed instead of the previous ValueColumn.

Would this be acceptable behavior? Better than before? Worse?
@koperagen
@nikitinas

creating a column of @dataschema instances with type guessing will now create a GroupColumn

koperagen · 2022-12-07T18:17:14Z

I reckon convert also gives special treatment to values of types DataRow, DataFrame
i.e.

val df = dataFrameOf("abc")(123)
df.convert { abc }.with { DataRow.read("something.json") }

makes abc a ColumnGroup.

The same goes for

df.convert { abc }.with { DataFrame.read() }

here abc is FrameColumn (DataColumn<DataFrame<>>).

Now, i believe, types which declarations are marked with @DataSchema are mapped by marker extractors in KSP and REPL. I.e.

@DataSchema
class Name(
    val firstName: String
    val lastName: String
)

@DataSchema
class Person(
    val name: Name / DataRow<Name> 
    val names: List<Name> // DataFrame<Name>
)

both return types are identical from markers extractors perpective. It's easier to create values of Name / List.
Now the only operation that considers values of type Name / List as DataRow / DataFrame is toDataFrame.

So if dataFrameOf sees Name as DataRow and makes ColumnGroup from it, should convert do the same?

Jolanrensen · 2022-12-08T14:24:27Z

I didn't completely understand it at first, but I agree. If dataFrameOf() should recognize @DataSchema instances and turn them into DataRow or DataFrame (for lists), then so should convert. If convert uses the columnOf calls this is already what happens using this PR, but I'll need to check that.

Btw. I don't think I've implemented the recognition of List<T> -> DataFrame<T> where T is annotated with @DataSchema. Will need to check that too.

nikitinas · 2022-12-20T08:18:22Z

I think convert should not unfold objects by default, because unfolding may require additional configuration of traversal depth, excludes etc. If we add another DSL for such configuration, convert will be overcomplicated. That's why I suggest to unify dataFrameOf, .toDataFrame and unfold behaviour and configuration, but disable object unfolding in convert as it can be done by separate unfold operation if needed.

In some use cases interfaces marked with @DataSchema should not be unfolded, because they are required as objects for some other APIs. In these cases DataFrame is used not for visual representation, but as an internal storage for data manipulation operations. At the same time there may be another case, when same objects need to be unfolded for better table representation in Jupyter, for example. So, this behaviour should be configurable. I have a lot of such scenarios in my applications of DataFrame.

nikitinas · 2022-12-20T08:31:36Z

See my comment for details

guessColumnType -> createColumnGuessingType.

03492b4

creating a column of @dataschema instances with type guessing will now create a GroupColumn

Jolanrensen added enhancement New feature or request invalid This doesn't seem right labels Nov 29, 2022

Jolanrensen mentioned this pull request Apr 26, 2024

Missing tests for unfold #677

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Column creation of @DataSchema instances to produce ColumnGroups #197

Column creation of @DataSchema instances to produce ColumnGroups #197

Jolanrensen commented Nov 29, 2022

koperagen commented Dec 7, 2022

Jolanrensen commented Dec 8, 2022

nikitinas commented Dec 20, 2022 •

edited

nikitinas commented Dec 20, 2022 •

edited

Column creation of @DataSchema instances to produce ColumnGroups #197

Are you sure you want to change the base?

Column creation of @DataSchema instances to produce ColumnGroups #197

Conversation

Jolanrensen commented Nov 29, 2022

koperagen commented Dec 7, 2022

Jolanrensen commented Dec 8, 2022

nikitinas commented Dec 20, 2022 • edited

nikitinas commented Dec 20, 2022 • edited

nikitinas commented Dec 20, 2022 •

edited

nikitinas commented Dec 20, 2022 •

edited