Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Column creation of @DataSchema instances to produce ColumnGroups #197

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

Jolanrensen
Copy link
Collaborator

Based on #177. See this issue for more details and motivation.

Might be related to dataFrameOf() overloads in #113

With these changes:

dataFrameOf("name", "gps")(
    "Home", Gps(0.0, 0.0),
    "Away", null,
)
// ⌌----------------------------------------------------------⌍
// |  | name:String| gps:{latitude:Double?, longitude:Double?}|
// |--|------------|------------------------------------------|
// | 0|        Home|  { latitude:0.000000, longitude:0.0000...|
// | 1|        Away|                                       { }|
// ⌎----------------------------------------------------------⌏
// 
// name: String
// gps:
//     latitude: Double?
//     longitude: Double?

instead of

// ⌌-------------------------------------------------⌍
// |  | name:String|                         gps:Gps?|
// |--|------------|---------------------------------|
// | 0|        Home| Gps(latitude=0.0, longitude=0.0)|
// | 1|        Away|                             null|
// ⌎-------------------------------------------------⌏
// 
// name: String
// gps: Gps?

Also changes the behavior of columnOf(Gps(...)) to produce a ColumnGroup now. To be specific, if a new column is created from values where all values are either null or the same @DataSchema annotated type instance, a ColumnGroup will be formed instead of the previous ValueColumn.

Would this be acceptable behavior? Better than before? Worse?
@koperagen
@nikitinas

creating a column of @dataschema instances with type guessing will now create a GroupColumn
@Jolanrensen Jolanrensen added enhancement New feature or request invalid This doesn't seem right labels Nov 29, 2022
@koperagen
Copy link
Collaborator

I reckon convert also gives special treatment to values of types DataRow, DataFrame
i.e.

val df = dataFrameOf("abc")(123)
df.convert { abc }.with { DataRow.read("something.json") }

makes abc a ColumnGroup.

The same goes for

df.convert { abc }.with { DataFrame.read() }

here abc is FrameColumn (DataColumn<DataFrame<>>).

Now, i believe, types which declarations are marked with @DataSchema are mapped by marker extractors in KSP and REPL. I.e.

@DataSchema
class Name(
    val firstName: String
    val lastName: String
)

@DataSchema
class Person(
    val name: Name / DataRow<Name> 
    val names: List<Name> // DataFrame<Name>
)

both return types are identical from markers extractors perpective. It's easier to create values of Name / List.
Now the only operation that considers values of type Name / List as DataRow / DataFrame is toDataFrame.

So if dataFrameOf sees Name as DataRow and makes ColumnGroup from it, should convert do the same?

@Jolanrensen
Copy link
Collaborator Author

I didn't completely understand it at first, but I agree. If dataFrameOf() should recognize @DataSchema instances and turn them into DataRow or DataFrame (for lists), then so should convert. If convert uses the columnOf calls this is already what happens using this PR, but I'll need to check that.

Btw. I don't think I've implemented the recognition of List<T> -> DataFrame<T> where T is annotated with @DataSchema. Will need to check that too.

@nikitinas
Copy link
Contributor

nikitinas commented Dec 20, 2022

I think convert should not unfold objects by default, because unfolding may require additional configuration of traversal depth, excludes etc. If we add another DSL for such configuration, convert will be overcomplicated. That's why I suggest to unify dataFrameOf, .toDataFrame and unfold behaviour and configuration, but disable object unfolding in convert as it can be done by separate unfold operation if needed.

In some use cases interfaces marked with @DataSchema should not be unfolded, because they are required as objects for some other APIs. In these cases DataFrame is used not for visual representation, but as an internal storage for data manipulation operations. At the same time there may be another case, when same objects need to be unfolded for better table representation in Jupyter, for example. So, this behaviour should be configurable. I have a lot of such scenarios in my applications of DataFrame.

@nikitinas
Copy link
Contributor

nikitinas commented Dec 20, 2022

See my comment for details

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request invalid This doesn't seem right
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants