Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rework of dfs functions to recursively (changing return type of recurse-able functions) #363

Merged
merged 36 commits into from
May 22, 2023

Conversation

Jolanrensen
Copy link
Collaborator

@Jolanrensen Jolanrensen commented Apr 22, 2023

Same as #361 but using a different tactic.

Largest changes:

Streamlined behavior of all, cols, groups etc. such that if it's called on

  • a SingleColumn<*> containing a single ColumnGroup (such as DataFrame), the operation will run on the children of that ColumnGroup
  • else, or a normal ColumnSet<*>, it will simply run on the columns inside.
    This is enforced using the new allInternal() function combined with isSingleColumnWithGroup. I think this simply enforces expected behavior, since all tests work with it (and break upon a slight modification of this function) and more importantly, the calls seem to make sense. Let me know what you think.

Introducing TransformableColumnSet and TransformableSingleColumn. A wrapper around the original ones that provide a way to transformResolve as well as the normal resolve. transformResolve takes a ColumnSetTransformer that can get executed before resolving the ColumnSet or SingleColumn. A ColumnSetTransformer contains implementations for both transform and transformSingle, similar to the functions in Utils.kt. Currently, the only implementation of this is RecursivelyTransformer, which is configured and created in the TransformableColumnSet<>.recursively() calls.
Not all ColumnSets are transformable. Functions that allow reconfiguration with a ColumnSetTransformer need to explicitly state that in their return type. In principle, any function that uses transform {} can automatically opt-in to the TransformableColumnSet return-type if that makes sense for them. Currently, I added the ability to cols {}, all(), groups {}, children {}, filter {}, nameContains(), endsWith(), startsWith(), except(), and withoutNulls().
I added TransformableSingleColumn to first {}, last {}, and single {} (which uses singleWithTransformerImpl to pass on the transformer to the right place).
Of course, if other functions are applicable to modification calls like .recursively() I'd gladly hear it.

.recursively() / .rec() is the replacement of dfs {}, dfsOf<>{}, and allDfs(). because the name was confusing. I put deprecations in place and replaced all calls in the library to recursively. Some things to note:

  • dfs {}, called on either a ColumnSet or SingleColumn would exclude the top-level of columns but still include column groups in the end-result, so a proper replacement of that call would be cols {}.recursively() if called on a SingleColumn and cols {}.recursively(includeTopLevel = false) if called on a normal ColumnSet. Of course rec() can be used as shorthand.
  • dfsOf<>(), is replaced similarly to either colsOf<>().rec() or colsOf<>().rec(includeTopLevel = false).
  • allDfs() defaulted includeGroups to false, so be careful, since this is now true. It can be replaced either with all().recursively() or allRecursively() (or allRec()), again taking into account includeTopLevel and includeGroups.

Other smaller changes:

  • Renamed occurrences of "Dfs" where "DataFrames" was meant as such, to avoid confusion
  • Renamed:
    • inner functions named dfs to depthFirstSearch where depth first search was actually meant
    • TreeNode.dfs {} -> TreeNode.allChildren {} (+ other dfs functions)
    • TreeNode.topDfs {} -> TreeNode.topmostChildren {} (+ other topDfs functions)
    • Iterable<ColumnWithPath<*>>.dfs() -> flattenRecursively()
    • Some unavoidable linting in the files I touched
  • Updated korro to 0.1.4
  • Removed ColumnPath.something overloads in ColumnsSelectionDsl since they are handled by SingleColumn<*>
  • Added docs to ColumnSet and SingleColumn
  • Reworked tests for ColumnsSelectionDsl
  • Added recursively tests

@Jolanrensen Jolanrensen added the enhancement New feature or request label Apr 22, 2023
@Jolanrensen Jolanrensen added this to the 0.11.0 milestone Apr 22, 2023
…zed versions on which transformers like "recursively" can be called.
topDfs -> topmostChildren
top -> roots
# Conflicts:
#	examples/jupyter-notebooks/titanic/Titanic.ipynb
@Jolanrensen Jolanrensen marked this pull request as ready for review April 26, 2023 13:16
# Conflicts:
#	core/generated-sources/src/test/kotlin/org/jetbrains/kotlinx/dataframe/samples/api/Modify.kt
#	core/generated-sources/src/test/kotlin/org/jetbrains/kotlinx/dataframe/samples/api/TestBase.kt
#	core/src/test/kotlin/org/jetbrains/kotlinx/dataframe/samples/api/Modify.kt
#	core/src/test/kotlin/org/jetbrains/kotlinx/dataframe/samples/api/TestBase.kt
#	docs/StardustDocs/topics/ColumnSelectors.md
@Jolanrensen Jolanrensen requested a review from zaleslaw May 2, 2023 10:36
while (node != null) {
node.data = null
node = node.parent
}
}
val dfs = fullTree.topDfs { it.data != null }
return dfs.map { it.data!!.addPath(it.pathFromRoot()) }
val subtrees = fullTree.topmostChildren { it.data != null }
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

topmostChildren?
topDFS is a bad name, but topmostChildren leads me to the mist again

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what would you suggest?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As my doc explains it takes the tree and returns first the children, closest to the root of the tree, where the predicate holds. It doesn't matter whether the children of those nodes also hold, those are ignored. So since we now have allChildren { something == true }, I thought topmostChildren { something == true }. It can also be viewed as subtree-roots where the predicate holds.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@koperagen Any suggestions? :)

@@ -37,7 +37,7 @@ public fun <T, C> DataFrame<T>.replace(columns: Iterable<ColumnReference<C>>): R

public fun <T> DataFrame<T>.replaceAll(
vararg valuePairs: Pair<Any?, Any?>,
columns: ColumnsSelector<T, *> = { allDfs() },
columns: ColumnsSelector<T, *> = { all().recursively(includeGroups = false) },
Copy link
Collaborator

@koperagen koperagen May 3, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why has the default value for includeGroups changed? I would argue that one of the main usages of dfsOf is to convert / replace columns with some type and groups are not needed

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it correct that it's because of unification of overloads?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, by default groups were included in dfs {}, but excluded in allDfs() if I recall correctly. I had to choose a sensible default. But if you use colsOf<Int>().rec(), then includeGroups irrelevant anyways, since Int is not a Group.

Copy link
Collaborator

@koperagen koperagen May 4, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, this is interesting and a bit confusing. I think this parameter only exists for allDfs. For normal dfs it would be dfs { !it.isColumnGroup() }. Now it's also present for rec, recursively. What if you remove it from them and leave only for allRec with the same default value (true)?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

.rec() can not only be called on all(), you can also call first { }.rec(includeGroups = false), so I think keeping it in rec/recursively too would be best.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we have a function that can give all valueColumns right? If we did we could have valueCols {}.recursively() or valueColsRecursively() but that might get a bit long.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Plus it's not just ColumnGroups that get included right? also FrameColumns and ValueColumns. So, just all of them. I think that's a good default

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@koperagen Actually I do want to add valueCols {} and frameCols {} since we already have groups {} as well. Probably in another PR though. Then you can call valueCols().rec() if you want. Sounds okay?

Copy link
Collaborator

@koperagen koperagen May 9, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could be a nice addition, not 100% related to this problem though. For this allRecursively function with previous default, something like "allLeafs" might fit better. I think it was named like that before and renamed later

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could also take it in another direction eventually, similar to colsOf<Type>() have colsOf(vararg ColumnKind) {}.rec(), that might be clearer compared to users having to figure out what a "leaf" means.

# Conflicts:
#	core/generated-sources/src/main/kotlin/org/jetbrains/kotlinx/dataframe/api/convert.kt
#	core/generated-sources/src/main/kotlin/org/jetbrains/kotlinx/dataframe/api/describe.kt
#	core/generated-sources/src/main/kotlin/org/jetbrains/kotlinx/dataframe/api/inferType.kt
#	core/generated-sources/src/main/kotlin/org/jetbrains/kotlinx/dataframe/api/reorder.kt
#	core/generated-sources/src/main/kotlin/org/jetbrains/kotlinx/dataframe/columns/ColumnWithPath.kt
#	core/src/main/kotlin/org/jetbrains/kotlinx/dataframe/api/describe.kt
#	core/src/main/kotlin/org/jetbrains/kotlinx/dataframe/api/inferType.kt
#	core/src/main/kotlin/org/jetbrains/kotlinx/dataframe/api/reorder.kt
#	core/src/main/kotlin/org/jetbrains/kotlinx/dataframe/columns/ColumnWithPath.kt
@koperagen
Copy link
Collaborator

Another thing that i noticed now is that it's hard to tell what rec does for first { it.type == typeOf<Int> }.rec(). Will it give the same result as dfs { it.type == typeOf<Int> }.first()?

@Jolanrensen
Copy link
Collaborator Author

Jolanrensen commented May 5, 2023

Another thing that i noticed now is that it's hard to tell what rec does for first { it.type == typeOf<Int> }.rec(). Will it give the same result as dfs { it.type == typeOf<Int> }.first()?

Yes, it works the same. You can read it like a sentence: "Select the first column of type Int recursively", which sort of makes sense.
Recursively modifies the call before such that it runs on all columns recursively instead of just on the top-level.

Basically:
allRec().first { it.type == typeOf<Int>() } == first { it.type == typeOf<Int>() }.rec()

@Jolanrensen Jolanrensen mentioned this pull request May 8, 2023
@Jolanrensen Jolanrensen mentioned this pull request May 10, 2023
33 tasks
# Conflicts:
#	core/generated-sources/src/main/kotlin/org/jetbrains/kotlinx/dataframe/api/rename.kt
#	core/generated-sources/src/main/kotlin/org/jetbrains/kotlinx/dataframe/impl/api/flatten.kt
#	core/generated-sources/src/test/kotlin/org/jetbrains/kotlinx/dataframe/api/move.kt
#	core/src/main/kotlin/org/jetbrains/kotlinx/dataframe/api/rename.kt
#	core/src/main/kotlin/org/jetbrains/kotlinx/dataframe/impl/api/flatten.kt
#	core/src/test/kotlin/org/jetbrains/kotlinx/dataframe/api/move.kt
@Jolanrensen Jolanrensen merged commit c3031c4 into master May 22, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants