Skip to content

Optimize Iterable<Map>.toDataFrame conversion#1635

Merged
koperagen merged 1 commit intomasterfrom
maps-to-dataframe
Dec 12, 2025
Merged

Optimize Iterable<Map>.toDataFrame conversion#1635
koperagen merged 1 commit intomasterfrom
maps-to-dataframe

Conversation

@koperagen
Copy link
Copy Markdown
Collaborator

@koperagen koperagen commented Dec 11, 2025

Now it hits a lot of heavy reflection calls like type.isSubtypeOf<AnyRow?>(), AnyFrame on each Map -> DataRow.
Collecting values first, inferring type for all gives a visible improvement

17 s -> 1.5 s

Code i used for testing:

val rowCount: Int = 1_000_000
val columnCount: Int = 10
val columns = (0 until columnCount).map { "col$it" }
val maps: List<Map<String, Any?>> = (0 until rowCount).map { rowIdx ->
    columns.associate { col ->
        col to when (columns.indexOf(col) % 4) {
            0 -> rowIdx
            1 -> "value_$rowIdx"
            2 -> rowIdx * 1.5
            3 -> rowIdx % 2 == 0
            else -> null
        }
    }
}

maps.toDataFrame()

fixes #90

… Map->Row, collect values directly into columns
@koperagen koperagen added this to the 1.0.0-Beta5 milestone Dec 11, 2025
@koperagen koperagen self-assigned this Dec 11, 2025
@koperagen koperagen added the performance Something related to how fast the library can handle data label Dec 11, 2025
val list = asList()
if (list.isEmpty()) return DataFrame.empty()

val allKeys = linkedSetOf<String>()
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:o TIL what a linked HashSet is

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

though, it seems mutableSetOf produces the same thing in Kotlin :) so I thought all sets behaved this way. Still it's nice to be expressive :D

@Jolanrensen
Copy link
Copy Markdown
Collaborator

Jolanrensen commented Dec 12, 2025

I wonder... could we do the same trick with Iterable<DataRow<*>>.toDataFrame()?

Currently, it's only optimized when all DataRows originate from the same DF, but else it just calls iterable.map { it.toDataFrame() }.concat() :(

Actually, we can just call map { it.toMap() }.toDataFrame().cast() now XD

@koperagen
Copy link
Copy Markdown
Collaborator Author

Actually, we can just call map { it.toMap() }.toDataFrame().cast() now XD

Lol. How tables have turned! Yes, i'm positively surprised by how relatively straightforward this is, at least without recursive value conversion. With DataRow we probably need to create nested column groups, so it will be a bit trickier

@Jolanrensen
Copy link
Copy Markdown
Collaborator

Jolanrensen commented Dec 12, 2025

Actually, we can just call map { it.toMap() }.toDataFrame().cast() now XD

Lol. How tables have turned! Yes, i'm positively surprised by how relatively straightforward this is, at least without recursive value conversion. With DataRow we probably need to create nested column groups, so it will be a bit trickier

ah yes, of course! let's do that separately :) I like this recent performance-improvement-trend ;P

@koperagen koperagen merged commit a227b94 into master Dec 12, 2025
6 checks passed
@koperagen koperagen deleted the maps-to-dataframe branch March 9, 2026 17:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

performance Something related to how fast the library can handle data

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Create DataFrame from list of rows where each row is Map

2 participants