Add progress tracking and memory estimation for JDBC module #1571

zaleslaw · 2025-11-13T15:12:27Z

This commit introduces progress tracking functionality for the JDBC module, including both simple and detailed modes, which are configurable via system properties.

Added new tests, configuration class (JdbcConfig), and memory estimation utilities to ensure efficient handling of large datasets.

Closes #455

This commit introduces progress tracking functionality for the JDBC module, including both simple and detailed modes configurable via system properties. Added new tests, configuration class (`JdbcConfig`), and memory estimation utilities to ensure efficient handling of large datasets.

Copilot

Pull Request Overview

This PR introduces comprehensive progress tracking and memory estimation functionality for the JDBC module, enabling safer and more observable database query operations.

Key Changes:

Adds configurable progress tracking (simple and detailed modes) via system properties
Implements memory estimation utilities to predict data loading requirements before full query execution
Introduces JdbcSafeDataLoading API for automatic memory limit management with configurable actions (throw, warn, or apply limit)

Reviewed Changes

Copilot reviewed 9 out of 9 changed files in this pull request and generated 7 comments.

Show a summary per file

File	Description
JdbcSafeLoadingTest.kt	Tests for safe loading functionality including memory limit handling and callbacks
JdbcProgressTest.kt	Tests for progress tracking configuration via system properties
JdbcProgressOutputTest.kt	Tests for verifying progress and warning log output
readJdbc.kt	Integrates progress tracker into the core data loading pipeline
progressTracker.kt	Implements three progress tracker variants: NoOp, Simple, and Detailed with memory estimation
MemoryEstimator.kt	Provides memory estimation using sample rows and COUNT(*) queries
MemoryEstimate.kt	Data class for memory estimation results with utility methods
JdbcSafeLoading.kt	Public API for safe JDBC loading with automatic memory checking and limit application
JdbcConfig.kt	Configuration object for JDBC module behavior via system properties and environment variables

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2025-11-13T15:18:08Z

dataframe-jdbc/src/main/kotlin/org/jetbrains/kotlinx/dataframe/io/MemoryEstimator.kt

+private fun formatBytes(bytes: Long): String = when {
+    bytes < 1024 -> "$bytes bytes"
+    bytes < 1024 * 1024 -> "%.1f KB".format(bytes / 1024.0)
+    bytes < 1024 * 1024 * 1024 -> "%.1f MB".format(bytes / (1024.0 * 1024.0))
+    else -> "%.2f GB".format(bytes / (1024.0 * 1024.0 * 1024.0))
+}


The formatBytes function is duplicated across three files: progressTracker.kt (line 296), MemoryEstimator.kt (line 203), and JdbcSafeLoading.kt (line 414). This function should be extracted to a common utility file to follow the DRY (Don't Repeat Yourself) principle and ensure consistency.

Suggested change

private fun formatBytes(bytes: Long): String = when {

bytes < 1024 -> "$bytes bytes"

bytes < 1024 * 1024 -> "%.1f KB".format(bytes / 1024.0)

bytes < 1024 * 1024 * 1024 -> "%.1f MB".format(bytes / (1024.0 * 1024.0))

else -> "%.2f GB".format(bytes / (1024.0 * 1024.0 * 1024.0))

}

Copilot · 2025-11-13T15:18:08Z

dataframe-jdbc/src/main/kotlin/org/jetbrains/kotlinx/dataframe/io/JdbcSafeLoading.kt

+private fun formatBytes(bytes: Long): String = when {
+    bytes < 1024 -> "$bytes bytes"
+    bytes < 1024 * 1024 -> "%.1f KB".format(bytes / 1024.0)
+    bytes < 1024 * 1024 * 1024 -> "%.1f MB".format(bytes / (1024.0 * 1024.0))
+    else -> "%.2f GB".format(bytes / (1024.0 * 1024.0 * 1024.0))
+}


The formatBytes function is duplicated across three files: progressTracker.kt (line 296), MemoryEstimator.kt (line 203), and JdbcSafeLoading.kt (line 414). This function should be extracted to a common utility file to follow the DRY (Don't Repeat Yourself) principle and ensure consistency.

Suggested change

private fun formatBytes(bytes: Long): String = when {

bytes < 1024 -> "$bytes bytes"

bytes < 1024 * 1024 -> "%.1f KB".format(bytes / 1024.0)

bytes < 1024 * 1024 * 1024 -> "%.1f MB".format(bytes / (1024.0 * 1024.0))

else -> "%.2f GB".format(bytes / (1024.0 * 1024.0 * 1024.0))

}

Copilot · 2025-11-13T15:18:09Z

dataframe-jdbc/src/main/kotlin/org/jetbrains/kotlinx/dataframe/io/MemoryEstimator.kt

+        dbType: DbType,
+    ): Long {
+        return try {
+            connection.prepareStatement("SELECT COUNT(*) FROM $tableName").use { stmt ->


The tableName is directly interpolated into the SQL query string without proper identifier quoting. This could lead to SQL injection vulnerabilities or query failures for table names with special characters. Consider using dbType.quoteIdentifier(tableName) or constructing the query through dbType.buildSelectTableQueryWithLimit() and then wrapping it with COUNT(*), similar to how it's done in estimateQueryRowCount.

Suggested change

connection.prepareStatement("SELECT COUNT(*) FROM $tableName").use { stmt ->

connection.prepareStatement("SELECT COUNT(*) FROM ${dbType.quoteIdentifier(tableName)}").use { stmt ->

Copilot · 2025-11-13T15:18:09Z

dataframe-jdbc/src/main/kotlin/org/jetbrains/kotlinx/dataframe/io/progressTracker.kt

+
+    private fun estimateClobSize(clob: Clob): Long {
+        return try {
+            clob.length() * 2L


The estimateClobSize function doesn't limit the CLOB length like estimateBlobSize does (which uses coerceAtMost(10 * 1024 * 1024)). This could result in excessively large memory estimates for very large CLOBs. Consider adding a similar constraint: clob.length().coerceAtMost(10 * 1024 * 1024) * 2L

Suggested change

clob.length() * 2L

clob.length().coerceAtMost(10 * 1024 * 1024) * 2L

Copilot · 2025-11-13T15:18:09Z

dataframe-jdbc/src/main/kotlin/org/jetbrains/kotlinx/dataframe/io/JdbcSafeLoading.kt

+ * val df = JdbcSafeDataLoading.load {
+ *     maxMemoryGb = 2.0
+ *     onExceed = ExceedAction.THROW
+ *
+ *     onEstimate = { estimate ->
+ *         println("Estimated: ${estimate.humanReadable}")
+ *     }
+ *
+ *     readSqlTable(config, "huge_table")
+ * }
+ * ```


The example code in the documentation is incorrect. The configuration properties (maxMemoryGb, onExceed, onEstimate) should be set in a configure lambda parameter, not inside the main block. The correct usage should be:

val df = JdbcSafeDataLoading.load( configure = { maxMemoryGb = 2.0 onExceed = ExceedAction.THROW onEstimate = { estimate -> println("Estimated: ${estimate.humanReadable}") } } ) { readSqlTable(config, "huge_table") }

Suggested change

* val df = JdbcSafeDataLoading.load {

* maxMemoryGb = 2.0

* onExceed = ExceedAction.THROW

*

* onEstimate = { estimate ->

* println("Estimated: ${estimate.humanReadable}")

* }

*

* readSqlTable(config, "huge_table")

* }

* ```

* val df = JdbcSafeDataLoading.load(

* configure = {

* maxMemoryGb = 2.0

* onExceed = ExceedAction.THROW

* onEstimate = { estimate ->

* println("Estimated: ${estimate.humanReadable}")

* }

* }

* ) {

* readSqlTable(config, "huge_table")

* }

Copilot · 2025-11-13T15:18:10Z

...e-jdbc/src/test/kotlin/org/jetbrains/kotlinx/dataframe/io/h2/tracking/JdbcSafeLoadingTest.kt

+                """.trimIndent()
+            )
+
+            // Insert 10000 rows (~10KB each = ~100MB total)


The comment incorrectly states "~10KB each = ~100MB total" but the code creates 1KB per row (VARCHAR(1000) with 1000 characters). For 10000 rows at 1KB each, the total should be approximately 10MB, not 100MB.

Suggested change

// Insert 10000 rows (~10KB each = ~100MB total)

// Insert 10000 rows (~1KB each = ~10MB total)

Copilot · 2025-11-13T15:18:10Z

dataframe-jdbc/src/main/kotlin/org/jetbrains/kotlinx/dataframe/io/progressTracker.kt

+private fun formatBytes(bytes: Long): String = when {
+    bytes < 1024 -> "$bytes bytes"
+    bytes < 1024 * 1024 -> "%.1f KB".format(bytes / 1024.0)
+    bytes < 1024 * 1024 * 1024 -> "%.1f MB".format(bytes / (1024.0 * 1024.0))
+    else -> "%.2f GB".format(bytes / (1024.0 * 1024.0 * 1024.0))
+}


The formatBytes function is duplicated across three files: progressTracker.kt (line 296), MemoryEstimator.kt (line 203), and JdbcSafeLoading.kt (line 414). This function should be extracted to a common utility file to follow the DRY (Don't Repeat Yourself) principle and ensure consistency.

zaleslaw requested a review from Copilot November 13, 2025 15:12

Copilot started reviewing on behalf of zaleslaw November 13, 2025 15:13 View session

Copilot finished reviewing on behalf of zaleslaw November 13, 2025 15:16

Copilot AI reviewed Nov 13, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add progress tracking and memory estimation for JDBC module #1571

Add progress tracking and memory estimation for JDBC module #1571

zaleslaw commented Nov 13, 2025 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Nov 13, 2025

Uh oh!

Copilot AI Nov 13, 2025

Uh oh!

Copilot AI Nov 13, 2025

Uh oh!

Copilot AI Nov 13, 2025

Uh oh!

Copilot AI Nov 13, 2025

Uh oh!

Copilot AI Nov 13, 2025

Uh oh!

Copilot AI Nov 13, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

	connection.prepareStatement("SELECT COUNT(*) FROM $tableName").use { stmt ->
	connection.prepareStatement("SELECT COUNT(*) FROM ${dbType.quoteIdentifier(tableName)}").use { stmt ->

	clob.length() * 2L
	clob.length().coerceAtMost(10 * 1024 * 1024) * 2L

	// Insert 10000 rows (~10KB each = ~100MB total)
	// Insert 10000 rows (~1KB each = ~10MB total)

Add progress tracking and memory estimation for JDBC module #1571

Are you sure you want to change the base?

Add progress tracking and memory estimation for JDBC module #1571

Conversation

zaleslaw commented Nov 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Copilot AI Nov 13, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Nov 13, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Nov 13, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Nov 13, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Nov 13, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Nov 13, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Nov 13, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

zaleslaw commented Nov 13, 2025 •

edited

Loading