-
Notifications
You must be signed in to change notification settings - Fork 76
Add progress tracking and memory estimation for JDBC module #1571
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
This commit introduces progress tracking functionality for the JDBC module, including both simple and detailed modes configurable via system properties. Added new tests, configuration class (`JdbcConfig`), and memory estimation utilities to ensure efficient handling of large datasets.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR introduces comprehensive progress tracking and memory estimation functionality for the JDBC module, enabling safer and more observable database query operations.
Key Changes:
- Adds configurable progress tracking (simple and detailed modes) via system properties
- Implements memory estimation utilities to predict data loading requirements before full query execution
- Introduces
JdbcSafeDataLoadingAPI for automatic memory limit management with configurable actions (throw, warn, or apply limit)
Reviewed Changes
Copilot reviewed 9 out of 9 changed files in this pull request and generated 7 comments.
Show a summary per file
| File | Description |
|---|---|
| JdbcSafeLoadingTest.kt | Tests for safe loading functionality including memory limit handling and callbacks |
| JdbcProgressTest.kt | Tests for progress tracking configuration via system properties |
| JdbcProgressOutputTest.kt | Tests for verifying progress and warning log output |
| readJdbc.kt | Integrates progress tracker into the core data loading pipeline |
| progressTracker.kt | Implements three progress tracker variants: NoOp, Simple, and Detailed with memory estimation |
| MemoryEstimator.kt | Provides memory estimation using sample rows and COUNT(*) queries |
| MemoryEstimate.kt | Data class for memory estimation results with utility methods |
| JdbcSafeLoading.kt | Public API for safe JDBC loading with automatic memory checking and limit application |
| JdbcConfig.kt | Configuration object for JDBC module behavior via system properties and environment variables |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| private fun formatBytes(bytes: Long): String = when { | ||
| bytes < 1024 -> "$bytes bytes" | ||
| bytes < 1024 * 1024 -> "%.1f KB".format(bytes / 1024.0) | ||
| bytes < 1024 * 1024 * 1024 -> "%.1f MB".format(bytes / (1024.0 * 1024.0)) | ||
| else -> "%.2f GB".format(bytes / (1024.0 * 1024.0 * 1024.0)) | ||
| } |
Copilot
AI
Nov 13, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The formatBytes function is duplicated across three files: progressTracker.kt (line 296), MemoryEstimator.kt (line 203), and JdbcSafeLoading.kt (line 414). This function should be extracted to a common utility file to follow the DRY (Don't Repeat Yourself) principle and ensure consistency.
| private fun formatBytes(bytes: Long): String = when { | |
| bytes < 1024 -> "$bytes bytes" | |
| bytes < 1024 * 1024 -> "%.1f KB".format(bytes / 1024.0) | |
| bytes < 1024 * 1024 * 1024 -> "%.1f MB".format(bytes / (1024.0 * 1024.0)) | |
| else -> "%.2f GB".format(bytes / (1024.0 * 1024.0 * 1024.0)) | |
| } |
| private fun formatBytes(bytes: Long): String = when { | ||
| bytes < 1024 -> "$bytes bytes" | ||
| bytes < 1024 * 1024 -> "%.1f KB".format(bytes / 1024.0) | ||
| bytes < 1024 * 1024 * 1024 -> "%.1f MB".format(bytes / (1024.0 * 1024.0)) | ||
| else -> "%.2f GB".format(bytes / (1024.0 * 1024.0 * 1024.0)) | ||
| } |
Copilot
AI
Nov 13, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The formatBytes function is duplicated across three files: progressTracker.kt (line 296), MemoryEstimator.kt (line 203), and JdbcSafeLoading.kt (line 414). This function should be extracted to a common utility file to follow the DRY (Don't Repeat Yourself) principle and ensure consistency.
| private fun formatBytes(bytes: Long): String = when { | |
| bytes < 1024 -> "$bytes bytes" | |
| bytes < 1024 * 1024 -> "%.1f KB".format(bytes / 1024.0) | |
| bytes < 1024 * 1024 * 1024 -> "%.1f MB".format(bytes / (1024.0 * 1024.0)) | |
| else -> "%.2f GB".format(bytes / (1024.0 * 1024.0 * 1024.0)) | |
| } |
| dbType: DbType, | ||
| ): Long { | ||
| return try { | ||
| connection.prepareStatement("SELECT COUNT(*) FROM $tableName").use { stmt -> |
Copilot
AI
Nov 13, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The tableName is directly interpolated into the SQL query string without proper identifier quoting. This could lead to SQL injection vulnerabilities or query failures for table names with special characters. Consider using dbType.quoteIdentifier(tableName) or constructing the query through dbType.buildSelectTableQueryWithLimit() and then wrapping it with COUNT(*), similar to how it's done in estimateQueryRowCount.
| connection.prepareStatement("SELECT COUNT(*) FROM $tableName").use { stmt -> | |
| connection.prepareStatement("SELECT COUNT(*) FROM ${dbType.quoteIdentifier(tableName)}").use { stmt -> |
|
|
||
| private fun estimateClobSize(clob: Clob): Long { | ||
| return try { | ||
| clob.length() * 2L |
Copilot
AI
Nov 13, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The estimateClobSize function doesn't limit the CLOB length like estimateBlobSize does (which uses coerceAtMost(10 * 1024 * 1024)). This could result in excessively large memory estimates for very large CLOBs. Consider adding a similar constraint: clob.length().coerceAtMost(10 * 1024 * 1024) * 2L
| clob.length() * 2L | |
| clob.length().coerceAtMost(10 * 1024 * 1024) * 2L |
| * val df = JdbcSafeDataLoading.load { | ||
| * maxMemoryGb = 2.0 | ||
| * onExceed = ExceedAction.THROW | ||
| * | ||
| * onEstimate = { estimate -> | ||
| * println("Estimated: ${estimate.humanReadable}") | ||
| * } | ||
| * | ||
| * readSqlTable(config, "huge_table") | ||
| * } | ||
| * ``` |
Copilot
AI
Nov 13, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The example code in the documentation is incorrect. The configuration properties (maxMemoryGb, onExceed, onEstimate) should be set in a configure lambda parameter, not inside the main block. The correct usage should be:
val df = JdbcSafeDataLoading.load(
configure = {
maxMemoryGb = 2.0
onExceed = ExceedAction.THROW
onEstimate = { estimate ->
println("Estimated: ${estimate.humanReadable}")
}
}
) {
readSqlTable(config, "huge_table")
}| * val df = JdbcSafeDataLoading.load { | |
| * maxMemoryGb = 2.0 | |
| * onExceed = ExceedAction.THROW | |
| * | |
| * onEstimate = { estimate -> | |
| * println("Estimated: ${estimate.humanReadable}") | |
| * } | |
| * | |
| * readSqlTable(config, "huge_table") | |
| * } | |
| * ``` | |
| * val df = JdbcSafeDataLoading.load( | |
| * configure = { | |
| * maxMemoryGb = 2.0 | |
| * onExceed = ExceedAction.THROW | |
| * onEstimate = { estimate -> | |
| * println("Estimated: ${estimate.humanReadable}") | |
| * } | |
| * } | |
| * ) { | |
| * readSqlTable(config, "huge_table") | |
| * } |
| """.trimIndent() | ||
| ) | ||
|
|
||
| // Insert 10000 rows (~10KB each = ~100MB total) |
Copilot
AI
Nov 13, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The comment incorrectly states "~10KB each = ~100MB total" but the code creates 1KB per row (VARCHAR(1000) with 1000 characters). For 10000 rows at 1KB each, the total should be approximately 10MB, not 100MB.
| // Insert 10000 rows (~10KB each = ~100MB total) | |
| // Insert 10000 rows (~1KB each = ~10MB total) |
| private fun formatBytes(bytes: Long): String = when { | ||
| bytes < 1024 -> "$bytes bytes" | ||
| bytes < 1024 * 1024 -> "%.1f KB".format(bytes / 1024.0) | ||
| bytes < 1024 * 1024 * 1024 -> "%.1f MB".format(bytes / (1024.0 * 1024.0)) | ||
| else -> "%.2f GB".format(bytes / (1024.0 * 1024.0 * 1024.0)) | ||
| } |
Copilot
AI
Nov 13, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The formatBytes function is duplicated across three files: progressTracker.kt (line 296), MemoryEstimator.kt (line 203), and JdbcSafeLoading.kt (line 414). This function should be extracted to a common utility file to follow the DRY (Don't Repeat Yourself) principle and ensure consistency.
This commit introduces progress tracking functionality for the JDBC module, including both simple and detailed modes, which are configurable via system properties.
Added new tests, configuration class (
JdbcConfig), and memory estimation utilities to ensure efficient handling of large datasets.Closes #455