-
Notifications
You must be signed in to change notification settings - Fork 86
#805 Enable the option to cache VRL indexes for better performance when same files are processed multiple times #806
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
… same files need to be processed multiple times.
…ut the option to README.
WalkthroughAdded an in-memory index caching option for variable-length record processing. Introduced Changes
Sequence Diagram(s)sequenceDiagram
participant User
participant Parser as CobolParametersParser
participant Reader
participant IndexBuilder
participant Cache as indexCache
User->>Parser: parse parameters (enable_index_cache=true)
Parser->>Parser: validate flags (enable_indexes vs enable_index_cache)
Parser->>Reader: set isIndexCachingAllowed=true
Reader->>IndexBuilder: buildIndex(files, reader, sqlContext, cachingAllowed=true)
alt File indexes in cache
IndexBuilder->>Cache: lookup by file path
Cache-->>IndexBuilder: return cached SparseIndexEntry[]
else Not cached
IndexBuilder->>IndexBuilder: generate indexes for nonCachedFiles
IndexBuilder->>Cache: store sorted SparseIndexEntry[] per file path
end
IndexBuilder->>IndexBuilder: createIndexRDD(combine cached + new)
IndexBuilder-->>Reader: return RDD[SparseIndexEntry]
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes
Poem
Pre-merge checks and finishing touches✅ Passed checks (5 passed)
✨ Finishing touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
JaCoCo code coverage report - 'cobol-parser'
|
JaCoCo code coverage report - 'spark-cobol'
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
spark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/source/index/IndexBuilder.scala (1)
37-50: Index cache key and scope risk silently wrong indexes across readers/sessionsThe global
indexCache: ConcurrentHashMap[String, Array[SparseIndexEntry]]is keyed only byfilePath. Index contents, however, depend on theVarLenReaderconfiguration (copybook, RDW settings,fileStartOffset/fileEndOffset, record extractor, occurs mappings, etc.) and on the file contents. Reusing cached entries for the same path but a different reader configuration or after the file has changed will silently produce incorrect indexes, and the cache is shared JVM‑wide across Spark sessions.Consider:
- Including a reader/file signature in the cache key (e.g.,
case class IndexCacheKey(filePath: String, readerSignature: String, startOffset: Long, endOffset: Long, fileSize: Long)), wherereaderSignatureis derived from the relevantReaderParametersfields.- Optionally scoping the cache per Spark application (e.g., by including
sparkContext.applicationIdin the key) or providing an explicit clear/eviction mechanism so long‑lived drivers that touch many distinct files don’t accumulate unbounded index state.
🧹 Nitpick comments (7)
spark-cobol/src/test/scala/za/co/absa/cobrix/spark/cobol/source/integration/Test37RecordLengthMappingSpec.scala (1)
173-196: Good test coverage for index caching, but consider verifying cache hitsThe test correctly verifies that cached and non-cached results are identical. However, it doesn't verify that caching actually occurred (e.g., that the second read used the cache).
Consider adding a verification mechanism to confirm the cache was used on the second read. This could be done by:
- Checking execution time (second read should be faster)
- Adding a cache hit counter that can be inspected
- Logging cache hits and checking logs
This would provide stronger confidence that the caching mechanism is working as intended, not just producing correct results.
spark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/source/index/IndexBuilder.scala (3)
120-174: Caching currently bypassed for locality‑aware indexing and may surprise users
buildIndexpassescachingAllowedonly into the non‑local variable‑length path:case reader: VarLenReader if reader.isIndexGenerationNeeded && localityParams.improveLocality && isDataLocalitySupported(fs) => // locality path – cachingAllowed ignored case reader: VarLenReader => // non‑local path – cachingAllowed used buildIndexForVarLenReader(filesList, reader, sqlContext, cachingAllowed)With defaults (
improve_locality = trueon HDFS),enable_index_cachewill effectively be ignored and indexes will still be recomputed on every run for locality‑aware indexing.If this is intentional, it’s worth documenting that caching only applies when locality is disabled or unsupported. Otherwise, consider extending
buildIndexForVarLenReaderWithFullLocalityto reuse cachedSparseIndexEntrys (and only recompute block locations) so caching works for the common HDFS+locality case as well.
127-174: Minor improvements to caching path: typos and defensive guardTwo small nits and a safety tweak:
- Comment typo on Line 127:
"chached and teh"→"cached and the".createIndexRDD(cachedIndexes ++ newIndexes, sqlContext)assumes there is at least one index; if that ever isn’t true,parallelizewill be called withnumPartitions = 0and fail at runtime.You can make this more robust and keep behavior unchanged for non‑empty inputs by guarding the partition count:
- private def createIndexRDD(indexes: Array[SparseIndexEntry], sqlContext: SQLContext): RDD[SparseIndexEntry] = { - val indexCount = indexes.length - - val numPartitions = Math.min(indexCount, Constants.maxNumPartitions) + private def createIndexRDD(indexes: Array[SparseIndexEntry], sqlContext: SQLContext): RDD[SparseIndexEntry] = { + val indexCount = indexes.length + val numPartitions = Math.max(1, Math.min(indexCount, Constants.maxNumPartitions))
381-388: Index RDD no longer cached in Spark; confirm this is acceptableThe old non‑local path used
repartitionIndexes, which didindexRDD.repartition(...).cache(). The new helper returns a plainparallelized RDD without caching:private def createIndexRDD(indexes: Array[SparseIndexEntry], sqlContext: SQLContext): RDD[SparseIndexEntry] = { val indexCount = indexes.length val numPartitions = Math.min(indexCount, Constants.maxNumPartitions) sqlContext.sparkContext.parallelize(indexes, numPartitions) }If downstream code runs multiple actions on
indexeswithin a single read, this may re‑scan or reshuffle more than before. If you still want Spark‑level reuse for a single scan, consider either:
- Calling
.cache()here, or- Reusing
repartitionIndexes(possibly generalized) to keep the existing behavior.spark-cobol/src/test/scala/za/co/absa/cobrix/spark/cobol/source/index/IndexBuilderSpec.scala (1)
67-107: Tests cover new signatures but not cache‑hit behaviorThe updated tests correctly pass
cachingAllowed = false, preserving existing behavior. However, none of the specs exercise the new caching path (e.g., repeated indexing of the same file withcachingAllowed = true).Consider adding a focused test that:
- Calls
buildIndexForVarLenReadertwice on the same file withcachingAllowed = true.- Asserts that the second call does not re‑run
generateIndexEntry(via a spy/metric) and returns the same index entries, and- Verifies that cached indexes are remapped to the correct
fileIdwhenFileWithOrder.orderchanges.Also applies to: 171-191
spark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/source/CobolRelation.scala (1)
65-71: DeriveindexCachingAllowedfrom reader properties instead of wiring it as a constructor argThe new
indexCachingAllowed: Booleanparameter is immediately passed through toIndexBuilder.buildIndex(...), while the same information is already conceptually part of the reader configuration (ReaderParameters.isIndexCachingAllowed).To reduce duplication and the risk of call‑site mismatches, consider:
- Dropping the explicit
indexCachingAllowedconstructor parameter, and- Inside
CobolRelation, deriving it from the reader, e.g.:private lazy val indexes: RDD[SparseIndexEntry] = { val cachingAllowed = cobolReader match { case v: VarLenReader => v.getReaderProperties.isIndexCachingAllowed case _ => false } IndexBuilder.buildIndex(filesList, cobolReader, sqlContext, cachingAllowed)(localityParams) }This keeps the public API leaner and tightly couples the caching decision to the actual reader config.
Also applies to: 77-78
cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/reader/parameters/CobolParametersParser.scala (1)
120-123: Alignenable_index_cachesemantics withenable_indexesThe new plumbing for
enable_index_cachelooks consistent and defaults are sensible (false). One edge case to watch:isUsingIndex = params.getOrElse(PARAM_ENABLE_INDEXES, "true").toBoolean, isIndexCachingAllowed = params.getOrElse(PARAM_ENABLE_INDEX_CACHE, "false").toBoolean,This allows
enable_index_cache = truewhileenable_indexes = false, yieldingisIndexGenerationNeeded = falsebutisIndexCachingAllowed = trueinReaderParameters. Unless all downstream call sites explicitly gate caching onisIndexGenerationNeeded, this can lead to unnecessary index building and cache population for readers that won’t actually use indexes.Two possible tightenings:
- Define caching as a derived flag:
val useIndexes = params.getOrElse(PARAM_ENABLE_INDEXES, "true").toBoolean val useIndexCache = useIndexes && params.getOrElse(PARAM_ENABLE_INDEX_CACHE, "false").toBoolean ... isUsingIndex = useIndexes, isIndexCachingAllowed = useIndexCache,
- Optionally, fail fast or at least log a warning when
enable_index_cache = truebutenable_indexes = false, since caching has no effect in that configuration.Also applies to: 385-386, 421-422, 508-509
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (12)
README.md(1 hunks)cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/reader/parameters/CobolParametersParser.scala(4 hunks)cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/reader/parameters/ReaderParameters.scala(1 hunks)cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/reader/parameters/VariableLengthParameters.scala(2 hunks)spark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/SparkCobolProcessor.scala(1 hunks)spark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/source/CobolRelation.scala(1 hunks)spark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/source/DefaultSource.scala(1 hunks)spark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/source/index/IndexBuilder.scala(4 hunks)spark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/source/streaming/FileStreamer.scala(2 hunks)spark-cobol/src/test/scala/za/co/absa/cobrix/spark/cobol/source/CobolRelationSpec.scala(3 hunks)spark-cobol/src/test/scala/za/co/absa/cobrix/spark/cobol/source/index/IndexBuilderSpec.scala(5 hunks)spark-cobol/src/test/scala/za/co/absa/cobrix/spark/cobol/source/integration/Test37RecordLengthMappingSpec.scala(1 hunks)
🧰 Additional context used
🧬 Code graph analysis (9)
spark-cobol/src/test/scala/za/co/absa/cobrix/spark/cobol/source/CobolRelationSpec.scala (1)
spark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/source/DefaultSource.scala (1)
sqlContext(126-126)
spark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/SparkCobolProcessor.scala (1)
spark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/source/index/IndexBuilder.scala (1)
buildIndex(51-69)
spark-cobol/src/test/scala/za/co/absa/cobrix/spark/cobol/source/integration/Test37RecordLengthMappingSpec.scala (1)
spark-cobol/src/test/scala/za/co/absa/cobrix/spark/cobol/source/fixtures/BinaryFileFixture.scala (1)
withTempBinFile(60-69)
spark-cobol/src/test/scala/za/co/absa/cobrix/spark/cobol/source/index/IndexBuilderSpec.scala (1)
spark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/source/index/IndexBuilder.scala (1)
buildIndex(51-69)
spark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/source/DefaultSource.scala (1)
spark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/source/parameters/LocalityParameters.scala (3)
LocalityParameters(21-21)LocalityParameters(23-30)extract(24-29)
spark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/source/CobolRelation.scala (1)
spark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/source/index/IndexBuilder.scala (1)
buildIndex(51-69)
spark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/source/streaming/FileStreamer.scala (1)
cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/internal/Logging.scala (1)
logger(33-38)
spark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/source/index/IndexBuilder.scala (2)
cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/reader/parameters/CobolParametersParser.scala (1)
getReaderProperties(367-456)cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/parser/common/Constants.scala (1)
Constants(20-81)
cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/reader/parameters/CobolParametersParser.scala (1)
cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/reader/parameters/Parameters.scala (1)
getOrElse(78-81)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (5)
- GitHub Check: Spark 3.5.7 on Scala 2.13.17
- GitHub Check: Spark 2.4.8 on Scala 2.11.12
- GitHub Check: Spark 3.4.4 on Scala 2.12.20
- GitHub Check: Spark 3.5.7 on Scala 2.12.20
- GitHub Check: test (2.12.20, 2.12, 3.3.4, 0, 80, 20)
🔇 Additional comments (9)
README.md (1)
1605-1605: LGTM! Clear documentation for the new caching feature.The documentation clearly explains when to use this option and what it does.
spark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/source/streaming/FileStreamer.scala (2)
86-86: Logging level change: WARNING → INFOThe logging level for end-of-stream messages has been changed from WARNING to INFO. This is appropriate since reaching the end of a stream is normal behavior, not a warning condition. However, this change appears unrelated to the index caching feature.
Was this logging change intentional, or should it be in a separate commit?
101-101: Consistent logging level changeSimilar to line 86, this end-of-stream logging is now at INFO level, which is more appropriate for normal operational messages.
spark-cobol/src/test/scala/za/co/absa/cobrix/spark/cobol/source/CobolRelationSpec.scala (1)
67-68: Test updated correctly for new parameterThe
indexCachingAllowed = falseparameter is added consistently across all test cases, maintaining existing behavior while supporting the new API.cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/reader/parameters/ReaderParameters.scala (1)
102-102: LGTM! Backward-compatible API extensionThe new
isIndexCachingAllowedparameter is added with a safe default offalse, ensuring backward compatibility. The caching feature is opt-in, which is appropriate for a performance optimization that changes runtime behavior.cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/reader/parameters/VariableLengthParameters.scala (1)
37-37: LGTM! Parameter properly documented and positionedThe new
isIndexCachingAllowedparameter is well-documented and logically placed after the relatedisUsingIndexparameter. Note that unlikeReaderParameters, this parameter doesn't have a default value, requiring callers to explicitly provide it.Also applies to: 60-60
spark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/source/DefaultSource.scala (2)
61-64: LGTM! Proper parameter extraction with safe defaultsThe
indexCachingAllowedflag is correctly extracted from variable-length parameters with an appropriate default offalsewhen variable-length parameters are not present.
69-70: Constructor call updated correctlyThe
indexCachingAllowedparameter is properly threaded through to theCobolRelationconstructor.spark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/SparkCobolProcessor.scala (1)
221-221: LGTM! IndexBuilder call updated for caching supportThe
buildIndexcall is correctly updated to passreaderParameters.isIndexCachingAllowed, enabling the index caching mechanism when configured.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 3
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (4)
cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/reader/parameters/CobolParametersParser.scala(5 hunks)spark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/source/CobolRelation.scala(1 hunks)spark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/source/index/IndexBuilder.scala(4 hunks)spark-cobol/src/test/scala/za/co/absa/cobrix/spark/cobol/source/integration/Test37RecordLengthMappingSpec.scala(2 hunks)
🚧 Files skipped from review as they are similar to previous changes (2)
- spark-cobol/src/test/scala/za/co/absa/cobrix/spark/cobol/source/integration/Test37RecordLengthMappingSpec.scala
- spark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/source/CobolRelation.scala
🧰 Additional context used
🧬 Code graph analysis (1)
cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/reader/parameters/CobolParametersParser.scala (1)
cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/reader/parameters/Parameters.scala (2)
getOrElse(78-81)contains(66-69)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (5)
- GitHub Check: Spark 3.5.7 on Scala 2.13.17
- GitHub Check: Spark 3.5.7 on Scala 2.12.20
- GitHub Check: Spark 3.4.4 on Scala 2.12.20
- GitHub Check: Spark 2.4.8 on Scala 2.11.12
- GitHub Check: test (2.12.20, 2.12, 3.3.4, 0, 80, 20)
🔇 Additional comments (4)
spark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/source/index/IndexBuilder.scala (2)
166-170: LGTM - Cached index retrieval correctly updates fileId.The logic correctly retrieves cached indexes and updates the
fileIdto match the current file ordering. The.map(ind => ind.copy(fileId = f.order))ensures each cached index entry is updated without modifying the original cached array.Note: For files with very large numbers of index entries, this mapping operation could have performance implications. Consider monitoring if this becomes a bottleneck.
381-388: LGTM - Clean helper method for RDD creation.The new
createIndexRDDhelper method is well-structured and correctly handles edge cases (empty index arrays) by ensuring at least 1 partition withMath.max(1, ...).cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/reader/parameters/CobolParametersParser.scala (2)
122-122: LGTM - Parameter constant properly defined.The new
PARAM_ENABLE_INDEX_CACHEconstant is well-placed in the "Indexed multisegment file processing" section and follows the existing naming conventions.
385-385: LGTM - Parameter correctly propagated through configuration layers.The
isIndexCachingAllowedparameter is properly threaded through the configuration hierarchy:
- Parsed from parameters with a safe default of
false(line 508)- Propagated from
VariableLengthParameterstoReaderParameters(line 421)- Fallback case correctly defaults to
false(line 385)Also applies to: 421-421, 508-508
...-parser/src/main/scala/za/co/absa/cobrix/cobol/reader/parameters/CobolParametersParser.scala
Show resolved
Hide resolved
| * In a nutshell, ideally, there will be as many partitions as are there are indexes. | ||
| */ | ||
| private[cobol] object IndexBuilder extends Logging { | ||
| private val indexCache = new ConcurrentHashMap[String, Array[SparseIndexEntry]]() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add cache eviction policy and file modification tracking.
The global indexCache has no eviction policy and will grow unboundedly as new files are processed, potentially causing memory exhaustion in long-running Spark applications. Additionally, the cache is keyed solely by file path without tracking file modification times, so if a file is modified, stale cached indexes will be returned.
Consider these improvements:
- Add a bounded cache with LRU eviction (e.g., using Guava's
CacheBuilderor Caffeine) - Include file modification timestamp or checksum in the cache key
- Add a configurable maximum cache size or TTL
- Consider adding cache statistics logging
Example with size-bounded cache:
+import com.google.common.cache.{Cache, CacheBuilder}
+import java.util.concurrent.TimeUnit
+
-private val indexCache = new ConcurrentHashMap[String, Array[SparseIndexEntry]]()
+private val indexCache: Cache[String, Array[SparseIndexEntry]] = CacheBuilder.newBuilder()
+ .maximumSize(1000) // configurable
+ .expireAfterAccess(1, TimeUnit.HOURS) // configurable
+ .build()Then update get/put operations accordingly.
Committable suggestion skipped: line range outside the PR's diff.
🤖 Prompt for AI Agents
In
spark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/source/index/IndexBuilder.scala
around line 49, the global indexCache is an unbounded ConcurrentHashMap which
can grow indefinitely and does not account for file modifications; replace it
with a bounded LRU/TTL cache (e.g., Guava CacheBuilder or Caffeine) configured
via a new configurable max size and optional TTL, and change cache keys to
include file modification timestamp (or checksum) so modified files miss the
cache; update all get/put usages to the new cache API, add configuration hooks
for size/TTL, and optionally log basic cache statistics (hits/evictions) for
monitoring.
| // Splitting between files for which indexes are cached and the list of files for which indexes are not cached | ||
| val cachedFiles = if (cachingAllowed) { | ||
| filesList.filter(f => indexCache.containsKey(f.filePath)) | ||
| } else { | ||
| Array.empty[FileWithOrder] | ||
| } | ||
|
|
||
| val filesRDD = sqlContext.sparkContext.parallelize(filesList, filesList.length) | ||
| val nonCachedFiles = filesList.diff(cachedFiles) | ||
|
|
||
| val indexRDD = filesRDD.mapPartitions( | ||
| partition => { | ||
| partition.flatMap(row => { | ||
| generateIndexEntry(row, sconf.value, reader) | ||
| }) | ||
| }).cache() | ||
| // Getting indexes for files for which indexes are not in the cache | ||
| val newIndexes = if (nonCachedFiles.length > 0) { | ||
| if (reader.getReaderProperties.enableSelfChecks) { | ||
| selfCheckForIndexCompatibility(reader, nonCachedFiles.head.filePath, conf) | ||
| } | ||
|
|
||
| repartitionIndexes(indexRDD) | ||
| val filesRDD = sqlContext.sparkContext.parallelize(nonCachedFiles, nonCachedFiles.length) | ||
| filesRDD.mapPartitions( | ||
| partition => { | ||
| partition.flatMap(row => { | ||
| generateIndexEntry(row, sconf.value, reader) | ||
| }) | ||
| }).collect() | ||
| } else { | ||
| Array.empty[SparseIndexEntry] | ||
| } | ||
|
|
||
| // Storing new indexes in the cache | ||
| if (cachingAllowed && newIndexes.length > 0) { | ||
| newIndexes.groupBy(_.fileId).foreach { case (fileId, indexEntries) => | ||
| val filePathOpt = filesList.find(_.order == fileId).map(_.filePath) | ||
|
|
||
| filePathOpt.foreach { filePath => | ||
| logger.info(s"Index stored to cache for file: $filePath.") | ||
| indexCache.put(filePath, indexEntries.sortBy(_.offsetFrom)) | ||
| } | ||
| } | ||
| } | ||
|
|
||
| // Getting indexes for files for which indexes are in the cache | ||
| val cachedIndexes = cachedFiles.flatMap { f => | ||
| logger.info("Index fetched from cache for file: " + f.filePath) | ||
| indexCache.get(f.filePath) | ||
| .map(ind => ind.copy(fileId = f.order)) | ||
| } | ||
|
|
||
| // Creating the final RDD with all indexes | ||
| createIndexRDD(cachedIndexes ++ newIndexes, sqlContext) | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧩 Analysis chain
Verify cache key includes file modification tracking.
The caching logic filters files using only containsKey(f.filePath) without considering whether the file has been modified since the index was cached. This means if a VRL file is modified (records added/removed/changed), the stale cached index will be used, leading to incorrect data processing.
Run the following script to check if file modification time is tracked elsewhere in the codebase:
If modification time is not tracked, the cache key should be changed from just the file path to include modification metadata, e.g., s"${filePath}_${modificationTime}".
🏁 Script executed:
#!/bin/bash
# Search for file modification time tracking in the codebase
rg -n "getModificationTime|lastModified|FileStatus" --type scala -C 3Length of output: 8054
Now I'll examine the IndexBuilder.scala file to understand the cache structure and identify whether modification time is tracked:
🏁 Script executed:
#!/bin/bash
# Examine the IndexBuilder.scala file around the cache logic
cat spark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/source/index/IndexBuilder.scala | head -180 | tail -80Length of output: 3010
🏁 Script executed:
#!/bin/bash
# Find FileWithOrder class definition
fd -t f "FileWithOrder" spark-cobol/src/main/scala/Length of output: 89
🏁 Script executed:
#!/bin/bash
# Examine FileWithOrder class definition
cat spark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/source/types/FileWithOrder.scalaLength of output: 778
🏁 Script executed:
#!/bin/bash
# Find indexCache declaration and initialization
rg -n "indexCache" spark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/source/index/IndexBuilder.scala -B 5 -A 2 | head -50Length of output: 1494
🏁 Script executed:
#!/bin/bash
# Search for any file validation or modification checks in the index building process
rg -n "getModificationTime|lastModified|getLen|FileStatus" spark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/source/index/ -C 3Length of output: 867
Add file modification time tracking to cache key to prevent stale index usage.
The cache uses only filePath as the key and does not consider whether files have been modified. If a VRL file is modified (records added/removed/changed) between index generations, the stale cached index will be returned without any validation, leading to incorrect data processing.
Suggested fix:
Include file modification time in the cache key:
// Line 129 - change from:
filesList.filter(f => indexCache.containsKey(f.filePath))
// to:
filesList.filter(f => {
val fileStatus = fileSystem.getFileStatus(new Path(f.filePath))
val cacheKey = s"${f.filePath}_${fileStatus.getModificationTime}"
indexCache.containsKey(cacheKey)
})
// Line 160 - update cache storage to use same key format:
val fileStatus = fileSystem.getFileStatus(new Path(filePath))
val cacheKey = s"${filePath}_${fileStatus.getModificationTime}"
indexCache.put(cacheKey, indexEntries.sortBy(_.offsetFrom))
// Line 168 - update cache retrieval to use same key format:
val fileStatus = fileSystem.getFileStatus(new Path(f.filePath))
val cacheKey = s"${f.filePath}_${fileStatus.getModificationTime}"
indexCache.get(cacheKey)Alternatively, store the modification time alongside cached data and validate it on retrieval.
🤖 Prompt for AI Agents
In
spark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/source/index/IndexBuilder.scala
around lines 127-174, the cache currently keys only by filePath which allows
stale indexes to be used after file modifications; update the cache logic to
include file modification time in the cache key (or store and validate
modification time alongside the cached entry). Specifically: when selecting
cachedFiles (line ~129) call the FileSystem to get the file status and build a
cacheKey using filePath + modificationTime and check
indexCache.containsKey(cacheKey); when storing newIndexes (around lines
~160-168) get the file status for the filePath, build the same cacheKey and put
the sorted indexEntries under that cacheKey; when fetching cachedIndexes (around
lines ~168+) compute the cacheKey from f.filePath and its modificationTime and
read indexCache.get(cacheKey) (or alternatively store modification time with
entries and validate it on retrieval). Use
org.apache.hadoop.fs.FileSystem.getFileStatus(new Path(...)).getModificationTime
for the timestamp.
6fb06ee to
5e5dd76
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 3
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
spark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/source/index/IndexBuilder.scala (1)
37-49: Cache implementation needs eviction policy and file modification tracking.The previous review comments on this file identified two major issues with the cache implementation that remain unaddressed:
Unbounded cache growth: The
ConcurrentHashMaphas no size limit or eviction policy, which can exhaust memory in long-running Spark applications as more unique files are processed.No file modification tracking: Cache keys are based solely on file path without checking modification times. If a VRL file is modified between reads, stale cached indexes will be returned, leading to incorrect data processing.
Please refer to the detailed suggestions in the previous review comments (lines 49 and 127-174 from past reviews) which recommend:
- Using a bounded cache with LRU eviction (e.g., Guava's CacheBuilder or Caffeine)
- Including file modification timestamp in the cache key
- Adding configurable cache size limits and TTL
- Adding cache statistics logging
Based on learnings, past review feedback should be addressed before merging.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (4)
cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/reader/parameters/CobolParametersParser.scala(5 hunks)spark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/source/CobolRelation.scala(1 hunks)spark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/source/index/IndexBuilder.scala(4 hunks)spark-cobol/src/test/scala/za/co/absa/cobrix/spark/cobol/source/integration/Test37RecordLengthMappingSpec.scala(2 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
- cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/reader/parameters/CobolParametersParser.scala
🧰 Additional context used
🧬 Code graph analysis (1)
spark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/source/CobolRelation.scala (2)
spark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/source/index/IndexBuilder.scala (1)
buildIndex(51-69)cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/reader/parameters/CobolParametersParser.scala (1)
getReaderProperties(367-456)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (5)
- GitHub Check: Spark 3.5.7 on Scala 2.12.20
- GitHub Check: Spark 2.4.8 on Scala 2.11.12
- GitHub Check: Spark 3.4.4 on Scala 2.12.20
- GitHub Check: Spark 3.5.7 on Scala 2.13.17
- GitHub Check: test (2.12.20, 2.12, 3.3.4, 0, 80, 20)
🔇 Additional comments (5)
spark-cobol/src/test/scala/za/co/absa/cobrix/spark/cobol/source/integration/Test37RecordLengthMappingSpec.scala (1)
21-21: LGTM!The import correctly brings in the parameter constants needed for the error message assertion in the new validation test.
spark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/source/CobolRelation.scala (1)
68-69: LGTM!Formatting adjustment improves readability by separating the implicit SQLContext parameter.
spark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/source/index/IndexBuilder.scala (3)
51-69: LGTM!The signature change correctly adds the
cachingAllowedparameter and appropriately propagates it tobuildIndexForVarLenReader. The routing logic correctly applies caching only to the variable-length reader path where index caching is implemented.
127-173: Caching logic flow is correct, subject to addressing past review concerns.The implementation correctly:
- Partitions files into cached and non-cached sets (lines 128-134)
- Generates indexes only for non-cached files (lines 137-151)
- Stores newly generated indexes grouped by fileId and keyed by filePath (lines 154-162)
- Retrieves cached indexes and updates fileId to match current file order (lines 166-170)
- Combines both sets into the final RDD (line 173)
The fileId update (line 169) is particularly important because cached entries retain the fileId from their original indexing, which may differ from the current file order.
However, the major issues flagged in previous review comments (unbounded cache growth and missing file modification tracking) still need to be addressed before this implementation is production-ready.
Based on learnings, past review feedback on cache architecture should be resolved.
381-388: LGTM!The helper method correctly creates an RDD from an array of indexes with appropriate partition sizing (capped at
maxNumPartitionswith a minimum of 1) and helpful logging.
spark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/source/CobolRelation.scala
Outdated
Show resolved
Hide resolved
| "work for data with offsets and indexes and index cache" in { | ||
| withTempBinFile("record_length_mapping", ".tmp", dataWithFileOffsets) { tempFile => | ||
| val expected = """{"SEG_ID":"A","TEXT":"123"},{"SEG_ID":"B","TEXT":"123456"},{"SEG_ID":"C","TEXT":"1234567"}""" | ||
|
|
||
| val df = spark.read | ||
| .format("cobol") | ||
| .option("copybook_contents", copybook) | ||
| .option("record_format", "F") | ||
| .option("record_length_field", "SEG-ID") | ||
| .option("file_start_offset", 1) | ||
| .option("file_end_offset", 2) | ||
| .option("input_split_records", "2") | ||
| .option("enable_index_cache", "true") | ||
| .option("pedantic", "true") | ||
| .option("record_length_map", """{"A":4,"B":7,"C":8}""") | ||
| .load(tempFile) | ||
|
|
||
| val actualInitial = df.orderBy("SEG_ID").toJSON.collect().mkString(",") | ||
| val actualCached = df.orderBy("SEG_ID").toJSON.collect().mkString(",") | ||
|
|
||
| assert(actualInitial == expected) | ||
| assert(actualCached == expected) | ||
| } | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Test may not properly verify index caching behavior.
The test reads from the same DataFrame instance twice (lines 191-192). In Spark, since DataFrames are lazy and the execution plan is determined at planning time (when the DataFrame is created), both collect() calls likely use the same plan without re-triggering index generation. This means the test may not actually verify that cached indexes are retrieved and used on subsequent reads.
To properly test caching, create a new DataFrame for the second read:
val actualInitial = df.orderBy("SEG_ID").toJSON.collect().mkString(",")
- val actualCached = df.orderBy("SEG_ID").toJSON.collect().mkString(",")
+
+ // Create a new DataFrame to trigger index retrieval from cache
+ val df2 = spark.read
+ .format("cobol")
+ .option("copybook_contents", copybook)
+ .option("record_format", "F")
+ .option("record_length_field", "SEG-ID")
+ .option("file_start_offset", 1)
+ .option("file_end_offset", 2)
+ .option("input_split_records", "2")
+ .option("enable_index_cache", "true")
+ .option("pedantic", "true")
+ .option("record_length_map", """{"A":4,"B":7,"C":8}""")
+ .load(tempFile)
+
+ val actualCached = df2.orderBy("SEG_ID").toJSON.collect().mkString(",")📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| "work for data with offsets and indexes and index cache" in { | |
| withTempBinFile("record_length_mapping", ".tmp", dataWithFileOffsets) { tempFile => | |
| val expected = """{"SEG_ID":"A","TEXT":"123"},{"SEG_ID":"B","TEXT":"123456"},{"SEG_ID":"C","TEXT":"1234567"}""" | |
| val df = spark.read | |
| .format("cobol") | |
| .option("copybook_contents", copybook) | |
| .option("record_format", "F") | |
| .option("record_length_field", "SEG-ID") | |
| .option("file_start_offset", 1) | |
| .option("file_end_offset", 2) | |
| .option("input_split_records", "2") | |
| .option("enable_index_cache", "true") | |
| .option("pedantic", "true") | |
| .option("record_length_map", """{"A":4,"B":7,"C":8}""") | |
| .load(tempFile) | |
| val actualInitial = df.orderBy("SEG_ID").toJSON.collect().mkString(",") | |
| val actualCached = df.orderBy("SEG_ID").toJSON.collect().mkString(",") | |
| assert(actualInitial == expected) | |
| assert(actualCached == expected) | |
| } | |
| } | |
| "work for data with offsets and indexes and index cache" in { | |
| withTempBinFile("record_length_mapping", ".tmp", dataWithFileOffsets) { tempFile => | |
| val expected = """{"SEG_ID":"A","TEXT":"123"},{"SEG_ID":"B","TEXT":"123456"},{"SEG_ID":"C","TEXT":"1234567"}""" | |
| val df = spark.read | |
| .format("cobol") | |
| .option("copybook_contents", copybook) | |
| .option("record_format", "F") | |
| .option("record_length_field", "SEG-ID") | |
| .option("file_start_offset", 1) | |
| .option("file_end_offset", 2) | |
| .option("input_split_records", "2") | |
| .option("enable_index_cache", "true") | |
| .option("pedantic", "true") | |
| .option("record_length_map", """{"A":4,"B":7,"C":8}""") | |
| .load(tempFile) | |
| val actualInitial = df.orderBy("SEG_ID").toJSON.collect().mkString(",") | |
| // Create a new DataFrame to trigger index retrieval from cache | |
| val df2 = spark.read | |
| .format("cobol") | |
| .option("copybook_contents", copybook) | |
| .option("record_format", "F") | |
| .option("record_length_field", "SEG-ID") | |
| .option("file_start_offset", 1) | |
| .option("file_end_offset", 2) | |
| .option("input_split_records", "2") | |
| .option("enable_index_cache", "true") | |
| .option("pedantic", "true") | |
| .option("record_length_map", """{"A":4,"B":7,"C":8}""") | |
| .load(tempFile) | |
| val actualCached = df2.orderBy("SEG_ID").toJSON.collect().mkString(",") | |
| assert(actualInitial == expected) | |
| assert(actualCached == expected) | |
| } | |
| } |
🤖 Prompt for AI Agents
In
spark-cobol/src/test/scala/za/co/absa/cobrix/spark/cobol/source/integration/Test37RecordLengthMappingSpec.scala
around lines 174 to 197, the test uses the same DataFrame instance for both
reads so both collect() calls use the same planned execution and do not verify
index caching behavior; change the test to create a new DataFrame (re-run
spark.read with the same options and tempFile) for the second read (or otherwise
materialize and unpersist then re-create the DataFrame) so the second collect()
runs a fresh read that will exercise retrieving from the index cache and allow
valid assertions.
| "throw an exception when index caching is requested while indexes are turned off" in { | ||
| withTempBinFile("record_length_mapping", ".tmp", dataWithFileOffsets) { tempFile => | ||
| val expected = """{"SEG_ID":"A","TEXT":"123"},{"SEG_ID":"B","TEXT":"123456"},{"SEG_ID":"C","TEXT":"1234567"}""" | ||
|
|
||
| val ex = intercept[IllegalArgumentException] { | ||
| spark.read | ||
| .format("cobol") | ||
| .option("copybook_contents", copybook) | ||
| .option("record_format", "F") | ||
| .option("record_length_field", "SEG-ID") | ||
| .option("enable_indexes", "false") | ||
| .option("enable_index_cache", "true") | ||
| .option("pedantic", "true") | ||
| .option("record_length_map", """{"A":4,"B":7,"C":8}""") | ||
| .load(tempFile) | ||
| } | ||
|
|
||
| assert(ex.getMessage == s"When '$PARAM_ENABLE_INDEXES' = false, '$PARAM_ENABLE_INDEX_CACHE' cannot be true.") | ||
| } | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove unused variable.
Line 201 declares expected but it's never used in the test. The test only validates the exception message, making this variable unnecessary.
Apply this diff:
"throw an exception when index caching is requested while indexes are turned off" in {
withTempBinFile("record_length_mapping", ".tmp", dataWithFileOffsets) { tempFile =>
- val expected = """{"SEG_ID":"A","TEXT":"123"},{"SEG_ID":"B","TEXT":"123456"},{"SEG_ID":"C","TEXT":"1234567"}"""
-
val ex = intercept[IllegalArgumentException] {📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| "throw an exception when index caching is requested while indexes are turned off" in { | |
| withTempBinFile("record_length_mapping", ".tmp", dataWithFileOffsets) { tempFile => | |
| val expected = """{"SEG_ID":"A","TEXT":"123"},{"SEG_ID":"B","TEXT":"123456"},{"SEG_ID":"C","TEXT":"1234567"}""" | |
| val ex = intercept[IllegalArgumentException] { | |
| spark.read | |
| .format("cobol") | |
| .option("copybook_contents", copybook) | |
| .option("record_format", "F") | |
| .option("record_length_field", "SEG-ID") | |
| .option("enable_indexes", "false") | |
| .option("enable_index_cache", "true") | |
| .option("pedantic", "true") | |
| .option("record_length_map", """{"A":4,"B":7,"C":8}""") | |
| .load(tempFile) | |
| } | |
| assert(ex.getMessage == s"When '$PARAM_ENABLE_INDEXES' = false, '$PARAM_ENABLE_INDEX_CACHE' cannot be true.") | |
| } | |
| } | |
| "throw an exception when index caching is requested while indexes are turned off" in { | |
| withTempBinFile("record_length_mapping", ".tmp", dataWithFileOffsets) { tempFile => | |
| val ex = intercept[IllegalArgumentException] { | |
| spark.read | |
| .format("cobol") | |
| .option("copybook_contents", copybook) | |
| .option("record_format", "F") | |
| .option("record_length_field", "SEG-ID") | |
| .option("enable_indexes", "false") | |
| .option("enable_index_cache", "true") | |
| .option("pedantic", "true") | |
| .option("record_length_map", """{"A":4,"B":7,"C":8}""") | |
| .load(tempFile) | |
| } | |
| assert(ex.getMessage == s"When '$PARAM_ENABLE_INDEXES' = false, '$PARAM_ENABLE_INDEX_CACHE' cannot be true.") | |
| } | |
| } |
🤖 Prompt for AI Agents
In
spark-cobol/src/test/scala/za/co/absa/cobrix/spark/cobol/source/integration/Test37RecordLengthMappingSpec.scala
around lines 199 to 218, there is an unused local val expected declared (line
~201) that is never referenced; remove that declaration to clean up the test and
avoid dead code, keeping the rest of the test unchanged so it still only
intercepts and asserts the IllegalArgumentException message.
5e5dd76 to
e4fec96
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
♻️ Duplicate comments (4)
spark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/source/CobolRelation.scala (1)
76-76: LGTM - Critical issue from previous review has been resolved.The call now correctly passes
isIndexCachingAllowedto control caching behavior, rather than the previously incorrectisIndexGenerationNeeded. This ensures theenable_index_cacheparameter works as intended.cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/reader/parameters/CobolParametersParser.scala (1)
957-959: LGTM - Error message corrected from previous review.The validation correctly prevents enabling index caching when indexes are disabled, and the error message now properly references both
PARAM_ENABLE_INDEXESandPARAM_ENABLE_INDEX_CACHE.spark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/source/index/IndexBuilder.scala (2)
49-49: Cache lacks eviction policy and will grow unboundedly.The global
indexCachehas no size limit or TTL, which can cause memory exhaustion in long-running Spark applications as more files are processed. Consider using a bounded cache with LRU eviction (e.g., Guava'sCacheBuilderor Caffeine) or adding a configurable maximum cache size.
127-174: Cache keys lack file modification tracking, risking stale index usage.The cache uses only
filePathas the key (lines 129, 160, 168) without considering file modification time. If a VRL file is modified between runs, the cached index will be returned without validation, potentially causing incorrect data processing.Include file modification time in the cache key or validate modification time on cache retrieval:
// When filtering cached files (around line 129): val fs = new Path(filesList.head.filePath).getFileSystem(conf) val cachedFiles = if (cachingAllowed) { filesList.filter { f => val fileStatus = fs.getFileStatus(new Path(f.filePath)) val cacheKey = s"${f.filePath}_${fileStatus.getModificationTime}" indexCache.containsKey(cacheKey) } } else { Array.empty[FileWithOrder] } // When storing (around line 160): val fileStatus = fs.getFileStatus(new Path(filePath)) val cacheKey = s"${filePath}_${fileStatus.getModificationTime}" indexCache.put(cacheKey, indexEntries.sortBy(_.offsetFrom)) // When retrieving (around line 168): val fileStatus = fs.getFileStatus(new Path(f.filePath)) val cacheKey = s"${f.filePath}_${fileStatus.getModificationTime}" indexCache.get(cacheKey).map(ind => ind.copy(fileId = f.order))
🧹 Nitpick comments (1)
cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/reader/parameters/CobolParametersParser.scala (1)
835-837: Validation added appears unrelated to PR objectives.This validation ensures
record_length_maprequiresrecord_length_field, which is reasonable but doesn't appear to be part of the index caching feature described in the PR objectives. Was this intentionally included?
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (4)
cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/reader/parameters/CobolParametersParser.scala(6 hunks)spark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/source/CobolRelation.scala(1 hunks)spark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/source/index/IndexBuilder.scala(4 hunks)spark-cobol/src/test/scala/za/co/absa/cobrix/spark/cobol/source/integration/Test37RecordLengthMappingSpec.scala(2 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
- spark-cobol/src/test/scala/za/co/absa/cobrix/spark/cobol/source/integration/Test37RecordLengthMappingSpec.scala
🧰 Additional context used
🧬 Code graph analysis (3)
spark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/source/CobolRelation.scala (2)
spark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/source/index/IndexBuilder.scala (1)
buildIndex(51-69)cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/reader/parameters/CobolParametersParser.scala (1)
getReaderProperties(367-456)
spark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/source/index/IndexBuilder.scala (1)
cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/parser/common/Constants.scala (1)
Constants(20-81)
cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/reader/parameters/CobolParametersParser.scala (1)
cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/reader/parameters/Parameters.scala (2)
getOrElse(78-81)contains(66-69)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (5)
- GitHub Check: Spark 3.5.7 on Scala 2.13.17
- GitHub Check: Spark 3.4.4 on Scala 2.12.20
- GitHub Check: Spark 3.5.7 on Scala 2.12.20
- GitHub Check: Spark 2.4.8 on Scala 2.11.12
- GitHub Check: test (2.12.20, 2.12, 3.3.4, 0, 80, 20)
🔇 Additional comments (5)
spark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/source/index/IndexBuilder.scala (1)
381-388: LGTM - Clean helper method for index RDD creation.The
createIndexRDDhelper centralizes index RDD assembly with proper partitioning logic, making the code more maintainable.cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/reader/parameters/CobolParametersParser.scala (4)
122-122: LGTM - New parameter constant added.The
PARAM_ENABLE_INDEX_CACHEconstant is properly defined and follows the existing naming conventions.
385-385: LGTM - Correct default for fallback case.Setting
isIndexCachingAllowed = falsein the fallbackVariableLengthParametersis the safe default.
421-421: LGTM - Parameter propagation is correct.The
isIndexCachingAllowedfield is properly propagated fromvarLenParamstoReaderParameters.
508-508: LGTM - Parameter parsing implemented correctly.The
enable_index_cacheoption is parsed with a safe default offalse, ensuring caching is opt-in.
Closes #805
Summary by CodeRabbit
New Features
Improvements
Tests
✏️ Tip: You can customize this high-level summary in your review settings.