-
Notifications
You must be signed in to change notification settings - Fork 86
#803 Fix possible cases when Hadoop file streams are opened but not closed. #804
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Warning Rate limit exceeded@yruslan has exceeded the limit for the number of commits or files that can be reviewed per hour. Please wait 16 minutes and 14 seconds before requesting another review. ⌛ How to resolve this issue?After the wait time has elapsed, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout. Please see our FAQ for further information. 📒 Files selected for processing (6)
WalkthroughThe PR optimizes resource management across multiple reader and iterator classes by implementing lazy stream opening in FileStreamer, precomputing recordExtractor options for reuse, and adding proper try/finally blocks to ensure stream closure when processing files. Changes
Sequence DiagramsequenceDiagram
participant Client
participant FileStreamer
participant BufferedFSDataInputStream
participant FileSystem as Hadoop FileSystem
Client->>FileStreamer: new FileStreamer(filePath)
Note over FileStreamer: wasOpened = false<br/>bufferedStream = null<br/>fileSize = lazy val
Client->>FileStreamer: next() or size
FileStreamer->>FileStreamer: ensureOpened()
alt Stream not yet opened
FileStreamer->>FileSystem: getHadoopPath
FileStreamer->>BufferedFSDataInputStream: create & open
Note over FileStreamer: wasOpened = true
end
FileStreamer->>BufferedFSDataInputStream: read data
BufferedFSDataInputStream-->>FileStreamer: data
Client->>FileStreamer: close()
FileStreamer->>BufferedFSDataInputStream: close()
Note over BufferedFSDataInputStream: `@throws`[IOException]
FileStreamer->>FileStreamer: wasOpened = true
sequenceDiagram
participant IndexBuilder
participant RecordExtractor
participant DataStream
participant HeaderStream
IndexBuilder->>DataStream: open
IndexBuilder->>HeaderStream: open
IndexBuilder->>RecordExtractor: create(dataStream, headerStream)
Note over IndexBuilder: try block
RecordExtractor->>DataStream: read first record
IndexBuilder->>IndexBuilder: compute offset
alt Second record exists
IndexBuilder->>DataStream: open (second)
IndexBuilder->>HeaderStream: open (second)
IndexBuilder->>RecordExtractor: create(dataStream2, headerStream2)
RecordExtractor->>DataStream: read second record
IndexBuilder->>IndexBuilder: validate match
IndexBuilder->>DataStream: close (second)
IndexBuilder->>HeaderStream: close (second)
end
Note over IndexBuilder: finally block
IndexBuilder->>DataStream: close (initial)
IndexBuilder->>HeaderStream: close (initial)
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes
Possibly related PRs
Poem
Pre-merge checks and finishing touches✅ Passed checks (5 passed)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
JaCoCo code coverage report - 'cobol-parser'
|
JaCoCo code coverage report - 'spark-cobol'
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (2)
spark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/source/streaming/FileStreamer.scala (1)
112-118: Clarify the wasOpened flag assignment in close().Setting
wasOpened = trueat line 114 insideclose()appears unusual. If the stream is being closed, it must have been opened first (viaensureOpened()), sowasOpenedshould already betrue.The guard condition
bufferedStream != null && !bufferedStream.isClosedsuggests the stream was opened, making this assignment redundant. If the intent is to mark that we attempted to use the stream, this should be clarified or reconsidered.Consider this alternative:
override def close(): Unit = { if (bufferedStream != null && !bufferedStream.isClosed) { - wasOpened = true bufferedStream.close() bufferedStream = null } }The
wasOpenedflag should already betrueif we're executing this branch, asensureOpened()is the only method that initializesbufferedStreamand it sets the flag.cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/reader/VarLenNestedReader.scala (1)
175-196: Close headerStream in generateIndex when recordExtractor is empty, matching the pattern in getRecordIterator.The
generateIndexmethod callsrecordExtractor(0L, dataStream, headerStream)inline (lines 179, 186) without checking if it returnsNone. When it does returnNone, theheaderStreamis never closed, unlike ingetRecordIterator(lines 104-106) which explicitly closes it.Since
IndexGenerator.sparseIndexGeneratordoes not receiveheaderStreamas a parameter, cleanup responsibility belongs to the caller. Apply the same pattern:val recordExtractorOpt = recordExtractor(0L, dataStream, headerStream) if (recordExtractorOpt.isEmpty) { headerStream.close() } recordExtractorOpt match { case Some(field) => IndexGenerator.sparseIndexGenerator(...) case None => IndexGenerator.sparseIndexGenerator(...) }
🧹 Nitpick comments (1)
spark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/source/streaming/FileStreamer.scala (1)
124-129: Lazy opening implementation is correct for single-threaded usage.The
ensureOpened()method correctly defers stream initialization until first use.Note that this implementation is not thread-safe (no synchronization on the
wasOpenedcheck-then-act pattern). However, based on the class documentation stating it's "stateful" and "not reusable", concurrent access doesn't appear to be an intended use case.If thread-safety becomes a concern in the future, consider adding explicit documentation:
* @param filePath String containing the fully qualified path to the file. * @param fileSystem Underlying Hadoop file system. + * @note This class is not thread-safe and should only be accessed from a single thread. * @throws IllegalArgumentException if the file is not found in the underlying file system. */
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (3)
cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/reader/VarLenNestedReader.scala(2 hunks)spark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/source/streaming/FileStreamer.scala(4 hunks)spark-cobol/src/test/scala/za/co/absa/cobrix/spark/cobol/source/streaming/FileStreamerSpec.scala(2 hunks)
🧰 Additional context used
🧬 Code graph analysis (2)
spark-cobol/src/test/scala/za/co/absa/cobrix/spark/cobol/source/streaming/FileStreamerSpec.scala (1)
spark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/source/streaming/FileStreamer.scala (1)
size(54-54)
cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/reader/VarLenNestedReader.scala (1)
spark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/source/streaming/FileStreamer.scala (1)
close(112-118)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (5)
- GitHub Check: Spark 3.5.7 on Scala 2.13.17
- GitHub Check: Spark 3.5.7 on Scala 2.12.20
- GitHub Check: Spark 2.4.8 on Scala 2.11.12
- GitHub Check: Spark 3.4.4 on Scala 2.12.20
- GitHub Check: test (2.12.20, 2.12, 3.3.4, 0, 80, 20)
🔇 Additional comments (6)
spark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/source/streaming/FileStreamer.scala (2)
43-50: Excellent lazy initialization pattern to prevent connection pool exhaustion.The deferred opening of the stream until first use directly addresses the PR objective of preventing S3 HTTP connection pool timeouts. The explanatory comments clearly document the rationale.
Note that
fileSizebeing lazy means accessing thesizeproperty will now throwFileNotFoundExceptionfor non-existent files, which is a behavioral change properly reflected in the updated tests.
73-73: LGTM - correct placement of ensureOpened().Calling
ensureOpened()at the start ofnext()ensures the stream is opened exactly when needed, just before the first read operation.spark-cobol/src/test/scala/za/co/absa/cobrix/spark/cobol/source/streaming/FileStreamerSpec.scala (2)
45-50: Test correctly validates lazy opening behavior.The updated test properly verifies that accessing
sizeon a non-existent file now throwsFileNotFoundException(instead of during construction). This aligns with the lazy evaluation offileSizein FileStreamer.
52-56: Excellent test coverage for the lazy opening feature.This new test case is crucial for validating that creating a
FileStreamerinstance for a non-existent file doesn't throw an exception as long as the stream is never actually used. This directly validates the PR's goal of preventing unnecessary stream opening and connection pool exhaustion.cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/reader/VarLenNestedReader.scala (2)
103-107: Stream lifecycle verification confirms safe cleanup pattern.The verification shows that all RawRecordExtractor implementations close
headerStreamin their constructor—none retain references to it afterward. This means the code pattern is safe:
- No extractor produced (isEmpty): headerStream closed by VarLenNestedReader (line 106)
- Extractor produced (isDefined): headerStream closed by the extractor's constructor before any iteration
The dual-cleanup concern is unfounded because the stream is either handled by VarLenNestedReader or transferred to the extractor—never both. The implementation correctly addresses the PR objective of preventing stream leaks.
113-113: Verification confirms iterator signatures are correctly updated.All constructor signatures match the Option type being passed:
VarLenHierarchicalIteratoracceptsrawRecordExtractor: Option[RawRecordExtractor]at line 47VarLenNestedIteratoracceptsrecordExtractor: Option[RawRecordExtractor]at line 47VRLRecordReaderacceptsrecordExtractor: Option[RawRecordExtractor]at line 44Both iterators correctly pass the Option parameter downstream to
VRLRecordReader. No inconsistencies found.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🧹 Nitpick comments (1)
spark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/source/streaming/FileStreamer.scala (1)
44-51: Lazy-open andwasOpenedsemantics are sound, butnext(0)can still open the stream unnecessarilyThe lazy initialization via
ensureOpened()and thewasOpenedflag correctly ensure:
- The underlying
BufferedFSDataInputStreamis opened only once.- After
close(), the stream is not reopened (non-reusable instance behavior is preserved).- Unused
FileStreamerinstances (never callingsize/totalSize/next) do not open the underlying file at all.One small behavioral nit is that
next()callsensureOpened()unconditionally, even whennumberOfBytes <= 0. In that case you:
- Open the underlying stream (S3 connection) in
ensureOpened().- Immediately return an empty array from the
numberOfBytes <= 0branch without closing it here.This is unlikely on current call sites, but it slightly undermines the “never open if the stream is effectively unused” goal and can surprise callers that treat
next(0)as a no-op.A low-impact refinement would be to guard the open with the basic sanity checks, e.g.:
- override def next(numberOfBytes: Int): Array[Byte] = { - ensureOpened() - val actualBytesToRead = if (maximumBytes > 0) { + override def next(numberOfBytes: Int): Array[Byte] = { + if (numberOfBytes <= 0) { + return new Array[Byte](0) + } + + val actualBytesToRead = if (maximumBytes > 0) { Math.min(maximumBytes - byteIndex + startOffset, numberOfBytes).toInt } else { numberOfBytes } - if (numberOfBytes <= 0) { - new Array[Byte](0) - } else if (actualBytesToRead <=0 || bufferedStream == null || bufferedStream.isClosed) { + if (actualBytesToRead <= 0) { + close() + return new Array[Byte](0) + } + + ensureOpened() + if (bufferedStream == null || bufferedStream.isClosed) { close() new Array[Byte](0) } else { // existing read logicThis keeps the new lazy-open behavior and non-reusability guarantees while avoiding unnecessary opens on zero-byte requests.
Also applies to: 61-75, 114-130
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
spark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/source/streaming/FileStreamer.scala(4 hunks)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (5)
- GitHub Check: test (2.12.20, 2.12, 3.3.4, 0, 80, 20)
- GitHub Check: Spark 3.5.7 on Scala 2.13.17
- GitHub Check: Spark 3.5.7 on Scala 2.12.20
- GitHub Check: Spark 2.4.8 on Scala 2.11.12
- GitHub Check: Spark 3.4.4 on Scala 2.12.20
spark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/source/streaming/FileStreamer.scala
Show resolved
Hide resolved
e57a669 to
34d0965
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
♻️ Duplicate comments (1)
spark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/source/streaming/FileStreamer.scala (1)
25-37: Scaladoc@throwsno longer matches actual behaviorThe Scaladoc still documents
@throws IllegalArgumentException if the file is not found, but the implementation and tests now rely on Hadoop throwing aFileNotFoundException(e.g., viagetContentSummaryor when opening the stream) on first size/read access, not on construction.Consider updating the Scaladoc to:
- Reference
FileNotFoundExceptioninstead ofIllegalArgumentException, and- Clarify that the exception is raised when size/totalSize/next are first invoked, not when the instance is created.
🧹 Nitpick comments (1)
spark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/source/index/IndexBuilder.scala (1)
215-273: Self-check logic is safer, but secondary streams would benefit fromtry/finallyThe revised self-check correctly:
- Uses the first extractor to read the first and (optionally) second record,
- Captures the offset of the second record,
- Re-opens new streams at that offset and validates that the second extractor returns the same record,
- Ensures the primary
dataStream/headerStreamare always closed viafinally.For the secondary streams (
dataStream2,headerStream2), they are only closed at the end ofextractorOpt2.foreach. Ifextractor2.hasNextorextractor2.next()throws unexpectedly, those streams may remain open. Consider wrapping their use in atry/finallyto guarantee closure:- val (dataStream2, headerStream2, _) = getStreams(filePath, offset, endOffset, config) - val extractorOpt2 = reader.asInstanceOf[ReaderVarLenNestedReader[_]].recordExtractor(1, dataStream2, headerStream2) - - extractorOpt2.foreach { extractor2 => - if (!extractor2.hasNext) { - // ... - } - // ... - dataStream2.close() - headerStream2.close() - } + val (dataStream2, headerStream2, _) = getStreams(filePath, offset, endOffset, config) + try { + val extractorOpt2 = reader.asInstanceOf[ReaderVarLenNestedReader[_]].recordExtractor(1, dataStream2, headerStream2) + + extractorOpt2.foreach { extractor2 => + if (!extractor2.hasNext) { + // ... + } + // ... + } + } finally { + dataStream2.close() + headerStream2.close() + }This keeps the self-check logic intact while making resource handling fully robust, including in error paths.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (5)
cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/reader/VarLenNestedReader.scala(4 hunks)spark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/reader/VarLenNestedReader.scala(2 hunks)spark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/source/index/IndexBuilder.scala(1 hunks)spark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/source/streaming/FileStreamer.scala(4 hunks)spark-cobol/src/test/scala/za/co/absa/cobrix/spark/cobol/source/streaming/FileStreamerSpec.scala(2 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
- cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/reader/VarLenNestedReader.scala
🧰 Additional context used
🧬 Code graph analysis (4)
spark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/reader/VarLenNestedReader.scala (3)
cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/reader/VarLenNestedReader.scala (1)
recordExtractor(54-89)spark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/source/streaming/FileStreamer.scala (1)
close(113-119)spark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/source/streaming/BufferedFSDataInputStream.scala (1)
close(41-47)
spark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/source/index/IndexBuilder.scala (4)
cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/reader/VarLenNestedReader.scala (1)
recordExtractor(54-89)spark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/source/streaming/FileStreamer.scala (3)
offset(59-59)next(73-111)close(113-119)spark-cobol/src/test/scala/za/co/absa/cobrix/spark/cobol/mocks/FixedRecordExtractorNoIndex.scala (3)
offset(36-36)hasNext(38-38)next(40-55)spark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/source/streaming/BufferedFSDataInputStream.scala (1)
close(41-47)
spark-cobol/src/test/scala/za/co/absa/cobrix/spark/cobol/source/streaming/FileStreamerSpec.scala (1)
spark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/source/streaming/FileStreamer.scala (1)
size(55-55)
spark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/source/streaming/FileStreamer.scala (1)
cobol-parser/src/test/scala/za/co/absa/cobrix/cobol/reader/memorystream/TestByteStream.scala (1)
next(34-48)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (5)
- GitHub Check: Spark 3.5.7 on Scala 2.13.17
- GitHub Check: Spark 3.5.7 on Scala 2.12.20
- GitHub Check: test (2.12.20, 2.12, 3.3.4, 0, 80, 20)
- GitHub Check: Spark 3.4.4 on Scala 2.12.20
- GitHub Check: Spark 2.4.8 on Scala 2.11.12
🔇 Additional comments (4)
spark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/reader/VarLenNestedReader.scala (1)
52-88: PrecomputingrecordExtractorOptand closing unusedheaderStreamlooks correctComputing the extractor once and closing
headerStreamwhen no extractor is created is a good way to avoid leaking header streams while keeping the iterator construction logic unchanged. Passing the cachedrecordExtractorOptinto both iterator variants keeps behavior consistent and avoids redundant work.spark-cobol/src/test/scala/za/co/absa/cobrix/spark/cobol/source/streaming/FileStreamerSpec.scala (1)
19-55: Tests now align with lazy-open andFileNotFoundExceptionbehaviorExpecting
FileNotFoundExceptionon the firstsizeaccess and allowing construction to succeed for an unused non-existent file matches the new lazy-opening semantics inFileStreamer. The added test also guards against regressions in that behavior.spark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/source/index/IndexBuilder.scala (1)
155-171: Index generation now reliably closes both data and header streamsWrapping
reader.generateIndex(...)in atry/finallythat closes bothinputStreamandheaderStreamensures that Hadoop/S3 streams are released even when index generation fails, which is important for avoiding connection pool exhaustion.spark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/source/streaming/FileStreamer.scala (1)
44-52: Lazy opening and EOF handling innext/closelook consistent with the stream contractThe combination of:
wasOpened+ensureOpened()for on-demand creation ofBufferedFSDataInputStream,lazy val fileSizefor metadata-based sizing that doesn’t open the data stream,next(...)honoringmaximumBytes, returning:
- empty array for non-positive requests or when no bytes remain / stream closed,
- a shorter array and closing the stream when fewer than
numberOfBytesare available,close()preventing any subsequent reopen (by settingwasOpened = true),provides predictable, non-reusable stream semantics and ensures underlying Hadoop/S3 resources are closed once the data (or the configured window via
maximumBytes) has been fully consumed orclose()is called explicitly.Also applies to: 73-111, 113-130
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🧹 Nitpick comments (1)
spark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/source/index/IndexBuilder.scala (1)
234-235: Optional: Simplify by removing redundant mid-flow close.Lines 234-235 close
dataStreamandheaderStream, which are then closed again in the finally block (lines 271-272). While safe due to idempotent close operations, this pattern is confusing and unnecessary. The finally block already ensures these streams are closed in all execution paths.Consider removing lines 234-235 and relying solely on the finally block for cleanup:
record = extractor.next() // Saving the record to check later - - dataStream.close() - headerStream.close() // Getting new streams and record extractor that points directly to the second record
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (6)
cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/reader/VarLenNestedReader.scala(4 hunks)spark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/reader/VarLenNestedReader.scala(2 hunks)spark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/source/index/IndexBuilder.scala(1 hunks)spark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/source/streaming/BufferedFSDataInputStream.scala(2 hunks)spark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/source/streaming/FileStreamer.scala(4 hunks)spark-cobol/src/test/scala/za/co/absa/cobrix/spark/cobol/source/streaming/FileStreamerSpec.scala(2 hunks)
✅ Files skipped from review due to trivial changes (1)
- spark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/source/streaming/BufferedFSDataInputStream.scala
🧰 Additional context used
🧬 Code graph analysis (5)
spark-cobol/src/test/scala/za/co/absa/cobrix/spark/cobol/source/streaming/FileStreamerSpec.scala (1)
spark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/source/streaming/FileStreamer.scala (1)
size(56-56)
spark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/source/index/IndexBuilder.scala (4)
cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/reader/VarLenNestedReader.scala (1)
recordExtractor(54-89)spark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/source/streaming/FileStreamer.scala (3)
offset(60-60)next(74-113)close(115-122)spark-cobol/src/test/scala/za/co/absa/cobrix/spark/cobol/mocks/FixedRecordExtractorNoIndex.scala (3)
offset(36-36)hasNext(38-38)next(40-55)spark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/source/streaming/BufferedFSDataInputStream.scala (1)
close(43-50)
cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/reader/VarLenNestedReader.scala (3)
spark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/source/streaming/BufferedFSDataInputStream.scala (1)
close(43-50)spark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/source/streaming/FileStreamer.scala (1)
close(115-122)cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/reader/index/IndexGenerator.scala (2)
IndexGenerator(30-171)sparseIndexGenerator(32-141)
spark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/source/streaming/FileStreamer.scala (2)
spark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/source/streaming/BufferedFSDataInputStream.scala (1)
close(43-50)cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/reader/common/Constants.scala (1)
Constants(20-44)
spark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/reader/VarLenNestedReader.scala (3)
cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/reader/VarLenNestedReader.scala (1)
recordExtractor(54-89)spark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/source/streaming/BufferedFSDataInputStream.scala (1)
close(43-50)spark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/source/streaming/FileStreamer.scala (1)
close(115-122)
🔇 Additional comments (5)
spark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/source/streaming/FileStreamer.scala (1)
45-134: LGTM: Lazy opening correctly prevents resource leaks.The lazy opening mechanism is well-implemented and directly addresses the PR objective:
- Streams are only opened when
next()is called viaensureOpened()- Unused streams never trigger expensive S3 file operations
- The
wasOpenedflag correctly prevents reopening afterclose()sizeandtotalSizeuse a lazyfileSizewithout opening the streamThis ensures header streams and other auxiliary streams that may never be read are never opened, preventing the HTTP connection pool exhaustion described in issue #803.
spark-cobol/src/test/scala/za/co/absa/cobrix/spark/cobol/source/streaming/FileStreamerSpec.scala (1)
45-56: LGTM: Tests correctly validate lazy opening behavior.The test updates properly verify the lazy opening semantics:
Lines 45-50: Accessing
sizeon a non-existent file now throwsFileNotFoundException(when the lazyfileSizeis evaluated), correctly replacing the previousIllegalArgumentExceptionexpectation.Lines 52-56: The new test confirms that merely constructing a
FileStreamerfor a non-existent file doesn't throw an exception, validating that file access is deferred until actual use.These tests ensure the lazy opening feature works as intended and prevent regression.
spark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/reader/VarLenNestedReader.scala (1)
56-88: LGTM: Precomputed extractor with early header stream closure.The refactoring optimizes resource usage and prevents leaks:
Line 57: Computing
recordExtractorOptonce at the start avoids redundant evaluations and ensures consistency across both iterator construction paths.Lines 58-60: Closing
headerStreamimmediately when the extractor isNoneis a key improvement—if the header stream won't be used by the iterator, it's released right away rather than held open unnecessarily.Lines 68, 80: Both iterator constructors receive the precomputed extractor, maintaining consistent behavior.
This change directly addresses the PR objective of preventing S3 connection pool exhaustion by ensuring header streams are closed when not needed.
cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/reader/VarLenNestedReader.scala (2)
103-130: LGTM: Consistent extractor precomputation in getRecordIterator.The changes mirror the pattern from the Spark module variant:
- Lines 104-107: Precomputing
recordExtractorOptand immediately closingheaderStreamwhen empty prevents unnecessary resource retention.- Lines 113, 123: Both iterator types receive the precomputed extractor, ensuring consistency and efficiency.
This aligns with the broader PR goal of proper stream lifecycle management to prevent S3 connection pool exhaustion.
174-201: LGTM: Optimized extractor handling in generateIndex.The refactoring improves efficiency and resource management in index generation:
Lines 174-177: Computing
recordExtractorOptonce at the start (withstartingRecordNumber = 0) and immediately closingheaderStreamwhen empty prevents resource leaks during index generation.Lines 184, 195: Both
IndexGenerator.sparseIndexGeneratorcall sites now receive the precomputed extractor, eliminating redundant extractor creation and ensuring consistent behavior across the segmented and non-segmented index generation paths.This completes the consistent pattern of precomputed extractors with early header stream closure across both the cobol-parser and spark-cobol modules.
spark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/source/index/IndexBuilder.scala
Show resolved
Hide resolved
34d0965 to
5287283
Compare
This is especially related to header streams.
Closes #803
Summary by CodeRabbit
New Features
sizeandtotalSizeproperties for file streaming operations.Bug Fixes
Tests
✏️ Tip: You can customize this high-level summary in your review settings.