Skip to content

#841 Add support for file offsets in in-place processors#844

Merged
yruslan merged 4 commits intomasterfrom
feature/841-add-support-for-file-offsets-in-processors
May 4, 2026
Merged

#841 Add support for file offsets in in-place processors#844
yruslan merged 4 commits intomasterfrom
feature/841-add-support-for-file-offsets-in-processors

Conversation

@yruslan
Copy link
Copy Markdown
Collaborator

@yruslan yruslan commented May 4, 2026

Closes #841

Summary by CodeRabbit

  • New Features

    • Support for file start/end offset options so processing can skip bytes at file start or end and respect per-file bounds across processing modes.
  • Documentation

    • Changelog updated for upcoming 2.10.4 release.
  • Tests

    • Added tests validating offsets for both in-place and variable-length processing and verifying generated binary/JSON outputs.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 4, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 091a1dc6-1067-46ae-b6e4-9e779c88db51

📥 Commits

Reviewing files that changed from the base of the PR and between a0c7207 and e6ce57c.

📒 Files selected for processing (3)
  • cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/reader/stream/FSStream.scala
  • cobol-parser/src/test/scala/za/co/absa/cobrix/cobol/processor/CobolProcessorBuilderSuite.scala
  • spark-cobol/src/test/scala/za/co/absa/cobrix/spark/cobol/SparkCobolProcessorSuite.scala
✅ Files skipped from review due to trivial changes (1)
  • cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/reader/stream/FSStream.scala
🚧 Files skipped from review as they are similar to previous changes (1)
  • spark-cobol/src/test/scala/za/co/absa/cobrix/spark/cobol/SparkCobolProcessorSuite.scala

Walkthrough

This PR threads file_start_offset and file_end_offset through the processing pipeline: FSStream now accepts offsets and enforces bounded reads; CobolProcessor and SparkCobolProcessor propagate readerParameters to per-file streamers; FileUtils adds a helper to compute Hadoop read sizes; tests and the README were updated.

Changes

File Offset Processing Support

Layer / File(s) Summary
Stream API Shape
cobol-parser/src/main/scala/.../reader/stream/FSStream.scala
FSStream constructor now accepts fileStartOffset and fileEndOffset, computes effectiveSize, and reports size/totalSize relative to offsets.
Stream Read Enforcement
cobol-parser/src/main/scala/.../reader/stream/FSStream.scala
next(numberOfBytes) caps reads to remaining effective bytes, closes and returns empty array at effective end; copyStream() preserves offsets; added skipFully helper.
Processor Wiring
cobol-parser/src/main/scala/.../processor/CobolProcessor.scala
CobolProcessorLoader.save passes readerParameters.fileStartOffset/fileEndOffset into FSStream; builder helper methods (getCobolSchema, getReaderParameters, getOptions) had their private[processor] qualifier removed.
Spark Integration
spark-cobol/src/main/scala/.../SparkCobolProcessor.scala
save() builds a CobolProcessor to capture readerParameters; getFileProcessorRdd/processListOfFiles now accept readerParameters; per-file processing computes maximumBytes from offsets and constructs FileStreamer(inputFile, startOffset, maximumBytes).
Hadoop Read Size Utility
spark-cobol/src/main/scala/.../utils/FileUtils.scala
Added getHadoopFileReadSize(...) to resolve filesystem, detect compression, and compute a clamped read size based on start/end offsets.
Index Builder Integration
spark-cobol/src/main/scala/.../source/index/IndexBuilder.scala
Replaced manual read-size logic with FileUtils.getHadoopFileReadSize(...) when fileEndOffset > 0.
Tests & Docs
cobol-parser/src/test/.../CobolProcessorBuilderSuite.scala, spark-cobol/src/test/.../SparkCobolProcessorSuite.scala, README.md
Added tests for file_start_offset=3 / file_end_offset=2 for both InPlace and ToVariableLength strategies; corrected a test title; added changelog entry for v2.10.4 referencing PR #841.

Sequence Diagram

sequenceDiagram
    participant Builder as Builder
    participant CobolProc as CobolProcessor
    participant FSStream as FSStream
    participant FileStreamer as FileStreamer
    participant FS as FileSystem

    Builder->>CobolProc: build with file_start_offset,<br/>file_end_offset options
    CobolProc->>CobolProc: extract readerParameters (offsets)

    rect rgba(100,150,255,0.5)
    CobolProc->>FSStream: FSStream(file, startOffset, endOffset)
    FSStream->>FS: seek to startOffset (skipFully)
    FS-->>FSStream: byte chunks
    FSStream->>CobolProc: data until effectiveSize reached
    end

    rect rgba(150,200,100,0.5)
    CobolProc->>FileStreamer: compute maximumBytes via FileUtils
    FileStreamer->>FS: FileStreamer(file, startOffset, maximumBytes)
    FS-->>FileStreamer: bounded read
    FileStreamer->>CobolProc: data within bounds
    end
Loading

Estimated Code Review Effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly Related PRs

Poem

🐰 Hopping through bytes, I bound the trail,
start at three, end at two — no extra tail.
Streams now skip the fluff with nimble paws,
trimmed and tidy, precise without a cause.
Cheers from the rabbit, who loves neat file laws.

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The PR title '#841 Add support for file offsets in in-place processors' clearly and specifically summarizes the main change: adding file offset support for in-place COBOL file processors.
Linked Issues check ✅ Passed The PR fully implements the requirements from #841: file_start_offset and file_end_offset options are now supported in both CobolProcessor [841] and SparkCobolProcessor [841], with comprehensive changes to FSStream, ReaderParameters passing, and file-slicing logic.
Out of Scope Changes check ✅ Passed All changes are directly scoped to implementing file offset support: FSStream modifications, CobolProcessor parameter passing, SparkCobolProcessor integration, FileUtils helper, and comprehensive test coverage. No extraneous modifications detected.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feature/841-add-support-for-file-offsets-in-processors

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share
Review rate limit: 0/1 reviews remaining, refill in 60 minutes.

Comment @coderabbitai help to get the list of available commands and usage tips.

@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 4, 2026

JaCoCo code coverage report - 'cobol-parser'

Overall Project 91.3% -0.06% 🍏
Files changed 75.9% 🍏

File Coverage
CobolProcessor.scala 80.9% 🍏
FSStream.scala 80.48% -19.05% 🍏

@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 4, 2026

JaCoCo code coverage report - 'spark-cobol'

Overall Project 83.34% -0.06% 🍏
Files changed 94.59% 🍏

File Coverage
SparkCobolProcessor.scala 96.81% 🍏
IndexBuilder.scala 96.6% 🍏
FileUtils.scala 83.54% -1.05% 🍏

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (2)
spark-cobol/src/test/scala/za/co/absa/cobrix/spark/cobol/SparkCobolProcessorSuite.scala (1)

163-193: ⚡ Quick win

Add a Spark InPlace case too.

This regression test still runs the ToVariableLength branch, so it won't catch a break in the actual in-place Spark processor path that this PR is about. A matching CobolProcessingStrategy.InPlace case would close that gap.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@spark-cobol/src/test/scala/za/co/absa/cobrix/spark/cobol/SparkCobolProcessorSuite.scala`
around lines 163 - 193, The test currently exercises only
CobolProcessingStrategy.ToVariableLength; add a parallel case that sets
withProcessingStrategy(CobolProcessingStrategy.InPlace) (using the same
SerializableRawRecordProcessor implementation, file_start_offset/file_end_offset
options, load/save and subsequent read/assert steps) so the in-place Spark
processing path is covered; replicate the sequence that writes outputFile, reads
binary bytes and JSON via spark.read (same options: copybook_contents,
record_format "V", is_rdw_big_endian "true", pedantic "true") and assert the
outputData and actual JSON equal the same expected values to ensure the InPlace
branch is tested.
cobol-parser/src/test/scala/za/co/absa/cobrix/cobol/processor/CobolProcessorBuilderSuite.scala (1)

94-112: ⚡ Quick win

Add an InPlace regression case alongside this one.

This new coverage only exercises CobolProcessingStrategy.ToVariableLength, while the feature request is specifically about the in-place processor path. A sibling assertion for CobolProcessingStrategy.InPlace would make sure CobolProcessorInPlace can't regress unnoticed.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@cobol-parser/src/test/scala/za/co/absa/cobrix/cobol/processor/CobolProcessorBuilderSuite.scala`
around lines 94 - 112, Add a sibling test that exercises the InPlace processing
path: duplicate the existing case that builds via CobolProcessor.builder but set
.withProcessingStrategy(CobolProcessingStrategy.InPlace) and use the same
RawRecordProcessor, options ("file_start_offset","file_end_offset") and
input/output files; then assert the returned count and the output binary
contents match the expected bytes to ensure CobolProcessorInPlace is covered and
cannot regress.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In
`@cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/reader/stream/FSStream.scala`:
- Around line 21-36: The FSStream constructor currently uses
bytesStream.skip(fileStartOffset) which is unreliable and computes effectiveSize
that can be negative; change it to loop until the requested fileStartOffset
bytes are actually skipped (repeatedly calling bytesStream.skip(remaining) and
if skip returns 0, read and discard a single byte to advance or break on EOF) to
ensure the stream position is correct, and clamp effectiveSize to be
non-negative by computing effectiveSize = Math.max(0L, fileSize -
fileStartOffset - fileEndOffset); ensure size and totalSize return that clamped
effectiveSize and that next() uses the clamped value to avoid premature EOF
behavior.

In `@README.md`:
- Around line 2014-2015: Update the changelog entry by increasing the heading
level from "#### 2.10.4 will be released soon." to "### 2.10.4 will be released
soon." to remove the MD001 warning, and correct the link target in the bullet
(currently "[`#841`](.../pull/841)") to point to the intended object—either change
to "[`#841`](.../issues/841)" if referencing the issue or to
"[`#844`](.../pull/844)" if referencing the PR that introduced the change; edit
the matching line containing the heading text and the link text in README.md
accordingly.

---

Nitpick comments:
In
`@cobol-parser/src/test/scala/za/co/absa/cobrix/cobol/processor/CobolProcessorBuilderSuite.scala`:
- Around line 94-112: Add a sibling test that exercises the InPlace processing
path: duplicate the existing case that builds via CobolProcessor.builder but set
.withProcessingStrategy(CobolProcessingStrategy.InPlace) and use the same
RawRecordProcessor, options ("file_start_offset","file_end_offset") and
input/output files; then assert the returned count and the output binary
contents match the expected bytes to ensure CobolProcessorInPlace is covered and
cannot regress.

In
`@spark-cobol/src/test/scala/za/co/absa/cobrix/spark/cobol/SparkCobolProcessorSuite.scala`:
- Around line 163-193: The test currently exercises only
CobolProcessingStrategy.ToVariableLength; add a parallel case that sets
withProcessingStrategy(CobolProcessingStrategy.InPlace) (using the same
SerializableRawRecordProcessor implementation, file_start_offset/file_end_offset
options, load/save and subsequent read/assert steps) so the in-place Spark
processing path is covered; replicate the sequence that writes outputFile, reads
binary bytes and JSON via spark.read (same options: copybook_contents,
record_format "V", is_rdw_big_endian "true", pedantic "true") and assert the
outputData and actual JSON equal the same expected values to ensure the InPlace
branch is tested.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: d4240e36-697e-4c0e-92ed-329177a75dfb

📥 Commits

Reviewing files that changed from the base of the PR and between a0f4553 and a0c7207.

📒 Files selected for processing (8)
  • README.md
  • cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/processor/CobolProcessor.scala
  • cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/reader/stream/FSStream.scala
  • cobol-parser/src/test/scala/za/co/absa/cobrix/cobol/processor/CobolProcessorBuilderSuite.scala
  • spark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/SparkCobolProcessor.scala
  • spark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/source/index/IndexBuilder.scala
  • spark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/utils/FileUtils.scala
  • spark-cobol/src/test/scala/za/co/absa/cobrix/spark/cobol/SparkCobolProcessorSuite.scala

Comment thread README.md
Comment on lines +2014 to +2015
- #### 2.10.4 will be released soon.
- [#841](https://github.com/AbsaOSS/cobrix/pull/841) Added support for file start and end offset options for in-place processing of files without converting to dataframes.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Fix the changelog heading and reference target.

This entry introduces the MD001 warning (## Changelog jumps straight to ####), and the link points to pull/841 even though this change is PR 844 closing issue 841. Bump the heading to ### and point the link at the intended object (issues/841 or pull/844).

📝 Suggested edit
-- #### 2.10.4 will be released soon.
-   - [`#841`](https://github.com/AbsaOSS/cobrix/pull/841) Added support for file start and end offset options for in-place processing of files without converting to dataframes.
+### 2.10.4 will be released soon.
+- [`#844`](https://github.com/AbsaOSS/cobrix/pull/844) Added support for file start and end offset options for in-place processing of files without converting to dataframes.
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
- #### 2.10.4 will be released soon.
- [#841](https://github.com/AbsaOSS/cobrix/pull/841) Added support for file start and end offset options for in-place processing of files without converting to dataframes.
### 2.10.4 will be released soon.
- [`#844`](https://github.com/AbsaOSS/cobrix/pull/844) Added support for file start and end offset options for in-place processing of files without converting to dataframes.
🧰 Tools
🪛 markdownlint-cli2 (0.22.1)

[warning] 2014-2014: Heading levels should only increment by one level at a time
Expected: h3; Actual: h4

(MD001, heading-increment)

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@README.md` around lines 2014 - 2015, Update the changelog entry by increasing
the heading level from "#### 2.10.4 will be released soon." to "### 2.10.4 will
be released soon." to remove the MD001 warning, and correct the link target in
the bullet (currently "[`#841`](.../pull/841)") to point to the intended
object—either change to "[`#841`](.../issues/841)" if referencing the issue or to
"[`#844`](.../pull/844)" if referencing the PR that introduced the change; edit
the matching line containing the heading text and the link text in README.md
accordingly.

@yruslan yruslan merged commit 9e407ff into master May 4, 2026
7 checks passed
@yruslan yruslan deleted the feature/841-add-support-for-file-offsets-in-processors branch May 4, 2026 12:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add support for file start and end offsets for EBCDIC files processor in place

1 participant