#841 Add support for file offsets in in-place processors by yruslan · Pull Request #844 · AbsaOSS/cobrix

yruslan · 2026-05-04T05:29:25Z

Closes #841

Summary by CodeRabbit

New Features
- Support for file start/end offset options so processing can skip bytes at file start or end and respect per-file bounds across processing modes.
Documentation
- Changelog updated for upcoming 2.10.4 release.
Tests
- Added tests validating offsets for both in-place and variable-length processing and verifying generated binary/JSON outputs.

…d offsets.

… for in-place processing in Spark.

coderabbitai · 2026-05-04T05:29:38Z

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 091a1dc6-1067-46ae-b6e4-9e779c88db51

📥 Commits

Reviewing files that changed from the base of the PR and between a0c7207 and e6ce57c.

📒 Files selected for processing (3)

cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/reader/stream/FSStream.scala
cobol-parser/src/test/scala/za/co/absa/cobrix/cobol/processor/CobolProcessorBuilderSuite.scala
spark-cobol/src/test/scala/za/co/absa/cobrix/spark/cobol/SparkCobolProcessorSuite.scala

✅ Files skipped from review due to trivial changes (1)

cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/reader/stream/FSStream.scala

🚧 Files skipped from review as they are similar to previous changes (1)

spark-cobol/src/test/scala/za/co/absa/cobrix/spark/cobol/SparkCobolProcessorSuite.scala

Walkthrough

This PR threads file_start_offset and file_end_offset through the processing pipeline: FSStream now accepts offsets and enforces bounded reads; CobolProcessor and SparkCobolProcessor propagate readerParameters to per-file streamers; FileUtils adds a helper to compute Hadoop read sizes; tests and the README were updated.

Changes

File Offset Processing Support

Layer / File(s)	Summary
Stream API Shape `cobol-parser/src/main/scala/.../reader/stream/FSStream.scala`	`FSStream` constructor now accepts `fileStartOffset` and `fileEndOffset`, computes `effectiveSize`, and reports `size`/`totalSize` relative to offsets.
Stream Read Enforcement `cobol-parser/src/main/scala/.../reader/stream/FSStream.scala`	`next(numberOfBytes)` caps reads to remaining effective bytes, closes and returns empty array at effective end; `copyStream()` preserves offsets; added `skipFully` helper.
Processor Wiring `cobol-parser/src/main/scala/.../processor/CobolProcessor.scala`	`CobolProcessorLoader.save` passes `readerParameters.fileStartOffset`/`fileEndOffset` into `FSStream`; builder helper methods (`getCobolSchema`, `getReaderParameters`, `getOptions`) had their `private[processor]` qualifier removed.
Spark Integration `spark-cobol/src/main/scala/.../SparkCobolProcessor.scala`	`save()` builds a `CobolProcessor` to capture `readerParameters`; `getFileProcessorRdd`/`processListOfFiles` now accept `readerParameters`; per-file processing computes `maximumBytes` from offsets and constructs `FileStreamer(inputFile, startOffset, maximumBytes)`.
Hadoop Read Size Utility `spark-cobol/src/main/scala/.../utils/FileUtils.scala`	Added `getHadoopFileReadSize(...)` to resolve filesystem, detect compression, and compute a clamped read size based on start/end offsets.
Index Builder Integration `spark-cobol/src/main/scala/.../source/index/IndexBuilder.scala`	Replaced manual read-size logic with `FileUtils.getHadoopFileReadSize(...)` when `fileEndOffset > 0`.
Tests & Docs `cobol-parser/src/test/.../CobolProcessorBuilderSuite.scala`, `spark-cobol/src/test/.../SparkCobolProcessorSuite.scala`, `README.md`	Added tests for `file_start_offset=3` / `file_end_offset=2` for both InPlace and ToVariableLength strategies; corrected a test title; added changelog entry for v2.10.4 referencing PR `#841`.

Sequence Diagram

sequenceDiagram
    participant Builder as Builder
    participant CobolProc as CobolProcessor
    participant FSStream as FSStream
    participant FileStreamer as FileStreamer
    participant FS as FileSystem

    Builder->>CobolProc: build with file_start_offset,<br/>file_end_offset options
    CobolProc->>CobolProc: extract readerParameters (offsets)

    rect rgba(100,150,255,0.5)
    CobolProc->>FSStream: FSStream(file, startOffset, endOffset)
    FSStream->>FS: seek to startOffset (skipFully)
    FS-->>FSStream: byte chunks
    FSStream->>CobolProc: data until effectiveSize reached
    end

    rect rgba(150,200,100,0.5)
    CobolProc->>FileStreamer: compute maximumBytes via FileUtils
    FileStreamer->>FS: FileStreamer(file, startOffset, maximumBytes)
    FS-->>FileStreamer: bounded read
    FileStreamer->>CobolProc: data within bounds
    end

Estimated Code Review Effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly Related PRs

#788 Add mainframe file processor that runs in Spark via RDDs #789: Touches CobolProcessor, builder, and Spark processing paths that overlap the builder/loader changes here.
Multisegment Header Trailer Removal #836: Introduces header/trailer handling and interacts with FSStream/FileUtils offset logic used in this PR.

Poem

🐰 Hopping through bytes, I bound the trail,
start at three, end at two — no extra tail.
Streams now skip the fluff with nimble paws,
trimmed and tidy, precise without a cause.
Cheers from the rabbit, who loves neat file laws.

🚥 Pre-merge checks | ✅ 5

✅ Passed checks (5 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The PR title '#841 Add support for file offsets in in-place processors' clearly and specifically summarizes the main change: adding file offset support for in-place COBOL file processors.
Linked Issues check	✅ Passed	The PR fully implements the requirements from `#841`: file_start_offset and file_end_offset options are now supported in both CobolProcessor [841] and SparkCobolProcessor [841], with comprehensive changes to FSStream, ReaderParameters passing, and file-slicing logic.
Out of Scope Changes check	✅ Passed	All changes are directly scoped to implementing file offset support: FSStream modifications, CobolProcessor parameter passing, SparkCobolProcessor integration, FileUtils helper, and comprehensive test coverage. No extraneous modifications detected.
Docstring Coverage	✅ Passed	No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch feature/841-add-support-for-file-offsets-in-processors

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Review rate limit: 0/1 reviews remaining, refill in 60 minutes.}

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

github-actions · 2026-05-04T05:32:40Z

JaCoCo code coverage report - 'cobol-parser'

Overall Project	91.3% `-0.06%`	🍏
Files changed	75.9%	🍏

File	Coverage
CobolProcessor.scala	80.9%	🍏
FSStream.scala	80.48% `-19.05%`	🍏

github-actions · 2026-05-04T05:32:41Z

JaCoCo code coverage report - 'spark-cobol'

Overall Project	83.34% `-0.06%`	🍏
Files changed	94.59%	🍏

File	Coverage
SparkCobolProcessor.scala	96.81%	🍏
IndexBuilder.scala	96.6%	🍏
FileUtils.scala	83.54% `-1.05%`	🍏

coderabbitai

Actionable comments posted: 2

🧹 Nitpick comments (2)

spark-cobol/src/test/scala/za/co/absa/cobrix/spark/cobol/SparkCobolProcessorSuite.scala (1)
163-193: ⚡ Quick win

Add a Spark InPlace case too.

This regression test still runs the ToVariableLength branch, so it won't catch a break in the actual in-place Spark processor path that this PR is about. A matching CobolProcessingStrategy.InPlace case would close that gap.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@spark-cobol/src/test/scala/za/co/absa/cobrix/spark/cobol/SparkCobolProcessorSuite.scala`
around lines 163 - 193, The test currently exercises only
CobolProcessingStrategy.ToVariableLength; add a parallel case that sets
withProcessingStrategy(CobolProcessingStrategy.InPlace) (using the same
SerializableRawRecordProcessor implementation, file_start_offset/file_end_offset
options, load/save and subsequent read/assert steps) so the in-place Spark
processing path is covered; replicate the sequence that writes outputFile, reads
binary bytes and JSON via spark.read (same options: copybook_contents,
record_format "V", is_rdw_big_endian "true", pedantic "true") and assert the
outputData and actual JSON equal the same expected values to ensure the InPlace
branch is tested.
cobol-parser/src/test/scala/za/co/absa/cobrix/cobol/processor/CobolProcessorBuilderSuite.scala (1)
94-112: ⚡ Quick win

Add an InPlace regression case alongside this one.

This new coverage only exercises CobolProcessingStrategy.ToVariableLength, while the feature request is specifically about the in-place processor path. A sibling assertion for CobolProcessingStrategy.InPlace would make sure CobolProcessorInPlace can't regress unnoticed.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@cobol-parser/src/test/scala/za/co/absa/cobrix/cobol/processor/CobolProcessorBuilderSuite.scala`
around lines 94 - 112, Add a sibling test that exercises the InPlace processing
path: duplicate the existing case that builds via CobolProcessor.builder but set
.withProcessingStrategy(CobolProcessingStrategy.InPlace) and use the same
RawRecordProcessor, options ("file_start_offset","file_end_offset") and
input/output files; then assert the returned count and the output binary
contents match the expected bytes to ensure CobolProcessorInPlace is covered and
cannot regress.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In
`@cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/reader/stream/FSStream.scala`:
- Around line 21-36: The FSStream constructor currently uses
bytesStream.skip(fileStartOffset) which is unreliable and computes effectiveSize
that can be negative; change it to loop until the requested fileStartOffset
bytes are actually skipped (repeatedly calling bytesStream.skip(remaining) and
if skip returns 0, read and discard a single byte to advance or break on EOF) to
ensure the stream position is correct, and clamp effectiveSize to be
non-negative by computing effectiveSize = Math.max(0L, fileSize -
fileStartOffset - fileEndOffset); ensure size and totalSize return that clamped
effectiveSize and that next() uses the clamped value to avoid premature EOF
behavior.

In `@README.md`:
- Around line 2014-2015: Update the changelog entry by increasing the heading
level from "#### 2.10.4 will be released soon." to "### 2.10.4 will be released
soon." to remove the MD001 warning, and correct the link target in the bullet
(currently "[`#841`](.../pull/841)") to point to the intended object—either change
to "[`#841`](.../issues/841)" if referencing the issue or to
"[`#844`](.../pull/844)" if referencing the PR that introduced the change; edit
the matching line containing the heading text and the link text in README.md
accordingly.

---

Nitpick comments:
In
`@cobol-parser/src/test/scala/za/co/absa/cobrix/cobol/processor/CobolProcessorBuilderSuite.scala`:
- Around line 94-112: Add a sibling test that exercises the InPlace processing
path: duplicate the existing case that builds via CobolProcessor.builder but set
.withProcessingStrategy(CobolProcessingStrategy.InPlace) and use the same
RawRecordProcessor, options ("file_start_offset","file_end_offset") and
input/output files; then assert the returned count and the output binary
contents match the expected bytes to ensure CobolProcessorInPlace is covered and
cannot regress.

In
`@spark-cobol/src/test/scala/za/co/absa/cobrix/spark/cobol/SparkCobolProcessorSuite.scala`:
- Around line 163-193: The test currently exercises only
CobolProcessingStrategy.ToVariableLength; add a parallel case that sets
withProcessingStrategy(CobolProcessingStrategy.InPlace) (using the same
SerializableRawRecordProcessor implementation, file_start_offset/file_end_offset
options, load/save and subsequent read/assert steps) so the in-place Spark
processing path is covered; replicate the sequence that writes outputFile, reads
binary bytes and JSON via spark.read (same options: copybook_contents,
record_format "V", is_rdw_big_endian "true", pedantic "true") and assert the
outputData and actual JSON equal the same expected values to ensure the InPlace
branch is tested.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: d4240e36-697e-4c0e-92ed-329177a75dfb

📥 Commits

Reviewing files that changed from the base of the PR and between a0f4553 and a0c7207.

📒 Files selected for processing (8)

README.md
cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/processor/CobolProcessor.scala
cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/reader/stream/FSStream.scala
cobol-parser/src/test/scala/za/co/absa/cobrix/cobol/processor/CobolProcessorBuilderSuite.scala
spark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/SparkCobolProcessor.scala
spark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/source/index/IndexBuilder.scala
spark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/utils/FileUtils.scala
spark-cobol/src/test/scala/za/co/absa/cobrix/spark/cobol/SparkCobolProcessorSuite.scala

coderabbitai · 2026-05-04T05:37:02Z

+- #### 2.10.4 will be released soon.
+   - [#841](https://github.com/AbsaOSS/cobrix/pull/841) Added support for file start and end offset options for in-place processing of files without converting to dataframes.


⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Fix the changelog heading and reference target.

This entry introduces the MD001 warning (## Changelog jumps straight to ####), and the link points to pull/841 even though this change is PR 844 closing issue 841. Bump the heading to ### and point the link at the intended object (issues/841 or pull/844).

📝 Suggested edit

-- #### 2.10.4 will be released soon. - - [`#841`](https://github.com/AbsaOSS/cobrix/pull/841) Added support for file start and end offset options for in-place processing of files without converting to dataframes. +### 2.10.4 will be released soon. +- [`#844`](https://github.com/AbsaOSS/cobrix/pull/844) Added support for file start and end offset options for in-place processing of files without converting to dataframes.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

- #### 2.10.4 will be released soon.

- [#841](https://github.com/AbsaOSS/cobrix/pull/841) Added support for file start and end offset options for in-place processing of files without converting to dataframes.

### 2.10.4 will be released soon.

- [`#844`](https://github.com/AbsaOSS/cobrix/pull/844) Added support for file start and end offset options for in-place processing of files without converting to dataframes.

🧰 Tools

🪛 markdownlint-cli2 (0.22.1)

[warning] 2014-2014: Heading levels should only increment by one level at a time
Expected: h3; Actual: h4

(MD001, heading-increment)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@README.md` around lines 2014 - 2015, Update the changelog entry by increasing the heading level from "#### 2.10.4 will be released soon." to "### 2.10.4 will be released soon." to remove the MD001 warning, and correct the link target in the bullet (currently "[`#841`](.../pull/841)") to point to the intended object—either change to "[`#841`](.../issues/841)" if referencing the issue or to "[`#844`](.../pull/844)" if referencing the PR that introduced the change; edit the matching line containing the heading text and the link text in README.md accordingly.

…ace with CobolProcessor.

yruslan added 3 commits April 29, 2026 08:41

#841 Improve FSStream and CobolProcessor to support file start and en…

0574c87

…d offsets.

#841 Ass support file start and end offsets in SparkCobolProcessor,…

283a7de

… for in-place processing in Spark.

#841 Update README of the new feature

a0c7207

coderabbitai Bot reviewed May 4, 2026

View reviewed changes

#841 Fix the offset skipping in when processing mainframe files in pl…

e6ce57c

…ace with CobolProcessor.

yruslan merged commit 9e407ff into master May 4, 2026
7 checks passed

yruslan deleted the feature/841-add-support-for-file-offsets-in-processors branch May 4, 2026 12:15

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

#841 Add support for file offsets in in-place processors#844

#841 Add support for file offsets in in-place processors#844
yruslan merged 4 commits intomasterfrom
feature/841-add-support-for-file-offsets-in-processors

yruslan commented May 4, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented May 4, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented May 4, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented May 4, 2026 •

edited

Loading

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

coderabbitai Bot May 4, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

		- #### 2.10.4 will be released soon.
		- [#841](https://github.com/AbsaOSS/cobrix/pull/841) Added support for file start and end offset options for in-place processing of files without converting to dataframes.

Conversation

yruslan commented May 4, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented May 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram

Estimated Code Review Effort

Possibly Related PRs

Poem

Uh oh!

github-actions Bot commented May 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

JaCoCo code coverage report - 'cobol-parser'

Uh oh!

github-actions Bot commented May 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

JaCoCo code coverage report - 'spark-cobol'

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai Bot May 4, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

yruslan commented May 4, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented May 4, 2026 •

edited

Loading

github-actions Bot commented May 4, 2026 •

edited

Loading

github-actions Bot commented May 4, 2026 •

edited

Loading