#797 Add support for variable-sized OCCURS when writing EBCDIC files by yruslan · Pull Request #830 · AbsaOSS/cobrix

yruslan · 2026-03-09T07:58:43Z

Closes #797

Summary by CodeRabbit

New Features
- Writer now supports OCCURS DEPENDING ON with dependency tracking for dynamic arrays and improved RDW handling
- Added a lightweight holder to track dependee field offsets
- Validation API expanded to validate copybook/source configurations for reading and writing
Bug Fixes
- String rendering now shows actual field names (no camelCase conversions)
- Validation tightened: unsupported OCCURS mappings and multi-segment write options produce clearer errors; variable-size OCCURS with fixed records now warns
Tests
- Added unit and integration tests for OCCURS, OCCURS DEPENDING ON, variable-size scenarios, and parameter validation
Chores
- Minor CI message formatting and documentation updates

…frame format with OCCURS DEPENDING ON clause.

coderabbitai · 2026-03-09T07:59:01Z

Walkthrough

Refactors writer AST and RDD processing to support OCCURS and OCCURS DEPENDING ON by tracking dependee fields through a dependeeMap; updates writer parameter validation and validation entrypoints for writing; adjusts copybook AST string rendering to use raw names; adds tests and minor CI/build tweaks.

Changes

Cohort / File(s)	Summary
AST String Representation `cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/parser/ast/Primitive.scala`, `cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/parser/ast/Statement.scala`	toString implementations now render raw `name` and no longer camel-case redefined names; removed `camelCased` field and `camelCase()` helper.
Dependee Model `spark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/writer/DependingOnField.scala`	New case class `DependingOnField(cobolField: Primitive, var baseOffset: Int)` introduced to represent OCCURS DEPENDING ON targets.
Writer AST Types `spark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/writer/WriterAst.scala`	Added `PrimitiveDependeeField` node; `PrimitiveArray` and `GroupArray` now accept optional `dependingOn` references; added documentation.
AST Construction & RDD Processing `spark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/writer/NestedRecordCombiner.scala`	Major refactor: added `getAst(copybook)` normalization; `constructWriterAst` and builder methods now accept and populate a `dependeeMap`; `processRDD`/`writeToBytes` updated to support variable-length OCCURS, RDW length adjustments, and dependee resolution; new `PrimitiveDependeeField` handling wired through generation and writing.
Writer Validation `spark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/source/parameters/CobolParametersValidator.scala`	Added SLF4J logger and new overload `validateOrThrow(parameters: Map[String,String], hadoopConf)`; new `validateParametersForWriting(readerParameters)` enforces disallowed write options (e.g., `occurs_mappings`, multi-segment) and warns on incompatible combos.
Streaming Entrypoint `spark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/source/streaming/CobolStreamer.scala`	Deserializer now collects parameters into a Map and calls new `validateOrThrow(parameters, hadoopConf)` before reader initialization.
Tests — Writer & Validation `spark-cobol/src/test/scala/.../NestedWriterSuite.scala`, `.../VariableLengthEbcdicWriterSuite.scala`, `.../CobolParametersValidatorSuite.scala`	Added copybooks and tests for OCCURS and OCCURS DEPENDING ON (including variable-length occurs), added validator tests for writing constraints, updated expectations and new duplicate-dependee-name test.
CI & Build `.github/workflows/jacoco_check.yml`, `build.sbt`	Minor CI message string-quoting changes and a single-line build manifest change.

Sequence Diagram

sequenceDiagram
    participant User
    participant Writer
    participant Combiner
    participant Builder
    participant DepMap
    participant Processor

    User->>Writer: combine(rdd, copybook, schema)
    Writer->>Combiner: constructWriterAst(copybook, schema)
    Combiner->>DepMap: create empty dependeeMap
    Combiner->>Builder: buildGroupField(..., dependeeMap)
    Builder->>Builder: buildPrimitiveNode(..., dependeeMap)
    alt field has DEPENDING ON
        Builder->>DepMap: register DependingOnField(name -> spec)
        Builder-->>Combiner: emit PrimitiveDependeeField / arrays with dependingOn
    end
    Combiner-->>Writer: return root WriterAst
    Writer->>Processor: processRDD(..., variableSizeOccurs)
    Processor->>Processor: iterate rows, call writeToBytes(ast,..., variableLengthOccurs)
    rect rgba(100,200,100,0.5)
        Processor->>DepMap: consult dependeeMap for dependee values
        Processor->>Processor: compute dynamic OCCURS sizes, write arrays accordingly
    end
    Processor-->>Writer: RDD[Array[Byte]]
    Writer-->>User: output RDD

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

#828 Add support for structs and arrays in EBCDIC writer. #829 — Strong overlap in writer-side dependency-tracking, NestedRecordCombiner and WriterAst node changes.
#723 Add the ability to generate _corrupt_fields column containing information on fields that Cobrix failed to decode #820 — Related edits to Primitive.scala AST; touches similar rendering/AST concerns.
#826 Add support for writing VRL files with RDW headers #827 — Related changes to CobolParametersValidator and validation flow for writer options.

Poem

🐰 I hop through copybooks, names honest and plain,
I map the fields that tell arrays how to grow,
I tuck their specs in a map, ready for the main,
I stitch bytes in order, row by row,
A rabbit’s nibble, and the writer sings hello.

🚥 Pre-merge checks | ✅ 3 | ❌ 2

❌ Failed checks (1 warning, 1 inconclusive)

Check name	Status	Explanation	Resolution
Out of Scope Changes check	⚠️ Warning	Most changes align with the OCCURS DEPENDING ON feature; however, unrelated refactoring includes camelCase removal from Primitive/Statement toString and JaCoCo workflow message formatting.	Separate camelCase refactoring and workflow formatting changes into a dedicated commit or PR to maintain clear scope alignment with the `#797` feature request.
Linked Issues check	❓ Inconclusive	The PR addresses the primary coding requirement from `#797`: support for OCCURS and OCCURS DEPENDING ON in EBCDIC writing. VB file support and CP037 code page support are partially addressed or incomplete.	Verify that VB file support and CP037 code page support are implemented; the PR appears to focus primarily on OCCURS DEPENDING ON handling without clear evidence of the other two requirements.

✅ Passed checks (3 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title accurately summarizes the main feature: adding support for variable-sized OCCURS when writing EBCDIC files, which directly corresponds to the primary changes in the codebase.
Docstring Coverage	✅ Passed	No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch feature/797-add-support-for-variable-size-occurs

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

github-actions · 2026-03-09T08:01:20Z

JaCoCo code coverage report - 'cobol-parser'

Overall Project	91.4% `-0.12%`	🍏
Files changed	0%	❌

File	Coverage
Primitive.scala	71.43% `-4.8%`	❌
Statement.scala	51.87% `-17.65%`	❌

github-actions · 2026-03-09T08:01:21Z

JaCoCo code coverage report - 'spark-cobol'

Overall Project	83.77% `-6.59%`	🍏
Files changed	35.63%	❌

File	Coverage
DependingOnField.scala	100%	❌
WriterAst.scala	93.51%	❌
NestedRecordCombiner.scala	88.37% `-1.86%`	🍏
CobolParametersValidator.scala	85.77% `-0.82%`	🍏
CobolStreamer.scala	0% `-24.89%`	❌

coderabbitai

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

spark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/source/parameters/CobolParametersValidator.scala (1)

119-145: ⚠️ Potential issue | 🟠 Major

Reject variable_size_occurs for fixed-length writes.

Now that variable_size_occurs is allowed for writing, NestedRecordCombiner.processRDD() can emit shorter byte arrays for record_format = F. Without RDW or any other framing, that turns the output into an undecodable flat file as soon as one record shrinks. Please fail this combination here, or keep fixed-length writes untrimmed.

⚙️ Proposed validation

   if (readerParameters.recordFormat != RecordFormat.FixedLength && readerParameters.recordFormat != RecordFormat.VariableLength) {
     issues += s"Only '${RecordFormat.FixedLength}' and '${RecordFormat.VariableLength}' values for 'record_format' are supported for writing, " +
       s"provided value: '${readerParameters.recordFormat}'"
   }
+
+  if (readerParameters.variableSizeOccurs &&
+    readerParameters.recordFormat != RecordFormat.VariableLength) {
+    issues += "'variable_size_occurs' is supported for writing only when 'record_format' is 'V'"
+  }
 
   if (readerParameters.occursMappings.nonEmpty) {
     issues += "OCCURS mapping option ('occurs_mappings') is not supported for writing"
   }

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In
`@spark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/source/parameters/CobolParametersValidator.scala`
around lines 119 - 145, The validator must reject the combination of
fixed-length output and variable-size OCCURS; update
validateParametersForWriting (in CobolParametersValidator) to add a check that
if readerParameters.recordFormat == RecordFormat.FixedLength and the
variable-size-occurs option is enabled (readerParameters.variableSizeOccurs /
variable_size_occurs flag on ReaderParameters), append an error like
"variable_size_occurs is not supported for fixed-length writes" to issues so the
IllegalArgumentException is thrown; this ensures fixed-length writes remain
untrimmed and prevents emitting shorter byte arrays.

🧹 Nitpick comments (1)

spark-cobol/src/test/scala/za/co/absa/cobrix/spark/cobol/writer/VariableLengthEbcdicWriterSuite.scala (1)
134-160: Add a VB rejection regression here too.

This negative-path test currently proves FB is rejected, but not VB. Since issue #797 explicitly mentions VB, a dedicated record_format = "VB" assertion would make the intended writer contract much clearer and protect it from regressions.

Based on learnings, variable-block format writing will not be implemented in the writer; only VariableLength uses RDW headers.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@spark-cobol/src/test/scala/za/co/absa/cobrix/spark/cobol/writer/VariableLengthEbcdicWriterSuite.scala`
around lines 134 - 160, The test in VariableLengthEbcdicWriterSuite ("throw an
exception on unexpected output record format") only asserts that "FB" is
rejected but misses asserting that "VB" is rejected; add a second negative-path
check (or extend the existing intercept) that writes with
.option("record_format", "VB") using the same df/path setup and asserts the
thrown IllegalArgumentException message includes the same "Only 'F' and 'V'
values for 'record_format' are supported for writing, provided value: 'VB'" text
so VB regressions are caught (refer to the test block variables df, path and the
intercept/exception handling used in the current test).

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In
`@spark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/writer/NestedRecordCombiner.scala`:
- Around line 418-426: The null-array branches in NestedRecordCombiner are
leaving DEPENDING ON fields as raw zeros in fixed-length mode because
Copybook.setPrimitiveField is only invoked when variableLengthOccurs is true;
update both null branches (around the blocks that now return 0 and
cobolField.binaryProperties.actualSize, also the similar case at lines 459-466)
to always call Copybook.setPrimitiveField(spec.cobolField, ar, 0,
fieldStartOffsetOverride = spec.baseOffset) for each spec in dependingOn before
returning, so the dependee PrimitiveDependeeField gets an encoded COBOL zero
even when the array is null and variableLengthOccurs is false.

---

Outside diff comments:
In
`@spark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/source/parameters/CobolParametersValidator.scala`:
- Around line 119-145: The validator must reject the combination of fixed-length
output and variable-size OCCURS; update validateParametersForWriting (in
CobolParametersValidator) to add a check that if readerParameters.recordFormat
== RecordFormat.FixedLength and the variable-size-occurs option is enabled
(readerParameters.variableSizeOccurs / variable_size_occurs flag on
ReaderParameters), append an error like "variable_size_occurs is not supported
for fixed-length writes" to issues so the IllegalArgumentException is thrown;
this ensures fixed-length writes remain untrimmed and prevents emitting shorter
byte arrays.

---

Nitpick comments:
In
`@spark-cobol/src/test/scala/za/co/absa/cobrix/spark/cobol/writer/VariableLengthEbcdicWriterSuite.scala`:
- Around line 134-160: The test in VariableLengthEbcdicWriterSuite ("throw an
exception on unexpected output record format") only asserts that "FB" is
rejected but misses asserting that "VB" is rejected; add a second negative-path
check (or extend the existing intercept) that writes with
.option("record_format", "VB") using the same df/path setup and asserts the
thrown IllegalArgumentException message includes the same "Only 'F' and 'V'
values for 'record_format' are supported for writing, provided value: 'VB'" text
so VB regressions are caught (refer to the test block variables df, path and the
intercept/exception handling used in the current test).

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: b88d503c-8012-4362-bbf1-6c0ad43aa41d

📥 Commits

Reviewing files that changed from the base of the PR and between a76495d and 94beb4b.

📒 Files selected for processing (8)

cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/parser/ast/Primitive.scala
cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/parser/ast/Statement.scala
spark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/source/parameters/CobolParametersValidator.scala
spark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/writer/DependingOnField.scala
spark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/writer/NestedRecordCombiner.scala
spark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/writer/WriterAst.scala
spark-cobol/src/test/scala/za/co/absa/cobrix/spark/cobol/writer/NestedWriterSuite.scala
spark-cobol/src/test/scala/za/co/absa/cobrix/spark/cobol/writer/VariableLengthEbcdicWriterSuite.scala

spark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/writer/NestedRecordCombiner.scala

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (1)

spark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/writer/NestedRecordCombiner.scala (1)
370-373: Mutable state in shared AST node is safe but fragile.

DependingOnField.baseOffset is mutated at line 372 for every row within the same partition. This works correctly because:

Rows are processed sequentially within mapPartitions

Each row gets a fresh byte array

baseOffset is always updated before being read in the subsequent dependingOn.foreach calls

However, this design is fragile—if the processing model ever changes to concurrent row handling, this will silently corrupt data. Consider adding a brief comment explaining why this mutation is safe.
📝 Suggested documentation
       // ── Primitive which has an OCCURS DEPENDS ON ─────────────────────────────
       case PrimitiveDependeeField(spec) =>
+        // NOTE: baseOffset is mutated here for each row. This is safe because rows
+        // are processed sequentially within mapPartitions, and the offset is always
+        // updated before being read in subsequent array-element writes.
         spec.baseOffset = currentOffset
         spec.cobolField.binaryProperties.actualSize
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@spark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/writer/NestedRecordCombiner.scala`
around lines 370 - 373, The mutation of DependingOnField.baseOffset inside the
PrimitiveDependeeField match is fragile; add a concise comment near the
assignment (spec.baseOffset = currentOffset) explaining that this is safe
because rows are processed sequentially within mapPartitions, each row uses a
fresh byte array, and baseOffset is always updated before any
dependingOn.foreach reads it, and note that concurrent processing would break
this invariant; reference PrimitiveDependeeField, DependingOnField.baseOffset,
mapPartitions, and dependingOn.foreach in the comment so future maintainers
understand the assumption.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In
`@spark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/writer/NestedRecordCombiner.scala`:
- Line 281: Fix the comment typo inside NestedRecordCombiner.scala near the
comment above the dependee handling: change "defines" to "defined" in the
comment "// Dependee fields need not to be defines in Spark schema." so it reads
"// Dependee fields need not to be defined in Spark schema." to correct the
spelling in the NestedRecordCombiner class.

---

Nitpick comments:
In
`@spark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/writer/NestedRecordCombiner.scala`:
- Around line 370-373: The mutation of DependingOnField.baseOffset inside the
PrimitiveDependeeField match is fragile; add a concise comment near the
assignment (spec.baseOffset = currentOffset) explaining that this is safe
because rows are processed sequentially within mapPartitions, each row uses a
fresh byte array, and baseOffset is always updated before any
dependingOn.foreach reads it, and note that concurrent processing would break
this invariant; reference PrimitiveDependeeField, DependingOnField.baseOffset,
mapPartitions, and dependingOn.foreach in the comment so future maintainers
understand the assumption.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 7883dcea-55d1-4916-8681-7c58385e6b5c

📥 Commits

Reviewing files that changed from the base of the PR and between e08fbb5 and bfa90d6.

📒 Files selected for processing (3)

spark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/source/parameters/CobolParametersValidator.scala
spark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/writer/NestedRecordCombiner.scala
spark-cobol/src/test/scala/za/co/absa/cobrix/spark/cobol/source/parameters/CobolParametersValidatorSuite.scala

coderabbitai · 2026-03-09T09:34:24Z

spark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/writer/NestedRecordCombiner.scala

    }.getOrElse {
-      log.error(s"Field '$path${p.name}' is not found in Spark schema. Will be replaced by filler.")
-      Filler(p.binaryProperties.actualSize)
+      // Dependee fields need not to be defines in Spark schema.


⚠️ Potential issue | 🟡 Minor

Minor typo in comment.

"defines" should be "defined".

✏️ Fix typo

- // Dependee fields need not to be defines in Spark schema. + // Dependee fields need not be defined in Spark schema.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

// Dependee fields need not to be defines in Spark schema.

// Dependee fields need not be defined in Spark schema.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@spark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/writer/NestedRecordCombiner.scala` at line 281, Fix the comment typo inside NestedRecordCombiner.scala near the comment above the dependee handling: change "defines" to "defined" in the comment "// Dependee fields need not to be defines in Spark schema." so it reads "// Dependee fields need not to be defined in Spark schema." to correct the spelling in the NestedRecordCombiner class.

coderabbitai

Actionable comments posted: 2

🧹 Nitpick comments (1)

spark-cobol/src/test/scala/za/co/absa/cobrix/spark/cobol/writer/NestedWriterSuite.scala (1)

467-476: Reduce coupling to AST layout and exact exception wording.

This test depends on children(5) and the full message with parser line numbers, so harmless AST or wording refactors will break it even if duplicate-dependee validation still works. Prefer selecting the second CNT1 by name and asserting only the stable part of the error.

♻️ Suggested cleanup

-      val ast = parsedCopybook.ast
-      val children = ast.children.head.asInstanceOf[Group].children
-      val cnt2 = children(5).asInstanceOf[Primitive].withUpdatedIsDependee(true)
-      children(5) = cnt2
+      val record = parsedCopybook.ast.children.head.asInstanceOf[Group]
+      val cnt1Indexes = record.children.zipWithIndex.collect {
+        case (p: Primitive, idx) if p.name.equalsIgnoreCase("CNT1") => idx
+      }
+      assert(cnt1Indexes.size == 2)
+      val secondCnt1Index = cnt1Indexes(1)
+      record.children(secondCnt1Index) =
+        record.children(secondCnt1Index).asInstanceOf[Primitive].withUpdatedIsDependee(true)

       val ex = intercept[IllegalArgumentException] {
         NestedRecordCombiner.constructWriterAst(parsedCopybook, df.schema)
       }

-      assert(ex.getMessage == "Duplicate field name 'CNT1' found in copybook. Field names must be unique (case-insensitive) when OCCURS DEPENDING ON is used. Already found a dependee field with the same name at line 4, current field line number: 10.")
+      assert(ex.getMessage.contains("Duplicate field name 'CNT1' found in copybook."))
+      assert(ex.getMessage.contains("Field names must be unique (case-insensitive) when OCCURS DEPENDING ON is used."))

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In
`@spark-cobol/src/test/scala/za/co/absa/cobrix/spark/cobol/writer/NestedWriterSuite.scala`
around lines 467 - 476, The test in NestedWriterSuite is brittle because it
relies on AST index children(5) and exact exception text; update the test so it
finds the second CNT1 node by name (e.g., walk parsedCopybook.ast to locate the
Primitive node whose name equals "CNT1" and then call withUpdatedIsDependee on
that node) instead of using children(5), and change the assertion on the thrown
IllegalArgumentException from exact string equality to a contains check for the
stable fragment (for example verify ex.getMessage contains "Duplicate field name
'CNT1'") while keeping the call to
NestedRecordCombiner.constructWriterAst(parsedCopybook, df.schema).

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In
`@spark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/source/parameters/CobolParametersValidator.scala`:
- Around line 153-156: The validator currently only logs a warning when
readerParameters.variableSizeOccurs is true with readerParameters.recordFormat
== RecordFormat.FixedLength; change this to reject the configuration by
throwing/returning a validation error in CobolParametersValidator (instead of
log.warn) when variableSizeOccurs is set with RecordFormat.FixedLength,
referencing the same flags (variableSizeOccurs and recordFormat) and
RecordFormat.FixedLength to locate the check; include a clear error message
stating that variable-size occurs cannot be used with fixed-length writes
because RDW framing is only added for RecordFormat.VariableLength (see
NestedRecordCombiner handling), so validation must fail.

In
`@spark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/source/streaming/CobolStreamer.scala`:
- Around line 42-47: The code currently calls
ssc.sparkContext.getConf.get(PARAM_COPYBOOK_PATH) and get(PARAM_SOURCE_PATH)
which throw NoSuchElementException before
CobolParametersValidator.validateOrThrow runs; change the parameter extraction
in CobolStreamer: replace direct get(...) calls with getOption(...) and supply a
safe default (e.g., .getOrElse("") ) when building the parameters Map for
PARAM_COPYBOOK_PATH and PARAM_SOURCE_PATH so that
CobolParametersValidator.validateOrThrow(parameters,
ssc.sparkContext.hadoopConfiguration) can run and produce its intended
validation errors instead of a raw exception.

---

Nitpick comments:
In
`@spark-cobol/src/test/scala/za/co/absa/cobrix/spark/cobol/writer/NestedWriterSuite.scala`:
- Around line 467-476: The test in NestedWriterSuite is brittle because it
relies on AST index children(5) and exact exception text; update the test so it
finds the second CNT1 node by name (e.g., walk parsedCopybook.ast to locate the
Primitive node whose name equals "CNT1" and then call withUpdatedIsDependee on
that node) instead of using children(5), and change the assertion on the thrown
IllegalArgumentException from exact string equality to a contains check for the
stable fragment (for example verify ex.getMessage contains "Duplicate field name
'CNT1'") while keeping the call to
NestedRecordCombiner.constructWriterAst(parsedCopybook, df.schema).

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 889a3459-b227-4f39-aa31-c97cb5aed499

📥 Commits

Reviewing files that changed from the base of the PR and between bfa90d6 and 01da263.

📒 Files selected for processing (3)

spark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/source/parameters/CobolParametersValidator.scala
spark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/source/streaming/CobolStreamer.scala
spark-cobol/src/test/scala/za/co/absa/cobrix/spark/cobol/writer/NestedWriterSuite.scala

coderabbitai · 2026-03-09T11:02:47Z

...rc/main/scala/za/co/absa/cobrix/spark/cobol/source/parameters/CobolParametersValidator.scala

+    if (readerParameters.variableSizeOccurs &&
+      readerParameters.recordFormat == RecordFormat.FixedLength) {
+      log.warn("Option 'variable_size_occurs=true' is used with 'record_format=F' which means records can have variable length. It is highly recommended to use 'record_format=V' instead.")
+    }


⚠️ Potential issue | 🟠 Major

Reject variable_size_occurs for fixed-length writes.

This combination can emit variable-sized records while spark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/writer/NestedRecordCombiner.scala:45-76 only adds RDW framing for RecordFormat.VariableLength. With record_format=F, downstream consumers lose unambiguous record boundaries, so this should fail validation instead of warning.

🛑 Suggested validation change

- if (readerParameters.variableSizeOccurs && - readerParameters.recordFormat == RecordFormat.FixedLength) { - log.warn("Option 'variable_size_occurs=true' is used with 'record_format=F' which means records can have variable length. It is highly recommended to use 'record_format=V' instead.") - } + if (readerParameters.variableSizeOccurs && + readerParameters.recordFormat == RecordFormat.FixedLength) { + issues += "Option 'variable_size_occurs=true' requires 'record_format=V' for writing" + }

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@spark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/source/parameters/CobolParametersValidator.scala` around lines 153 - 156, The validator currently only logs a warning when readerParameters.variableSizeOccurs is true with readerParameters.recordFormat == RecordFormat.FixedLength; change this to reject the configuration by throwing/returning a validation error in CobolParametersValidator (instead of log.warn) when variableSizeOccurs is set with RecordFormat.FixedLength, referencing the same flags (variableSizeOccurs and recordFormat) and RecordFormat.FixedLength to locate the check; include a clear error message stating that variable-size occurs cannot be used with fixed-length writes because RDW framing is only added for RecordFormat.VariableLength (see NestedRecordCombiner handling), so validation must fail.

coderabbitai · 2026-03-09T11:02:47Z

spark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/source/streaming/CobolStreamer.scala

+    val parameters: Map[String, String] = Map[String, String](
+      PARAM_COPYBOOK_PATH -> ssc.sparkContext.getConf.get(PARAM_COPYBOOK_PATH),
+      PARAM_SOURCE_PATH -> ssc.sparkContext.getConf.get(PARAM_SOURCE_PATH)
+    )
+
+    CobolParametersValidator.validateOrThrow(parameters, ssc.sparkContext.hadoopConfiguration)


⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🌐 Web query:

In Apache Spark's Scala API, what does SparkConf.get(key)do when the key is missing, and how does that differ fromSparkConf.getOption(key)?

💡 Result:

SparkConf.get(key) returns the String value if set, but if the key is missing it throws java.util.NoSuchElementException(key). [1]

SparkConf.getOption(key) returns an Option[String]: Some(value) if present, otherwise None (no exception). [1]

Sources:
[1] Apache Spark source (SparkConf.scala, v3.5.6) (github.com)

Citations:

1: https://github.com/apache/spark/tree/v3.5.6/core/src/main/scala/org/apache/spark/SparkConf.scala

🏁 Script executed:

# Locate and inspect CobolStreamer.scala around lines 42-47 fd "CobolStreamer.scala" --type f

Repository: AbsaOSS/cobrix

Length of output: 1855

🏁 Script executed:

# Inspect CobolParametersValidator to understand error message design fd "CobolParametersValidator.scala" --type f

Repository: AbsaOSS/cobrix

Length of output: 163

🏁 Script executed:

# Find CobolStreamer.scala fd "CobolStreamer.scala" --path "*/spark-cobol/*"

Repository: AbsaOSS/cobrix

Length of output: 273

🏁 Script executed:

# Read CobolStreamer.scala around lines 42-47 find . -path "*/spark-cobol/*" -name "CobolStreamer.scala" -exec cat -n {} \; | sed -n '35,55p'

Repository: AbsaOSS/cobrix

Length of output: 1236

🏁 Script executed:

# Read CobolParametersValidator to understand error messages cat -n spark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/source/parameters/CobolParametersValidator.scala | head -150

Repository: AbsaOSS/cobrix

Length of output: 8759

Don't call SparkConf.get() before validation.

These lookups throw NoSuchElementException before CobolParametersValidator.validateOrThrow() runs (line 47). Users see a raw config exception instead of the validator's intended error messages for missing required parameters.

Suggested fix

implicit class Deserializer(`@transient` val ssc: StreamingContext) extends Serializable { - val parameters: Map[String, String] = Map[String, String]( - PARAM_COPYBOOK_PATH -> ssc.sparkContext.getConf.get(PARAM_COPYBOOK_PATH), - PARAM_SOURCE_PATH -> ssc.sparkContext.getConf.get(PARAM_SOURCE_PATH) - ) + private val conf = ssc.sparkContext.getConf + val parameters: Map[String, String] = + Seq(PARAM_COPYBOOK_PATH, PARAM_SOURCE_PATH) + .flatMap(key => conf.getOption(key).map(value => key -> value)) + .toMap CobolParametersValidator.validateOrThrow(parameters, ssc.sparkContext.hadoopConfiguration)

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

val parameters: Map[String, String] = Map[String, String](

PARAM_COPYBOOK_PATH -> ssc.sparkContext.getConf.get(PARAM_COPYBOOK_PATH),

PARAM_SOURCE_PATH -> ssc.sparkContext.getConf.get(PARAM_SOURCE_PATH)

)

CobolParametersValidator.validateOrThrow(parameters, ssc.sparkContext.hadoopConfiguration)

private val conf = ssc.sparkContext.getConf

val parameters: Map[String, String] =

Seq(PARAM_COPYBOOK_PATH, PARAM_SOURCE_PATH)

.flatMap(key => conf.getOption(key).map(value => key -> value))

.toMap

CobolParametersValidator.validateOrThrow(parameters, ssc.sparkContext.hadoopConfiguration)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@spark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/source/streaming/CobolStreamer.scala` around lines 42 - 47, The code currently calls ssc.sparkContext.getConf.get(PARAM_COPYBOOK_PATH) and get(PARAM_SOURCE_PATH) which throw NoSuchElementException before CobolParametersValidator.validateOrThrow runs; change the parameter extraction in CobolStreamer: replace direct get(...) calls with getOption(...) and supply a safe default (e.g., .getOrElse("") ) when building the parameters Map for PARAM_COPYBOOK_PATH and PARAM_SOURCE_PATH so that CobolParametersValidator.validateOrThrow(parameters, ssc.sparkContext.hadoopConfiguration) can run and produce its intended validation errors instead of a raw exception.

yruslan added 2 commits March 6, 2026 11:46

#797 Add support for OCCURS DEPENDING ON whan writing EBCDIC files.

2106054

#797 Add support for variable length OCCURS when writing data in main…

94beb4b

…frame format with OCCURS DEPENDING ON clause.

coderabbitai bot reviewed Mar 9, 2026

View reviewed changes

spark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/writer/NestedRecordCombiner.scala Outdated Show resolved Hide resolved

yruslan added 2 commits March 9, 2026 09:16

Fix quote escaping in Jacoco check CI script.

e08fbb5

#797 Implemented PR suggestions and improved test coverage slightly.

bfa90d6

coderabbitai bot reviewed Mar 9, 2026

View reviewed changes

#797 Improve code coverage and remove redundant validation method.

01da263

coderabbitai bot reviewed Mar 9, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

#797 Add support for variable-sized OCCURS when writing EBCDIC files#830

#797 Add support for variable-sized OCCURS when writing EBCDIC files#830
yruslan wants to merge 5 commits intomasterfrom
feature/797-add-support-for-variable-size-occurs

yruslan commented Mar 9, 2026 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Mar 9, 2026 •

edited

Loading

❌ Failed checks (1 warning, 1 inconclusive)

Uh oh!

github-actions bot commented Mar 9, 2026 •

edited

Loading

Uh oh!

github-actions bot commented Mar 9, 2026 •

edited

Loading

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

coderabbitai bot left a comment

Uh oh!

coderabbitai bot Mar 9, 2026

Uh oh!

coderabbitai bot left a comment

Uh oh!

coderabbitai bot Mar 9, 2026

Uh oh!

coderabbitai bot Mar 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

	// Dependee fields need not to be defines in Spark schema.
	// Dependee fields need not be defined in Spark schema.

Conversation

yruslan commented Mar 9, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Mar 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram

Estimated code review effort

Possibly related PRs

Poem

❌ Failed checks (1 warning, 1 inconclusive)

Uh oh!

github-actions bot commented Mar 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

JaCoCo code coverage report - 'cobol-parser'

Uh oh!

github-actions bot commented Mar 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

JaCoCo code coverage report - 'spark-cobol'

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 9, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 9, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 9, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

yruslan commented Mar 9, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Mar 9, 2026 •

edited

Loading

github-actions bot commented Mar 9, 2026 •

edited

Loading

github-actions bot commented Mar 9, 2026 •

edited

Loading