Skip to content

#797 Add support for variable-sized OCCURS when writing EBCDIC files#830

Open
yruslan wants to merge 5 commits intomasterfrom
feature/797-add-support-for-variable-size-occurs
Open

#797 Add support for variable-sized OCCURS when writing EBCDIC files#830
yruslan wants to merge 5 commits intomasterfrom
feature/797-add-support-for-variable-size-occurs

Conversation

@yruslan
Copy link
Collaborator

@yruslan yruslan commented Mar 9, 2026

Closes #797

Summary by CodeRabbit

  • New Features

    • Writer now supports OCCURS DEPENDING ON with dependency tracking for dynamic arrays and improved RDW handling
    • Added a lightweight holder to track dependee field offsets
    • Validation API expanded to validate copybook/source configurations for reading and writing
  • Bug Fixes

    • String rendering now shows actual field names (no camelCase conversions)
    • Validation tightened: unsupported OCCURS mappings and multi-segment write options produce clearer errors; variable-size OCCURS with fixed records now warns
  • Tests

    • Added unit and integration tests for OCCURS, OCCURS DEPENDING ON, variable-size scenarios, and parameter validation
  • Chores

    • Minor CI message formatting and documentation updates

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Mar 9, 2026

Walkthrough

Refactors writer AST and RDD processing to support OCCURS and OCCURS DEPENDING ON by tracking dependee fields through a dependeeMap; updates writer parameter validation and validation entrypoints for writing; adjusts copybook AST string rendering to use raw names; adds tests and minor CI/build tweaks.

Changes

Cohort / File(s) Summary
AST String Representation
cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/parser/ast/Primitive.scala, cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/parser/ast/Statement.scala
toString implementations now render raw name and no longer camel-case redefined names; removed camelCased field and camelCase() helper.
Dependee Model
spark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/writer/DependingOnField.scala
New case class DependingOnField(cobolField: Primitive, var baseOffset: Int) introduced to represent OCCURS DEPENDING ON targets.
Writer AST Types
spark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/writer/WriterAst.scala
Added PrimitiveDependeeField node; PrimitiveArray and GroupArray now accept optional dependingOn references; added documentation.
AST Construction & RDD Processing
spark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/writer/NestedRecordCombiner.scala
Major refactor: added getAst(copybook) normalization; constructWriterAst and builder methods now accept and populate a dependeeMap; processRDD/writeToBytes updated to support variable-length OCCURS, RDW length adjustments, and dependee resolution; new PrimitiveDependeeField handling wired through generation and writing.
Writer Validation
spark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/source/parameters/CobolParametersValidator.scala
Added SLF4J logger and new overload validateOrThrow(parameters: Map[String,String], hadoopConf); new validateParametersForWriting(readerParameters) enforces disallowed write options (e.g., occurs_mappings, multi-segment) and warns on incompatible combos.
Streaming Entrypoint
spark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/source/streaming/CobolStreamer.scala
Deserializer now collects parameters into a Map and calls new validateOrThrow(parameters, hadoopConf) before reader initialization.
Tests — Writer & Validation
spark-cobol/src/test/scala/.../NestedWriterSuite.scala, .../VariableLengthEbcdicWriterSuite.scala, .../CobolParametersValidatorSuite.scala
Added copybooks and tests for OCCURS and OCCURS DEPENDING ON (including variable-length occurs), added validator tests for writing constraints, updated expectations and new duplicate-dependee-name test.
CI & Build
.github/workflows/jacoco_check.yml, build.sbt
Minor CI message string-quoting changes and a single-line build manifest change.

Sequence Diagram

sequenceDiagram
    participant User
    participant Writer
    participant Combiner
    participant Builder
    participant DepMap
    participant Processor

    User->>Writer: combine(rdd, copybook, schema)
    Writer->>Combiner: constructWriterAst(copybook, schema)
    Combiner->>DepMap: create empty dependeeMap
    Combiner->>Builder: buildGroupField(..., dependeeMap)
    Builder->>Builder: buildPrimitiveNode(..., dependeeMap)
    alt field has DEPENDING ON
        Builder->>DepMap: register DependingOnField(name -> spec)
        Builder-->>Combiner: emit PrimitiveDependeeField / arrays with dependingOn
    end
    Combiner-->>Writer: return root WriterAst
    Writer->>Processor: processRDD(..., variableSizeOccurs)
    Processor->>Processor: iterate rows, call writeToBytes(ast,..., variableLengthOccurs)
    rect rgba(100,200,100,0.5)
        Processor->>DepMap: consult dependeeMap for dependee values
        Processor->>Processor: compute dynamic OCCURS sizes, write arrays accordingly
    end
    Processor-->>Writer: RDD[Array[Byte]]
    Writer-->>User: output RDD
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

Poem

🐰 I hop through copybooks, names honest and plain,
I map the fields that tell arrays how to grow,
I tuck their specs in a map, ready for the main,
I stitch bytes in order, row by row,
A rabbit’s nibble, and the writer sings hello.

🚥 Pre-merge checks | ✅ 3 | ❌ 2

❌ Failed checks (1 warning, 1 inconclusive)

Check name Status Explanation Resolution
Out of Scope Changes check ⚠️ Warning Most changes align with the OCCURS DEPENDING ON feature; however, unrelated refactoring includes camelCase removal from Primitive/Statement toString and JaCoCo workflow message formatting. Separate camelCase refactoring and workflow formatting changes into a dedicated commit or PR to maintain clear scope alignment with the #797 feature request.
Linked Issues check ❓ Inconclusive The PR addresses the primary coding requirement from #797: support for OCCURS and OCCURS DEPENDING ON in EBCDIC writing. VB file support and CP037 code page support are partially addressed or incomplete. Verify that VB file support and CP037 code page support are implemented; the PR appears to focus primarily on OCCURS DEPENDING ON handling without clear evidence of the other two requirements.
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately summarizes the main feature: adding support for variable-sized OCCURS when writing EBCDIC files, which directly corresponds to the primary changes in the codebase.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch feature/797-add-support-for-variable-size-occurs

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@github-actions
Copy link

github-actions bot commented Mar 9, 2026

JaCoCo code coverage report - 'cobol-parser'

Overall Project 91.4% -0.12% 🍏
Files changed 0%

File Coverage
Primitive.scala 71.43% -4.8%
Statement.scala 51.87% -17.65%

@github-actions
Copy link

github-actions bot commented Mar 9, 2026

JaCoCo code coverage report - 'spark-cobol'

Overall Project 83.77% -6.59% 🍏
Files changed 35.63%

File Coverage
DependingOnField.scala 100%
WriterAst.scala 93.51%
NestedRecordCombiner.scala 88.37% -1.86% 🍏
CobolParametersValidator.scala 85.77% -0.82% 🍏
CobolStreamer.scala 0% -24.89%

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
spark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/source/parameters/CobolParametersValidator.scala (1)

119-145: ⚠️ Potential issue | 🟠 Major

Reject variable_size_occurs for fixed-length writes.

Now that variable_size_occurs is allowed for writing, NestedRecordCombiner.processRDD() can emit shorter byte arrays for record_format = F. Without RDW or any other framing, that turns the output into an undecodable flat file as soon as one record shrinks. Please fail this combination here, or keep fixed-length writes untrimmed.

⚙️ Proposed validation
   if (readerParameters.recordFormat != RecordFormat.FixedLength && readerParameters.recordFormat != RecordFormat.VariableLength) {
     issues += s"Only '${RecordFormat.FixedLength}' and '${RecordFormat.VariableLength}' values for 'record_format' are supported for writing, " +
       s"provided value: '${readerParameters.recordFormat}'"
   }
+
+  if (readerParameters.variableSizeOccurs &&
+    readerParameters.recordFormat != RecordFormat.VariableLength) {
+    issues += "'variable_size_occurs' is supported for writing only when 'record_format' is 'V'"
+  }
 
   if (readerParameters.occursMappings.nonEmpty) {
     issues += "OCCURS mapping option ('occurs_mappings') is not supported for writing"
   }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@spark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/source/parameters/CobolParametersValidator.scala`
around lines 119 - 145, The validator must reject the combination of
fixed-length output and variable-size OCCURS; update
validateParametersForWriting (in CobolParametersValidator) to add a check that
if readerParameters.recordFormat == RecordFormat.FixedLength and the
variable-size-occurs option is enabled (readerParameters.variableSizeOccurs /
variable_size_occurs flag on ReaderParameters), append an error like
"variable_size_occurs is not supported for fixed-length writes" to issues so the
IllegalArgumentException is thrown; this ensures fixed-length writes remain
untrimmed and prevents emitting shorter byte arrays.
🧹 Nitpick comments (1)
spark-cobol/src/test/scala/za/co/absa/cobrix/spark/cobol/writer/VariableLengthEbcdicWriterSuite.scala (1)

134-160: Add a VB rejection regression here too.

This negative-path test currently proves FB is rejected, but not VB. Since issue #797 explicitly mentions VB, a dedicated record_format = "VB" assertion would make the intended writer contract much clearer and protect it from regressions.

Based on learnings, variable-block format writing will not be implemented in the writer; only VariableLength uses RDW headers.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@spark-cobol/src/test/scala/za/co/absa/cobrix/spark/cobol/writer/VariableLengthEbcdicWriterSuite.scala`
around lines 134 - 160, The test in VariableLengthEbcdicWriterSuite ("throw an
exception on unexpected output record format") only asserts that "FB" is
rejected but misses asserting that "VB" is rejected; add a second negative-path
check (or extend the existing intercept) that writes with
.option("record_format", "VB") using the same df/path setup and asserts the
thrown IllegalArgumentException message includes the same "Only 'F' and 'V'
values for 'record_format' are supported for writing, provided value: 'VB'" text
so VB regressions are caught (refer to the test block variables df, path and the
intercept/exception handling used in the current test).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In
`@spark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/writer/NestedRecordCombiner.scala`:
- Around line 418-426: The null-array branches in NestedRecordCombiner are
leaving DEPENDING ON fields as raw zeros in fixed-length mode because
Copybook.setPrimitiveField is only invoked when variableLengthOccurs is true;
update both null branches (around the blocks that now return 0 and
cobolField.binaryProperties.actualSize, also the similar case at lines 459-466)
to always call Copybook.setPrimitiveField(spec.cobolField, ar, 0,
fieldStartOffsetOverride = spec.baseOffset) for each spec in dependingOn before
returning, so the dependee PrimitiveDependeeField gets an encoded COBOL zero
even when the array is null and variableLengthOccurs is false.

---

Outside diff comments:
In
`@spark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/source/parameters/CobolParametersValidator.scala`:
- Around line 119-145: The validator must reject the combination of fixed-length
output and variable-size OCCURS; update validateParametersForWriting (in
CobolParametersValidator) to add a check that if readerParameters.recordFormat
== RecordFormat.FixedLength and the variable-size-occurs option is enabled
(readerParameters.variableSizeOccurs / variable_size_occurs flag on
ReaderParameters), append an error like "variable_size_occurs is not supported
for fixed-length writes" to issues so the IllegalArgumentException is thrown;
this ensures fixed-length writes remain untrimmed and prevents emitting shorter
byte arrays.

---

Nitpick comments:
In
`@spark-cobol/src/test/scala/za/co/absa/cobrix/spark/cobol/writer/VariableLengthEbcdicWriterSuite.scala`:
- Around line 134-160: The test in VariableLengthEbcdicWriterSuite ("throw an
exception on unexpected output record format") only asserts that "FB" is
rejected but misses asserting that "VB" is rejected; add a second negative-path
check (or extend the existing intercept) that writes with
.option("record_format", "VB") using the same df/path setup and asserts the
thrown IllegalArgumentException message includes the same "Only 'F' and 'V'
values for 'record_format' are supported for writing, provided value: 'VB'" text
so VB regressions are caught (refer to the test block variables df, path and the
intercept/exception handling used in the current test).

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: b88d503c-8012-4362-bbf1-6c0ad43aa41d

📥 Commits

Reviewing files that changed from the base of the PR and between a76495d and 94beb4b.

📒 Files selected for processing (8)
  • cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/parser/ast/Primitive.scala
  • cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/parser/ast/Statement.scala
  • spark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/source/parameters/CobolParametersValidator.scala
  • spark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/writer/DependingOnField.scala
  • spark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/writer/NestedRecordCombiner.scala
  • spark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/writer/WriterAst.scala
  • spark-cobol/src/test/scala/za/co/absa/cobrix/spark/cobol/writer/NestedWriterSuite.scala
  • spark-cobol/src/test/scala/za/co/absa/cobrix/spark/cobol/writer/VariableLengthEbcdicWriterSuite.scala

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
spark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/writer/NestedRecordCombiner.scala (1)

370-373: Mutable state in shared AST node is safe but fragile.

DependingOnField.baseOffset is mutated at line 372 for every row within the same partition. This works correctly because:

  1. Rows are processed sequentially within mapPartitions
  2. Each row gets a fresh byte array
  3. baseOffset is always updated before being read in the subsequent dependingOn.foreach calls

However, this design is fragile—if the processing model ever changes to concurrent row handling, this will silently corrupt data. Consider adding a brief comment explaining why this mutation is safe.

📝 Suggested documentation
       // ── Primitive which has an OCCURS DEPENDS ON ─────────────────────────────
       case PrimitiveDependeeField(spec) =>
+        // NOTE: baseOffset is mutated here for each row. This is safe because rows
+        // are processed sequentially within mapPartitions, and the offset is always
+        // updated before being read in subsequent array-element writes.
         spec.baseOffset = currentOffset
         spec.cobolField.binaryProperties.actualSize
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@spark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/writer/NestedRecordCombiner.scala`
around lines 370 - 373, The mutation of DependingOnField.baseOffset inside the
PrimitiveDependeeField match is fragile; add a concise comment near the
assignment (spec.baseOffset = currentOffset) explaining that this is safe
because rows are processed sequentially within mapPartitions, each row uses a
fresh byte array, and baseOffset is always updated before any
dependingOn.foreach reads it, and note that concurrent processing would break
this invariant; reference PrimitiveDependeeField, DependingOnField.baseOffset,
mapPartitions, and dependingOn.foreach in the comment so future maintainers
understand the assumption.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In
`@spark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/writer/NestedRecordCombiner.scala`:
- Line 281: Fix the comment typo inside NestedRecordCombiner.scala near the
comment above the dependee handling: change "defines" to "defined" in the
comment "// Dependee fields need not to be defines in Spark schema." so it reads
"// Dependee fields need not to be defined in Spark schema." to correct the
spelling in the NestedRecordCombiner class.

---

Nitpick comments:
In
`@spark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/writer/NestedRecordCombiner.scala`:
- Around line 370-373: The mutation of DependingOnField.baseOffset inside the
PrimitiveDependeeField match is fragile; add a concise comment near the
assignment (spec.baseOffset = currentOffset) explaining that this is safe
because rows are processed sequentially within mapPartitions, each row uses a
fresh byte array, and baseOffset is always updated before any
dependingOn.foreach reads it, and note that concurrent processing would break
this invariant; reference PrimitiveDependeeField, DependingOnField.baseOffset,
mapPartitions, and dependingOn.foreach in the comment so future maintainers
understand the assumption.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 7883dcea-55d1-4916-8681-7c58385e6b5c

📥 Commits

Reviewing files that changed from the base of the PR and between e08fbb5 and bfa90d6.

📒 Files selected for processing (3)
  • spark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/source/parameters/CobolParametersValidator.scala
  • spark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/writer/NestedRecordCombiner.scala
  • spark-cobol/src/test/scala/za/co/absa/cobrix/spark/cobol/source/parameters/CobolParametersValidatorSuite.scala

}.getOrElse {
log.error(s"Field '$path${p.name}' is not found in Spark schema. Will be replaced by filler.")
Filler(p.binaryProperties.actualSize)
// Dependee fields need not to be defines in Spark schema.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Minor typo in comment.

"defines" should be "defined".

✏️ Fix typo
-      // Dependee fields need not to be defines in Spark schema.
+      // Dependee fields need not be defined in Spark schema.
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
// Dependee fields need not to be defines in Spark schema.
// Dependee fields need not be defined in Spark schema.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@spark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/writer/NestedRecordCombiner.scala`
at line 281, Fix the comment typo inside NestedRecordCombiner.scala near the
comment above the dependee handling: change "defines" to "defined" in the
comment "// Dependee fields need not to be defines in Spark schema." so it reads
"// Dependee fields need not to be defined in Spark schema." to correct the
spelling in the NestedRecordCombiner class.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (1)
spark-cobol/src/test/scala/za/co/absa/cobrix/spark/cobol/writer/NestedWriterSuite.scala (1)

467-476: Reduce coupling to AST layout and exact exception wording.

This test depends on children(5) and the full message with parser line numbers, so harmless AST or wording refactors will break it even if duplicate-dependee validation still works. Prefer selecting the second CNT1 by name and asserting only the stable part of the error.

♻️ Suggested cleanup
-      val ast = parsedCopybook.ast
-      val children = ast.children.head.asInstanceOf[Group].children
-      val cnt2 = children(5).asInstanceOf[Primitive].withUpdatedIsDependee(true)
-      children(5) = cnt2
+      val record = parsedCopybook.ast.children.head.asInstanceOf[Group]
+      val cnt1Indexes = record.children.zipWithIndex.collect {
+        case (p: Primitive, idx) if p.name.equalsIgnoreCase("CNT1") => idx
+      }
+      assert(cnt1Indexes.size == 2)
+      val secondCnt1Index = cnt1Indexes(1)
+      record.children(secondCnt1Index) =
+        record.children(secondCnt1Index).asInstanceOf[Primitive].withUpdatedIsDependee(true)

       val ex = intercept[IllegalArgumentException] {
         NestedRecordCombiner.constructWriterAst(parsedCopybook, df.schema)
       }

-      assert(ex.getMessage == "Duplicate field name 'CNT1' found in copybook. Field names must be unique (case-insensitive) when OCCURS DEPENDING ON is used. Already found a dependee field with the same name at line 4, current field line number: 10.")
+      assert(ex.getMessage.contains("Duplicate field name 'CNT1' found in copybook."))
+      assert(ex.getMessage.contains("Field names must be unique (case-insensitive) when OCCURS DEPENDING ON is used."))
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@spark-cobol/src/test/scala/za/co/absa/cobrix/spark/cobol/writer/NestedWriterSuite.scala`
around lines 467 - 476, The test in NestedWriterSuite is brittle because it
relies on AST index children(5) and exact exception text; update the test so it
finds the second CNT1 node by name (e.g., walk parsedCopybook.ast to locate the
Primitive node whose name equals "CNT1" and then call withUpdatedIsDependee on
that node) instead of using children(5), and change the assertion on the thrown
IllegalArgumentException from exact string equality to a contains check for the
stable fragment (for example verify ex.getMessage contains "Duplicate field name
'CNT1'") while keeping the call to
NestedRecordCombiner.constructWriterAst(parsedCopybook, df.schema).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In
`@spark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/source/parameters/CobolParametersValidator.scala`:
- Around line 153-156: The validator currently only logs a warning when
readerParameters.variableSizeOccurs is true with readerParameters.recordFormat
== RecordFormat.FixedLength; change this to reject the configuration by
throwing/returning a validation error in CobolParametersValidator (instead of
log.warn) when variableSizeOccurs is set with RecordFormat.FixedLength,
referencing the same flags (variableSizeOccurs and recordFormat) and
RecordFormat.FixedLength to locate the check; include a clear error message
stating that variable-size occurs cannot be used with fixed-length writes
because RDW framing is only added for RecordFormat.VariableLength (see
NestedRecordCombiner handling), so validation must fail.

In
`@spark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/source/streaming/CobolStreamer.scala`:
- Around line 42-47: The code currently calls
ssc.sparkContext.getConf.get(PARAM_COPYBOOK_PATH) and get(PARAM_SOURCE_PATH)
which throw NoSuchElementException before
CobolParametersValidator.validateOrThrow runs; change the parameter extraction
in CobolStreamer: replace direct get(...) calls with getOption(...) and supply a
safe default (e.g., .getOrElse("") ) when building the parameters Map for
PARAM_COPYBOOK_PATH and PARAM_SOURCE_PATH so that
CobolParametersValidator.validateOrThrow(parameters,
ssc.sparkContext.hadoopConfiguration) can run and produce its intended
validation errors instead of a raw exception.

---

Nitpick comments:
In
`@spark-cobol/src/test/scala/za/co/absa/cobrix/spark/cobol/writer/NestedWriterSuite.scala`:
- Around line 467-476: The test in NestedWriterSuite is brittle because it
relies on AST index children(5) and exact exception text; update the test so it
finds the second CNT1 node by name (e.g., walk parsedCopybook.ast to locate the
Primitive node whose name equals "CNT1" and then call withUpdatedIsDependee on
that node) instead of using children(5), and change the assertion on the thrown
IllegalArgumentException from exact string equality to a contains check for the
stable fragment (for example verify ex.getMessage contains "Duplicate field name
'CNT1'") while keeping the call to
NestedRecordCombiner.constructWriterAst(parsedCopybook, df.schema).

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 889a3459-b227-4f39-aa31-c97cb5aed499

📥 Commits

Reviewing files that changed from the base of the PR and between bfa90d6 and 01da263.

📒 Files selected for processing (3)
  • spark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/source/parameters/CobolParametersValidator.scala
  • spark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/source/streaming/CobolStreamer.scala
  • spark-cobol/src/test/scala/za/co/absa/cobrix/spark/cobol/writer/NestedWriterSuite.scala

Comment on lines +153 to +156
if (readerParameters.variableSizeOccurs &&
readerParameters.recordFormat == RecordFormat.FixedLength) {
log.warn("Option 'variable_size_occurs=true' is used with 'record_format=F' which means records can have variable length. It is highly recommended to use 'record_format=V' instead.")
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Reject variable_size_occurs for fixed-length writes.

This combination can emit variable-sized records while spark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/writer/NestedRecordCombiner.scala:45-76 only adds RDW framing for RecordFormat.VariableLength. With record_format=F, downstream consumers lose unambiguous record boundaries, so this should fail validation instead of warning.

🛑 Suggested validation change
-    if (readerParameters.variableSizeOccurs &&
-      readerParameters.recordFormat == RecordFormat.FixedLength) {
-      log.warn("Option 'variable_size_occurs=true' is used with 'record_format=F' which means records can have variable length. It is highly recommended to use 'record_format=V' instead.")
-    }
+    if (readerParameters.variableSizeOccurs &&
+      readerParameters.recordFormat == RecordFormat.FixedLength) {
+      issues += "Option 'variable_size_occurs=true' requires 'record_format=V' for writing"
+    }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@spark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/source/parameters/CobolParametersValidator.scala`
around lines 153 - 156, The validator currently only logs a warning when
readerParameters.variableSizeOccurs is true with readerParameters.recordFormat
== RecordFormat.FixedLength; change this to reject the configuration by
throwing/returning a validation error in CobolParametersValidator (instead of
log.warn) when variableSizeOccurs is set with RecordFormat.FixedLength,
referencing the same flags (variableSizeOccurs and recordFormat) and
RecordFormat.FixedLength to locate the check; include a clear error message
stating that variable-size occurs cannot be used with fixed-length writes
because RDW framing is only added for RecordFormat.VariableLength (see
NestedRecordCombiner handling), so validation must fail.

Comment on lines +42 to +47
val parameters: Map[String, String] = Map[String, String](
PARAM_COPYBOOK_PATH -> ssc.sparkContext.getConf.get(PARAM_COPYBOOK_PATH),
PARAM_SOURCE_PATH -> ssc.sparkContext.getConf.get(PARAM_SOURCE_PATH)
)

CobolParametersValidator.validateOrThrow(parameters, ssc.sparkContext.hadoopConfiguration)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🌐 Web query:

In Apache Spark's Scala API, what does SparkConf.get(key)do when the key is missing, and how does that differ fromSparkConf.getOption(key)?

💡 Result:

  • SparkConf.get(key) returns the String value if set, but if the key is missing it throws java.util.NoSuchElementException(key). [1]
  • SparkConf.getOption(key) returns an Option[String]: Some(value) if present, otherwise None (no exception). [1]

Sources:
[1] Apache Spark source (SparkConf.scala, v3.5.6) (github.com)

Citations:


🏁 Script executed:

# Locate and inspect CobolStreamer.scala around lines 42-47
fd "CobolStreamer.scala" --type f

Repository: AbsaOSS/cobrix

Length of output: 1855


🏁 Script executed:

# Inspect CobolParametersValidator to understand error message design
fd "CobolParametersValidator.scala" --type f

Repository: AbsaOSS/cobrix

Length of output: 163


🏁 Script executed:

# Find CobolStreamer.scala
fd "CobolStreamer.scala" --path "*/spark-cobol/*"

Repository: AbsaOSS/cobrix

Length of output: 273


🏁 Script executed:

# Read CobolStreamer.scala around lines 42-47
find . -path "*/spark-cobol/*" -name "CobolStreamer.scala" -exec cat -n {} \; | sed -n '35,55p'

Repository: AbsaOSS/cobrix

Length of output: 1236


🏁 Script executed:

# Read CobolParametersValidator to understand error messages
cat -n spark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/source/parameters/CobolParametersValidator.scala | head -150

Repository: AbsaOSS/cobrix

Length of output: 8759


Don't call SparkConf.get() before validation.

These lookups throw NoSuchElementException before CobolParametersValidator.validateOrThrow() runs (line 47). Users see a raw config exception instead of the validator's intended error messages for missing required parameters.

Suggested fix
   implicit class Deserializer(`@transient` val ssc: StreamingContext) extends Serializable {
-    val parameters: Map[String, String] = Map[String, String](
-      PARAM_COPYBOOK_PATH -> ssc.sparkContext.getConf.get(PARAM_COPYBOOK_PATH),
-      PARAM_SOURCE_PATH -> ssc.sparkContext.getConf.get(PARAM_SOURCE_PATH)
-    )
+    private val conf = ssc.sparkContext.getConf
+    val parameters: Map[String, String] =
+      Seq(PARAM_COPYBOOK_PATH, PARAM_SOURCE_PATH)
+        .flatMap(key => conf.getOption(key).map(value => key -> value))
+        .toMap

     CobolParametersValidator.validateOrThrow(parameters, ssc.sparkContext.hadoopConfiguration)
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
val parameters: Map[String, String] = Map[String, String](
PARAM_COPYBOOK_PATH -> ssc.sparkContext.getConf.get(PARAM_COPYBOOK_PATH),
PARAM_SOURCE_PATH -> ssc.sparkContext.getConf.get(PARAM_SOURCE_PATH)
)
CobolParametersValidator.validateOrThrow(parameters, ssc.sparkContext.hadoopConfiguration)
private val conf = ssc.sparkContext.getConf
val parameters: Map[String, String] =
Seq(PARAM_COPYBOOK_PATH, PARAM_SOURCE_PATH)
.flatMap(key => conf.getOption(key).map(value => key -> value))
.toMap
CobolParametersValidator.validateOrThrow(parameters, ssc.sparkContext.hadoopConfiguration)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@spark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/source/streaming/CobolStreamer.scala`
around lines 42 - 47, The code currently calls
ssc.sparkContext.getConf.get(PARAM_COPYBOOK_PATH) and get(PARAM_SOURCE_PATH)
which throw NoSuchElementException before
CobolParametersValidator.validateOrThrow runs; change the parameter extraction
in CobolStreamer: replace direct get(...) calls with getOption(...) and supply a
safe default (e.g., .getOrElse("") ) when building the parameters Map for
PARAM_COPYBOOK_PATH and PARAM_SOURCE_PATH so that
CobolParametersValidator.validateOrThrow(parameters,
ssc.sparkContext.hadoopConfiguration) can run and produce its intended
validation errors instead of a raw exception.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

ASCII to EBCDIC support with OCCURS and OCCURS+DEPENDING ON

1 participant