#826 Add support for writing VRL files with RDW headers#827
Conversation
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review infoConfiguration used: Organization UI Review profile: CHILL Plan: Pro 📒 Files selected for processing (1)
🚧 Files skipped from review as they are similar to previous changes (1)
WalkthroughAdds RDW (4-byte Record Description Word) writing support and writer-side parameter validation; changes Copybook.setPrimitiveField to accept configuredStartOffset and fieldStartOffsetOverride; sets concise RecordFormat string codes; integrates RDW-aware encoding in BasicRecordCombiner and adds variable-length EBCDIC writer tests. Changes
Sequence Diagram(s)sequenceDiagram
participant DefaultSource as DefaultSource
participant Validator as CobolParametersValidator
participant Combiner as BasicRecordCombiner
participant Copybook as Copybook
DefaultSource->>Validator: validateParametersForWriting(readerParameters)
activate Validator
Validator->>Validator: check recordFormat, variable_size_occurs, offsets
alt invalid
Validator-->>DefaultSource: throw IllegalArgumentException
else valid
Validator-->>DefaultSource: return
end
deactivate Validator
DefaultSource->>Combiner: combine(records, readerParameters)
activate Combiner
Combiner->>Combiner: Detect RDW & endianness
alt RDW present
Combiner->>Combiner: Compute record length, validate
Combiner->>Combiner: Write 4-byte RDW header (big/little-endian)
Combiner->>Combiner: set startOffset = 4
else No RDW
Combiner->>Combiner: set startOffset = 0
end
loop for each field
Combiner->>Copybook: setPrimitiveField(field, outBytes, value, configuredStartOffset, fieldStartOffsetOverride)
activate Copybook
Copybook->>Copybook: use override if > 0 else configuredStartOffset
Copybook-->>Combiner: field bytes written
deactivate Copybook
end
Combiner-->>DefaultSource: return output byte array
deactivate Combiner
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Possibly related PRs
Poem
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
JaCoCo code coverage report - 'cobol-parser'
|
JaCoCo code coverage report - 'spark-cobol'
|
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (1)
spark-cobol/src/test/scala/za/co/absa/cobrix/spark/cobol/writer/VariableLengthEbcdicWriterSuite.scala (1)
85-100: Test intent and configuration are slightly out of sync.The test name says “RDW adjustment,” but no
rdw_adjustmentoption is set. Either rename the test or add that option so coverage matches the name.Suggested minimal fix (rename for accuracy)
- "write simple variable -record-length EBCDIC data files with little-endian RDWs and RDW adjustment" in { + "write simple variable-record-length EBCDIC data files with little-endian RDWs and record-length-includes-RDW" in {🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@spark-cobol/src/test/scala/za/co/absa/cobrix/spark/cobol/writer/VariableLengthEbcdicWriterSuite.scala` around lines 85 - 100, The test name claims “RDW adjustment” but the write options do not set rdw_adjustment; either update the test description string in the test case starting with "write simple variable -record-length EBCDIC data files with little-endian RDWs and RDW adjustment" to remove the “RDW adjustment” phrase, or add the missing writer option by calling .option("rdw_adjustment", "<value>") on the DataFrameWriter (alongside .option("is_rdw_big_endian", "false") and .option("is_rdw_part_of_record_length", "true")) so the configuration matches the test name.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In
`@spark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/writer/BasicRecordCombiner.scala`:
- Around line 82-95: Validate recordLength computed from
cobolSchema.getRecordSize + adjustment1 + adjustment2 before writing into ar:
ensure it is non-negative and within allowed range for the chosen header format
(if isRdwBigEndian then recordLength must be <= 0xFFFF; if little-endian then
must be within 0..0xFFFFFFFF), and throw a clear IllegalArgumentException (or
similar) referencing recordLength and the offending adjustments when the check
fails. Do this validation in the method that contains the ar[] writes (the block
using recordLength and isRdwBigEndian) so serialization never writes
truncated/negative RDW values.
---
Nitpick comments:
In
`@spark-cobol/src/test/scala/za/co/absa/cobrix/spark/cobol/writer/VariableLengthEbcdicWriterSuite.scala`:
- Around line 85-100: The test name claims “RDW adjustment” but the write
options do not set rdw_adjustment; either update the test description string in
the test case starting with "write simple variable -record-length EBCDIC data
files with little-endian RDWs and RDW adjustment" to remove the “RDW adjustment”
phrase, or add the missing writer option by calling .option("rdw_adjustment",
"<value>") on the DataFrameWriter (alongside .option("is_rdw_big_endian",
"false") and .option("is_rdw_part_of_record_length", "true")) so the
configuration matches the test name.
ℹ️ Review info
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (6)
cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/parser/Copybook.scalacobol-parser/src/main/scala/za/co/absa/cobrix/cobol/parser/recordformats/RecordFormat.scalaspark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/source/DefaultSource.scalaspark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/source/parameters/CobolParametersValidator.scalaspark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/writer/BasicRecordCombiner.scalaspark-cobol/src/test/scala/za/co/absa/cobrix/spark/cobol/writer/VariableLengthEbcdicWriterSuite.scala
spark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/writer/BasicRecordCombiner.scala
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (1)
spark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/writer/BasicRecordCombiner.scala (1)
81-93: Move RDW validation outside themapfor efficiency.The RDW length computation and validation depend only on
cobolSchemaandreaderParameters, which are constant across all rows. Performing this validation insidedf.rdd.maprepeats the same check for every row.♻️ Suggested refactor
val startOffset = if (hasRdw) 4 else 0 + val recordLength = if (hasRdw) { + val recordLengthLong = cobolSchema.getRecordSize.toLong + adjustment1.toLong + adjustment2.toLong + if (recordLengthLong < 0) { + throw new IllegalArgumentException( + s"Invalid RDW length $recordLengthLong. Check 'is_rdw_part_of_record_length' and 'rdw_adjustment'." + ) + } + if (isRdwBigEndian && recordLengthLong > 0xFFFFL) { + throw new IllegalArgumentException( + s"RDW length $recordLengthLong exceeds 65535 and cannot be encoded in big-endian mode." + ) + } + recordLengthLong.toInt + } else 0 + df.rdd.map { row => val ar = new Array[Byte](size) if (hasRdw) { - val recordLengthLong = cobolSchema.getRecordSize.toLong + adjustment1.toLong + adjustment2.toLong - if (recordLengthLong < 0) { - throw new IllegalArgumentException( - s"Invalid RDW length $recordLengthLong. Check 'is_rdw_part_of_record_length' and 'rdw_adjustment'." - ) - } - if (isRdwBigEndian && recordLengthLong > 0xFFFFL) { - throw new IllegalArgumentException( - s"RDW length $recordLengthLong exceeds 65535 and cannot be encoded in big-endian mode." - ) - } - val recordLength = recordLengthLong.toInt if (isRdwBigEndian) {🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@spark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/writer/BasicRecordCombiner.scala` around lines 81 - 93, The RDW length calculation and validations currently inside the df.rdd.map should be hoisted out and performed once before the map: compute recordLengthLong = cobolSchema.getRecordSize.toLong + adjustment1.toLong + adjustment2.toLong, validate it (throw on negative or on >0xFFFFL when isRdwBigEndian), then set recordLength = recordLengthLong.toInt and capture recordLength (and any flags like hasRdw) in local vals used inside BasicRecordCombiner’s mapping logic so the per-row map uses the precomputed recordLength instead of recalculating and revalidating for every row.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In
`@spark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/writer/BasicRecordCombiner.scala`:
- Line 65: Update the RDW detection in BasicRecordCombiner by expanding the
hasRdw condition to include RecordFormat.VariableBlock; specifically modify the
logic that computes hasRdw from checking only readerParameters.recordFormat ==
RecordFormat.VariableLength to also consider readerParameters.recordFormat ==
RecordFormat.VariableBlock so VariableBlock files get RDW headers just like
VariableLength files (look for the hasRdw variable and the
readerParameters.recordFormat / RecordFormat enum references).
---
Nitpick comments:
In
`@spark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/writer/BasicRecordCombiner.scala`:
- Around line 81-93: The RDW length calculation and validations currently inside
the df.rdd.map should be hoisted out and performed once before the map: compute
recordLengthLong = cobolSchema.getRecordSize.toLong + adjustment1.toLong +
adjustment2.toLong, validate it (throw on negative or on >0xFFFFL when
isRdwBigEndian), then set recordLength = recordLengthLong.toInt and capture
recordLength (and any flags like hasRdw) in local vals used inside
BasicRecordCombiner’s mapping logic so the per-row map uses the precomputed
recordLength instead of recalculating and revalidating for every row.
ℹ️ Review info
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
spark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/writer/BasicRecordCombiner.scalaspark-cobol/src/test/scala/za/co/absa/cobrix/spark/cobol/writer/VariableLengthEbcdicWriterSuite.scala
🚧 Files skipped from review as they are similar to previous changes (1)
- spark-cobol/src/test/scala/za/co/absa/cobrix/spark/cobol/writer/VariableLengthEbcdicWriterSuite.scala
| } | ||
|
|
||
| val size = cobolSchema.getRecordSize | ||
| val hasRdw = readerParameters.recordFormat == RecordFormat.VariableLength |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# First, find the RecordFormat enum definition
fd -e scala -x grep -l "enum.*RecordFormat\|object RecordFormat" {} \;Repository: AbsaOSS/cobrix
Length of output: 151
🏁 Script executed:
#!/bin/bash
# Search for VariableBlock usage in the codebase
rg "VariableBlock" --type scala -B2 -A2Repository: AbsaOSS/cobrix
Length of output: 13006
🏁 Script executed:
#!/bin/bash
# Check how RDW is handled in reader parameters across the codebase
rg "hasRdw|rdw" --type scala -i -B2 -A2 | head -100Repository: AbsaOSS/cobrix
Length of output: 13042
🏁 Script executed:
cat -n spark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/writer/BasicRecordCombiner.scala | sed -n '55,85p'Repository: AbsaOSS/cobrix
Length of output: 1375
🏁 Script executed:
# Search for how VariableBlock is handled in BasicRecordCombiner context
rg "VariableBlock|VariableLength|hasRdw" spark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/writer/BasicRecordCombiner.scala -B2 -A2Repository: AbsaOSS/cobrix
Length of output: 681
🏁 Script executed:
# Check if there are other uses of RDW checks for VariableBlock vs VariableLength
rg "recordFormat.*==.*Variable" spark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/writer/ -A1Repository: AbsaOSS/cobrix
Length of output: 372
Add VariableBlock to the RDW check.
Line 65 should also check for RecordFormat.VariableBlock since VB files use RDW headers for individual records within blocks, similar to V format files. Change:
val hasRdw = readerParameters.recordFormat == RecordFormat.VariableLengthto:
val hasRdw = readerParameters.recordFormat == RecordFormat.VariableLength || readerParameters.recordFormat == RecordFormat.VariableBlockWithout this, VariableBlock files will be written without proper RDW headers, causing downstream reading failures.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In
`@spark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/writer/BasicRecordCombiner.scala`
at line 65, Update the RDW detection in BasicRecordCombiner by expanding the
hasRdw condition to include RecordFormat.VariableBlock; specifically modify the
logic that computes hasRdw from checking only readerParameters.recordFormat ==
RecordFormat.VariableLength to also consider readerParameters.recordFormat ==
RecordFormat.VariableBlock so VariableBlock files get RDW headers just like
VariableLength files (look for the hasRdw variable and the
readerParameters.recordFormat / RecordFormat enum references).
There was a problem hiding this comment.
This is not applicable here. The coborl writer is not going to support variable-block headers.
There was a problem hiding this comment.
@yruslan, thank you for the clarification! I understand that variable-block format support is intentionally not included in the writer implementation. I've noted this design decision for future reference.
✏️ Learnings added
Learnt from: yruslan
Repo: AbsaOSS/cobrix PR: 827
File: spark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/writer/BasicRecordCombiner.scala:65-65
Timestamp: 2026-03-03T08:34:52.931Z
Learning: In the COBOL writer (BasicRecordCombiner), variable-block (VB) format is not supported for writing. Only VariableLength format uses RDW headers in the writer; VariableBlock format will not be implemented.
Note: Learnings are effective only in the context of similar code segments. To apply general coding guidelines, please configure review instructions. You can manage existing learnings in the app.
🧠 Learnings used
Learnt from: yruslan
Repo: AbsaOSS/cobrix PR: 799
File: cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/processor/impl/StreamProcessor.scala:83-103
Timestamp: 2025-11-13T14:28:55.864Z
Learning: In Cobrix, RDW (Record Descriptor Word) length field contains only the payload size, not including the 4-byte RDW header itself. This is a historical convention. When reading files with RDW that includes the header size, users should use `option("rdw_adjustment", "-4")`.
Learnt from: yruslan
Repo: AbsaOSS/cobrix PR: 820
File: cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/reader/extractors/record/RecordExtractors.scala:435-436
Timestamp: 2026-02-11T08:01:53.018Z
Learning: In `cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/reader/extractors/record/RecordExtractors.scala`, the `extractHierarchicalRecord` method hardcodes `generateCorruptedFields = false` because validation in `CobolParametersParser` throws an `IllegalArgumentException` when `generate_corrupted_fields` is used with segment-children (hierarchical data), preventing users from enabling this unsupported combination.
Learnt from: yruslan
Repo: AbsaOSS/cobrix PR: 820
File: cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/reader/parameters/CobolParameters.scala:50-50
Timestamp: 2026-02-11T08:15:51.206Z
Learning: Rename the field from _corrupted_fields to _corrupt_fields to align with Apache Spark's naming convention (e.g., _corrupt_record). Apply this change consistently across the codebase where the field is defined or referenced (including cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/reader/parameters/CobolParameters.scala) and update any tests or serialization logic that rely on the old name.
There was a problem hiding this comment.
Actionable comments posted: 1
♻️ Duplicate comments (1)
spark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/writer/BasicRecordCombiner.scala (1)
65-65:⚠️ Potential issue | 🟠 MajorHandle
VariableBlockexplicitly in RDW detection (or reject it upstream).
hasRdwonly matchesRecordFormat.VariableLength. IfVariableBlockis allowed by writer validation, records from this combiner will miss RDW handling.💡 Proposed fix
- val hasRdw = readerParameters.recordFormat == RecordFormat.VariableLength + val hasRdw = readerParameters.recordFormat == RecordFormat.VariableLength || + readerParameters.recordFormat == RecordFormat.VariableBlockTo verify whether this is already blocked upstream, run:
#!/bin/bash # Verify writer-side validation and local RDW gating for VariableBlock. fd 'CobolParametersValidator.scala' -e scala -x rg -nP -C3 'VariableBlock|VariableLength|recordFormat|rdw|writer' {} rg -nP --type scala -C3 'hasRdw|RecordFormat\.VariableLength|RecordFormat\.VariableBlock' spark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/writer/BasicRecordCombiner.scala🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@spark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/writer/BasicRecordCombiner.scala` at line 65, The RDW detection in BasicRecordCombiner currently sets hasRdw only when readerParameters.recordFormat == RecordFormat.VariableLength, which misses RecordFormat.VariableBlock; update the logic handling hasRdw (or add an explicit guard) to treat VariableBlock appropriately (either include RecordFormat.VariableBlock in the RDW-enabled branch or assert/reject VariableBlock upstream during writer validation), and ensure any downstream RDW-dependent code paths reference the updated hasRdw flag or the new explicit check so VariableBlock records are either processed correctly or blocked with a clear validation error.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In
`@spark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/writer/BasicRecordCombiner.scala`:
- Around line 78-90: The code in BasicRecordCombiner computes recordLengthLong
and then narrows it to an Int (recordLength), which can silently overflow; add a
guard before the toInt conversion that throws an IllegalArgumentException if
recordLengthLong > Int.MaxValue (and keep the existing big-endian > 65535
check), so that when isRdwBigEndian is false and the little-endian path uses
4-byte RDW it still fails fast for lengths exceeding Int.MaxValue instead of
producing a wrapped recordLength value.
---
Duplicate comments:
In
`@spark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/writer/BasicRecordCombiner.scala`:
- Line 65: The RDW detection in BasicRecordCombiner currently sets hasRdw only
when readerParameters.recordFormat == RecordFormat.VariableLength, which misses
RecordFormat.VariableBlock; update the logic handling hasRdw (or add an explicit
guard) to treat VariableBlock appropriately (either include
RecordFormat.VariableBlock in the RDW-enabled branch or assert/reject
VariableBlock upstream during writer validation), and ensure any downstream
RDW-dependent code paths reference the updated hasRdw flag or the new explicit
check so VariableBlock records are either processed correctly or blocked with a
clear validation error.
ℹ️ Review info
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
spark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/writer/BasicRecordCombiner.scala
Closes #826
Summary by CodeRabbit
New Features
Bug Fixes / Validation
Tests
Style / Docs