Skip to content

Commit

Permalink
#481 Fix ASCII control characters handling policy. Add 'keep_all' str…
Browse files Browse the repository at this point in the history
…ing trimming policy.
  • Loading branch information
yruslan committed Mar 25, 2022
1 parent 746a199 commit df9d579
Show file tree
Hide file tree
Showing 12 changed files with 199 additions and 92 deletions.
24 changes: 14 additions & 10 deletions README.md
Expand Up @@ -1203,17 +1203,17 @@ Again, the full example is available at

##### Data parsing options

| Option (usage example) | Description |
| ------------------------------------------ |:----------------------------------------------------------------------------- |
| .option("string_trimming_policy", "both") | Specifies if and how string fields should be trimmed. Available options: `both` (default), `none`, `left`, `right`. |
| .option("ebcdic_code_page", "common") | Specifies a code page for EBCDIC encoding. Currently supported values: `common` (default), `common_extended`, `cp037`, `cp037_extended`, `cp875`. `*_extended` code pages supports non-printable characters that converts to ASCII codes below 32. |
| .option("ebcdic_code_page_class", "full.class.specifier") | Specifies a user provided class for a custom code page to UNICODE conversion. |
| .option("ascii_charset", "US-ASCII") | Specifies a charset to use to decode ASCII data. The value can be any charset supported by `java.nio.charset`: `US-ASCII` (default), `UTF-8`, `ISO-8859-1`, etc. |
| .option("is_utf16_big_endian", "true") | Specifies if UTF-16 encoded strings (`National` / `PIC N` format) are big-endian (default). |
| .option("floating_point_format", "IBM") | Specifies a floating-point format. Available options: `IBM` (default), `IEEE754`, `IBM_little_endian`, `IEEE754_little_endian`. |
| Option (usage example) | Description |
| ------------------------------------------ |:----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| .option("string_trimming_policy", "both") | Specifies if and how string fields should be trimmed. Available options: `both` (default), `none`, `left`, `right`, `keep_all`. `keep_all` - keeps control characters when decoding ASCII text files |
| .option("ebcdic_code_page", "common") | Specifies a code page for EBCDIC encoding. Currently supported values: `common` (default), `common_extended`, `cp037`, `cp037_extended`, `cp875`. `*_extended` code pages supports non-printable characters that converts to ASCII codes below 32. |
| .option("ebcdic_code_page_class", "full.class.specifier") | Specifies a user provided class for a custom code page to UNICODE conversion. |
| .option("ascii_charset", "US-ASCII") | Specifies a charset to use to decode ASCII data. The value can be any charset supported by `java.nio.charset`: `US-ASCII` (default), `UTF-8`, `ISO-8859-1`, etc. |
| .option("is_utf16_big_endian", "true") | Specifies if UTF-16 encoded strings (`National` / `PIC N` format) are big-endian (default). |
| .option("floating_point_format", "IBM") | Specifies a floating-point format. Available options: `IBM` (default), `IEEE754`, `IBM_little_endian`, `IEEE754_little_endian`. |
| .option("variable_size_occurs", "false") | If `false` (default) fields that have `OCCURS 0 TO 100 TIMES DEPENDING ON` clauses always have the same size corresponding to the maximum array size (e.g. 100 in this example). If set to `true` the size of the field will shrink for each field that has less actual elements. |
| .option("occurs_mapping", "{\"FIELD\": {\"X\": 1}}") | If specified, as a JSON string, allows for String `DEPENDING ON` fields with a corresponding mapping. |
| .option("improved_null_detection", "false") | If `true`, values that contain only 0x0 ror DISPLAY strings and numbers will be considered `null`s instead of empty strings. |
| .option("occurs_mapping", "{\"FIELD\": {\"X\": 1}}") | If specified, as a JSON string, allows for String `DEPENDING ON` fields with a corresponding mapping. |
| .option("improved_null_detection", "false") | If `true`, values that contain only 0x0 ror DISPLAY strings and numbers will be considered `null`s instead of empty strings. |

##### Modifier options

Expand Down Expand Up @@ -1398,6 +1398,10 @@ at org.apache.hadoop.io.nativeio.NativeIO$POSIX.getStat(NativeIO.java:608)
A: Update hadoop dll to version 3.2.2 or newer.

## Changelog
- #### 2.4.10 will be released soon.
- [#481](https://github.com/AbsaOSS/cobrix/issues/481) ASCII control characters are now ignored instead of being replaced with spaces.
A new string trimming policy (`keep_all`) allows keeping all control characters in strings (including `0x00`).

- #### 2.4.9 released 4 March 2022.
- [#474](https://github.com/AbsaOSS/cobrix/issues/474) Fix numeric decoder of unsigned DISPLAY format. The decoder made more strict and does not allow sign
overpunching for unsigned numbers.
Expand Down
Expand Up @@ -50,16 +50,15 @@ class AsciiStringDecoderWrapper(trimmingType: Int, asciiCharsetName: String, imp
// Filter out all special characters
val buf = new ArrayBuffer[Byte](bytes.length)
while (i < bytes.length) {
if (bytes(i) >= 0 && bytes(i) < 32 /* Special characters are masked */ )
buf.append(32)
else
if (trimmingType == KeepAll || bytes(i) >= 32 || bytes(i) < 0) {
buf.append(bytes(i))
}
i = i + 1
}

val str = new String(buf.toArray, charset)

if (trimmingType == TrimNone) {
if (trimmingType == TrimNone || trimmingType == KeepAll) {
str
} else if (trimmingType == TrimLeft) {
StringTools.trimLeft(str)
Expand Down
Expand Up @@ -99,6 +99,7 @@ object DecoderSelector {
case TrimLeft => StringDecoders.TrimLeft
case TrimRight => StringDecoders.TrimRight
case TrimBoth => StringDecoders.TrimBoth
case KeepAll => StringDecoders.KeepAll
}
}

Expand Down
Expand Up @@ -31,6 +31,7 @@ object StringDecoders {
val TrimLeft = 2
val TrimRight = 3
val TrimBoth = 4
val KeepAll = 5

// Characters used for HEX conversion
private val HEX_ARRAY = "0123456789ABCDEF".toCharArray
Expand All @@ -55,7 +56,7 @@ object StringDecoders {
i = i + 1
}

if (trimmingType == TrimNone) {
if (trimmingType == TrimNone || trimmingType == KeepAll ) {
buf.toString
} else if (trimmingType == TrimLeft) {
StringTools.trimLeft(buf.toString)
Expand All @@ -81,13 +82,15 @@ object StringDecoders {
var i = 0
val buf = new StringBuffer(bytes.length)
while (i < bytes.length) {
if (bytes(i) < 32 /*Special and high order characters are masked*/ )
buf.append(' ')
else
if (trimmingType == KeepAll || bytes(i) >= 32) {
buf.append(bytes(i).toChar)
} else if (bytes(i) < 0) {
buf.append(' ')
}
i = i + 1
}
if (trimmingType == TrimNone) {

if (trimmingType == TrimNone || trimmingType == KeepAll) {
buf.toString
} else if (trimmingType == TrimLeft) {
StringTools.trimLeft(buf.toString)
Expand Down Expand Up @@ -116,7 +119,7 @@ object StringDecoders {
new String(bytes, StandardCharsets.UTF_16LE)
}

if (trimmingType == TrimNone) {
if (trimmingType == TrimNone || trimmingType == KeepAll) {
utf16Str
} else if (trimmingType == TrimLeft) {
StringTools.trimLeft(utf16Str)
Expand Down
Expand Up @@ -19,7 +19,7 @@ package za.co.absa.cobrix.cobol.parser.policies
object StringTrimmingPolicy extends Enumeration {
type StringTrimmingPolicy = Value

val TrimNone, TrimLeft, TrimRight, TrimBoth = Value
val TrimNone, TrimLeft, TrimRight, TrimBoth, KeepAll = Value

def withNameOpt(s: String): Option[Value] = {
val exactNames = values.find(_.toString == s)
Expand All @@ -33,6 +33,8 @@ object StringTrimmingPolicy extends Enumeration {
Some(TrimRight)
} else if (sLowerCase == "both") {
Some(TrimBoth)
} else if (sLowerCase == "keep_all") {
Some(KeepAll)
} else {
None
}
Expand Down
Expand Up @@ -56,7 +56,7 @@ class AsciiStringDecoderWrapperSpec extends WordSpec {
val str = "\u0001\u0005A\u0008\u0010B\u0015\u001F"
val decoder = new AsciiStringDecoderWrapper(TrimNone, "ASCII", false)

assert(decoder(str.getBytes(StandardCharsets.UTF_8)) == " A B ")
assert(decoder(str.getBytes(StandardCharsets.UTF_8)) == "AB")
}

"support left trimming" in {
Expand All @@ -81,7 +81,14 @@ class AsciiStringDecoderWrapperSpec extends WordSpec {
val str = "\u0002\u0004A\u0007\u000FB\u0014\u001E"
val decoder = new AsciiStringDecoderWrapper(TrimBoth, "ASCII", false)

assert(decoder(str.getBytes(StandardCharsets.UTF_8)) == "A B")
assert(decoder(str.getBytes(StandardCharsets.UTF_8)) == "AB")
}

"be able to decode strings when keep_all is the trimming policy" in {
val str = "\u0002\u0004A\u0007\u000FB\u0014\u001E"
val decoder = new AsciiStringDecoderWrapper(KeepAll, "ASCII", false)

assert(decoder(str.getBytes(StandardCharsets.UTF_8)) == str)
}

"be serializable and deserializable" in {
Expand Down
6 changes: 3 additions & 3 deletions data/test17_expected/test17d.txt
@@ -1,6 +1,6 @@
{"File_Id":0,"Record_Id":2,"SEGMENT_ID":"C","COMPANY_ID":"9377942526","STATIC_DETAILS":{"COMPANY_NAME":"Joan Q & Z","ADDRESS":"10 Sandton, Johannesburg","TAXPAYER":{"TAXPAYER_TYPE":"A","TAXPAYER_STR":"92714306","TAXPAYER_NUM":959592241},"CONTACTS":[{"PHONE_NUMBER":"+(277) 944 44 55","CONTACT_PERSON":"Janiece Newcombe"}]}}
{"File_Id":0,"Record_Id":6,"SEGMENT_ID":"C","COMPANY_ID":"3483483977","STATIC_DETAILS":{"COMPANY_NAME":"Robotrd Inc.","ADDRESS":"2 Park ave., Johannesburg","TAXPAYER":{"TAXPAYER_TYPE":"N","TAXPAYER_STR":"","TAXPAYER_NUM":31195396},"CONTACTS":[{"PHONE_NUMBER":"+(174) 970 97 54","CONTACT_PERSON":"Tyesha Debow"},{"PHONE_NUMBER":"+(848) 832 61 68","CONTACT_PERSON":"Mindy Celestin"},{"PHONE_NUMBER":"+(455) 184 13 39","CONTACT_PERSON":"Mabelle Winburn"}]}}
{"File_Id":0,"Record_Id":7,"SEGMENT_ID":"C","COMPANY_ID":"7540764401","STATIC_DETAILS":{"COMPANY_NAME":"Eqartion Inc.","ADDRESS":"871A Forest ave., Toronto","TAXPAYER":{"TAXPAYER_TYPE":"N","TAXPAYER_STR":"6 H","TAXPAYER_NUM":87432264},"CONTACTS":[]}}
{"File_Id":0,"Record_Id":7,"SEGMENT_ID":"C","COMPANY_ID":"7540764401","STATIC_DETAILS":{"COMPANY_NAME":"Eqartion Inc.","ADDRESS":"871A Forest ave., Toronto","TAXPAYER":{"TAXPAYER_TYPE":"N","TAXPAYER_STR":"6H","TAXPAYER_NUM":87432264},"CONTACTS":[]}}
{"File_Id":0,"Record_Id":8,"SEGMENT_ID":"C","COMPANY_ID":"4413124035","STATIC_DETAILS":{"COMPANY_NAME":"Xingzhoug","ADDRESS":"74 Qing ave., Beijing","TAXPAYER":{"TAXPAYER_TYPE":"N","TAXPAYER_STR":"2f","TAXPAYER_NUM":50803302},"CONTACTS":[]}}
{"File_Id":0,"Record_Id":12,"SEGMENT_ID":"C","COMPANY_ID":"9546291887","STATIC_DETAILS":{"COMPANY_NAME":"ZjkLPj","ADDRESS":"5574, Tokyo","TAXPAYER":{"TAXPAYER_TYPE":"A","TAXPAYER_STR":"73538919","TAXPAYER_NUM":926102835},"CONTACTS":[{"PHONE_NUMBER":"+(300) 252 33 17","CONTACT_PERSON":"Carrie Celestin"},{"PHONE_NUMBER":"+(907) 101 70 64","CONTACT_PERSON":"Edyth Deveau"},{"PHONE_NUMBER":"+(694) 918 17 44","CONTACT_PERSON":"Jene Norgard"}]}}
{"File_Id":0,"Record_Id":15,"SEGMENT_ID":"C","COMPANY_ID":"9168453994","STATIC_DETAILS":{"COMPANY_NAME":"Test Bank","ADDRESS":"1 Garden str., London","TAXPAYER":{"TAXPAYER_TYPE":"A","TAXPAYER_STR":"82573513","TAXPAYER_NUM":942814519},"CONTACTS":[{"PHONE_NUMBER":"+(768) 691 44 85","CONTACT_PERSON":"Timika Bourke"},{"PHONE_NUMBER":"+(695) 918 33 16","CONTACT_PERSON":"Lynell Riojas"}]}}
Expand Down Expand Up @@ -49,12 +49,12 @@
{"File_Id":0,"Record_Id":155,"SEGMENT_ID":"C","COMPANY_ID":"9898799886","STATIC_DETAILS":{"COMPANY_NAME":"Joan Q & Z","ADDRESS":"10 Sandton, Johannesburg","TAXPAYER":{"TAXPAYER_TYPE":"N","TAXPAYER_STR":"i","TAXPAYER_NUM":93022636},"CONTACTS":[{"PHONE_NUMBER":"+(576) 960 82 65","CONTACT_PERSON":"Carrie Maxim"},{"PHONE_NUMBER":"+(211) 823 44 73","CONTACT_PERSON":"Carrie Batman"},{"PHONE_NUMBER":"+(121) 202 45 80","CONTACT_PERSON":"Cliff Gagliano"},{"PHONE_NUMBER":"+(675) 313 76 46","CONTACT_PERSON":"Gabriele Hisle"}]}}
{"File_Id":0,"Record_Id":159,"SEGMENT_ID":"C","COMPANY_ID":"1542972569","STATIC_DETAILS":{"COMPANY_NAME":"ZjkLPj","ADDRESS":"5574, Tokyo","TAXPAYER":{"TAXPAYER_TYPE":"A","TAXPAYER_STR":"62949671","TAXPAYER_NUM":909261108},"CONTACTS":[{"PHONE_NUMBER":"+(759) 249 16 51","CONTACT_PERSON":"Estelle Thorpe"},{"PHONE_NUMBER":"+(66) 307 32 55","CONTACT_PERSON":"Cliff Deveau"},{"PHONE_NUMBER":"+(710) 445 38 90","CONTACT_PERSON":"Sulema Debow"}]}}
{"File_Id":0,"Record_Id":162,"SEGMENT_ID":"C","COMPANY_ID":"5492257935","STATIC_DETAILS":{"COMPANY_NAME":"ECSRONO","ADDRESS":"123/B Prome str., Denver","TAXPAYER":{"TAXPAYER_TYPE":"N","TAXPAYER_STR":"_","TAXPAYER_NUM":67540319},"CONTACTS":[{"PHONE_NUMBER":"+(168) 809 90 63","CONTACT_PERSON":"Alona Celestin"},{"PHONE_NUMBER":"+(845) 120 90 31","CONTACT_PERSON":"Estelle Flatt"}]}}
{"File_Id":0,"Record_Id":167,"SEGMENT_ID":"C","COMPANY_ID":"2366383436","STATIC_DETAILS":{"COMPANY_NAME":"Beiereqweq.","ADDRESS":"901 Ztt, Munich","TAXPAYER":{"TAXPAYER_TYPE":"N","TAXPAYER_STR":"\" Z","TAXPAYER_NUM":35788122},"CONTACTS":[{"PHONE_NUMBER":"+(515) 716 22 11","CONTACT_PERSON":"Alona Shapiro"},{"PHONE_NUMBER":"+(649) 897 62 54","CONTACT_PERSON":"Wilbert Tumlin"},{"PHONE_NUMBER":"+(180) 179 20 17","CONTACT_PERSON":"Deshawn Thorpe"},{"PHONE_NUMBER":"+(12) 730 88 41","CONTACT_PERSON":"Sulema Batman"}]}}
{"File_Id":0,"Record_Id":167,"SEGMENT_ID":"C","COMPANY_ID":"2366383436","STATIC_DETAILS":{"COMPANY_NAME":"Beiereqweq.","ADDRESS":"901 Ztt, Munich","TAXPAYER":{"TAXPAYER_TYPE":"N","TAXPAYER_STR":"\"Z","TAXPAYER_NUM":35788122},"CONTACTS":[{"PHONE_NUMBER":"+(515) 716 22 11","CONTACT_PERSON":"Alona Shapiro"},{"PHONE_NUMBER":"+(649) 897 62 54","CONTACT_PERSON":"Wilbert Tumlin"},{"PHONE_NUMBER":"+(180) 179 20 17","CONTACT_PERSON":"Deshawn Thorpe"},{"PHONE_NUMBER":"+(12) 730 88 41","CONTACT_PERSON":"Sulema Batman"}]}}
{"File_Id":0,"Record_Id":171,"SEGMENT_ID":"C","COMPANY_ID":"3002677167","STATIC_DETAILS":{"COMPANY_NAME":"ABCD Ltd.","ADDRESS":"74 Lawn ave., New York","TAXPAYER":{"TAXPAYER_TYPE":"N","TAXPAYER_STR":"` I","TAXPAYER_NUM":56661321},"CONTACTS":[{"PHONE_NUMBER":"+(372) 400 84 96","CONTACT_PERSON":"Eliana Godfrey"},{"PHONE_NUMBER":"+(128) 167 19 48","CONTACT_PERSON":"Suk Debow"},{"PHONE_NUMBER":"+(824) 681 73 76","CONTACT_PERSON":"Wilbert Mork"}]}}
{"File_Id":0,"Record_Id":176,"SEGMENT_ID":"C","COMPANY_ID":"3086612212","STATIC_DETAILS":{"COMPANY_NAME":"ECSRONO","ADDRESS":"123/B Prome str., Denver","TAXPAYER":{"TAXPAYER_TYPE":"A","TAXPAYER_STR":"78635498","TAXPAYER_NUM":926430771},"CONTACTS":[{"PHONE_NUMBER":"+(272) 831 90 52","CONTACT_PERSON":"Otelia Benally"},{"PHONE_NUMBER":"+(816) 337 55 41","CONTACT_PERSON":"Mindy Boehme"},{"PHONE_NUMBER":"+(508) 154 21 13","CONTACT_PERSON":"Timika Sauve"},{"PHONE_NUMBER":"+(335) 303 80 26","CONTACT_PERSON":"Timika Flatt"}]}}
{"File_Id":0,"Record_Id":180,"SEGMENT_ID":"C","COMPANY_ID":"1600426180","STATIC_DETAILS":{"COMPANY_NAME":"Pear GMBH.","ADDRESS":"107 Labe str., Berlin","TAXPAYER":{"TAXPAYER_TYPE":"N","TAXPAYER_STR":"E","TAXPAYER_NUM":43447109},"CONTACTS":[{"PHONE_NUMBER":"+(768) 461 89 92","CONTACT_PERSON":"Cliff Debow"},{"PHONE_NUMBER":"+(395) 386 85 35","CONTACT_PERSON":"Gabriele Deveau"},{"PHONE_NUMBER":"+(267) 618 38 57","CONTACT_PERSON":"Deshawn Bourke"}]}}
{"File_Id":0,"Record_Id":184,"SEGMENT_ID":"C","COMPANY_ID":"6926861847","STATIC_DETAILS":{"COMPANY_NAME":"Xingzhoug","ADDRESS":"74 Qing ave., Beijing","TAXPAYER":{"TAXPAYER_TYPE":"A","TAXPAYER_STR":"65111659","TAXPAYER_NUM":909455665},"CONTACTS":[{"PHONE_NUMBER":"+(347) 457 79 19","CONTACT_PERSON":"Cassey Mackinnon"},{"PHONE_NUMBER":"+(176) 205 63 71","CONTACT_PERSON":"Alona Newcombe"},{"PHONE_NUMBER":"+(348) 375 95 34","CONTACT_PERSON":"Starr Maxim"}]}}
{"File_Id":0,"Record_Id":186,"SEGMENT_ID":"C","COMPANY_ID":"9452676140","STATIC_DETAILS":{"COMPANY_NAME":"Pear GMBH.","ADDRESS":"107 Labe str., Berlin","TAXPAYER":{"TAXPAYER_TYPE":"N","TAXPAYER_STR":"","TAXPAYER_NUM":97056018},"CONTACTS":[{"PHONE_NUMBER":"+(123) 240 91 88","CONTACT_PERSON":"Willis Thorpe"}]}}
{"File_Id":0,"Record_Id":190,"SEGMENT_ID":"C","COMPANY_ID":"8581179565","STATIC_DETAILS":{"COMPANY_NAME":"Beiereqweq.","ADDRESS":"901 Ztt, Munich","TAXPAYER":{"TAXPAYER_TYPE":"N","TAXPAYER_STR":"W i","TAXPAYER_NUM":89592681},"CONTACTS":[{"PHONE_NUMBER":"+(518) 461 10 86","CONTACT_PERSON":"Otelia Flatt"},{"PHONE_NUMBER":"+(697) 268 10 81","CONTACT_PERSON":"Wilbert Lepe"},{"PHONE_NUMBER":"+(548) 150 86 82","CONTACT_PERSON":"Suk Maxim"}]}}
{"File_Id":0,"Record_Id":190,"SEGMENT_ID":"C","COMPANY_ID":"8581179565","STATIC_DETAILS":{"COMPANY_NAME":"Beiereqweq.","ADDRESS":"901 Ztt, Munich","TAXPAYER":{"TAXPAYER_TYPE":"N","TAXPAYER_STR":"Wi","TAXPAYER_NUM":89592681},"CONTACTS":[{"PHONE_NUMBER":"+(518) 461 10 86","CONTACT_PERSON":"Otelia Flatt"},{"PHONE_NUMBER":"+(697) 268 10 81","CONTACT_PERSON":"Wilbert Lepe"},{"PHONE_NUMBER":"+(548) 150 86 82","CONTACT_PERSON":"Suk Maxim"}]}}
{"File_Id":0,"Record_Id":191,"SEGMENT_ID":"C","COMPANY_ID":"7590246923","STATIC_DETAILS":{"COMPANY_NAME":"Joan Q & Z","ADDRESS":"10 Sandton, Johannesburg","TAXPAYER":{"TAXPAYER_TYPE":"N","TAXPAYER_STR":"\\","TAXPAYER_NUM":17652973},"CONTACTS":[]}}
{"File_Id":0,"Record_Id":195,"SEGMENT_ID":"C","COMPANY_ID":"2521796035","STATIC_DETAILS":{"COMPANY_NAME":"Beiereqweq.","ADDRESS":"901 Ztt, Munich","TAXPAYER":{"TAXPAYER_TYPE":"N","TAXPAYER_STR":"","TAXPAYER_NUM":84541629},"CONTACTS":[{"PHONE_NUMBER":"+(295) 174 64 72","CONTACT_PERSON":"Estelle Wallingford"},{"PHONE_NUMBER":"+(173) 201 14 38","CONTACT_PERSON":"Doretha Shapiro"},{"PHONE_NUMBER":"+(756) 614 38 41","CONTACT_PERSON":"Suk Benally"}]}}

0 comments on commit df9d579

Please sign in to comment.