NUL characters in US ASCII getting converted to space character i.e " " insted of empty value. #481

rohitavantsa · 2022-03-21T13:15:20Z

We have an US ASCII fixed byte length file .
File Contents:
1234 t ----> this row having three spaces
4567NULNULNULf -----> this row is having 3 NUL characters

CopyBook Contents:

01 tablename
05 record_ID PIC x(3)
05 record_status PIC x(3)
05 record_flag PIC x(1)

expected output:

[Row(record_ID='1234', record_status=' ',record_flag='t'),
Row(record_ID='4567',record_status='',record_flag='f')]

Actual Output :
[Row(record_ID='1234', record_status=' ',record_flag='t'),
Row(record_ID='4567',record_status=' ',record_flag='f')]

We are expected an empty value insted we are getting three white spaces. We are seeing the onprem data is an empty value. Can you please help us understand why we are seeing this issue in the scenario.??

@yruslan Can you please help us on this.

yruslan · 2022-03-21T13:23:25Z

Hi, thanks for the issue report. Could you please add

An example US ASCII with NUL characters.
The code snippet you are using to read the file.

Btw, does this option help removing extra spaces: .option("string_trimming_policy", "both") ?

rohitavantsa · 2022-03-22T11:31:37Z

Hi @yruslan

String_trimming_policy is set to none in our case as we need to preserve the spaces while reading the file.

Here are the options we are using to read the file:
spark.read.format('cobol').option('copybookcontents', 'encoding': 'ASCII' , 'ebcdic_code_page':'CP037','string_trimming_policy':'none', 'debug_ignore_file_size':'true').load('filepath')

Please find the Sample file below:
sampleUS-ASCII file.txt

Please open this file in np++ to get the reference to NUL character .

Expected Output:
The row with NULNULNUL should be appeared as an empty string instead of ' ' (three spaces) which we are currently getting in our dataframe. Onprem system is providing this field as '' empty field.

yruslan · 2022-03-22T11:54:00Z

Hi,

Before looking deeper please try:

Removing 'ebcdic_code_page':'CP037' since it is applicable only for EBCDIC and
Adding .option("improved_null_detection", "true")

ASCII charset is set using this option:
.option("ascii_charset", "US_ASCII") (UTF_8 is the default)
(you can specify a different charset, of course)

rohitavantsa · 2022-03-22T12:09:01Z

Sure @yruslan will try this.

rohitavantsa · 2022-03-23T12:32:25Z

Hi @yruslan

We have tried removing option 'ebcdic_code_page':'CP037' and added .opt ion("improved_null_detection", "true") but still it not working as we expect.

To be more clear:
the NUL character which i am refering is hex value \x00 which is not getting read properly. While reading we actually expect a empty field but getting a space character. The file which i have give consists the NUL one.
You could actually try that and check is that the normal behavior or we need any kind of fix.

Thanks in advance

yruslan · 2022-03-23T14:43:06Z

Currently, all characters that are lower than 0x20 are replaced by spaces. If all characters in a field are 0x00, and improved_null_detection = true, the field becomes null.

Will check your file. Probably the correct behavior for ASCII would be not replacing lower characters with spaces and always skipping 0x00. This is something that needs to be implemented on our side.

rohitavantsa · 2022-03-23T19:03:45Z

Sure thanks.

…ing trimming policy.

yruslan · 2022-03-24T08:11:15Z

This should be fixed in this branch:
https://github.com/AbsaOSS/cobrix/tree/bugfix/481-ignore-control-characters

You can test it by building that branch.

…ing trimming policy.

rohitavantsa · 2022-03-25T12:54:36Z

Thanks @yruslan . This fix is helping us resolve the issue.

yruslan · 2022-03-25T13:44:32Z

Great! It will be released as a new version sometime next week

…ing trimming policy.

rohitavantsa added the question Further information is requested label Mar 21, 2022

yruslan added a commit that referenced this issue Mar 24, 2022

#481 Fix ASCII control characters handling policy. Add 'keep_all' str…

a88554d

…ing trimming policy.

yruslan added a commit that referenced this issue Mar 24, 2022

#481 Fix ASCII control characters handling policy. Add 'keep_all' str…

e4006e3

…ing trimming policy.

yruslan added a commit that referenced this issue Mar 24, 2022

#481 Fix ASCII control characters handling policy. Add 'keep_all' str…

e9fb855

…ing trimming policy.

yruslan added a commit that referenced this issue Mar 25, 2022

#481 Fix ASCII control characters handling policy. Add 'keep_all' str…

df9d579

…ing trimming policy.

yruslan closed this as completed Apr 19, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

NUL characters in US ASCII getting converted to space character i.e " " insted of empty value. #481

NUL characters in US ASCII getting converted to space character i.e " " insted of empty value. #481

rohitavantsa commented Mar 21, 2022

yruslan commented Mar 21, 2022

rohitavantsa commented Mar 22, 2022 •

edited

yruslan commented Mar 22, 2022

rohitavantsa commented Mar 22, 2022

rohitavantsa commented Mar 23, 2022

yruslan commented Mar 23, 2022

rohitavantsa commented Mar 23, 2022

yruslan commented Mar 24, 2022

rohitavantsa commented Mar 25, 2022

yruslan commented Mar 25, 2022

NUL characters in US ASCII getting converted to space character i.e " " insted of empty value. #481

NUL characters in US ASCII getting converted to space character i.e " " insted of empty value. #481

Comments

rohitavantsa commented Mar 21, 2022

yruslan commented Mar 21, 2022

rohitavantsa commented Mar 22, 2022 • edited

yruslan commented Mar 22, 2022

rohitavantsa commented Mar 22, 2022

rohitavantsa commented Mar 23, 2022

yruslan commented Mar 23, 2022

rohitavantsa commented Mar 23, 2022

yruslan commented Mar 24, 2022

rohitavantsa commented Mar 25, 2022

yruslan commented Mar 25, 2022

rohitavantsa commented Mar 22, 2022 •

edited