Reading the mainframe files data as text field irrespective of the copybook data type #291

codealways · 2020-05-14T15:46:35Z

Background

If the data does not match with the data type of copybook while reading with cobrix the invalid data becomes null.

Feature

Can we read all the data by text fields irrespective of the copybook data type so that no data would be lost while reading. It’s just like reading a csv file with all the data type as string

Example [Optional]

A simple example if applicable.

Proposed Solution [Optional]

Solution Ideas
1.
2.
3.

yruslan · 2020-05-14T20:01:29Z

This is a tricky question. The simple answer is no, we currently do not support this. And I'm not sure what is the way to support it that might be helpful.

COBOL has a system of types and formats and some of these types have values that do not have meaning (semantic mapping).

Let's consider 2 approaches.

Decoding data type format. For instance, binary-coded decimal (BCD) can convert HEX values of 54 33 21 to the number 543321. So if the copybook defines the field as being a BCD encoded number Cobrix converts it to numbers. But if a value 54 A3 21 is encountered, it is invalid from BCD perspective since 'A' is not a decimal number. So there is no mapping from that value to a number.
Leaving raw values. If a field is defined as BCD number but we always put raw values to the filed, a BCD number '54 33 21' will be converted to a string having 3 characters 0x54, 0x33, 0x21. And while for any character sequence a string is possible, the field losses its meaning as a number.
Leaving raw numbers only for incorrect values. We can write numbers as numbers when decoding is possible and write raw values otherwise. But in this case, we lose the ability to identify which value is correct.
Additional debugging fields. We have .option("debug", "true") which generates fields having additional values having the original data in HEX encoding. We can add an option to create these fields not as HEX encoding, but as actual raw values. This might be a viable solution.

Please, try the debug option. Let us know if it is sufficient for your use case or you would prefer having raw values as debugging fields.

codealways · 2020-05-15T08:35:04Z

Thanks for the quick reply. I checked the #4 option to make debug=true.while converting the HEX number to ascii do i need to always use the character set as Cp037. As Cp037 is one of the encoder character set for EBCIDIC for IBM, do we have any other EBCIDIC character set that may come.

codealways · 2020-05-15T09:47:29Z

For packed Decimal Comp3 S9(2)V999 for the value 12.000 we are getting the HEX as 00000000012000C. but while converting to string with characterset Cp037, its not working as intended because i suppose i need to look for the field datatype from copybook and accordingly convert to ascii.

i.e here C means its positive value and (999) means 3 digits after decimal. Please suggest are there any other ways to convert the values to the raw value in cobrix

yruslan · 2020-05-15T11:35:08Z

The raw values are presented exactly the same way as they were in the original file, no encoding conversion is happening.

I'm trying to understand what do you want to achieve. Could you please provide a made-up example? Something like: "for fields having raw values so and so we want values to be so and so".

codealways · 2020-05-15T14:34:34Z

For example my copybook file is as below

01 StudentDetail
10 Name PIC A(10)
10 ID PIC S9(4) COMP
10 Mark PIC S9(10)V999 COMP-3
after enabling the debug option while loading the ebcidic file I am getting the dataframe as below

Name Name_debug ID ID_debug Mark Mark_debug
John D1D6C8D5 1 0001 15.000 0000000015000C

So if you will see the Mark_debug column if I want to reverse engineer the 0000000015000C to 15.00 then how should I know whether the actual ASCII value in MARK as 15.000 or 015.000. Or is it the case that in ebcidic binary format the data wont come like 015.000. i.e can the data come like this in Mark column which is also a valid data and alligned to datatype.

Name Name_debug ID ID_debug Mark Mark_debug
John D1D6C8D5 1 0001 015.000 0000000015000C

yruslan · 2020-05-15T14:46:34Z

So if you will see the Mark_debug column if I want to reverse engineer the 0000000015000C to 15.00 then how should I know whether the actual ASCII value in MARK as 15.000 or 015.000.

The difference between 15.000 and 015.000 is the matter of interpretation of binary data that comes in. The copybook does not describe whether there should be a leading zero when the value is displayed or not. And since it is not described in the copybook, Cobrix cannot do much about it.

Another way to look at it is that 15.000 and 015.000 are the same values since they semantically map to the same mathematical number 15.

codealways · 2020-05-15T14:50:39Z

Thats correct, but our requirement is to read the raw value as it is. so in this scenario if i will read the mark field as text field then 15.000 and 015.000 as two different thing. So basically we need to read the value as it is without manipulating anything.

yruslan · 2020-05-15T15:16:25Z

Okay. I have a question to you too. What in the copybook says that 15.000 is modified while 015.000 is the original value?

codealways · 2020-05-15T16:37:25Z

no it is not mentioned, but I am assuming for example if in my data file the value is 015.000 or 15.000 or 0015.000 all the cases the HEX encoded value will be 0000000015000C. So as per my requirement how should I get the actual raw value from the HEX encoded.

yruslan · 2020-05-15T18:15:45Z

Just a suggestion: 15.000 should should count as the raw value in your example. A number with all trailing zeros removed. Cobrix won't ever unpack COMP-3 encoded value as '015.000'.

But then again, your requirement depends on the interpretation of what do you mean by 'raw value'. And your notion of 'raw value' depends on your requirements. So all completely up to you.

codealways · 2020-05-15T18:31:04Z

Cobrix won't ever unpack COMP-3 encoded value as '015.000'

then it should be fine. As of now with version 2.0.7 I am planning to use debug option and reverse engineer the hex value to the raw value so that i can use the data as it is with out any changes.

If for future instead of providing HEX if we can provide raw value it will be really good. It seems while adding debug fields in addDebugFields function i guess val debugDataType = AlphaNumeric(s"X($size)", size, None, None, None) will keep the raw value. Please suggest.

yruslan · 2020-05-15T18:42:18Z

Yes, we can add an option to generate raw values for debugging easily, Adding it to the backlog,

AlphaNumeric(s"X($size)", size, None, None, None) will keep the raw value. Please suggest.

Yes, and also you need to change this method to return raw values instead of HEX:

cobrix/cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/parser/decoders/StringDecoders.scala

Line 122 in a4b24ce

def decodeHex(bytes: Array[Byte]): String = {

codealways · 2020-05-15T18:44:20Z

Thanks a lot. It was great having discussion with you. Do you want me to close this issue or this will be tracked as part of backlog.

yruslan · 2020-05-15T19:00:39Z

No problem 😄

Let's leave this issue open. I'll use it to make the change to support raw values.

codealways · 2020-05-17T14:32:13Z

it seems for the packed decimal field 10 Mark PIC S9(10)V999 COMP-3

** Mark Mark_debug**
15.000 0000000015000C

cobrix does not convert to actual HEX. the value in Mark_debug seems like the data is converted to be stored in ASCII environment. Please confirm

yruslan · 2020-05-18T05:51:51Z

Data in Mark_debug should contain HEX of the raw data without any conversion.

Remember, BCD is a binary encoding format. The notion of encoding (EBCDIC vs ASCII) is not applicable here.

codealways · 2020-05-18T06:29:25Z

Got it, Thanks

codealways added the enhancement New feature or request label May 14, 2020

codealways changed the title ~~Reading the mainframe files data all as text field irrespective of the copybook data type~~ Reading the mainframe files data as text field irrespective of the copybook data type May 14, 2020

yruslan added the accepted Accepted for implementation label May 15, 2020

yruslan self-assigned this May 15, 2020

yruslan added a commit that referenced this issue May 29, 2020

#291 Add an ability to generate raw binary debugging debug fields.

b7f016a

yruslan added a commit that referenced this issue May 29, 2020

#291 Add documentation for the new feature.

a5a57db

yruslan added a commit that referenced this issue May 29, 2020

#291 Add an ability to generate raw binary debugging debug fields.

6a1143e

yruslan added a commit that referenced this issue May 29, 2020

#291 Add documentation for the new feature.

3d6fc2d

yruslan added a commit that referenced this issue May 29, 2020

#291 Fix failing unit test.

e463a8e

yruslan added a commit that referenced this issue May 29, 2020

#291 Fix a warning.

f69d9de

yruslan added a commit that referenced this issue May 29, 2020

#291 Add an ability to generate raw binary debugging debug fields.

257d51c

yruslan added a commit that referenced this issue May 29, 2020

#291 Add documentation for the new feature.

64d619f

yruslan added a commit that referenced this issue May 29, 2020

#291 Fix failing unit test.

76698ed

yruslan added a commit that referenced this issue May 29, 2020

#291 Fix a warning.

e785ad9

yruslan closed this as completed Jun 12, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reading the mainframe files data as text field irrespective of the copybook data type #291

Reading the mainframe files data as text field irrespective of the copybook data type #291

codealways commented May 14, 2020

yruslan commented May 14, 2020

codealways commented May 15, 2020 •

edited

codealways commented May 15, 2020 •

edited

yruslan commented May 15, 2020

codealways commented May 15, 2020 •

edited

yruslan commented May 15, 2020

codealways commented May 15, 2020

yruslan commented May 15, 2020

codealways commented May 15, 2020 •

edited

yruslan commented May 15, 2020

codealways commented May 15, 2020

yruslan commented May 15, 2020

codealways commented May 15, 2020

yruslan commented May 15, 2020

codealways commented May 17, 2020

yruslan commented May 18, 2020

codealways commented May 18, 2020

Reading the mainframe files data as text field irrespective of the copybook data type #291

Reading the mainframe files data as text field irrespective of the copybook data type #291

Comments

codealways commented May 14, 2020

Background

Feature

Example [Optional]

Proposed Solution [Optional]

yruslan commented May 14, 2020

codealways commented May 15, 2020 • edited

codealways commented May 15, 2020 • edited

yruslan commented May 15, 2020

codealways commented May 15, 2020 • edited

yruslan commented May 15, 2020

codealways commented May 15, 2020

yruslan commented May 15, 2020

codealways commented May 15, 2020 • edited

yruslan commented May 15, 2020

codealways commented May 15, 2020

yruslan commented May 15, 2020

codealways commented May 15, 2020

yruslan commented May 15, 2020

codealways commented May 17, 2020

yruslan commented May 18, 2020

codealways commented May 18, 2020

codealways commented May 15, 2020 •

edited

codealways commented May 15, 2020 •

edited

codealways commented May 15, 2020 •

edited

codealways commented May 15, 2020 •

edited