Skip to content

Unable to parse the Mainframe copybook which has a COBOL datatype of BBBB which means empty spacesc #734

@suryagits

Description

@suryagits

Describe the bug

We are using CoBrix with PySpark and executing it on AWS EMR.
We have the EBCDIC file and it's corresponding copybook in the AWS S3 bucket. While trying to parse the EBCDIC file using the Copybook, we are getting an error.

Error message :
py4j.protocol.Py4jJavaError : An error occurred while calling o2021.loa : za.co.absa.cobrix.cobol.parser.exceptions.SyntaxErrorException : Syntax error in the copybook at line 29 : Invalid input 'BBBB' at position 29:45

Code snippet that caused the issue

try : 
 file_path = f's3://{s3_bucket}/{ebcdic_file_path}'
 spark.read
   .format("cobol")
   .option("copybook_contents", copybook)
   .option("encoding", ebcdic)
   .option("schema_retention_policy", "collapse_root")
   .option("generate_record_id", True)
   .load(file_path)
except Exception as e:
   log_message = f'spark job failed with error : {e}'
   logging.error(log_message)
  raise e

Expected behavior

We expected the Cobrix to successfully parse the EBCDIC file record column using the Cobybook which has this datatype of 'BBBB'

Context

PySpark Jar dependencies :

  • cobol-parser_2.12-2.6.7.jar
  • hadoop-lzo-0.4.3.jar
  • scodec-bits_2.12-1.1.12.jar
  • scodec-core_2.12-1.11.4.jar
  • spark-cobol_2.12-2.6.7.jar
  • Operating system: AWS EMR (Linux Image)

Copybook (if possible)

                    15 EL02-267-COLNAME-A
                      20 EL02-267-COLNAME-B
                                                       PIC X(19).
                      .........
                      .........
                      .........
                      20 EL02-267-COLNAME-C  REDEFINES
                                    EL02-267-COLNAME-D
                                                       PIC 9(06)BBBB. (This is what is causing the issue we suppose)
GP5WHB        20 FILLER                 pic X(285).                      CLEAN-UP

Attach a small data file that can help reproduce the issue, if possible : Need to check the feasibility due to confidentiality of the data. Will get back.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions