validate doesn't flag a data file with only LF #499

rchenatjpl · 2022-05-10T08:05:18Z

The attached data file has only LF but of course should have CRLF. validate passes it.

% validate -t ../M7_217_044546_N.xml
PDS Validate Tool Report
Configuration:
Version 2.1.5-SNAPSHOT
Date 2022-05-10T08:01:15Z
Parameters:
Targets [file:/Users/rchen/Desktop/M7_217_044546_N.xml]
Severity Level WARNING
Recurse Directories true
File Filters Used [*.xml, *.XML]
Data Content Validation on
Product Level Validation on
Max Errors 100000
Registered Contexts File /Users/rchen/PDS4tools/validate/resources/registered_context_products.json
Product Level Validation Results
PASS: file:/Users/rchen/Desktop/M7_217_044546_N.xml
1 product validation(s) completed
Summary:
0 error(s)
0 warning(s)
Product Validation Summary:
1 product(s) passed
0 product(s) failed
0 product(s) skipped
Referential Integrity Check Summary:
0 check(s) passed
0 check(s) failed
0 check(s) skipped
End of Report
Completed execution in 3190 ms
%
%
% validate -V
gov.nasa.pds:validate
Version 2.1.5-SNAPSHOT
Release Date: 2022-02-10 21:20:46

rchenatjpl · 2022-05-10T08:05:29Z

Archive.zip

jordanpadams · 2022-05-10T17:40:51Z

@rchenatjpl shouldn't this be valid as of CCB-264? #292

rchenatjpl · 2022-05-10T23:21:04Z

oh, I forgot about that. I guess the problem is then what should validate do if the data file has only LF, but the label says record_delimiter is CRLF, which was the only value available for the selected IM model

jordanpadams · 2022-05-11T01:38:30Z

@rchenatjpl copy. That is a bug then

al-niessner · 2022-12-21T21:46:17Z

@jordanpadams @nutjob4life @tloubrieu-jpl

The failure is obscure and needs clarification. It starts here:

validate/src/main/java/gov/nasa/pds/tools/validate/rule/pds4/FileReferenceValidationRule.java

Lines 429 to 439 in 1979b01

    
           ProblemType problemType = this.documentUtil.getProblemType(doctype); 
        
           // Is is possible that there's no corresponding problemType. Must check for 
        
           // null-ness before calling checkGenericDocument() function. 
        
           if (problemType == null) { 
        
             LOG.error( 
        
                 "FileReferenceValidationRule:Cannot retrieve ProblemType from provided doctype {}", 
        
                 doctype); 
        
           } else { 
        
             return this.checkGenericDocument(target, urlRef, fileObject, filename, parent, 
        
                 directory, documentStandardId, doctype, problemType); 
        
           }

In the XML file the doctype resolves to "". when at line 429 the problemType is retrieved with doctype being "" a null is returned. It is an inherent part of the call at:

validate/src/main/java/gov/nasa/pds/tools/util/DocumentUtil.java

Lines 117 to 137 in 1979b01

    
           public ProblemType getProblemType(String docType) { 
        
             ProblemType problemType = null; 
        
             if (!this.classInitialized) { 
        
               // Only initialize this class once of the two lists' content. 
        
               this.initialize(); 
        
             } 
        
             // Iterating through docTypeList and check if docType contains singleDocType. 
        
             // Note that everything is changed to lower cases for comparison. 
        
             int ii = 0; 
        
             for (String singleDocType : this.docTypeList) { 
        
               if (docType.toLowerCase().contains(singleDocType.toLowerCase())) { 
        
                 problemType = this.problemTypeList.get(ii); 
        
                 // Once we have found a matching value, there's no need to continue looping as 
        
                 // it will be fetching the wrong ProblemType if we continue. 
        
                 break; 
        
               } 
        
               ii++; 
        
             } 
        
             LOG.debug("getProblemType:docType,problemType {},{}", docType, problemType); 
        
             return (problemType);

because "" is not part of the initialize() of the class. Have a bunch like EXCEL, HTML, GIF, etc. Not blank so the function returns null. Now at line 432, the null is detected and the generic document test is not performed (not to say that the LF would have been caught but it was never given the chance).

So,

Is the example XML malformed because the doctype is ""?
a. Should validate not return null but rather unknown doctype error report?
b. Should there be a "" doctype?
The 429-439 block of code is wrong and it should always do the generic test despite not having a doctype but
a. use parent problemType
b. also modify generic document handler to ignore problemType arg and use a listener
This ticket is in error because the doctype is allowed to be "" and this is the correct behavior.

Please clarify which is the actual problem (XML and/or code) and what the actual problem is (malformed XML and/or not detecting previous error).

jordanpadams · 2022-12-22T22:36:11Z

✅ The 429-439 block of code is wrong and it should always do the generic test despite not having a doctype but
a. use parent problemType
b. also modify generic document handler to ignore problemType arg and use a listener

I actually think that doctype code was kind of a wild ride that we wound up not really relying on too much. In this case, we should continue to perform validation.

jordanpadams · 2022-12-22T22:36:37Z

@al-niessner ☝️

al-niessner · 2022-12-23T20:02:08Z

@jordanpadams @nutjob4life @tloubrieu-jpl

Yum, I do not think this is the right route. I think the XML is not sufficient. Applied patch to have problemType for the doctype = "" and it now sinks checking a little deeper:

validate/src/main/java/gov/nasa/pds/tools/validate/rule/pds4/FileReferenceValidationRule.java

Lines 838 to 846 in 15c37d4

    
           if (documentStandardId != null) { 
        
             mimeTypeIsCorrectFlag = this.documentsChecker.isMimeTypeCorrect(textName, documentStandardId); 
        
             LOG.debug("handleGenericDocument:textName,documentStandardId,mimeTypeIsCorrectFlag {},{},{}", 
        
                 textName, documentStandardId, mimeTypeIsCorrectFlag); 
        
           } else { 
        
             mimeTypeIsCorrectFlag = true; // Set to true even though the label does not have the 
        
                                           // documentStandardId set 
        
                                           // to anything. 
        
           }

The problem here is that documentStandardId = null causing the else portion to be execute instead of the actual check at line 839 not that it would matter to this ticket. If you look just down from the highlighted block it is obvious that the only check is the mime type - not too thrilling. Point is, if we are checking a random file then how do we know it should have LFCR instead of LF or CR or just randomly placed bytes of those values because it is a binary file. Without a proper doctype and mime type to tell us that lines should end LFCR and that LF or CR will not appear otherwise in the file then how can a check for them be meaningful?

al-niessner · 2022-12-23T20:34:28Z

@jordanpadams @nutjob4life @tloubrieu-jpl

Just ran the cucumber tests and 8 of them fail because they now pass files that do not resolve a doctype. It seems there is a much a larger design problem and less a line of code.

jordanpadams · 2022-12-23T21:13:38Z

@al-niessner which tests are failing and for which ticket? I would not be surprised if this was a larger design issue...

Without a proper doctype and mime type to tell us that lines should end LFCR and that LF or CR will not appear otherwise in the file then how can a check for them be meaningful?

that is actually not true. LF and CRLF are defined in the labels:

    <Table_Character>
      <offset unit="byte">80</offset>
      <records>25</records> 
      <record_delimiter>Carriage-Return Line-Feed</record_delimiter>
      <Record_Character>

this mimetype and doctype thing is totally arbitrary. as I noted, it was a bit of a rabbithole we went down to try to do some guessing to check that the document type defined in the label actually makes sense with the file name defined (e.g. label says Excel file and it ends in xls/xlsx). In the end, unless something looks totally out of the ordinary here, we should just ignore this check and keep going. It is only something glaring wrong where this error should ever be thrown (e.g. label says Excel file and it is a PDF).

so for this case, I imagine the doctype check doesn't know what to do with a .dat file name suffix, which is why it returns "", but then the code should just continue to do checks and no longer do that specific doctype check.

hopefully that makes sense. will try to poke at it some more.

al-niessner · 2022-12-23T21:22:03Z

@jordanpadams

This is a valid file reference test not a table test. Are you saying all undefined doctype files are text tables?

The tests that are failing are HTML, JPEG, GIF, EXCEL, etc (8 of them) that somehow leave doctype empty while specifying enough to be one of those 8 types. The error demonstrates that the block

validate/src/main/java/gov/nasa/pds/tools/validate/rule/pds4/FileReferenceValidationRule.java

Lines 429 to 439 in 1979b01

    
           ProblemType problemType = this.documentUtil.getProblemType(doctype); 
        
           // Is is possible that there's no corresponding problemType. Must check for 
        
           // null-ness before calling checkGenericDocument() function. 
        
           if (problemType == null) { 
        
             LOG.error( 
        
                 "FileReferenceValidationRule:Cannot retrieve ProblemType from provided doctype {}", 
        
                 doctype); 
        
           } else { 
        
             return this.checkGenericDocument(target, urlRef, fileObject, filename, parent, 
        
                 directory, documentStandardId, doctype, problemType); 
        
           }

throws an error when productType is still null. It is clear that they are positive tests of a fail result. Not sure of the actual tests numbers as JUnit is hiding them very well from me.

al-niessner · 2022-12-23T21:26:06Z

@jordanpadams

How can you tell that a file should have LFCR or CRLF? If given a JPEG does it need to have CRLF in it or just EOF? In other words, what in the XML is telling that this reference file should end with CRLF instead of LF or a stream of bytes until EOF?

al-niessner · 2022-12-23T21:29:28Z

@jordanpadams

I think I am understanding it better now. The original error that doctype is "" forcing problemType to null is causing the base error. It then prevents the table checks from taking place when we think they should or that the table checks are not correct. Is that about right?

jordanpadams · 2022-12-23T21:34:14Z

@al-niessner

How can you tell that a file should have LFCR or CRLF?

this is described in the label and this check is already happening elsewhere in the code.

for example, in the test data attached above, the label has:

    <Table_Character>
      <offset unit="byte">80</offset>
      <records>25</records> 
      <record_delimiter>Carriage-Return Line-Feed</record_delimiter>
      <Record_Character>

so this only applies to products where you can define record_delimiter. from the IM Spec, record_delimiter is defined in:

Checksum_Manifest (and any objects that extend Checksum_Manifest)
Stream_Text (and any objects that extend Stream_Text)
Table_Character (and any objects that extend Table_Character)
Table_Delimited (and any objects that extend Table_Delimited)

jordanpadams · 2022-12-23T21:35:25Z

the file reference check happens, which may throw an error or may not, but we should move onto doing the next check. nothing within that class should be a fatal error, stopping the remaining validation checks.

msbentley · 2022-12-23T21:43:48Z

I was thinking about this the other day - it could, theoretically, be possible that for some horribly esoteric reason a single file contains multiple tables with different line endings, right? :-/ And definitely possible that we have binary content mixed with ASCII, so I guess this check as to be at the data object level?

al-niessner · 2022-12-23T22:21:28Z

@jordanpadams

Found it. Now there is a table line counting problem. Will fix soon.

jordanpadams · 2022-12-30T16:29:07Z

so I guess this check as to be at the data object level?

@msbentley yes! our code was refactored a bit to hopefully enable more ready support validation of things like this at the data object level. hopefully that works as expected.

miguelp1986 · 2023-03-09T21:39:45Z

Testrail link: https://cae-testrail.jpl.nasa.gov/testrail/index.php?/cases/view/1273418

rchenatjpl added bug Something isn't working needs:triage labels May 10, 2022

rchenatjpl assigned jordanpadams May 10, 2022

jordanpadams added B13.0 s.medium and removed needs:triage labels May 10, 2022

jordanpadams removed their assignment May 10, 2022

jordanpadams added B13.1 p.must-have labels Dec 20, 2022

al-niessner mentioned this issue Dec 21, 2022

B13.1 Fix Must-Have Priority Bugs #578

Closed

jordanpadams added the sprint-backlog label Dec 23, 2022

jordanpadams assigned al-niessner Dec 23, 2022

al-niessner mentioned this issue Dec 23, 2022

issue 499: not checking table EOL #583

Merged

jordanpadams closed this as completed Dec 29, 2022

jordanpadams removed the sprint-backlog label Feb 23, 2023

jordanpadams removed the B13.0 label Mar 6, 2023

miguelp1986 added the i&t.done label Mar 9, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

validate doesn't flag a data file with only LF #499

validate doesn't flag a data file with only LF #499

rchenatjpl commented May 10, 2022

rchenatjpl commented May 10, 2022

jordanpadams commented May 10, 2022

rchenatjpl commented May 10, 2022

jordanpadams commented May 11, 2022

al-niessner commented Dec 21, 2022 •

edited

jordanpadams commented Dec 22, 2022

jordanpadams commented Dec 22, 2022

al-niessner commented Dec 23, 2022

al-niessner commented Dec 23, 2022

jordanpadams commented Dec 23, 2022

al-niessner commented Dec 23, 2022

al-niessner commented Dec 23, 2022

al-niessner commented Dec 23, 2022

jordanpadams commented Dec 23, 2022 •

edited

jordanpadams commented Dec 23, 2022

msbentley commented Dec 23, 2022

al-niessner commented Dec 23, 2022

jordanpadams commented Dec 30, 2022

miguelp1986 commented Mar 9, 2023

validate doesn't flag a data file with only LF #499

validate doesn't flag a data file with only LF #499

Comments

rchenatjpl commented May 10, 2022

rchenatjpl commented May 10, 2022

jordanpadams commented May 10, 2022

rchenatjpl commented May 10, 2022

jordanpadams commented May 11, 2022

al-niessner commented Dec 21, 2022 • edited

jordanpadams commented Dec 22, 2022

jordanpadams commented Dec 22, 2022

al-niessner commented Dec 23, 2022

al-niessner commented Dec 23, 2022

jordanpadams commented Dec 23, 2022

al-niessner commented Dec 23, 2022

al-niessner commented Dec 23, 2022

al-niessner commented Dec 23, 2022

jordanpadams commented Dec 23, 2022 • edited

jordanpadams commented Dec 23, 2022

msbentley commented Dec 23, 2022

al-niessner commented Dec 23, 2022

jordanpadams commented Dec 30, 2022

miguelp1986 commented Mar 9, 2023

al-niessner commented Dec 21, 2022 •

edited

jordanpadams commented Dec 23, 2022 •

edited