Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

validate doesn't flag a data file with only LF #499

Closed
rchenatjpl opened this issue May 10, 2022 · 19 comments
Closed

validate doesn't flag a data file with only LF #499

rchenatjpl opened this issue May 10, 2022 · 19 comments
Assignees

Comments

@rchenatjpl
Copy link

The attached data file has only LF but of course should have CRLF. validate passes it.

% validate -t ../M7_217_044546_N.xml
PDS Validate Tool Report
Configuration:
Version 2.1.5-SNAPSHOT
Date 2022-05-10T08:01:15Z
Parameters:
Targets [file:/Users/rchen/Desktop/M7_217_044546_N.xml]
Severity Level WARNING
Recurse Directories true
File Filters Used [*.xml, *.XML]
Data Content Validation on
Product Level Validation on
Max Errors 100000
Registered Contexts File /Users/rchen/PDS4tools/validate/resources/registered_context_products.json
Product Level Validation Results
PASS: file:/Users/rchen/Desktop/M7_217_044546_N.xml
1 product validation(s) completed
Summary:
0 error(s)
0 warning(s)
Product Validation Summary:
1 product(s) passed
0 product(s) failed
0 product(s) skipped
Referential Integrity Check Summary:
0 check(s) passed
0 check(s) failed
0 check(s) skipped
End of Report
Completed execution in 3190 ms
%
%
% validate -V
gov.nasa.pds:validate
Version 2.1.5-SNAPSHOT
Release Date: 2022-02-10 21:20:46

@rchenatjpl rchenatjpl added bug Something isn't working needs:triage labels May 10, 2022
@rchenatjpl
Copy link
Author

Archive.zip

@jordanpadams
Copy link
Member

@rchenatjpl shouldn't this be valid as of CCB-264? #292

@rchenatjpl
Copy link
Author

oh, I forgot about that. I guess the problem is then what should validate do if the data file has only LF, but the label says record_delimiter is CRLF, which was the only value available for the selected IM model

@jordanpadams
Copy link
Member

@rchenatjpl copy. That is a bug then

@al-niessner
Copy link
Contributor

al-niessner commented Dec 21, 2022

@jordanpadams @nutjob4life @tloubrieu-jpl

The failure is obscure and needs clarification. It starts here:

ProblemType problemType = this.documentUtil.getProblemType(doctype);
// Is is possible that there's no corresponding problemType. Must check for
// null-ness before calling checkGenericDocument() function.
if (problemType == null) {
LOG.error(
"FileReferenceValidationRule:Cannot retrieve ProblemType from provided doctype {}",
doctype);
} else {
return this.checkGenericDocument(target, urlRef, fileObject, filename, parent,
directory, documentStandardId, doctype, problemType);
}

In the XML file the doctype resolves to "". when at line 429 the problemType is retrieved with doctype being "" a null is returned. It is an inherent part of the call at:

public ProblemType getProblemType(String docType) {
ProblemType problemType = null;
if (!this.classInitialized) {
// Only initialize this class once of the two lists' content.
this.initialize();
}
// Iterating through docTypeList and check if docType contains singleDocType.
// Note that everything is changed to lower cases for comparison.
int ii = 0;
for (String singleDocType : this.docTypeList) {
if (docType.toLowerCase().contains(singleDocType.toLowerCase())) {
problemType = this.problemTypeList.get(ii);
// Once we have found a matching value, there's no need to continue looping as
// it will be fetching the wrong ProblemType if we continue.
break;
}
ii++;
}
LOG.debug("getProblemType:docType,problemType {},{}", docType, problemType);
return (problemType);

because "" is not part of the initialize() of the class. Have a bunch like EXCEL, HTML, GIF, etc. Not blank so the function returns null. Now at line 432, the null is detected and the generic document test is not performed (not to say that the LF would have been caught but it was never given the chance).

So,

  1. Is the example XML malformed because the doctype is ""?
    a. Should validate not return null but rather unknown doctype error report?
    b. Should there be a "" doctype?
  2. The 429-439 block of code is wrong and it should always do the generic test despite not having a doctype but
    a. use parent problemType
    b. also modify generic document handler to ignore problemType arg and use a listener
  3. This ticket is in error because the doctype is allowed to be "" and this is the correct behavior.

Please clarify which is the actual problem (XML and/or code) and what the actual problem is (malformed XML and/or not detecting previous error).

@jordanpadams
Copy link
Member

✅ The 429-439 block of code is wrong and it should always do the generic test despite not having a doctype but
a. use parent problemType
b. also modify generic document handler to ignore problemType arg and use a listener

I actually think that doctype code was kind of a wild ride that we wound up not really relying on too much. In this case, we should continue to perform validation.

@jordanpadams
Copy link
Member

@al-niessner ☝️

@al-niessner
Copy link
Contributor

@jordanpadams @nutjob4life @tloubrieu-jpl

Yum, I do not think this is the right route. I think the XML is not sufficient. Applied patch to have problemType for the doctype = "" and it now sinks checking a little deeper:

if (documentStandardId != null) {
mimeTypeIsCorrectFlag = this.documentsChecker.isMimeTypeCorrect(textName, documentStandardId);
LOG.debug("handleGenericDocument:textName,documentStandardId,mimeTypeIsCorrectFlag {},{},{}",
textName, documentStandardId, mimeTypeIsCorrectFlag);
} else {
mimeTypeIsCorrectFlag = true; // Set to true even though the label does not have the
// documentStandardId set
// to anything.
}

The problem here is that documentStandardId = null causing the else portion to be execute instead of the actual check at line 839 not that it would matter to this ticket. If you look just down from the highlighted block it is obvious that the only check is the mime type - not too thrilling. Point is, if we are checking a random file then how do we know it should have LFCR instead of LF or CR or just randomly placed bytes of those values because it is a binary file. Without a proper doctype and mime type to tell us that lines should end LFCR and that LF or CR will not appear otherwise in the file then how can a check for them be meaningful?

@al-niessner
Copy link
Contributor

@jordanpadams @nutjob4life @tloubrieu-jpl

Just ran the cucumber tests and 8 of them fail because they now pass files that do not resolve a doctype. It seems there is a much a larger design problem and less a line of code.

@jordanpadams
Copy link
Member

@al-niessner which tests are failing and for which ticket? I would not be surprised if this was a larger design issue...

Without a proper doctype and mime type to tell us that lines should end LFCR and that LF or CR will not appear otherwise in the file then how can a check for them be meaningful?

that is actually not true. LF and CRLF are defined in the labels:

    <Table_Character>
      <offset unit="byte">80</offset>
      <records>25</records> 
      <record_delimiter>Carriage-Return Line-Feed</record_delimiter>
      <Record_Character>

this mimetype and doctype thing is totally arbitrary. as I noted, it was a bit of a rabbithole we went down to try to do some guessing to check that the document type defined in the label actually makes sense with the file name defined (e.g. label says Excel file and it ends in xls/xlsx). In the end, unless something looks totally out of the ordinary here, we should just ignore this check and keep going. It is only something glaring wrong where this error should ever be thrown (e.g. label says Excel file and it is a PDF).

so for this case, I imagine the doctype check doesn't know what to do with a .dat file name suffix, which is why it returns "", but then the code should just continue to do checks and no longer do that specific doctype check.

hopefully that makes sense. will try to poke at it some more.

@al-niessner
Copy link
Contributor

@jordanpadams

This is a valid file reference test not a table test. Are you saying all undefined doctype files are text tables?

The tests that are failing are HTML, JPEG, GIF, EXCEL, etc (8 of them) that somehow leave doctype empty while specifying enough to be one of those 8 types. The error demonstrates that the block

ProblemType problemType = this.documentUtil.getProblemType(doctype);
// Is is possible that there's no corresponding problemType. Must check for
// null-ness before calling checkGenericDocument() function.
if (problemType == null) {
LOG.error(
"FileReferenceValidationRule:Cannot retrieve ProblemType from provided doctype {}",
doctype);
} else {
return this.checkGenericDocument(target, urlRef, fileObject, filename, parent,
directory, documentStandardId, doctype, problemType);
}

throws an error when productType is still null. It is clear that they are positive tests of a fail result. Not sure of the actual tests numbers as JUnit is hiding them very well from me.

@al-niessner
Copy link
Contributor

@jordanpadams

How can you tell that a file should have LFCR or CRLF? If given a JPEG does it need to have CRLF in it or just EOF? In other words, what in the XML is telling that this reference file should end with CRLF instead of LF or a stream of bytes until EOF?

@al-niessner
Copy link
Contributor

@jordanpadams

I think I am understanding it better now. The original error that doctype is "" forcing problemType to null is causing the base error. It then prevents the table checks from taking place when we think they should or that the table checks are not correct. Is that about right?

@jordanpadams
Copy link
Member

jordanpadams commented Dec 23, 2022

@al-niessner

How can you tell that a file should have LFCR or CRLF?

this is described in the label and this check is already happening elsewhere in the code.

for example, in the test data attached above, the label has:

    <Table_Character>
      <offset unit="byte">80</offset>
      <records>25</records> 
      <record_delimiter>Carriage-Return Line-Feed</record_delimiter>
      <Record_Character>

so this only applies to products where you can define record_delimiter. from the IM Spec, record_delimiter is defined in:

@jordanpadams
Copy link
Member

the file reference check happens, which may throw an error or may not, but we should move onto doing the next check. nothing within that class should be a fatal error, stopping the remaining validation checks.

@msbentley
Copy link

I was thinking about this the other day - it could, theoretically, be possible that for some horribly esoteric reason a single file contains multiple tables with different line endings, right? :-/ And definitely possible that we have binary content mixed with ASCII, so I guess this check as to be at the data object level?

@al-niessner
Copy link
Contributor

@jordanpadams

Found it. Now there is a table line counting problem. Will fix soon.

@jordanpadams
Copy link
Member

so I guess this check as to be at the data object level?

@msbentley yes! our code was refactored a bit to hopefully enable more ready support validation of things like this at the data object level. hopefully that works as expected.

@miguelp1986
Copy link

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants