File checker - reduce size of text comparison #546

IanMayo · 2023-11-20T18:11:48Z

The file-checker is failing files, but the target content is actually present.

The tester reports:

Couldn't find source text from xxx.html in target document yyy.dita
Source text:
2.       A mid-life upgrade of these installations was announced in 2011.  ANCHORS was upgraded with a new heat 
pump and ACME waste cleanser.  BRAVO is to be upgraded with NOBLE IOT data system and DRAGO halon drench, 
as well as a new Drago Mills EMS system

This paragraph of text is present in the published dita. The text looks identical, including   chars, and erroneous double spaces between words.

To avoid the above false-error, I think we should trim the block of text. I guess if the target contains, say, 30 matching chars successive chars from the source then it is valid.

I'll come back tomorrow and see if I can spot a pattern.

Aah, there is something. In the source html for another file there is a ° marker, but in the dita it is a ASCII degree symbol. I guess we should strip these out of both strings before comparing.

The text was updated successfully, but these errors were encountered:

robintw · 2023-11-22T14:35:22Z

Are you still intending to look for more patterns for when this is failing? Or should I go ahead and change the logic for how we check?

And, if I do change the logic, do you mean if any 30 successive characters anywhere in the string match then we count it as a match? I suspect that's actually significantly more computationally intensive to do, as we'd have to loop over every possible 30-char long substring and check each one until we find one that matches. Is that worth it?

IanMayo · 2023-11-22T14:46:38Z

For the 30 chars, I thought we could simplify the test - by taking a random 30 char long block of text from the string, and if they are also present in the target then it's valid. Matching a longer string seems to invite more chances for false negatives (through different whitespace or the presence of special characters).

I've just pulled the value of 30 out of the air, since it's longer than Characteristics or some long word that could occur multiple times.

I won't get chance to look for a deeper understanding of why I'm getting false negatives for a few days - but if you're ok with reducing the length of the text being compared, then that itself may reduce the false negatives.

… text and check that exists in the target file. Fixes #546

IanMayo added the enhancement New feature or request label Nov 20, 2023

IanMayo assigned robintw Nov 20, 2023

robintw added a commit that referenced this issue Nov 22, 2023

Change check_files.py script to get a random 30 char substring of the…

a09ae5d

… text and check that exists in the target file. Fixes #546

robintw mentioned this issue Nov 22, 2023

Change check_files.py script to check a random 30 char substring of the text #554

Merged

IanMayo closed this as completed in #554 Nov 23, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

File checker - reduce size of text comparison #546

File checker - reduce size of text comparison #546

IanMayo commented Nov 20, 2023

robintw commented Nov 22, 2023

IanMayo commented Nov 22, 2023

File checker - reduce size of text comparison #546

File checker - reduce size of text comparison #546

Comments

IanMayo commented Nov 20, 2023

robintw commented Nov 22, 2023

IanMayo commented Nov 22, 2023