Skip to content
This repository has been archived by the owner on Apr 17, 2024. It is now read-only.

File checker - reduce size of text comparison #546

Closed
IanMayo opened this issue Nov 20, 2023 · 2 comments · Fixed by #554
Closed

File checker - reduce size of text comparison #546

IanMayo opened this issue Nov 20, 2023 · 2 comments · Fixed by #554
Assignees
Labels
enhancement New feature or request

Comments

@IanMayo
Copy link
Contributor

IanMayo commented Nov 20, 2023

The file-checker is failing files, but the target content is actually present.

The tester reports:

Couldn't find source text from xxx.html in target document yyy.dita
Source text:
2.       A mid-life upgrade of these installations was announced in 2011.  ANCHORS was upgraded with a new heat 
pump and ACME waste cleanser.  BRAVO is to be upgraded with NOBLE IOT data system and DRAGO halon drench, 
as well as a new Drago Mills EMS system

This paragraph of text is present in the published dita. The text looks identical, including   chars, and erroneous double spaces between words.

To avoid the above false-error, I think we should trim the block of text. I guess if the target contains, say, 30 matching chars successive chars from the source then it is valid.

I'll come back tomorrow and see if I can spot a pattern.

Aah, there is something. In the source html for another file there is a ° marker, but in the dita it is a ASCII degree symbol. I guess we should strip these out of both strings before comparing.

@IanMayo IanMayo added the enhancement New feature or request label Nov 20, 2023
@robintw
Copy link
Collaborator

robintw commented Nov 22, 2023

Are you still intending to look for more patterns for when this is failing? Or should I go ahead and change the logic for how we check?

And, if I do change the logic, do you mean if any 30 successive characters anywhere in the string match then we count it as a match? I suspect that's actually significantly more computationally intensive to do, as we'd have to loop over every possible 30-char long substring and check each one until we find one that matches. Is that worth it?

@IanMayo
Copy link
Contributor Author

IanMayo commented Nov 22, 2023

For the 30 chars, I thought we could simplify the test - by taking a random 30 char long block of text from the string, and if they are also present in the target then it's valid. Matching a longer string seems to invite more chances for false negatives (through different whitespace or the presence of special characters).

I've just pulled the value of 30 out of the air, since it's longer than Characteristics or some long word that could occur multiple times.

I won't get chance to look for a deeper understanding of why I'm getting false negatives for a few days - but if you're ok with reducing the length of the text being compared, then that itself may reduce the false negatives.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
enhancement New feature or request
Projects
Development

Successfully merging a pull request may close this issue.

2 participants