Skip to content

config_archive.xml mistaken regular expressions #257

@kdraeder

Description

@kdraeder

The short version:

h\.bgc\..*?.?[_\d+]+\.nc$ should be replaced by:
h\.bgc\..*(\._\d+)?\.nc$

ic.[-\d(_\d)?+]+(\._\d*)?\.nc(\.\d*)?$ should be replaced by
ic\.[-\d_]+(\._\d*)?\.nc(\.\d*)?$


I believe that the target of [_\d+] in h\.bgc\..*?.?[_\d+]+\.nc$ is the optional instance number; ._0001 .
The [] around _\d+ makes the + a literal instead of a metacharacter (1 or more \d).

In addition, the + following the [_\d+] means that there must be an instance number (actually, a +,_, or \d),
which is not the goal. I tested that the existing RE:
matches CASE.bgc.moredescription._0001.nc
but not CASE.bgc.moredescription.nc
(Putting an _ between 'more' and 'description' short-circuits the match, which is another flaw of the [_\d+])

One final suggestion is that .bgc\..*?.? is hard to parse and maybe needlessly complex.
The .bgc. part is clear, but .*? is a lazy/non-greedy expression that matches the minimum of 0 or more characters.
If nothing follows .bgc., then it matches 0 characters. The [_\d+] would make it match all of the characters up to
the _, \d, or +. But the optional single character .? between those makes it hard to predict what will match.
My best guess it that it makes the parser look ahead for a _, \d, or +, then look for a character before it,
and then assign any remaining characters before the optional character to the .*?.

I think that the [_\d+] should be replaced by an optional group (in which the + is a metacharacter)
and the .*?.? should be replaced by .*:
h\.bgc\..*(\._\d+)?\.nc


Farther down I see ic.[-\d(_\d)?+]+(\._\d*)?\.nc.
ic. should probably be ic\..
?+ seems to mean 'matches the previous token between zero and one times, as many times as possible,
without giving back (possessive)'.
But it's inside a [ ] list, so the RE tester I'm using makes it look like the group syntax is ignored
and the metacharacters are disabled:

- matches the character - with index 4510 (2D16 or 558) literally (case sensitive)
\d matches a digit (equivalent to [0-9])
(_
   matches a single character in the list (_ (case sensitive)
   ( matches the character ( with index 4010 (2816 or 508) literally (case sensitive)
   _ matches the character _ with index 9510 (5F16 or 1378) literally (case sensitive)
\d matches a digit (equivalent to [0-9])
)?+
 matches a single character in the list )?+ (case sensitive)
   ) matches the character ) with index 4110 (2916 or 518) literally (case sensitive)
   ? matches the character ? with index 6310 (3F16 or 778) literally (case sensitive)
   + matches the character + with index 4310 (2B16 or 538) literally (case sensitive)

It does seem to find strings of -, _, and integers, but so does ic\.[-\d_]+\., which is simpler.

Metadata

Metadata

Assignees

Labels

bugSomething isn't working

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions