-
Notifications
You must be signed in to change notification settings - Fork 21
Description
The short version:
h\.bgc\..*?.?[_\d+]+\.nc$ should be replaced by:
h\.bgc\..*(\._\d+)?\.nc$
ic.[-\d(_\d)?+]+(\._\d*)?\.nc(\.\d*)?$ should be replaced by
ic\.[-\d_]+(\._\d*)?\.nc(\.\d*)?$
I believe that the target of [_\d+] in h\.bgc\..*?.?[_\d+]+\.nc$ is the optional instance number; ._0001 .
The [] around _\d+ makes the + a literal instead of a metacharacter (1 or more \d).
In addition, the + following the [_\d+] means that there must be an instance number (actually, a +,_, or \d),
which is not the goal. I tested that the existing RE:
matches CASE.bgc.moredescription._0001.nc
but not CASE.bgc.moredescription.nc
(Putting an _ between 'more' and 'description' short-circuits the match, which is another flaw of the [_\d+])
One final suggestion is that .bgc\..*?.? is hard to parse and maybe needlessly complex.
The .bgc. part is clear, but .*? is a lazy/non-greedy expression that matches the minimum of 0 or more characters.
If nothing follows .bgc., then it matches 0 characters. The [_\d+] would make it match all of the characters up to
the _, \d, or +. But the optional single character .? between those makes it hard to predict what will match.
My best guess it that it makes the parser look ahead for a _, \d, or +, then look for a character before it,
and then assign any remaining characters before the optional character to the .*?.
I think that the [_\d+] should be replaced by an optional group (in which the + is a metacharacter)
and the .*?.? should be replaced by .*:
h\.bgc\..*(\._\d+)?\.nc
Farther down I see ic.[-\d(_\d)?+]+(\._\d*)?\.nc.
ic. should probably be ic\..
?+ seems to mean 'matches the previous token between zero and one times, as many times as possible,
without giving back (possessive)'.
But it's inside a [ ] list, so the RE tester I'm using makes it look like the group syntax is ignored
and the metacharacters are disabled:
- matches the character - with index 4510 (2D16 or 558) literally (case sensitive)
\d matches a digit (equivalent to [0-9])
(_
matches a single character in the list (_ (case sensitive)
( matches the character ( with index 4010 (2816 or 508) literally (case sensitive)
_ matches the character _ with index 9510 (5F16 or 1378) literally (case sensitive)
\d matches a digit (equivalent to [0-9])
)?+
matches a single character in the list )?+ (case sensitive)
) matches the character ) with index 4110 (2916 or 518) literally (case sensitive)
? matches the character ? with index 6310 (3F16 or 778) literally (case sensitive)
+ matches the character + with index 4310 (2B16 or 538) literally (case sensitive)
It does seem to find strings of -, _, and integers, but so does ic\.[-\d_]+\., which is simpler.