Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Iris incorrectly failing valid netcdf names #5929

Open
ScottWales opened this issue Apr 29, 2024 · 3 comments
Open

Iris incorrectly failing valid netcdf names #5929

ScottWales opened this issue Apr 29, 2024 · 3 comments
Labels
Good First Issue A good issue to take on if you're just getting started with Iris development Type: Bug

Comments

@ScottWales
Copy link

🐛 Bug Report

Iris is incorrectly validating netcdf variable names, rejecting opening valid files

How To Reproduce

Steps to reproduce the behaviour:

import netCDF4 as nc
import iris

root = nc.Dataset("sample.nc", "w", format="NETCDF4")
# Valid variable name according to netCDF rules
root.createVariable("a(x)", "f4")
root.close()

cubes = iris.load("sample.nc")

Iris fails in iris/common/mixin.py", line 177 with error ValueError: 'a(x)' is not a valid NetCDF variable name.

Expected behaviour

Iris should safely load the variable with a valid NetCDF name.

Iris matches names against a regex at https://github.com/SciTools/iris/blob/main/lib/iris/common/metadata.py#L41-L43, referencing an outdated link to netCDF documentation. If validation is performed on reading NetCDF files it should follow the current format rules:

nc_name_re = re.compile(r"""
    ^([a-zA-Z0-9_]|[^\x00-\x7F]) # Start with ASCII word or multibyte unicode
    ([^\x00-\x1F/\x7F-\xFF]|[^\x00-\x7F])* # Continue with non-control ASCII or multibyte unicode
    (?<!\s)$ # Don't end with whitespace
    """,
    re.VERBOSE)

Alternately validation could be performed only when writing out files, or left to the NetCDF library.

Unidata give this guidance on netcdf names in the file format documentation:

Note on names: Earlier versions of the netCDF C-library reference implementation enforced a more restricted set of characters in creating new names, but permitted reading names containing arbitrary bytes. This specification extends the permitted characters in names to include multi-byte UTF-8 encoded Unicode and additional printing characters from the US-ASCII alphabet. The first character of a name must be alphanumeric, a multi-byte UTF-8 character, or '_' (reserved for special names with meaning to implementations, such as the “_FillValue” attribute). Subsequent characters may also include printing special characters, except for '/' which is not allowed in names. Names that have trailing space characters are also not permitted.

The netCDF4 library will give an error if you try to create a variable with an invalid name, e.g. one that ends with whitespace:

import netCDF4 as nc
root = nc.Dataset("sample.nc", "w", format="NETCDF4")
# Invalid netCDF variable name, raises exception `RuntimeError: NetCDF: Name contains illegal characters`
root.createVariable("invalid ", "f4")
root.close()

Environment

  • OS & Version: Rocky Linux 8
  • Iris Version: 3.7.0

Additional context

Full stack trace
$ python sample_iris.py 2>&1 | sed 's:".*/python:".../python:'
Traceback (most recent call last):
  File "/scratch/hc46/saw562/tmp/pete/sample_iris.py", line 13, in <module>
    cubes = iris.load("sample.nc")
  File ".../python3.10/site-packages/iris/__init__.py", line 324, in load
    return _load_collection(uris, constraints, callback).merged().cubes()
  File ".../python3.10/site-packages/iris/__init__.py", line 290, in _load_collection
    result = _CubeFilterCollection.from_cubes(cubes, constraints)
  File ".../python3.10/site-packages/iris/cube.py", line 106, in from_cubes
    for cube in cubes:
  File ".../python3.10/site-packages/iris/__init__.py", line 271, in _generate_cubes
    for cube in iris.io.load_files(part_names, callback, constraints):
  File ".../python3.10/site-packages/iris/io/__init__.py", line 236, in load_files
    for cube in handling_format_spec.handler(
  File ".../python3.10/site-packages/iris/fileformats/netcdf/loader.py", line 576, in load_cubes
    cube = _load_cube(engine, cf, cf_var, cf.filename)
  File ".../python3.10/site-packages/iris/fileformats/netcdf/loader.py", line 285, in _load_cube
    engine.activate()
  File ".../python3.10/site-packages/iris/fileformats/_nc_load_rules/engine.py", line 97, in activate
    run_actions(self)
  File ".../python3.10/site-packages/iris/fileformats/_nc_load_rules/actions.py", line 521, in run_actions
    action_default(engine)  # This should run the default rules.
  File ".../python3.10/site-packages/iris/fileformats/_nc_load_rules/actions.py", line 74, in inner
    rule_name = func(engine, *args, **kwargs)
  File ".../python3.10/site-packages/iris/fileformats/_nc_load_rules/actions.py", line 90, in action_default
    hh.build_cube_metadata(engine)
  File ".../python3.10/site-packages/iris/fileformats/_nc_load_rules/helpers.py", line 397, in build_cube_metadata
    cube.var_name = cf_var.cf_name
  File ".../python3.10/site-packages/iris/common/mixin.py", line 177, in var_name
    raise ValueError(emsg.format(name))
ValueError: 'a(x)' is not a valid NetCDF variable name.
@trexfeathers trexfeathers added the Good First Issue A good issue to take on if you're just getting started with Iris development label Apr 29, 2024
@pp-mo
Copy link
Member

pp-mo commented Apr 30, 2024

Thanks @ScottWales for raising this.
I certainly do see what you mean, and I know what it sounds like, but in fact I think this error is more of a mis-speak than a total mistake.
What it should really say is "is not a valid NetCDF -CF variable name".
I.E. it is applying the rules given here

So, Iris generally follows CF conventions, and in many cases requires them. We consider "CF strictness" as generally quite an important feature of Iris, though we do also have quite a few workarounds for common violations (like invalid "standard_name" and "units" attributes, and dimension variables in "coordinates" attributes). The appropriate level of this has also been a subject of some considerable discussion in recent times.

Unfortunately, we don't have a workaround for this problem.
I think it might even be a bit tricky to devise a solid automatic scheme for replacing non-compliant variable names, since the CF rules are rather highly fussy, not allowing "%" for instance.

Obviously we can fix this error message to give the proper context, but it remains a fact that Iris is rather strict about this + it is hard to work around.
If practical, you might consider using "ncdata" to fix this, which at least avoids a need to copy or modify the original file. It's very much a kind of problem which that was designed for.
Something like this should work ...

from ncdata.netcdf4 import from_nc4
from ncdata.iris import to_iris
ncds = from_nc4("sample.nc")
ncds.variables.rename("a(x)", "a_x")
cubes = to_iris(ncds)
print(cubes)

I tested that on your minimal example + it does work for me.

Of course, in an actual case it may not be so simple, as you may need to correct references to the variable elsewhere, like e.g. data::coordinates = "a(x) b(y)" -- though I suppose that wouldn't be good CF either, so perhaps it's not a likely case.

Does this help at all @ScottWales ?

@pp-mo
Copy link
Member

pp-mo commented May 1, 2024

Ok update on this -- I was a bit wrong...

It seems that the current restrictions in Iris don't actually follow the CF rules, but something slightly similar, which is in fact the rules given for "Classic" NetCDF -- effectively NetCDF3.
The statement for that (which I think the Iris regexp you noted is implementing) can now be found here in the current Netcdf User Guide : Permitted Characters in NetCDF Names

However, as also stated in that para, NetCDF4 removes most of those restrictions
-- though quite what "as well as other special characters" means is still a bit in question.
This is what we see in your experiments creating variables.
Following that method, I find -- purely from experiment :

  • "\n" and "\t" are forbidden
  • " " (space) is disallowed at the ends, but allowed in the middle
  • characters including at least (I have checked) =;:.()[]{}%$! are all OK but not at the start
  • a "/" is generally errored but in some places may be ignored
    ( e.g. "b/" seems valid but results in a variable called just "b", "cc/a" results in "a" at least when there is no group "cc").

Actually though, I would still say that ideally Iris probably should be enforcing the CF rules -- which would actually be more restrictive than what we currently have.

In general , I'd suggest a reasonable goal that the cf-checker will not object to anything saved from Iris : As we don't currently have any special checks on save, that is just what we allow for setting a "cube.var_name", which is exactly the scope of the regexp and the error which you have noted.

Just to confuse things further, though, the CF community are currently discussing the possibility of allowing basically any valid netcdf names after all (!)

@pp-mo
Copy link
Member

pp-mo commented May 1, 2024

Iris probably should be enforcing the CF rules -- which would actually be more restrictive than what we currently have

( But practically, that would be a painful journey involving breaking changes + deprecations )

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Good First Issue A good issue to take on if you're just getting started with Iris development Type: Bug
Projects
Status: No status
Development

No branches or pull requests

3 participants