Properly handle unicode values passed to cmor.axis in Python #616

mauzey1 · 2020-11-05T22:21:13Z

Fixes #612

When string values get passed to cmor.axis, they will be treated as a Numpy array of type numpy.unicode_.

From https://numpy.org/doc/stable/reference/arrays.dtypes.html

Note that str refers to either null terminated bytes or unicode strings depending on the Python version. In code targeting both Python 2 and 3 np.unicode_ should be used as a dtype for strings.

The C code has been modified to properly copy string values from a Numpy array to a C array to be processed by cmor_axis.

…ordinate values.

…alue used when copying string axis values

… code

durack1 · 2020-11-05T22:25:36Z

@mauzey1 I wonder if explicitly decoding to unicode e.g.

string.decode('utf-8')

Would be a good way to force all types?

mauzey1 · 2020-11-06T00:23:20Z

@durack1
The current line below should already be performing this conversion.

coord_vals = numpy.array(coord_vals, numpy.unicode_)

I could make the conversion more explicit by using

if data_type == 'S':
    coord_vals = numpy.char.decode(coord_vals, encoding='utf-8')

durack1 · 2020-11-06T00:28:00Z

@mauzey1 my comment was just that. I found that in generating html content from the contributed contents for the CMIP6_CVs all manner of weird non-standard (and non UTF-8) characters were sneaking in (presumably from folks copying characters out of Word, or some other rich text software), and the parsers weren't expecting these characters and consequently barfed.

I think this is true to CMOR, so catching non UTF-8 characters and throwing an explicit error (if a decode function can't handle things) would be my more bulletproof preference

durack1

@mauzey1 happy to green light your contribution, it looks good, but was just wanting to make the statement about non-UTF-8 chars for the record

mauzey1 · 2020-11-06T02:03:35Z

@durack1 After some experimenting with unicode, I decided to change the code to explicitly convert the strings to UTF-8.

durack1 · 2020-11-07T00:58:30Z

@mauzey1 I think that was a wise tweak, hopefully, it catches those fringe cases

mauzey1 added 9 commits August 17, 2020 13:14

In cmor.axis, convert unicode strings to zero-terminated bytes for co…

16c7c35

…ordinate values.

Merge branch 'master' into 612_cmor_axis_unicode_values

e627833

Merge branch 'master' into 612_cmor_axis_unicode_values

be54b3c

Move handling of Python unicode characters into C code.

ca21764

Update tables

7ee62f3

Update xcode to 11.4.0 and clang_osx-64 to 10.0.1

9294824

Merge branch 'master' into 612_cmor_axis_unicode_values

aaf2ffc

Treat string axis values as dtype numpy.unicode_, and fix the index v…

f0a02e7

…alue used when copying string axis values

Measuring max string length for axis values has been moved into the C…

c97309b

… code

mauzey1 requested review from durack1 and wachsylon November 5, 2020 22:21

durack1 approved these changes Nov 6, 2020

View reviewed changes

Convert bytestrings to UTF-8 in axis values in cmor.axis

02beb05

mauzey1 merged commit 5a3905c into master Nov 6, 2020

mauzey1 deleted the 612_cmor_axis_unicode_values branch November 6, 2020 14:55

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Properly handle unicode values passed to cmor.axis in Python #616

Properly handle unicode values passed to cmor.axis in Python #616

mauzey1 commented Nov 5, 2020

durack1 commented Nov 5, 2020

mauzey1 commented Nov 6, 2020

durack1 commented Nov 6, 2020 •

edited

Loading

durack1 left a comment

mauzey1 commented Nov 6, 2020

durack1 commented Nov 7, 2020

Properly handle unicode values passed to cmor.axis in Python #616

Properly handle unicode values passed to cmor.axis in Python #616

Conversation

mauzey1 commented Nov 5, 2020

durack1 commented Nov 5, 2020

mauzey1 commented Nov 6, 2020

durack1 commented Nov 6, 2020 • edited Loading

durack1 left a comment

Choose a reason for hiding this comment

mauzey1 commented Nov 6, 2020

durack1 commented Nov 7, 2020

durack1 commented Nov 6, 2020 •

edited

Loading