Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Properly handle unicode values passed to cmor.axis in Python #616

Merged
merged 10 commits into from
Nov 6, 2020

Conversation

mauzey1
Copy link
Collaborator

@mauzey1 mauzey1 commented Nov 5, 2020

Fixes #612

When string values get passed to cmor.axis, they will be treated as a Numpy array of type numpy.unicode_.

From https://numpy.org/doc/stable/reference/arrays.dtypes.html

Note that str refers to either null terminated bytes or unicode strings depending on the Python version. In code targeting both Python 2 and 3 np.unicode_ should be used as a dtype for strings.

The C code has been modified to properly copy string values from a Numpy array to a C array to be processed by cmor_axis.

@durack1
Copy link
Contributor

durack1 commented Nov 5, 2020

@mauzey1 I wonder if explicitly decoding to unicode e.g.

string.decode('utf-8')

Would be a good way to force all types?

@mauzey1
Copy link
Collaborator Author

mauzey1 commented Nov 6, 2020

@durack1
The current line below should already be performing this conversion.

coord_vals = numpy.array(coord_vals, numpy.unicode_)

I could make the conversion more explicit by using

if data_type == 'S':
    coord_vals = numpy.char.decode(coord_vals, encoding='utf-8')

@durack1
Copy link
Contributor

durack1 commented Nov 6, 2020

@mauzey1 my comment was just that. I found that in generating html content from the contributed contents for the CMIP6_CVs all manner of weird non-standard (and non UTF-8) characters were sneaking in (presumably from folks copying characters out of Word, or some other rich text software), and the parsers weren't expecting these characters and consequently barfed.

I think this is true to CMOR, so catching non UTF-8 characters and throwing an explicit error (if a decode function can't handle things) would be my more bulletproof preference

Copy link
Contributor

@durack1 durack1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mauzey1 happy to green light your contribution, it looks good, but was just wanting to make the statement about non-UTF-8 chars for the record

@mauzey1
Copy link
Collaborator Author

mauzey1 commented Nov 6, 2020

@durack1 After some experimenting with unicode, I decided to change the code to explicitly convert the strings to UTF-8.

@mauzey1 mauzey1 merged commit 5a3905c into master Nov 6, 2020
@mauzey1 mauzey1 deleted the 612_cmor_axis_unicode_values branch November 6, 2020 14:55
@durack1
Copy link
Contributor

durack1 commented Nov 7, 2020

@mauzey1 I think that was a wise tweak, hopefully, it catches those fringe cases

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

cmor.axis is not handling unicode string values correctly in Python
2 participants