Support Python 2 basestring for schema fields #92

Sheldon-Smith · 2018-09-13T15:48:17Z

We have encountered an issue with validating schema fields when the supplied value is a Python 2.7 byte string instead of a unicode object. In nti.externalization.internalization.fields.py,

287    try:
288        if isinstance(value, text_type) and IFromUnicode_providedBy(field):
289            value = field.fromUnicode(value)  # implies validation
290        else:
291            field.validate(value)

In Python 2, the type check here fails for a base string type. This seems like a functional difference from the expected behavior in Python 3. Should we consider comparing against string_types here rather than text_type? We are seeing this as an issue in cases where the value is a byte string int or bool supplied through a 3rd party library.

The text was updated successfully, but these errors were encountered:

jamadden · 2018-09-13T16:04:11Z

On Python 2, str is emphatically not a text type; it's best to think of str by its alias bytes. On Python 2, only unicode is a text type. On Python 3, str is a text/unicode object and bytes is a different type.

Passing a bytes object to fromUnicode is usually guaranteed to break the field. IFromUnicode specifically documents a unicode/text input, so implementations are well within their rights to expect that.

This has been text_type all the way back to at least 1.0a1, so this is not a change. The use of basestring on Python 2 is almost always a code smell. It's important to understand and know the distinction between text and bytes and know which you are working with at any given time for any given API.

We are seeing this as an issue in cases where the value is a byte string int or bool supplied through a 3rd party library.

I would suggest that's a problem in the way the data is being read, then. The supplied IExternalObjectIO in this package read bytestrings in and produce text for what is a string in the input:

>>> from nti.externalization.representation import JsonRepresenter
>>> rep = JsonRepresenter()
>>> rep.load(b'1')
1
>>> rep.load(b'"a string"')
u'a string'
>>>

cutz · 2018-09-13T17:01:00Z

I think we are running in to this "optimization" in lxml.

In Python 2, lxml's API returns byte strings for plain ASCII text values, be it for tag names or text in Element content. This is the same behaviour as known from ElementTree. The reasoning is that ASCII encoded byte strings are compatible with Unicode strings in Python 2, but consume less memory (usually by a factor of 2 or 4) and are faster to create because they do not require decoding. Plain ASCII string values are very common in XML, so this optimisation is generally worth it.

I'm not sure what to do about it yet, but I think that's why we are getting byte strings back when all other indications should be that we get unicode strings (unicode string input, with appropriate encoding in the xml doc).

jamadden · 2018-09-13T17:20:27Z

If so, I would expect the consumer of the lxml parsing data to transform into text as needed values that are intended to be text. That can be done in an XML schema agnostic way if you can assume that all PCDATA (element content) is text (doesn't it have to be?). (I don't think tag names would matter for this, but if they happen to specify field names, that's fine, field names are supposed to be native strings anyway.)

cutz · 2018-09-14T17:08:11Z

Yes, I agree. Unfortunately were wrapped around a 3rd party library doing the parsing/consumption so it's going to take some work arounds on our side. I think it is doable though and it does seem like the proper place to do it. I think this can probably be closed.

jamadden · 2018-09-14T17:41:05Z

I can propose an IFromNativeString interface for upstream zope.schema and implement it for the numeric fields if that would help. The code here would add a new branch elif isinstance(value, str) and IFromNativeString.providedBy(field): value = field.fromNativeString(value).

cutz · 2018-09-14T20:18:17Z

I'd be in favor of that if we feel like it's a reasonable and good idea. Assuming they would be open to that.

jamadden · 2018-09-14T20:37:08Z

I was thinking about that some more, and I've come to the conclusion it's actually fairly problematic. The reason is that on Python 3 (which is now the dialect version used by the majority according to the most recent Tiobe rankings), native string and unicode are the same thing. An object would have to implement both interfaces and methods, which is redundant and ugly:

class IFromNativeString(Interface):
    def fromNativeString(value):
       ...

@interface.implementer(IFromUnicode,IFromNativeString)
class Python3StrIsUnicode(object):

   def fromUnicode(self, value):
       # existing code

  if PY3:
      fromNativeString = fromUnicode
  else:
     def fromNativeString(self, value):
          # Python 2, do something with a bytestring

I thought of two alternatives. One would be a IFromByteString interface. Byte strings and unicode are distinct on all platforms; byte strings happen to be native strings on Python 2, but I don't think that matters because we're not defining type aliases. I could then make Number implement that.

@implementer(IFromUnicode,IFromByteString)
class Number(Field):
   def fromByteString(self, bytestring):
        return self.fromUnicode(bytestring.decode('utf-8')

That's more general, and for things like Bytes and BytesLine could save a decoding step.

The other alternative is to ... actually, I can't remember right now. I've been pulled away several times while writing this up and I don't remember what I was thinking before.

jamadden · 2018-09-19T18:00:56Z

#93 ought to resolve this if you want to take a look. (cf OpenNTI/nti.schema#34)

Add support for IFromBytes. Fixes #92

jamadden mentioned this issue Sep 18, 2018

Add IFromBytes zopefoundation/zope.schema#71

Merged

jamadden closed this as completed in 5b4ad1f Sep 20, 2018

jamadden added a commit that referenced this issue Sep 20, 2018

Merge pull request #93 from NextThought/issue92

f56e27d

Add support for IFromBytes. Fixes #92

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support Python 2 basestring for schema fields #92

Support Python 2 basestring for schema fields #92

Sheldon-Smith commented Sep 13, 2018

jamadden commented Sep 13, 2018

cutz commented Sep 13, 2018

jamadden commented Sep 13, 2018

cutz commented Sep 14, 2018

jamadden commented Sep 14, 2018

cutz commented Sep 14, 2018

jamadden commented Sep 14, 2018

jamadden commented Sep 19, 2018

Support Python 2 basestring for schema fields #92

Support Python 2 basestring for schema fields #92

Comments

Sheldon-Smith commented Sep 13, 2018

jamadden commented Sep 13, 2018

cutz commented Sep 13, 2018

jamadden commented Sep 13, 2018

cutz commented Sep 14, 2018

jamadden commented Sep 14, 2018

cutz commented Sep 14, 2018

jamadden commented Sep 14, 2018

jamadden commented Sep 19, 2018