Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support Python 2 basestring for schema fields #92

Closed
Sheldon-Smith opened this issue Sep 13, 2018 · 8 comments
Closed

Support Python 2 basestring for schema fields #92

Sheldon-Smith opened this issue Sep 13, 2018 · 8 comments

Comments

@Sheldon-Smith
Copy link

We have encountered an issue with validating schema fields when the supplied value is a Python 2.7 byte string instead of a unicode object. In nti.externalization.internalization.fields.py,

287    try:
288        if isinstance(value, text_type) and IFromUnicode_providedBy(field):
289            value = field.fromUnicode(value)  # implies validation
290        else:
291            field.validate(value)

In Python 2, the type check here fails for a base string type. This seems like a functional difference from the expected behavior in Python 3. Should we consider comparing against string_types here rather than text_type? We are seeing this as an issue in cases where the value is a byte string int or bool supplied through a 3rd party library.

@jamadden
Copy link
Member

On Python 2, str is emphatically not a text type; it's best to think of str by its alias bytes. On Python 2, only unicode is a text type. On Python 3, str is a text/unicode object and bytes is a different type.

Passing a bytes object to fromUnicode is usually guaranteed to break the field. IFromUnicode specifically documents a unicode/text input, so implementations are well within their rights to expect that.

This has been text_type all the way back to at least 1.0a1, so this is not a change. The use of basestring on Python 2 is almost always a code smell. It's important to understand and know the distinction between text and bytes and know which you are working with at any given time for any given API.

We are seeing this as an issue in cases where the value is a byte string int or bool supplied through a 3rd party library.

I would suggest that's a problem in the way the data is being read, then. The supplied IExternalObjectIO in this package read bytestrings in and produce text for what is a string in the input:

>>> from nti.externalization.representation import JsonRepresenter
>>> rep = JsonRepresenter()
>>> rep.load(b'1')
1
>>> rep.load(b'"a string"')
u'a string'
>>>

@cutz
Copy link

cutz commented Sep 13, 2018

I think we are running in to this "optimization" in lxml.

In Python 2, lxml's API returns byte strings for plain ASCII text values, be it for tag names or text in Element content. This is the same behaviour as known from ElementTree. The reasoning is that ASCII encoded byte strings are compatible with Unicode strings in Python 2, but consume less memory (usually by a factor of 2 or 4) and are faster to create because they do not require decoding. Plain ASCII string values are very common in XML, so this optimisation is generally worth it.

I'm not sure what to do about it yet, but I think that's why we are getting byte strings back when all other indications should be that we get unicode strings (unicode string input, with appropriate encoding in the xml doc).

@jamadden
Copy link
Member

If so, I would expect the consumer of the lxml parsing data to transform into text as needed values that are intended to be text. That can be done in an XML schema agnostic way if you can assume that all PCDATA (element content) is text (doesn't it have to be?). (I don't think tag names would matter for this, but if they happen to specify field names, that's fine, field names are supposed to be native strings anyway.)

@cutz
Copy link

cutz commented Sep 14, 2018

Yes, I agree. Unfortunately were wrapped around a 3rd party library doing the parsing/consumption so it's going to take some work arounds on our side. I think it is doable though and it does seem like the proper place to do it. I think this can probably be closed.

@jamadden
Copy link
Member

I can propose an IFromNativeString interface for upstream zope.schema and implement it for the numeric fields if that would help. The code here would add a new branch elif isinstance(value, str) and IFromNativeString.providedBy(field): value = field.fromNativeString(value).

@cutz
Copy link

cutz commented Sep 14, 2018

I'd be in favor of that if we feel like it's a reasonable and good idea. Assuming they would be open to that.

@jamadden
Copy link
Member

I was thinking about that some more, and I've come to the conclusion it's actually fairly problematic. The reason is that on Python 3 (which is now the dialect version used by the majority according to the most recent Tiobe rankings), native string and unicode are the same thing. An object would have to implement both interfaces and methods, which is redundant and ugly:

class IFromNativeString(Interface):
    def fromNativeString(value):
       ...

@interface.implementer(IFromUnicode,IFromNativeString)
class Python3StrIsUnicode(object):

   def fromUnicode(self, value):
       # existing code

  if PY3:
      fromNativeString = fromUnicode
  else:
     def fromNativeString(self, value):
          # Python 2, do something with a bytestring

I thought of two alternatives. One would be a IFromByteString interface. Byte strings and unicode are distinct on all platforms; byte strings happen to be native strings on Python 2, but I don't think that matters because we're not defining type aliases. I could then make Number implement that.

@implementer(IFromUnicode,IFromByteString)
class Number(Field):
   def fromByteString(self, bytestring):
        return self.fromUnicode(bytestring.decode('utf-8')

That's more general, and for things like Bytes and BytesLine could save a decoding step.

The other alternative is to ... actually, I can't remember right now. I've been pulled away several times while writing this up and I don't remember what I was thinking before.

@jamadden
Copy link
Member

#93 ought to resolve this if you want to take a look. (cf OpenNTI/nti.schema#34)

jamadden added a commit that referenced this issue Sep 20, 2018
Add support for IFromBytes. Fixes #92
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants