Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for PEP3131 (Non-ASCII Identifiers) #160

Open
wants to merge 6 commits into
base: master
Choose a base branch
from

Conversation

DylanLukes
Copy link

In the process of using Baron for some research on source pulled from hundreds/thousands of repositories on GitHub, I've found that in many cases Baron is unable to tokenize/parse source containing non-ASCII identifiers.

Non-ASCII identifiers are supported by (at least as far back as) Python 3, as specified by PEP3131.

This pull request includes some very small changes that allow Baron to handle non-ASCII identifiers:

  • Replace native re module with a dependency on the regex module.
    • This is because regex supports Unicode character property classes.
  • Replace the regex for NAME tokens:
    • Before: [a-zA-Z_]\w*
    • After: [\p{XID_Start}_]\p{XID_Continue}*

I have checked that all tests pass without regression, and have added another simple test:

def test_name_unicode():
    match('β', 'NAME')
    match('가사', 'NAME')

Note:

PEP3131 states:

The identifier syntax is <XID_Start> <XID_Continue>*.

However, this seems to be an error, as XID_Start does not contain _ by default (though the Unicode specifications suggest a Start class could or should contain it.

@DylanLukes
Copy link
Author

Looks like there's a failing test on 2.7, will fix.

The 2.6 failure is unrelated to this PR:

0.10s$ curl -sSf --retry 5 -o python-2.6.tar.bz2 ${archive_url}
163curl: (22) The requested URL returned error: 404 Not Found

@DylanLukes
Copy link
Author

Alright, tests now all pass on 2.7 and up! I ended up making them conditional on the Python version, as it turns out the derived Unicode categories differ between Python 2 and Python 3.

That is "α" is matched by "\p{XID_Start}" on Python 3, but not on Python 2.

In summary: this set of changes adds support for Python 3's Unicode identifiers... but only if you're using Python 3.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants