Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

edlib and unicode in python3 #89

Closed
christian-storm opened this issue Aug 18, 2017 · 4 comments
Closed

edlib and unicode in python3 #89

christian-storm opened this issue Aug 18, 2017 · 4 comments

Comments

@christian-storm
Copy link

Hi Martin,

I've been playing around with your great edlib library in python. Wow, is it ever fast!

I'm trying to use edlib for text comparison and am getting some odd behavior with unicode.
I've copied my little test program below so you can see what I mean. The issue stems from the fact that text is encoded into bytes (edlib.pyx) before aligning. Would casting the python string as unicode instead of bytes alleviate this issue? Not sure if you have any thoughts on how one might add unicode support.

Output of test program:
a and b both ascii
a: Testing What
b: Testing_What
cigar: 7=1X4= expected cigar: 7=1X4=

a is ascii and b is ascii with one unicode/multi-byte character
a: Testing What
b: Testing✐What
cigar: 7=1X2D2=2I expected cigar: 7=1X4=

a and b are ascii with one unicode/multi-byte character in the same position
a: Testing✑What
b: Testing✐What
cigar: 9=1X2= expected cigar: 7=1X4=

edlib_test.py

import edlib

a = "Testing What"
b = "Testing_What"

result = edlib.align(a, b, task="path")
print("a and b both ascii")
print("a: {}".format(a))
print("b: {}".format(b))
print("cigar: {} expected cigar: {}\n".format(result['cigar'], '7=1X4='))

a = "Testing What"
b = "Testing✐What"

result = edlib.align(a, b, task="path")
print("a is ascii and b is ascii with one unicode/multi-byte character")
print("a: {}".format(a))
print("b: {}".format(b))
print("cigar: {} expected cigar: {}\n".format(result['cigar'], '7=1X4='))

a = "Testing✑What"
b = "Testing✐What"

result = edlib.align(a, b, task="path")
print("a and b are ascii with one unicode/multi-byte character in the same position")
print("a: {}".format(a))
print("b: {}".format(b))
print("cigar: {} expected cigar: {}\n".format(result['cigar'], '7=1X4='))
@Martinsos
Copy link
Owner

Martinsos commented Aug 21, 2017

Hi @christian-storm, thank you for reaching out and thank you for kind words :).
Ah yes, the problem here is with the multibyte characters. Edlib's core is written in C++, and it actually takes sequence as an array of chars. Python package is transforming Python string into list of bytes and then passing that to C++ code, since one char is one byte.

So the problem is that Edlib assumes one byte is one character, and there is no way around it currently. When you give it a multibyte character, it actually thinks it is two independent characters.

One way to handle this right now is by transforming your alphabet. What that means is, if you can map those multibyte characters to some free singlebyte characters that you are not using otherwise, you can run edlib and you will get correct results. However, if you are already using all the single byte characters then there is no way around it currently.

In the future, what I could do is improve Edlib so it does not take an array of chars. Instead, it would take array of objects that have equality operator defined upon them, making it more general. However, that is not a small change so I cant really promise anything at the moment.

Hope that helps! By the way, there was very similar issue where I explained the idea of mapping the multibyte characters to single-byte characters in more details, you can check it out here: #79.

@christian-storm
Copy link
Author

Thanks Martinos. I'll keep my fingers crossed that you find some time :) I'd dig in myself if I weren't so rusty at C++.

@Martinsos
Copy link
Owner

Thanks :). By the way, what are you using edlib for? Mapping aplhabet to single-byte characters is not an option for you?

@christian-storm
Copy link
Author

Apologies for not responding. I was looking for a faster replacement for python's difflib.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants