Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Functions not handling accented chars properly #9

Closed
ChrisMuir opened this issue Apr 1, 2018 · 3 comments
Closed

Functions not handling accented chars properly #9

ChrisMuir opened this issue Apr 1, 2018 · 3 comments
Labels
bug

Comments

@ChrisMuir
Copy link
Owner

@ChrisMuir ChrisMuir commented Apr 1, 2018

Testing this on a Mac and a PC and getting different results.

library(refinr)
vect <- c("César Moreira Nuñez", "cesar moreira nunez")

On the PC:

key_collision_merge(vect)
#> "César Moreira Nuñez" "César Moreira Nuñez" # This is the correct output
n_gram_merge(vect)
#> "César Moreira Nuñez" "cesar moreira nunez"

On the Mac:

key_collision_merge(vect)
#> "César Moreira Nuñez" "cesar moreira nunez"
n_gram_merge(vect)
#> "César Moreira Nuñez" "cesar moreira nunez"

The expected output for all four functions above is c("César Moreira Nuñez", "César Moreira Nuñez").

This issue is possibly related to issue #58 from the rOpenSci pkg tokenizers (and the reprex above was stolen from that issue).

Both the Mac and PC are running R v3.4.4, and here's the local and encoding setting for each:

PC:

Sys.getlocale()
#> "LC_COLLATE=English_United States.1252;LC_CTYPE=English_United States.1252;LC_MONETARY=English_United States.1252;LC_NUMERIC=C;LC_TIME=English_United States.1252"

getOption("encoding")
#> "native.enc"

Mac:

Sys.getlocale()
#> en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

getOption("encoding")
#> native.enc
@ChrisMuir
Copy link
Owner Author

@ChrisMuir ChrisMuir commented Apr 1, 2018

On the PC, the issue with function n_gram_merge() seems to be fixed simply by rearranging some of the operations within function get_fingerprint_ngram() and changing when the input strings are run through function iconv():

Current:

...<snip code>...
vect <- strsplit(vect, "", fixed = TRUE)
vect <- cpp_get_char_ngrams(vect, numgram = numgram)
vect <- iconv(vect, to = "ASCII//TRANSLIT")

Updated:

...<snip code>...
vect <- iconv(vect, to = "ASCII//TRANSLIT")
vect <- strsplit(vect, "", fixed = TRUE)
vect <- cpp_get_char_ngrams(vect, numgram = numgram)
@ChrisMuir
Copy link
Owner Author

@ChrisMuir ChrisMuir commented Apr 1, 2018

The purpose of using iconv() within both get_fingerprint_ functions is to strip all char accent marks. The issue on the Mac is that iconv() (with to = "ASCII//TRANSLIT) is not achieving this:

vect <- c("César Moreira Nuñez", "cesar moreira nunez")
iconv(vect, to = "ASCII//TRANSLIT")
#> "C'esar Moreira Nu~nez" "cesar moreira nunez"

I could use stringi::stri_trans_general() in place of iconv(), but would rather not depend on stringi just for one function.
Here's an example, from the Mac:

stringi::stri_trans_general(vect, "Latin-ASCII")
#> "Cesar Moreira Nunez" "cesar moreira nunez"
@ChrisMuir ChrisMuir changed the title Functions not handling accepted chars properly Functions not handling accented chars properly Apr 1, 2018
ChrisMuir added a commit that referenced this issue Apr 1, 2018
@ChrisMuir ChrisMuir added the bug label Apr 5, 2018
ChrisMuir added a commit that referenced this issue Apr 22, 2018
@ChrisMuir
Copy link
Owner Author

@ChrisMuir ChrisMuir commented Apr 23, 2018

Fixed in commit 3c0625b.

@ChrisMuir ChrisMuir closed this Apr 23, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
1 participant
You can’t perform that action at this time.