Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Functions not handling accented chars properly #9

Closed
ChrisMuir opened this issue Apr 1, 2018 · 3 comments
Closed

Functions not handling accented chars properly #9

ChrisMuir opened this issue Apr 1, 2018 · 3 comments
Labels

Comments

@ChrisMuir
Copy link
Owner

Testing this on a Mac and a PC and getting different results.

library(refinr)
vect <- c("César Moreira Nuñez", "cesar moreira nunez")

On the PC:

key_collision_merge(vect)
#> "César Moreira Nuñez" "César Moreira Nuñez" # This is the correct output
n_gram_merge(vect)
#> "César Moreira Nuñez" "cesar moreira nunez"

On the Mac:

key_collision_merge(vect)
#> "César Moreira Nuñez" "cesar moreira nunez"
n_gram_merge(vect)
#> "César Moreira Nuñez" "cesar moreira nunez"

The expected output for all four functions above is c("César Moreira Nuñez", "César Moreira Nuñez").

This issue is possibly related to issue #58 from the rOpenSci pkg tokenizers (and the reprex above was stolen from that issue).

Both the Mac and PC are running R v3.4.4, and here's the local and encoding setting for each:

PC:

Sys.getlocale()
#> "LC_COLLATE=English_United States.1252;LC_CTYPE=English_United States.1252;LC_MONETARY=English_United States.1252;LC_NUMERIC=C;LC_TIME=English_United States.1252"

getOption("encoding")
#> "native.enc"

Mac:

Sys.getlocale()
#> en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

getOption("encoding")
#> native.enc
@ChrisMuir
Copy link
Owner Author

ChrisMuir commented Apr 1, 2018

On the PC, the issue with function n_gram_merge() seems to be fixed simply by rearranging some of the operations within function get_fingerprint_ngram() and changing when the input strings are run through function iconv():

Current:

...<snip code>...
vect <- strsplit(vect, "", fixed = TRUE)
vect <- cpp_get_char_ngrams(vect, numgram = numgram)
vect <- iconv(vect, to = "ASCII//TRANSLIT")

Updated:

...<snip code>...
vect <- iconv(vect, to = "ASCII//TRANSLIT")
vect <- strsplit(vect, "", fixed = TRUE)
vect <- cpp_get_char_ngrams(vect, numgram = numgram)

@ChrisMuir
Copy link
Owner Author

ChrisMuir commented Apr 1, 2018

The purpose of using iconv() within both get_fingerprint_ functions is to strip all char accent marks. The issue on the Mac is that iconv() (with to = "ASCII//TRANSLIT) is not achieving this:

vect <- c("César Moreira Nuñez", "cesar moreira nunez")
iconv(vect, to = "ASCII//TRANSLIT")
#> "C'esar Moreira Nu~nez" "cesar moreira nunez"

I could use stringi::stri_trans_general() in place of iconv(), but would rather not depend on stringi just for one function.
Here's an example, from the Mac:

stringi::stri_trans_general(vect, "Latin-ASCII")
#> "Cesar Moreira Nunez" "cesar moreira nunez"

@ChrisMuir ChrisMuir changed the title Functions not handling accepted chars properly Functions not handling accented chars properly Apr 1, 2018
@ChrisMuir ChrisMuir added the bug label Apr 5, 2018
@ChrisMuir
Copy link
Owner Author

Fixed in commit 3c0625b.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

1 participant