Join GitHub today
GitHub is home to over 50 million developers working together to host and review code, manage projects, and build software together.
Sign upFunctions not handling accented chars properly #9
Comments
|
On the PC, the issue with function Current: ...<snip code>...
vect <- strsplit(vect, "", fixed = TRUE)
vect <- cpp_get_char_ngrams(vect, numgram = numgram)
vect <- iconv(vect, to = "ASCII//TRANSLIT")Updated: ...<snip code>...
vect <- iconv(vect, to = "ASCII//TRANSLIT")
vect <- strsplit(vect, "", fixed = TRUE)
vect <- cpp_get_char_ngrams(vect, numgram = numgram) |
|
The purpose of using vect <- c("César Moreira Nuñez", "cesar moreira nunez")
iconv(vect, to = "ASCII//TRANSLIT")
#> "C'esar Moreira Nu~nez" "cesar moreira nunez"I could use stringi::stri_trans_general(vect, "Latin-ASCII")
#> "Cesar Moreira Nunez" "cesar moreira nunez" |
|
Fixed in commit 3c0625b. |
Testing this on a Mac and a PC and getting different results.
On the PC:
On the Mac:
The expected output for all four functions above is
c("César Moreira Nuñez", "César Moreira Nuñez").This issue is possibly related to issue #58 from the rOpenSci pkg tokenizers (and the reprex above was stolen from that issue).
Both the Mac and PC are running R v3.4.4, and here's the local and encoding setting for each:
PC:
Mac: