Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pander can not encode UTF-8 in rows or columns #280

Closed
NaserMonsefi opened this issue Oct 3, 2016 · 13 comments
Closed

Pander can not encode UTF-8 in rows or columns #280

NaserMonsefi opened this issue Oct 3, 2016 · 13 comments

Comments

@NaserMonsefi
Copy link

NaserMonsefi commented Oct 3, 2016

Hi,

I was using pander with a matrix containing UTF-8 col names and released that pander can not recognise them. I dig a little deeper and noticed that actually pander have no problem with UTF-8 characters anywhere else beside row or col names. Further, I noticed that pander encodes them from UTF-8 to latin1 but for some reason this doesn't happen for row or col names. I made a small matrix to test this and it looks like this:

image

The encoding for this data shows that the first two are UTF-8 (β) with longer tail on beta and the two others are latin1 (ß) with chopped beta tail. This is true for the rownames and colnames as well.

image

Now if it is passed to pander it looks as follow:

image

First pander encoded all the UTF-8 (β) in the matrix to latin1 (ß) and printed them. But for some reason this doesn't happen for row and col names. Pander was only able to print the latin1 (ß) correctly in rows and cols.
My question is first, how can I make sure that pander actually print UTF-8 in the row and col as well? Also it is preferred if it actually pass them as UTF-8 not as latin1 in the matrix and for rows and cols.

Thanks,
Naser

@daroczig
Copy link
Member

daroczig commented Oct 3, 2016

This report seems to be similar to #228 -- are you on Windows? Can you please share your devtools::session_info()? And also the data object eg via dput.

@NaserMonsefi
Copy link
Author

Thanks a lot for coming back to me so quick, here is the sessioninfo:
image
I am afraid that dput will mess up the unicodes, I uploaded the RDS file here:
https://www.dropbox.com/s/t1u20gybxirrmt1/data_utf8.RDS?dl=0
Hopefully this works,

Yours,
Naser

@daroczig
Copy link
Member

daroczig commented Oct 7, 2016

Thanks for the details! Runnig here works OK:

pander 280

Although I'm on Linux and using UTF-8 locale. Can you pls also try to set the locale to UTF-8? pander doesn't do any specific character encoding updates, so I suspect this issue is rather due to the local config. Eg what if you update the Encoding of the object? Any help is highly appreciated here, I don't have access to Windows on a regular basis.

@NaserMonsefi
Copy link
Author

You are absolutely correct, seems to be a windows problem. It worked on my linux vbox.
Neither of English locale worked either (although they supposed to be utf8)
Guess, for windows i might change encoding of the data to native(latin1) before using pander.
Yours,
Naser

@NaserMonsefi
Copy link
Author

I think I found the cause for the problem,
So if I use to change Encoding like this, it gave the same wrong format for UTF-8 (β) (forcing encodign to latin1 that is native):
image

but if I use enc2native function instead, it doesn't make the weird character and all characters are in the latin1 (ß) form.
image

But my guess would be that somehow pander uses enc2native for the data in the matrix but uses Encoding for row and col names to transfer to native, creating the incorrect characters.
This will sort of work, meaning that seems you can not get UTF-8 characters in windows for pander but still can change them to native and then use pander.

Yours,
Naser

@daroczig
Copy link
Member

daroczig commented Oct 7, 2016

Might be related to some internal Rcpp stuff, but AFAIK we pass all headers + table body to the same functions. cc @RomanTsegelskyi for confirmation

BTW can you please let me know, @NaserMonsefi, how you created this data.frame? This Windows behaviour (like in #228) to have different encoding for table header and content really freaks me out.

@NaserMonsefi
Copy link
Author

I originally noticed the problem, importing a data set using read.delim

read.delim('..data.csv', sep = ',', stringsAsFactors = FALSE, encoding = 'UTF-8', check.names = F)

The files is encoded in UTF-8 and have header names with the UTF-8 beta in it.
Of course if i use check.names = T it will encode to "unknown" with more wrong characters.
I think I found a solution for my case as mentioned above, but don't know what is causing it on the OS level.
Yours,
Naser

@nbarrowman
Copy link

I have been having the same problem, also on Windows. Thanks Naser, enc2native also worked for me.

@awfrankwils
Copy link

@daroczig
Copy link
Member

daroczig commented Sep 9, 2018

I tested #326 in a Windows VM started and seems to do the trick, but please confirm.

daroczig added a commit that referenced this issue Jan 21, 2019
@daroczig
Copy link
Member

Should be fixed with the above commit.

@billdenney
Copy link

@daroczig, I just had the same issue. Is there a way that I could help in some way to release a new version of pander with this fix (and all others that have been made)?

@daroczig
Copy link
Member

@billdenney you mean a CRAN release? I will need to look into the CI builder as seems to be failing and do a general check-up on the package ... I have not really touched it for a while. I can do that in a few weeks hopefully, but would appreciate any help someone running all the tests and R CMD check using dev version of R etc and create a PR for a CRAN release.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants