Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Table width and number of characters #18

Closed
huashan opened this issue Mar 6, 2013 · 10 comments
Closed

Table width and number of characters #18

huashan opened this issue Mar 6, 2013 · 10 comments

Comments

@huashan
Copy link

huashan commented Mar 6, 2013

I see in helper.R that Pander uses nchar() to determine column width by number of characters in the string. That is not suitable for cjk characters, I'd suggest using nchar(x, type='type') to handle CJK characters.

@daroczig
Copy link
Member

daroczig commented Mar 6, 2013

Thank you @huashan for reporting this issue. But please give me a hand with the solution as I have no experience with CJK characters.

The type argument of nchar can be only something from c("bytes", "chars", "width"). Did you mean chars there instead of type? But AFAIK that's the default.

Could you please also write here an example for the case so that I could test?

@huashan
Copy link
Author

huashan commented Mar 6, 2013

The default for type is chars,
I simply replaced all the nchar() calls with char(x, type='bytes') in helper.R and the results are what I expected.

@daroczig
Copy link
Member

daroczig commented Mar 6, 2013

Right, so there is no sense in replacing nchar(x) to nchar(x, type='char') this way.
But for CJK, would you need bytes instead? Or I miss the point.

@huashan
Copy link
Author

huashan commented Mar 6, 2013

you need to use bytes or width to deal with CJK characters.

@daroczig
Copy link
Member

daroczig commented Mar 6, 2013

Cool, thanks a lot for making this clear to me.
I will dig into this deeper in no later then a few days and will definitely update the package. Hopefully today, but we will see.

@daroczig
Copy link
Member

daroczig commented Mar 6, 2013

Just did some testing (sorry, I have no idea what is that character is below, but looks cool):

> nchar('乂')
[1] 1
> nchar('乂', 'byte')
[1] 3
> nchar('乂', 'width')
[1] 2

So I decided to choose width which would fall back to char if needed. Anyway, please see below the before-after test and verify:

Before:

> pander(data.frame(x=1:3, y=c('xxx','乂乂乂', 'yyy')))

-------
 x   y 
--- ---
 1  xxx

 2  乂乂乂

 3  yyy
-------

After:

> pander(data.frame(x=1:3, y=c('xxx','乂乂乂', 'yyy')))

----------
 x    y   
--- ------
 1   xxx  

 2  乂乂乂

 3   yyy  
----------

Thanks again for reporting this issue and I would love to hear some feedback if this works or if there would be any need for more tweaks.

@huashan
Copy link
Author

huashan commented Mar 7, 2013

Thanks Daroczig!

Another related issue is strwrap() and CJK characters. Strings are wrapped by the ocurrence of whitespaces in strwrap(), however, with CJK characters, there are no whitespaces between each word. My solution is to treat each CJK character as one whitespace and then split the string at the specified width parameter. In this case, the width parameter for this hacked strwrap() should be set to the half of the expected width when dealing with CJK characters and that would be a little bit cumbersome. Or we have to treat each CJK character as two and then split at the first whitespace.

daroczig added a commit that referenced this issue Mar 8, 2013
daroczig added a commit that referenced this issue Mar 8, 2013
@daroczig
Copy link
Member

daroczig commented Mar 8, 2013

Thanks a lot!

First I tried to fix this issue with a way you described as "a little bit cumbersome" - but was easier to implement. After all I was not pleased with this method, as e.g. you might have a mixed cell with both e.g. latin and CJK chars, so that it might split a word with even latin chars. Just imagine: you might have a cell with a latin text and some Unicode chars, this could be split by any char, not just at white spaces - which is not good.

So I tried to work on the second option too: currently the script would check the real width of each word and split text on white space based on nchar(..., type = 'width'). Please verify if it would work.


Demo:

> library(pander)
> x <- data.frame(x='1乂2乂 12345678 1234567 1234 12 1 1 1 1 123112 3乂4乂5乂6')
> pander(x)
----------------------------
             x              
----------------------------
1乂2乂 12345678 1234567 1234
12 1 1 1 1 123112 3乂4乂5乂6
----------------------------
> panderOptions('table.split.cells',10)
> pander(x)

----------
    x     
----------
  1乂2乂  
 12345678 
 1234567  
1234 12 1 
  1 1 1   
  123112  
3乂4乂5乂6
----------

@huashan
Copy link
Author

huashan commented Mar 27, 2013

panderOptions('table.split.cells', 4)
x <- data.frame(x='1乂2乂 1234 1234 1234 12 1 1 1 1 123 3乂4乂5乂6')
pander(x)
+------------+
| x |
+============+
| 1乂2乂 |
| 1234 |
| 1234 |
| 1234 |
| 12 1 |
| 1 1 |
| 1 |
| 123 |
| 3乂4乂5乂6 |
+------------+

The CJK characters are completely splitable, so the last few lines should be expected to be:
| 3乂4乂|
|5乂6 |
+------------+

or:
| 3乂4|
|乂5乂|
| 6 |
+------------+

@daroczig
Copy link
Member

Hm, that's a feature not a bug based on the last commit :)

But joking apart, in your last comment you wrote that "Or we have to treat each CJK character as two and then split at the first whitespace." So I implemented that as it seems pretty hard to check if CJK or any other Unicode character is present, and other chars would probably not allow break(s) between them.

So @huashan please verify if handling CJK chars as double and breaking those only on white-space would work, or we need some more magic here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants