Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] initial xlsb attempts #688

Merged
merged 77 commits into from
Sep 2, 2023
Merged

[WIP] initial xlsb attempts #688

merged 77 commits into from
Sep 2, 2023

Conversation

JanMarvin
Copy link
Owner

@JanMarvin JanMarvin commented Jul 12, 2023

This works as good as it gets for an initial approach. Would love to get some feedback prior to merge. How does this work for some real life examples?

  • known not working
  • dxfs styles (not sure if when / if this is ever going to be implemented)
  • conditional formatting (depends on dxfs)
  • data validation
  • pivot tables (another not sure)
  • some complex array formulas
  • big endian (not supported)

Otherwise it is supposed to work.

outdated

Initially I was assuming that the file would be somehow encrypted, but it is not and it is actually documented. I do not understand the documentation, but that is a me problem ...

For our final approach I want to do the following:

  • unzip the xlsb file.
  • convert the bin files to basic xml files

This way our usual functions should still work as expected.

This is should be good to go soon. Remaining issues:

  • formulas are still broken. They are included in reverse polish notation and I still have to come up with a way to handle this. I'd really like to see working or semi working functions so that showFormula works and ideally opening the workbook with wb$open() should not require cleaning of the formulas.
    • In addition internal and external references are shared via an index in the workbook.
    • shared formulas are another beast to tackle
    • array formulas?
    • table formulas?
  • there are still unencoded binary files like pivot tables and pivot caches, maybe slicers and others as well? (I might need a few rainy days before I start working on this)
  • the code needs some further cleanups (possibly reordering of functions)
  • there are still a few things not 1:1 with the xlsx variant. Like styled strings. But there is a limit to what I need. Might as well add a warning that xlsb are not on pair to the xlsx files.

earlier attempts

The parser is coming along quite nicely. So far I'm able to read
sharedStrings.bin, workbook.bin and worksheets/sheet.bin. Therefore all logical, numeric and character data should be ready for export. The formula parser looks a bit tricky, but the basic format is readable. External references are still a task.

# example file
url <- "https://github.com/JanMarvin/openxlsx2/files/11996061/openxlsx2_example.xlsb.zip"
file <- "/tmp/openxlsx2_example.xlsb.zip"
curl::curl_download(url, file, mode = "wb")
unzip("/tmp/openxlsx2_example.xlsb.zip", exdir = "/tmp")
unzip("/tmp/openxlsx2_example.xlsb", exdir = "/tmp/test")

# reading plain shared strings works
openxlsx2:::sst("/tmp/test/xl/sharedStrings.bin", "/tmp/sst.txt", 0)

# reading the sheet names from the workbook also works
openxlsx2:::workbook("/tmp/test/xl/workbook.bin", "/tmp/wb.txt", 0)

# this is still broken
# at the beginning of every row is a 17 byte long block I don't understand and therefore I'm unable to work around it.
# I can skip over it in the first row, but I do not know when the second row begins and therefore I fail
openxlsx2:::worksheet("/tmp/test/xl/worksheets/sheet1.bin", "/tmp/ws.txt", 0)

Finally something to show. Reading the binary of the example file works. The output below is created using the conversion from bin to xml. This does not yet contain formulas, might not be the most stable reader yet (it might not even work with any other file :)) and it still lacks all style information and no pivot tables, no charts etc. all this is also binary and needs conversion (unlikely to happen, but hey the entire approach was unlikely a few weeks ago).

library(openxlsx2)

# example file
url <- "https://github.com/JanMarvin/openxlsx2/files/11996061/openxlsx2_example.xlsb.zip"
file <- "/tmp/openxlsx2_example.xlsb.zip"
curl::curl_download(url, file, mode = "wb")
unzip("/tmp/openxlsx2_example.xlsb.zip", exdir = "/tmp")

wb <- wb_load("/tmp/openxlsx2_example.xlsb")
#> ... lots of debug output ...

# woha!
wb_to_df(wb)
#>     Var1 Var2 NA  Var3  Var4  Var5         Var6    Var7       Var8
#> 3   TRUE    1 NA     1     a 45075 3209324 This  #DIV/0 0.06059028
#> 4   TRUE   NA NA #NUM!     b 45069         <NA>       0 0.58538194
#> 5   TRUE    2 NA  1.34     c 44958         <NA> #VALUE! 0.95905093
#> 6  FALSE    2 NA  <NA> #NUM!    NA         <NA>       2 0.72561343
#> 7  FALSE    3 NA  1.56     e    NA         <NA>    <NA>         NA
#> 8  FALSE    1 NA   1.7     f 44987         <NA>     2.7 0.36525463
#> 9     NA   NA NA  <NA>  <NA>    NA         <NA>    <NA>         NA
#> 10 FALSE    2 NA    23     h 45284         <NA>      25         NA
#> 11 FALSE    3 NA  67.3     i 45285         <NA>       3         NA
#> 12    NA    1 NA   123  <NA> 45138         <NA>     122         NA

working example

library(openxlsx2)
url <- "https://github.com/JanMarvin/openxlsx2/files/11996061/openxlsx2_example.xlsb.zip"
file <- "/tmp/openxlsx2_example.xlsb.zip"
curl::curl_download(url, file, mode = "wb")
unzip("/tmp/openxlsx2_example.xlsb.zip", exdir = "/tmp")

wb <- wb_load("/tmp/openxlsx2_example.xlsb")
#> [1] "/tmp/Rtmp5898IA/_openxlsx_wb_load89cf3f379133/xl/workbook.bin"
#> [1] "/tmp/Rtmp5898IA/_openxlsx_wb_load89cf3f379133/xl/workbook.xml"
#> ProductVersion: 3843: 32768: 0
#> ProductVersion: 4096: 32768: 0
#> ProductVersion: 4098: 32768: 0
#> ProductVersion: 3843: 0: 0
#> ProductVersion: 4352: 0: 0
#> <fills>
#> <fill>
#> <fill>
#> </fills>
#> <borders>
#> <borders>
#> </borders>
#> [1]  1 11  1  9
#> BrtCellError: 544
#> BrtCellError: 745
#> BrtCellError: 841
#> [1]  5 37  2 13

# reading works
wb_to_df(wb)
#>     Var1 Var2 NA  Var3  Var4       Var5         Var6    Var7     Var8
#> 3   TRUE    1 NA     1     a 2023-05-29 3209324 This #DIV/0! 01:27:15
#> 4   TRUE   NA NA #NUM!     b 2023-05-23         <NA>       0 14:02:57
#> 5   TRUE    2 NA  1.34     c 2023-02-01         <NA> #VALUE! 23:01:02
#> 6  FALSE    2 NA  <NA> #NUM!       <NA>         <NA>       2 17:24:53
#> 7  FALSE    3 NA  1.56     e       <NA>         <NA>    <NA>     <NA>
#> 8  FALSE    1 NA   1.7     f 2023-03-02         <NA>     2.7 08:45:58
#> 9     NA   NA NA  <NA>  <NA>       <NA>         <NA>    <NA>     <NA>
#> 10 FALSE    2 NA    23     h 2023-12-24         <NA>      25     <NA>
#> 11 FALSE    3 NA  67.3     i 2023-12-25         <NA>       3     <NA>
#> 12    NA    1 NA   123  <NA> 2023-07-31         <NA>     122     <NA>

# formulas not yet
wb_to_df(wb, show_formula = TRUE)
#>     Var1 Var2 NA  Var3  Var4       Var5         Var6               Var7
#> 3   TRUE    1 NA     1     a 2023-05-29 3209324 This         E3\n0\n/\n
#> 4   TRUE   NA NA #NUM!     b 2023-05-23         <NA>               C4\n
#> 5   TRUE    2 NA  1.34     c 2023-02-01         <NA>            #VALUE!
#> 6  FALSE    2 NA  <NA> #NUM!       <NA>         <NA>        C6\nE6\n+\n
#> 7  FALSE    3 NA  1.56     e       <NA>         <NA>               <NA>
#> 8  FALSE    1 NA   1.7     f 2023-03-02         <NA>        C8\nE8\n+\n
#> 9     NA   NA NA  <NA>  <NA>       <NA>         <NA>               <NA>
#> 10 FALSE    2 NA    23     h 2023-12-24         <NA>    C10\nE10\nSUM\n
#> 11 FALSE    3 NA  67.3     i 2023-12-25         <NA> C11\nE3\nPRODUCT\n
#> 12    NA    1 NA   123  <NA> 2023-07-31         <NA>      E12\nC12\n-\n
#>        Var8
#> 3  01:27:15
#> 4  14:02:57
#> 5  23:01:02
#> 6  17:24:53
#> 7      <NA>
#> 8  08:45:58
#> 9      <NA>
#> 10     <NA>
#> 11     <NA>
#> 12     <NA>

# https://en.wikipedia.org/wiki/Reverse_Polish_notation
fmls <- wb_to_df(wb, show_formula = TRUE)$Var7
message(fmls[1])
#> E3
#> 0
#> /
message(fmls[9])
#> C11
#> E3
#> PRODUCT

@JanMarvin JanMarvin added the enhancement 😀 New feature or request label Jul 12, 2023
@JanMarvin JanMarvin added this to the future milestone Jul 12, 2023
@JanMarvin
Copy link
Owner Author

The worksheet parser is coming along. But this is probably the most confusing format I've ever come across. Formulas are not characters, ... c'mon.

@JanMarvin JanMarvin force-pushed the xlsb branch 3 times, most recently from 3532125 to 4d8c9e5 Compare July 26, 2023 23:24
@JanMarvin JanMarvin changed the title [early WIP] initial xlsb attempts [WIP] initial xlsb attempts Jul 27, 2023
@JanMarvin
Copy link
Owner Author

@barracuda156 in case you're still working on the big endian stuff, could you please run the code below and see if it works for you? Building it will throw a bunch of unused variable warnings etc. and it's still work in progress, but I want to know if this also works on big endian.

Basically this adds a (limited) xlsb bin file to xml converter. Since there's no big endian supported Excel it will only be a one way conversion, from little to big endian on big endian systems. Microsoft is using a lot of flag bits and I read them as byte and shift & mask or use bitfields to access them. Not sure if this works well with swapped bytes.

Would appreciate if you can give it a try.

# install this branch
remotes::install_github("JanMarvin/openxlsx2#688")

library(openxlsx2)

# example file
url <- "https://github.com/JanMarvin/openxlsx2/files/11996061/openxlsx2_example.xlsb.zip"
file <- paste0(tempdir(), "/openxlsx2_example.xlsb.zip")
curl::curl_download(url, file, mode = "wb")
unzip(file, exdir = tempdir())

wb <- wb_load(paste0(tempdir(), "/openxlsx2_example.xlsb"))

wb_to_df(wb)
#>     Var1 Var2 NA  Var3  Var4       Var5         Var6    Var7     Var8
#> 3   TRUE    1 NA     1     a 2023-05-29 3209324 This #DIV/0! 01:27:15
#> 4   TRUE   NA NA #NUM!     b 2023-05-23         <NA>       0 14:02:57
#> 5   TRUE    2 NA  1.34     c 2023-02-01         <NA> #VALUE! 23:01:02
#> 6  FALSE    2 NA  <NA> #NUM!       <NA>         <NA>       2 17:24:53
#> 7  FALSE    3 NA  1.56     e       <NA>         <NA>    <NA>     <NA>
#> 8  FALSE    1 NA   1.7     f 2023-03-02         <NA>     2.7 08:45:58
#> 9     NA   NA NA  <NA>  <NA>       <NA>         <NA>    <NA>     <NA>
#> 10 FALSE    2 NA    23     h 2023-12-24         <NA>      25     <NA>
#> 11 FALSE    3 NA  67.3     i 2023-12-25         <NA>       3     <NA>
#> 12    NA    1 NA   123  <NA> 2023-07-31         <NA>     122     <NA>

@barracuda156
Copy link

@JanMarvin Sure, will check it and let you know.

@barracuda156
Copy link

I have built and installed from 074b262 and get this:

> library(openxlsx2)
> url <- "https://github.com/JanMarvin/openxlsx2/files/11996061/openxlsx2_example.xlsb.zip"
> file <- paste0(tempdir(), "/openxlsx2_example.xlsb.zip")
> curl::curl_download(url, file, mode = "wb")
> unzip(file, exdir = tempdir())
> wb <- wb_load(paste0(tempdir(), "/openxlsx2_example.xlsb"))
Error in if (sheets$typ[i] == "chartsheet") { : 
  missing value where TRUE/FALSE needed

@JanMarvin
Copy link
Owner Author

Thanks for the attempt! It finished the first conversion attempt in workbook_bin() but the output did not look like expected. You could run it with

wb <- wb_load(paste0(tempdir(), "/openxlsx2_example.xlsb"), debug = TRUE)

This will print the path to the converted workbook.xml file. If you open this file in an editor it should contain the converted xml, if you could paste this it might give me a hint. But obviously I'm at the guesswork stage and don't want to steal your time. I'll have to think about it a bit more.

I'm expecting something like this:

<workbook xmlns="http://schemas.openxmlformats.org/spreadsheetml/2006/main" xmlns:r="http://schemas.openxmlformats.org/officeDocument/2006/relationships" xmlns:mc="http://schemas.openxmlformats.org/markup-compatibility/2006" xmlns:x15="http://schemas.microsoft.com/office/spreadsheetml/2010/11/main" xmlns:xr="http://schemas.microsoft.com/office/spreadsheetml/2014/revision" xmlns:xr6="http://schemas.microsoft.com/office/spreadsheetml/2016/revision6" xmlns:xr10="http://schemas.microsoft.com/office/spreadsheetml/2016/revision10" xmlns:xr2="http://schemas.microsoft.com/office/spreadsheetml/2015/revision2" mc:Ignorable="x15 xr xr6 xr10 xr2">
<bookViews>
<workbookView xWindow="0" yWindow="760" windowWidth="30240" windowHeight="17580" activeTab="0"/>
</bookViews>
<sheets>
<sheet r:id="rId1" state="visible" sheetId="1" name="Sheet1"/>
<sheet r:id="rId2" state="visible" sheetId="2" name="Sheet2"/>
</sheets>
</workbook>

@JanMarvin
Copy link
Owner Author

I assume my approach does not work because with their bit flags we have something like this a int16_t looking like this 01 02. In swapped something like 02 01. This works if the entire int16_t is a single numeric. But with the flags stored inside it's different. If the first bit is off interest to me, I cannot swap and assume the same order. If I expect this fn nn nn nn (f is flag and n is numeric, can be zero). This byte is now at the second position.
I assume that I have to read it as is in little endian. And only swap after the flags have been separated. Still, since I end up with let's say 15 bit of a int16, I might have to shift it before I can swap it.

Unfortunately that's quite a bit of work and not as easy as I hoped

@barracuda156
Copy link

barracuda156 commented Aug 10, 2023

Sorry for a delay, I ran with debug, and get a funny output:

output
> library(openxlsx2)
> url <- "https://github.com/JanMarvin/openxlsx2/files/11996061/openxlsx2_example.xlsb.zip"
> file <- paste0(tempdir(), "/openxlsx2_example.xlsb.zip")
> curl::curl_download(url, file, mode = "wb")
> unzip(file, exdir = tempdir())
> wb <- wb_load(paste0(tempdir(), "/openxlsx2_example.xlsb"), debug = TRUE)
[1] "/var/folders/rD/rDeCM6SDHv8daLCecrRmrU+++TI/-Tmp-//RtmpCr89Jd/_openxlsx_wb_loadb2872addb47a/xl/workbook.bin"
[1] "/var/folders/rD/rDeCM6SDHv8daLCecrRmrU+++TI/-Tmp-//RtmpCr89Jd/_openxlsx_wb_loadb2872addb47a/xl/workbook.xml"
.
<workbook>
wb-loop: 131: 0: 3
.
<fileVersion>
wb-loop: 128: 50: 56
.
<workbookProperties>
wb-loop: 153: 12: 71
.
<BrtACBegin>
ProductVersion: 3843: 32768: 0
wb-loop: 37: 6: 79
.
⼀唀猀攀爀猀⼀樀愀渀洀愀爀瘀椀渀最愀爀戀甀猀稀甀猀⼀匀漀甀爀挀攀⼀
wb-loop: 2071: 70: 152
.
<BrtACEnd>
wb-loop: 38: 0: 154
.
<BrtACBegin>
ProductVersion: 4096: 32768: 0
wb-loop: 37: 6: 162
.
<BrtRevisionPtr>
wb-loop: 3073: 116: 281
.
<BrtACEnd>
wb-loop: 38: 0: 283
.
<workbookViews>
wb-loop: 135: 0: 286
.
<BrtACBegin>
ProductVersion: 4098: 32768: 0
wb-loop: 37: 6: 294
.
<BrtUID>
wb-loop: 3072: 16: 313
.
<BrtACEnd>
wb-loop: 38: 0: 315
.
<workbookView>
wb-loop: 158: 29: 347
.
</workbookViews>
wb-loop: 136: 0: 350
.
<sheets>
wb-loop: 143: 0: 353
.
<sheet>
sheet vis: 0: 1: 爀䤀搀㄀
wb-loop: 156: 36: 392
.
<sheet>
sheet vis: 0: 2: 爀䤀搀㈀
wb-loop: 156: 36: 431
.
</sheets>
wb-loop: 144: 0: 434
.
<calcPr>
wb-loop: 157: 26: 463
.
<fileRecovery>
wb-loop: 155: 1: 467
.
<ext>
ProductVersion: 3843: 0: 0
wb-loop: 35: 4: 473
.
<BrtWorkBookPr15>
wb-loop: 2091: 1: 477
.
</ext>
wb-loop: 36: 0: 479
.
<ext>
ProductVersion: 4352: 0: 0
wb-loop: 35: 4: 485
.
<calcs>
wb-loop: 5095: 0: 488
.
<calc>
0: 洀椀挀爀漀猀漀昀琀⸀挀漀洀㨀刀䐀
wb-loop: 5097: 40: 531
.
<calc>
0: 洀椀挀爀漀猀漀昀琀⸀挀漀洀㨀匀椀渀最氀攀
wb-loop: 5097: 48: 582
.
<calc>
0: 洀椀挀爀漀猀漀昀琀⸀挀漀洀㨀䘀嘀
wb-loop: 5097: 40: 625
.
<calc>
0: 洀椀挀爀漀猀漀昀琀⸀挀漀洀㨀䌀一䴀吀䴀
wb-loop: 5097: 46: 674
.
<calc>
0: 洀椀挀爀漀猀漀昀琀⸀挀漀洀㨀䰀䔀吀开圀䘀
wb-loop: 5097: 48: 725
.
<calc>
0: 洀椀挀爀漀猀漀昀琀⸀挀漀洀㨀䰀䄀䴀䈀䐀䄀开圀䘀
wb-loop: 5097: 54: 782
.
<calc>
0: 洀椀挀爀漀猀漀昀琀⸀挀漀洀㨀䄀刀刀䄀夀吀䔀堀吀开圀䘀
wb-loop: 5097: 60: 845
.
</calcs>
wb-loop: 5096: 0: 848
.
</ext>
wb-loop: 36: 0: 850
.
</workbook>
wb-loop: 132: 0: 853
.
278: 0
.
611: 4
<fonts>
.
43: 39
[1]   3   1   0   0   0   0 255
.
43: 43
[1]   3   1   0   0   0   0 255
.
43: 53
[1]   2 255   0   0   0   0 255
.
37: 6
Unhandled Style: 37: 6 @ 153
.
1025: 0
Unhandled Style: 1025: 0 @ 162
.
38: 0
Unhandled Style: 38: 0 @ 164
.
612: 0
</fonts>
.
603: 4
.
45: 68
.
45: 68
.
604: 0
.
613: 4
.
46: 51
.
614: 0
.
626: 4
.
47: 16
.
627: 0
.
617: 4
.
47: 16
.
47: 16
.
47: 16
.
47: 16
.
47: 16
.
618: 0
.
619: 4
.
37: 6
Unhandled Style: 37: 6 @ 517
.
3072: 16
Unhandled Style: 3072: 16 @ 526
.
38: 0
Unhandled Style: 38: 0 @ 544
.
48: 24
.
620: 0
.
505: 4
.
506: 0
.
508: 80
Unhandled Style: 508: 80 @ 586
.
509: 0
Unhandled Style: 509: 0 @ 669
.
35: 4
Unhandled Style: 35: 4 @ 671
.
1131: 0
Unhandled Style: 1131: 0 @ 678
.
1142: 42
Unhandled Style: 1142: 42 @ 681
.
1143: 0
Unhandled Style: 1143: 0 @ 726
.
1132: 0
Unhandled Style: 1132: 0 @ 729
.
36: 0
Unhandled Style: 36: 0 @ 731
.
35: 4
Unhandled Style: 35: 4 @ 733
.
2096: 0
Unhandled Style: 2096: 0 @ 740
.
2098: 50
Unhandled Style: 2098: 50 @ 743
.
2099: 0
Unhandled Style: 2099: 0 @ 796
.
2097: 0
Unhandled Style: 2097: 0 @ 799
.
36: 0
Unhandled Style: 36: 0 @ 801
.
279: 0
.
159: 8
.
19: 13
嘀愀爀㄀
.
19: 13
嘀愀爀㈀
.
19: 13
嘀愀爀㌀
.
19: 13
嘀愀爀㐀
.
19: 7
愀
.
19: 7
戀
.
19: 7
挀
.
19: 7
攀
.
19: 7
昀
.
19: 7
栀
.
19: 7
椀
.
19: 13
嘀愀爀㔀
.
19: 13
嘀愀爀㘀
.
19: 13
嘀愀爀㜀
.
19: 29
㌀㈀ 㤀㌀㈀㐀 吀栀椀猀
.
19: 13
嘀愀爀㠀
.
19: 11
洀瀀最
.
19: 11
挀礀氀
.
19: 13
搀椀猀瀀
.
19: 9
栀瀀
.
19: 13
搀爀愀琀
.
19: 9
眀琀
.
19: 13
焀猀攀挀
.
19: 9
瘀猀
.
19: 9
愀洀
.
19: 13
最攀愀爀
.
19: 13
挀愀爀戀
.
19: 23
䴀愀稀搀愀 刀堀㐀
.
19: 31
䴀愀稀搀愀 刀堀㐀 圀愀最
.
19: 25
䐀愀琀猀甀渀 㜀㄀ 
.
19: 33
䠀漀爀渀攀琀 㐀 䐀爀椀瘀攀
.
19: 39
䠀漀爀渀攀琀 匀瀀漀爀琀愀戀漀甀琀
.
19: 19
嘀愀氀椀愀渀琀
.
19: 25
䐀甀猀琀攀爀 ㌀㘀 
.
19: 23
䴀攀爀挀 ㈀㐀 䐀
.
19: 21
䴀攀爀挀 ㈀㌀ 
.
19: 21
䴀攀爀挀 ㈀㠀 
.
19: 23
䴀攀爀挀 ㈀㠀 䌀
.
19: 25
䴀攀爀挀 㐀㔀 匀䔀
.
19: 25
䴀攀爀挀 㐀㔀 匀䰀
.
19: 27
䴀攀爀挀 㐀㔀 匀䰀䌀
.
19: 41
䌀愀搀椀氀氀愀挀 䘀氀攀攀琀眀漀漀搀
.
19: 43
䰀椀渀挀漀氀渀 䌀漀渀琀椀渀攀渀琀愀氀
.
19: 39
䌀栀爀礀猀氀攀爀 䤀洀瀀攀爀椀愀氀
.
19: 21
䘀椀愀琀 ㄀㈀㠀
.
19: 27
䠀漀渀搀愀 䌀椀瘀椀挀
.
19: 33
吀漀礀漀琀愀 䌀漀爀漀氀氀愀
.
19: 31
吀漀礀漀琀愀 䌀漀爀漀渀愀
.
19: 37
䐀漀搀最攀 䌀栀愀氀氀攀渀最攀爀
.
19: 27
䄀䴀䌀 䨀愀瘀攀氀椀渀
.
19: 25
䌀愀洀愀爀漀 娀㈀㠀
.
19: 37
倀漀渀琀椀愀挀 䘀椀爀攀戀椀爀搀
.
19: 23
䘀椀愀琀 堀㄀ⴀ㤀
.
19: 31
倀漀爀猀挀栀攀 㤀㄀㐀ⴀ㈀
.
19: 29
䰀漀琀甀猀 䔀甀爀漀瀀愀
.
19: 33
䘀漀爀搀 倀愀渀琀攀爀愀 䰀
.
19: 29
䘀攀爀爀愀爀椀 䐀椀渀漀
.
19: 31
䴀愀猀攀爀愀琀椀 䈀漀爀愀
.
19: 25
嘀漀氀瘀漀 ㄀㐀㈀䔀
.
160: 0
Error in if (sheets$typ[i] == "chartsheet") { : 
  missing value where TRUE/FALSE needed

@JanMarvin
Copy link
Owner Author

Thanks, just as expected parts work, but others not so much. Looks like std::u16string to std::string conversion might be broken too. 😞

I followed this guide here and 15 minutes after your last post I am now able to reproduce your output locally (s390x emulation seems to work better than sparc64 and ppc). Maybe I find a way to run rstudio-server in this docker container so that it will be a little easier to play around. If indeed u16string conversion is broken, unfortunately I do not really see a way forward. But we'll see, Rome wasn't built in a day

@JanMarvin
Copy link
Owner Author

For reference, the workbook.xml file currently looks like this:

<workbook xmlns="http://schemas.openxmlformats.org/spreadsheetml/2006/main" xmlns:r="http://schemas.openxmlformats.org/officeDocument/2006/relationships" xmlns:mc="http://schemas.openxmlformats.org/markup-compatibility/2006" xmlns:x15="http://schemas.microsoft.com/office/spreadsheetml/2010/11/main" xmlns:xr="http://schemas.microsoft.com/office/spreadsheetml/2014/revision" xmlns:xr6="http://schemas.microsoft.com/office/spreadsheetml/2016/revision6" xmlns:xr10="http://schemas.microsoft.com/office/spreadsheetml/2016/revision10" xmlns:xr2="http://schemas.microsoft.com/office/spreadsheetml/2015/revision2" mc:Ignorable="x15 xr xr6 xr10 xr2">
<bookViews>
<workbookView xWindow="0" yWindow="760" windowWidth="30240" windowHeight="17580" activeTab="0" />
</bookViews>
<sheets>
<sheet r:id="爀䤀搀㄀" state="visible" sheetId="1" name="匀栀攀攀琀㄀"/>
<sheet r:id="爀䤀搀㈀" state="visible" sheetId="2" name="匀栀攀攀琀㈀"/>
</sheets>
</workbook>
``

@JanMarvin
Copy link
Owner Author

Looks like string conversion is working.
Import still fails, but it fails better!

<workbook xmlns="http://schemas.openxmlformats.org/spreadsheetml/2006/main" xmlns:r="http://schemas.openxmlformats.org/officeDocument/2006/relationships" xmlns:mc="http://schemas.openxmlformats.org/markup-compatibility/2006" xmlns:x15="http://schemas.microsoft.com/office/spreadsheetml/2010/11/main" xmlns:xr="http://schemas.microsoft.com/office/spreadsheetml/2014/revision" xmlns:xr6="http://schemas.microsoft.com/office/spreadsheetml/2016/revision6" xmlns:xr10="http://schemas.microsoft.com/office/spreadsheetml/2016/revision10" xmlns:xr2="http://schemas.microsoft.com/office/spreadsheetml/2015/revision2" mc:Ignorable="x15 xr xr6 xr10 xr2">
<bookViews>
<workbookView xWindow="0" yWindow="760" windowWidth="30240" windowHeight="17580" activeTab="0" />
</bookViews>
<sheets>
<sheet r:id="rId1" state="visible" sheetId="1" name="Sheet1"/>
<sheet r:id="rId2" state="visible" sheetId="2" name="Sheet2"/>
</sheets>
</workbook>

@JanMarvin
Copy link
Owner Author

It works! 🎉

> wb <- wb_load("/tmp/openxlsx2_example.xlsb")
> wb_to_df(wb)
    Var1 Var2 NA  Var3  Var4       Var5         Var6    Var7     Var8
3   TRUE    1 NA     1     a 2023-05-29 3209324 This #DIV/0! 01:27:15
4   TRUE   NA NA #NUM!     b 2023-05-23         <NA>       0 14:02:57
5   TRUE    2 NA  1.34     c 2023-02-01         <NA> #VALUE! 23:01:02
6  FALSE    2 NA  <NA> #NUM!       <NA>         <NA>       2 17:24:53
7  FALSE    3 NA  1.56     e       <NA>         <NA>    <NA>     <NA>
8  FALSE    1 NA   1.7     f 2023-03-02         <NA>     2.7 08:45:58
9     NA   NA NA  <NA>  <NA>       <NA>         <NA>    <NA>     <NA>
10 FALSE    2 NA    23     h 2023-12-24         <NA>      25     <NA>
11 FALSE    3 NA  67.3     i 2023-12-25         <NA>       3     <NA>
12    NA    1 NA   123  <NA> 2023-07-31         <NA>     122     <NA>

We had at least two remaining stringsAsFactors=FALSE issues. Obviously nobody has tried to run openxlsx2 with R 3.x. But now that's sorted (didn't want to build current R, my love for big endian knows limits), but now I get the expected table!

@JanMarvin
Copy link
Owner Author

Final remarks for today. Loading the entire worksheet and constructing a data frame also works with my largest public test cases (I got the link from the readxlsb issue tracker. The files contain some data from US oil and gas rigs).

system("curl https://rigcount.bakerhughes.com/static-files/42d55143-821b-4c37-a49e-79e4c1525d9a -o /tmp/north_america_rotary_rig_count_jan_2000_-_current.xlsb")
system("curl https://rigcount.bakerhughes.com/static-files/14b72078-1d87-4bf3-8cdf-53ec5d99b5e7 -o /tmp/north_american_rotary_rig_count_pivot_table_feb_2011_-_current.xlsb")

Unfortunately my assumption regarding flags is also true. Since the order of the bytes in big endian changes, I cannot simply swap the bytes and still have the flags at the same positions, but have to swap the bytes once the bit regions have been separated. There are various cases where the xlsb format uses some bit flag magic and it is probably simply coincidence that it looks like its working. Because of this the following values are not imported correctly.

In the upper file there are hidden rows and rows with different heights, these are missing in the big endian workbook:

wb$worksheets[[1]]$sheet_data$row_attr$ht
wb$worksheets[[1]]$sheet_data$row_attr$hidden

I have not yet decided how to continue in this regard. It is wonderful that is working so far and that I'm able to test it locally, but printing a big warning that "reading xlsb on big endian is not supported and results might not be reliable" is also a step forward. After all the big endian audience for my package is obviously negligible small and xlsb files are not that common after all.

@barracuda156
Copy link

@JanMarvin Is this format exclusively Microsoft’s? There are several opensource Office apps, and they should support BE systems. (And there are current BE systems, not just old macOS and Solaris.)

@JanMarvin
Copy link
Owner Author

@barracuda156 , yes it's the modern and documented version of their previously undocumented xls files. Basically xlsb are zipped binary files for certain potentially large XML parts. Their XLSX format is only zipped XML parts, but unpacked XML files can easily take couple of hundred MB, where binary files only take a few MB.

And even though I asked you to test this with big endian, I'm not going to get guilt tripped to waste a few days adding full big endian support. Obviously there are still systems out that are big endian, but ... I have no access to any and won't have access to one. Development in the docker environment is just a failsafe solution to me. That's why my interest is rather low. But after all this is open source and if some IBM z system owner is in dire need, they should have access to developers that can patch this and I would be happy to apply a patch. After all there are other and potentially better solutions to the issue at hand. Like converting xlsb with Excel to xlsx. :)

@barracuda156
Copy link

@JanMarvin

And even though I asked you to test this with big endian

Should I test the most recent commit by the way?

@JanMarvin JanMarvin merged commit d503ebd into main Sep 2, 2023
9 checks passed
@JanMarvin JanMarvin deleted the xlsb branch September 2, 2023 10:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement 😀 New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants