-
Notifications
You must be signed in to change notification settings - Fork 11
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WIP] initial xlsb attempts #688
Conversation
The worksheet parser is coming along. But this is probably the most confusing format I've ever come across. Formulas are not characters, ... c'mon. |
3532125
to
4d8c9e5
Compare
111debb
to
0d476ff
Compare
dabecf7
to
7ae8792
Compare
@barracuda156 in case you're still working on the big endian stuff, could you please run the code below and see if it works for you? Building it will throw a bunch of unused variable warnings etc. and it's still work in progress, but I want to know if this also works on big endian. Basically this adds a (limited) xlsb bin file to xml converter. Since there's no big endian supported Excel it will only be a one way conversion, from little to big endian on big endian systems. Microsoft is using a lot of flag bits and I read them as byte and shift & mask or use bitfields to access them. Not sure if this works well with swapped bytes. Would appreciate if you can give it a try. # install this branch
remotes::install_github("JanMarvin/openxlsx2#688")
library(openxlsx2)
# example file
url <- "https://github.com/JanMarvin/openxlsx2/files/11996061/openxlsx2_example.xlsb.zip"
file <- paste0(tempdir(), "/openxlsx2_example.xlsb.zip")
curl::curl_download(url, file, mode = "wb")
unzip(file, exdir = tempdir())
wb <- wb_load(paste0(tempdir(), "/openxlsx2_example.xlsb"))
wb_to_df(wb)
#> Var1 Var2 NA Var3 Var4 Var5 Var6 Var7 Var8
#> 3 TRUE 1 NA 1 a 2023-05-29 3209324 This #DIV/0! 01:27:15
#> 4 TRUE NA NA #NUM! b 2023-05-23 <NA> 0 14:02:57
#> 5 TRUE 2 NA 1.34 c 2023-02-01 <NA> #VALUE! 23:01:02
#> 6 FALSE 2 NA <NA> #NUM! <NA> <NA> 2 17:24:53
#> 7 FALSE 3 NA 1.56 e <NA> <NA> <NA> <NA>
#> 8 FALSE 1 NA 1.7 f 2023-03-02 <NA> 2.7 08:45:58
#> 9 NA NA NA <NA> <NA> <NA> <NA> <NA> <NA>
#> 10 FALSE 2 NA 23 h 2023-12-24 <NA> 25 <NA>
#> 11 FALSE 3 NA 67.3 i 2023-12-25 <NA> 3 <NA>
#> 12 NA 1 NA 123 <NA> 2023-07-31 <NA> 122 <NA> |
@JanMarvin Sure, will check it and let you know. |
I have built and installed from 074b262 and get this:
|
Thanks for the attempt! It finished the first conversion attempt in wb <- wb_load(paste0(tempdir(), "/openxlsx2_example.xlsb"), debug = TRUE) This will print the path to the converted I'm expecting something like this: <workbook xmlns="http://schemas.openxmlformats.org/spreadsheetml/2006/main" xmlns:r="http://schemas.openxmlformats.org/officeDocument/2006/relationships" xmlns:mc="http://schemas.openxmlformats.org/markup-compatibility/2006" xmlns:x15="http://schemas.microsoft.com/office/spreadsheetml/2010/11/main" xmlns:xr="http://schemas.microsoft.com/office/spreadsheetml/2014/revision" xmlns:xr6="http://schemas.microsoft.com/office/spreadsheetml/2016/revision6" xmlns:xr10="http://schemas.microsoft.com/office/spreadsheetml/2016/revision10" xmlns:xr2="http://schemas.microsoft.com/office/spreadsheetml/2015/revision2" mc:Ignorable="x15 xr xr6 xr10 xr2">
<bookViews>
<workbookView xWindow="0" yWindow="760" windowWidth="30240" windowHeight="17580" activeTab="0"/>
</bookViews>
<sheets>
<sheet r:id="rId1" state="visible" sheetId="1" name="Sheet1"/>
<sheet r:id="rId2" state="visible" sheetId="2" name="Sheet2"/>
</sheets>
</workbook> |
I assume my approach does not work because with their bit flags we have something like this a int16_t looking like this Unfortunately that's quite a bit of work and not as easy as I hoped |
Sorry for a delay, I ran with debug, and get a funny output: output
|
Thanks, just as expected parts work, but others not so much. Looks like std::u16string to std::string conversion might be broken too. 😞 I followed this guide here and 15 minutes after your last post I am now able to reproduce your output locally (s390x emulation seems to work better than sparc64 and ppc). Maybe I find a way to run rstudio-server in this docker container so that it will be a little easier to play around. If indeed u16string conversion is broken, unfortunately I do not really see a way forward. But we'll see, Rome wasn't built in a day |
For reference, the workbook.xml file currently looks like this: <workbook xmlns="http://schemas.openxmlformats.org/spreadsheetml/2006/main" xmlns:r="http://schemas.openxmlformats.org/officeDocument/2006/relationships" xmlns:mc="http://schemas.openxmlformats.org/markup-compatibility/2006" xmlns:x15="http://schemas.microsoft.com/office/spreadsheetml/2010/11/main" xmlns:xr="http://schemas.microsoft.com/office/spreadsheetml/2014/revision" xmlns:xr6="http://schemas.microsoft.com/office/spreadsheetml/2016/revision6" xmlns:xr10="http://schemas.microsoft.com/office/spreadsheetml/2016/revision10" xmlns:xr2="http://schemas.microsoft.com/office/spreadsheetml/2015/revision2" mc:Ignorable="x15 xr xr6 xr10 xr2">
<bookViews>
<workbookView xWindow="0" yWindow="760" windowWidth="30240" windowHeight="17580" activeTab="0" />
</bookViews>
<sheets>
<sheet r:id="爀䤀搀" state="visible" sheetId="1" name="匀栀攀攀琀"/>
<sheet r:id="爀䤀搀㈀" state="visible" sheetId="2" name="匀栀攀攀琀㈀"/>
</sheets>
</workbook>
`` |
Looks like string conversion is working. <workbook xmlns="http://schemas.openxmlformats.org/spreadsheetml/2006/main" xmlns:r="http://schemas.openxmlformats.org/officeDocument/2006/relationships" xmlns:mc="http://schemas.openxmlformats.org/markup-compatibility/2006" xmlns:x15="http://schemas.microsoft.com/office/spreadsheetml/2010/11/main" xmlns:xr="http://schemas.microsoft.com/office/spreadsheetml/2014/revision" xmlns:xr6="http://schemas.microsoft.com/office/spreadsheetml/2016/revision6" xmlns:xr10="http://schemas.microsoft.com/office/spreadsheetml/2016/revision10" xmlns:xr2="http://schemas.microsoft.com/office/spreadsheetml/2015/revision2" mc:Ignorable="x15 xr xr6 xr10 xr2">
<bookViews>
<workbookView xWindow="0" yWindow="760" windowWidth="30240" windowHeight="17580" activeTab="0" />
</bookViews>
<sheets>
<sheet r:id="rId1" state="visible" sheetId="1" name="Sheet1"/>
<sheet r:id="rId2" state="visible" sheetId="2" name="Sheet2"/>
</sheets>
</workbook>
|
It works! 🎉 > wb <- wb_load("/tmp/openxlsx2_example.xlsb")
> wb_to_df(wb)
Var1 Var2 NA Var3 Var4 Var5 Var6 Var7 Var8
3 TRUE 1 NA 1 a 2023-05-29 3209324 This #DIV/0! 01:27:15
4 TRUE NA NA #NUM! b 2023-05-23 <NA> 0 14:02:57
5 TRUE 2 NA 1.34 c 2023-02-01 <NA> #VALUE! 23:01:02
6 FALSE 2 NA <NA> #NUM! <NA> <NA> 2 17:24:53
7 FALSE 3 NA 1.56 e <NA> <NA> <NA> <NA>
8 FALSE 1 NA 1.7 f 2023-03-02 <NA> 2.7 08:45:58
9 NA NA NA <NA> <NA> <NA> <NA> <NA> <NA>
10 FALSE 2 NA 23 h 2023-12-24 <NA> 25 <NA>
11 FALSE 3 NA 67.3 i 2023-12-25 <NA> 3 <NA>
12 NA 1 NA 123 <NA> 2023-07-31 <NA> 122 <NA> We had at least two remaining |
Final remarks for today. Loading the entire worksheet and constructing a data frame also works with my largest public test cases (I got the link from the system("curl https://rigcount.bakerhughes.com/static-files/42d55143-821b-4c37-a49e-79e4c1525d9a -o /tmp/north_america_rotary_rig_count_jan_2000_-_current.xlsb")
system("curl https://rigcount.bakerhughes.com/static-files/14b72078-1d87-4bf3-8cdf-53ec5d99b5e7 -o /tmp/north_american_rotary_rig_count_pivot_table_feb_2011_-_current.xlsb") Unfortunately my assumption regarding flags is also true. Since the order of the bytes in big endian changes, I cannot simply swap the bytes and still have the flags at the same positions, but have to swap the bytes once the bit regions have been separated. There are various cases where the xlsb format uses some bit flag magic and it is probably simply coincidence that it looks like its working. Because of this the following values are not imported correctly. In the upper file there are hidden rows and rows with different heights, these are missing in the big endian workbook: wb$worksheets[[1]]$sheet_data$row_attr$ht
wb$worksheets[[1]]$sheet_data$row_attr$hidden I have not yet decided how to continue in this regard. It is wonderful that is working so far and that I'm able to test it locally, but printing a big warning that "reading xlsb on big endian is not supported and results might not be reliable" is also a step forward. After all the big endian audience for my package is obviously negligible small and |
@JanMarvin Is this format exclusively Microsoft’s? There are several opensource Office apps, and they should support BE systems. (And there are current BE systems, not just old macOS and Solaris.) |
@barracuda156 , yes it's the modern and documented version of their previously undocumented xls files. Basically xlsb are zipped binary files for certain potentially large XML parts. Their XLSX format is only zipped XML parts, but unpacked XML files can easily take couple of hundred MB, where binary files only take a few MB. And even though I asked you to test this with big endian, I'm not going to get guilt tripped to waste a few days adding full big endian support. Obviously there are still systems out that are big endian, but ... I have no access to any and won't have access to one. Development in the docker environment is just a failsafe solution to me. That's why my interest is rather low. But after all this is open source and if some IBM z system owner is in dire need, they should have access to developers that can patch this and I would be happy to apply a patch. After all there are other and potentially better solutions to the issue at hand. Like converting xlsb with Excel to xlsx. :) |
Should I test the most recent commit by the way? |
…ferences. works for the very huge sample size of n = 1 and with exactly one external workbook reference :)
…xternalReferences>
This works as good as it gets for an initial approach. Would love to get some feedback prior to merge. How does this work for some real life examples?
dxfs
styles (not sure if when / if this is ever going to be implemented)dxfs
)Otherwise it is supposed to work.
outdated
Initially I was assuming that the file would be somehow encrypted, but it is not and it is actually documented. I do not understand the documentation, but that is a me problem ...
For our final approach I want to do the following:
This way our usual functions should still work as expected.
This is should be good to go soon. Remaining issues:
showFormula
works and ideally opening the workbook withwb$open()
should not require cleaning of the formulas.xlsb
are not on pair to thexlsx
files.earlier attempts
The parser is coming along quite nicely. So far I'm able to read
sharedStrings.bin
,workbook.bin
andworksheets/sheet.bin
. Therefore all logical, numeric and character data should be ready for export. The formula parser looks a bit tricky, but the basic format is readable. External references are still a task.Finally something to show. Reading the binary of the example file works. The output below is created using the conversion from bin to xml. This does not yet contain formulas, might not be the most stable reader yet (it might not even work with any other file :)) and it still lacks all style information and no pivot tables, no charts etc. all this is also binary and needs conversion (unlikely to happen, but hey the entire approach was unlikely a few weeks ago).
working example