Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Template Filling, Styles & Page Setups #75

Closed
diginfo opened this issue Jun 27, 2014 · 33 comments
Closed

Template Filling, Styles & Page Setups #75

diginfo opened this issue Jun 27, 2014 · 33 comments

Comments

@diginfo
Copy link

diginfo commented Jun 27, 2014

I am trying to replace an implementation of PHPExcel with js-xlsx under node js as I am no longer using php server-side.

I have a series of excel templates that are pre-formatted with colours and styling, and also page size, margins, print areas etc etc, and then I simply add text to the relevant cells, under PHPexcel it works very well.

I discovered js-xlsx a few hours ago and thought I had found the solution, but it appears that I am unable to duplicate the original file, only with the cell values changed, whilst maintaining everything else in the original file.

var XLSX = require('xlsx');
var wb = XLSX.readFile('/usr/share/nodejs/dwap/public/xls/qplan.xlsx',{cellStyles:true});
XLSX.writeFile(wb,'/usr/share/nodejs/dwap/public/xls/out.xlsx',{bookSST:true});

I have used the simple code above to read in a formatted file and then save it again, but all of the styling is lost.

Is it possible to achieve this kind of functionality with js-xlsx and if so what am I doing wrong ?

@SheetJSDev
Copy link
Contributor

js-xlsx currently rebuilds styles and other metadata (unfortunately, XLSB and XLS and XML2003 have different ways of storing the style information, so the only uniform way to handle this is to parse and write out the data). The style metadata is woefully incomplete (contributed by @elad, who specifically needed certain cell background information) and should improve over time. The biggest challenge is finding a way that works across the various formats.

If you know the input templates will be XLSX, it's pretty simple to hack the parser and writer to use the original stylesheet:

  1. save the raw style xml to a field in the workbook: Near the end of the parse_zip function add a line like:
if(dir.style) out.RawStyle = getzipdata(zip, dir.style.replace(/^\//,''));
  1. Force the cell parser to store the original style: Add it to the cell object before storing:
p.Style = cell.s;
  1. force the cell writer to use the overridden style: Change the line that tries to finds the style:
o.s = cell.RawStyle !== undefined ? cell.RawStyle : get_cell_style(opts.cellXfs, cell, opts);
  1. force write_zip to use the raw style field if present: where the styles are produced change the line so that it reads:
    zip.file(f, wb.RawStyle !== undefined ? wb.RawStyle : write_sty(wb, f, opts));

(after making the changes, run make and make dist to produce the xlsx.js file). Disclaimer: I haven't tested this.

The reason this hack and the bold workaround aren't in the main code is because I don't think it's the right approach going forward.

Try this and let me know how it goes

@elad
Copy link

elad commented Jun 27, 2014

Just to simplify a bit, because I had to dive into this code without knowing it myself :)

What js-xlsx does is transform a file to an internal representation, and all operations are done on that internal representation. When you want to save the changes, it needs to export the data back. This means that js-xlsx doesn't "go to cell A1 and change value and saves," but rather parses the entire file, modifies the data, and rewrites the entire file.

Files have several "subfiles" that store the various sheets, styles, and so on. Some of those aren't parsed, and others are parsed only partially. When I added minimal style parsing, for example, the styles weren't being parsed at all. Because I only needed a specific subset, styles are partially parsed just to give the background color information.

What's becoming obvious though is that there has to be some format to hold this data that properly reflects what Excel is capable of. Otherwise we'll just pollute the data structures and they'll become a maintenance headache - if I understand what @SheetJSDev is getting at.

(This is partly my fault, because I lack the time to give this proper thought, but I have a feeling that what we'll end up doing is populate the .h HTML field so it reflects what's in the cell, maybe add some common flags for things like bold/italics/underline in addition to the existing colors, and as suggested above, write back the style on export.)

@diginfo
Copy link
Author

diginfo commented Jun 27, 2014

Thanks for helping, I modified the 3 files and recompiled, this resulted in a partial success.

Cells with data in them are formatted, but formatting (such as background shading) is removed from cells without data (text) in them.

Merged cells were not merged any more.

The print margins etc did not get copied across.

Maybe I am expecting too much, but this would be a really fabulous solution if I was able to do all of my formatting in the original file and then just update the cells with data.

I just tried the python xlrd/xlwt library and that did not copy any styles either.

@SheetJSDev
Copy link
Contributor

@elad I didn't mean to suggest you were sloppy or cut corners! I was contextualizing where things stand and giving credit where it is due. You had a need for extracting the cell background and scratched the itch :)

@diginfo to be clear, though, you can literally modify the parsed cell structure by changing the cell's v attribute (and it will be written)

What's becoming obvious though is that there has to be some format to hold this data that properly reflects what Excel is capable of. Otherwise we'll just pollute the data structures and they'll become a maintenance headache

Besides the maintenance headache, a uniform representation lets you convert between formats seamlessly. There are some cool tests that read XLS and write XLSX

...

Cells with data in them are formatted, but formatting (such as background shading) is removed from cells without data (text) in them

Set sheetStubs to true in the parser options object.

Merged cells were not merged any more.

They are being parsed but that data is not being written. That's a pretty straightforward fix in the writer. The !merges field is an array of ranges, which can be stringified with the helper function encode_range

The print margins etc did not get copied across.

These are worksheet-level properties, which are not parsed at the moment.

Maybe I am expecting too much, but this would be a really fabulous solution if I was able to do all of my formatting in the original file and then just update the cells with data.

You aren't expecting too much :) The challenge is finding the best approach that will extend to the other formats, and it's better to spend a few minutes or hours or days thinking before adding something that we'll regret later. As the old adage goes:

Weeks of programming save hours of planning

@diginfo
Copy link
Author

diginfo commented Jun 27, 2014

Thanks for the explanation, I understand the need to come up with a solution that can handle all scenarios & stylesheets.

Personally, my only requirement is for a excel-based reporting tool and probably the only thing I will ever do is to create a template and style & format it so that data can be later added to the relevant cells.

This will allow me to easily create pretty reports that can be either downloaded or rendered to pdf.

I understand your best approach, and that this is not something that may happen imminently, but Is this something I can further hack in the next few days to achieve what I need before the official method and enhancement is released ?

PS: I have discovered that the 'formatting_info=True' option in xlrd for python does not appear to work for xlsx files, so it looks like that is no longer an option.

@SheetJSDev
Copy link
Contributor

But Is this something I can further hack in the next few days to achieve what I need before the official method and enhancement is released ?

Fortunately, this shouldn't take more than an hour :)

The missing features that you identified are worksheet-level properties, which should be found in the individual sheetN.xml files within the xlsx.

If you have access to vim, you can open the file using vim and it will show you a zip.vim view. Alternatively, rename the xlsx file to .zip and unzip.

Let's take https://github.com/SheetJS/test_files/blob/master/merge_cells.xlsx?raw=true as an example.

$ vim merge_cells.xlsx

The page margins is in the worksheet xml, so let's take a peek there:

It's easier to understand if we clean it up a bit. In vim, the mantra is :%!xmllint --format - (pass the data through xmllint).

The relevant block is at the bottom:

<pageMargins left="0.75" right="0.75" top="1" bottom="1" header="0.5" footer="0.5"/>

We want to take this raw text and add it to the output. To do that, the steps are straightforward:

  1. Figure out how to grab the text. Since this is coming from the worksheet xml, the relevant parser is parse_ws_xml. Regular expressions are your friend here. For example, that tag can be found with a really dirty match tag:
var pageMargins = data.match(/<pageMargins[^>]*>/)
  1. Save the text to a field in the spreadsheet. Always use a name starting with '!' -- other utility functions that just iterate through cells (like the current formulae function) know to skip those entries.

For example:

if(pageMargins != null) s['!pageMargins'] = pageMargins; 
  1. Add the text to the output. The dual function for parse_ws_xml is write_ws_xml so look there. The text shows up after sheetData in the worksheet xml, so it should be written there:
if(ws['!pageMargins']) o.push(ws['!pageMargins']);

Hopefully this is helpful and hopefully you finish within an hour :) I'd like to think the code organization is sufficiently straightforward that hacking should not require too much trickery.

@diginfo
Copy link
Author

diginfo commented Jun 27, 2014

Thanks, but where do I add this:

if(pageMargins != null) s['!pageMargins'] = pageMargins;

@SheetJSDev
Copy link
Contributor

@diginfo near the end of parse_ws_xml. https://github.com/SheetJS/js-xlsx/blob/master/bits/67_wsxml.js#L129-130 is a good spot

@diginfo
Copy link
Author

diginfo commented Jun 28, 2014

OK, I think I have figured it out:

function parse_ws_xml(data, opts, rels) {
    --
    /* PACHACK */
    var pageMargins = data.match(/<pageMargins[^>]*>/);
    if(pageMargins != null) s['!pageMargins'] = pageMargins;

    var pageSetup = data.match(/<pageSetup[^>]*>/);
    if(pageSetup != null) s['!pageSetup'] = pageSetup; 

var write_ws_xml = function(idx, opts, wb) {
    --
    /* PACHACK */
    if(ws['!pageMargins']) o.push(ws['!pageMargins']);
    if(ws['!pageSetup']) o.push(ws['!pageSetup']);

@diginfo
Copy link
Author

diginfo commented Jun 28, 2014

Unfortunately, this killed the output file which is corrupted and cannot be read:

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<recoveryLog xmlns="http://schemas.openxmlformats.org/spreadsheetml/2006/main"><logFileName>Repair Result to out 04944.xml</logFileName><summary>Errors were detected in file 'SYSDATA:Users:PAC:Downloads:out.xlsx'</summary><removedParts summary="Following is a list of removed parts:"><removedPart>Replaced Part: /xl/worksheets/sheet1.xml with XML error.  Load error. Line 2, column 361.</removedPart></removedParts></recoveryLog>

@SheetJSDev
Copy link
Contributor

Load error. Line 2, column 361.

This is somewhat informative (certainly better than the cannot parse errors in XLSB with no indication). It's possible that pageSetup or pageMargins has children. Can you put the xml somewhere (e.g. http://hastebin.com/) so I can take a look? If it is confidential, send me an email

@diginfo
Copy link
Author

diginfo commented Jun 28, 2014

OK, but which XML ?- do you mean the output xlsx file or the sheet1.xml file that was from the source ?

@SheetJSDev
Copy link
Contributor

XLSX is fine, but the error is reported to be in /xl/worksheets/sheet1.xml (so that should be sufficient)

@diginfo
Copy link
Author

diginfo commented Jun 28, 2014

This is the file extracted from the source xlsx :http://hastebin.com/fexawijizu.xml

@SheetJSDev
Copy link
Contributor

I wanted to see the file being generated by js-xlsx -- the load error "Load error. Line 2, column 361." tells you where to look

@diginfo
Copy link
Author

diginfo commented Jun 28, 2014

OK, this is the output: http://hastebin.com/umuwadukeb.xml

@diginfo
Copy link
Author

diginfo commented Jun 28, 2014

This is the output without the additional pageMargins & pageSetup which loads successfully: http://hastebin.com/riqaxuwaqe.xml

@diginfo
Copy link
Author

diginfo commented Jun 28, 2014

Hi;

Thought I would send you the source file.

Kind Regards

PeterC

On 28/6/14 8:46 am, SheetJSDev wrote:

I wanted to see the file being generated by js-xlsx -- the load error
|"Load error. Line 2, column 361."| tells you where to look


Reply to this email directly or view it on GitHub
#75 (comment).

@diginfo
Copy link
Author

diginfo commented Jun 28, 2014

Would it not be possible to clone the original file, then change the worksheet values and save the file ?

I dont need to change or edit anything else

@SheetJSDev
Copy link
Contributor

It's not that simple because the you have to rebuild the shared string table.

I'm surprised that js-xlsx generated that xml because it always generates with standalone="yes" (https://github.com/SheetJS/js-xlsx/blob/master/bits/22_xmlutils.js#L102)

Can you share the broken xlsx that was generated?

@diginfo
Copy link
Author

diginfo commented Jun 28, 2014

Attached are the source and output files, not sure if you can receive
them ??

PeterC

On 28/6/14 10:34 am, SheetJSDev wrote:

It's not that simple because the you have to rebuild the shared string
table.

I'm surprised that js-xlsx generated that xml because it always
generates with |standalone="yes"|
(https://github.com/SheetJS/js-xlsx/blob/master/bits/22_xmlutils.js#L102)

Can you share the broken xlsx that was generated?


Reply to this email directly or view it on GitHub
#75 (comment).

@SheetJSDev
Copy link
Contributor

Github doesn't support attachments. Send to sheetjs@gmail.com

Sent from my iPad

On Jun 27, 2014, at 11:03 PM, diginfo notifications@github.com wrote:

Attached are the source and output files, not sure if you can receive
them ??

PeterC

On 28/6/14 10:34 am, SheetJSDev wrote:

It's not that simple because the you have to rebuild the shared string
table.

I'm surprised that js-xlsx generated that xml because it always
generates with |standalone="yes"|
(https://github.com/SheetJS/js-xlsx/blob/master/bits/22_xmlutils.js#L102)

Can you share the broken xlsx that was generated?


Reply to this email directly or view it on GitHub
#75 (comment).


Reply to this email directly or view it on GitHub.

@DigitalMachinist
Copy link

@SheetJSDev

Just to plug this issue, I have need of this very same feature in the near future and it would be nice to know when this is implemented and/or if there is now an established workaround for this issue as a result of @diginfo's correspondence with you.

In my case, having this feature is non-essential, but I'm sure it would please my clients to be getting their Excel reports output from the server with the same appearance that they're used to in the past.

Thanks for investing the time in helping diginfo through his hack. I may be in a similar position in the future.

@SheetJSDev
Copy link
Contributor

@sivyr This is still in the pipeline. No one has forgotten :)

The current hangup actually has nothing to do with XLSX. There is a parallel project for XLS and other pre-2007 formats: https://github.com/SheetJS/js-xls . Right now, you can do some really cool things like read XLS files and generate XLSX files. In order to keep the representations aligned, we have to make sure that a unified style representation can be generated from the different file types. I haven't settled on that format yet.

I think @elad said it best:

there has to be some format to hold this data that properly reflects what Excel is capable of. Otherwise we'll just pollute the data structures and they'll become a maintenance headache

@elad
Copy link

elad commented Jul 17, 2014

I was talking to friend today and they brought up the idea of looking at how Excel macros work to access this type of data. For example, take a look at http://msdn.microsoft.com/en-us/library/cc296089(v=office.12).aspx for colors.

Maybe we should use the same variables and structure exposed for macro development?

I was also thinking we could simplify by restricting styling to an entire cell. (I'm not sure but Excel macros might be doing that already.)

@SheetJSDev
Copy link
Contributor

I was also thinking we could simplify by restricting styling to an entire cell. (I'm not sure but Excel macros might be doing that already.)

VBA macros actually lets you set styling on a character-by-character basis. For example:

Sheets("Sheet1").Range("B3").Characters(16,9).Font.Italic = True

If we wanted to make a function library, this may look like:

workbook.GetSheet("Sheet1").GetRange("B3").GetChars(15,9).GetFont.SetFormat({italic:true})

Any thoughts for what this function interface should look like?

@elad
Copy link

elad commented Jul 17, 2014

You're right, per-character styling is possible.

The only thing holding me back is real world use cases. If I had to guess, I would bet entire-cell styling is much more common than per-character styling. You'd see bold headlines, colored rows, and so on. Unlikely that styling on a per-character basis would have any meaning other than simple styling. Considering Excel is used to carry data, not text, the meaning is much more important to me than style, at least in the sense of how easy it is to extract programmatically.

That said, you might want to have SheetJS support everything Excel can - this would make it a much stronger tool. Focusing on cell contents (that is, ignoring GetSheet and GetRange), I think something that maps style to character range is sensible. So a style definition would look something like:

{
    bold: true,
    font: 'Arial',
    size: 16,
    fg_color: '#000000',
    bg_color: '#ffffff',
   ...
}

The way this would be mapped to character ranges by way of beginning index. There can be no overlaps I imagine, so there would be an offset-to-style map. If the entire cell has a single style, there would be only one mapping, from 0. I don't know if I'd break the style to multiple functions... I'd probably treat it the same way as jQuery treats CSS - simple get/set.

Makes sense...?

@SheetJSDev
Copy link
Contributor

If I understood correctly, what you are describing is similar to text run concept that XLSB uses: http://msdn.microsoft.com/en-us/library/dd952238.aspx. Each "run" is defined implicitly by specifying the start index and the style info.

Fortunately, translating from the XLSX rich text format to XLSB style is easy. The hard part, of course, is going to/from HTML (in particular, converting potentially messy HTML back to the rich text)

you might want to have SheetJS support everything Excel can

That is the goal :) It's clear that completely different groups of people designed the various formats (as an example, some formats use "Creator" to refer to the author while others use "Author"), so part of the challenge is trying to make sense of inconsistent standards and approaches.

@bmavity
Copy link

bmavity commented Aug 20, 2014

Have the line numbers changed for the above workaround? It may be the coffee not kicking in yet, but I am having trouble finding a function that has both a p variable and cell variable in the 67_wsxml.js file.

I fully understand the desire to avoid hacks and have a common format, but maybe you could have a temporary hack branch with all these one liners in place that people could clone and use while you work it out. I can send a pull request if you walk me through the setup today.

@SheetJSDev
Copy link
Contributor

@bmavity the line numbers and variable names changed. The new location was https://github.com/SheetJS/js-xlsx/blob/master/bits/67_wsxml.js#L231-L232 and the new name for the cell variable is tag (The old code was being deoptimized in v8 due to variable polymorphism, so they were renamed)

@bmavity
Copy link

bmavity commented Aug 20, 2014

Great, thanks. I'll see what I can do. :)

@clholzin
Copy link

Is there any new development for keeping raw template styles once file is written?

@SheetJSDev
Copy link
Contributor

We offer this in the Pro compendium. Since companies have paid for the features already, it would be unfair to them if we turned around and made it available as open source. We have a longer comment in a gist.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

7 participants