Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for parsing Comments #43

Merged
merged 5 commits into from
Jan 22, 2014
Merged

Conversation

kinwahlai
Copy link

Thanks for a great library!
Here is my best attempt at contributing support to parse comments.
I used an existing file already in the project for the regressions tests.
I realize I coded the changes directly in the built xlsx.
Let me know if you want me to change something.

Thanks!

Comments parts listed in the [Content Types] are parsed.
Sheets's relationships are parsed.
Comments parts are correlated to their corresponding sheets parts.
Comments's contents are added to the ref'ed cells.
Rich text styling properties are currently ignored.

For example:

{
  "!ref": "A1:B3",
  "A1": {
    "v": 1,
    "t": "n"
  },
  "B1": {
    "v": "one",
    "t": "s",
    "r": "one",
    "c": [
      { "a": "Yegor Kozlov",
       "t": [ "Yegor Kozlov:",
              "\r\nfirst cell" ]
      }
    ]
  }
}

@SheetJSDev
Copy link
Contributor

The test failed because I moved the POI test cases to a separate file. I pushed changes to the test_files repo and this repo's tests.lst -- can you pull the changes, make init then make test?

Comments parts listed in the [Content Types] are parsed.
Sheets's relationships are parsed.
Comments parts are correlated to their corresponding sheets parts.
Comments's contents are added to the ref'ed cells.
Rich text styling properties are currently ignored.

For example:
{
  "!ref": "A1:B3",
  "A1": {
    "v": 1,
    "t": "n"
  },
  "B1": {
    "v": "one",
    "t": "s",
    "r": "one",
    "c": [
      { "a": "Yegor Kozlov",
       "t": [ "Yegor Kozlov:",
              "\r\nfirst cell" ]
      }
    ]
  }
}
@kinwahlai
Copy link
Author

OK thanks for your attention. We rebased our branch on the top of your latest master.
We are working on putting the changes into the bits files to support the tests as expected by Travis.

@SheetJSDev
Copy link
Contributor

@kinwahlai I can reconcile that later. According to https://travis-ci.org/SheetJS/js-xlsx/jobs/17037467, Travis is complaining because the file SimpleWithComments.xlsx was renamed to apachepoi_SimpleWithComments.xlsx. The fix is in the test script:

var wb = XLSX.readFile('./test_files/apachepoi_SimpleWithComments.xlsx');

Update test xlsx file to apachepoi_SimpleWithComments.xlsx
@SheetJSDev
Copy link
Contributor

Hmm @KinWah @kinwahlai I thought I made a comment yesterday but that seemed to have disappeared ...

Can you take a look at https://github.com/SheetJS/test_files/blob/master/comments_stress_test.xlsx ? In the first sheet, there is a rich-text comment and the raw text is broken up. It may make more sense to concatenate rather than push the raw comments.

Also, do you need the rich text as well? There is some basic HTML conversion (search for 18.4.7 in the code) which really should be moved out of the shared string table function

    <comment ref="C1" authorId="0">
      <text>
        <r>
          <rPr>
            <b/>
            <sz val="9"/>
            <color indexed="81"/>
            <rFont val="Calibri"/>
            <family val="2"/>
          </rPr>
          <t xml:space="preserve">I </t>
        </r>
        <r>
          <rPr>
            <b/>
            <i/>
            <sz val="9"/>
            <color indexed="81"/>
            <rFont val="Calibri"/>
          </rPr>
          <t>really</t>
        </r>
        <r>
          <rPr>
            <b/>
            <sz val="9"/>
            <color indexed="81"/>
            <rFont val="Calibri"/>
            <family val="2"/>
          </rPr>
          <t xml:space="preserve"> </t>
        </r>
        <r>
          <rPr>
            <sz val="9"/>
            <color indexed="81"/>
            <rFont val="Calibri"/>
            <family val="2"/>
          </rPr>
          <t xml:space="preserve">hope that </t>
        </r>
        <r>
          <rPr>
            <i/>
            <sz val="9"/>
            <color indexed="81"/>
            <rFont val="Calibri"/>
          </rPr>
          <t>xlsx</t>
        </r>
        <r>
          <rPr>
            <sz val="9"/>
            <color indexed="81"/>
            <rFont val="Calibri"/>
            <family val="2"/>
          </rPr>
          <t xml:space="preserve"> </t>
        </r>
        <r>
          <rPr>
            <u/>
            <sz val="9"/>
            <color indexed="81"/>
            <rFont val="Calibri"/>
          </rPr>
          <t>decides not</t>
        </r>
        <r>
          <rPr>
            <sz val="9"/>
            <color indexed="81"/>
            <rFont val="Calibri"/>
            <family val="2"/>
          </rPr>
          <t xml:space="preserve"> </t>
        </r>
        <r>
          <rPr>
            <i/>
            <u/>
            <sz val="9"/>
            <color indexed="81"/>
            <rFont val="Calibri"/>
          </rPr>
          <t xml:space="preserve">to use </t>
        </r>
        <r>
          <rPr>
            <b/>
            <i/>
            <u/>
            <sz val="9"/>
            <color indexed="81"/>
            <rFont val="Calibri"/>
          </rPr>
          <t>magic</t>
        </r>
        <r>
          <rPr>
            <b/>
            <u/>
            <sz val="9"/>
            <color indexed="81"/>
            <rFont val="Calibri"/>
          </rPr>
          <t xml:space="preserve"> like rPr</t>
        </r>
      </text>
    </comment>

@kinwahlai
Copy link
Author

Hi @SheetJSDev; well in our usecase we need to know the internal structure of the comments. And we are not interested by the styling as html.
For example: to find out that a google-spreadsheet's note matches a comment we make sure we see single rich text in the comment. We put some JSON in some of those notes and we dont want the html styling to appear in there.
When we see multiple rich texts we discard them as they are not notes.

In fact we have the same requirement with number formats: we need to know about the number-formats as defined in the spreadsheet and we are not interested in the formatting of the raw values for them.

For the sake of consistency with the rest of the code we can certainly help with doing to the comments the same type of things than what happens for the other rich texts.

What do you think?

@SheetJSDev
Copy link
Contributor

@kinwahlai my point was according to the spec, the text tags don't have author information. It's at the outer comment tag level (see my annotations below):

<comment ref="C1" authorId="0">   <-- AUTHOR IS IDENTIFIED HERE
      <text>  <--- this element does not have the author id
        <r> <-- this element does not have the author id
          <rPr>
             ...
          </rPr>
          <t xml:space="preserve">I </t> <-- this element does not have the author id
        </r>

Hence I would think it makes sense to combine the t tags. I have not played with google docs much, so I don't know how they do it (and my comment may not make sense for gdocs). Can you share an example file with comments from multiple users?


Also, Excel generates an empty tag for comments with no text. See the comments_stress_test.xlsx file (https://github.com/SheetJS/test_files/blob/master/comments_stress_test.xlsx):

<comment ref="B2" authorId="0"><text/>

You need to guard for the possibility that x.match(/<text>([^\u2603]*)<\/text>/m) is null

@hmalphettes
Copy link
Contributor

Hi @SheetJSDev, @kinwahlai and I are pairing on this code.
The example sheet we were looking at in the openxml spec had 2 rich texts; I mistook it for the author and body of the comment.
Thanks for clarifying this. In fact we are parsing the author as expected.

Thanks for noting the bug regarding the potentially null text element that would crash the parser.

OK to concatenate the comments as a first step and to pull out the code that applies the comments styles as html styles. It does not really match our usecase and I think it would make our life harder if we were trying to generate xlsx files from the generated model.
But we don't know better than that for now at the moment ;-)

We will add a few more commits to that branch for your kind review then.

@SheetJSDev
Copy link
Contributor

@kinwahlai @hmalphettes Keep in mind that the goal here is to find a solution that makes "sense". There is no standard solution here, which is why discussion is good :)

If you use the Excel UI "Insert Comment" option, it inserts the author name followed by a colon (in bold). However, since that is actual content in the comment, you can change it or remove it (and that's how you generate the empty comment in the sample sheet).

I think it would make our life harder if we were trying to generate xlsx files from the generated model.

Google docs does some wild stuff with the comments. A friend and I tried to make individual comments as well as comment "conversation", and apparently gdocs is not writing author information using the author tag (authors are identified using a dash followed by the name):

    <comment ref="A1" authorId="0">
      <text>
        <t xml:space="preserve">Foo bar

baz qux
    -Sheet JS
This is a conversation
    -Dev SheetJS</t>
      </text>
    </comment>

screenshot

complete comments.xml from exported XLSX:

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<comments xmlns="http://schemas.openxmlformats.org/spreadsheetml/2006/main">
  <authors>
    <author/>
  </authors>
  <commentList>
    <comment ref="A2" authorId="0">
      <text>
        <t xml:space="preserve">5 6 7 8
    -Sheet JS</t>
      </text>
    </comment>
    <comment ref="A2" authorId="0">
      <text>
        <t xml:space="preserve">1 2 3 4
    -Dev SheetJS</t>
      </text>
    </comment>
    <comment ref="A1" authorId="0">
      <text>
        <t xml:space="preserve">Foo bar

baz qux
    -Sheet JS
This is a conversation
    -Dev SheetJS</t>
      </text>
    </comment>
  </commentList>
</comments>

@hmalphettes
Copy link
Contributor

@SheetJSDev thanks for trying all these examples.
And thanks for reminding us of the bigger cause.

In our narrow usecase what we wanted to access were google's 'notes'
Once saved as an xlsx, they show up like this:

<?xml version="1.0" encoding="UTF-8"?>
<comments xmlns="http://schemas.openxmlformats.org/spreadsheetml/2006/main">
  <authors>
    <author />
  </authors>
  <commentList>
    <comment ref="A1" authorId="0">
      <text>
        <t xml:space="preserve">{stc_datasource:'c3b00170-642b-11e3-949a-0800200c9a66',stc_object:'stc_application'}</t>
      </text>
    </comment>
  </commentList>
</comments>

The simplest case of them all so far.

@SheetJSDev
Copy link
Contributor

@hmalphettes @kinwahlai Playing with gdocs a bit more:

  1. google docs comments and notes are plaintext. The UI is a text field and the generated xml only uses t tags

  2. each comment has only one text tag. I can't seem to generate a case where gdocs generates a comment with multiple text tag children.

  3. Google treats conversations and comments differently (notes and comments are separated with ---- when a note is present, but do not appear in a cell with only comments):

<!-- with note -->
<t xml:space="preserve">note with comment test

----
5 6 7 8
    -Sheet JS
----
1 2 3 4
    -Dev SheetJS</t>

<!-- without note -->
<t xml:space="preserve">Foo bar

baz qux
    -Sheet JS
This is a conversation
    -Dev SheetJS</t>
  1. As a result, you probably will need to do post-processing in your app, and I think the concatenation comment still applies for the individual t tags within a given text tag.

@hmalphettes
Copy link
Contributor

OK @SheetJSDev thanks again for looking into more spreadsheets. I am convinced.

I have pulled up the code that parses the rich text for sst and reused it to parse the comments.
It does the concatenation and passes the raw values along.
That would work fine for our use case.

Here is what the comment of the test xlsx looks like now:

{ a: 'Yegor Kozlov',
  t: 'Yegor Kozlov:\r\nfirst cell',
  raw: '<r><rPr><b/><sz val="8"/><color indexed="81"/><rFont val="Tahoma"/></rPr><t>Yegor Kozlov:</t></r><r><rPr><sz val="8"/><color indexed="81"/><rFont val="Tahoma"/></rPr><t xml:space="preserve">\r\nfirst cell</t></r>',
  r: '<span style="font-weight: bold;">Yegor Kozlov:</span><span style=""><br/>first cell</span>' }

Let us know!

SheetJSDev added a commit that referenced this pull request Jan 22, 2014
@SheetJSDev SheetJSDev merged commit e8d14dd into SheetJS:master Jan 22, 2014
@SheetJSDev
Copy link
Contributor

@hmalphettes @kinwahlai Thanks for the patches! Now for the difficult part (figuring out how to do the same thing for https://github.com/SheetJS/js-xls)

@hmalphettes
Copy link
Contributor

Oh boy xls! Sorry we are putting this new feature on your plate.
Cheers!
Hugues

@SheetJS SheetJS locked and limited conversation to collaborators Feb 22, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants