Skip to content
This repository has been archived by the owner on Aug 22, 2023. It is now read-only.

To emit valid XML saxpath needs to correctly escape XML entities in attribute values #15

Closed
LeDominik opened this issue Sep 25, 2015 · 3 comments

Comments

@LeDominik
Copy link

Given the following style of XML (as provided by the stackexchange archives)

<?xml version="1.0" encoding="utf-8"?>
<posts>
  <row Id="3" PostTypeId="1" AcceptedAnswerId="6" CreationDate="2011-07-12T18:44:18.650"
  Score="33" ViewCount="6210" 
  Body="&lt;p&gt;Why do we use a permutation table in the first step of &lt;a href=&quot;http://en.wikipedia.org/wiki/Data_Encryption_Standard&quot;&gt;DES algorithm&lt;/a&gt; and one at the end of algorithm?&lt;/p&gt;&#xA;" 
  OwnerUserId="12" LastEditorUserId="10" LastEditDate="2011-09-27T21:21:58.213" 
  LastActivityDate="2012-06-13T16:41:23.283" 
  Title="What are the benefits of the two permutation tables in DES?" 
  Tags="&lt;block-cipher&gt;&lt;des&gt;&lt;permutation&gt;" 
  AnswerCount="3" CommentCount="0" FavoriteCount="8" />

  <row [...]

I can very nicely match on /posts/row, however as sax will nicely de-escape all attribute values (= the long Body here) before handing the events over to saxpath's Recorder the String-Rewriting will actually produce invalid XML. I monkey-patched it with the following custom recorder -- the 5 predefined XML 1.0 Standard-entities suffice to produce valid XML:

var MyRecorder = function () { XMLrecorder.call(this); };
util.inherits(MyRecorder, XMLrecorder);

function escapeXml(unsafe) {
    return unsafe.replace(/[<>&'"]/g, function (c) {
        switch (c) {
            case '<': return '&lt;';
            case '>': return '&gt;';
            case '&': return '&amp;';
            case '\'': return '&apos;';
            case '"': return '&quot;';
        }
    });
}

MyRecorder.prototype.onOpenTag = function(node) {
    var id;
    var attribute;
    for (id in this.streams) {
        if (this.streams.hasOwnProperty(id)) {
            this.streams[id] += '<' + node.name;
            for (attribute in node.attributes) {
                this.streams[id] += ' ' + attribute;
                this.streams[id] += '="' + escapeXml(node.attributes[attribute]) + '"';
            }
            this.streams[id] += '>';
        }
    }
};

var streamer   = new saxpath.SaXPath(saxParser, '/posts/row', new MyRecorder());

I think sax is operating correctly and IMHO text-nodes need to be treated this way as well 😄

Any thoughts?

@StevenLooman
Copy link
Owner

Thank you for reporting this. I'll look at it shortly.

@StevenLooman
Copy link
Owner

0.6.3 was release with the fix.

@LeDominik
Copy link
Author

Works like a charm 👍

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants