STW-XML-PARSE is yet another xml parser. Somewhere between XMLisp and Plump, it aims to be both performant and efficient while mapping xml elements to corresponding class instances and attributes to slots. Using the MOP, a metaclass ELEMENT-CLASS is defined, with its concommitant slots of type XML-DIRECT-SLOT DEFINITION.
Use quickload to call (ql:quickload :stw-xml-parse). If not found in the quickload repositories, add to your local-projects directory.
When parsing a document a root node of type DOCUMENT-NODE is created and returned, containing the slots: CHILD-NODES, DOCUMENT, and FILE.
All nodes within a document are of the type DOM-NODE. Some are also of the type ELEMENT-NODE, a sub-class of DOM-NODE, whilst others, (e.g. TEXT-NODE) are not. All dom nodes have the slot PARENT-NODE.
The class ELEMENT-NODE has the subclasses, BRANCH-NODE, LEAF-NODE, and CONTENT-NODE. All dom nodes have the slot PARENT-NODE, but only branch nodes have the slot CHILD-NODES.
Nodes of the type LEAF-NODE correspond to the void elements of xml, containing attributes but no content. They typically appear in the form <element />. As such they have no child-nodes.
Any element that has or can have child nodes is a BRANCH-NODE. All branch nodes have the slot CHILD-NODES, which is a list.
The somewhat ironically named attribute nodes are so named, as their slots may correspond to elements instead of attributes. When a node inherits from ATTRIBUTE-NODE, parsing proceeds by seeking the slot on the PARENT-NODE with the same name as the name of the child class with which it will be assigned.
Content Nodes have content that is read but not evaluated, (e.g. the <script> or <title> elements of html).
Text nodes are decoded during parsing and encoded during printing. The option of whether to store whitespace is proferred, with the boolean PRESERVE-WHITESPACE.
All nodes of type SGML-NODE, including !DOCTYPE, !–, ![CDATA[, etc, are read without being evaluated.
To define a node use the macro DEFINE-ELEMENT-NODE, which returns an instance of ELEMENT-CLASS. Unless specified otherwise, all classes of type ELEMENT-CLASS are assumed sub classes of BRANCH-NODE.
(define-element-node div ()
(id
(html-class :attribute "class" :initarg :class :initform nil :reader html-class)))
=> #<ELEMENT-CLASS XML.PARSE::DIV>
(define-element-node a ()
(href target))
=>#<ELEMENT-CLASS XML.PARSE::A>The metaclass ELEMENT-CLASS provides a number of slots that apply, or can be applied, to a node. First and foremost, the slot ELEMENT which with the reader CLASS->ELEMENT maps the class to an element. Conversely, elements are mapped to classes in the hash table ELEMENT-CLASS-MAP.
As xml elements are case sensitive, the slot ACCEPTED-CASE is provided, and accepts :upper, :lower and :any keywords, (default :ANY).
All attributes are indexed in a stateful character based trie, imported from stw-utils. The reader attribute->slot will return the slot, however, for expedience and performance, a list of slot, slot name, slot type, the attribute, and length of attribute name, are all stored and returned from a positive search of the SLOT-INDEX trie.
Both the slots PERMITTED-CONTENT and PERMITTED-PARENTS are provided as a means to tie strict conformance rules to an XML document parser/printer, and to provide the necessary data to produce a DTD. However, at present, neither the parser nor the printer enforce these rules.
Use the :attribute initarg to specify the attribute a slot maps. A string, it can therefore represent the case, package or entirely unrelated name, thus avoiding conflicts with Common Lisp symbols, amongst other things. If an attribute is not assiged, the name of the slot is parsed and the accepted-case determined. The resulting attribute name is stored in the class slot SLOT-INDEX for quick retrieval.
The initarg :expected-value allows the specification of accepted results. Useful for conformance checking and DTD formation.
Catch or flag deprecated or obsolete elements. A keyword, it defaults to :active. Useful, once again, for conformance, or for document upgrading.
Parsing a file or string is straightforward:
(parse-document <pathname>)) or
(parse-document <string>))Parsing occurs via the class initialization function INITIALIZE-INSTANCE, and uses the various functions such as READ-CONTENT, READ-ATTRIBUTES, READ-SUBELEMENTS, etc. Each, relies on the reader functions, READ-UNTIL or READ-AND-DECODE, as provided by the lexer in STW-UTILS.
To parse, simply call the function PARSE-DOCUMENT with either a file path designator or string.
(parse-document #P"<document-pathname>")
=> #<DOCUMENT-NODE {1001FF8763}>PARSE-DOCUMENT contains the optional argument PARSER, PRESERVE-WHITESPACE and ELEMENT-CLASS-MAP, which each default to #’READ-ELEMENT, NIL, and ELEMENT-CLASS-MAP, respectively.
When a child element is encountered, the method BIND-CHILD-NODE is invoked, accepting both the parent node and child node as arguments. At its most basic parent node is set as the child node’s parent node, and the child node pushed to the CHILD-NODES slot of parent node. But more complex interactions can be readily devised with method specialization.
The special variable mode accepts three possible keyword settings, that correspond to three courses of action:
- :verbose => ‘warn
- :strict => ‘error
- :silent => NIL
Two conditions are provided:
When there is no class associated with an element the error CLASS-NOT-FOUND-ERROR is invoked. The options then are to enter the debugger or bind the error to the restart ASSIGN-GENERIC-NODE. When MODE is :SILENT or :VERBOSE this process is automatically handled, albeit in the latter case with a warning printed to STANDARD-OUTPUT.
In the event of an attribute having no slot, two restarts are made available, ASSIGN-SLOT-TO-ATTRIBUTE and IGNORE-MISSING-SLOT. This error is not handled automatically and will land in the debugger if not handled.
XML attributes cannot contain multiple values. When encountered the error MULTIPLE-VALUE-ERROR is thrown. The restarts USE-FIRST-FOUND-VALUE and IGNORE-ATTRIBUTE are provided.
When a CLASS-NOT-FOUND-ERROR is invoked, the restart ASSIGN-GENERIC-NODE is made available. When invoked, parsing continues, but with type GENERIC-NODE.
When a CLASS-NOT-FOUND-ERROR is invoked, the restart ASSIGN-TEXT-NODE is made available. When invoked, parsing continues, but with type TEXT-NODE.
Another option for handling a CLASS-NOT-FOUND-ERROR is the restart IGNORE-NODE, which will consume the element tag until the specified closing character, and push the element-name to the STRAY-TAGS list, so it can be caught during parsing.
Advances past both the attribute name and value and continues onwards.
An interactive restart. Provide the slot definition name for an existing slot in the relevant class. The restart will bind the attribute to slot in the SLOT-INDEX trie of the class.
A restart that results in the the first value in a multiple value attribute, assigned to a slot, while the rest are skipped over.
A restart that results in an attribute being entirely ignored when multiple values are encountered.
Alongside standard parsing, invoking the function (set-reader) will create an altered READTABLE and bind it to the global variable READTABLE. SET-READER has an optional READER argument, a function, (default #’READ-XML), which is bound to the character #\<.
When (get-macro-character #\<) returns true, print-object calls the method serialize-object, thus printing an xml representation of a class and its children. By default the boolean PRINT-CHILDNODES => T. By setting it to nil, the representation is truncated.
Now when inspecting <a href='/some-url'>url</a> we see:
#<XML.PARSE:DOCUMENT-NODE {1003EAC093}>
--------------------
Class: #<STANDARD-CLASS XML.PARSE:DOCUMENT-NODE>
--------------------
Group slots by inheritance [ ]
Sort slots alphabetically [X]
All Slots:
[ ] CHILD-NODES = (#<XML.TEST::A {1003EAC183}>)
[ ] DOCUMENT = "<a href='/some-url'>url</a>
"
[ ] FILE = NIL
Similarly:
(make-instance 'a :href "/some-url" :child-nodes (make-instance 'text-node :text "url"))
=> <a href='/some-url'>url</a>To close the reader and return to the initial READTABLE call (remove-reader).
(make-instance 'a :href "/some-url" :child-nodes (make-instance 'text-node :text "url"))
=> #<A {1004547EC3}>Accepts a node. Returns a unique copy. Uses the read feature of STW-XML-PARSE to write and then read the node back in. All non-constant values are distinct from the original.
Accepts a node, ancestor (type), and limiting node (type). Returns the first node that matches the type of the ancestor node.
Recursively walk a trie. If predicate is matched collect the node. With optional from-end.
Return all comments.
Return all text nodes for node, with optional filter. Filter if supplied must be a function that accepts one text-node as an argument.
Return all text-nodes for node where the parent-node is of a type specified in the argument parents.
Return all text nodes containing token.
Return all text nodes containing any of the specified tokens (&rest tokens).
Return all text nodes containing all tokens (&rest tokens).
Return all elements with the specified tagname.
Return the first element that contains attribute.
Return the first element that contains all attributes (&rest attributes).
Return all elements that contain attribute.
Return all elements that contain all attributes.
Return the first element that contains the attribute and one-of of attribute-values (&rest attribute-values).
Return the first element that contains the attribute and each of attribute-values (&rest attribute-values).
Return all elements that contain the attribute and any of attribute-values (&rest attribute-values).
Return all elements that contain the attribute and all of attribute-values (&rest attribute-values).
Return the next element in nodelist.
Return the previous element in nodelist.
Given a starting-node and a predicate, query-select returns the first node that matches.
Given a starting-node and a predicate, query-select-all returns all matching nodes.
Remove node from nodelist.
Add node to nodelist.
Insert node-to-insert before node in nodelist.
Insert node-to-insert after node in nodelist.
Return first child in node.
Return last child in node.
Return first child of type in node.
Return last child of type in node.
- Provide optional rule checks on, e.g. PERMITTED-PARENTS, or PERMITTED-CONTENTS, to ensure conformance.
- Create DTD parser / printer.