Skip to content

An XML to DOM style parser; parsing elements and attributes into CLOS objects.

License

Notifications You must be signed in to change notification settings

LiamHowley/stw-xml-parse

Repository files navigation

STW-XML-PARSE

Introduction

STW-XML-PARSE is yet another xml parser. Somewhere between XMLisp and Plump, it aims to be both performant and efficient while mapping xml elements to corresponding class instances and attributes to slots. Using the MOP, a metaclass ELEMENT-CLASS is defined, with its concommitant slots of type XML-DIRECT-SLOT DEFINITION.

To Load

Use quickload to call (ql:quickload :stw-xml-parse). If not found in the quickload repositories, add to your local-projects directory.

Basic Model

Document Node

When parsing a document a root node of type DOCUMENT-NODE is created and returned, containing the slots: CHILD-NODES, DOCUMENT, and FILE.

Dom Nodes

All nodes within a document are of the type DOM-NODE. Some are also of the type ELEMENT-NODE, a sub-class of DOM-NODE, whilst others, (e.g. TEXT-NODE) are not. All dom nodes have the slot PARENT-NODE.

Element Nodes

The class ELEMENT-NODE has the subclasses, BRANCH-NODE, LEAF-NODE, and CONTENT-NODE. All dom nodes have the slot PARENT-NODE, but only branch nodes have the slot CHILD-NODES.

Leaf Nodes

Nodes of the type LEAF-NODE correspond to the void elements of xml, containing attributes but no content. They typically appear in the form <element />. As such they have no child-nodes.

Branch Nodes

Any element that has or can have child nodes is a BRANCH-NODE. All branch nodes have the slot CHILD-NODES, which is a list.

Attribute Nodes

The somewhat ironically named attribute nodes are so named, as their slots may correspond to elements instead of attributes. When a node inherits from ATTRIBUTE-NODE, parsing proceeds by seeking the slot on the PARENT-NODE with the same name as the name of the child class with which it will be assigned.

Content Nodes

Content Nodes have content that is read but not evaluated, (e.g. the <script> or <title> elements of html).

Text Nodes and Whitespace Nodes

Text nodes are decoded during parsing and encoded during printing. The option of whether to store whitespace is proferred, with the boolean PRESERVE-WHITESPACE.

SGML Nodes

All nodes of type SGML-NODE, including !DOCTYPE, !–, ![CDATA[, etc, are read without being evaluated.

Defining a Class

To define a node use the macro DEFINE-ELEMENT-NODE, which returns an instance of ELEMENT-CLASS. Unless specified otherwise, all classes of type ELEMENT-CLASS are assumed sub classes of BRANCH-NODE.

(define-element-node div ()
  (id
   (html-class :attribute "class" :initarg :class :initform nil :reader html-class)))
=> #<ELEMENT-CLASS XML.PARSE::DIV>

(define-element-node a ()
  (href target))
=>#<ELEMENT-CLASS XML.PARSE::A>

Class Slots

element-class

The metaclass ELEMENT-CLASS provides a number of slots that apply, or can be applied, to a node. First and foremost, the slot ELEMENT which with the reader CLASS->ELEMENT maps the class to an element. Conversely, elements are mapped to classes in the hash table ELEMENT-CLASS-MAP.

accepted-case

As xml elements are case sensitive, the slot ACCEPTED-CASE is provided, and accepts :upper, :lower and :any keywords, (default :ANY).

slot-index

All attributes are indexed in a stateful character based trie, imported from stw-utils. The reader attribute->slot will return the slot, however, for expedience and performance, a list of slot, slot name, slot type, the attribute, and length of attribute name, are all stored and returned from a positive search of the SLOT-INDEX trie.

permitted-content and permitted-parents

Both the slots PERMITTED-CONTENT and PERMITTED-PARENTS are provided as a means to tie strict conformance rules to an XML document parser/printer, and to provide the necessary data to produce a DTD. However, at present, neither the parser nor the printer enforce these rules.

XML-DIRECT-SLOT-DEFINITION Slots

attribute

Use the :attribute initarg to specify the attribute a slot maps. A string, it can therefore represent the case, package or entirely unrelated name, thus avoiding conflicts with Common Lisp symbols, amongst other things. If an attribute is not assiged, the name of the slot is parsed and the accepted-case determined. The resulting attribute name is stored in the class slot SLOT-INDEX for quick retrieval.

expected-value

The initarg :expected-value allows the specification of accepted results. Useful for conformance checking and DTD formation.

status

Catch or flag deprecated or obsolete elements. A keyword, it defaults to :active. Useful, once again, for conformance, or for document upgrading.

Parsing

Parsing a file or string is straightforward:

(parse-document <pathname>)) or

(parse-document <string>))

Parsing occurs via the class initialization function INITIALIZE-INSTANCE, and uses the various functions such as READ-CONTENT, READ-ATTRIBUTES, READ-SUBELEMENTS, etc. Each, relies on the reader functions, READ-UNTIL or READ-AND-DECODE, as provided by the lexer in STW-UTILS.

To parse, simply call the function PARSE-DOCUMENT with either a file path designator or string.

(parse-document #P"<document-pathname>")
=> #<DOCUMENT-NODE {1001FF8763}>

PARSE-DOCUMENT contains the optional argument PARSER, PRESERVE-WHITESPACE and ELEMENT-CLASS-MAP, which each default to #’READ-ELEMENT, NIL, and ELEMENT-CLASS-MAP, respectively.

Binding Nodes

When a child element is encountered, the method BIND-CHILD-NODE is invoked, accepting both the parent node and child node as arguments. At its most basic parent node is set as the child node’s parent node, and the child node pushed to the CHILD-NODES slot of parent node. But more complex interactions can be readily devised with method specialization.

Modes, Conditions and Restarts

Modes

The special variable mode accepts three possible keyword settings, that correspond to three courses of action:

  1. :verbose => ‘warn
  2. :strict => ‘error
  3. :silent => NIL

Two conditions are provided:

CLASS-NOT-FOUND-ERROR

When there is no class associated with an element the error CLASS-NOT-FOUND-ERROR is invoked. The options then are to enter the debugger or bind the error to the restart ASSIGN-GENERIC-NODE. When MODE is :SILENT or :VERBOSE this process is automatically handled, albeit in the latter case with a warning printed to STANDARD-OUTPUT.

SLOT-NOT-FOUND-ERROR

In the event of an attribute having no slot, two restarts are made available, ASSIGN-SLOT-TO-ATTRIBUTE and IGNORE-MISSING-SLOT. This error is not handled automatically and will land in the debugger if not handled.

MULTIPLE-VALUE-ERROR

XML attributes cannot contain multiple values. When encountered the error MULTIPLE-VALUE-ERROR is thrown. The restarts USE-FIRST-FOUND-VALUE and IGNORE-ATTRIBUTE are provided.

ASSIGN-GENERIC-NODE

When a CLASS-NOT-FOUND-ERROR is invoked, the restart ASSIGN-GENERIC-NODE is made available. When invoked, parsing continues, but with type GENERIC-NODE.

ASSIGN-TEXT-NODE

When a CLASS-NOT-FOUND-ERROR is invoked, the restart ASSIGN-TEXT-NODE is made available. When invoked, parsing continues, but with type TEXT-NODE.

IGNORE-NODE

Another option for handling a CLASS-NOT-FOUND-ERROR is the restart IGNORE-NODE, which will consume the element tag until the specified closing character, and push the element-name to the STRAY-TAGS list, so it can be caught during parsing.

IGNORE-MISSING-SLOT

Advances past both the attribute name and value and continues onwards.

ASSIGN-SLOT-TO-ATTRIBUTE

An interactive restart. Provide the slot definition name for an existing slot in the relevant class. The restart will bind the attribute to slot in the SLOT-INDEX trie of the class.

USE-FIRST-FOUND-VALUE

A restart that results in the the first value in a multiple value attribute, assigned to a slot, while the rest are skipped over.

IGNORE-ATTRIBUTE

A restart that results in an attribute being entirely ignored when multiple values are encountered.

Reading and Printing

Alongside standard parsing, invoking the function (set-reader) will create an altered READTABLE and bind it to the global variable READTABLE. SET-READER has an optional READER argument, a function, (default #’READ-XML), which is bound to the character #\<.

When (get-macro-character #\<) returns true, print-object calls the method serialize-object, thus printing an xml representation of a class and its children. By default the boolean PRINT-CHILDNODES => T. By setting it to nil, the representation is truncated.

Now when inspecting <a href='/some-url'>url</a> we see:

#<XML.PARSE:DOCUMENT-NODE {1003EAC093}>
--------------------
Class: #<STANDARD-CLASS XML.PARSE:DOCUMENT-NODE>
--------------------
 Group slots by inheritance [ ]
 Sort slots alphabetically  [X]

All Slots:
[ ]  CHILD-NODES = (#<XML.TEST::A {1003EAC183}>)
[ ]  DOCUMENT    = "<a href='/some-url'>url</a>
"
[ ]  FILE        = NIL

Similarly:

(make-instance 'a :href "/some-url" :child-nodes (make-instance 'text-node :text "url"))
 => <a href='/some-url'>url</a>

To close the reader and return to the initial READTABLE call (remove-reader).

(make-instance 'a :href "/some-url" :child-nodes (make-instance 'text-node :text "url"))
 => #<A {1004547EC3}>

Query Functions

clone-node

Accepts a node. Returns a unique copy. Uses the read feature of STW-XML-PARSE to write and then read the node back in. All non-constant values are distinct from the original.

find-ancestor-node

Accepts a node, ancestor (type), and limiting node (type). Returns the first node that matches the type of the ancestor node.

walk-tree

Recursively walk a trie. If predicate is matched collect the node. With optional from-end.

retrieve-comments

Return all comments.

retrieve-text-nodes

Return all text nodes for node, with optional filter. Filter if supplied must be a function that accepts one text-node as an argument.

retrieve-text-nodes-from-parents

Return all text-nodes for node where the parent-node is of a type specified in the argument parents.

retrieve-text-nodes-with-token

Return all text nodes containing token.

retrieve-text-nodes-with-tokens

Return all text nodes containing any of the specified tokens (&rest tokens).

retrieve-text-nodes-with-all-tokens

Return all text nodes containing all tokens (&rest tokens).

get-elements-by-tagname

Return all elements with the specified tagname.

get-element-with-attribute

Return the first element that contains attribute.

get-element-with-attributes

Return the first element that contains all attributes (&rest attributes).

get-elements-with-attribute

Return all elements that contain attribute.

get-elements-with-attributes

Return all elements that contain all attributes.

get-element-with-attribute-value

Return the first element that contains the attribute and one-of of attribute-values (&rest attribute-values).

get-element-with-attribute-values

Return the first element that contains the attribute and each of attribute-values (&rest attribute-values).

get-elements-with-attribute-value

Return all elements that contain the attribute and any of attribute-values (&rest attribute-values).

get-elements-with-attribute-values

Return all elements that contain the attribute and all of attribute-values (&rest attribute-values).

get-next-sibling

Return the next element in nodelist.

get-previous-sibling

Return the previous element in nodelist.

query-select

Given a starting-node and a predicate, query-select returns the first node that matches.

query-select-all

Given a starting-node and a predicate, query-select-all returns all matching nodes.

remove-node

Remove node from nodelist.

add-node

Add node to nodelist.

insert-before

Insert node-to-insert before node in nodelist.

insert-after

Insert node-to-insert after node in nodelist.

first-child

Return first child in node.

last-child

Return last child in node.

first-of-type

Return first child of type in node.

last-of-type

Return last child of type in node.

To Do

  • Provide optional rule checks on, e.g. PERMITTED-PARENTS, or PERMITTED-CONTENTS, to ensure conformance.
  • Create DTD parser / printer.

About

An XML to DOM style parser; parsing elements and attributes into CLOS objects.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published