The Counting Robot
The "Counting Robot" is a simple tool for getting counts of phenomena in documents encoded with XML. This script creates tab-delimited reports from any selection of XML content, using an XPath you provide. These reports are easily read in any text editor or spreadsheet software.
The XQuery 'count-sets-library.xql' is a companion tool for querying and manipulating the reports generated by the Counting Robot. The XQuery provides a library of functions which put multiple Robot reports into conversation with each other using set theory.
The Robot in practice
The Women Writers Project has found the Counting Robot useful for exploration and research, as well as for everyday maintenance of our TEI-encoded texts.
We've used the Robot to answer questions such as:
- What is the highest number of page breaks in a paragraph?
- What are the most referenced titles in Women Writers Online?*
- What values of
@typeare we using on
- What non-English languages are used across WWO? How are they distributed across authors? Across centuries?*
- Which authors mention
<title>s inside dramatic speech acts?*
*: For these examples and more, see Sarah Connell's "The Text is Variety" lecture notes.
Using the Robot
The Counting Robot is the XQuery script
counting-robot.xq, located in this folder. It is meant as a personal workspace; you can customize it to suit your needs. As such, if you've cloned down the wwp-public-code-share GitHub repository, you probably shouldn't change this file directly. Instead, make a copy to work in, so that you can easily retrieve any future changes with a
When working with the robot, there are only three components that you'll need to pay close attention to:
- the variable(s) referencing your XML documents;
- the XML namespace declarations.
In addition, there are two variables used to toggle specific behavior by the Robot:
XML document references
The XQuery script contains a declared variable called
$DESCRIPTIVE_NAME, which is just a template for you to use when defining your XML corpora. You can create as many variables to XML documents as you like, and reference the one(s) you need in
Here is an example section with multiple variables:
declare variable $blazingWorld := doc('file:/Users/ashleyclark/WWP/textbase/distribution/cavendish.blazing.xml'); declare variable $wwo := collection('file:///Users/ashleyclark/WWP/textbase/distribution?select=*.xml'); declare variable $wwo_ondeck := collection('file:///Users/ashleyclark/WWP/textbase/on_deck?select=*.xml'); declare variable $wwo_construction := collection('file:///Users/ashleyclark/WWP/textbase/under_construction?select=*.xml'); declare variable $wwo_all := ( $wwo | $wwo_ondeck | $wwo_construction );
Each line contains a uniquely named variable (e.g.
$wwo), defined as an XML document, a collection of documents, or a union of documents/collections:
( $VARIABLE-1 | $VARIABLE-2 | doc('FILE-PATH') | collection('DIRECTORY-PATH') )
Paths can be relative to the XQuery, absolute, or URLs. Each space in file or directory names should be replaced with its hexadecimal form "%20" (e.g. "hello world.xml" should be referenced as "hello%20world.xml").
When the Counting Robot starts working,
$query is the variable which stores the values it will count. This will likely be an XPath or XQuery referencing one of the corpora variables you've defined.
For example, the default
$query defined in the XQuery script is:
declare variable $query := $DESCRIPTIVE_NAME//text//title;
This translates to: "From the XML documents defined in the variable
$DESCRIPTIVE_NAME, get all
<text>, and store them as a variable named
Namespaces are unique identifiers which specify the flavor of XML being used. For example, TEI is a flavor of XML created by the Text Encoding Initiative. It has a namespace of
http://www.tei-c.org/ns/1.0, which is usually referred to as
tei for short.
The TEI namespace declaration might look like this in an XML document:
<TEI xmlns="http://www.tei-c.org/ns/1.0"> <!-- ... --> </TEI>
Or it might look like this:
<tei:TEI xmlns:tei="http://www.tei-c.org/ns/1.0"> <!-- ... --> </tei:TEI>
The TEI namespace declaration looks like this in the XQuery script:
declare namespace tei="http://www.tei-c.org/ns/1.0";
With this last declaration present, you can use the namespace prefix
tei in your
$query XPath to refer to any element within the TEI namespace, such as
The Counting Robot defines the TEI and WWP namespace prefixes for you, but it also includes a special namespace declaration:
declare default element namespace "http://www.wwp.northeastern.edu/ns/textbase";
This tells the Robot that whenever it sees a prefix-less element name in an XPath, to assume that the element is supposed to be in the WWP namespace. So, this XPath:
$query := $testDoc//text//title;
is functionally the same as this XPath:
$query := $testDoc//wwp:text//wwp:title;
And neither will match any
<title>s within an XML document that has any other namespace declaration, or that declares no namespace at all.
You should feel free to change the default namespace and/or add namespace prefix declarations.
The Counting Robot uses XQuery version 3.0. Since it is assumed you'll be working extensively in the declared variables, the XQuery script has no external parameters to set. See our general documentation for details on setting up an XQuery transformation.
The set theory library
Unlike the Counting Robot,
count-sets-library.xql is library of XQuery functions, rather than a standalone XQuery. The functions can be used to combine and manipulate multiple reports from the Counting Robot:
ctab:get-union-of-reports( ($filenameA, $filenameB, $ETC) )
- the union of reports A through N in a sequence (including adding up the counts)
ctab:get-union-of-rows( ($rowA1, $rowB1, $ETC) )
- the union of all rows in a sequence (including adding up the counts)
ctab:get-intersection-of-reports( ($filenameA, $filenameB, $ETC) )
- the intersection of reports A through N, or, only the data values which occur once per report (including adding up the counts)
- all data values in report(s) A where there isn't a corresponding value in report(s) B
- both A and B can be a sequence of filenames rather than a single string; the union of those sequences will be applied automatically
ctab:get-set-difference-of-rows( ($rowA1, $rowA2, $ETC), ($rowB1, $rowB2, $ETC) )
- all data values in the sequence of rows A where there isn't a corresponding value in sequence of rows B