# Python XPath 2.0 with Elementpath

The purpose of elementpath is to provide XPath 1.0 and XPath 2.0 selectors for Python's `ElementTree` and `lxml.etree` library.

We can use elementpath to upgrade XPath 1.0 in `lxml` to XPath 2.0. This will be great helpful to existing projects using `lxml`.

The usage is simple. Import elementpath and call elementpath.select() function to return a list of `lxml.etree._Element` objects.

**Prerequisite**

This article suppose you already know XPath 1.0.

# Installing

To install elementpath, just run the following pip command:

```bash
pip install elementpath
```

Now we can read in our `cd_catalog.xml` file (which is an example from w3cschool), and test XPath 2.0.

The usage of elementpath is simple. You just replace `root.xpath(xpath)` with `elementpath.select(root, xpath)`. And the return type is `lxml.etree._Element`.

In [103]:
import elementpath
from lxml import etree

# this file is downloaded from:
# https://www.w3schools.com/xml/cd_catalog.xml
with open("cd_catalog.xml", "rb") as f:
    xml_str = f.read()

root = etree.XML(xml_str)
e = elementpath.select(root, "CD[1]")
print(type(e[0]))

<class 'lxml.etree._Element'>


# Speed

In [13]:
root.xpath("CD[position()<5]/TITLE/text()")

['Empire Burlesque', 'Hide your heart', 'Greatest Hits', 'Still got the blues']

In [14]:
elementpath.select(root, "CD[position()<5]/TITLE/text()")

['Empire Burlesque', 'Hide your heart', 'Greatest Hits', 'Still got the blues']

In [93]:
%timeit root.xpath("CD[position()<5]/TITLE/text()")

26.1 µs ± 1.12 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)


In [92]:
%timeit elementpath.select(root, "CD[position()<5]/TITLE/text()")

1.01 ms ± 23.5 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)


Elementpath is **40 times slower** than lxml. That is because this package is implemented using Python instead of Cython.

For old projects with `lxml`, You should use it only when you need XPath 2.0 features. 

For new projects, you should consider [saxonc](https://www.saxonica.com/saxon-c/index.xml).

# XPath 2.0 Types

In XPath 1.0, there are only four types:

* node-set
* boolean
* number (floating-point)
* string

While in XPath 2.0, there are dozens of them:

![XPath 2.0 Types](https://www.w3.org/TR/xquery-operators/type-hierarchy.png)

You can read the [official documents](https://www.w3.org/TR/xquery-operators) for details.

# XPath 2.0 Workflow Control

The great upgrade, in my mind, is the workflow control introduced to XPath 2.0. This make XPath 2.0 more like a programming language. And a programming language is the thing that we are most familiar with. With workflow control in XPath 2.0, we can remove workflow control in Python, which is slow. In a project that uses XPath in configuration, this can be a game changer. 

## If Statement

The if statement structure is：

```
if (condition)
    then ... 
    else ...
```

As I tested, in the if statement, an else clause must be provided. However, we can feed an empty list if we do not want to return anything.

The following xpath compares two values, returns the larger one.

In [104]:
elementpath.select(root, "if (CD[1]/PRICE>CD[2]/PRICE) then CD[1]/TITLE/text() else CD[2]/TITLE/text()")

['Empire Burlesque']

In [23]:
# compares two values, returns the first one, only if it is the smaller one.
elementpath.select(root, "if (CD[1]/PRICE<CD[2]/PRICE) then CD[1]/TITLE/text() else ()")

[]

In [9]:
# compares two values, returns the smaller one.
elementpath.select(root, "if (CD[1]/PRICE<CD[2]/PRICE) then CD[1]/TITLE/text() else CD[2]/TITLE/text()")

['Hide your heart']

## For Statement

In the for statement, you have to define a variable to represent the item in the list. The statement structure is:


> for $x in *list* return ...


In [106]:
elementpath.select(root, "for $x in CD[position()<5] return concat($x/TITLE/text(), ' by ', $x/ARTIST/text())")

['Empire Burlesque by Bob Dylan',
 'Hide your heart by Bonnie Tyler',
 'Greatest Hits by Dolly Parton',
 'Still got the blues by Gary Moore']

## Quantifiers

You can test some or every item in a list with whatever condition statement you want.


> some/every $x in *list* satisfies *condition*


In [13]:
elementpath.select(root, "some $x in CD/ARTIST/text() satisfies $x='Bob Dylan'")

True

In [14]:
# xpath 1.0 equavalent for the above
root.xpath("count(CD/ARTIST[text()='Bob Dylan'])>0")

True

In [15]:
elementpath.select(root, "every $x in CD/ARTIST/text() satisfies $x='Bob Dylan'")

False

In [16]:
# xpath 1.0 equavalent for the above
root.xpath("count(CD/ARTIST[text()='Bob Dylan'])=count(CD)")

False

In [17]:
elementpath.select(root, "every $x in CD/ARTIST/text() satisfies string-length($x)>2")

True

In [18]:
elementpath.select(root, "every $x in CD/ARTIST/text() satisfies string-length($x)>4")

False

# XPath 2.0 Sequence Operations

In XPath 2.0, everything is a flatten and ordered sequence. 

## Union

To Union two list, we can simply use comma. The repeated items are not removed.

In [97]:
elementpath.select(root, "(CD[position()<2], CD[position()<3])")

[<Element CD at 0x1b27f326a40>,
 <Element CD at 0x1b27f326a40>,
 <Element CD at 0x1b27f326840>]

## Intersect

Suppose we need a list from `Greatest Hits` to `One night only`. In XPath 1.0, this would be very tricky. However, in XPath 2.0, this will be much simpler. We just need to find out all following siblings for `Greatest Hits`, and then all preceding siblings for `One night only`, finally, we intersect the two.

In [67]:
xpath="""
CD[TITLE/text()='Greatest Hits']/following-sibling::CD
intersect
CD[TITLE/text()='One night only']/preceding-sibling::CD
"""
elementpath.select(root, xpath)

[<Element CD at 0x1b27f3a2d80>, <Element CD at 0x1b27f3a2e80>]

Well, this solution does not include the two book themselves. We need to union them.

In [94]:
xpath="""
CD[TITLE/text()='Greatest Hits']/(self::CD|following-sibling::CD)
intersect
CD[TITLE/text()='One night only']/(self::CD|preceding-sibling::CD)
"""
elementpath.select(root, xpath)

[<Element CD at 0x1b27f3a2a40>,
 <Element CD at 0x1b27f3a2d80>,
 <Element CD at 0x1b27f3a2e80>,
 <Element CD at 0x1b27f3a2f00>]

In [90]:
%timeit elementpath.select(root, xpath)

8.82 ms ± 205 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


This is still too heavy. 

## except 
The except keyword will filter out items we do not want. 

In [34]:
elementpath.select(root, "count(CD)")

26

In [35]:
elementpath.select(root, "count(CD except CD[YEAR/text()<1990])")

12

# XPath 2.0 Functions
## Numeric Functions
Firstly, the return type of numeric functions is different from XPath 1.0. We can see from the following example, `floor(3.2)` produce a decimal in XPath 2.0 but a float in XPath 1.0.

In [38]:
elementpath.select(root, "floor(3.2)")

Decimal('3')

In [39]:
# in xpath 1.0, the return type of float
root.xpath("floor(3.2)")

3.0

Some functions are new in XPath 2.0, like `abs` and `round-half-to-even`.

In [40]:
elementpath.select(root, "abs(-3.5)")

Decimal('3.5')

In [41]:
elementpath.select(root, "round-half-to-even(-3.5)")

Decimal('-4')

In [107]:
elementpath.select(root, "round-half-to-even(-2.5)")

Decimal('-2')

## String Functions
New string functions are added to XPath 2.0, like `string-join`, `normalze-space`, `upper-case`, `lower-case`.

In [42]:
elementpath.select(root, "string-join(CD/TITLE/text(), ', ')")

'Empire Burlesque, Hide your heart, Greatest Hits, Still got the blues, Eros, One night only, Sylvias Mother, Maggie May, Romanza, When a man loves a woman, Black angel, 1999 Grammy Nominees, For the good times, Big Willie style, Tupelo Honey, Soulsville, The very best of, Stop, Bridge of Spies, Private Dancer, Midt om natten, Pavarotti Gala Concert, The dock of the bay, Picture book, Red, Unchain my heart'

In [43]:
elementpath.select(root, "normalize-space('abc\nd')")

'abc d'

In [44]:
elementpath.select(root, "upper-case('abCD')")

'ABCD'

In [45]:
elementpath.select(root, "lower-case('abCD')")

'abcd'

In [46]:
elementpath.select(root, "normalize-space('as\ndf')")

'as df'

## Regular Expression

Regular Expression is another great update of XPath. It is not as great of workflow control because in XPath 1.0, we can still use regualr expressoin with [EXSLT](http://exslt.org/). But you will never be able to use workflow control in XPath 1.0. 


XPath 2.0 support `i`, `m`, `s` and `x` flags. while in EXSLT, the supported flags are `g` and `i`.

In [112]:
elementpath.select(root, r'matches("hello world", "hello\sworld")')

True

In [41]:
elementpath.select(root, 'matches("aabbcc", "aa")')

True

In [42]:
elementpath.select(root, 'replace("aabbcc", "aa", "AA")')

'AAbbcc'

In [43]:
elementpath.select(root, r'matches("aaa aaa aaa", "([a-z]*) \1 \1")')

True

**Back-Reference** is different in pattern and replacement. In pattern, it is \\*n*, but in replacement it is $*n*.

The following code find a word repeating three times and make it four times.

In [50]:
elementpath.select(root, r'replace("Lucy is a good good good girl", "([a-z]*) \1 \1", "$1 $1 $1 $1")')

'Lucy is a good good good good girl'

**fn:tokenize**

It is called split in other places.

In [45]:
elementpath.select(root, r'fn:tokenize("abracadabra", "(ab)|(a)")')

['', 'r', 'c', 'd', 'r', '']

## Sequence Functions

In [46]:
elementpath.select(root, r'fn:remove(("a", "b", "c"), 1)')

['b', 'c']

In [47]:
elementpath.select(root, r'fn:distinct-values(("a", "b", "c", "a"))')

['a', 'b', 'c']

In [48]:
elementpath.select(root, r'fn:reverse(("a", "b", "c", "d"))')

['d', 'c', 'b', 'a']

In [49]:
elementpath.select(root, r'subsequence(("a", "b", "c", "d"), 2, 2)')

['b', 'c']

In [50]:
elementpath.select(root, r'exactly-one(("a"))')

'a'

### Datetime Functions

There are tons of datetime functions added to XPath 2.0, too. The following example add 1 year and 2 month to 2020-10-30.

In [114]:
elementpath.select(root, 'xs:date("2000-10-30") + xs:yearMonthDuration("P1Y2M")')

[Date10(2001, 12, 30)]

# Conclusion

XPath 2.0 brings a lot of new features to XPath, including workflow control and a bunch of new functions, which will make your life much easier.

XPath 2.0 is not the newest XPath. The latest version is XPath 3.1. My next article would be XPath 3.1 with `saxonc`. You can use `saxonc` in your new projects. For legacy projects with `lxml`, you will have to use `elementpath`.

This article is just a brief introductions of XPath 2.0 in python. In work, you will always reference the official documentation like a student need a dictionary.

# Reference

[What's New in XPath 2.0](https://www.xml.com/pub/a/2002/03/20/xpath2.html)

[XQuery 1.0 and XPath 2.0 Functions and Operators (Second Edition)](https://www.w3.org/TR/xquery-operators/)

[elementpath documentation](https://elementpath.readthedocs.io/)

