Changelog

What's Changed

BREAKING: Complete xpath module rewrite by @James-LG in #24
- Fixed #17: Allow the selection of text with xpath expressions. e.g. //div/text()
- Fixed #15: Allow the selection of attributes with xpath expressions. e.g. //a/@href
- Fixes the behaviour of indexes in xpath expressions. e.g. //div/span[1]
- New implementation follows the official XPath specification as close as possible.

Full Changelog: v0.5.1...v0.6.0

v0.5.x -> 0.6.0 Migration Guide

A quick guide to upgrading through some of the major breaking changes introduced in v0.6.0.

Item Type

The biggest change is the return type. Before it was a list of items that could be either an HtmlTag or HtmlText. Now the items are a much more complicated type following the XPath specification.

Below is an overview of the returned item type XpathItem:

/// https://www.w3.org/TR/xpath-datamodel-31/#dt-item
#[derive(PartialEq, PartialOrd, Eq, Ord, Debug, Clone, Hash, EnumExtract)]
pub enum XpathItem<'tree> {
    /// A node in the [`XpathItemTree`](crate::xpath::XpathItemTree).
    ///
    ///  https://www.w3.org/TR/xpath-datamodel-31/#dt-node
    Node(Node<'tree>),

    /// A function item.
    ///
    /// https://www.w3.org/TR/xpath-datamodel-31/#dt-function-item
    Function(Function),

    /// An atomic value.
    ///
    /// https://www.w3.org/TR/xpath-datamodel-31/#dt-atomic-value
    AnyAtomicType(AnyAtomicType),
}

/// A node in the [`XpathItemTree`](crate::xpath::XpathItemTree).
///
///  https://www.w3.org/TR/xpath-datamodel-31/#dt-node
#[derive(PartialEq, PartialOrd, Eq, Ord, Debug, Clone, Hash, EnumExtract)]
pub enum Node<'tree> {
    /// A node in the [`XpathItemTree`](crate::xpath::XpathItemTree).
    TreeNode(XpathItemTreeNode<'tree>),

    /// A node that is not part of an [`XpathItemTree`](crate::xpath::XpathItemTree).
    NonTreeNode(NonTreeXpathNode),
}

/// Nodes that are not part of the [`XpathItemTree`].
#[derive(PartialEq, PartialOrd, Eq, Ord, Debug, Clone, Hash, EnumExtract)]
pub enum NonTreeXpathNode {
    /// An attribute node.
    AttributeNode(AttributeNode),

    /// A namespace node.
    NamespaceNode(NamespaceNode),
}

/// A node in the [`XpathItemTree`].
#[derive(PartialEq, PartialOrd, Eq, Ord, Debug, Clone, Hash)]
pub struct XpathItemTreeNode<'a> {
    id: NodeId,

    /// The data associated with this node.
    pub data: &'a XpathItemTreeNodeData,
}

/// Nodes that are part of the [`XpathItemTree`].
#[derive(PartialEq, PartialOrd, Eq, Ord, Debug, Hash, EnumExtract)]
pub enum XpathItemTreeNodeData {
    /// The root node of the document.
    DocumentNode(XpathDocumentNode),

    /// An element node.
    ///
    /// HTML tags are represented as element nodes.
    ElementNode(ElementNode),

    /// A processing instruction node.
    PINode(PINode),

    /// A comment node.
    CommentNode(CommentNode),

    /// A text node.
    TextNode(TextNode),
}

Xpath Item Tree

To facilitate the new XpathItem type, xpath expressions now must be passed an XpathItemTree rather than an HtmlDocument.

XpathItemTree implements From<&HtmlDocument>, so you can easily generate an XpathItemTree from a reference to an HtmlDocument. Note that this is a decently expensive operation, so you probably only want to perform this operation once per HtmlDocument if possible.

let expr = xpath::parse("//td[@class='something']//span").unwrap();
- let results = expr.apply(&html_document)?;
+ let xpath_item_tree = XpathItemTree::from(&html_document);
+ let results = expr.apply(&xpath_item_tree)?;

Getting Text

Text nodes are a type of TreeNode. You can either match on the item, or use these convenient as_[variant] functions.

Other changes:

The function to retrieve text was renamed from get_text to just text, and get_all_text to all_text.
The function now returns a String rather than an Option<String>.

- let text = item.get_text(&html_document).unwrap();
+ let text = item.as_node()?.as_tree_node()?.text(&page);

Getting Attributes

Attribute nodes are a type of NonTreeNode. You can now either select these directly in the xpath expression using the attribute axis (new feature), or you can get them from an ElementNode.

- let attribute = item.get_attributes().unwrap().get("href").unwrap();
+ let element = item.as_node()?.as_tree_node()?.data.as_element_node()?;
+ let attribute = element.get_attribute("href").unwrap();

or alternatively, use xpath to select the attribute node

- let expr = xpath::parse("//td[@class='something']//span").unwrap();
- let items = expr.apply(&html_document)?;
- let attribute = items[0].get_attributes().unwrap().get("href").unwrap();
+ let expr = xpath::parse("//td[@class='something']//span/@href").unwrap();
+ let items = expr.apply(&xpath_item_tree)?;
+ let attribute = items[0].as_node()?.as_non_tree_node()?.as_attribute_node()?.value;

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

What's Changed

Contributors

Fixed

Contributors

What's Changed

Contributors

Changelog

What's Changed

v0.5.x -> 0.6.0 Migration Guide

Item Type

Xpath Item Tree

Getting Text

Getting Attributes

Contributors

What's Changed

New Contributors

Contributors

What's Changed

Features

Fixes

Added

Contributors

Fixed

Contributors

Fixes:

Features:

New Contributors

Contributors

Fixes:

Releases: James-LG/Skyscraper

v0.6.3

What's Changed

Contributors

v0.6.2

Fixed

Contributors

v0.6.1

What's Changed

Contributors

v0.6.0

Changelog

What's Changed

v0.5.x -> 0.6.0 Migration Guide

Item Type

Xpath Item Tree

Getting Text

Getting Attributes

Contributors

v0.5.1

What's Changed

New Contributors

Contributors

v0.5.0

What's Changed

Features

Fixes

v0.4.0

Added

Contributors

v0.3.1

Fixed

Contributors

v0.3.0

Fixes:

Features:

New Contributors

Contributors

v0.2.1

Fixes: