Skip to content

Commit

Permalink
Merge pull request #24 from James-LG/james/nom
Browse files Browse the repository at this point in the history
BREAKING: Complete xpath module rewrite
  • Loading branch information
James-LG committed Jan 3, 2024
2 parents e1cb26f + d38ebd4 commit d33595c
Show file tree
Hide file tree
Showing 91 changed files with 12,865 additions and 2,832 deletions.
2 changes: 1 addition & 1 deletion .devcontainer/Dockerfile
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# See here for image contents: https://github.com/microsoft/vscode-dev-containers/tree/v0.187.0/containers/rust/.devcontainer/base.Dockerfile

FROM mcr.microsoft.com/vscode/devcontainers/rust:0-1-bullseye
FROM mcr.microsoft.com/devcontainers/rust:1-1-bullseye

# [Optional] Uncomment this section to install additional packages.
# RUN apt-get update && export DEBIAN_FRONTEND=noninteractive \
Expand Down
13 changes: 9 additions & 4 deletions Cargo.toml
Original file line number Diff line number Diff line change
@@ -1,8 +1,8 @@
[package]
name = "skyscraper"
version = "0.5.1"
version = "0.6.0"
authors = ["James La Novara-Gsell <james.lanovara.gsell@gmail.com>"]
edition = "2018"
edition = "2021"
description = "XPath for HTML web scraping"
license = "MIT"
readme = "README.md"
Expand All @@ -14,15 +14,20 @@ categories = ["parsing"]

[dependencies]
indextree = "4.3.1"
lazy_static = "1.4.0"
thiserror = "1.0.30"
thiserror = "1.0.52"
indexmap = "2.0.0"
log = "0.4.19"
nom = "7.1.3"
ordered-float = "4.2.0"
once_cell = "1.19.0"
enum-extract-macro = "0.1.1"
enum-extract-error = "0.1.1"

[dev-dependencies]
criterion = "0.5.1"
mockall = "0.12.0"
indoc = "2"
proptest = "1.3.1"

[[bench]]
name = "benchmarks"
Expand Down
97 changes: 70 additions & 27 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,11 @@

Rust library to scrape HTML documents with XPath expressions.

> This library is major-version 0 because there are still `todo!` calls for many xpath features.
>If you encounter one that you feel should be prioritized, open an issue on [GitHub](https://github.com/James-LG/Skyscraper/issues).
>
> See the [Supported XPath Features](#supported-xpath-features) section for details.
## HTML Parsing

Skyscraper has its own HTML parser implementation. The parser outputs a
Expand Down Expand Up @@ -48,34 +53,72 @@ assert_eq!(parent_node, parent_of_child1);

## XPath Expressions

Skyscraper is capable of parsing XPath strings and applying them to HTML
documents.
Skyscraper is capable of parsing XPath strings and applying them to HTML documents.

Below is a basic xpath example. Please see the [docs](https://docs.rs/skyscraper/latest/skyscraper/xpath/index.html) for more examples.

```rust
use skyscraper::{html, xpath};
// Parse the html text into a document.
let html_text = r##"
<div>
<div class="foo">
<span some_attr="value">yes</span>
</div>
<div class="bar">
<span>no</span>
</div>
</div>
"##;
let document = html::parse(html_text)?;

// Parse and apply the xpath.
let expr = xpath::parse("//div[@class='foo']/span")?;
let results = expr.apply(&document)?;
assert_eq!(1, results.len());

// Get text from the node
let text = results[0].get_text(&document).expect("text missing");
assert_eq!("yes", text);
use skyscraper::html;
use skyscraper::xpath::{self, XpathItemTree, grammar::{XpathItemTreeNodeData, data_model::{Node, XpathItem}}};
use std::error::Error;

fn main() -> Result<(), Box<dyn Error>> {
let html_text = r##"
<html>
<body>
<div>Hello world</div>
</body>
</html>"##;

// Get attributes from the node
let attributes = results[0].get_attributes(&document).expect("no attributes");
assert_eq!("value", attributes["some_attr"]);
let document = html::parse(html_text)?;
let xpath_item_tree = XpathItemTree::from(&document);
let xpath = xpath::parse("//div")?;

let item_set = xpath.apply(&xpath_item_tree)?;

assert_eq!(item_set.len(), 1);

let mut items = item_set.into_iter();

let item = items
.next()
.unwrap();

let element = item
.as_node()?
.as_tree_node()?
.data
.as_element_node()?;

assert_eq!(element.name, "div");
Ok(())
}
```

### Supported XPath Features

Below is a non-exhaustive list of all the features that are currently supported.

1. Basic xpath steps: `/html/body/div`, `//div/table//span`
1. Attribute selection: `//div/@class`
1. Text selection: `//div/text()`
1. Wildcard node selection: `//body/*`
1. Predicates:
1. Attributes: `//div[@class='hi']`
1. Indexing: `//div[1]`
1. Functions:
1. `fn:root()`
1. Forward axes:
1. Child: `child::*`
1. Descendant: `descendant::*`
1. Attribute: `attribute::*`
1. DescendentOrSelf: `descendant-or-self::*`
1. (more coming soon)
1. Reverse axes:
1. Parent: `parent::*`
1. (more coming soon)
1. Treat expressions: `/html treat as node()`

This should cover most XPath use-cases.
If your use case requires an unimplemented feature,
please open an issue on [GitHub](https://github.com/James-LG/Skyscraper/issues).
8 changes: 8 additions & 0 deletions proptest-regressions/xpath/grammar/terminal_symbols.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
# Seeds for failure cases proptest has generated in the past. It is
# automatically read and these particular cases re-run before any
# novel cases are generated.
#
# It is recommended to check this file in to source control so that
# everyone who runs the test benefits from these saved cases.
cc 20048a5e9b79bcb027e54a2942e2bff1c34738f45885647a114e6666affd5c8b # shrinks to s = ""
cc 1a8c030e6f3cd967369afade668ea42b6f21dc2dfc006f9865557baad40bf590 # shrinks to s = "\""
7 changes: 7 additions & 0 deletions proptest-regressions/xpath/grammar/xml_names.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
# Seeds for failure cases proptest has generated in the past. It is
# automatically read and these particular cases re-run before any
# novel cases are generated.
#
# It is recommended to check this file in to source control so that
# everyone who runs the test benefits from these saved cases.
cc 521c97a97f04f52ff7e083b8395ebc30c03b00c664dea089d23b430716979c05 # shrinks to s = "À"
Loading

0 comments on commit d33595c

Please sign in to comment.