
# selectolax

Fastest HTML Parser for python (that I know of).\
Many times faster than Beautiful Soup.

https://github.com/rushter/selectolax

# What's Behind It?
Useses Modest backend by default. \
Modest is a HTML parser written entirely in C.\
https://github.com/lexborisov/modest\
Lexbor as the other option.\
https://github.com/lexbor/lexbor\
The author wrote lexbor more recently, but the selectolax readme says it's still in beta.\
The lexbor readme says it's stable and we should use it. Lexbor seems to be better maintained with lots of recent commits from that guy. Most examples will be using the Modest, but we'll do a speed test at the end with lexbor too.


# View method definitions
selectolax uses Cython to extend Modest and Lexbor.
The docstrings are in the .pxi and .pyx files. 

HTMLParser - https://github.com/rushter/selectolax/blob/master/selectolax/parser.pyx

Node - https://github.com/rushter/selectolax/blob/master/selectolax/modest/node.pxi

Selection - https://github.com/rushter/selectolax/blob/master/selectolax/modest/selection.pxi


In [44]:
from selectolax.parser import HTMLParser
from selectolax.lexbor import LexborHTMLParser

import requests
from pathlib import Path

from bs4 import BeautifulSoup
from parsel import Selector
from selectolax.parser import HTMLParser

In [2]:
# Webpage
r = requests.get("https://en.wikipedia.org/wiki/John_D._Rockefeller")

# HTMLParser

In [3]:
tree = HTMLParser(r.text)
tree, type(tree)

(<HTMLParser chars=390526>, selectolax.parser.HTMLParser)

``` 
.css()
```
Returns list of Nodes

In [4]:
tree.css("h2")  # -> list[Node]

[<Node h2>,
 <Node h2>,
 <Node h2>,
 <Node h2>,
 <Node h2>,
 <Node h2>,
 <Node h2>,
 <Node h2>,
 <Node h2>,
 <Node h2>,
 <Node h2>,
 <Node h2>,
 <Node h2>,
 <Node h2>,
 <Node h2>]

In [5]:
for node in tree.css("h2"):
    print(node.text())

Contents
Early life
Pre-Standard Oil career
Standard Oil
Colorado Fuel and Iron
Personal life
Philanthropy
Florida home
Illnesses and death
Legacy
See also
Explanatory notes
Citations
General bibliography
External links


``` 
.css_first()
```
Returns Node

In [6]:
tree.css_first("h2")  # -> Node

<Node h2>

In [7]:
tree.css_first("h2").text()

'Contents'

```
.select()
```
returns Selection object

In [8]:
tree.select("div")

<selectolax.parser.Selector at 0x7fd369296700>

Chain selection to search further down

In [9]:
tree_1 = HTMLParser(
    """
    <div>
        <a class="thisone" href="www.google.com">a tag!</a>
        <span>span tag!</span>
    </div>
"""
)
tree_1.select("div").css(".thisone").matches

[<Node a>]

Tree Attributes

In [10]:
tree.head, tree.body, tree.root, tree.input_encoding

(<Node head>, <Node body>, <Node html>, 'UTF-8')

Tree checks

In [11]:
tree.scripts_contain("Namespace"), tree.css_matches("a")

(True, True)

Tags
```
.tags()
```

Tags vs select 

In [12]:
%%timeit
tree.tags("a")

100 µs ± 460 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)


In [13]:
%%timeit
tree.select("a").matches

145 µs ± 857 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)


Tags vs select with more specific attributes

Getting just tags is faster, but when you need to search for something more specific, you'll save time by using .css() or .select()

In [14]:
%%timeit
[node for node in tree.tags("a") if node.attributes.get('href') == 'https://web.archive.org/web/20110722233117/http://www.cpu.edu.ph/infocen/October05.htm']

1.44 ms ± 6.78 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)


In [15]:
%%timeit
[node for node in tree.select(".mw-references-columns a[href]").matches if node.attributes.get('href') == 'https://web.archive.org/web/20110722233117/http://www.cpu.edu.ph/infocen/October05.htm']

449 µs ± 6.42 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)


# Nodes

In [16]:
node = tree.css("div.hatnote")[0]
node

<Node div>

```html
<div role="note" class="hatnote navigation-not-searchable">For other people named John D. Rockefeller, see
    <a href="/wiki/John_D._Rockefeller_(disambiguation)" class="mw-disambig"
        title="John D. Rockefeller (disambiguation)">John D. Rockefeller (disambiguation)</a>.
</div>
```

Shares many methods and attributes as the tree - .css(), .css_first(), .select(), etc

In [17]:
node.css("a")[0], node.css_first("a"), node.select("a")

(<Node a>, <Node a>, <selectolax.parser.Selector at 0x7fd3692a6380>)

```
.attributes
```

In [18]:
node.attributes

{'role': 'note', 'class': 'hatnote navigation-not-searchable'}

In [19]:
node.attributes["role"]

'note'

```
.text()
```

In [20]:
node.text()

'For other people named John D. Rockefeller, see John D. Rockefeller (disambiguation).'

In [21]:
node.text(deep=False)

'For other people named John D. Rockefeller, see .'

Relatives

In [22]:
node.parent, node.child, node.last_child

(<Node div>, <Node -text>, <Node -text>)

In [23]:
node.next, node.prev

(<Node -text>, <Node style>)

In [24]:
for i in node.iter():
    print(i)

<Node a>


In [25]:
for i in node.traverse():
    print(i)

<Node div>
<Node a>
<Node p>
<Node style>
<Node table>
<Node tbody>
<Node tr>
<Node th>
<Node div>
<Node tr>
<Node td>
<Node a>
<Node img>
<Node div>
<Node tr>
<Node th>
<Node td>
<Node div>
<Node br>
<Node span>
<Node span>
<Node br>
<Node div>
<Node a>
<Node tr>
<Node th>
<Node td>
<Node span>
<Node br>
<Node div>
<Node a>
<Node tr>
<Node th>
<Node td>
<Node a>
<Node style>
<Node span>
<Node a>
<Node span>
<Node span>
<Node span>
<Node span>
<Node span>
<Node span>
<Node span>
<Node span>
<Node span>
<Node tr>
<Node th>
<Node td>
<Node style>
<Node div>
<Node ul>
<Node li>
<Node li>
<Node tr>
<Node th>
<Node td>
<Node a>
<Node div>
<Node a>
<Node a>
<Node a>
<Node a>
<Node a>
<Node tr>
<Node th>
<Node td>
<Node a>
<Node tr>
<Node th>
<Node td>
<Node div>
<Node div>
<Node a>
<Node div>
<Node div>
<Node abbr>
<Node wbr>
<Node tr>
<Node th>
<Node td>
<Node link>
<Node div>
<Node ul>
<Node li>
<Node a>
<Node li>
<Node li>
<Node a>
<Node li>
<Node a>
<Node li>
<Node a>
<Node tr>
<Node th>

# Selection
Allows you to chain together CSS selectors and implements some convenience methods

```
.matches
```

In [26]:
# same as tree.css("#As_a_bookkeeper")
tree.select("#As_a_bookkeeper").matches

[<Node span>]

```
.text_contains()
```

In [27]:
tree.select("h2").text_contains("Florida home").matches

[<Node h2>]

```
.css()
```
Returns Selection object

This is different than ```tree.css()``` which returns List of Nodes

The only useful think I can think of is to use it after a selection method


In [28]:
tree.select("div").text_contains("Rockefeller").css(".thumbcaption").matches[0].text()

"Rockefeller's birthplace in Richford, New York"

## Advanced

In [29]:
%%timeit
(
    tree.select("tr:has(th.infobox-label)")
    .text_contains("Died")
    .css("td.infobox-data") # Selection css() method returns a selector, not a node or node list 
    .matches[0]
    .text(deep=False)
)

139 µs ± 1.21 µs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)


In [43]:
%%timeit
[node.next.text(deep=False) for node in tree.select("th.infobox-label").text_contains("Died").matches][0]

90.9 µs ± 901 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)


# Quick Exmaples

Get links

In [31]:
[node.attrs["href"] for node in tree.css("a[href]")]

['#bodyContent',
 '/wiki/Main_Page',
 '/wiki/Special:Search',
 '/w/index.php?title=Special:CreateAccount&returnto=John+D.+Rockefeller',
 '/w/index.php?title=Special:UserLogin&returnto=John+D.+Rockefeller',
 '/w/index.php?title=Special:CreateAccount&returnto=John+D.+Rockefeller',
 '/w/index.php?title=Special:UserLogin&returnto=John+D.+Rockefeller',
 '/wiki/Help:Introduction',
 '/wiki/Special:MyTalk',
 '/wiki/Special:MyContributions',
 '/wiki/Main_Page',
 '/wiki/Wikipedia:Contents',
 '/wiki/Portal:Current_events',
 '/wiki/Special:Random',
 '/wiki/Wikipedia:About',
 '//en.wikipedia.org/wiki/Wikipedia:Contact_us',
 'https://donate.wikimedia.org/wiki/Special:FundraiserRedirector?utm_source=donate&utm_medium=sidebar&utm_campaign=C13_en.wikipedia.org&uselang=en',
 '/wiki/Help:Contents',
 '/wiki/Help:Introduction',
 '/wiki/Wikipedia:Community_portal',
 '/wiki/Special:RecentChanges',
 '/wiki/Wikipedia:File_upload_wizard',
 '/wiki/Special:WhatLinksHere/John_D._Rockefeller',
 '/wiki/Special:Recen

# Comparison to other libraries

Parse 1000 html pages and find all links for each page.

In [45]:
def bsoup_default(all_html):
    """Beautiful Soup with html.parser parser"""
    for html in all_html:
        soup = BeautifulSoup(html, "html.parser")
        [link["href"] for link in soup.find_all("a", href=True)]


def bsoup_lxml(all_html):
    """Beautiful Soup with lxml parser
    pip install lxml
    """
    for html in all_html:
        soup = BeautifulSoup(html, "lxml")
        [link["href"] for link in soup.find_all("a", href=True)]


def selectolax_default(all_html):
    """Selectolax using Modest Engine"""
    for html in all_html:
        tree = HTMLParser(html)
        [node.attrs["href"] for node in tree.css("a[href]")]


def selectolax_lexbor(all_html):
    """Selectolax using lexbor Engine"""
    for html in all_html:
        tree = LexborHTMLParser(html)
        [node.attrs["href"] for node in tree.css("a[href]")]


def parsel_default(all_html):
    for html in all_html:
        selector = Selector(html)
        selector.css("a[href]::attr(href)").getall()


def load_all_files():
    """There are about 5000 random webpages saved in files
    files is gitignored"""
    all_html = []
    files = Path("files").iterdir()
    for file_path in files:
        with open(file_path, "r") as f:
            all_html.append(f.read())
    return all_html

In [33]:
all_html = load_all_files()

In [34]:
%%time
bsoup_default(all_html)

CPU times: user 18.4 s, sys: 9.85 ms, total: 18.4 s
Wall time: 18.4 s


In [35]:
%%time
bsoup_lxml(all_html)



CPU times: user 13.7 s, sys: 19.9 ms, total: 13.8 s
Wall time: 13.8 s


In [36]:
%%time
parsel_default(all_html)

CPU times: user 1.78 s, sys: 8 µs, total: 1.78 s
Wall time: 1.77 s


In [37]:
%%time
selectolax_default(all_html)

CPU times: user 506 ms, sys: 9.99 ms, total: 516 ms
Wall time: 514 ms


In [46]:
%%time
selectolax_lexbor(all_html)

CPU times: user 400 ms, sys: 0 ns, total: 400 ms
Wall time: 398 ms


# Problems

* No XPATH
* API a little wonky
    (Compare to parsel)