# Snooty

_A domain-specific language for scraping HTML._

![snooty mascots](snooty.jpg)

I've been thinking about how to streamline the process of creating a scraper. (By "scraper", I only mean a program that converts HTML to structured data, not a program that crawls the web.) One of the hassles is that there are multiple languages for traversing HTML documents (e.g. CSS and XPath) that have non-overlapping features, and neither language is really designed for scraping. 

CSS selectors are designed for stylesheets (obviously!) and lack some important features like selecting text or attribute nodes. It has other irritating limitations like it can select a node that is the next sibling of some selector, but it cannot select a node that is the previous sibling of some selector.

XPaths are more powerful but the syntax is quite confusing and it is a language designed for XML and this leads to awkward constructions for parsing HTML, like `a[contains(@href,"image")]` which selects anchors matching a certain `href`.

The [Parsel library](https://parsel.readthedocs.io/) makes it pretty easy to go back and forth between XPath and CSS selectors, and so in a sense this solves a part of the problem. 

But another part of the problem is how to safely run scraping code. Imagine that we wanted to add scraping capabilities to [Starbelly](https://starbelly.readthedocs.io/en/latest/). If Starbelly runs user-provided Python code, then the user can take full control of the server, because Python is basically impossible to Sandbox.

The Splash project ran into this same dilemma. As a result, they chose Lua as their scripting language, specifically because it is a powerful language that is also easy to sandbox. Lua is a pretty unusual language. It takes time to learn, and it does not have the same library support as Python, i.e. libraries like Parsel.

Another drawback to Python code is that it is difficult to debug. It is helpful to visualize what a scraper is doing or debug parts of it inside a web browser, but the web browser can't run Python code. In practice, I end up converting from Python to JavaScript, debugging in a browser, then converting back to Python and putting it into my scraper.

One solution to this problem may be to write the scrapers in JavaScript. These would be easy to debug in a browser. Furthermore, JavaScript is obviously capable of running in a sandbox (that's what it does inside the browser, after all) but we still wouldn't have access to libraries like Parsel or `dateutil.parser`.

Snooty is a possible middle ground: a scripting language specifically written for scraping HTML. Untrusted code can safely be executed, and the language can be designed to maximize productivity for scraping tasks. This notebook contains a simple prototype.

In [33]:
import locale
from lxml import etree
from pypeg2 import *

In [55]:
class Comment:
    grammar = '#', attr('text', restline), endl

class LocaleStatement:
    grammar = 'locale', attr('locale', word)
    
    def __repr__(self):
        return 'LocaleStatement[{}]'.format(self.locale)

class Selector:
    grammar = attr('parent', word), '->', attr('child', word)
    
    def __str__(self):
        return 'Selector[parent={} child={}]'.format(self.parent, self.child)
    
class SelectorStatement:
    grammar = 'selector', name(), '=', attr('selector', Selector)
    
    def __str__(self):
        return 'SelectorStatement[{} = {}]'.format(self.name, self.selector)

class DottedName(List):
    grammar = csl(str, separator='.')
    
class ExportStatement:
    grammar = 'export', attr('name', DottedName), '=', attr('selector', str)
    
class Script(List):
    grammar = maybe_some([
        Comment, 
        ExportStatement, 
        LocaleStatement, 
        SelectorStatement
    ])

In [96]:
def run(code, html):
    ast = parse(scraper, Script)
    doc = etree.fromstring(html)
    selectors = {}
    export = {}

    for stmt in ast:
        if isinstance(stmt, LocaleStatement):
            locale.setlocale(locale.LC_ALL, stmt.locale)
        elif isinstance(stmt, SelectorStatement):
            selectors[stmt.name] = stmt.selector
        elif isinstance(stmt, Comment):
            pass
        elif isinstance(stmt, ExportStatement):
            exp = export
            for name in stmt.name:
                print(name)
                if name not in exp:
                    exp[name] = dict()
            sel = selectors[stmt.selector]
            exp[stmt.name[-1]] = [etree.tostring(e) for e in 
                doc.xpath('//{}/{}'.format(sel.parent, sel.child))]
        else:
            print(stmt)

    print(export)

In [97]:
doc = r'''<!DOCTYPE html>
<html>
<body>
    <h1>A Heading</h1>
    <p>A paragraph. It contains <a href='/foo'>a link.</a></p>
    <ol>
        <li>Link <a href='/1.html'>one</a></li>
        <li>Link <a href='/2.html'>two</a></li>
        <li>Link <a href='/3.html'>three</a></li>
    </ol>
</body>
</html>'''

In [98]:
scraper = r'''
# You can define a locale used for parsing numbers and dates:
locale en_US

# The "selector" command defines a selection of HTML
# nodes.
selector paragraph = p -> a
selector list_items = li -> a

# The export command defines data elements that should
# be produced as the result of scraping.
export foo.paragraph = paragraph
export foo.list = list_items
'''

In [99]:
run(scraper, doc)

foo
paragraph
foo
list
{'foo': {}, 'paragraph': [b'<a href="/foo">a link.</a>'], 'list': [b'<a href="/1.html">one</a>', b'<a href="/2.html">two</a>', b'<a href="/3.html">three</a>']}
