imajes / parsley forked from fizx/parsley

A library for expressing markup-to-data transformations, see parselets.com for examples

This URL has Read+Write access

name age message
file .gitignore Thu Mar 12 15:27:52 -0700 2009 ignore json artifacts [kyle]
file AUTHORS Tue Dec 09 15:14:15 -0800 2008 project cleanup [Kyle Maxwell]
file ChangeLog Mon Nov 24 18:47:51 -0800 2008 snapshot [Kyle Maxwell]
file INSTALL Loading commit data...
file INTRO Tue Mar 03 21:31:24 -0800 2009 tentative rename, etc [Kyle Maxwell]
file Makefile.am
file Makefile.in
file NEWS Mon Nov 24 18:47:51 -0800 2008 snapshot [Kyle Maxwell]
file PAPER Tue Mar 03 21:31:24 -0800 2009 tentative rename, etc [Kyle Maxwell]
file Portfile Fri Mar 20 11:05:57 -0700 2009 refactor works, still have quote error [Kyle Maxwell]
file Portfile.in Fri Mar 20 11:05:57 -0700 2009 refactor works, still have quote error [Kyle Maxwell]
file README.C-LANG
file README.markdown Fri Mar 20 11:05:57 -0700 2009 refactor works, still have quote error [Kyle Maxwell]
file TODO Tue Mar 03 21:31:24 -0800 2009 tentative rename, etc [Kyle Maxwell]
file VERSION Sat Jan 03 01:19:50 -0800 2009 port working [Kyle Maxwell]
file aclocal.m4 Fri Mar 13 16:05:24 -0700 2009 fixed bus errors on invalid expressions [Kyle Maxwell]
file bootstrap.sh Thu Mar 12 14:21:07 -0700 2009 no cache in git [kyle]
file config.guess Fri Mar 13 08:32:55 -0700 2009 copy in libtool deps [Kyle Maxwell]
file config.sub Fri Mar 13 08:32:55 -0700 2009 copy in libtool deps [Kyle Maxwell]
file configure Sun Mar 15 08:21:02 -0700 2009 unbundle json [Kyle Maxwell]
file configure.ac Sun Mar 15 08:21:02 -0700 2009 unbundle json [Kyle Maxwell]
file depcomp Sat Jan 03 16:27:11 -0800 2009 i hope this is the right autotools [Kyle Maxwell]
file functions.c
file functions.h Tue Mar 03 21:31:24 -0800 2009 tentative rename, etc [Kyle Maxwell]
file install-sh Fri Jan 02 19:34:26 -0800 2009 extras [Kyle Maxwell]
file libtool Thu Mar 19 18:00:27 -0700 2009 failing test case for escaped double quotes [Kyle Maxwell]
file ltmain.sh Sun Mar 15 07:54:22 -0700 2009 fixed libtool [Kyle Maxwell]
file missing Sat Jan 03 16:27:11 -0800 2009 i hope this is the right autotools [Kyle Maxwell]
file parsed_xpath.c
file parsed_xpath.h
file parser.c
file parser.h
file parser.y
file parsley.c
file parsley.h
file parsley_main.c
file parsleyc_main.c Fri Mar 20 11:05:57 -0700 2009 refactor works, still have quote error [Kyle Maxwell]
file regexp.c Thu Mar 12 14:06:05 -0700 2009 clear out as many gcc warnings as I can [Kyle Maxwell]
file regexp.h Thu Mar 12 14:06:05 -0700 2009 clear out as many gcc warnings as I can [Kyle Maxwell]
file scanner.c Thu Mar 12 12:54:05 -0700 2009 don't ignore bison artifacts [Kyle Maxwell]
file scanner.l Mon Mar 09 17:15:45 -0700 2009 WIP [Kyle Maxwell]
directory test/
directory tmp/ Sun Mar 08 23:03:51 -0700 2009 WIP, moved parser to %union [Kyle Maxwell]
file util.c
file util.h
file xml2json.c Thu Mar 12 16:48:22 -0700 2009 no leak [Kyle Maxwell]
file xml2json.h Sun Mar 15 08:21:02 -0700 2009 unbundle json [Kyle Maxwell]
file y.tab.c
file y.tab.h
file ylwrap Fri Jan 02 19:34:26 -0800 2009 extras [Kyle Maxwell]
README.markdown

Overview 

parsley is a simple language for data-extraction from XML-like documents (including HTML). parsley is:

  1. Blazing fast -- Typical HTML parses are sub-50ms.
  2. Easy to write and understand -- parsley uses your current knowledge of JSON, CSS, and XPath.
  3. Powerful. parsley can understand full XPath, including standard and user-defined functions.

Examples

A simple script, or "parselet", looks like this:

{
  "title": "h1",
  "links(a)": [
    {
      "text": ".",
      "href": "@href"
    }
  ]
}

This returns JSON or XML output with the same structure. Applying this parselet to http://www.yelp.com/biz/amnesia-san-francisco yields either:

{
  "title": "Amnesia",
  "links": [
    {
      "href": "\/",
      "text": "Yelp"
    },
    {
      "href": "\/",
      "text": "Welcome"
    },
    {
      "href": "\/signup?return_url=%2Fuser_details",
      "text": " About Me"
    },
    .....
  ]
}

or equivalently:

<parsley:root>
  <title>Amnesia</title>
  <links>
    <parsley:group>
      <href>/</href>
      <text>Yelp</text>
    </parsley:group>
    <parsley:group>
      <href>/</href>
      <text>Welcome</text>
    </parsley:group>
    <parsley:group>
      <href>/signup?return_url=%2Fuser_details</href>
      <text> About Me</text>
    </parsley:group>
    .....
  </links>
</parsley:root>      

This parselet could also have been expressed as:

{
  "title": "h1",
  "links(a)": [
    {
      "text": ".",
      "href": "@href"
    }
  ]
}

The "a" in links(a) is a "key selector" -- an explicit grouping (with scope) for the array. You can use any XPath 1.0 or CSS3 expression as a value or a key selector. Parsley will try to be smart, and figure out which you are using. You can use CSS selectors inside XPath functions -- "substring-after(h1>a, ':')" is a valid expression.

Variables

You can use $foo to access the value of the key "foo" in the current scope (i.e. nested curly brace depth). Also available are $parent.foo, $parent.parent.foo, $root.foo, $root.foo.bar, etc.

Custom Functions

You can write custom functions in XSLT (I'd like to also support C and JavaScript). They look like:

<func:function name="user:excited">
   <xsl:param name="input" />
   <func:result select="concat($input, '!!!!!!!')" />
</func:function>

If you run:

{
  "title": "user:excited(h1)",
}

on the Yelp page, you'll get:

{
  "title": "Amnesia!!!!!!!",
}