thadguidry edited this page Jan 7, 2013 · 9 revisions
Clone this wiki locally

Using Jython as your Expression Language

Full docs on the Jython language are at its official site http://www.jython.org.

Note: The Jython extension has been bundled with OpenRefine since 2.1. Before that it was an extension which needed to be installed separately.

Note: You can use almost any Python (.py)(.pyc) files compatible with the bundled Jython 2.5.1 and drop them into the path. For instance, download, extract and drop in BeautifulSoup.py and use it to parse and extract HTML tags or content using Jython as your expression language in OpenRefine. Since Jython is essentially Java, you can even import Java libraries and utilize those!

OpenRefine now has most of the Jsoup.org library built in for parsing and working with HTML elements and extraction

Built-in GREL Jsoup functions

Remember to restart OpenRefine, so that new Jython/Python libraries are initialized during Butterfly's startup.

Using Jython and BeautifulSoup to handle Entity Extraction and HTML markup removal

A few HTML parsing Python libraries to experiment with :

  1. HTMLParser (bundled with Jython in OpenRefine)
  2. BeautifulSoup

A few XML parsing Python libraries:

  1. ElementTree (bundled with Jython in Refine)
  2. lxml will NOT work in Jython, since lxml has C bindings for CPython (regular Python), hence will not work in OpenRefine which is Jython / Java only, and has no CPython interpreter built-in

Expressions in Jython must have a return statement:

  return value[1:-1]
  return rowIndex%2

Fields have to be accessed using the bracket operator rather than the dot operator:

  return cells["col1"]["value"]

To access the Levenshtein distance between the reconciled value and the cell value (?) use the Recon variable:

  return cell["recon"]["features"]["nameLevenshtein"]

To return the lower case of value (if the value is not null):

  if value is not None:
    return value.lower()
    return None