Join GitHub today
GitHub is home to over 28 million developers working together to host and review code, manage projects, and build software together.Sign up
Extracting & Converting HTML markup
At some point you may have a dataset where one or more of the columns contain HTML markup. The markup tags may interfere with data extraction/cleansing, word counts, etc.
It's very easy to create a new column with the HTML tags removed using OpenRefine's support for regular expressions.
- Attribution note: the regular expression below comes from Phil Haack's blog.*
Assume you have a column Col1 containing HTML markup. Click on the dropdown next to Col1 and choose "Edit Column > Add Column Based On This Column...". Pick a name for the new column, and use the following expression:
Press OK, and you'll soon have a new column with the plain text extracted from Col1.
Extract HTML attributes, text, links with integrated GREL Jsoup commands
WARNING: Make sure to use .toString() suffixes when needed to output strings into Refine cells while working with the built-in HTML GREL commands (the default output is org.jsoup.nodes objects). Otherwise you'll get a preview just fine in the Expression Editor, BUT no data shown in the Refine cells when you apply it!
Useful Common Examples:
Extract all the <table> rows from a <div ID=content>:
Extract text contained within a tag
Full Documentation on using the integrated jsoup commands, you can refer back to jsoup's selector syntax itself http://jsoup.org/cookbook/extracting-data/selector-syntax .
Simple Web Scraping (Web Scraper) can be made in OpenRefine with a pattern. For example, you can add a new column based on an HTML page that has a bunch of links, where you need to Loop and Extract (using forEach()) all the <a href=> links based upon a regex pattern (such as those links containing a number digit
d+) and join the array so that you can split and Fetch URLs on all those extracted links to scrape even more data off each HTML link with another Fetch URLs pass. :
After adding that new column with the above GREL snippet, you can then do Edit Cells -> Split multi-valued cells... and use your separator chars phrase, such as "SplitCharsGoHere" or "--SPLITME--" or "|||||" ; Whatever chars you choose to do your join().
After the Split multi-valued cells operation finishes, you should be left in records mode view (rather than row mode view) and you now have a bunch of URLs extracted from EACH original HTML record where you can perform another Fetch URLs operation on all of them.