You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Assume you have a column Col1 containing HTML markup. Click on the
dropdown next to Col1 and choose "Edit Column > Add Column Based On
This Column...". Pick a name for the new column, and use the following
Press OK, and you'll soon have a new column with the plain text
extracted from Col1.
Extract HTML attributes, text, links with integrated GREL Jsoup commands
WARNING: Make sure to use .toString() suffixes when needed to output
strings into Refine cells while working with the built-in HTML GREL
commands (the default output is org.jsoup.nodes objects). Otherwise
you'll get a preview just fine in the Expression Editor, BUT no data
shown in the Refine cells when you apply it!
Note: Now included in 2.1 version are HTML functions (Thanks Iain
Sproat!) built upon jsoup.org which is a Java
library built on
Useful Common Examples:
Extract all the <table> rows from a <div ID=content>:
Simple Web Scraping (Web Scraper) can be made in OpenRefine with a
pattern. For example, you can add a new column based on an HTML page
that has a bunch of links, where you need to Loop and Extract (using
forEach()) all the <a href=> links based upon a regex pattern (such
as those links containing a number digit
d+) and join the array so that you can split and Fetch URLs on all those
extracted links to scrape even more data off each HTML link with another
Fetch URLs pass. :
After adding that new column with the above GREL snippet, you can then
do Edit Cells -> Split multi-valued cells... and use your separator
chars phrase, such as "SplitCharsGoHere" or "--SPLITME--" or
"|||||" ; Whatever chars you choose to do your join().
After the Split multi-valued cells operation finishes, you should be
left in records mode view (rather than row mode view) and you now have a
bunch of URLs extracted from EACH original HTML record where you can
perform another Fetch URLs operation on all of them.