Join GitHub today
GitHub is home to over 28 million developers working together to host and review code, manage projects, and build software together.Sign up
Understanding Regular Expressions
Understanding regular expressions
A regular expression is a string that describes a text pattern occurring in other strings. Consider the text
Once upon a time in the land of King Arthur
The number of spaces between consecutive words vary. In order to change the text so that there is precisely 1 space between any 2 consecutive words, conceptually, we want to replace any continuous sequence of one or more spaces (no matter how many) with exactly one space. We can use a regular expression to describe "any continuous sequence of one or more spaces (no matter how many)".
OpenRefine and OpenRefine Expression Language (GREL) support regular expressions in the syntax of Java regular expressions: see Java Regex Tutorial.
Jython supported Regex
You can also use Jython Regex instead of GREL functions and use a Custom Text Facet, etc with something like this for example:
import re g = re.search(ur"\u2014 (.*),\s*BWV", value) return g.group(1)
Clojure supported Regex
To get n-th element of returned sequence, you can use nth function.
(nth (re-find #"\u2014 (.*),\s*BWV" value) 1)
GREL supported Regex
To write a regular expression inside a GREL expression, wrap it between a pair of forward slashes /. For example, in
value.replace(/\s+/, " ")
the regular expression is
Elsewhere in OpenRefine, i.e., not within a GREL expression, do not use slashes to wrap regular expressions.
GREL functions that support Regex:
To describe a pattern, you typically need concepts like something repeating again and again, or some things A and B occurring interchangeably. For example, if in a scientific paper you ever see a sequence of letters A, C, T, G, you know that's DNA being discussed. As the first try, we might formulate this regular expression to match a DNA sequence
The square brackets mean "one of the characters inside" and the + sign means "occurring one or more times". Let's use that to test whether a cell's text value contains any DNA sequence:
That almost works, except that it also finds single letters A, T, C, G, double-letter substrings like AT, 3-letter substrings like CAT, TAG, ... and the movie name GATTACA. If we were to avoid these "false positives" (matches that are not wanted), then we have to make the regular expression stricter, by changing "occurring one or more times" to something like "occurring at least 8 or more times":
DNA sequences don't follow a very tricky pattern. Let's consider something a bit more complicated: decimal numbers. These are valid decimal numbers:
We can describe a decimal number as follows:
- It optionally starts with a sign (minus or plus)
- Then it consists of a sequence of one or more digits
- Then optionally if it has a decimal part, it contains
- a period, followed by
- one or more digits
We can express that as this regular expression
The question mark ? denotes an optional character or group of characters. The period . needs to be escaped by preceding it with a backslash \. A digit is denoted as the range of character from 0 to 9.
We have encoded a digit as [0-9] but there is a shortcut: \d. There are several shortcuts:
- \d, same as [0-9]
- \D, same as [^0-9], meaning any character other than a digit
- \s, meaning any whitespace character (space, tab, new line, carriage return)
- \S, meaning any character other than whitespace characters
- \w, same as [a-zA-Z_0-9], meaning any character that can be part of a word (where the meaning of "word" is more liberal than an English word)
- \W, meaning any non-word character
- . meaning any single character
Note that any character that is used to describe patterns within regular expressions must be escaped. For example, if you want to say "a period character" rather than "any character", then you must write \. because just writing . means "any character". Similarly,
- + must be written as \+
- ? must be written as \?
- [ must be written as \[
- and so forth.