Understanding Regular Expressions

kolen edited this page Oct 3, 2014 · 13 revisions

Understanding regular expressions

Introduction

A regular expression is a string that describes a text pattern occurring in other strings. Consider the text

  Once upon    a  time  in the   land of  King Arthur

The number of spaces between consecutive words vary. In order to change the text so that there is precisely 1 space between any 2 consecutive words, conceptually, we want to replace any continuous sequence of one or more spaces (no matter how many) with exactly one space. We can use a regular expression to describe "any continuous sequence of one or more spaces (no matter how many)".

OpenRefine and OpenRefine Expression Language (GREL) support regular expressions in the syntax of Java regular expressions: see Java Regex Tutorial.

Jython supported Regex

You can also use Jython Regex instead of GREL functions and use a Custom Text Facet, etc with something like this for example:

import re
g = re.search(ur"\u2014 (.*),\s*BWV", value)
return g.group(1)

Clojure supported Regex

Clojure uses the same regex engine as Java and can be invoked with re-find, re-matches, etc. You can use #"pattern" reader macro. See Regex support.

To get n-th element of returned sequence, you can use nth function.

(nth (re-find #"\u2014 (.*),\s*BWV" value) 1)

GREL supported Regex

To write a regular expression inside a GREL expression, wrap it between a pair of forward slashes /. For example, in

  value.replace(/\s+/, " ")

the regular expression is

  \s+

Note for advanced users: That (wrapping with forward slashes) is similar to how you write regular expressions in Javascript. On the other hand, in Java, there is no native syntax for regular expressions, so you would have to write them as strings and escape them properly (e.g., "\\s+"). So even though the regular expression syntax in OpenRefine follows that of Java (due to implementation in Java), how you write a regular expression within a GREL expression is similar to Javascript, for convenience.

Elsewhere in OpenRefine, i.e., not within a GREL expression, do not use slashes to wrap regular expressions.

GREL functions that support Regex:
  1. replace
  2. match
  3. partition
  4. rpartition
  5. split

Basic Examples

To describe a pattern, you typically need concepts like something repeating again and again, or some things A and B occurring interchangeably. For example, if in a scientific paper you ever see a sequence of letters A, C, T, G, you know that's DNA being discussed. As the first try, we might formulate this regular expression to match a DNA sequence

  [ACTG]+

The square brackets mean "one of the characters inside" and the + sign means "occurring one or more times". Let's use that to test whether a cell's text value contains any DNA sequence:

  isNotNull(value.match(/[ACTG]+/))

That almost works, except that it also finds single letters A, T, C, G, double-letter substrings like AT, 3-letter substrings like CAT, TAG, ... and the movie name GATTACA. If we were to avoid these "false positives" (matches that are not wanted), then we have to make the regular expression stricter, by changing "occurring one or more times" to something like "occurring at least 8 or more times":

  [ACTG]{8,}+

DNA sequences don't follow a very tricky pattern. Let's consider something a bit more complicated: decimal numbers. These are valid decimal numbers:

  • 120.0232898
  • -0.2909
  • +2.343

We can describe a decimal number as follows:

  • It optionally starts with a sign (minus or plus)
  • Then it consists of a sequence of one or more digits
  • Then optionally if it has a decimal part, it contains
    • a period, followed by
    • one or more digits

We can express that as this regular expression

  [-+]?[0-9]+(\.[0-9]+)?

The question mark ? denotes an optional character or group of characters. The period . needs to be escaped by preceding it with a backslash \. A digit is denoted as the range of character from 0 to 9.

We have encoded a digit as [0-9] but there is a shortcut: \d. There are several shortcuts:

  • \d, same as [0-9]
  • \D, same as [^0-9], meaning any character other than a digit
  • \s, meaning any whitespace character (space, tab, new line, carriage return)
  • \S, meaning any character other than whitespace characters
  • \w, same as [a-zA-Z_0-9], meaning any character that can be part of a word (where the meaning of "word" is more liberal than an English word)
  • \W, meaning any non-word character
  • . meaning any single character

Note that any character that is used to describe patterns within regular expressions must be escaped. For example, if you want to say "a period character" rather than "any character", then you must write \. because just writing . means "any character". Similarly,

  • + must be written as \+
  • ? must be written as \?
  • [ must be written as \[
  • and so forth.
Clone this wiki locally
You can’t perform that action at this time.
You signed in with another tab or window. Reload to refresh your session. You signed out in another tab or window. Reload to refresh your session.
Press h to open a hovercard with more details.