GREL String Functions

Thad Guidry edited this page Jun 23, 2016 · 21 revisions
Clone this wiki locally

String functions supported by OpenRefine Expression Language (GREL)

See also: All GREL Functions.

Basic

length(string s)

Returns the length of s as a number.

Testing String Characteristics

startsWith(string s, string sub)

Returns boolean indicating whether s starts with sub. For example, startsWith("food", "foo") returns true, whereas startsWith("food", "bar") returns false. You could also write the first case as "food".startsWith("foo").

endsWith(string s, string sub)

Returns boolean indicating whether s ends with sub. For example, endsWith("food", "ood") returns true, whereas endsWith("food", "odd") returns false. You could also write the first case as "food".endsWith("ood").

contains(string s, string sub)

Returns boolean indicating whether s contains sub. For example, contains("food", "oo") returns true whereas contains("food", "ee") returns false. You could also write the first case as "food".contains("oo").

Basic String Modification

Case Conversion

toLowercase(string s)

Returns s converted to lowercase.

toUppercase(string s)

Returns s converted to uppercase.

toTitlecase(string s)

Returns s converted to titlecase. For example, toTitlecase("Once upon a midnight dreary") returns the string Once Upon A Midnight Dreary.

Trimming

trim(string s) and strip(string s)

Returns a copy of the string, with leading and trailing whitespace removed. For example, trim(" island ") returns the string island.

chomp(string s, string sep)

Returns a copy of s with sep removed from the end if s ends with sep; otherwise, just returns s. For example, chomp("hardly", "ly") and chomp("hard", "ly") both return the string hard.

Substring

substring(s, number from, optional number to)

Returns the substring of s starting from character index from and upto character index to. If to is omitted, it's understood as the end of the string s. For example, substring("profound", 3) returns the string found, and substring("profound", 2, 4) returns the string of.

Character indexes start from zero. Negative character indexes are understood as counting from the end of the string. For example, substring("profound", 1, -1) returns the string rofoun.

slice(s, number from, optional number to)

See substring function above.

get(o, number or string from, optional number to)

See substring function above.

Find and Replace

indexOf(string s, string sub)

Returns the index of sub first ocurring in s as a character index; or -1 if s does not contain sub. For example, indexOf("internationalization", "nation") returns 5, whereas indexOf("internationalization", "world") returns -1.

lastIndexOf(string s, string sub)

Returns the index of sub last ocurring in s as a character index; or -1 if s does not contain sub. For example, lastIndexOf("parallel", "a") returns 3 (pointing at the second character "a").

replace(string s, string f, string r)

Returns the string obtained by replacing f with r in s. f can be a regular expression, in which case r can also contain capture groups declared in f.

For a simple example, replace("The cow jumps over the moon and moos", "oo", "ee") returns the string The cow jumps over the meen and mees.

More info on Regex

replaceChars(string s, string f, string r)

Returns the string obtained by replacing any character in s that is also in f with the character r. For example, replaceChars("commas , and semicolons ; are separators", ",;", "**") returns the string commas ** and semicolons ** are separators.

match(string s, regexp p)

Attempts to match the string s in its entirety against the regex pattern p and returns an array of capture groups. For example, match("230.22398, 12.3480", /.*(\d\d\d\d)/) returns an array of 1 string: 3480. match("230.22398, 12.3480", /.*\.(\d+).*\.(\d+)/) returns an array of 2 strings: 22398 and 3480.

Another match() example:

isNotNull(value.match(/(a.c)/))

returns True or False if value is like abc or azc , etc but would not match a value of ac. Remember to use Parentheses when using match() to denote the Capture Group or Groups inside your Regex expression so that you can get output when needed otherwise an empty array [] is shown for no match.

Example: Cell contains value:

hello 123456 goodbye

value.match(/\d{6}/) = null

value.match(/.*\d{6}.*/) = [ ] (empty array)

value.match(/.*(\d{6}).*/) = [ "123456" ] (array with one captured value)

value.match(/(.*)(\d{6})(.*)/) = [ "hello ", "123456", " goodbye" ] (array with three captured values)

More info on Regex

And a nice online learning & testing tool for Regex: RegExr

String Parsing and Splitting

toNumber(o)

Returns o converted to a number.

split(s, sep)

Returns the array of strings obtained by splitting s at wherever sep is found in it. sep can be either a string or a regular expression. For example, split("fire, water, earth, air", ",") returns the array of 4 strings: "fire", " water", " earth" , and " air". The double quotation marks are shown here only to highlight the fact that the spaces are retained.

More info on Regex

splitByLengths(string s, number n1, number n2, ...)

Returns the array of strings obtained by splitting s into substrings with the given lengths. For example, splitByLengths("internationalization", 5, 6, 3) returns an array of 3 strings: inter, nation, and ali.

smartSplit(string s, optional string sep)

Returns the array of strings obtained by splitting s by the separator sep. Handles quotes properly. Guesses tab or comma separator if sep is not given. Also, value.escape('javascript') is useful for previewing unprintable chars prior to using smartSplit.

smartSplit(value,"\n")  //split cell at Carriage Return or New Line char

splitByCharType(s)

Returns an array of strings obtained by splitting s into groups of consecutive characters where the characters within each group share the same unicode type, and consecutive groups differ in their unicode types.

for example the string HenryCTaylor will be split as follow

  • splitByCharType(value)[0] will return H
  • splitByCharType(value)[1] will return enry
  • splitByCharType(value)[2] will return CT
  • splitByCharType(value)[3] will return aylor

partition(string s, string or regex frag, optional boolean omitFragment)

Returns an array of strings [ a, frag, b ] where a is the substring within s before the first occurrence of frag in s, and b is the substring after frag in s. For example, partition("internationalization", "nation") returns 3 strings: inter, nation, and alization. If s does not contain frag, it returns an array of [ s, "", "" ] (the first string is the original s and the second and third strings are empty).

If omitFragment is true, frag is not returned. That is, the result is an array of only 2 elements.

Another Example using Regex with partition():

value.partition(/c.e/)[1] used on a cell with a value of "abcdefgh" will output "cde"

"abcdefgh".partition(/d..g/)[0] will output "abc", since "abc" is the 1st part [0], not the middle part [1] which is "defg", or the tail or last part [-1] which is "h".

More info on Regex

rpartition(string s, string or regex frag, optional boolean omitFragment)

Returns an array of strings [ a, frag, b ] where a is the substring within s before the last occurrence of frag in s, and b is the substring after frag in s. For example, partition("parallel", "a") returns 3 strings: par, a, and llel. If s does not contain frag, it returns an array of [ s, "", "" ] (the first string is the original s and the second and third strings are empty).

More info on Regex

Encoding and Hashing

diff(o1, o2, optional string timeUnit)

For strings, returns the portion where they differ. For dates, it returns the difference in given time units.

escape(string s, string mode)

Escapes s in the given escaping mode: html, xml, csv, url, javascript.

unescape(string s, string mode)

Unescapes s in the given escaping mode: html, xml, csv, url, javascript.

md5(string s)

Returns the MD5 hash of s.

sha1(string s)

Returns the SHA-1 hash of s.

phonetic(string s, string encoding)

Returns the a phonetic encoding of s, string encoding to use). Example:

  • phonetic(value,'doublemetaphone')
  • phonetic(value,'metaphone')
  • phonetic(value,'metaphone3')
  • phonetic(value,'soundex')
  • phonetic(value,'cologne')

reinterpret(string s, string encoder)

Returns s reinterpreted thru the given encoder. Supported encodings here.

fingerprint(string s)

Returns the fingerprint of s, a derived string that aims to be a more canonical form of it (this is mostly useful for finding clusters of strings related to the same information).

ngram(string s, number n)

Returns an array of the word ngrams of s.

ngramFingerprint(string s, number n)

Returns the n-gram fingerprint of s.

unicode(string s)

Returns an array of strings describing each character of s in their full unicode notation.

unicodeType(string s)

Returns an array of strings describing each character of s in their full unicode notation.

Freebase Specific

mqlKeyQuote(string s)

Quotes s so that it can be used as a Freebase key. More info: MQL key escaping

mqlKeyUnquote(string key)

Unquotes the MQL quoted string key back to its original form. More info: MQL key escaping