# Regex Functionality
Demonstration of regex functionality in arkouda `Strings`

The `CHPL_RE2` flag must be set to use this funtionality. After setting remake Chapel
```
export CHPL_RE2=bundled
```
This is for chapel v1.25.0, for v1.24 set the `CHPL_REGEX` flag (`export CHPL_REGEXP=re2`)

The regex functionality uses Chapel's `regex` module which is built on google's `re2`. `re2` sacrifices some features like lookahead/lookbehind in exchange for guarantees that searches complete in linear time with respect to the size of the input and in a fixed amount of stack space

In [None]:
import arkouda as ak
ak.connect()
import re

In [None]:
strings = ak.array(['1_2___', '____', '3', '__4___5____6___7'])
pattern = '_+'

## ak.Match object
`ak.Match` objects are returned from `Strings.search`, `Strings.match`, and `Strings.fullmatch`

In [None]:
strings

In [None]:
print(f"strings.search('{pattern}'):\n{strings.search(pattern)}\n")
print(f"strings.match('{pattern}'):\n{strings.match(pattern)}\n")
print(f"strings.fullmatch('{pattern}'):\n{strings.fullmatch(pattern)}")

The `ak.Match` object mimics the `re.Match` object applied to every element in `Strings`

In [None]:
ak_search = strings.search(pattern)
re_search = [re.search(pattern, strings[i]) for i in range(strings.size)]
for i in range(strings.size):
    print(f"ak_search[{i}]: {ak_search[i]}")
    print(f"re_search[{i}]: {re_search[i]}\n")

In [None]:
print(f"ak_search.matched(): {ak_search.matched()}")
print(f"ak_search.start(): {ak_search.start()}")
print(f"ak_search.end(): {ak_search.end()}")
print(f"ak_search.find_matches(): {ak_search.find_matches()}")

In [None]:
# check that ak.Match methods line up with re methods applied to every element of Strings
matched = ak.all(ak_search.matched() == ak.array([m is not None for m in re_search]))
start = ak.all(ak_search.start() == ak.array([m.start() for m in re_search if m is not None]))
end = ak.all(ak_search.end() == ak.array([m.end() for m in re_search if m is not None]))
find_matches = ak.all(ak_search.find_matches() == ak.array([m.string[m.start():m.end()] for m in re_search if m is not None]))

print(f"matched: {matched}")
print(f"start: {start}")
print(f"end: {end}")
print(f"find_matches: {find_matches}")

`ak.Match` doesn't return the matches by default like `re.Match` to avoid flooding the client. The `find_matches` function returns a new `Strings` object containing only the matches. Setting the `return_match_origins` flag will return the indices of the original Strings object where the matches where found

In [None]:
print(f"ak_search.find_matches(return_match_origins=True): {ak_search.find_matches(return_match_origins=True)}")

`ak.Match` objects support capture groups which can be accessed using `.group()`

In [None]:
tug_of_war = ak.array(["Isaac Newton, physicist", '<---calculus--->', 'Gottfried Leibniz, mathematician'])

ak_captures = tug_of_war.search(r"(\w+) (\w+)")
print(f"ak_captures.group() = {ak_captures.group()}")
print(f"ak_captures.group(1) = {ak_captures.group(1)}")
print(f"ak_captures.group(2) = {ak_captures.group(2)}")

re_captures = [re.search("(\\w+) (\\w+)", tug_of_war[i]) for i in range(tug_of_war.size)]
print(f"re agree? {ak.all(ak_captures.group(2) == ak.array([m.group(2) for m in re_captures if m is not None]))}")

## Split
`split` will return a new `Strings` split by occurrences of `pattern` up to `maxsplit` times. If the `return_segments` flag is set, a mapping between original strings and new array elements will also be returned

In [None]:
strings

In [None]:
pattern = '_+'
maxsplit = 2
print(f"strings.split('{pattern}', maxsplit={maxsplit}, return_segments=True):\n{strings.split(pattern, maxsplit=maxsplit, return_segments=True)}")

mimics `re.split` functionality applied to each element in `strings`

In [None]:
split, split_map = strings.split(pattern, maxsplit=maxsplit, return_segments=True)
for i in range(strings.size):
    print(f"strings[{i}]: '{strings[i]}'")
    print(f"re.split = {re.split(pattern, strings[i], maxsplit=maxsplit)}")
    print(f"ak.split = {split[split_map[i]:split_map[i + 1]] if i != strings.size - 1 else split[split_map[i]:]}\n")

## Findall
The `findall` function returns a new `Strings` object containing only the matches. Setting the `return_match_origins` flag will return the indices of the original Strings object where the matches where found

In [None]:
strings

In [None]:
pattern = '_+'
print(f"strings.findall('{pattern}', return_match_origins=True):\n{strings.findall(pattern, return_match_origins=True)}")

In [None]:
# Verify the results of findall match re
for i in range(strings.size):
    print(re.findall(pattern, strings[i]))

## Sub
`sub` returns a new `Strings` obtained by replacing up to `count` of the leftmost non-overlapping occurrences of pattern in string by the replacement `repl`.

`subn` returns the string from `sub` but also returns a pdarray containing the number of substitions per string

In [None]:
strings

In [None]:
pattern = '_+'
repl = '-------'
count = 2
ak_sub = strings.sub(pattern, repl, count)
re_sub = [re.sub(pattern, repl, strings[i], count) for i in range(strings.size)]
print(f"ak_sub: {ak_sub}")
print(f"re_sub: {re_sub}")
print(f"re agree? {ak.all(ak_sub == ak.array(re_sub))}")

The default `count=0` replaces all occurences of `pattern` with `repl`

In [None]:
strings.subn(pattern, '-')

## Substring Search
Returns a boolean array indicating whether each element `contains`, `startswith`, or `endswith` the regex pattern.

In [None]:
strings

In [None]:
print(f"strings.contains('{pattern}, regex=True'):\n{strings.contains(pattern, regex=True)}\n")
print(f"strings.startswith('{pattern}, regex=True'):\n{strings.startswith(pattern, regex=True)}\n")
print(f"strings.endswith('{pattern}, regex=True'):\n{strings.endswith(pattern, regex=True)}")

## Peel
Peel off one or more delimited fields from each string (similar to `string.partition`), returning two new arrays of strings

In [None]:
under = ak.array(['one_two', 'three_____four____five', 'six'])
under

In [None]:
print(f"under.peel('{pattern}', includeDelimiter=True, regex=True):\n{under.peel(pattern,includeDelimiter=True,regex=True)}")

## Flatten
Given an array of strings where each string encodes a variable-length sequence delimited by a common substring, flattening offers a method for unpacking the sequences into a flat array of individual elements

In [None]:
under

In [None]:
print(f"under.flatten('{pattern}', return_segments=True, regex=True):\n{under.flatten(pattern, return_segments=True, regex=True)}")

## Find Locations
Finds pattern matches and returns pdarrays containing the number, start postitions, and lengths of matches

In [None]:
strings

In [None]:
print(f"strings.find_locations('{pattern}'):\n{strings.find_locations(pattern)}")

In [None]:
pattern = r'\d'
print(f"strings.find_locations(r'{pattern}'):\n{strings.find_locations(pattern)}")

`cached_regex_patterns` shows which regex patterns have been cached for that `Strings`

In [None]:
strings.cached_regex_patterns()

In [None]:
strings.purge_cached_regex_patterns()

In [None]:
strings.cached_regex_patterns()

## shutdown

In [None]:
ak.shutdown()