# Regular Expressions

This introduces some basic regular expression functions in Python.

In [None]:
import re

some_text = "This is some text that needs processing #regex"
word_regex = re.compile(r"\w+")
ht_at_regex = re.compile(r"([@#])\w+")

In [None]:
print(some_text)

In [None]:
tokens = word_regex.findall(some_text)
print(tokens)

In [None]:
tokens = ht_at_regex.findall(some_text)
print(tokens)

Note that if grouping `()` is included in the regex, then findall returns just the group matches, not the full match. Use finditer instead to get the whole match (see below)

In [None]:
match = ht_at_regex.search(some_text)
print(match)

In [None]:
print(match.group(0))

In [None]:
print(match.group(1))

In [None]:
iterator = word_regex.finditer(some_text)

In [None]:
for match in iterator:
    print(match)

In [None]:
replaced = word_regex.sub("word", some_text)
print(replaced)

In [None]:
tagged = word_regex.sub("\g<0>_word", some_text)
print(tagged)

In [None]:
def reverse(match):
    return match.group()[::-1]

reversed = word_regex.sub(reverse, some_text)
print(reversed)

## Unicode and regex
Python re module is Unicode aware, but limited.

In [None]:
text = "The café served pizzas with jalapeños"
word_regex = re.compile(r"\w+")
word_regex.findall(text)

In [None]:
ascii_word_regex = re.compile(r"\w+", re.ASCII)
ascii_word_regex.findall(text)

Many other options are available for regular expressions with unicode: https://www.regular-expressions.info/unicode.html, but most options aren't available with Python's standard re module: https://www.regular-expressions.info/refunicode.html

Fortunately, another regex library is available: https://pypi.org/project/regex/. This is backwards-compatible with re (so you can use the same functions), but offers lots more functionality.

In [None]:
import regex
word_regex = regex.compile(r"\w+")

In [None]:
word_regex.findall(text)

In [None]:
word_regex = regex.compile(r"\p{L}+")

In [None]:
word_regex.findall(text)

In [None]:
text = "The café served pizzas with jalapen\u0303os"

In [None]:
word_regex.findall(text)

In [None]:
combiner_regex = regex.compile(r"\p{M}+")

In [None]:
combiner_regex.findall(text)

In [None]:
word_regex = regex.compile(r"(?:\p{L}\p{M}*)+")

In [None]:
word_regex.findall(text)

## Emoji
Lots and lots available: http://www.unicode.org/emoji/charts/full-emoji-list.html

In [None]:
skintones = "\U0001F596\U0001F596\U0001F3FE"

In [None]:
print(skintones)

In [None]:
emoji = "\U0001F46E\U0001F3FC\u200D\u2640\uFE0F"

In [None]:
print(emoji)

In [None]:
print(list(emoji))
print(len(emoji))

In [None]:
emojis = skintones + emoji
print(emojis)

In [None]:
grapheme_regex = regex.compile(r"\X")

In [None]:
grapheme_regex.findall(emojis)

In [None]:
hits = grapheme_regex.findall(emojis)
for hit in hits:
    print(hit)

More Unicode fun:

- https://norasandler.com/2017/11/02/Around-the-with-Unicode.html
- https://blog.jonnew.com/posts/poo-dot-length-equals-two