# Social Data Science PhD course
22nd of November, 2021

Signe Sørensen and Thomas Arildsen, CLAAUDIA

# Text processing with regular expressions

What are regular expressions?

> A **regular expression** (shortened as **regex** or **regexp**; also referred to as rational expression) is a sequence of characters that specifies a search pattern. Usually such patterns are used by string-searching algorithms for "find" or "find and replace" operations on strings, or for input validation. It is a technique developed in theoretical computer science and formal language theory.

\[[Wikipedia](https://en.wikipedia.org/wiki/Regular_expression)\]

So, a "regex" is a sort of special text string that specifies a pattern you want to search for.

A simpler example - may be familiar to some: "wildcard" text strings:

    some*
    
- Matches: "something", "some", "somelier", etc.
- Often used when listing / searching for files:

In [None]:
!ls *.ipynb

"Wildcard" text strings are just a simple example of specifying text patterns.
- Regular expressions are more general than this
- Regular expressions are a combination of *literal characters* and *meta-characters*
- Literal characters: for example ordinary letters that should be taken as-is: "b"
- Meta-characters: typically special or punctuation characters that serve a functional role in the regex: "." (matches any 1 character)
  
      b.

- Matches "be", "by", "b5", "b:"; etc. *Not* "beer", "banana", etc.
- Notice how the meta-character "." matches only 1 character. Must be combined with more meta-characters to match longer string parts.

# Regular expressions in Python

Python has a module with functionality for working with regular expressions: `re`
- Part of the Python standard library, so no need to install anything extra

In [None]:
import re

Let us see how we can use regular expressions and then start building from there what regular expressions can actually look like.

In [None]:
re.search?

We can use `re.search` to search for a particular regex in a text string.


## How to write regex patterns in Python

`\` in Python strings is used to "escape" certain special characters.
- For example, we must type `'\\'` to get a backslash or `'\n'` to get a new line - try it:

In [None]:
print('\\')

In [None]:
print('Hi\nthere')

In regex, `\` is also used to denote special sequences - as we shall see shortly.
- Can "collide" with Python's "escaping" of characters in text strings - makes it cumbersome to write special sequences in regex.
- Solution: use Python's "raw text strings" for regex:

In [None]:
print(r"Text with \ in it")

Suppresses special handling of backslash.

## Searching with regex

Now we are ready to try some regex searching. What does `re.search` return?:

In [None]:
re.search(r'word', 'Text with words in it ;-)')

What happens when a pattern is not found?

In [None]:
re.search(r'needle', 'haystack')

Regex are case-sensitive

In [None]:
re.search(r'a', 'A string with some a''s in it')

Unless you tell them not to:

In [None]:
re.search(r'a', 'A string with some a''s in it', flags=re.I)

What happens when there are multiple matches?

In [None]:
re.search(r'a', 'A woman walking her dog')

We have to use another variant when we want all matches:

In [None]:
re.findall(r'a', 'A woman walking her dog')

# Regex syntax

Now we know a little better how `re.search` behaves. Time to look closer at how to type regex.
- So far, we have seen simple examples with literal characters

      r'needle'
      
- The real strength of regex comes from meta-characters, or special characters, and special sequences

`.`

In [None]:
re.search(r'.', 'haystack')

We have seen this one before; matches any 1 character.

Multiple characters; zero or more (`*`):

In [None]:
re.search(r'.*', 'haystack')

In [None]:
re.search(r'a*', 'haystack')

In [None]:
re.search(r'ay*', 'haystack')

In [None]:
re.search(r'ay*', 'aardvark')

One or more (`+`):

In [None]:
re.search(r'a+', 'haystack')

In [None]:
re.search(r'a+', 'aardvark')

In [None]:
re.search(r'ay+', 'haystack')

In [None]:
re.search(r'ay+', 'aardvark')

Zero or one (`?`):

In [None]:
re.search(r'ay?', 'aardvark')

In [None]:
re.search(r'a?', 'aardvark')

In [None]:
re.search(r'ay+', 'ayyyyyy')

In [None]:
re.search(r'ay?', 'ayyyyyy')

Specifying number of matches:

In [None]:
re.search(r'ay{3}', 'ayyyyyy')

In [None]:
re.search(r'ay{3,5}', 'ayy')

What if we want to match a special character - for example '?', '\*', or '+' ? "Escape" them (`\`):

In [None]:
re.search(r'\*', 'A string with *s in it?')

In [None]:
re.search(r'\?', 'A string with *s in it?')

Being more selective than just `.`, any character: character sets (`[]`):

In [None]:
re.search(r'[irt]', 'A string with *s in it?')

In [None]:
re.search(r'[irt]+', 'A string with *s in it?')

Negate contents with `^`:

In [None]:
re.search(r'[^irt]', 'A string with *s in it?')

Character ranges (`[ - ]`):

In [None]:
re.search(r'[a-z]+', 'A string with *s in it?')

In [None]:
re.search(r'[a-z ]+', 'A string with *s in it?')

Either/or (`|`):

In [None]:
re.search(r't|r', 'Some string')

In [None]:
re.search(r'(t|r)+', 'Some string')

Grouping `()`:

In [None]:
match = re.search(r'([a-z]+)', 'Some string')
match

In [None]:
match.groups()

In [None]:
match = re.search(r'([a-z]+).+?([a-z]+)', 'Some string')
match

In [None]:
match.groups()

## Character classes

Specifying a word like `r'[A-Za-z]'` may be a bit cumbersome.
- Quicker way of specifying character classes like words (`\w` matches "word" *characters*; `[a-zA-Z0-9_]` in ASCII):

In [None]:
re.search(r'\w', "String with words in it")

In [None]:
re.search(r'\w+', "String with words in it")

Most character classes have an *opposite* class as well (the class letter in capital):

In [None]:
re.search(r'\W', "String with words in it")

Decimal digits (`\d`):

In [None]:
re.search(r'\d', "A string with 7 and 13 in it")

See more character classes and regex features [in the documentation](https://docs.python.org/3.8/library/re.html#index-24).

Regex search supports unicode characters in general

In [None]:
re.search(r'💚', "Lots of ❤️🧡💛💚💙💜🤎 in many colours")

## Substituting and splitting

Regex can also be used for more things than just searching:
- Substitution:

In [None]:
re.sub(r'simple', r'complicated', 'This is a simple example')

Incorporating matched groups:

In [None]:
match = re.search(r'(tast)(.+)(chees)', 'This is a tasty cheese')
match.groups()

In [None]:
re.sub(r'(tast)(.+)(chees)', r'\3\2\1', 'This is a tasty cheese')