# Regular expressions

> _"One of the unsung successes in standardization in computer science has been the regular expression (RE), a language for specifying text search strings. This practical language is used in every computer language, word processor, and text processing tool [...]. Formally, a regular expression is an algebraic notation for characterizing a set of strings. They are particularly useful for searching in texts, when we have a pattern to search for and a corpus of corpus texts to search through."_

Source: "Speech and Language Processing" by Jurafsky and Martin, [chapter 2](https://web.stanford.edu/~jurafsky/slp3/2.pdf).

## What are regular expressions

Regular expressions (regex) are **patterns that try to match specific sequences of characters** (i.e. a string). They allow us to extract information from text in a more intelligent way than by just performing exact matches.

Typical uses of regular expressions:
* Search patterns in text
* Replace strings by others
* Text validation: is the input well-formatted?

If we want to use regular expressions, we must import the ```re``` module like this:

In [None]:
import re

## Some basic regex symbols:
 
* ```.``` : matches any character.
* ```*``` : previous character repeats zero or more times.
* ```+``` : previous character repeats 1 or more times.
* ```^``` : marks the start of string.
* ```$``` : marks the end of string.
* ```?``` : indicates that the previous character is optional.
* ```\d``` : matches any number.
* ```\b``` : indicates a word boundary (i.e. beginning or end of word)
* ```\.``` : matches a full stop.
* ```[A-Z]``` : matches any capital letter (A-Z).
* ```[a-z]``` : matches any non-capital letter (a-z).

## Some useful functions:
* `re.match(pattern, str)`: does a regular expression match the whole string?
* `re.findall(pattern, str)`: find all matches of a regular expression in a string.
* `re.sub(pattern, replacement, str)`: replace all instances of the found pattern in the string by the replacement.

### Matching with `re.match(pattern, str)`

In [None]:
import re

str1 = "ACT 1, SCENE 12"
pattern = r'^ACT \d+, SCENE \d+$'

if re.match(pattern, str1):
    print(str1)
else:
    print("There's no match")

### Matching and capturing with ```.group()```

In the regular expression we can specify, in parentheses, which is the information we would like to keep, to reuse it later. We can then recover this information with the ```.group()``` method, specifying inside the parentheses which is the group of information we'd like to retrieve.

Example:

In [None]:
import re

str1 = "ACT 1, SCENE 12"
pattern = r'^ACT (\d+), SCENE (\d+)$'

if re.match(pattern, str1):
    print("I am only interested in knowing the SCENE, which is number... " + re.match(pattern, str1).group(2))
else:
    print("There's no match")

### Finding all occurrences in text with ```re.findall(pattern, str)```

In [None]:
import re

str1 = """It was the best of times,
it was the worst of times,
it was the age of wisdom,
it was the age of foolishness,
it was the epoch of belief,
it was the epoch of incredulity."""
pattern = r'it was the (.+) of (.+),'

matches = re.findall(pattern, str1)
for match in matches:
    print(match)

### Replacing with ```re.sub(pattern, replacement, str)```

In [None]:
import re

str1 = """It was the best of times,
it was the worst of times,
it was the age of wisdom,
it was the age of foolishness,
it was the epoch of belief,
it was the epoch of incredulity."""
pattern = r'\bwas\b'

print(re.sub(pattern, "is not", str1))