<a href="https://colab.research.google.com/github/SCS-Technology-and-Innovation/DSLP/blob/main/DSLP_M04.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Patterns

It is often crucial to detect content that matches certain **patterns**: everything that might be a date, a temperature, a proper name, etc.

Identifying such elements allows for a smoother and a more thorough process for tasks such as (machine-assisted) translation or localization.

The most powerful tool for this task are **regular expressions** (often called *regex* for short).

In [None]:
content = "The eldest cat, called Puff, was born on May 3rd, 2014, in a cardboard box under a tree. The youngest one, whom we call Happy, was born sometime in July of 2019... I found him on the 15th of the month as a newborn kitten!"

Let's start with an easy task: locate all four-digit numbers (since those might be years).

In [None]:
from re import search

search('[0-9][0-9][0-9][0-9]', content)

<re.Match object; span=(50, 54), match='2014'>

When we use `search`, we only get the first match.

In [None]:
from re import findall

findall('[0-9][0-9][0-9][0-9]', content)

['2014', '2019']

Each set of brackets `[]` indicates an interval of permitted characters and since four positions are specified, this regex will match four-character substrings. Instead of putting four sets of brackets, we can also use curly braces `{}` to specify the number of matches we want from the bracketed interval.

In [None]:
findall('[0-9]{4}', content)

['2014', '2019']

If I wanted to find possible dates, I could look for words that **start** with one or two digits followed by either *st*, *nd*, *rd*, or *th*.

To find a start of a word, we can ask for a whitespace character to be present using the code `\s` to indicate this within the pattern description of the regex.

Placing two numbers, separated by a comma, within the curly braces states that either one or two digits are fine.

The parenthesis `()` form a *group* within which we provide four options (alternatives) by using the `|` symbol as an "**or**" operator.

In [None]:
findall('\s[0-9]{1,2}(st|nd|rd|th)', content)

['rd', 'th']

Finding names could be based on initial capital letters followed by *at least one* (achieved with `+`) lowercase letter. To get only words that start with a capitalized letter, we match one or more whitespaces at at the start of the pattern.

In [None]:
findall('\s+[A-Z][a-z]+', content)

[' Puff', ' May', ' The', ' Happy', ' July']

Also starts of sentences get capitalized, we could try to only pick those whitespaces that are **not** (achieved by `^`) preceded by a punctuation.

Let's first learn to match punctuation:

The symbol `.` has a special meaning in regex: it matches any single character (that is not the newline character). Hence we have to escape this meaning by adding a backslash `\` in front of it.

Similarly, `?` means "*zero times or one time*" so it needs to be escaped as well.

In [None]:
findall('(\.|,|\?|!)', content)

[',', ',', ',', ',', '.', ',', ',', '.', '.', '.', '!']

Now, let's try finding **anything but** punctuation. That will be a lot of matches so we can look at the count instead of the listing.

In [None]:
len(findall('[^(\.|,|\?|!)]', content))

210

 Finally, we are ready to match non-punctuation followed by whitespace and then a capitalized word.

In [None]:
matches = findall('[^(\.|,|\?|!)]\s+[A-Z][a-z]+', content)
matches

['d Puff', 'n May', 'l Happy', 'n July']

In [None]:
for m in matches:
  print(m[2:])

Puff
May
Happy
July


Note that these matches now include that non-punctuation character. We can easily clean it out, along with the space, by cropping the first two characters of each match if we want to.

Now, it will not be easy to separate months from names since some people are named April, for example. Natural language processing (NLP) tools can (somewhat precisely) achieve that by tagging the words in terms of their role within the sentence, but we won't get into that today.

To learn more about the numerous mechanisms with which to build regular expressions, there is an excellent [Real Python Regex tutorial](https://realpython.com/regex-python/). If you want to practice formulating regexes and then testing whether certains strings match those patterns, [regex101.com](https://regex101.com/) is cool and handy (pick Python as your "flavor" on the left sidebar to match the syntax; many tools include regexes but the precise syntax tends to vary from one tool to another).