<h1>Basic regular expression patterns</h1>
<p>
Regular expressions are powerful tools used to identify specific patterns within text. They are integral to various applications across programming languages and software tools, enabling precise searches and manipulations of text strings. Regex uses a defined syntax recognized by a processing engine, making it universally applicable from simple text editing to complex data analysis tasks.
</p>
<p>
In this Jupyter notebook, we will introduce regexs by using Python's re module. The notebook presents the foundational elements of regex such as character classes, quantifiers, and anchors, and how they are used to create complex patterns. There is also an introduction to advanced regex functionalities including grouping, capturing, and look-ahead assertions to handle tricky text processing scenarios.
</p>


In [1]:
import numpy as np    # Import the numpy library to enable numerical operations on arrays and matrices.
import re             # Import the re module to utilize regular expressions.

<h2>Concatenation</h2>
<p>Search for a specific word using regular expressions</p>

In [2]:
re.findall(r"woodchunks", "I gathered a pile of woodchunks to keep the campfire burning.")

['woodchunks']

<p>The following code doesn't find any matches because of the case difference:</p>

In [3]:
re.findall(r"Woodchunks", "I gathered a pile of woodchunks to keep the campfire burning.")

[]

<p>We can use square brackets [] to specify a disjunction of characters to match</p>

In [4]:
re.findall(r"[Ww]oodchunks", "I gathered a pile of woodchunks to keep the campfire burning.")

['woodchunks']

<p>Square brackets can also define a range of characters. The dash - within the brackets specifies a range of characters, like [a-z] for lowercase letters or [b-g] for a subset of them</p>

In [5]:
re.findall(r"[A-Z]", "The Greeks enter Troy using the Trojan Horse on April 24, 1184")

['T', 'G', 'T', 'T', 'H', 'A']

<p>You can extend the character set in the pattern as follows</p>

In [9]:
re.findall(r"[A-Z][a-z]", "The Greeks enter Troy using the Trojan Horse on April 24, 1184")

['Th', 'Gr', 'Tr', 'Tr', 'Ho', 'Ap']

<p>If you place a caret ^ at the beginning of a square bracket expression, it negates the pattern. For example, [^a] matches any single character except a</p>

In [12]:
re.findall(r"[^A-Za-z ]", "The Greeks enter Troy using the Trojan Horse on April 24, 1184")

['2', '4', ',', '1', '1', '8', '4']

<p>This negation only applies when the caret is the first symbol after the opening square bracket</p>

In [14]:
re.findall(r"[a^b]", "look up a^b now")

['a', '^', 'b']

<p>To introduce optional elements, such as an optional s in woodchunks, you use the question mark ?. It signifies that the preceding character may appear zero or one time, making it optional</p>

In [15]:
re.findall(r"woodchunks?", "I gathered a pile of woodchunks to keep the campfire burning.")

['woodchunks']

<p>Allow for any character to appear in a particular position with the wildcard operator.</p>

In [16]:
re.findall(r"beg.n", "begin begining began begun")

['begin', 'begin', 'began', 'begun']

<h2>Anchors - match positions</h2>
<p>Anchors are not characters that match content directly, but instead, they match positions within string data</p>

<h3>1) Caret ^</h3>
<p>The caret ^  is used to match the beginning of a string. It ensures that the regular expression attempts to match starting from the very first character of the string.</p>

In [17]:
re.findall(r"^cat", "catapult truncate category")

['cat']

<h3>2) Dollar sign $</h3>
<p>The dollar sign is used to match the end of a string. It checks that the characters preceding it are at the end of the string.</p>

In [18]:
re.findall(r"dog$", "The dog race is bulldog")

['dog']

<h3>3) Word boundary /b</h3>
<p>This anchor matches positions where a word character is next to a non-word character, including start and end of a string if it starts or ends with a word character.</p>

In [19]:
re.findall(r"\bcat\b", "the cat sat on the caterpillar")

['cat']

<h3>4) Non-word boundary /B</h3>
<p>The opposite of \b, this anchor matches positions between two word characters or two non-word characters. </p>

In [20]:
re.findall(r"\Bcat\B", "educational category")

['cat']

<h2>Kleene Operators</h2>
<p>In regular expressions, Kleene operators are powerful tools that help match patterns of characters. There are two primary types of Kleene operators</p>

<h3>1) Kleene Star *</h3>
<p>The Kleene Star * matches zero or more occurrences of the preceding character or regular expression.</p>

In [21]:
re.findall(r"aa*", "a abaabbababb abababab aaa bbb aaaa acbd ")

['a', 'a', 'aa', 'a', 'a', 'a', 'a', 'a', 'a', 'aaa', 'aaaa', 'a']

<p>Similarly, the pattern [ab]* means zero or more occurrences of either 'a' or 'b'</p>

In [41]:
re.findall(r"[ab]*", "a abaabbababb abababab aaa bbb aaaa acbd ")

['a',
 '',
 'abaabbababb',
 '',
 'abababab',
 '',
 'aaa',
 '',
 'bbb',
 '',
 'aaaa',
 '',
 'a',
 '',
 'b',
 '',
 '',
 '']

<h3>2) Kleene Plus +</h3>
<p>The Kleene Plus + matches one or more occurrences of the preceding character or regular expression, making it more specific than the Kleene Star.</p>

In [42]:
re.findall(r"[0-9]+", "The Greeks enter Troy using the Trojan Horse on April 24")

['24']

<h2>Disjunction</h2>
<p>Disjunction in regex is the equivalent of logical OR in programming.</p>

In [22]:
re.findall(r"cat|dog", "I have a cat and a dog.")

['cat', 'dog']

<p>If you use parentheses in a regular expression, the default behavior is to capture the text that matches the part of the pattern inside the parentheses</p>

In [43]:
re.findall(r"\w+(ing|ed)", "playing cat bingo loved singing")

['ing', 'ing', 'ed', 'ing']

<p>To capture the whole pattern that includes both the word and its suffix, you need to adjust the regex pattern with :? inside the parenthesis</p>

In [44]:
re.findall(r"\w+(?:ing|ed)", "playing cat bingo loved singing")

['playing', 'bing', 'loved', 'singing']

<h2>Example of Nested Groups</h2>

<p>Suppose you have a string containing dates, and you want to extract different parts of these dates</p>

In [25]:
dates = "19-02-1978 22-11-2015 30/09/2021"

<p>First, let’s extract dates that strictly follow the <b>dd-mm-yyyy</b> format:</p>

In [45]:
print("Pattern: \d{2}-\d{2}-\d{4}")
print(re.findall(r"\d{2}-\d{2}-\d{4}", dates))

Pattern: \d{2}-\d{2}-\d{4}
['19-02-1978', '22-11-2015']


<p>To break down the dates into day-month and year, you can use nested parentheses</p>

In [27]:
print("Pattern: ((\d{2}-\d{2})-(\d{4}))")
print(re.findall(r"((\d{2}-\d{2})-(\d{4}))", dates))

Pattern: ((\d{2}-\d{2})-(\d{4}))
[('19-02-1978', '19-02', '1978'), ('22-11-2015', '22-11', '2015')]


<p>To accommodate different separators between day, month, and year</p>

In [28]:
print("Pattern: ((\d{2}[-/]\d{2})[-/](\d{4}))")
print(re.findall(r"((\d{2}[-/]\d{2})[-/](\d{4}))", dates))

Pattern: ((\d{2}[-/]\d{2})[-/](\d{4}))
[('19-02-1978', '19-02', '1978'), ('22-11-2015', '22-11', '2015'), ('30/09/2021', '30/09', '2021')]


<h2>Greediness and Laziness in Quantifiers</h2>
<p>Regular expressions are greedy by default, meaning they match the longest possible string that fits the pattern.The *? quantifier, for instance, makes the * operator lazy, meaning it matches the shortest possible string</p>

In [29]:
re.findall(r"<.*?>", "<p>This is a paragraph.</p> <p>This is another paragraph.</p>")

['<p>', '</p>', '<p>', '</p>']

<h2>Substitution, Capture Groups and ELIZA</h2>
<p>In regular expressions, parentheses ( ) are used for grouping parts of a pattern and capturing the matched text. Each group of parentheses captures the corresponding part of the matched string and stores it in a numbered register.</p>

In [32]:
docs = ["I am feeling tired", "I am feeling nervous", "I am feeling happy"]
for doc in docs:
    print("User: ",doc)
    print("ELIZA: ",re.sub(r"I am feeling (.+)", r"Why are you feeling \1?", doc))

User:  I am feeling tired
ELIZA:  Why are you feeling tired?
User:  I am feeling nervous
ELIZA:  Why are you feeling nervous?
User:  I am feeling happy
ELIZA:  Why are you feeling happy?


<p>We can use more than one capture group to substitute multiple parts of the text</p>

In [35]:
doc = "My name is John and I live in New York"
result = re.sub(r"My name is (.+) and I live in (.+)", r"Hello, \1 how is the weather at \2!", doc)
print("User: ",doc)
print("ELIZA: ",result)

User:  My name is John and I live in New York
ELIZA:  Hello, John how is the weather at New York!


<h3>Non-capturing Groups</h3>
<p>Sometimes we use parentheses just for grouping without needing the text they match to be stored for later use.This can be done using non-capturing groups, which are defined by ?:</p>

In [39]:
doc = "I am feeling quite tired but excited"
result = re.sub(r"I am feeling (?:quite )?(.+)", r"What makes you feel \1?", doc)
print("User: ",doc)
print("ELIZA: ",result)

User:  I am feeling quite tired but excited
ELIZA:  What makes you feel tired but excited?
