# **Regular Expressions**

This notebook was prepared to help students improve their coding skills and show code examples about Regular expressions (also called “RegEx”). 

The notebook contains three main sections about each topic:
- Definition
- Code examples

To improve your understanding, you can change the strings in the examples and execute the code again.





## **1- Introduction**

Regular expressions in Python are compiled into pattern objects, which have methods for various operations such as searching for pattern matches or performing string substitutions. The most significant ones are:
- ``match()``:$\qquad$ Determine if the regular expression matches at the beginning of the string.<br>
- ``search()``:$\qquad$ Scan a string, looking for any location where this regular expression matches.<br>
- ``findall()``:$\qquad$ Find all substrings where the regular expression matches, and return them as a list.<br>
- ``finditer()``:$\qquad$ Find all substrings where the regular expression matches, and return them as an iterator.<br>

#### **Content:**
This notebook shows examples for the following meta-character settings:
- Anchors
- Quantifiers
- Disjunctions
- Character Classes

Further Python documentation:
- Regular expression syntax in the library ``re``: https://docs.python.org/3/library/re.html)
- Regular Expression HOWTO: https://docs.python.org/3/howto/regex.html#regex-howto

## **2- Anchors mark a position in the string**

Anchors do not match any character. Instead, they match a position before, after, or between characters. They can be used to **“anchor”** the regex match at a certain position. 

<ul>
<li><b>$\text{^}$ :   means the beginning of the string </b></li>
<ul>
<li>The caret <b>^</b> matches the position before the first character in the string. </li>
<li> Applying <b>^a</b> to <b>abc</b> matches <b>a</b>. </li>
<li><b>^b</b> does not match <b>abc</b> at all, because the <b>b</b> cannot be matched right after the start of the string, matched by <b>^</b>. </li>
</ul>
<li><b>$\text{\$}$:   means the end of the string</b></li>
<ul>

<li>Similarly, <b>$\$$</b> matches right after the last character in the string. <b>c$\$$</b> matches c in abc, while <b>$\$$</b> does not match at all.</li>
</ul>
</ul>

To work with regular expressions, you should import the <b>re</b> module. The <b>re</b> module provides a set of powerful regular expression facilities, which allows you to quickly check whether a given string matches a given pattern (using the match function), or contains such a pattern (using the search function).

#### **Code Examples:**

<br></br>
<b>1. </b>   The following example shows how to use <b>^</b> expression to find if a string <b>begins</b> with the word "banana":

In [None]:
# import the 're' module
import re

# Define the regular expression by using '^'
reg = re.compile('^banana')

# Define strings for the search function:
p1 = reg.search("are bananas cheap?")
p2 = reg.search("bananas are cheap.")

# Show results:
if p1:
    # Print the result:
    print("The string 'p1' begins with '" + p1.group()+"'")

    # Print the first and last character of the match
    print("index of first and last matching character in the sentence 'p1': \t start: {} \t end: {}".format(p1.start(), p1.end()))
    
    # Print the start and end position as tuple:
    print('index of first and last matching charcter as tuple: \t{}'.format(p1.span()))

else:
    print("The string 'p1' does not begin with 'banana'.")

if p2:
    # Print the string which matches the regular expression
    print("The string 'p2' begins with '" + p2.group()+"'.")

    # Print the first and last character of the match
    print("index of first and last matching character in the sentence 'p2': \t start: {} \t end: {}".format(p2.start(), p2.end()))
    
    # Print the start and end position as tuple:
    print('index of first and last matching charcter as tuple: \t{}'.format(p2.span()))
    
else:
    print("The string 'p2' does not begin with 'banana'")


The string 'p1' does not begin with 'banana'.
The string 'p2' begins with 'banana'.
index of first and last matching character in the sentence 'p2': 	 start: 0 	 end: 6
index of first and last matching charcter as tuple: 	(0, 6)


<br></br>
<b>2.</b>  The following example shows how to use <b>$</b> expression to find if a string <b>ends</b> with the word "banana":

In [None]:
# import the 're' module
import re

# Define the regular expression by using '$'
reg = re.compile('banana$')

# Define strings for the search function:
p1 = reg.search("I have a banana")
p2 = reg.search("have a banana and an apple")

# Show results:
if p1:
    # Print the result:
    print("The string 'p1' ends with '" + p1.group()+"'")

    # Print the first and last character of the match
    print("index of first and last matching character in the sentence 'p1': \t start: {} \t end: {}".format(p1.start(), p1.end()))
    
    # Print the start and end position as tuple:
    print('index of first and last matching charcter as tuple: \t{}'.format(p1.span()))

else:
    print("The string 'p1' does not end with 'banana'.")

if p2:
    # Print the string which matches the regular expression
    print("The string 'p2' ends with '" + p2.group()+"'.")

    # Print the first and last character of the match
    print("index of first and last matching character in the sentence 'p2': \t start: {} \t end: {}".format(p2.start(), p2.end()))
    
    # Print the start and end position as tuple:
    print('index of first and last matching charcter as tuple: \t{}'.format(p2.span()))
    
else:
    print("The string 'p2' does not end with 'banana'")



The string 'p1' ends with 'banana'
index of first and last matching character in the sentence 'p1': 	 start: 9 	 end: 15
index of first and last matching charcter as tuple: 	(9, 15)
The string 'p2' does not end with 'banana'


<br></br>
<b>3.</b>  The following example shows how to use <b>$</b> and <b>^</b> expressions to find an exact match for the string “banana”:

In [None]:
# import the 're' module
import re

# Define the regular expression by using '^' and '$'
reg = re.compile('^banana$')
 
# Define strings for the search function:
p1=reg.search('bananas are cheap')
p2=reg.search('I have a banana')
p3=reg.search('I have a banana and an apple')
p4=reg.search('banana')

print(p1)

print(p2)

print(p3)

print(p4.group())

None
None
None
banana


#### **Alternative method: Use module functions for matches in the string variable**

Regular expressions are compiled into pattern objects, which have methods for various operations such as searching for pattern matches or performing string substitutions.<br>
The ``re`` module defines several functions, constants, and an exception. Some of the functions are simplified versions of the full-featured methods for compiled regular expressions. Most non-trivial applications always use the compiled form.

#### **Code Examples:**

<br></br>
<b>1.</b> The following example shows an alternative coding style which combines these two command lines:


*   reg = re.compile('^banana')
*   p=reg.search('bananas are cheap')



In [None]:
# import the 're' module
import re

# Alternative method: Use the module function search() for matches in the string variable
# re.search('pattern_to_be_searched', 'string')

p = re.search('^banana', 'bananas are cheap')

# Show results:
if p:
    # Print the string which matches the pattern
    print('matching string:\t' + p.group())
    
    # Print the first and last character of the match
    print('index of first and last matching character in the sentence: \t start: {} \t end: {}'.format(p.start(), p.end()))
    
    # Print the start and end position as tuple:
    print('index of first and last matching charcter as tuple: \t{}'.format(p.span()))
else:
    print("The string does not begin with 'banana'")

matching string:	banana
index of first and last matching character in the sentence: 	 start: 0 	 end: 6
index of first and last matching charcter as tuple: 	(0, 6)


<br></br>
<b>2.</b> In all the examples above, we have used the search() function. We can alternatively use the match() function. The following example compares the match() and the search() functions:


In [None]:
# The function match() would return the same result
#
# NOTE:
# The match() function only checks if the regular expression matches at the beginning of the string
# while the search() function scans the whole string for a match. 
# It is important to keep this distinction in mind. 

# NOTE:
# The match() function only returns a successful match which starts at 0. 
# If the match does not start at 0, the match() function returns "None".

# For more detailed comparison of the search() and  match() functions: https://docs.python.org/3/library/re.html#search-vs-match

p_match_pattern = re.match('^banana', 'bananas are cheap')
p_search_pattern = re.search('^banana', 'bananas are cheap')

if p_match_pattern:
  print('match function matching string: ' + p_match_pattern.group())

if p_search_pattern:
  print('search function matching string: ' + p_search_pattern.group())

match function matching string: banana
search function matching string: banana


<a id='001'></a>

## **3- Quantifiers indicate the number of repetitions of the previous character**


<ul>

<li><b>Match multiple characters (zero or more) </b></li>
<ul>
<li>A regular expression followed by an asterisk <b>(*)</b> matches 0 or more characters in front of the asterisk. If any matching string is found, the first matching string in a line is used. </li>
<li>For example, the below regular expression matches "cl","col","cool","cool",…,"coooooooooool":</li>
<b>co*l</b>
</ul>

<li><b>Find one or more characters</b></li>
<ul>
<li>A regular expression followed by a plus sign <b>(+)</b> matches 1 or more characters in fron of the plus. If there is any choice, the first matching string in a line is used.</li>
<li>For example, the below regular expression matches "col","cool",…,"cooooooooooool":</li>
<b>co+l</b>
</ul>

<li><b>Find zero or one character</b></li>
<ul>
<li>A regular expression followed by a question mark (?) matches 1 or 0 character in front of the question mark.</li>
<li>For example, the below regular expression matches "apple" and "apples":</li>
<b>apples?</b>
</ul>

<li><b>Define range quantifier</b></li>
<ul>
<li>The opening and closing curly brackets <b>{ }</b> are used as range quantifiers. The number of repetitions can be written between curly brackets.</li>
<li>For example, the below regular expression matches matches "aa" or "aaa":</li>
<b>a{2,3}</b>
</ul>

</ul>



#### **Code Examples:**

<br></br>
<b>1.</b> The following example shows how to use <b>?</b> expression to find an exact match for the strings “banana” and "bnana":


In [None]:
# import the 're' module
import re

# Define the regular expression by using '?' which means 0 or 1 repetitions
reg = re.compile('ba?nana')

# Define strings for the search function and print the matching patterns for each string:
print(reg.search('banana').group())
print(reg.search('bnana').group())
print(reg.search('baaaaanana'))

banana
bnana
None


<br></br>
<b>2.</b> The following example shows how to use <b>+</b> expression to find a match for the string “banana” and other variations where the first 'a' letter repeats more than once:

In [1]:
# import the 're' module
import re

# Define the regular expression by using '+' which means 1 or more repetitions
reg = re.compile('ba+nana')

# Define strings for the search function and print the matching patterns for each string:
print(reg.search('banana').group())
print(reg.search('bnana'))
print(reg.search('baaaaanana').group())

banana
None
baaaaanana


<br></br>
<b>3.</b> The following example shows how to use <b>*</b> expression to find a match for the string “banana” and other variations where the first 'a' letter does not exist or repeats at least once:

In [None]:
# import the 're' module
import re

# Define the regular expression by using '*' which means 0 or more repetitions
reg = re.compile('ba*nana')

# Define strings for the search function and print the matching patterns for each string:
print(reg.search('banana').group())
print(reg.search('bnana').group())
print(reg.search('baaaaanana').group())

banana
bnana
baaaaanana


<br></br>
<b>4.</b> The following example shows how to use <b>{ }</b> expression to find a match for the string “banana” where the first 'a' letter repeats at least 2 times and at maximum 7 times:

In [2]:
# import the 're' module
import re

# Define the regular expression by using '{}'
# a{2,7} means that the letter 'a' must repeat at least 2 times and at maximum 7 times for the string to match the pattern
reg = re.compile('a{2,7}')

# Define strings for the search function and print the matching patterns for each string:
print(reg.search('banana'))
print(reg.search('bnana'))
print(reg.search('baaaaanana').group())

None
None
aaaaa


## **4- Disjunctions represent a logical OR**

In order to match one string <b>OR</b> another, the disjunction operator is used.

Disjunctions are written in squared brackets <b>'[ ]'</b> or separated by the pipe sign <b>'|'</b>.

For example, <b>[aA]pple</b> matches both ‘apple’ and ‘Apple’.

#### **Code Examples:**

<br></br>
<b>1.</b> The following example shows how to use <b>?</b> expression to find an exact match for the strings “banana” and "bnana":

In [None]:
# 'b[aou]nana' matches strings which have either 'a', 'o' or 'u' as 2nd character.
reg = re.compile('b[aou]nana')

print(reg.search('banana').group())
print(reg.search('bonana').group())
print(reg.search('bunana').group())

banana
bonana
bunana


In [None]:
# 'b(a|o|u)nana' matches strings which have either 'a', 'o' or 'u' as 2nd character.
reg = re.compile('b(a|o|u)nana')

print(reg.search('banana').group())
print(reg.search('bonana').group())
print(reg.search('bunana').group())

banana
bonana
bunana


In [None]:
# 'banana|bonana|bunana' matches strings which contain either 'banana', 'bonana' or 'bunana'.
reg = re.compile('banana|bonana|bunana')

print(reg.search('banana').group())
print(reg.search('bonana').group())
print(reg.search('bunana').group())

banana
bonana
bunana


## **5- Character classes represent certain groups of characters.**

* $\text{\d or [0-9]}\quad \quad \quad \quad$ :   matches digits
* $\text{\w or [0-9A-Za-z_]}\quad$ :  matches alphanumeric characters and underscores
* $\text{\s} \qquad \qquad \qquad \qquad$:  matches white space characters
* $. \qquad \qquad \qquad \qquad$ :  matches any character

In [None]:
# This regular expression matches strings which start with 'Hello' and end with 1 following word
reg = re.compile('^Hello \w+$')

print(reg.search('Hello World').group())
print(reg.search('Hello new World'))

Hello World
None


In [None]:
# The expression would match strings which start with 'Hello'
reg = re.compile('^Hello.+$')

print(reg.search('Hello World').group())
print(reg.search('Hello new World').group())

Hello World
Hello new World


## **6- Find regular expressions in files**

In [None]:
# Create sample text file

text ="This is a sample text which contains several characters.\n\
For example, I bought cheap bananas today."

# Write content in output file
with open("/content/regText.txt", 'w') as file:
  file.write(text)

In [None]:
#@title
# Find the epression 'text' in sample file 'regText.txt'

!grep text '/content/regText.txt'

This is a sample text which contains several characters.


In [None]:
# Replace the word banana with test in regText.txt

!sed -e 's/banana/test/' regText.txt

This is a sample text which contains several characters.
For example, I bought cheap tests today.

Copyright © 2021 IU International University of Applied Sciences

Sources:
- https://www.regular-expressions.info/anchors.html