# **Regular Expressions**

Regular expressions (also called “RegEx”) are a powerful tool and a standardized language to search for patterns in strings.

Regular expressions in Python are compiled into pattern objects, which have methods for various operations such as searching for pattern matches or performing string substitutions. The most significant ones are:
- ``match()``:$\qquad$ Determine if the regular expression matches at the beginning of the string.<br>
- ``search()``:$\qquad$ Scan a string, looking for any location where this regular expression matches.<br>
- ``findall()``:$\qquad$ Find all substrings where the regular expression matches, and return them as a list.<br>
- ``finditer()``:$\qquad$ Find all substrings where the regular expression matches, and return them as an iterator.<br>

This notebook shows examples for the following meta-character settings:
- Anchors
- Quantifiers
- Disjunctions
- Character Classes

Further Python documentation:
- Regular expression syntax in the library ``re``: https://docs.python.org/3/library/re.html)
- Regular Expression HOWTO: https://docs.python.org/3/howto/regex.html#regex-howto

In [None]:
# Load 're' module for all following code cells
import re

## **Anchors mark a position in the string**

- $\text{^}$ : $\quad$ means the beginning of the string
- $\text{\$}$: $\quad$  means the end of the string

In [None]:
# Define regular expression
reg = re.compile('^banana')

# Does the sentence "bananas are cheap" begin with "banana"?
p = reg.search("bananas are cheap")

# Show results:
if p:
    # Print the string which matches the regular expression
    print('matching string:\t' + p.group())
    
    # Print the first and last character of the match
    print('index of first and last matching character in the sentence: \t start: {} \t end: {}'.format(p.start(), p.end()))
    
    # Print the start and end position as tuple:
    print('index of first and last matching charcter as tuple: \t{}'.format(p.span()))
else:
    print("no match")


matching string:	banana
index of first and last matching character in the sentence: 	 start: 0 	 end: 6
index of first and last matching charcter as tuple: 	(0, 6)


In [None]:
# Check if the sentence "I have a banana" ends with "banana"
reg = re.compile('banana$')

p = reg.search("I have a banana")

# Show results:

if p:
  # Print the string which matches the regular expression
  print('matching string:\t' + p.group())
  
  # Print the first and last character of the match
  print('index of first and last matching character in the sentence: \t start: {} \t end: {}'.format(p.start(), p.end()))
  
  # Print the start and end position as tuple:
  print('index of first and last matching charcter as tuple: \t{}'.format(p.span()))
  
else:
    print("no match")

matching string:	banana
index of first and last matching character in the sentence: 	 start: 9 	 end: 15
index of first and last matching charcter as tuple: 	(9, 15)


In [None]:
# '^' means the beginning of a string
# '$' means the end of a string

reg = re.compile('^banana$')
 
# Do the following strings begin and end with 'banana'?
print(reg.search('bananas are cheap'))

print(reg.search('I have a banana'))

print(reg.search('banana').group())

None
None
banana


#### Alternative method: Use module functions for matches in string variable

Regular expressions are compiled into pattern objects, which have methods for various operations such as searching for pattern matches or performing string substitutions.<br>
The ``re`` module defines several functions, constants, and an exception. Some of the functions are simplified versions of the full featured methods for compiled regular expressions. Most non-trivial applications always use the compiled form.

In [None]:
# Alternative method: Use the module function search() for matches in string variable
p = re.search('^banana', 'bananas are cheap')

# Show results:
if p:
    # Print the string which matches the regular expression
    print('matching string:\t' + p.group())
    
    # Print the first and last character of the match
    print('index of first and last matching character in the sentence: \t start: {} \t end: {}'.format(p.start(), p.end()))
    
    # Print the start and end position as tuple:
    print('index of first and last matching charcter as tuple: \t{}'.format(p.span()))
else:
    print("no match")

matching string:	banana
index of first and last matching character in the sentence: 	 start: 0 	 end: 6
index of first and last matching charcter as tuple: 	(0, 6)


In [None]:
# The function match() would return the same result
#
# Note:
# match() only checks if the regular expression matches at the beginning of the string
# while search() scans the whole string for a match. 
# It is important to keep this distinction in mind. 

# Note:
# match() only reports a successful match which starts at 0. 
# If the match does not start at 0, match() does not report it.

# See also [search() vs. match()](https://docs.python.org/3/library/re.html#search-vs-match)

p = re.match('^banana', 'bananas are cheap')

if p:
  print('matching string: ' + p.group())

matching string: banana


## **Quantifiers indicate the number of repetitions of the previous character**

- $\text{*}$: $\quad$  means 0 or more
- $\text{+}$: $\quad$ means 1 or more
- $\text{?}$: $\quad$ means 0 or 1 repetitions
- If more precise quantifiers are needed, the number of repetitions can be written in curly brackets.

In [None]:
# '?' means 0 or 1 repetitions
reg = re.compile('ba?nana')

print(reg.search('banana').group())
print(reg.search('bnana').group())
print(reg.search('baaaaanana'))

banana
bnana
None


In [None]:
# '+' means 1 or more repetitions
reg = re.compile('ba+nana')

print(reg.search('banana').group())
print(reg.search('bnana'))
print(reg.search('baaaaanana').group())

banana
None
baaaaanana


In [None]:
# '*' means 0 or more repetitions
reg = re.compile('ba*nana')

print(reg.search('banana').group())
print(reg.search('bnana').group())
print(reg.search('baaaaanana').group())

banana
bnana
baaaaanana


In [None]:
# 'a{2,7}' means that the letter 'a' must repeat at least 2 times and 
# at maximum 7 times for the string to match the pattern
reg = re.compile('a{2,7}')

print(reg.search('banana'))
print(reg.search('bnana'))
print(reg.search('baaaaanana').group())

None
None
aaaaa


## **Disjunctions represent a logical OR**

Disjunctions are written in squared brackets ``[ ]`` or separated by the pipe sign ``|``.

In [None]:
# 'b[aou]nana' matches strings which have either 'a', 'o' or 'u' as 2nd character.
reg = re.compile('b[aou]nana')

print(reg.search('banana').group())
print(reg.search('bonana').group())
print(reg.search('bunana').group())

banana
bonana
bunana


In [None]:
# 'b(a|o|u)nana' matches strings which have either 'a', 'o' or 'u' as 2nd character.
reg = re.compile('b(a|o|u)nana')

print(reg.search('banana').group())
print(reg.search('bonana').group())
print(reg.search('bunana').group())

banana
bonana
bunana


In [None]:
# 'banana|bonana|bunana' matches strings which contain either 'banana', 'bonana' or 'bunana'.
reg = re.compile('banana|bonana|bunana')

print(reg.search('banana').group())
print(reg.search('bonana').group())
print(reg.search('bunana').group())

banana
bonana
bunana


## **Character classes represent certain groups of characters.**

* $\text{\d or [0-9]}\quad \quad \quad \quad$ :   matches digits
* $\text{\w or [0-9A-Za-z_]}\quad$ :  matches alphanumeric characters and underscores
* $\text{\s} \qquad \qquad \qquad \qquad$:  matches white space characters
* $. \qquad \qquad \qquad \qquad$ :  matches any character

In [None]:
# This regular expression matches strings which start with 'Hello' and end with 1 following word
reg = re.compile('^Hello \w+$')

print(reg.search('Hello World').group())
print(reg.search('Hello new World'))

Hello World
None


In [None]:
# The expression would match strings which start with 'Hello'
reg = re.compile('^Hello.+$')

print(reg.search('Hello World').group())
print(reg.search('Hello new World').group())

Hello World
Hello new World


## **Find regular expressions in files**

In [None]:
# Create sample text file

text ="This is a sample text which contains several characters.\n\
For example, I bought cheap bananas today."

# Write content in output file
with open("/content/regText.txt", 'w') as file:
  file.write(text)

In [None]:
#@title
# Find the epression 'text' in sample file 'regText.txt'

!grep text '/content/regText.txt'

This is a sample text which contains several characters.


In [None]:
# Replace the word banana with test in regText.txt

!sed -e 's/banana/test/' regText.txt

This is a sample text which contains several characters.
For example, I bought cheap tests today.

Copyright © 2021 IU International University of Applied Sciences