# Boundary Matchers

Consider a scenario where you want to find all occurances of `and`, `or` and `the` in the given text.

In [1]:
import re
from utils import highlight_regex_matches

In [2]:
txt = """
Lorem Ipsum is simply dummy text of the printing and typesetting industry. 
Lorem Ipsum has been the industry's standard dummy text ever since the 1500s, 
when an unknown printer took a galley of type and scrambled it to make a type specimen book. 
It has survived not only five centuries, but also the leap into electronic typesetting, 
remaining essentially unchanged. 
It was popularised in the 1960s with the release of Letraset sheets containing Lorem Ipsum passages, 
and more recently with desktop publishing software like Aldus PageMaker including versions of Lorem Ipsum.
"""

In [3]:
pattern = re.compile("and|or|the")

In [4]:
pattern.findall(txt)

['or',
 'the',
 'and',
 'or',
 'the',
 'and',
 'the',
 'and',
 'the',
 'the',
 'the',
 'or',
 'and',
 'or',
 'or']

In [5]:
highlight_regex_matches(pattern, txt)


L[42m[1mor[0mem Ipsum is simply dummy text of [42m[1mthe[0m printing [42m[1mand[0m typesetting industry. 
L[42m[1mor[0mem Ipsum has been [42m[1mthe[0m industry's st[42m[1mand[0mard dummy text ever since [42m[1mthe[0m 1500s, 
when an unknown printer took a galley of type [42m[1mand[0m scrambled it to make a type specimen book. 
It has survived not only five centuries, but also [42m[1mthe[0m leap into electronic typesetting, 
remaining essentially unchanged. 
It was popularised in [42m[1mthe[0m 1960s with [42m[1mthe[0m release of Letraset sheets containing L[42m[1mor[0mem Ipsum passages, 
[42m[1mand[0m m[42m[1mor[0me recently with desktop publishing software like Aldus PageMaker including versions of L[42m[1mor[0mem Ipsum.



In [6]:
pattern.search(txt)

<re.Match object; span=(2, 4), match='or'>

There is a slight problem with the above pattern. `and`, `or`, `the` inside the words are also counted as a match where as we want to find individual strings containing `and`, `or`, `the` only.

### What is the solution?

Solution is to use this pattern:

`\b(and|or|the)\b`

where `\b` is a metacharacter that matches at a position that is called a **word boundary**. 

Such identifiers that correspond to a particular position inside of the input are called **Boundary Matchers**.

**Note:** Since `\b` is also an escape sequence for strings in Python, we need to escape it using `\`, i.e. `\\b`,  in order to treat it like a metacharacter for regex matching.

In [7]:
print("\\b(and|or|the)\\b")

\b(and|or|the)\b


In [8]:
pattern = re.compile("\\b(and|or|the)\\b")
print(pattern)

re.compile('\\b(and|or|the)\\b')


In [9]:
re.findall(pattern , txt)

['the', 'and', 'the', 'the', 'and', 'the', 'the', 'the', 'and']

In [10]:
highlight_regex_matches(pattern, txt)


Lorem Ipsum is simply dummy text of [42m[1mthe[0m printing [42m[1mand[0m typesetting industry. 
Lorem Ipsum has been [42m[1mthe[0m industry's standard dummy text ever since [42m[1mthe[0m 1500s, 
when an unknown printer took a galley of type [42m[1mand[0m scrambled it to make a type specimen book. 
It has survived not only five centuries, but also [42m[1mthe[0m leap into electronic typesetting, 
remaining essentially unchanged. 
It was popularised in [42m[1mthe[0m 1960s with [42m[1mthe[0m release of Letraset sheets containing Lorem Ipsum passages, 
[42m[1mand[0m more recently with desktop publishing software like Aldus PageMaker including versions of Lorem Ipsum.



Here is a table which shows the list of all boundary matchers available in Python:

<table style="border: 1px solid black; font-size:15px;">
<thead>
    <th>Matcher</th>
    <th>Description</th>
</thead>
    
<tbody>
<tr>
    <td>^</td>
    <td>Matches at the beginning of a line</td>
</tr>
    
<tr>
    <td>$</td>
    <td>Matches at the end of a line</td>
</tr>

<tr>
    <td>\b</td>
    <td>Matches a word boundary</td>
</tr>

<tr>
    <td>\B</td>
    <td>Matches the opposite of \b. Anything that is not a word boundary</td>
</tr>

<tr>
    <td>\A</td>
    <td>Matches the beginning of the input</td>
</tr>

<tr>
    <td>\Z</td>
    <td>Matches the end of the input</td>
</tr>
</tbody>
</table>

### Example 1

Consider a scenario where we want to find all the lines in the given text which **start** with the pattern `Name:`.

In [11]:
txt = """
Name:
Age: 0
Roll No.: 15
Grade: S

Name: Ravi
Age: -1
Roll No.: 123 Name: ABC
Grade: K

Name: Ram
Age: N/A
Roll No.: 1
Grade: G
"""

In [12]:
pattern = re.compile("Name:")

In [13]:
highlight_regex_matches(pattern, txt)


[42m[1mName:[0m
Age: 0
Roll No.: 15
Grade: S

[42m[1mName:[0m Ravi
Age: -1
Roll No.: 123 [42m[1mName:[0m ABC
Grade: K

[42m[1mName:[0m Ram
Age: N/A
Roll No.: 1
Grade: G



In [14]:
pattern = re.compile("^Name:")
highlight_regex_matches(pattern, txt)


Name:
Age: 0
Roll No.: 15
Grade: S

Name: Ravi
Age: -1
Roll No.: 123 Name: ABC
Grade: K

Name: Ram
Age: N/A
Roll No.: 1
Grade: G



In [15]:
pattern = re.compile("^Name:" , flags=re.M)
highlight_regex_matches(pattern, txt)


[42m[1mName:[0m
Age: 0
Roll No.: 15
Grade: S

[42m[1mName:[0m Ravi
Age: -1
Roll No.: 123 Name: ABC
Grade: K

[42m[1mName:[0m Ram
Age: N/A
Roll No.: 1
Grade: G



In [16]:
pattern = re.compile("^Name:.*" , flags=re.M)
highlight_regex_matches(pattern, txt)


[42m[1mName:[0m
Age: 0
Roll No.: 15
Grade: S

[42m[1mName: Ravi[0m
Age: -1
Roll No.: 123 Name: ABC
Grade: K

[42m[1mName: Ram[0m
Age: N/A
Roll No.: 1
Grade: G



In [17]:
pattern = re.compile("^Name: \w+" , flags=re.M)
highlight_regex_matches(pattern, txt)


Name:
Age: 0
Roll No.: 15
Grade: S

[42m[1mName: Ravi[0m
Age: -1
Roll No.: 123 Name: ABC
Grade: K

[42m[1mName: Ram[0m
Age: N/A
Roll No.: 1
Grade: G



In [18]:
pattern = re.compile("^Name: \w+", flags=re.M)

In [19]:
pattern.findall(txt)

['Name: Ravi', 'Name: Ram']

> `re.M` (short for `re.MULTILINE`) is a flag which is used to make begin/end `(^, $)` consider each line.

### Example 2

Find all the sentences which do not end with a full stop (`.`) in the given text.

In [20]:
txt = """
Lorem Ipsum is simply dummy text of the printing and typesetting industry.
Lorem Ipsum has been the industry's standard dummy text ever since the 1500s!
It has survived not only five centuries, but also the leap into electronic typesetting, remaining essentially unchanged.
It was popularised in the 1960s with the release of Letraset sheets containing Lorem Ipsum passages
More recently with desktop publishing software like Aldus PageMaker including versions of Lorem Ipsum."""

In [21]:
pattern = re.compile("^.*[^\.]$", flags=re.M)

In [22]:
pattern.findall(txt)

["Lorem Ipsum has been the industry's standard dummy text ever since the 1500s!",
 'It was popularised in the 1960s with the release of Letraset sheets containing Lorem Ipsum passages']

In [23]:
highlight_regex_matches(pattern, txt)


Lorem Ipsum is simply dummy text of the printing and typesetting industry.
[42m[1mLorem Ipsum has been the industry's standard dummy text ever since the 1500s![0m
It has survived not only five centuries, but also the leap into electronic typesetting, remaining essentially unchanged.
[42m[1mIt was popularised in the 1960s with the release of Letraset sheets containing Lorem Ipsum passages[0m
More recently with desktop publishing software like Aldus PageMaker including versions of Lorem Ipsum.


![](images/memes/meme16.png)

<h2>\b meta character</h2>
\b is called 'boundary' and allows you to isolate words.

is similar to ^ and $ (location and no consumption)`````


In [2]:
import re
string = 'cat catherine catholic  wildcat copycat uncatchable'
pattern = re.compile('cat')

In [3]:
re.findall(pattern, string)

['cat', 'cat', 'cat', 'cat', 'cat', 'cat']

In [4]:
#using space

In [5]:
string = 'cat catherine catholic  wildcat copycat uncatchable'

In [6]:
pattern = re.compile(' cat ')

In [7]:
re.findall(pattern, string)

[]

In [8]:
pattern = re.compile('cat ')
re.findall(pattern, string)

['cat ', 'cat ', 'cat ']

In [9]:
pattern = re.compile(r'\bcat\b')
re.findall(pattern, string)

['cat']

In [10]:
# be careful with periods(dot) and non-alphanumeric characters 
#   \w  [A-Za-z0-9_]   \W  +:@^%
# . = nonalpha numeric
# One side has to have an alphanumeric character and the other side 
# is non alphanumeric character

In [11]:
string = '.cat catherine catholic  wildcat copycat uncatchable'

In [12]:
pattern = re.compile(r'\bcat\b')
re.findall(pattern, string)

['cat']

In [13]:
string = '@cat cat catherine catholic  wildcat copycat uncatchable'

In [14]:
pattern = re.compile(r'\bcat\b')
re.findall(pattern, string)

['cat', 'cat']

In [15]:
string = '@moondra2017.org'
string2 = '@moondra'
string3 = 'Python@moondra'
string4 = '@moondra_python'

#we only want @moondra and '@moondra_python' -- string 2 and string 4

In [16]:
pattern = re.compile(r'\b@[\w]+\b')    #no good
re.search(pattern, string)

In [17]:
pattern = re.compile(r'\B@[\w]+\b')    # _  is include in \w
re.search(pattern, string)            # This works but not perfect

<re.Match object; span=(0, 12), match='@moondra2017'>

In [31]:
pattern = re.compile(r'\B@[\w]+\b(?!\.)')
re.findall(pattern, string)

[]

In [30]:
pattern = re.compile(r'[a-z]?@[\w]+\b\.\w+')
re.findall(pattern, string)

['@moondra2017.org']

In [32]:
pattern = re.compile(r'[a-z]@[\w]+.\w+')
re.findall(pattern, string)

[]

In [21]:
string = '@moondra2017.org'
string2 = '@moondra @moondra @moondra'
string3 = 'Python@moondra'
string4 = '@moondra_python'
pattern = re.compile(r'\B@[\w]+$')    #  #This is perfect
re.search(pattern, string)

In [22]:
pattern = re.compile(r'\B@[\w]+$') 
re.findall(pattern, string2)

['@moondra']

In [23]:
pattern = re.compile(r'\B@[\w]+$')
re.search(pattern, string4)

<re.Match object; span=(0, 15), match='@moondra_python'>