# Regular Expression

A Regular Expression (RegEx) is a sequence of characters that defines a search pattern.

### MetaCharacters
Metacharacters are characters that are interpreted in a special way by a RegEx engine. Here's a list of metacharacters:

**[] . ^ $ * + ? {} () \ |**

**1. [] - Square brackets**

Square brackets specifies a set of characters you wish to match.

In [1]:
import re

In [2]:
pattern = '[abc]'
string = 'abyss'
re.search(pattern, string)

<re.Match object; span=(0, 1), match='a'>

Here, [abc] will match if the string you are trying to match contains any of the a, b or c.

We can also specify a range of characters using - inside square brackets.

* [a-e] is the same as [abcde].
* [1-4] is the same as [1234].
* [0-39] is the same as [01239].

We can complement (invert) the character set by using caret ^ symbol at the start of a square-bracket.

* [^abc] means any character except a or b or c.
* [^0-9] means any non-digit character.

In [12]:
# [a-e] is the same as [abcde].
re.search(r'[a-e]','boy')

# [1-4] is the same as [1234].
print(re.search(r'[0-9]','26 Avenue Street.'))

# [0-39] is the same as [01239].
print(re.search(r'[0-39]','7th Avenue 3rd Street.'))

# [^abc] means any character except a or b or c.
print(re.search(r'[^abc]','Street.'))

# [^0-9] means any non-digit character.
print(re.search(r'[^0-9]','7th Avenue Street.'))

<re.Match object; span=(0, 1), match='2'>
<re.Match object; span=(11, 12), match='3'>
<re.Match object; span=(0, 1), match='S'>
<re.Match object; span=(1, 2), match='t'>


#### 2. . - Period

A period matches any single character (except newline '\n').

In [11]:
print(re.search(r'.', '1a'))
print(re.search(r'.', 'a'))
print(re.search(r'.', 'abc'))

<re.Match object; span=(0, 1), match='1'>
<re.Match object; span=(0, 1), match='a'>
<re.Match object; span=(0, 1), match='a'>


#### 3. ^ - Caret

The caret symbol ^ is used to check if a string starts with a certain character.

In [13]:
print(re.search(r'^a', 'a'))
print(re.search(r'^a', 'ball'))
print(re.search(r'^ab', 'abcd'))
print(re.search(r'^ab', 'apple'))

<re.Match object; span=(0, 1), match='a'>
None
<re.Match object; span=(0, 2), match='ab'>
None


#### 4. $ - Dollar

The dollar symbol $ is used to check if a string ends with a certain character.

In [14]:
print(re.search(r'a$', 'a'))
print(re.search(r'a$', 'ball'))
print(re.search(r'cd$', 'abcd'))
print(re.search(r'cd$', 'apple'))

<re.Match object; span=(0, 1), match='a'>
None
<re.Match object; span=(2, 4), match='cd'>
None


#### 5. * - Star

The star symbol * matches zero or more occurrences of the pattern left to it.

In [16]:
# Looking for 0 or more a
print(re.search(r'ma*n', 'man'))
print(re.search(r'ma*n', 'mn'))
print(re.search(r'ma*n', 'woman'))
print(re.search(r'ma*n', 'maaana'))

<re.Match object; span=(0, 3), match='man'>
<re.Match object; span=(0, 2), match='mn'>
<re.Match object; span=(2, 5), match='man'>
<re.Match object; span=(0, 5), match='maaan'>


#### 6. + - Plus

The plus symbol + matches one or more occurrences of the pattern left to it.

In [17]:
# Looking for atleast 1 a
print(re.search(r'ma+n', 'man'))
print(re.search(r'ma+n', 'mn'))
print(re.search(r'ma+n', 'woman'))
print(re.search(r'ma+n', 'maaana'))

<re.Match object; span=(0, 3), match='man'>
None
<re.Match object; span=(2, 5), match='man'>
<re.Match object; span=(0, 5), match='maaan'>


#### 7. ? - Question Mark

The question mark symbol ? matches zero or one occurrence of the pattern left to it.

In [18]:
# Looking for min 0 or max 1 a
print(re.search(r'ma?n', 'man'))
print(re.search(r'ma?n', 'mn'))
print(re.search(r'ma?n', 'woman'))
print(re.search(r'ma?n', 'maaana'))

<re.Match object; span=(0, 3), match='man'>
<re.Match object; span=(0, 2), match='mn'>
<re.Match object; span=(2, 5), match='man'>
None


#### 8. {} - Braces

Consider this code: {n,m}. This means at least n, and at most m repetitions of the pattern left to it.

In [20]:
# Looking for min 2 and max 3 a
print(re.search(r'a{2,3}', 'abc dat'))
print(re.search(r'a{2,3}', 'abc daat'))
print(re.search(r'a{2,3}', 'aabc daat'))
print(re.search(r'a{2,3}', 'aabc daaaat'))

None
<re.Match object; span=(5, 7), match='aa'>
<re.Match object; span=(0, 2), match='aa'>
<re.Match object; span=(0, 2), match='aa'>


#### 9. | - Alternation

Vertical bar | is used for alternation (or operator).

In [21]:
# Looking for a or b present in the string
print(re.search(r'a|b', 'cde'))
print(re.search(r'a|b', 'ade'))
print(re.search(r'a|b', 'acdbea'))

None
<re.Match object; span=(0, 1), match='a'>
<re.Match object; span=(0, 1), match='a'>


#### 10. () - Group

Parentheses () is used to group sub-patterns. For example, (a|b|c)xz match any string that matches either a or b or c followed by xz

In [27]:
print(re.search(r'(a|b|c)xz', 'cdexz'))
print(re.search(r'(a|b|c)xz', 'axz'))
print(re.search(r'(a|b|c)xz', 'acdbea'))
print(re.search(r'(a|b|c)xz', 'acdbxzsd'))

None
<re.Match object; span=(0, 3), match='axz'>
None
<re.Match object; span=(3, 6), match='bxz'>


#### 11. \ - Backslash

Backlash \ is used to escape various characters including all metacharacters. For example,

\\\\$a match if a string contains \\$ followed by a. Here, \$ is not interpreted by a RegEx engine in a special way.

If you are unsure if a character has special meaning or not, you can put \ in front of it. This makes sure the character is not treated in a special way.

**Special Sequences**

Special sequences make commonly used patterns easier to write. Here's a list of special sequences:

* \A - Matches if the specified characters are at the start of a string.

In [28]:
print(re.search(r'\Athe', 'the sun'))
print(re.search(r'\Athe', 'In the sun'))

<re.Match object; span=(0, 3), match='the'>
None


* \b - Matches if the specified characters are at the beginning or end of a word.

In [29]:
print(re.search(r'\bfoo', 'football'))
print(re.search(r'\bfoo', 'a football'))
print(re.search(r'\bfoo', 'afootball'))
print(re.search(r'foo\b', 'the foo'))
print(re.search(r'foo\b', 'the afoo test'))
print(re.search(r'foo\b', 'the afootest'))

<re.Match object; span=(0, 3), match='foo'>
<re.Match object; span=(2, 5), match='foo'>
None
<re.Match object; span=(4, 7), match='foo'>
<re.Match object; span=(5, 8), match='foo'>
None


* \B - Opposite of \b. Matches if the specified characters are not at the beginning or end of a word.

In [30]:
print(re.search(r'\Bfoo', 'football'))
print(re.search(r'\Bfoo', 'a football'))
print(re.search(r'\Bfoo', 'afootball'))
print(re.search(r'foo\B', 'the foo'))
print(re.search(r'foo\B', 'the afoo test'))
print(re.search(r'foo\B', 'the afootest'))

None
None
<re.Match object; span=(1, 4), match='foo'>
None
None
<re.Match object; span=(5, 8), match='foo'>


* \d - Matches any decimal digit. Equivalent to [0-9]

In [31]:
print(re.search(r'\d', '12abc3'))
print(re.search(r'\d', 'afootest'))

<re.Match object; span=(0, 1), match='1'>
None


* \D - Matches any non-decimal digit. Equivalent to [^0-9]

In [32]:
print(re.search(r'\D', '12abc3'))
print(re.search(r'\D', 'afootest'))

<re.Match object; span=(2, 3), match='a'>
<re.Match object; span=(0, 1), match='a'>


* \s - Matches where a string contains any whitespace character. Equivalent to [ \t\n\r\f\v].

In [33]:
print(re.search(r'\s', 'Python RegEx'))
print(re.search(r'\s', 'PythonRegEx'))

<re.Match object; span=(6, 7), match=' '>
None


* \S - Matches where a string contains any non-whitespace character. Equivalent to [^ \t\n\r\f\v].

In [35]:
print(re.search(r'\S', 'a b'))
print(re.search(r'\S', ' '))

<re.Match object; span=(0, 1), match='a'>
None


* \w - Matches any alphanumeric character (digits and alphabets). Equivalent to [a-zA-Z0-9_]. By the way, underscore _ is also considered an alphanumeric character.

In [36]:
print(re.search(r'\w', '12&": ;c'))
print(re.search(r'\w', '%"> !'))
print(re.search(r'\w', ' '))
print(re.search(r'\w', 'abc123'))

<re.Match object; span=(0, 1), match='1'>
None
None
<re.Match object; span=(0, 1), match='a'>


* \W - Matches any non-alphanumeric character. Equivalent to [^a-zA-Z0-9_]

In [37]:
print(re.search(r'\W', '12&": ;c'))
print(re.search(r'\W', '%"> !'))
print(re.search(r'\W', ' '))
print(re.search(r'\W', 'abc123'))

<re.Match object; span=(2, 3), match='&'>
<re.Match object; span=(0, 1), match='%'>
<re.Match object; span=(0, 1), match=' '>
None


* \Z - Matches if the specified characters are at the end of a string.

In [38]:
print(re.search(r'Python\Z', 'I like Python'))
print(re.search(r'Python\Z', 'I like Python Programming'))
print(re.search(r'Python\Z', 'Python is fun.'))

<re.Match object; span=(7, 13), match='Python'>
None
None


**Tip:** To build and test regular expressions, you can use RegEx tester tools such as <a href="https://regex101.com/">regex101</a>. This tool not only helps in creating regular expressions, but it also helps to learn it.

## Python RegEx

Python has a module named re to work with regular expressions. To use it, we need to import the module.

In [39]:
import re

#### 1. re.findall()
The re.findall() method returns a list of strings containing all matches. If the pattern is not found, re.findall() returns an empty list.

In [40]:
string = 'hello 12 hi 89. Howdy 34'
pattern = '\d+'

result = re.findall(pattern, string) 
print(result)

['12', '89', '34']


#### 2. re.split()
The re.split method splits the string where there is a match and returns a list of strings where the splits have occurred. If the pattern is not found, re.split() returns a list containing the original string.

In [41]:
string = 'Twelve:12 Eighty nine:89.'
pattern = '\d+'

result = re.split(pattern, string) 
print(result)

['Twelve:', ' Eighty nine:', '.']


We can pass maxsplit argument to the re.split() method. It's the maximum number of splits that will occur. The default value of maxsplit is 0; meaning all possible splits.

In [42]:
string = 'Twelve:12 Eighty nine:89.'
pattern = '\d+'

result = re.split(pattern, string, 1) 
print(result)

['Twelve:', ' Eighty nine:89.']


#### 3. re.sub()
The method returns a string where matched occurrences are replaced with the content of replace variable. If the pattern is not found, re.sub() returns the original string.

In [43]:
# multiline string
string = 'abc 12\
de 23 \n f45 6'

# matches all whitespace characters
pattern = '\s+'

# empty string
replace = ''

new_string = re.sub(pattern, replace, string) 
print(new_string)

abc12de23f456


We can pass count as a fourth parameter to the re.sub() method. If omited, it results to 0. This will replace all occurrences.

In [44]:
# multiline string
string = 'abc 12\
de 23 \n f45 6'

# matches all whitespace characters
pattern = '\s+'
replace = ''

new_string = re.sub(r'\s+', replace, string, 1) 
print(new_string)

abc12de 23 
 f45 6


#### 4. re.subn()
The re.subn() is similar to re.sub() expect it returns a tuple of 2 items containing the new string and the number of substitutions made.

In [45]:
# multiline string
string = 'abc 12\
de 23 \n f45 6'

# matches all whitespace characters
pattern = '\s+'

# empty string
replace = ''

new_string = re.subn(pattern, replace, string) 
print(new_string)

('abc12de23f456', 4)


#### 5. re.search()
The re.search() method takes two arguments: a pattern and a string. The method looks for the first location where the RegEx pattern produces a match with the string.

If the search is successful, re.search() returns a match object; if not, it returns None.

In [46]:
string = "Python is fun"

# check if 'Python' is at the beginning
match = re.search('\APython', string)

if match:
  print("pattern found inside the string")
else:
  print("pattern not found")  

pattern found inside the string


Here, match contains a match object.

#### 6. re.fullmatch(pattern, string, flags=0)
If the whole string matches the regular expression pattern, return a corresponding match object. Return None if the string does not match the pattern; note that this is different from a zero-length match.

In [59]:
string = "Python is fun"
string1 = "Python 3 is fun"

# check if 'nonnumeric' string
match = re.fullmatch('\D+', string)

if match:
  print("pattern found inside the string")
else:
  print("pattern not found")  

# check if 'nonnumeric' string
match1 = re.fullmatch('\D+', string1)

if match1:
  print("pattern found inside the string")
else:
  print("pattern not found")  

pattern found inside the string
pattern not found


#### 7. re.escape(pattern)
Escape special characters in pattern. This is useful if you want to match an arbitrary literal string that may have regular expression metacharacters in it.

In [60]:
print(re.escape('http://www.python.org'))

http://www\.python\.org


## Match object
We can get methods and attributes of a match object using dir() function.

Some of the commonly used methods and attributes of match objects are:

#### 1. match.group()
The group() method returns the part of the string where there is a match.

In [47]:
string = '39801 356, 2102 1111'

# Three digit number followed by space followed by two digit number
pattern = '(\d{3}) (\d{2})'

# match variable contains a Match object.
match = re.search(pattern, string) 

if match:
  print(match.group())
else:
  print("pattern not found")

801 35


Our pattern (\d{3}) (\d{2}) has two subgroups (\d{3}) and (\d{2}). We can get the part of the string of these parenthesized subgroups. Here's how:

In [48]:
print(match.group(1))
print(match.group(2))
print(match.group(1, 2))
print(match.groups())

801
35
('801', '35')
('801', '35')


#### 2. match.start()
The start() function returns the index of the start of the matched substring.

In [50]:
match.start()

2

#### 3. match.end()
The end() function returns the end index of the matched substring.

In [51]:
match.end()

8

#### 4. match.span()
The span() function returns a tuple containing start and end index of the matched part.

In [52]:
match.span()

(2, 8)

#### match.re and match.string
The re attribute of a matched object returns a regular expression object. Similarly, string attribute returns the passed string.

In [53]:
print(match.re)
print(match.string)

re.compile('(\\d{3}) (\\d{2})')
39801 356, 2102 1111


## Using r prefix before RegEx
When r or R prefix is used before a regular expression, it means raw string. For example, '\n' is a new line whereas r'\n' means two characters: a backslash \ followed by n.

Backlash \ is used to escape various characters including all metacharacters. However, using r prefix makes \ treat as a normal character.

In [55]:
string = '\n and \r are \n escape sequences.'

result = re.findall(r'[\n\r]', string) 
print(result)

['\n', '\r', '\n']
