# Regular Expressions - re module

Regular expressions, also known as RE, regex, or regexp patterns, are tiny high-specialized coding embedded within Python and can be accessed by importing Python's **re** module. REs are used to specify rules for matching a set of strings. This set of strings can be anything you like, English sentences, email addresses, URLs, and much more. REs can also be used to modify strings or split them apart in various ways.

It is important to note that not all string processing can be done using REs and sometimes the RE itself turn out to be very complicated. In these cases, we recommend that you write a Python code to process the string which will be, although slower, easier to understand.

Let's start with simple patterns.

## RE characters matching

All letters and characters simply match themselves, for example the **RE hello** will exactly match the **string hello**. You can enable case-insensitivity that allows the RE hello to match Hello, heLLo, ... etc. 

Some other characters called **metacharacters** do not match themselves. They have a special role which affects other portions of the RE by repeating them or changing their meaning.

Here is a list of complete metacharacters that will be discussed one-by-one.
       
       
                                         . ^ $ * + ? { } [ ] \ | ( )
                 
In Python, to match a RE with a string, import the re module and call the match() method which takes two arguments, the RE pattern and the string to be matched.


### Matching character class [ ]

The characters [ and ] is used in RE to specify a character class, which is the set of characters you want to match a string with.

Characters can be listed individually, or a range of characters can be indicated by giving two characters and separating them by a '-'.

**Example 1**:

RE is [abc] will match **any of the characters** 'a', 'b', or 'c'. If you want to use a range to express the same set of characters, use [a-c]. 

In [2]:
import re

# matches char 'a' in string 'a'
re.match(r'[abc]', 'a')

<_sre.SRE_Match object; span=(0, 1), match='a'>

As you can see from the output, the match() method gives 3 information about the matching string:

 - **re.match()**: returns a Match object. 
 
 
 - **span**: is a tuple containing the (start, end) positions of the match.
 
 
 - **match**: is the string that matches the RE pattern.
 

So in the above example, we are trying to see if the string 'a' matches the RE [abc] and it does match because [abc] matches any of the characters 'a', 'b', or 'c'. After printing the result of matching, we see span=(0, 1) which means the matched string started at position 0 and ended at position 1 (our string is only one character so we don't even have position 1). span will be clarified more in the future examples.  

**Example 2**:


RE is [a-z] will match only lowercase letters.

In [3]:
# matches char 'h' in string 'hello'
re.match(r'[a-z]', 'hello')

<_sre.SRE_Match object; span=(0, 1), match='h'>

**Notice how the matching engine has stopped after it found the first match which is character 'h' with the RE pattern [a-z].**

In [10]:
# no match here, 'Hello' doesn't start with lowercase letter
print(re.match(r'[a-z]', 'Hello'))

None


In [11]:
# no match here, 'Hello' doesn't start with lowercase letter
print(re.match(r'[a-z]', '123hello'))

None


**NOTE**: metacharacters are inactive inside character classes. For example, [ijk\$] will match any of the characters 'i', 'j', 'k', or '\$'. '$' is a metacharacter but because it is inside a character class, its special nature has stopped.



In [9]:
# matches char '$' in string '$aaa'
print(re.match(r'[ijk$]', '$aaa'))

<_sre.SRE_Match object; span=(0, 1), match='$'>


### Unmatching character class [^ ]

You can match any character NOT listed within a class by using the '^' symbol as the first character of the class. This is known as **_complementing_** the set of matching characters. Let's take some examples:

**Example 1**:

[^5] will match any character except '5'. In this example, we will write the code in a different way by assigning a label to a string **s**, creating a **Match object m** by calling the match() method from the regular expression **re** library, and printing the result of match.

In [18]:
s = '5'

# m is a Match object
# no match here
m = re.match(r'[^5]', s)

print(m)

None


In [13]:
s = '123'

# m is a Match object
# '1' will be matched with [^5]
m = re.match(r'[^5]', s)

print(m)

<_sre.SRE_Match object; span=(0, 1), match='1'>


The match is only the character '1' from s and the start and end locations of span is 0 and 1 respectively. Try a couple of examples yourself and to make sure you understand.

### Escape metacharacter \

Perhaps the most important metacharacter in RE is the backslash, \. 

You have seen earlier with Python string literals, the backslash can be followed by various characters to create special effects on the string. For example, '\n' is used in a string to print it in multiple lines:

In [19]:
print("Hi there!\nHow are you doing today?")

Hi there!
How are you doing today?


The backslash is also used to escape all the metacharacters so you can still match them in patterns; for example, if you need to match a \[ or \, you can precede them with a backslash to remove their special meaning like this: \\[ or \\.

In [22]:
print(re.match(r'\[10:20\]', '[10:20]'))

<_sre.SRE_Match object; span=(0, 7), match='[10:20]'>


Note the matching string is '[10:20]' and the matching spans from location 0 (which is character '['\) to  location 7 (which is after character ']') where the matching process stopped.

### Predefined sets of characters

These sets begin with backslash '\' and used to define sets of digits, letters, ... etc. 

#### \w

Matches any alphanumeric characters which are lowercase and uppercase letters, digits, and the underscore \_. This is equivalent to the class [a-zA-Z0-9_] 

In [24]:
print(re.match(r'\w', 'python'))

<_sre.SRE_Match object; span=(0, 1), match='p'>


In [26]:
print(re.match(r'\w', '_init_'))

<_sre.SRE_Match object; span=(0, 1), match='_'>


#### \W

Matches any **non-alphanumeric** characters. This is equivalent to the class [^a-zA-Z0-9_].

In [29]:
print(re.match(r'\W', '_init_'))

None


#### \d

Matches any decimal digit; this is equivalent to the class [0-9]


In [28]:
print(re.match(r'\d', '357'))

<_sre.SRE_Match object; span=(0, 1), match='3'>


#### \D

Matches any **non-digit** character; this is equivalent to the class [^0-9]


In [30]:
print(re.match(r'\D', '357'))

None


#### \s
Matches any whitespace character; this is equivalent to the class [ \t\n\r\f\v]. Notice the space in the pattern.

In [31]:
# matches '\n'
print(re.match(r'\s', '\n'))

<_sre.SRE_Match object; span=(0, 1), match='\n'>


In [33]:
# matches the space ' '
print(re.match(r'\s', ' '))

<_sre.SRE_Match object; span=(0, 1), match=' '>


#### \S
Matches any non-whitespace character; this is equivalent to the class [^ \t\n\r\f\v].

In [35]:
# will not match anything
print(re.match(r'\S', ' \t \n'))

None


#### The dot . 

The RE **.** (the dot) matches anything except the newline character. You can use it if you want to match "any character".

**NOTE**: there is an alternate to **.** which is re.DOTALL that matches even the newline.

In [12]:
print(re.match(r'.', 'fhdjhd77^&&*'))

<_sre.SRE_Match object; span=(0, 1), match='f'>


You can combine different sets of characters in one RE. For example, [\s,\d] is a character class that will match any whitespace character, or ',' or any digit. 

In [13]:
print(re.match(r'[\s,\d]', '6, '))
print(re.match(r'[\s,\d]', ' 3,0'))

<_sre.SRE_Match object; span=(0, 1), match='6'>
<_sre.SRE_Match object; span=(0, 1), match=' '>


## Repeated patterns

There are five different ways to check for repeated patterns:

### 1. * 

The * metacharacter specifies that the previous character can be matched **zero or more times**, instead of exactly once.

In [46]:
print(re.match(r'ca*t', 'ct'))      # 0 'a'
print(re.match(r'ca*t', 'cat'))     # 1 'a'
print(re.match(r'ca*t', 'caat'))    # 2 'a'
print(re.match(r'ca*t', 'caaat'))   # 3 'a'
print(re.match(r'ca*t', 'caaaat'))  # 4 'a'

<_sre.SRE_Match object; span=(0, 2), match='ct'>
<_sre.SRE_Match object; span=(0, 3), match='cat'>
<_sre.SRE_Match object; span=(0, 4), match='caat'>
<_sre.SRE_Match object; span=(0, 5), match='caaat'>
<_sre.SRE_Match object; span=(0, 6), match='caaaat'>


**The * is greedy; when there is a repeating pattern in the string the matching engine will try to match as many times as possible. As seen in the example above, the string 'caaaat' was matched with the RE as the engine took all the four 'a' characters not only one 'a'.**

#### 2. +

The **+** metacharacter specifies that the previous character can be matched **one or more times**.

In [50]:
print(re.match(r'bo+k', 'bk'))      # 0 'b'
print(re.match(r'bo+k', 'bok'))     # 1 'b'
print(re.match(r'bo+k', 'book'))    # 2 'b'
print(re.match(r'bo+k', 'boook'))   # 3 'b'
print(re.match(r'bo+k', 'booook'))  # 4 'b'

None
<_sre.SRE_Match object; span=(0, 3), match='bok'>
<_sre.SRE_Match object; span=(0, 4), match='book'>
<_sre.SRE_Match object; span=(0, 5), match='boook'>
<_sre.SRE_Match object; span=(0, 6), match='booook'>


Notice the first matching attempt wasn't successful because **+** needs at least one 'o' in the string.

#### 3. ?

The **?** metacharacter specifies that the previous charcter can be matched **zero or one time**.

In [51]:
print(re.match(r'bo?k', 'bk'))      # 0 'b'
print(re.match(r'bo?k', 'bok'))     # 1 'b'
print(re.match(r'bo?k', 'book'))    # 2 'b'


<_sre.SRE_Match object; span=(0, 2), match='bk'>
<_sre.SRE_Match object; span=(0, 3), match='bok'>
None


#### 4. {m}

Specifies that the previous character should be matched **exactly m times**; fewer matches cause the entire RE not to match. But more than m matches cause the engine to stop at the first m matches.

For example, a{6} will match exactly six 'a' characters, but not five.

In [9]:
import re
# no match, less than 6 'b'
print(re.match(r'ab{6}', 'abbb')) 

# match, exactly 6 'b'
print(re.match(r'ab{6}', 'abbbbbb'))


# match the first 6 'a'
print(re.match(r'ab{6}', 'abbbbbbbbbbbbbb'))

None
<_sre.SRE_Match object; span=(0, 7), match='abbbbbb'>
<_sre.SRE_Match object; span=(0, 7), match='abbbbbb'>


#### 5. {m,n}

Specifies that the previous character should be matched **from m to n times**, where m is the minimum number of repetitions and n is the maximum number. 

- Omitting m, when we write only {,n}, means we want to match a pattern minimum of zero times and maximum n times.


- Omitting n, when we write only {m,}, means we want to match a pattern minimum m times but with no maximum.

In [6]:
# no match, less than 3 'b'
print(re.match(r'ab{3,5}', 'ab')) 

# matches 'a' and 3 'b' (minimium)
print(re.match(r'ab{3,5}', 'abbb'))


# matches 'a' and the first 5 'b'
print(re.match(r'ab{3,5}', 'abbbbbbbbb'))

None
<_sre.SRE_Match object; span=(0, 4), match='abbb'>
<_sre.SRE_Match object; span=(0, 6), match='abbbbb'>


## Methods of match object

The match object has some useful methods that tell us different information about the pattern matching.

### 1. group()

Returns the substring that was matched by the RE.

In [66]:
m = re.match(r'ab{3,5}', 'abbbbbbbbb')
m.group()

'abbbbb'

### 2. start() and end()

Return the starting and the ending positions of the match.

In [69]:
print(m.start())
print(m.end())

0
6


## RE applications

RE is very useful in many areas, such as searching for certain patterns in a text, modifying strings in text, and finding all occurrences of a pattern in a text. 

Let's explore these topics to show the power of RE.

### Searching for patterns

One of the most common tasks to use RE is when looking for particular patterns in text. To do that, we will use the search() function that searches for first occurrence of RE pattern within string.

**NOTE**: The difference between **match** and **search** is:

 - **match()**: Checks for a match only at the beginning of the string.
 
 
 - **search()**: Scans through a string looking for the first location where the RE pattern produces a match.
      - Returns a corresponding match object (instance) if there is a match. 
      - Returns None if no position in the string matches the pattern.
      - This is what Perl does by default.
 

**The search() function returns a match object on success, None on failure**. 

In this example, we provide a RE in which we define three subgroups of matching patterns. Each subgroup is enclosed in ( ). We will use two methods:

- group(num): This method returns entire match if no num argument supplied, or specific subgroup with number equals to num.


- groups(): This method returns all matching subgroups in a tuple (empty if there weren't any).



In [45]:
import re

line = "Cats are smarter than dogs!"

# RE is (.*) are (.*?) (.*) 
# there are 3 groups (.*) (.*?) (.*) 
# (.*) means match anything zero or more times
# (.*?) means match anything exactly one time
# space are space matches themselves

search_obj = re.search(r'(.*) are (.*?) (.*)', line)

if search_obj:
    print("search_obj.group()  : ", search_obj.group())
    print("search_obj.group(1) : ", search_obj.group(1))
    print("search_obj.group(2) : ", search_obj.group(2))
    print("search_obj.group(3) : ", search_obj.group(3))
    print("search_obj.groups() : ", search_obj.groups())
else:
    print("Nothing found!!")

search_obj.group()  :  Cats are smarter than dogs!
search_obj.group(1) :  Cats
search_obj.group(2) :  smarter
search_obj.group(3) :  than dogs!
search_obj.groups() :  ('Cats', 'smarter', 'than dogs!')


### Finding all patterns in a text

You can use re.findall() method to find all the occurrences of a pattern in a string or text. The method returns all the occurrences found.

**Example**:

In [3]:
import re

line = 'I love Python because Python is easy'
f = re.findall(r'Python', line)
print(f)

['Python', 'Python']


###  Split text based on a pattern

You already know the split() method we used with strings, we can also call another **split()** method in the **re** module to separate a string into multiple substrings based on a RE pattern. The method returns a list of splitted substrings without the **pattern** (RE) used to split the string. 

This example shows how to split a line of characters based on different patterns. 

In [36]:
import re

url = 'http://www.domain_name.com'
f = re.split(r'www', url)
print(f)

['http://', '.domain_name.com']


In [38]:
f = re.split(r'//', url)
print(f)

['http:', 'www.domain_name.com']


Notice how the pattern itself used to split on, for example 'www' or '//', is not included in the list.


## Great!

### Now you have a solid background on how to define and use regular expressions in Python. You can build on these skills to develop more useful patterns to apply in your programs.  