## Regular Expressions
Regular expression is a set of characters, called as the pattern, which helps in finding substrings in a given string. The pattern is used to detect the substrings

For example, suppose you have a dataset of customer reviews about your restaurant. Say, you want to extract the emojis from the reviews because they are a good predictor os the sentiment of the review.

Take another example, the artificial assistants such as Siri, Google Now use information retrieval to give you better results. When you ask them for any query or ask them to search for something interesting on the screen, they look for common patterns such as emails, phone numbers, place names, date and time and so on. This is because then the assitant can automatically make a booking or ask you to call the resturant to make a booking.

Regular expressions are very powerful tool in text processing. It will help you to clean and handle your text in a much better way.

### Let's import the regular expression library in python.

In [4]:
import re

Let's do a quick search using a pattern.

In [None]:
re.search(pattern, string)

In [2]:
re.search('Ravi', 'Ravi is an exceptional student!')

<_sre.SRE_Match object; span=(0, 4), match='Ravi'>

In [3]:
# print output of re.search()
match = re.search('Ravi', 'Ravi is an exceptional student!')
print(match.group())

Ravi


Let's define a function to match regular expression patterns

In [5]:
def find_pattern(text, patterns):
    if re.search(patterns, text):
        return re.search(patterns, text)
    else:
        return 'Not Found!'

In [6]:
string =  'The roots of education are bitter, but the fruit is sweet.'
pattern = 'education'

result = find_pattern(string, pattern)
print(result.group())

education


### Quantifiers

In [9]:
# '*': Zero or more 
print(find_pattern("ac", "ab*"))
print(find_pattern("abc", "ab*"))
print(find_pattern("abbc", "ab*"))
print(find_pattern("cv", "ab*"))
print(find_pattern("cav", "ab*"))

<_sre.SRE_Match object; span=(0, 1), match='a'>
<_sre.SRE_Match object; span=(0, 2), match='ab'>
<_sre.SRE_Match object; span=(0, 3), match='abb'>
Not Found!
<_sre.SRE_Match object; span=(1, 2), match='a'>


In [6]:
# '?': Zero or one (tells whether a pattern is absent or present)
print(find_pattern("ac", "ab?"))
print(find_pattern("abc", "ab?"))
print(find_pattern("abbc", "ab?"))

<re.Match object; span=(0, 1), match='a'>
<re.Match object; span=(0, 2), match='ab'>
<re.Match object; span=(0, 2), match='ab'>


In [7]:
# '+': One or more
print(find_pattern("ac", "ab+"))
print(find_pattern("acb", "ab+"))
print(find_pattern("abc", "ab+"))
print(find_pattern("abbc", "ab+"))

Not Found!
Not Found!
<_sre.SRE_Match object; span=(0, 2), match='ab'>
<_sre.SRE_Match object; span=(0, 3), match='abb'>


In [11]:
# {n}: Matches if a character is present exactly n number of times
print(find_pattern("abbc", "ab{2}"))
print(find_pattern("aabbc", "ab{2}"))

<_sre.SRE_Match object; span=(0, 3), match='abb'>
<_sre.SRE_Match object; span=(1, 4), match='abb'>


In [9]:
# {m,n}: Matches if a character is present from m to n number of times
print(find_pattern("aabbbbbbc", "ab{3,5}"))   # return true if 'b' is present 3-5 times
print(find_pattern("aabbbbbbc", "ab{7,10}"))  # return true if 'b' is present 7-10 times
print(find_pattern("aabbbbbbc", "ab{,10}"))   # return true if 'b' is present atmost 10 times
print(find_pattern("aabbbbbbc", "ab{10,}"))   # return true if 'b' is present from at least 10 times

<re.Match object; span=(1, 7), match='abbbbb'>
Not Found!
<re.Match object; span=(0, 1), match='a'>
Not Found!


There are four variants of the quantifier that you just saw:

1. {m, n}: Matches the preceding character ‘m’ times to ‘n’ times.


2. {m, }: Matches the preceding character ‘m’ times to infinite times, i.e. there is no upper limit to the occurrence of the preceding character.


3. {, n}: Matches the preceding character from zero to ‘n’ times, i.e. the upper limit is fixed regarding the occurrence of the preceding character.


4. {n}: Matches if the preceding character occurs exactly ‘n’ number of times.

<div class="alert alert-block alert-info">
<b>Note:</b>That while specifying the {m,n} notation, avoid using a space after the comma, i.e. use {m,n} rather than {m, n}.
</div>

An interesting thing to note is that this quantifier can replace the ‘?’, ‘*’ and the ‘+’ quantifier. That is because:

* '?' is equivalent to zero or once, or {0, 1}
* '*' is equivalent to zero or more times, or {0, }
* '+' is equivalent to one or more times, or {1, }

If you put the parentheses around some characters, the quantifier will look for repetition of the group of characters rather than just looking for repetitions of the preceding character. This concept is called grouping in regular expression jargon. For example, the pattern ‘(abc){1, 3}’ will match the following strings:

* abc
* abcabc
* abcabcabc

Similarly, the pattern (010)+ will match:

* 010
* 010010
* 010010010

Comprehension: Grouping

Description

Write a regular expression which matches a string where '23' occurs one or more times followed by occurrence of '78' one or more times

Sample positive matches (should match all of these):
2378
23237878
232323237878

Sample negative matches (shouldn't match either of these):
23
78
23378
223378
22337788

In [5]:
string = '2323233787878'
pattern = '(23){1,}(78){1,}'

result = re.search(pattern, string)


# evaluate result - don't change the following piece of code, it is used to evaluate your regex
if result != None:
    print(True)
else:
    print(False)

False


### Pipe operator

The pipe operator. It’s notated by ‘|’. The pipe operator is used as an OR operator. You need to use it inside the parentheses. For example, the pattern ‘(d|g)one’ will match both the strings - ‘done’ and ‘gone’. The pipe operator tells that the place inside the parentheses can be either ‘d’ or ‘g’.

 

Similarly, the pattern ‘(ICICI|HDFC) Bank’ will match the strings ‘ICICI Bank’ and ‘HDFC Bank’. You can also use quantifiers after the parentheses as usual even when there is a pipe operator inside. Not only that, there can be an infinite number of pipe operators inside the parentheses. The pattern ‘(0|1|2){2} means 'exactly two occurrences of either of 0, 1 or 2', and it will match these strings - ‘00’, ‘01’, ‘02’, ‘10’, ‘11’, ‘12’, ‘20’, ‘21’ and ‘22’.

Lastly, you will often find yourself in situations where you will need to mention characters such as ‘?’, ‘*’, ‘+’, ‘(‘, ‘)’, ‘{‘, etc. in your regular expressions. These are called special characters since they have special meanings when they appear inside a  regex pattern (as you have already seen).

 

Suppose you want to extract all the questions from a document, and you assume that all questions end with a question mark - ‘?’. So you would need to use the ‘?’ in the regular expression. Now, you already know that ‘?’ has a special meaning in regular expressions. So, how do you tell regular expression engine that you want to match the question mark literally in the sentence, rather than as a special character (which it is by default)? 

 

In situations such as these, you’ll need to use escape sequences. The escape sequence, denoted by a backslash ‘\’, is used to escape the special meaning of the special characters. To match a question mark literally, you need to use '\?' (this is called escaping the character).

 

Let's take another example - if you want to match the addition symbol in a string, you can’t use the pattern ‘+’. You need to escape the ‘+’ operator and the pattern that you’re going to use in this case is ‘\+’. 

<div class="alert alert-block alert-info">
<b>Note:</b>  The '\' itself is a special character, and to match the ‘\’ character literally, you need to escape it too. You can use the pattern ‘\\’ to escape the backslash.

</div>

In [6]:
string = '11*2'
pattern = '(\*)'

result = re.search(pattern, string)

if result != None:
    print(True)
else:
    print(False)

True


### Regex Flags

A flag has a special meaning. For example, if you want your regex to ignore the case of the text then you can pass the 're.I' flag. Similarly, you can have have a flag with the syntax re.M that enables you to search in multiple lines (in case the input text has multiple lines). You can pass all these flags in the re.search() function. The syntax to pass multiple flags is:

In [8]:
re.search(pattern, string, flags=re.I | re.M)

<_sre.SRE_Match object; span=(2, 3), match='*'>

 The re.compile() function stores the regular expression pattern in the cache memory and is said to result in a little faster searches. You need to pass the regex pattern to re.compile() function. 

In [9]:
# without re.compile() function
result = re.search("a+", "abc")
result

<_sre.SRE_Match object; span=(0, 1), match='a'>

In [10]:
# using the re.compile() function
pattern = re.compile("a+")
result = pattern.search("abc")
result

<_sre.SRE_Match object; span=(0, 1), match='a'>

### Anchors

In [10]:
# '^': Indicates start of a string
# '$': Indicates end of string

print(find_pattern("James", "^J"))   # return true if string starts with 'J' 
print(find_pattern("Pramod", "^J"))  # return true if string starts with 'J' 
print(find_pattern("India", "a$"))   # return true if string ends with 'c'
print(find_pattern("Japan", "a$"))   # return true if string ends with 'c'


<re.Match object; span=(0, 1), match='J'>
Not Found!
<re.Match object; span=(4, 5), match='a'>
Not Found!


In [17]:
print(find_pattern("010", "^01*0$"))  
print(find_pattern("0110", "^01*0$"))  
print(find_pattern("01111111110", "^01*0$"))   
print(find_pattern("‘00’", "^01*0$"))

<_sre.SRE_Match object; span=(0, 3), match='010'>
<_sre.SRE_Match object; span=(0, 4), match='0110'>
<_sre.SRE_Match object; span=(0, 11), match='01111111110'>
Not Found!


### Wildcard

In [11]:
# '.': Matches any character
print(find_pattern("a", "."))
print(find_pattern("#", "."))


<re.Match object; span=(0, 1), match='a'>
<re.Match object; span=(0, 1), match='#'>


### Character sets

In [12]:
# Now we will look at '[' and ']'.
# They're used for specifying a character class, which is a set of characters that you wish to match.
# Characters can be listed individually as follows
print(find_pattern("a", "[abc]"))

# Or a range of characters can be indicated by giving two characters and separating them by a '-'.
print(find_pattern("c", "[a-c]"))  # same as above

<re.Match object; span=(0, 1), match='a'>
<re.Match object; span=(0, 1), match='c'>


In [20]:
# '^' is used inside character set to indicate complementary set - here caret symbol is different
print(find_pattern("a", "[^abc]"))  # return true if neither of these is present - a,b or c
print(find_pattern("d", "[^abc]"))

Not Found!
<_sre.SRE_Match object; span=(0, 1), match='d'>


### Character sets
| Pattern  | Matches                                                                                    |
|----------|--------------------------------------------------------------------------------------------|
| [abc]    | Matches either an a, b or c character                                                      |
| [abcABC] | Matches either an a, A, b, B, c or C character                                             |
| [a-z]    | Matches any characters between a and z, including a and z                                  |
| [A-Z]    | Matches any characters between A and Z, including A and Z                                  |
| [a-zA-Z] | Matches any characters between a and z, including a and z ignoring cases of the characters |
| [0-9]    | Matches any character which is a number between 0 and 9                                    |

### Meta sequences

| Pattern  | Equivalent to    |
|----------|------------------|
| \s       | [ \t\n\r\f\v]    |
| \S       | [^ \t\n\r\f\v]   |
| \d       | [0-9]            |
| \D       | [^0-9]           |
| \w       | [a-zA-Z0-9_]     |
| \W       | [^a-zA-Z0-9_]    |

In [23]:
print(find_pattern('S ourodip', "\s+"))
print(find_pattern('Souro123', "\d+"))

<_sre.SRE_Match object; span=(1, 2), match=' '>
<_sre.SRE_Match object; span=(5, 8), match='123'>


### Greedy vs non-greedy regex

In [14]:
print(find_pattern("aabbbbbb", "ab{3,5}")) # return if a is followed by b 3-5 times GREEDY

<re.Match object; span=(1, 7), match='abbbbb'>


In [15]:
print(find_pattern("aabbbbbb", "ab{3,5}?")) # return if a is followed by b 3-5 times Non-GREEDY

<re.Match object; span=(1, 5), match='abbb'>


In [16]:
# Example of HTML code
print(re.search("<.*>","<HTML><TITLE>My Page</TITLE></HTML>"))

<re.Match object; span=(0, 35), match='<HTML><TITLE>My Page</TITLE></HTML>'>


In [17]:
# Example of HTML code
print(re.search("<.*?>","<HTML><TITLE>My Page</TITLE></HTML>"))

<re.Match object; span=(0, 6), match='<HTML>'>


In [26]:
string = \
"""<html>
<head>
<title> My amazing webpage </title>
</head>
<body> Welcome to my webpage! </body>
</html> """

Write a non-greedy regular expression to match the contents of only the first tag (the <html> tag in this case) including the angular brackets.

In [29]:
# regex pattern
pattern = "<.*>?"

# check whether pattern is present in string or not
result = re.search(pattern, string, re.M)  # re.M enables tha tpettern to be searched in multiple lines

# evaluate result - don't change the following piece of code, it is used to evaluate your regex
if (result != None) and (len(result.group()) <= 6):
    print(True)
    print(result)
else:
    print(False)

True
<_sre.SRE_Match object; span=(0, 6), match='<html>'>


To use a pattern in a non-greedy way, you can just put a question mark at the end of any of the following quantifiers that you’ve studied till now:

* *

* +

* ?

* {m, n}

* {m,}

* {, n}

* {n}

 

The lazy quantifiers of the above greedy quantifiers are:

* *?

* +?

* ??

* {m, n}?

* {m,}?

* {, n}?

* {n}?

### The five most important re functions that you would be required to use most of the times are

* match() Determine if the RE matches at the beginning of the string


* search() Scan through a string, looking for any location where this RE matches


* findall() Find all the substrings where the RE matches, and return them as a list


* finditer() Find all substrings where RE matches and return them as asn iterator


* sub() Find all substrings where the RE matches and substitute them with the given string

In [18]:
# - this function uses the re.match() and let's see how it differs from re.search()
def match_pattern(text, patterns):
    if re.match(patterns, text):
        return re.match(patterns, text)
    else:
        return ('Not found!')

In [19]:
print(find_pattern("abbc", "b+"))

<re.Match object; span=(1, 3), match='bb'>


In [20]:
print(match_pattern("abbc", "b+"))

Not found!


In [31]:
## Example usage of the sub(regular_expression, replacement_string, input_text) function. Replace Road with rd.

street = '21 Ramakrishna Road'
print(re.sub('Road', 'Rd', street))

21 Ramakrishna Rd


In [32]:
print(re.sub('R\w+', 'Rd', street))

21 Rd Rd


In [33]:
# print(re.sub('(?!Ramakrishna)R\w+', 'Rd', street))

21 Ramakrishna Rd


In [23]:
## Example usage of finditer(). Find all occurrences of word Festival in given sentence

text = 'Diwali is a festival of lights, Holi is a festival of colors!'
pattern = 'festival'
for match in re.finditer(pattern, text):
    print('START -', match.start(), end="")
    print('END -', match.end())

START - 12END - 20
START - 42END - 50


In [38]:
# Example usage of findall(). In the given URL find all dates
url = "http://www.telegraph.co.uk/formula-1/2017/10/28/mexican-grand-prix-2017-time-does-start-tv-channel-odds-lewisl/2017/05/12/"
date_regex = '/(\d{4})/(\d{1,2})/(\d{1,2})/'
print(re.findall(date_regex, url))

[('2017', '10', '28'), ('2017', '05', '12')]


In [25]:
## Exploring Groups
m1 = re.search(date_regex, url)
print(m1.group())  ## print the matched group

/2017/10/28/


In [26]:
print(m1.group(1)) # - Print first group

2017


In [27]:
print(m1.group(2)) # - Print second group

10


In [28]:
print(m1.group(3)) # - Print third group

28


In [29]:
print(m1.group(0)) # - Print zero or the default group

/2017/10/28/


## Use cases 

The following code demonstrates how regex can be used for a file search operation. Say you have a list of folders and filenames called 'items' and you want to extract (or read) only some specific files, say images.

In [47]:
# items contains all the files and folders of current directory
items = ['photos', 'documents', 'videos', 'image001.jpg','image002.jpg','image005.jpg', 'wallpaper.jpg',
         'flower.jpg', 'earth.jpg', 'monkey.jpg', 'image002.png']

# create an empty list to store resultant files
images = []

# regex pattern to extract files that end with '.jpg'
pattern = ".*\.jpg$"

for item in items:
    if re.search(pattern, item):
        images.append(item)

# print result
print(images)

['image001.jpg', 'image002.jpg', 'image005.jpg', 'wallpaper.jpg', 'flower.jpg', 'earth.jpg', 'monkey.jpg']


In [43]:
# items contains all the files and folders of current directory
items = ['photos', 'documents', 'videos', 'image001.jpg','image002.jpg','image005.jpg', 'wallpaper.jpg',
         'flower.jpg', 'earth.jpg', 'monkey.jpg', 'image002.png']

# create an empty list to store resultant files
images = []

# regex pattern to extract files that start with 'image' and end with '.jpg'
pattern = "image.*\.jpg$"

for item in items:
    if re.search(pattern, item):
        images.append(item)

# print result
print(images)

['image001.jpg', 'image002.jpg', 'image005.jpg']


You saw how to search for specific file names using regular expressions. Similarly, they can be used to extract features from text such as the ones listed below:

Extracting dates

Extracting emails

Extracting phone numbers, and other patterns.