Regular Expressions

In [None]:
import re
re.match()
re.search()
re.findall()
re.split()
re.sub()
re.compile()

re.match() method
This method finds match if it occurs at start of the string. For example,
calling match() on the string ‘AV Analytics AV’ and looking for a pattern
‘AV’ will match. However, if we look for only Analytics, the pattern will not match. Let’s perform it in python now.

In [1]:
import re

In [2]:
result = re.match(r'AV', 'AV Analytics ESET AV')
print (result)

<re.Match object; span=(0, 2), match='AV'>


In [10]:
result = re.match(r'AV', 'AV Analytics ESET AV')
print (result.group(0))

AV


In [8]:
result = re.match(r'Analytics', 'AV Analytics ESET AV')
print (result)

None


In [7]:
result = re.match(r'AV', 'AV Analytics ESET AV')
print (result.start())
print (result.end())

0
2


re.search() method
It is similar to match() but it doesn’t restrict us to find matches at the
beginning of the string only. Unlike previous method, here searching for
pattern ‘Analytics’ will return a match.

In [11]:
result = re.search(r'Analytics', 'AV Analytics ESET AV')
print (result.group(0))

Analytics


re.findall (pattern, string)
It helps to get a list of all matching patterns. It has no constraints of
searching from start or end. If we will use method findall to search ‘AV’ in a given string it will return both occurrences of AV. While searching a string,
I would recommend you to use re.findall() always, it can work like
re.search() and re.match() both.

In [12]:
result = re.findall(r'AV', 'AV Analytics ESET AV')
print (result)

['AV', 'AV']


re.split(pattern, string, [maxsplit=0])
This method helps to split the string by the occurrences of a given pattern.



In [14]:
result=re.split(r'y','Analytics')
print(result)

['Anal', 'tics']


Output:
[‘Anal’, ‘tics’]
Above, we have split the string “Analytics” by “y”. Method split() has
another argument “maxsplit”. It has default value of zero. In this case it
does the maximum splits that can be done, but if we give value to maxsplit, it will split the string. Let’s look at the example below:
Code

In [16]:
import re
result=re.split(r's','Analytics eset')
print(result)

['Analytic', ' e', 'et']


[‘Analytic’, ‘ e’, ‘et’] #It has performed all the splits that can be done by
pattern “s”.

In [17]:
result=re.split(r's','Analytics eset',maxsplit=1)
print(result)

['Analytic', ' eset']


Output:
['Analytic', ' eset']

In [18]:
result=re.split(r's','Analytics eset',maxsplit=1)
print (result)

['Analytic', ' eset']


re.sub(pattern, repl, string)
It helps to search a pattern and replace with a new substring. If the
pattern is not found, string is returned unchanged.

In [19]:
result=re.sub(r'Ruby','Python','Joe likes Ruby')
print (result)

Joe likes Python



re.compile() method
We can combine a regular expression pattern into pattern objects, which can be used for pattern matching. It also helps to search a pattern again without rewriting it.

In [20]:
pattern=re.compile('XSS')
result=pattern.findall('XSS is Cross Site Scripting, XSS')
print (result)
result2=pattern.findall('XSS is Cross Site Scripting, SQLi is Sql Injection')
print (result2)

['XSS', 'XSS']
['XSS']



Till now, we looked at various methods of regular expression using a
constant pattern (fixed characters). But, what if we do not have a constant search pattern and we want to return specific set of characters (defined by a rule) from a string? Don’t be intimidated.
This can easily be solved by defining an expression with the help of pattern operators (meta and literal characters). Let’s look at the most common pattern operators.
Regular expressions can specify patterns, not just fixed characters. Here are the most commonly used operators that helps to generate an expression to represent required characters in a string or file. It is commonly used in web scrapping and text mining to extract required information.
Operators Description
. Matches with any single character except newline ‘\n’.
? match 0 or 1 occurrence of the pattern to its left
+ 1 or more occurrences of the pattern to its left
* 0 or more occurrences of the pattern to its left
\w Matches with a alphanumeric character whereas \W (upper case W)
matches non alphanumeric character.
\d Matches with digits [0-9] and /D (upper case D) matches with nondigits.
\s Matches with a single white space character (space, newline,
return, tab, form) and \S (upper case S) matches any non-white space
character.
\b boundary between word and non-word and /B is opposite of /b
[..] Matches any single character in a square bracket and [^..]
matches any single character not in square bracket
\ It is used for special meaning characters like \. to match a
period or \+ for plus sign.
^ and $ ^ and $ match the start or end of the string respectively
{n,m} Matches at least n and at most m occurrences of preceding
expression if we write it as {,m} then it will return at least any minimum
occurrence to max m preceding expression.
a| b Matches either a or b
( ) Groups regular expressions and returns matched text
\t, \n, \r Matches tab, newline, return
For more details on meta characters “(“, “)”,”|” and others details , you
can refer this link (https://docs.python.org/2/library/re.html).
Now, let’s understand the pattern operators by looking at the below examples.

In [21]:
result=re.findall(r'.','Python is the best scripting language')
print (result)

['P', 'y', 't', 'h', 'o', 'n', ' ', 'i', 's', ' ', 't', 'h', 'e', ' ', 'b', 'e', 's', 't', ' ', 's', 'c', 'r', 'i', 'p', 't', 'i', 'n', 'g', ' ', 'l', 'a', 'n', 'g', 'u', 'a', 'g', 'e']


In [22]:
result=re.findall(r'\w','Python is the best scripting language')
print (result)

['P', 'y', 't', 'h', 'o', 'n', 'i', 's', 't', 'h', 'e', 'b', 'e', 's', 't', 's', 'c', 'r', 'i', 'p', 't', 'i', 'n', 'g', 'l', 'a', 'n', 'g', 'u', 'a', 'g', 'e']


In [23]:
result=re.findall(r'\w*','Python is the best scripting language')
print (result)

['Python', '', 'is', '', 'the', '', 'best', '', 'scripting', '', 'language', '']


[‘Python’, ”, ‘is’, ”, ‘the’, ”, ‘best’, ”, ‘scripting’, ”, ‘language’,
”]
Again, it is returning space as a word because “*” returns zero or more
matches of pattern to its left. Now to remove spaces we will go with “+”.

In [25]:
result=re.findall(r'\w+','Python is the best scripting language')
print (result)

['Python', 'is', 'the', 'best', 'scripting', 'language']


In [26]:
result=re.findall(r'^\w+','Python is the best scripting language')
print (result)

['Python']


If we will use “$” instead of “^”, it will return the word from the end of
the string. Let’s look at it.

In [27]:
result=re.findall(r'\w+$','Python is the best scripting language')
print (result)

['language']


* Problem 2: Return the first two character of each word *
Solution-1 Extract consecutive two characters of each word, excluding spaces
(using “\w”)

In [28]:
result=re.findall(r'\w\w','Python is the best')
print (result)

['Py', 'th', 'on', 'is', 'th', 'be', 'st']


Solution-2 Extract consecutive two characters those available at start of
word boundary (using “\b”)

In [34]:
result=re.findall(r'\b\w.','Python is the best')
print (result)

['Py', 'is', 'th', 'be']


* Problem 3: Return the domain type of given email-ids *
Solution-1 Extract all characters after “@”

In [36]:
result=re.findall(r'@\w+','abc.test@gmail.com, xyz@test.com,test.first@strategicsec.com, first.test@rest.biz')
print (result)

['@gmail', '@test', '@strategicsec', '@rest']


In [45]:
result=re.findall(r'@\w+.\w+','abc.test@gmail.com, xyz@test.com,test.first@strategicsec.com, first.test@rest.biz')
print (result)

['@gmail.com', '@test.com', '@strategicsec.com', '@rest.biz']


Solution – 2 Extract only domain name using “( )”

In [38]:
result=re.findall(r'@\w+.(\w+)','abc.test@gmail.com, xyz@test.com,test.first@strategicsec.com, first.test@rest.biz')
print (result)

['com', 'com', 'com', 'biz']


In [39]:
result=re.findall(r'@\w+.(\w+)','abc.test@gmail.com, xyz@test.com,test.first@strategicsec.com, first.test@rest.biz')
print (result)

['com', 'com', 'com', 'biz']


* Problem 4: Return date from given string *
Here we will use “\d” to extract digit.
Solution:

In [40]:
result=re.findall(r'\d{2}-\d{2}-\d{4}','Joe 34-3456 12-05-2007, XYZ 56-453211-11-2016, ABC 67-8945 12-01-2009')
print (result)

['12-05-2007', '11-11-2016', '12-01-2009']


In [41]:
result=re.findall(r'\d{2}-\d{2}-(\d{4})','Joe 34-3456 12-05-2007, XYZ 56-453211-11-2016, ABC 67-8945 12-01-2009')
print (result)

['2007', '2016', '2009']


* Problem 5: Return all words of a string those starts with vowel *
Solution-1 Return each words

In [43]:
result=re.findall(r'\w+','Python is the best')
print (result)

['Python', 'is', 'the', 'best']
