# Regular Expression

In last few years, there has been a dramatic shift in usage of general purpose programming languages for data science and machine learning. This was not always the case – a decade back this thought would have met a lot of skeptic eyes!

This means that more people / organizations are using tools like Python / JavaScript for solving their data needs. This is where Regular Expressions become super useful. Regular expressions are normally the default way of data cleaning and wrangling in most of these tools. Be it extraction of specific parts of text from web pages, making sense of twitter data or preparing your data for text mining – Regular expressions are your best bet for all these tasks.

Given their applicability, it makes sense to know them and use them appropriately.

## What is Regular Expression and how is it used?

Simply put, regular expression is a sequence of character(s) mainly used to find and replace patterns in a string or file. They are supported by most of the programming languages like python, perl, R, Java and many others. So, learning them helps in multiple ways. Regular expressions use two types of characters:

a) Meta characters: As the name suggests, these characters have a special meaning, similar to * in wild card.

b) Literals (like a,b,1,2…)

In Python, we have module “re” that helps with regular expressions. So you need to import library re before you can use regular expressions in Python.

Use this code --> 
Import re
The most common uses of regular expressions are:

Search a string (search and match)
Finding a string (findall)
Break string into a sub strings (split)
Replace part of a string (sub)
Let’s look at the methods that library “re” provides to perform these tasks.

## What are various methods of Regular Expressions?

The ‘re’ package provides multiple methods to perform queries on an input string. Here are the most commonly used methods, I will discuss:

1. re.match()
2. re.search()
3. re.findall()
4. re.split()
5. re.sub()
6. re.compile()

Let’s look at them one by one.

### re.match(pattern, string): 

This method finds match if it occurs at start of the string. For example, calling match() on the string ‘AV Analytics AV’ and looking for a pattern ‘AV’ will match. <br> 
However, if we look for only Analytics, the pattern will not match. Let’s perform it in python now.

In [1]:
import re

In [2]:
result = re.match(r'AV', 'AV Analytics Vidhya AV')

print(result)

<re.Match object; span=(0, 2), match='AV'>


In [3]:
# below code for getting the match string

result = re.match(r'AV', 'AV Analytics Vidhya AV')

print(result.group(0))

AV


In [4]:
# trying a pattern which is not at the start of the string

result = re.match(r'Analytics', 'AV Analytics Vidhya AV')

print(result)

None


In [5]:
# There are methods like start() and end() to know the start and end position of matching pattern in the string.

result = re.match(r'AV', 'AV Analytics Vidhya AV')

print(result.start())
print(result.end())

0
2


### re.search(pattern, string):

It is similar to match() but it doesn’t restrict us to find matches at the beginning of the string only. Unlike previous method, here searching for pattern ‘Analytics’ will return a match, but it only returns the first occurrence of the search pattern.

In [7]:
result = re.search(r'Analytics' , 'AV Analytics Vidhya AV')

print(result.group(0))

Analytics


### re.findall (pattern, string):

It helps to get a list of all matching patterns. It has no constraints of searching from start or end. If we will use method findall to search ‘AV’ in given string it will return both occurrence of AV. While searching a string, I would recommend you to use re.findall() always, it can work like re.search() and re.match() both.

In [8]:
result = re.findall(r'AV', 'AV Analytics Vidhya AV')

print(result)

['AV', 'AV']


### re.split(pattern, string, [maxsplit=0]):

This methods helps to split string by the occurrences of given pattern.

In [10]:
result = re.split(r'y', 'Analytics')

result

['Anal', 'tics']

In [11]:
# maxsplit 

result = re.split(r'i', 'Analytics Vidhya')

print(result)

['Analyt', 'cs V', 'dhya']


In [12]:

result = re.split(r'i', 'Analytics Vidhya', maxsplit = 1)

print(result)

['Analyt', 'cs Vidhya']


### re.sub(pattern, repl, string):


It helps to search a pattern and replace with a new sub string. If the pattern is not found, string is returned unchanged.

In [13]:
result = re.sub(r'India', 'the world', 'AV is the largest Analytics community of India')

print(result)

AV is the largest Analytics community of the world


### re.compile(pattern, repl, string):

We can combine a regular expression pattern into pattern objects, which can be used for pattern matching. It also helps to search a pattern again without rewriting it.

In [16]:
pattern = re.compile('AV')

result = pattern.findall('AV Analytics Vidhya AV')

print(result)

['AV', 'AV']


In [17]:
result2 = pattern.findall('AV is largest analytics community of India')

print(result2)

['AV']


### What are the most commonly used operators?

Regular expressions can specify patterns, not just fixed characters. Here are the most commonly used operators that helps to generate an expression to represent required characters in a string or file. It is commonly used in web scrapping and  text mining to extract required information.

  Operators	 Description
1. .	 Matches with any single character except newline ‘\n’.
2. ?	 match 0 or 1 occurrence of the pattern to its left
3. '+	 1 or more occurrences of the pattern to its left
4. '*	 0 or more occurrences of the pattern to its left
5. \w	 Matches with a alphanumeric character whereas \W (upper case W) matches non alphanumeric character.
6. \d	  Matches with digits [0-9] and /D (upper case D) matches with non-digits.
7. \s	 Matches with a single white space character (space, newline, return, tab, form) and \S (upper case S) matches any  non-white space character.
8. \b	 boundary between word and non-word and /B is opposite of /b
9. [..]	 Matches any single character in a square bracket and [^..] matches any single character not in square bracket
10. \	 It is used for special meaning characters like \. to match a period or \+ for plus sign.
11. ^ and $	 ^ and $ match the start or end of the string respectively
12. {n,m}	 Matches at least n and at most m occurrences of preceding expression if we write it as {,m} then it will return at least any minimum occurrence to max m preceding expression.
13. a| b	 Matches either a or b
14. ( )	Groups regular expressions and returns matched text
15. \t, \n, \r	 Matches tab, newline, return


Now, let’s understand the pattern operators by looking at the below examples. 

### Some Examples of Regular Expressions

#### Problem 1: Return the first word of a given string

In [18]:
## Extract each character (using “\w“)

result = re.findall(r'.', 'AV is largest Analytics community of India')

print(result)

['A', 'V', ' ', 'i', 's', ' ', 'l', 'a', 'r', 'g', 'e', 's', 't', ' ', 'A', 'n', 'a', 'l', 'y', 't', 'i', 'c', 's', ' ', 'c', 'o', 'm', 'm', 'u', 'n', 'i', 't', 'y', ' ', 'o', 'f', ' ', 'I', 'n', 'd', 'i', 'a']


In [19]:
# remove the spaces

result = re.findall(r'\w','AV is largest Analytics community of India')

print(result)

['A', 'V', 'i', 's', 'l', 'a', 'r', 'g', 'e', 's', 't', 'A', 'n', 'a', 'l', 'y', 't', 'i', 'c', 's', 'c', 'o', 'm', 'm', 'u', 'n', 'i', 't', 'y', 'o', 'f', 'I', 'n', 'd', 'i', 'a']


In [20]:
## Extract each word (using “*” or “+“)

result = re.findall(r'\w*','AV is largest Analytics community of India')

print(result)

['AV', '', 'is', '', 'largest', '', 'Analytics', '', 'community', '', 'of', '', 'India', '']


In [21]:
## remove spaces

result = re.findall(r'\w+','AV is largest Analytics community of India')

print(result)

['AV', 'is', 'largest', 'Analytics', 'community', 'of', 'India']


In [22]:
##  Extract each word (using “^“)

# Extract first word
result = re.findall(r'^\w+','AV is largest Analytics community of India')

print(result)

['AV']


In [25]:
# Extract last word

result = re.findall(r'\w+$','AV is largest Analytics community of India')

print(result)

['India']


#### Problem 2: Return the first two character of each word

In [26]:
## Extract consecutive two characters of each word, excluding spaces (using “\w“)

result = re.findall(r'\w\w', 'AV is largest Analytics community of India')

print(result)

['AV', 'is', 'la', 'rg', 'es', 'An', 'al', 'yt', 'ic', 'co', 'mm', 'un', 'it', 'of', 'In', 'di']


In [28]:
## Extract consecutive two characters those available at start of word boundary (using “\b“)

result = re.findall(r'\b\w.','AV is largest Analytics community of India')

print(result)

['AV', 'is', 'la', 'An', 'co', 'of', 'In']


#### Problem 3: Return the domain type of given email-ids

In [31]:
## Extract all characters after “@”

result=re.findall(r'@\w+','abc.test@gmail.com, xyz@test.in, test.first@analyticsvidhya.com, first.test@rest.biz')

print(result)

['@gmail', '@test', '@analyticsvidhya', '@rest']


In [32]:
## To add ".com" ".in", we will go with below code.

result = re.findall(r'@\w+.\w+','abc.test@gmail.com, xyz@test.in, test.first@analyticsvidhya.com, first.test@rest.biz')

print (result)

['@gmail.com', '@test.in', '@analyticsvidhya.com', '@rest.biz']


In [33]:
## Extract only domain name using “( )”

result=re.findall(r'@\w+.(\w+)','abc.test@gmail.com, xyz@test.in, test.first@analyticsvidhya.com, first.test@rest.biz')

print(result)

['com', 'in', 'com', 'biz']


#### Problem 4: Return date from given string

In [34]:
## we will use “\d” to extract digit.

result = re.findall(r'\d{2}-\d{2}-\d{4}','Amit 34-3456 12-05-2007, XYZ 56-4532 11-11-2011, ABC 67-8945 12-01-2009')

print(result)

['12-05-2007', '11-11-2011', '12-01-2009']


In [36]:
## If you want to extract only year again parenthesis “( )” will help you.

# Extract year
result = re.findall(r'\d{2}-\d{2}-(\d{4})','Amit 34-3456 12-05-2007, XYZ 56-4532 11-11-2011, ABC 67-8945 12-01-2009')

print(result)

['2007', '2011', '2009']


In [37]:
# Extract month
result = re.findall(r'\d{2}-(\d{2})-\d{4}','Amit 34-3456 12-05-2007, XYZ 56-4532 11-11-2011, ABC 67-8945 12-01-2009')

print(result)

['05', '11', '01']


#### Problem 5: Return all words of a string those starts with vowel

In [38]:
## Return each words

result = re.findall(r'\w+','AV is largest Analytics community of India')

print(result)

['AV', 'is', 'largest', 'Analytics', 'community', 'of', 'India']


In [39]:
## Return words starts with alphabets (using [])

result = re.findall(r'[aeiouAEIOU]\w+','AV is largest Analytics community of India')

print(result)

['AV', 'is', 'argest', 'Analytics', 'ommunity', 'of', 'India']


In [40]:
## Above you can see that it has returned “argest” and “ommunity” from the mid of words.
## To drop these two, we need to use “\b” for word boundary.


## words starting with vowel sounds only
result = re.findall(r'\b[aeiouAEIOU]\w+','AV is largest Analytics community of India')

print(result)

['AV', 'is', 'Analytics', 'of', 'India']


In [41]:
# Extract only constant

result = re.findall(r'\b[^aeiouAEIOU]\w+','AV is largest Analytics community of India')

print(result)

[' is', ' largest', ' Analytics', ' community', ' of', ' India']


In [42]:
## Above you can see that it has returned words starting with space. To drop it from output, include space in square bracket[].

result = re.findall(r'\b[^aeiouAEIOU ]\w+','AV is largest Analytics community of India')

print(result)

['largest', 'community']


#### Problem 6: Validate a phone number (phone number must be of 10 digits and starts with 8 or 9) 

In [43]:
li = ['9873415255','987341-525','98734x5255']

for val in li:
    if re.match(r'[8-9]{1}[0-9]{9}', val) and len(val) == 10:
        print('Valid Phone Number')
    else:
        print('Not a valid phone number')

Valid Phone Number
Not a valid phone number
Not a valid phone number


#### Problem 7: Split a string with multiple delimiters

In [44]:
line = 'asdf fjdk;afed,fjek,asdf,foo' # String has multiple delimiters (";",","," ").

result = re.split(r'[;,\s]', line)

print(result)

['asdf', 'fjdk', 'afed', 'fjek', 'asdf', 'foo']


In [45]:
## We can also use method re.sub() to replace these multiple delimiters with one as space " ".

line = 'asdf fjdk;afed,fjek,asdf,foo'

result = re.sub(r'[;,\s]',' ', line)

print(result)

asdf fjdk afed fjek asdf foo


#### Problem 8: Retrieve Information from HTML file

In [57]:
str = '<tr align="center"><td>1</td> <td>Noah</td> <td>Emma</td></tr> <tr align="center"><td>2</td> <td>Liam</td> <td>Olivia</td></tr> <tr align="center"><td>3</td> <td>Mason</td> <td>Sophia</td></tr> <tr align="center"><td>4</td> <td>Jacob</td> <td>Isabella</td></tr> <tr align="center"><td>5</td> <td>William</td> <td>Ava</td></tr> <tr align="center"><td>6</td> <td>Ethan</td> <td>Mia</td></tr> <tr align="center"><td>7</td> <td HTML>Michael</td> <td>Emily</td></tr>'

result = re.findall(r'<td>\w+</td>\s<td>(\w+)</td>\s<td>(\w+)</td>', str)

print(str)

<tr align="center"><td>1</td> <td>Noah</td> <td>Emma</td></tr> <tr align="center"><td>2</td> <td>Liam</td> <td>Olivia</td></tr> <tr align="center"><td>3</td> <td>Mason</td> <td>Sophia</td></tr> <tr align="center"><td>4</td> <td>Jacob</td> <td>Isabella</td></tr> <tr align="center"><td>5</td> <td>William</td> <td>Ava</td></tr> <tr align="center"><td>6</td> <td>Ethan</td> <td>Mia</td></tr> <tr align="center"><td>7</td> <td HTML>Michael</td> <td>Emily</td></tr>


In [63]:
## we can read html file using library urllib2 (see below code).

from urllib.request import urlopen

response = urlopen("http://www.google.com")
html = response.read()
print(html)

b'<!doctype html><html itemscope="" itemtype="http://schema.org/WebPage" lang="en-IN"><head><meta content="text/html; charset=UTF-8" http-equiv="Content-Type"><meta content="/logos/doodles/2019/india-elections-2019-5842323225706496-l.png" itemprop="image"><meta content="India Elections 2019 - Phase 7" property="twitter:title"><meta content="India Elections 2019 - Phase 7 #GoogleDoodle" property="twitter:description"><meta content="India Elections 2019 - Phase 7 #GoogleDoodle" property="og:description"><meta content="summary_large_image" property="twitter:card"><meta content="@GoogleDoodles" property="twitter:site"><meta content="https://www.google.com/logos/doodles/2019/india-elections-2019-5842323225706496-2x.png" property="twitter:image"><meta content="https://www.google.com/logos/doodles/2019/india-elections-2019-5842323225706496-2x.png" property="og:image"><meta content="1400" property="og:image:width"><meta content="420" property="og:image:height"><title>Google</title><script nonc