# Regular Expressions
## Pattern Operators

what if we do not have a constant search pattern and we want to return specific set of characters (defined by a rule) from a string? This can easily be solved by defining an expression with the help of pattern operators (meta  and literal characters).

### Most Commonly Used Operators
Regular expressions can specify patterns, not just fixed characters. Here are the most commonly used operators that helps to generate an expression to represent required characters in a string or file. It is commonly used in web scrapping and  text mining to extract required information.

Operators &nbsp; | &nbsp; &nbsp; Description
-------------------------------------
<pre>
.	       | Matches with any single character except newline ‘\n’.
?	       | match 0 or 1 occurrence of the pattern to its left
+	       | 1 or more occurrences of the pattern to its left
*	       | 0 or more occurrences of the pattern to its left
\w	      | Matches with a alphanumeric character whereas \W (upper case W) matches non alphanumeric character.
\d	      | Matches with digits [0-9] and /D (upper case D) matches with non-digits.
\s	      | Matches with a single white space character (space, newline, return, tab, form) and \S (upper case S) 
            | matches any non-white space character.
\b	      | boundary between word and non-word and /B is opposite of /b
[..]	    | Matches any single character in a square bracket and [^..] matches any single character not in square 
            | bracket
\	       | It is used for special meaning characters like \. to match a period or \+ for plus sign.
^ and $	    | ^ and $ match the start or end of the string respectively
{n,m}	   | Matches at least n and at most m occurrences of preceding expression if we write it as {,m} then it will 
            | return at least any minimum occurrence to max m preceding expression.
a| b	    | Matches either a or b
( )	     | Groups regular expressions and returns matched text
\t, \n, \r  | Matches tab, newline, return
</pre>

### Problem 1: Return the first word of a given string

#### Solution-1  Extract each character (using “.”)

In [2]:
import re
result=re.findall(r'.','AV is largest Analytics community of India')

In [2]:
result

['A',
 'V',
 ' ',
 'i',
 's',
 ' ',
 'l',
 'a',
 'r',
 'g',
 'e',
 's',
 't',
 ' ',
 'A',
 'n',
 'a',
 'l',
 'y',
 't',
 'i',
 'c',
 's',
 ' ',
 'c',
 'o',
 'm',
 'm',
 'u',
 'n',
 'i',
 't',
 'y',
 ' ',
 'o',
 'f',
 ' ',
 'I',
 'n',
 'd',
 'i',
 'a']

Above, space is also extracted. Let us avoid it using “\w” instead of “.”. 

#### Solution-2  Extract each character (using “\w”)

In [4]:
result=re.findall(r'\w','AV is largest Analytics community of India')

In [5]:
result

['A',
 'V',
 'i',
 's',
 'l',
 'a',
 'r',
 'g',
 'e',
 's',
 't',
 'A',
 'n',
 'a',
 'l',
 'y',
 't',
 'i',
 'c',
 's',
 'c',
 'o',
 'm',
 'm',
 'u',
 'n',
 'i',
 't',
 'y',
 'o',
 'f',
 'I',
 'n',
 'd',
 'i',
 'a']

#### Solution-3  Extract each word (using “ * ” or “+”)

In [6]:
result=re.findall(r'\w*','AV is largest Analytics community of India')

In [7]:
result

['AV',
 '',
 'is',
 '',
 'largest',
 '',
 'Analytics',
 '',
 'community',
 '',
 'of',
 '',
 'India',
 '']

Again, it is returning space as a word because “ * ” returns zero or more matches of pattern to its left. Now to remove spaces we will go with “+”.

In [8]:
result=re.findall(r'\w+','AV is largest Analytics community of India')

In [9]:
result

['AV', 'is', 'largest', 'Analytics', 'community', 'of', 'India']

#### Solution-4 Extract each word (using “^”)

In [10]:
result=re.findall(r'^\w+','AV is largest Analytics community of India')

In [11]:
result

['AV']

If we use “$” instead of “^”, it will return the word from the end of the string. Let’s look at it.

In [12]:
result=re.findall(r'\w+$','AV is largest Analytics community of India')

In [13]:
result

['India']

### Problem 2: Return the first two character of each word

#### Solution-1  Extract consecutive two characters of each word, excluding spaces (using “\w”)

In [14]:
result=re.findall(r'\w\w','AV is largest Analytics community of India')

In [15]:
result

['AV',
 'is',
 'la',
 'rg',
 'es',
 'An',
 'al',
 'yt',
 'ic',
 'co',
 'mm',
 'un',
 'it',
 'of',
 'In',
 'di']

#### Solution-2  Extract consecutive two characters those available at start of word boundary (using “\b”)

In [16]:
result=re.findall(r'\b\w.','AV is largest Analytics community of India')

In [17]:
result

['AV', 'is', 'la', 'An', 'co', 'of', 'In']

### Problem 3: Return the domain type of given email-ids

#### Solution-1  Extract all characters after “@”

In [18]:
result=re.findall(r'@\w+','abc.test@gmail.com, xyz@test.in, test.first@analyticsvidhya.com, first.test@rest.biz')

In [19]:
result

['@gmail', '@test', '@analyticsvidhya', '@rest']

Above, you can see that “.com”, “.in” part is not extracted. To add it, we will go with below code.

In [20]:
result=re.findall(r'@\w+.\w+','abc.test@gmail.com, xyz@test.in, test.first@analyticsvidhya.com, first.test@rest.biz')

In [21]:
result

['@gmail.com', '@test.in', '@analyticsvidhya.com', '@rest.biz']

#### Solution – 2 Extract only domain name using “( )”

In [22]:
result=re.findall(r'@\w+.(\w+)','abc.test@gmail.com, xyz@test.in, test.first@analyticsvidhya.com, first.test@rest.biz')

In [23]:
result

['com', 'in', 'com', 'biz']

### Problem 4: Return date from given string
Here we will use “\d” to extract digit.

In [24]:
result=re.findall(r'\d{2}-\d{2}-\d{4}','Amit 34-3456 12-05-2007, XYZ 56-4532 11-11-2011, ABC 67-8945 12-01-2009')

In [25]:
result

['12-05-2007', '11-11-2011', '12-01-2009']

If you want to extract only year again parenthesis “( )” will help you.

In [27]:
result=re.findall(r'\d{2}-\d{2}-(\d{4})','Amit 34-3456 12-05-2007, XYZ 56-4532 11-11-2011, ABC 67-8945 12-01-2009')

In [28]:
result

['2007', '2011', '2009']

### Problem 5: Return all words of a string those which start with vowel

#### Solution-1  Return each words

In [29]:
result=re.findall(r'\w+','AV is largest Analytics community of India')

In [30]:
result

['AV', 'is', 'largest', 'Analytics', 'community', 'of', 'India']

#### Solution-2  Return words starts with alphabets (using [])

In [31]:
result=re.findall(r'[aeiouAEIOU]\w+','AV is largest Analytics community of India')

In [32]:
result

['AV', 'is', 'argest', 'Analytics', 'ommunity', 'of', 'India']

Above you can see that it has returned “argest” and “ommunity” from the mid of words. To drop these two, we need to use “\b” for word boundary.

#### Solution- 3

In [33]:
result=re.findall(r'\b[aeiouAEIOU]\w+','AV is largest Analytics community of India')

In [34]:
result

['AV', 'is', 'Analytics', 'of', 'India']

In similar ways, we can extract words those which start with consonants using “^” within square bracket.

In [35]:
result=re.findall(r'\b[^aeiouAEIOU]\w+','AV is largest Analytics community of India')

In [36]:
result

[' is', ' largest', ' Analytics', ' community', ' of', ' India']

Above you can see that it has returned words starting with space. To drop it from output, include space in square bracket[].

In [37]:
result=re.findall(r'\b[^aeiouAEIOU ]\w+','AV is largest Analytics community of India')

In [38]:
result

['largest', 'community']

### Problem 6: Validate a phone number (phone number must be of 10 digits and starts with 8 or 9)
We have a list of phone numbers in list “li” and here we will validate phone numbers using regular expressions

In [40]:
li=['9999999999','999999-999','99999x9999']
for val in li:
    if re.match(r'[8-9]{1}[0-9]{9}',val) and len(val) == 10:
        print ('yes')
    else:
        print ('no')

yes
no
no


### Problem 7: Split a string with multiple delimiters

In [41]:
line = 'asdf fjdk;afed,fjek,asdf,foo' # String has multiple delimiters (";",","," ").
result= re.split(r'[;,\s]', line)
result

['asdf', 'fjdk', 'afed', 'fjek', 'asdf', 'foo']

We can also use method re.sub() to replace these multiple delimiters with one as space “ ”.

In [42]:
line = 'asdf fjdk;afed,fjek,asdf,foo'
result= re.sub(r'[;,\s]',' ', line)
result

'asdf fjdk afed fjek asdf foo'

### Problem 8: Retrieve Information from HTML file
I want to extract information from a HTML file (see below sample data). Here we need to extract information available between <td> and </td> except the first numerical index. I have assumed here that below html code is stored in a string <b>str</b>.

Sample HTML file (str)
<pre>
"<tr align="center"><td>1</td> <td>Noah</td> <td>Emma</td></tr>"
"<tr align="center"><td>2</td> <td>Liam</td> <td>Olivia</td></tr>"
"<tr align="center"><td>3</td> <td>Mason</td> <td>Sophia</td></tr>"
"<tr align="center"><td>4</td> <td>Jacob</td> <td>Isabella</td></tr>"
"<tr align="center"><td>5</td> <td>William</td> <td>Ava</td></tr>"
"<tr align="center"><td>6</td> <td>Ethan</td> <td>Mia</td></tr>"
"<tr align="center"><td>7</td> <td HTML>Michael</td> <td>Emily</td></tr>"==$0
</pre>
<strong>Solution:</strong>

In [44]:
str = '<tr align="center"><td>1</td> <td>Noah</td> <td>Emma</td></tr><tr align="center"><td>2</td> <td>Liam</td> <td>Olivia</td></tr><tr align="center"><td>3</td> <td>Mason</td> <td>Sophia</td></tr><tr align="center"><td>4</td> <td>Jacob</td> <td>Isabella</td></tr><tr align="center"><td>5</td> <td>William</td> <td>Ava</td></tr><tr align="center"><td>6</td> <td>Ethan</td> <td>Mia</td></tr><tr align="center"><td>7</td> <td HTML>Michael</td> <td>Emily</td></tr>'
result=re.findall(r'<td>\w+</td>\s<td>(\w+)</td>\s<td>(\w+)</td>',str)

In [45]:
result

[('Noah', 'Emma'),
 ('Liam', 'Olivia'),
 ('Mason', 'Sophia'),
 ('Jacob', 'Isabella'),
 ('William', 'Ava'),
 ('Ethan', 'Mia')]