# Learn to Use Regular Expressions (RegEx)

Python has a built-in module to work with regular expressions called **re**. Some of the commonly used methods from the **re** module are listed below:

1.re.match( )

2.re.search( )

3.re.findall( )

4.re.sub( )

<br>

Let us look at each method with the help of example.

**1. re.match()**

The re.match function returns a match object on success and none on failure. 

In [1]:
# import re library
import re

#match a word at the beginning of a string

result = re.match('Divyansh','Divyansh is lazy!') 
print(result)

result_2 = re.match('is','Divyansh is lazy!') 
print(result_2)

<re.Match object; span=(0, 8), match='Divyansh'>
None


Since output of the re.match is an object, we will use *group()* function of match object to get the matched expression.

In [2]:
print(result.group())  #returns the total matches

Divyansh


<br>

**2. re.search()**

Matches the first occurence of a pattern in the entire string.

In [3]:
# search for the pattern "founded" in a given string
result = re.search('founded','Sid founded this company, he also founded that company')
print(result.group())

founded


<br>

**3. re.findall()**

It will return all the occurrences of the pattern from the string. I would recommend you to use *re.findall()* always, it can work like both *re.search()* and *re.match()*.

In [4]:
result = re.findall('founded','Sid founded this company, he also founded that company')  
print(result)

['founded', 'founded']


__4. re.sub()__

This method returns a string where matched occurences are replaced with a new text string.

In [5]:
result = re.sub('he', 'Sid', 'Sid founded this company, he also founded that company')  
print(result)

Sid founded this company, Sid also founded that company


In [6]:
result = re.sub('he', '', 'Sid founded this company, he also founded that company')  
print(result)

Sid founded this company,  also founded that company


### Special sequences

1. **\b** returns a match where the specified pattern is at the beginning or at the end of a word.

In [7]:
# Check if there is any word that ends with "ics"
x = re.findall(r"ics\b", "Analytics is cool")
print(x)

['ics']


In [11]:
# Check if there is any word that starts with "Ana"
x = re.findall(r"\bAna", "Analytics is cool")
print(x)

['Ana']


In [9]:
# Check if there is any word that ends with "ics"
x = re.findall(r"ics\b", "Analyticz is coolics")
print(x)

['ics']


2. **\d** returns a match when the string contains digits (numbers from 0-9)

In [12]:
str = "2 million monthly visits in Jan'19."

# Check if the string contains any digits (numbers from 0-9):
x = re.findall("\d", str)

print(x)

['2', '1', '9']


In [13]:
str = "2 million monthly visits in Jan'19."

# Check if the string contains any digits (numbers from 0-9):
# adding '+' after '\d' will continue to extract digits till encounters a space
x = re.findall("\d+", str)

print(x)

['2', '19']


We can infer that **\d+** repeats one or more occurences of **\d** till the non maching character is found where as **\d** does character wise comparison.

3. **\w** helps in extraction of alphanumeric characters only (characters from a to Z, digits from 0-9, and the underscore _ character)


In [14]:
str = "2 million monthly visits!"

x = re.findall("\w+",str)

print(x)

['2', 'million', 'monthly', 'visits']


## Metacharacters

Metacharacters are characters with a special meaning

1. **(.)** matches any character (except newline character)

In [15]:
str = "rohan and ronit recently published a research paper!" 

# search for a string that starts with "ro", followed by 1 character
x = re.findall("ro.", str)

print(x)

['roh', 'ron']


In [16]:
# search for a string that starts with "ro", followed by three characters
x2 = re.findall("ro...", str)

print(x2)

['rohan', 'ronit']


2. **(^)** starts with

In [25]:
str = "Data Science"

#Check if the string starts with 'Data':
x = re.findall("^Data", str)

if (x):
  print("Yes, the string starts with 'Data'")
else:
  print("No match")
  
print(x)  

Yes, the string starts with 'Data'
['Data']


In [18]:
# try with a different string
str2 = "Big Data"

#Check if the string starts with 'Data':
x2 = re.findall("^Data", str2)

if (x2):
  print("Yes, the string starts with 'data'")
else:
  print("No match")
  
#print(x2)  

No match


3. **($)** ends with

In [19]:
str = "Data Science"

#Check if the string ends with 'Science':

x = re.findall("Science$", str)

if (x):
  print("Yes, the string ends with 'Science'")

else:
  print("No match")
  
#print(x)

Yes, the string ends with 'Science'


In [20]:
str = "Big Data"

#Check if the string ends with 'Science':

x = re.findall("Science$", str)

if (x):
  print("Yes, the string ends with 'Science'")

else:
  print("No match")
  
#print(x)

No match


4. (*) matches for zero or more occurences of the pattern to the left of it.

In [21]:
str = "easy easssy eay eaty"

#Check if the string contains "ea" followed by 0 or more "s" characters and ending with y
x = re.findall("eas*y", str)

print(x)

['easy', 'easssy', 'eay']


## Sets

1. A set is a bunch of characters inside a pair of square brackets [ ] with a special meaning.

In [22]:
str = "Divyansh is lazy!"

#Check for the characters y, d, or h, in the above string
x = re.findall("[ansz]", str)

print(x)

['a', 'n', 's', 's', 'a', 'z']


In [24]:
str = "Divyansh is lazy!"

#Check for the characters between a and g, in the above string
x = re.findall("[a-l]", str)

print(x)

['i', 'a', 'h', 'i', 'l', 'a']


2. **[^]** Check whether string has other characters mentioned after ^

In [27]:
str = "Divyansh is lazy!"

#Check if every word character has characters other than y, d, or h

x = re.findall("[^anzs]", str)

print(x)

['D', 'i', 'v', 'y', 'h', ' ', 'i', ' ', 'l', 'y', '!']


In [26]:
str = "@Divyansh"

x = re.findall("[^@]", str)

print(x)

['D', 'i', 'v', 'y', 'a', 'n', 's', 'h']


---
## Solve Some Queries

Let us try solving some queries that we are likeli to come across while working with real world text datasets.



### Eliminating Unwanted Terms

In [28]:
str = "@DA wnats to create  a Data Science community #DA!!"

# extract words that start with a special character
x = re.sub("[^a-zA-Z ]", "",str)

print(x)

DA wnats to create  a Data Science community DA


In [29]:
str = "@DA wnats to create  a Data Science community #DA!!"

# extract words that start with a special character
# \w matches any alpha numeric character 
# + for repeats a character one or more times
x = re.sub("[^a-zA-Z ]\w+", "",str)

print(x)

 wnats to create  a Data Science community !!


### Finding Email IDs

In [30]:
str = 'Send a mail to rohan.1997@gmail.com, smith_david34@yahoo.com and priya@yahoo.com about the meeting @2PM'
  
# \w matches any alpha numeric character 
# + for repeats a character one or more times
x = re.findall('[a-zA-Z0-9._-]+@\w+\.com', str)     
  
# Printing of List 
print(x) 

['rohan.1997@gmail.com', 'smith_david34@yahoo.com', 'priya@yahoo.com']
