### Python Built-in Module for Regular Expressions
Python has a built-in module to work with regular expressions called “re”. Some common methods from this module are-

- re.match()
- re.search()
- re.findall()

In [29]:
### re.match(pattern, string)
### The re.match function returns a match object on success and none on failure.

import re

#match a word at the beginning of a string

result = re.match('Analytics',r'Analytics Vidhya is the largest data science community of India')
print(result)

<re.Match object; span=(0, 9), match='Analytics'>


In [30]:
print(result.group()) #returns the total matches

Analytics


In [31]:
result_2 = re.match('largest',r'Analytics Vidhya is the largest data science community of India')
print(result_2)

None


In [33]:
### re.search(pattern, string)
### Matches the first occurrence of a pattern in the entire string(and not just at the beginning).

result = re.search('founded',r'Andrew NG founded Coursera. He also founded deeplearning.ai')
print(result.group())

founded


In [32]:
### re.findall(pattern, string)
### It will return all the occurrences of the pattern from the string. I would recommend you to use re.findall() always, it can work like both re.search() and re.match().

result = re.findall('founded',r'Andrew NG founded Coursera. He also founded deeplearning.ai')
print(result)

['founded', 'founded']


## Special Sequences in Regular Expressions

In [34]:
## \b
## \b returns a match where the specified pattern is at the beginning or at the end of a word.

str = r'Analytics Vidhya is the largest Analytics community of India'

#Check if there is any word that ends with "est"

x = re.findall(r"est\b", str)
print(x)

['est']


In [38]:
### \d
### \d returns a match where the string contains digits (numbers from 0-9).

str = "2 million monthly visits in Jan'19."

#Check if the string contains any digits (numbers from 0-9):

x = re.findall("\d", str)
print(x)

if (x):
    print("Yes, there is at least one match!")
else:
    print("No match")

['2', '1', '9']
Yes, there is at least one match!


In [41]:
### \d
### \d returns a match where the string contains digits (numbers from 0-9).

str_ = "2 million monthly visits in Jan'19."

#Check if the string contains any digits (numbers from 0-9):

x = re.findall("\d", str_)
print(x)

if (x):
    print("Yes, there is at least one match!")
else:
    print("No match")

['2', '1', '9']
Yes, there is at least one match!


In [42]:
# Check if the string contains any digits (numbers from 0-9):
# adding '+' after '\d' will continue to extract digits till encounters a space

x = re.findall("\d+", str_)
print(x)

if (x):
    print("Yes, there is at least one match!")
else:
    print("No match")
 

['2', '19']
Yes, there is at least one match!


In [43]:
### \D
### \D returns a match where the string does not contain any digit. It is basically the opposite of \d.

str_ = "2 million monthly visits in Jan'19."

#Check if the word character does not contain any digits (numbers from 0-9):

x = re.findall("\D", str_)
print(x)

if (x):
    print("Yes, there is at least one match!")
else:
    print("No match")

[' ', 'm', 'i', 'l', 'l', 'i', 'o', 'n', ' ', 'm', 'o', 'n', 't', 'h', 'l', 'y', ' ', 'v', 'i', 's', 'i', 't', 's', ' ', 'i', 'n', ' ', 'J', 'a', 'n', "'", '.']
Yes, there is at least one match!


In [44]:
#Check if the word does not contain any digits (numbers from 0-9):

x = re.findall("\D+", str)
print(x)

if (x):
    print("Yes, there is at least one match!")
else:
    print("No match")

[" million monthly visits in Jan'", '.']
Yes, there is at least one match!


In [45]:
### \w
### \w helps in extraction of alphanumeric characters only (characters from a to Z, digits from 0-9, and the underscore _ character)

str_ = "2 million monthly visits!"

#returns a match at every word character (characters from a to Z, digits from 0-9, and the underscore _ character)

x = re.findall("\w+",str_)
print(x)

if (x):
    print("Yes, there is at least one match!")
else:
    print("No match")
    

['2', 'million', 'monthly', 'visits']
Yes, there is at least one match!


In [46]:
### \W
### \W returns match at every non-alphanumeric character. Basically opposite of \w.

str_ = "2 million monthly visits9!"

#returns a match at every NON word character (characters NOT between a and Z. Like "!", "?" white-space etc.):

x = re.findall("\W", str)
print(x)

if (x):
    print("Yes, there is at least one match!")
else:
    print("No match")
    

[' ', ' ', ' ', ' ', ' ', "'", '.']
Yes, there is at least one match!


# Metacharacters in Regular Expression

In [47]:
## (.) matches any character (except newline character)

str_ = "rohan and rohit recently published a research paper!"

#Search for a string that starts with "ro", followed by any number of characters

x = re.findall("ro.", str_)           #searches one character after ro
x2 = re.findall("ro...", str_)        #searches three characters after ro

print(x)
print(x2)

['roh', 'roh']
['rohan', 'rohit']


In [48]:
### (^) starts with
### It checks whether the string starts with the given pattern or not.

str_ = "Data Science"

#Check if the string starts with 'Data':

x = re.findall("^Data", str_)

if (x):
    print("Yes, the string starts with 'Data'")
else:
    print("No match")

# try with a different string

str2 = "Big Data"

#Check if the string starts with 'Data':

x2 = re.findall("^Data", str2)

if (x2):
    print("Yes, the string starts with 'data'")
else:
    print("No match")
    

Yes, the string starts with 'Data'
No match


In [49]:
### ($) ends with
### It checks whether the string ends with the given pattern or not.

str_ = "Data Science"

#Check if the string ends with 'Science':

x = re.findall("Science$", str_)

if (x):
    print("Yes, the string ends with 'Science'")
else:
    print("No match")

Yes, the string ends with 'Science'


In [50]:
### (*) matches for zero or more occurrences of the pattern to the left of it
str_ = "easy easssy eay ey"

#Check if the string contains "ea" followed by 0 or more "s" characters and ending with y

x = re.findall("eas*y", str_)
print(x)

if (x):
    print("Yes, there is at least one match!")
else:
    print("No match")

['easy', 'easssy', 'eay']
Yes, there is at least one match!


In [51]:
### (+) matches one or more occurrences of the pattern to the left of it
### Check if the string contains "ea" followed by 1 or more "s" characters and ends with y

x = re.findall("eas+y", str_)
print(x)

if (x):
    print("Yes, there is at least one match!")
else:
    print("No match")

['easy', 'easssy']
Yes, there is at least one match!


In [52]:
### (?) matches zero or one occurrence of the pattern left to it.
x = re.findall("eas?y",str_)
print(x)

if (x):
    print("Yes, there is at least one match!")
else:
    print("No match")

['easy', 'eay']
Yes, there is at least one match!


In [53]:
### (|) either or
str_ = "Analytics Vidhya is the largest data science community of India"

#Check if the string contains either "data" or "India":

x = re.findall("data|India", str_)
print(x)

if (x):
    print("Yes, there is at least one match!")
else:
    print("No match")

['data', 'India']
Yes, there is at least one match!


In [54]:
# try with a different string

str_ = "Analytics Vidhya is one of the largest data science communities"

#Check if the string contains either "data" or "India":

x = re.findall("data|India", str_)
print(x)

if (x):
    print("Yes, there is at least one match!")
else:
    print("No match")

['data']
Yes, there is at least one match!


> End of Program

____

# Regular Expressions

Regular Expressions, “regex” for short, is a pattern describing a certain amount of text (Goyvaerts). 
This tool allows us to filter through and pull information from a text document without having to physically read it. Regex is especially helpful when tokenizing text by defining the rules for where strings will be split into tokens. The regular expressions library contains many useful tools:

## Ranges — [A-Z] [A-Za-z0-9]
Instead of typing every letter in the alphabet, we can use a range from a-z. 

The left pattern above, [A-Z], will match on any uppercase letter. The pattern on the right, [A-Za-z0-9], will match on any uppercase letter, lowercase letter, and digit.

## Character Classes — \w \W \d
Character classes, in a way, are similar to ranges. Think of these as a shortcut to ranges. 

The \w class would find any lowercase word and the \d class would match on any digit.

## Groups — (A-Z0-9)
Also similar to ranges, but require more specificity to match. 
The pattern above (A-Z0-9) can only match on that exact sequence.

## Quantifiers — {*} {+} {?}
Work hand in hand with groups, giving us the ability to specify the amount of times a group need to appear in order to match. 

The {*} will match on a group that occurs 0 or more times, {+}, 1 or more times, and {?} 0 or 1 times.

> End of Program