# Extract Information Using Regular Expressions (RegEx)

The first thing that i want to start off is the notion of raw string

**r** expression is used to create a raw string. Python raw string treats backslash (\\) as a literal character.



Let us see some examples!

In [2]:
# normal string vs raw string
path = "C:\desktop\nathan"  #string
print("string:",path)

string: C:\desktop
athan


  path = "C:\desktop\nathan"  #string


In [3]:
path= r"C:\desktop\nathan"  #raw string
print("raw string:",path)

raw string: C:\desktop\nathan


So, it is always recommended to use raw strings while dealing with regular expressions. 

Python has a built-in module to work with regular expressions called **re**. Some of the commonly used methods from the **re** module are listed below:

1.re.match(): This function checks if 

2.re.search()

3.re.findall()

<br>

Let us look at each method with the help of example.

**1. re.match()**

The re.match function returns a match object on success and none on failure.

In [4]:
import re
#match a word at the beginning of a string 

result = re.match('Analytics',r'Analytics Vidhya is the largest data science community of India')
print(result)

result_2 = re.match('largest',r'Analytics Vidhya is the largest data science community of India') 
print(result_2)

<re.Match object; span=(0, 9), match='Analytics'>
None


Since output of the re.match is an object, we will use *group()* function of match object to get the matched expression.

In [5]:
print(result.group())  #returns the total matches

Analytics


<br>

**2. re.search()**

Matches the first occurence of a pattern in the entire string.

In [6]:
#search for a pattern 'founded' in a given string 

result = re.search('founded',r'Andrew NG founded Coursera.He also founded deeplearning.ai')
print(result.group())

founded


<br>

**3. re.findall()**

It will return all the occurrences of the pattern from the string. I would recommend you to use *re.findall()* always, it can work like both *re.search()* and *re.match()*.

In [7]:
result = re.findall('founded',r'Andrew NG founded Coursera. He also founded deeplearning.ai')  
print(result)

['founded', 'founded']


### Special sequences

1. **\A**	returns a match if the specified pattern is at the beginning of the string.

In [8]:
str = r'Analytics Vidhya is the largest data science community of India'

x= re.findall('\AVidhya', str)

print(x)

[]


  x= re.findall('\AVidhya', str)


2. **\b** returns a match where the specified pattern is at the beginning or at the end of a word.

In [9]:
#checks if there is any word that ends with 'est'"Visualization With Seaborn_v2.ipynb"

x= re.findall(r'est\b',str)
print(x)

['est']


It returns the last three characters of the word "largest".

3. **\B**	returns a match where the specified pattern is present, but NOT at the beginning (or at the end) of a word.

In [10]:
str = r'Analytics Vidhya is the largest data science community of India'

x = re.findall(r"\Ben", str)

print(x)

['en']


4. **\d** returns a match where the string contains digits (numbers from 0-9)

In [14]:
str = "2 million monthly visits in Jan'19."

#Check if the string contains any digits (numbers from 0-9):
x = re.findall("\d", str)

print(x)

if (x):
  print("Yes, there is at least one match!")
else:
  print("No match")

['2', '1', '9']
Yes, there is at least one match!


  x = re.findall("\d", str)


We can infer that **\d+** repeats one or more occurences of **\d** till the non maching character is found where as **\d** does character wise comparison.

5. **\D** returns a match where the string does not contain any digit.

In [15]:
str = "2 million monthly visits in Jan'19."

#Check if the word character does not contain any digits (numbers from 0-9):
x = re.findall("\D", str)

print(x)

if (x):
  print("Yes, there is at least one match!")
else:
  print("No match")

[' ', 'm', 'i', 'l', 'l', 'i', 'o', 'n', ' ', 'm', 'o', 'n', 't', 'h', 'l', 'y', ' ', 'v', 'i', 's', 'i', 't', 's', ' ', 'i', 'n', ' ', 'J', 'a', 'n', "'", '.']
Yes, there is at least one match!


  x = re.findall("\D", str)


In [16]:
str = "2 million monthly visits'19"

#Check if the word does not contain any digits (numbers from 0-9):

x = re.findall("\D+", str)

print(x)

if (x):
  print("Yes, there is at least one match!")
else:
  print("No match")

[" million monthly visits'"]
Yes, there is at least one match!


  x = re.findall("\D+", str)


6. **\w** helps in extraction of alphanumeric characters only (characters from a to Z, digits from 0-9, and the underscore _ character)

In [17]:
str = "2 million monthly visits!"

#returns a match at every word character (characters from a to Z, digits from 0-9, and the underscore _ character)

x = re.findall("\w",str)

print(x)

if (x):
  print("Yes, there is at least one match!")
else:
  print("No match")

['2', 'm', 'i', 'l', 'l', 'i', 'o', 'n', 'm', 'o', 'n', 't', 'h', 'l', 'y', 'v', 'i', 's', 'i', 't', 's']
Yes, there is at least one match!


  x = re.findall("\w",str)


In [18]:
str = "2 million monthly visits!"

#returns a match at every word character (characters from a to Z, digits from 0-9, and the underscore _ character)

x = re.findall("\w",str)

print(x)

if (x):
  print("Yes, there is at least one match!")
else:
  print("No match")

['2', 'm', 'i', 'l', 'l', 'i', 'o', 'n', 'm', 'o', 'n', 't', 'h', 'l', 'y', 'v', 'i', 's', 'i', 't', 's']
Yes, there is at least one match!


  x = re.findall("\w",str)


7. **\W** returns match at every non alphanumeric character.

In [19]:
str = "2 million monthly visits9!"

#returns a match at every NON word character (characters NOT between a and Z. Like "!", "?" white-space etc.):

x = re.findall("\W", str)

print(x)

if (x):
  print("Yes, there is at least one match!")
else:
  print("No match")

[' ', ' ', ' ', '!']
Yes, there is at least one match!


  x = re.findall("\W", str)


## Metacharacters

Metacharacters are characters with a special meaning

1. **(.)** matches any character (except newline character)