# Extract Information Using Regular Expressions (RegEx)

The first thing that i want to start off is the notion of raw string

**r** expression is used to create a raw string. Python raw string treats backslash (\\) as a literal character.



Let us see some examples!

In [2]:
# normal string vs raw string
path = "C:\desktop\nathan"  #string
print("string:",path)

string: C:\desktop
athan


  path = "C:\desktop\nathan"  #string


In [3]:
path= r"C:\desktop\nathan"  #raw string
print("raw string:",path)

raw string: C:\desktop\nathan


So, it is always recommended to use raw strings while dealing with regular expressions. 

Python has a built-in module to work with regular expressions called **re**. Some of the commonly used methods from the **re** module are listed below:

1.re.match(): This function checks if 

2.re.search()

3.re.findall()

<br>

Let us look at each method with the help of example.

**1. re.match()**

The re.match function returns a match object on success and none on failure.

In [35]:
import re
#match a word at the beginning of a string 

result = re.match('Analytics',r'Analytics Vidhya is the largest data science community of India')
print(result)

result_2 = re.match('largest',r'Analytics Vidhya is the largest data science community of India') 
print(result_2)

<re.Match object; span=(0, 9), match='Analytics'>
None


Since output of the re.match is an object, we will use *group()* function of match object to get the matched expression.

In [5]:
print(result.group())  #returns the total matches

Analytics


<br>

**2. re.search()**

Matches the first occurence of a pattern in the entire string.

In [6]:
#search for a pattern 'founded' in a given string 

result = re.search('founded',r'Andrew NG founded Coursera.He also founded deeplearning.ai')
print(result.group())

founded


<br>

**3. re.findall()**

It will return all the occurrences of the pattern from the string. I would recommend you to use *re.findall()* always, it can work like both *re.search()* and *re.match()*.

In [7]:
result = re.findall('founded',r'Andrew NG founded Coursera. He also founded deeplearning.ai')  
print(result)

['founded', 'founded']


### Special sequences

1. **\A**	returns a match if the specified pattern is at the beginning of the string.

In [8]:
str = r'Analytics Vidhya is the largest data science community of India'

x= re.findall('\AVidhya', str)

print(x)

[]


  x= re.findall('\AVidhya', str)


2. **\b** returns a match where the specified pattern is at the beginning or at the end of a word.

In [9]:
#checks if there is any word that ends with 'est'"Visualization With Seaborn_v2.ipynb"

x= re.findall(r'est\b',str)
print(x)

['est']


It returns the last three characters of the word "largest".

3. **\B**	returns a match where the specified pattern is present, but NOT at the beginning (or at the end) of a word.

In [10]:
str = r'Analytics Vidhya is the largest data science community of India'

x = re.findall(r"\Ben", str)

print(x)

['en']


4. **\d** returns a match where the string contains digits (numbers from 0-9)

In [11]:
str = "2 million monthly visits in Jan'19."

#Check if the string contains any digits (numbers from 0-9):
x = re.findall("\d", str)

print(x)

if (x):
  print("Yes, there is at least one match!")
else:
  print("No match")

['2', '1', '9']
Yes, there is at least one match!


  x = re.findall("\d", str)


We can infer that **\d+** repeats one or more occurences of **\d** till the non maching character is found where as **\d** does character wise comparison.

5. **\D** returns a match where the string does not contain any digit.

In [12]:
str = "2 million monthly visits in Jan'19."

#Check if the word character does not contain any digits (numbers from 0-9):
x = re.findall("\D", str)

print(x)

if (x):
  print("Yes, there is at least one match!")
else:
  print("No match")

[' ', 'm', 'i', 'l', 'l', 'i', 'o', 'n', ' ', 'm', 'o', 'n', 't', 'h', 'l', 'y', ' ', 'v', 'i', 's', 'i', 't', 's', ' ', 'i', 'n', ' ', 'J', 'a', 'n', "'", '.']
Yes, there is at least one match!


  x = re.findall("\D", str)


In [13]:
str = "2 million monthly visits'19"

#Check if the word does not contain any digits (numbers from 0-9):

x = re.findall("\D+", str)

print(x)

if (x):
  print("Yes, there is at least one match!")
else:
  print("No match")

[" million monthly visits'"]
Yes, there is at least one match!


  x = re.findall("\D+", str)


6. **\w** helps in extraction of alphanumeric characters only (characters from a to Z, digits from 0-9, and the underscore _ character)

In [14]:
str = "2 million monthly visits!"

#returns a match at every word character (characters from a to Z, digits from 0-9, and the underscore _ character)

x = re.findall("\w",str)

print(x)

if (x):
  print("Yes, there is at least one match!")
else:
  print("No match")

['2', 'm', 'i', 'l', 'l', 'i', 'o', 'n', 'm', 'o', 'n', 't', 'h', 'l', 'y', 'v', 'i', 's', 'i', 't', 's']
Yes, there is at least one match!


  x = re.findall("\w",str)


In [15]:
str = "2 million monthly visits!"

#returns a match at every word character (characters from a to Z, digits from 0-9, and the underscore _ character)

x = re.findall("\w",str)

print(x)

if (x):
  print("Yes, there is at least one match!")
else:
  print("No match")

['2', 'm', 'i', 'l', 'l', 'i', 'o', 'n', 'm', 'o', 'n', 't', 'h', 'l', 'y', 'v', 'i', 's', 'i', 't', 's']
Yes, there is at least one match!


  x = re.findall("\w",str)


7. **\W** returns match at every non alphanumeric character.

In [16]:
str = "2 million monthly visits9!"

#returns a match at every NON word character (characters NOT between a and Z. Like "!", "?" white-space etc.):

x = re.findall("\W", str)

print(x)

if (x):
  print("Yes, there is at least one match!")
else:
  print("No match")

[' ', ' ', ' ', '!']
Yes, there is at least one match!


  x = re.findall("\W", str)


## Metacharacters

Metacharacters are characters with a special meaning

1. **(.)** matches any character (except newline character)

In [17]:
str = "rohan and rohit recently published a research paper!" 

#Search for a string that starts with "ro", followed by three (any) characters

x = re.findall("ro.", str)
x2 = re.findall("ro...", str)

print(x)
print(x2)

['roh', 'roh']
['rohan', 'rohit']


2. **(^)** starts with

In [18]:
str = "Data Science"

#Check if the string starts with 'Data':
x = re.findall("^Data", str)

if (x):
  print("Yes, the string starts with 'Data'")
else:
  print("No match")
  
#print(x)  

Yes, the string starts with 'Data'


In [19]:
# try with a different string
str2 = "Big Data"

#Check if the string starts with 'Data':
x2 = re.findall("^Data", str2)

if (x2):
  print("Yes, the string starts with 'data'")
else:
  print("No match")
  
#print(x2)  

No match


3. **($)** ends with

In [20]:
str = "Data Science"

#Check if the string ends with 'Science':

x = re.findall("Science$", str)

if (x):
  print("Yes, the string ends with 'Science'")

else:
  print("No match")
  
#print(x)

Yes, the string ends with 'Science'


4. (*) matches for zero or more occurences of the pattern to the left of it

In [21]:
str = "easy easssy eay ey"

#Check if the string contains "ea" followed by 0 or more "s" characters and ending with y
x = re.findall("eas*y", str)

print(x)

if (x):
  print("Yes, there is at least one match!")
else:
  print("No match")

['easy', 'easssy', 'eay']
Yes, there is at least one match!


5. **(+)** matches one or more occurences of the pattern to the left of it

In [22]:
#Check if the string contains "ea" followed by 1 or more "s" characters and ends with y 
x = re.findall("eas+y", str)

print(x)

if (x):
  print("Yes, there is at least one match!")
else:
  print("No match")

['easy', 'easssy']
Yes, there is at least one match!


6. **(?)** matches zero or one occurrence of the pattern left to it.

In [23]:
x = re.findall("eas?y",str)

print(x)

if (x):
  print("Yes, there is at least one match!")
else:
  print("No match")

['easy', 'eay']
Yes, there is at least one match!


7. **(|)** either or

In [24]:
str = "Analytics Vidhya is the largest data science community of India"

#Check if the string contains either "data" or "India":

x = re.findall("data|India", str)

print(x)

if (x):
  print("Yes, there is at least one match!")
else:
  print("No match")

['data', 'India']
Yes, there is at least one match!


In [25]:
# try with a different string
str = "Analytics Vidhya is one of the largest data science communities"

#Check if the string contains either "data" or "India":

x = re.findall("data|India", str)

print(x)

if (x):
  print("Yes, there is at least one match!")
else:
  print("No match")

['data']
Yes, there is at least one match!


## Sets

1. A set is a bunch of characters inside a pair of square brackets [ ] with a special meaning.

In [26]:
str = 'Analytics Vidhya is the largest data science community of India'

#check for the characters y,d,or h, in the above string 
x = re.findall('[ydh]', str)

print(x)

if(x):
    print('Yes, there is atleat one match')
else:
    print('No match')

['y', 'd', 'h', 'y', 'h', 'd', 'y', 'd']
Yes, there is atleat one match


In [28]:
str = 'Analytics Vidhya is the largest data science community in India'

#check for the charaters between a and g, in the above string 

x= re.findall('[a-g]', str)
print(x)

if(x):
    print('Yes, there is at least one match')
else:
    print('No match')

['a', 'c', 'd', 'a', 'e', 'a', 'g', 'e', 'd', 'a', 'a', 'c', 'e', 'c', 'e', 'c', 'd', 'a']
Yes, there is at least one match


<br>

Let's solve a problem.

In [None]:
str = "Mars' average distance from the Sun is roughly 230 million km and its orbital period is 687 (Earth) days."

# extract the numbers starting with 0 to 4 from in the above string
x = re.findall(r"\b[0-4]\d+", str)

print(x)

if (x):
  print("Yes, there is at least one match!")
else:
  print("No match")

2. **[^]** Check whether string has other characters mentioned after ^

In [29]:
str = "Analytics Vidhya is the largest data sciece community of India"

#Check if every word character has characters than y, d, or h

x = re.findall("[^ydh]", str)

print(x)

if (x):
  print("Yes, there is at least one match!")
else:
  print("No match")

['A', 'n', 'a', 'l', 't', 'i', 'c', 's', ' ', 'V', 'i', 'a', ' ', 'i', 's', ' ', 't', 'e', ' ', 'l', 'a', 'r', 'g', 'e', 's', 't', ' ', 'a', 't', 'a', ' ', 's', 'c', 'i', 'e', 'c', 'e', ' ', 'c', 'o', 'm', 'm', 'u', 'n', 'i', 't', ' ', 'o', 'f', ' ', 'I', 'n', 'i', 'a']
Yes, there is at least one match!


3. **[a-zA-Z0-9]** : Check whether string has alphanumeric characters

In [33]:
str = "@AV Largest Data Science community #AV!!"

# extract words that start with a special character
x = re.findall("[^a-zA-Z0-9 ]\w+", str)

print(x)

['@AV', '#AV']


  x = re.findall("[^a-zA-Z0-9 ]\w+", str)


## Solve Complex Queries

Let us try solving some complex queries using regex.

### Extracting Email IDs

In [31]:
str = 'Send a mail to rohan.1997@gmail.com, smith_david34@yahoo.com and priya@yahoo.com about the meeting @2PM'
  
# \w matches any alpha numeric character 
# + for repeats a character one or more times 
#x = re.findall('\w+@\w+\.com', str)     
x = re.findall('[a-zA-Z0-9._-]+@\w+\.com', str)     
  
# Printing of List 
print(x) 

['rohan.1997@gmail.com', 'smith_david34@yahoo.com', 'priya@yahoo.com']


  x = re.findall('[a-zA-Z0-9._-]+@\w+\.com', str)


### Extracting Dates

In [37]:
text = "London Olympic 2012 was held from 2012-07-27 to 2012/08/12."

# '\d{4}' repeats '\d' 4 times
match = re.findall("\d{4}.\d{2}.\d{2}", text)
print(match)

['2012-07-27', '2012/08/12']


  match = re.findall("\d{4}.\d{2}.\d{2}", text)


In [38]:
text="London Olympic 2012 was held from 27 Jul 2012 to 12-Aug-2012."

match = re.findall('\d{2}.\w{3}.\d{4}', text)

print(match)

['27 Jul 2012', '12-Aug-2012']


  match = re.findall('\d{2}.\w{3}.\d{4}', text)


In [39]:
# extract dates with varying lengths
text="London Olympic 2012 was held from 27 July 2012 to 12 August 2012."

#'\w{3,10}' repeats '\w' 3 to 10 times
match = re.findall('\d{2}.\w{3,10}.\d{4}', text)

print(match)

['27 July 2012', '12 August 2012']


  match = re.findall('\d{2}.\w{3,10}.\d{4}', text)


## Extracting Title from Names - Titanic Dataset

In [40]:
import pandas as pd

# load dataset
data=pd.read_csv("titanic.csv")

Pyarrow will become a required dependency of pandas in the next major release of pandas (pandas 3.0),
(to allow more performant data types, such as the Arrow string type, and better interoperability with other libraries)
but was not found to be installed on your system.
If this would cause problems for you,
please provide us feedback at https://github.com/pandas-dev/pandas/issues/54466
        
  import pandas as pd


FileNotFoundError: [Errno 2] No such file or directory: 'titanic.csv'