## Text Extraction using RegEx
Regular expression is a powerful tool for processing and extracting character patterns from text

In [1]:
import re

In [2]:
sen = "Machine learning is a growing technology which enables computers to \
       learn automatically from past data. We make machine to do specific task \
       without explicitly programming. Machine learning uses various algorithms \
       for building mathematical models and making predictions using historical data or information."

#### findall()
It returns a list of all the occurrences of the given pattern from our string

In [3]:
re.findall('Machine', sen)
# return a list of all the occurrences of the word 'Machine' in our string

['Machine', 'Machine']

In [4]:
re.findall('Machine', sen, flags=re.IGNORECASE)
# re.IGNORECASE - ignores the case while performing the search.

['Machine', 'machine', 'Machine']

In [5]:
# Search multiple patterns
re.findall('technology | learning', sen)

[' learning', 'technology ', ' learning']

In [6]:
re.findall('resource | learning | from', sen)

[' learning ', ' from', ' learning ']

In [7]:
# Extract only words containing alphabets
sen1 = "Her age is 20. She learned driving at 18. I taught her."
pattern = '[a-z]+' 
# The + character is used to match 1 or more repetitions of the preceding regular expression
re.findall(pattern, sen1)

['er', 'age', 'is', 'he', 'learned', 'driving', 'at', 'taught', 'her']

In [8]:
re.findall(pattern, sen1, flags=re.IGNORECASE)

['Her', 'age', 'is', 'She', 'learned', 'driving', 'at', 'I', 'taught', 'her']

In [9]:
# without using IGNORECASE
pattern ='[a-zA-Z]+'
re.findall(pattern, sen1)

['Her', 'age', 'is', 'She', 'learned', 'driving', 'at', 'I', 'taught', 'her']

In [10]:
# search a specific pattern
sen2 = "@Harish please complete the task. @Navin review it."

In [11]:
pattern = '@([a-zA-Z]+)'
re.findall(pattern, sen2)

['Harish', 'Navin']

In [12]:
pattern = '@[a-zA-Z]+'
re.findall(pattern, sen2)

['@Harish', '@Navin']

In [13]:
## Searching email from the text data.
emailtext = 'ok take this to the next level from a google@datamites.com productivity standpoint zed each info@ernet.com time you send an infoi@gmail.com email'

In [14]:
re.findall(r'[a-zA-Z0-9]+@[a-zA-Z0-9]+.[a-z]+',emailtext)

['google@datamites.com', 'info@ernet.com', 'infoi@gmail.com']

In [15]:
re.search('\S+@\S+', emailtext) 

<re.Match object; span=(38, 58), match='google@datamites.com'>

In [16]:
re.findall('\S+@\S+', emailtext) 

['google@datamites.com', 'info@ernet.com', 'infoi@gmail.com']

In [17]:
## searching numbers in text data
text = "This is session on 23 of this 68685 month"

In [18]:
re.findall(r'\d',text) ## single numbers

['2', '3', '6', '8', '6', '8', '5']

In [19]:
re.findall(r'\d+',text)

['23', '68685']

In [20]:
## searching string of numbers separted with some special character(in this cas ips)
# b matches word boundary , (?:[0-9]{1,3}\.) group {3} 3 times,  [0-9]{1,3}  last block without '.'
s = re.findall(r'\b(?:[0-9]{1,3}\.){3}[0-9]{1,3}\b', text)
s

[]

In [21]:
text = 'The cyber security has become one of the most important ascept of the business. \
$550 million has been invested research. 12.45.65.78 is one the most spammed inject ips. \
ask@web.com. $600 million wasted. save safe. sale. hi5@connect.com'

In [22]:
re.findall('sa.e',text)

['save', 'safe', 'sale']

In [23]:
re.findall('safe|saze',text)

['safe']

In [24]:
re.findall('\$550',text)

['$550']

In [25]:
re.findall('\d+',text)

['550', '12', '45', '65', '78', '600', '5']

In [26]:
re.findall('[a-z]+',text)

['he',
 'cyber',
 'security',
 'has',
 'become',
 'one',
 'of',
 'the',
 'most',
 'important',
 'ascept',
 'of',
 'the',
 'business',
 'million',
 'has',
 'been',
 'invested',
 'research',
 'is',
 'one',
 'the',
 'most',
 'spammed',
 'inject',
 'ips',
 'ask',
 'web',
 'com',
 'million',
 'wasted',
 'save',
 'safe',
 'sale',
 'hi',
 'connect',
 'com']

In [27]:
re.findall('[a-z0-9]+@[a-z0-9]+.com',text)

['ask@web.com', 'hi5@connect.com']