### Natural Language Processing

#### Regular expressions

A regex is a special sequence of characters that defines a pattern for complex string-matching functionality.

- re: a package in Python to work with regular expressions
- Functions:
     - match: checks for a match only at the beginning of the string
     - search: checks for a match anywhere in the string (this is what Perl does by default)
     - fullmatch: checks for entire string to be a match
     - findall: Returns a list containing all matches
     - split: Returns a list where the string has been split at each match
     - sub: Replaces one or many matches with a string
    
- Quantifiers:
    - [  ]	Set of characters in the bracket	
    - [^ ]  Complement of the set in bracket
    - .	Any character 	
    - '*'	Zero or more occurrences
    - '+'	One or more occurrences		
    - ?	Zero or one occurrences		
    - {}	Exactly the specified number of occurrences	 
- Non-Greedy Match:
    - *?, +?, ?? 
  
- Anchors:
    - ^	Starts with	
    - $	Ends with
    
- Capture and group:
    - ()	
- Others:
    - |	Either or	
    - \	Signals a special sequence (w - any alphanumeric word, d - digits, D - non-digit, s- escape sequence)
    
- More Information: https://docs.python.org/3/library/re.html

In [1]:
import re

In [31]:
#Find occurrence of characters

text = "It rained in Spain"

#Only first instance
x = re.search("ai", text)
print(x)

#All instances 
X = re.findall("ai", text)
print(X)

<re.Match object; span=(4, 6), match='ai'>
['ai', 'ai']


In [36]:
#Ignore case 

a = re.search('[a-z]+', 'PYTHON is great!')

b = re.search('[a-z]+', 'PYTHON is great!')   # add the flag re.I


print("a = ", a)
print("b = ", b)

a =  <re.Match object; span=(7, 9), match='is'>
b =  <re.Match object; span=(0, 6), match='PYTHON'>


In [3]:
#Find occurrence of dates

text = '''My name is Sachin. I was born on 07/12/1980 in Mumbai.
I graduated from University of Georgia on 07-28-2019.'''

#The regex pattern to extract dates
pattern =  MM(/-)DD(/-)YYYY                      # add "\d{2}[/-]\d{2}[/-]\d{4}"

#Will return all the strings that are matched
dates = re.findall(pattern, text)

print(dates)

['07/12/1980', '07-28-2019']


In [4]:
# Finding all Adverbs in a text (ends with ly)

text = "He was carefully disguised but captured quickly by police."
pattern = <ending in ly>               # r"\w+ly\b"

re.findall(pattern, text)

['carefully', 'quickly']

In [39]:
# Capturing all hashtag in a tweet

text = "#AI is a powerful #technology that can be utilized in any #industry."

pattern = r'#\S+'            #same as  (#+[a-zA-Z0-9]{1,})'
    
re.findall(pattern, text)

['#AI', '#technology', '#industry.']

In [7]:
# Try it out yourslef
text = 'sun #plants #!wood% ##arebaba#tey   travel#blessed    #weed das#$#!@D!AAAA'

pattern = ?
re.findall(pattern, text)

['#plants', '##arebaba', '#tey', '#blessed', '#weed']

In [48]:
#Find occurrence of emails
## re.findall() returns a list of all the found email strings

text = 'purple alice-b@google.com, blah monkey bob@fb.com dishwasher'

pattern = ----@----                               # r'[\w\.-]+@[\w\.-]+'

emails = re.findall(pattern, text) 
for email in emails:
    print(email)

alice-b@google.com
bob@fb.com


In [25]:
# Find occurrence of emails
## create groups to extract parts of a matched string.

text = 'purple alice-b@google.com monkey dishwasher'

pattern = r'([\w.-]+)@([\w.-]+)'    # () creates groups

match = re.search(pattern, text)        
if match:
    print(match.group())   ## (the whole match)
    print(match.group(1))  ## (the username, group 1)
    print(match.group(2))  ## (the host, group 2)

alice-b@google.com
alice-b
google.com


In [None]:
#replace domain with Upgrad.com

text = 'aaa@gmail.com bbb@hotmail.com ccc@apple.com'


print(re.sub('gmail|hotmail|apple', 'UpGrad', text))

In [40]:
# replacing parts of strings
text = 'aaa@gmail.com bbb@hotmail.com ccc@apple.com'

#anything before @ with info
print(re.sub('[a-z]*@', 'info@', text))    

info@gmail.com info@hotmail.com info@apple.com


***