# Hands on NLP using Python

## 2. Regular Expression

## S4.V19. Introduction to Regular Expression

## S4.V20. Finding patterns in text-Part 1

In [1]:
import re

sentence = 'He was born in year 1996'

re.match(r'.*', sentence) # to match entire thing `.*` 

<_sre.SRE_Match object; span=(0, 24), match='He was born in year 1996'>

`.` matches anything. `*` makes it match 0 or more time.

`+` matches 1 or more times

In [2]:
# difference between match and search
print(re.match(r'\d+', sentence)) # won't work as it searches from the beginning of the string
re.search(r'\d+', sentence) # look across the string

None


<_sre.SRE_Match object; span=(20, 24), match='1996'>

In [3]:
print(re.match(r'[A-Za-z]', sentence)) # only selects 1st letter 
print(re.match(r'[A-Za-z]+', sentence)) # selects 1st word
print(re.match(r'\w+', sentence)) # [A-Za-z]+ and \w+ are same

<_sre.SRE_Match object; span=(0, 1), match='H'>
<_sre.SRE_Match object; span=(0, 2), match='He'>
<_sre.SRE_Match object; span=(0, 2), match='He'>


In [4]:
sent = 'abb'
print(re.match(r'ab*', sent))
re.match(r'ab?', sent)

<_sre.SRE_Match object; span=(0, 3), match='abb'>


<_sre.SRE_Match object; span=(0, 2), match='ab'>

`*` look for 0 or more and thus it finds both `b` and thus we see `abb` matches

`?` makes `b` optional here, so `b` can be 0 or 1 time. Thus it matches only 1 `b` out of 2

## S4.V21. Finding patterns in text-Part 2

In [5]:
import re

sent = 'He was born in year 1996'
re.match(r'[a-zA-Z]+', sent)

<_sre.SRE_Match object; span=(0, 2), match='He'>

In [6]:
sent = '1996 was the year he was born'
print(re.match(r'[a-zA-Z]+', sent))

None


**Problem:** Now `match` cant find any pattern. Because match works from the beginning of the sentence and as beginning contains a digit and not word, output is None.

**Soultion:** Use `search` it looks through the entire string.

In [7]:
sent = '1996 was the year he was born'
print(re.search(r'[a-zA-Z]+', sent)) # works and finds 1st pattern

<_sre.SRE_Match object; span=(5, 8), match='was'>


In [8]:
# start character: to find whether a string starts with digit
sent = '1996 was the year he was born'

if re.search(r'^\d+', sent):
    print('Match')
else:
    print('No Match')

Match


`^` ensures the beginning of the string

In [9]:
# end character: to find whether something specific at the end of the string
sent = '1996 was the year he was born'

if re.search(r'born$', sent):
    print('Match')
else:
    print('No Match')

Match


`$` ensures the end of the string

## S4.V22. Substituting patterns in text

In [41]:
import re

sent = 'I love Avengers'
print(re.sub(r'Avengers','Justice League', sent))

I love Justice League


In [42]:
# will do that for every instance
sent = 'I love Avengers Avengers'
print(re.sub(r'Avengers','Justice League', sent))

I love Justice League Justice League


In [44]:
# case insensitive

# To replace all characters with numbers
sent = 'I love Avengers'
print(re.sub(r'[a-z]','0', sent))

I 0000 A0000000


Interestingly, we see only lowercase letter have been converted. But not uppercase. Because we are only searching for lowercase letter `[a-z]`. We can solve the problem by using `[a-zA-Z]`.

Or, use case insensitive as flag i.e. `flags=re.I`

In [45]:
sent = 'I love Avengers'
print(re.sub(r'[a-z]','0', sent, flags=re.I)) # With case insenitivity all changed to 0

0 0000 00000000


In [47]:
# we can also mention how many we want to change. Default is all
sent = 'I love Avengers'
print(re.sub(r'[a-z]','0', sent, 5, flags=re.I)) 

0 0000 Avengers


## S4.V23. Shorthand character class

In [82]:
import re

sent1 = 'Welcome to the Year 2018'
sent2 = "Just ~% +++--- arrived @Jack's place. #fun"
sent3 = 'I                love                  you'

# replace digits from sent1
sent1_modified = re.sub(r'\d','', sent1)
print(sent1_modified)

Welcome to the Year 


In [83]:
# replce all non-alpha characters
sent2_modified = re.sub(r"[@#~%+-\.']",'', sent2)
print(sent2_modified)

# replce all non-alpha characters-easy way
sent2_modified = re.sub(r"\W",' ', sent2)
print(sent2_modified)

Just   arrived Jacks place fun
Just           arrived  Jack s place   fun


In [84]:
# replacing 1 or MORE spaces with a single space to keep space between words consistent
sent2_modified = re.sub(r"\s+",' ', sent2_modified)
print(sent2_modified)

Just arrived Jack s place fun


In [85]:
# now we need to remove the single letter s
sent2_modified = re.sub(r"\s+[a-zA-Z]\s",' ', sent2_modified)
print(sent2_modified)

Just arrived Jack place fun


In [86]:
# replce all SPACES
sent3_modified = re.sub(r"\s+",' ', sent3)
print(sent3_modified)

I love you


In [89]:
# updating
sent3_modified = re.sub(r"\s+love\s+",' hate ', sent3) # or use r"\s+[A-Za-z]+\s+"
print(sent3_modified)

I hate you


## S4.V25. Preprocessing using RegEx

In [92]:
# clean all impurities

X = ['This is a wolf#scary',
     'Welcome to the jungle #missing',
     '11322 the number to know',
     'Remember the name s - John',
     'I love you']

for i in range(len(X)):
    X[i] = re.sub(r'\W', ' ', X[i]) # remove non-word characters
    X[i] = re.sub(r'\d', ' ', X[i]) # remove digits
    X[i] = re.sub(r'\s+[A-Za-z]\s+', ' ', X[i]) # remove single letters
    # X[i] = re.sub(r'\s+[a-z]\s+', ' ', X[i], flags=re.I) # It will also remove single letters
    X[i] = re.sub(r'\s+', ' ', X[i]) # to remove multiple spaces with a single space
    X[i] = re.sub(r'^\s', '', X[i]) # prevent start of sentences with a space
    X[i] = re.sub(r'\s$', '', X[i]) # prevent end of sentences with a space
print (X) # clean list of sentences

['This is wolf scary', 'Welcome to the jungle missing', 'the number to know', 'Remember the name John', 'I love you']
