## Regular Expressions

### Let's look at the aforementioned recipes in detail. Regular expression – learning to use *, +, and ?

We start off with a recipe that will elaborate the use of the , +, and ? operators in regular expressions. These short-hand operators are more commonly known as wild cards,and also zero or more (*) one or more (+), and zero or one (?) for distinction.

In [1]:
import re

In [3]:
def text_match(text, patterns):
    if re.search(patterns, text):
        return 'Found a match!'
    else:
        return 'Not matched!'
    

In [4]:
print(text_match("ac", "ab?"))
print(text_match("abc", "ab?"))
print(text_match("abbc", "ab?"))

Found a match!
Found a match!
Found a match!


In [5]:
print(text_match("ac", "ab*"))
print(text_match("abc", "ab*"))
print(text_match("abbc", "ab*"))

Found a match!
Found a match!
Found a match!


In [6]:
print(text_match("ac", "ab+"))
print(text_match("abc", "ab+"))
print(text_match("abbc", "ab+"))

Not matched!
Found a match!
Found a match!


In [7]:
print(text_match("abbc", "ab{2}"))

Found a match!


In [8]:
print(text_match("aabbbbc", "ab{3,5}?"))

Found a match!


### Regular expression – learning to use $ and ^, and the non-start and non-end of a word

In [9]:
print(text_match("abc","^a.*c$"))

Found a match!


Let's look at this pattern, ^a.*c$. This means: start with a, followed by zero or
more of any characters, and end with c.

The dot matches any character except a newline in default mode; that
is, when you say .*, it means zero or more occurrences of any character.

In [10]:
print("Begin with a word")
print(text_match("Tuffy eats pie, Loki eats peas!", "^\w+"))

Begin with a word
Found a match!


\w stands for any alphanumeric character and underscore. The pattern says: start
with (^) any alphanumeric character (\w) and one or more occurrences of it (+).
The output:

In [11]:
print("End with a word and optional punctuation")
print(text_match("Tuffy eats pie, Loki eats peas!", "\w+\S*?$"))

End with a word and optional punctuation
Found a match!


The pattern means one or more occurrences of \w, followed by zero or more
occurrences of \S, and that should be falling towards the end of the input text. To
understand \S (capital S), we must first understand \s, which is all whitespace
characters. \S is the reverse or the anti-set of \s, which when followed by \w
translates to looking for a punctuation:

In [12]:
print("Finding a word which contains character, not start or end of the word")
print(text_match("Tuffy eats pie, Loki eats peas!", "\Bu\B"))

Finding a word which contains character, not start or end of the word
Found a match!


For decoding this pattern, \B is a anti-set or reverse of \b. The \b matches an
empty string at the beginning or end of a word, and we have already seen what a
word is. Hence, \B will match inside the word and it will match any word in our
input string that contains character u:

### Searching multiple literal strings and substring occurrences

In [13]:
patterns = ['Tuffy', 'Pie', 'Loki']
text = 'Tuffy eats pie, Loki eats peas'

In [14]:
for pattern in patterns:
    print('Searching for "%s" in "%s" -&gt;' % (pattern, text),)
    if re.search(pattern, text):
        print('Found!')
    else:
        print('Not Found!')

Searching for "Tuffy" in "Tuffy eats pie, Loki eats peas" -&gt;
Found!
Searching for "Pie" in "Tuffy eats pie, Loki eats peas" -&gt;
Not Found!
Searching for "Loki" in "Tuffy eats pie, Loki eats peas" -&gt;
Found!


In [16]:
text = 'Diwali is a festival of lights, Holi is a festival of colors!'
pattern = 'festival'

In [17]:
for match in re.finditer(pattern, text):
    s = match.start()
    e = match.end()
    print('Found "%s" at %d:%d' %(text[s:e],s,e))

Found "festival" at 12:20
Found "festival" at 42:50


### Learning to create date regex and a set of characters or ranges of character

In [22]:
url="http://www.telegraph.co.uk/formula-1/2017/10/28/mexican-grand-prix-2017-time-does-start-tv-channel-odds-lewis1/"
date_regex = '/(\d{4})/(\d{1,2})/(\d{1,2})/'

In [25]:
print("Date found in the URL :", re.findall(date_regex, url))

Date found in the URL : [('2017', '10', '28')]


In [26]:
def is_allowed_specific_char(string):
    charRe = re.compile(r'[^a-zA-z0-9.]')
    string = charRe.search(string)
    return not bool(string)

In [27]:
print(is_allowed_specific_char("ABCDEFabcdef123450."))
print(is_allowed_specific_char("*&%@#!}{"))

True
False


### Find all five-character words and make abbreviations in some sentences

In [28]:
street = '21 Ramkrishna Road'
print(re.sub('Road', 'Rd', street))

21 Ramkrishna Rd


In [29]:
text = 'Diwali is a festival of light, Holi is a festival of color!'
print(re.findall(r"\b\w{5}\b", text))

['light', 'color']


### Learning to write your own regex tokenizer

In [30]:
raw = "I am big! It's the pictures that got small."
print(re.split(r' +', raw))

['I', 'am', 'big!', "It's", 'the', 'pictures', 'that', 'got', 'small.']


In [31]:
print(re.split(r'\W+', raw))

['I', 'am', 'big', 'It', 's', 'the', 'pictures', 'that', 'got', 'small', '']


In [32]:
print(re.findall(r'\w+|\S\w*', raw))

['I', 'am', 'big', '!', 'It', "'s", 'the', 'pictures', 'that', 'got', 'small', '.']


### Learning to write your own regex stemmer

In [33]:
def stem(word):
    splits = re.findall(r'^(.*?)(ing|ly|ed|ious|ies|ive|es|s|ment)?$',word)
    stem = splits[0][0]
    return stem

In [34]:
raw = "Keep your friends close, but your enemies closer."
tokens = re.findall(r'\w+|\S\w*', raw)
print(tokens)

['Keep', 'your', 'friends', 'close', ',', 'but', 'your', 'enemies', 'closer', '.']


In [35]:
for t in tokens:
    print("'"+stem(t)+"'")

'Keep'
'your'
'friend'
'close'
','
'but'
'your'
'enem'
'closer'
'.'
