## Regular Expressions

#### Regular Expressions Part I

Provides a concise and flexible means for matching strings of text, such as particular characters, words, or patterns of characters. A regular expression is written in a formal language that can be interpreted by a regular expression processor. Really clever "wild card" expressions for matching and parsing strings

###### re.search() 
To see if a string matches a regular expression, similar to using the find() method for strings. Returns a T/F depending on whether the string matches the regular expression

###### re.findall() 
To extract portions of a string that match your regular expression similar to a combination of find() and slicing. Use this when we actually want the matching strings to be extracted.

###### Using re.search() like find()

In [1]:
%reset

Once deleted, variables cannot be recovered. Proceed (y/[n])? y


In [4]:
with open(r'random-text.txt') as hand:
    for line in hand:
        line = line.rstrip()
        if line.find('from') >= 0:
            print(line)
            
# str.find() : returns the index of the string and -1 otherwise

About a hundred extremists from the Maute group mounted attacks across Marawi.
The military said it had rescued 120 people from a school and a hospital.
Mujiv Hataman, governor of the Autonomous Region in Mindanao, said the terrorists freed 107 prisoners from the local prison, among them Maute rebels.
“Anyone now holding a gun, confronting government with violence, my orders are spare no one, let us solve the problems of Mindanao once and for all,” said Duterte, who is from the island, after cutting short a visit to Russia and returning to Manila.


In [5]:
import re

with open(r'random-text.txt') as hand:
    for line in hand:
        line = line.rstrip()
        if re.search('from', line):
            # True or False
            print(line)

About a hundred extremists from the Maute group mounted attacks across Marawi.
The military said it had rescued 120 people from a school and a hospital.
Mujiv Hataman, governor of the Autonomous Region in Mindanao, said the terrorists freed 107 prisoners from the local prison, among them Maute rebels.
“Anyone now holding a gun, confronting government with violence, my orders are spare no one, let us solve the problems of Mindanao once and for all,” said Duterte, who is from the island, after cutting short a visit to Russia and returning to Manila.


###### Examples of various re patterns

In [6]:
with open(r'random-text.txt') as hand:
    for line in hand:
        line = line.rstrip()
        if line.startswith('from'):
            print(line)

In [7]:
import re

with open(r'random-text.txt') as hand:
    for line in hand:
        line = line.rstrip()
        if re.search('^from', line):
            print(line)

^X.*  : Match the character X at the start of the line and the following characters can be anything(dot) that can appear many times 

(e.g.) X-Sieve CMU Sieve 2.3


^X-\S+ :  Match the character X- at the start of the line and match any non-whitespace character(\S) one or more times(+)

(e.g.) X-DSPAM-Result Innocent

In [8]:
import re
x = 'My 2 favorite numbers are 19 and 42'
y = re.findall('[0-9]+', x)
print(y)

# [0-9]+  means one more more digits

['2', '19', '42']


In [9]:
z = re.findall('[AEIOU]+', x)
print(z)

# [AEIOU]+   means one or more uppercase vowels

[]


In [16]:
import re
a = 'From: Using the : character'
b = re.findall('^F.+?:', a)
print(b)

# ^F.+?:   means match the character F at the start of the line
# followed by one ore more characters but not greedy
# Greedy means finding the longest pattern-matching string
# If you add question mark(?) it means the matching is not greedy

['From:']


In [17]:
m = 'From stephen.marquard@uct.ac.za Sat Jan  5 09:14:16 2008'

n = re.findall('\S+@\S+', m)
print(n)

# \S+@\S+   means at least one non-whitespace char followed by the @ char 
# and then again, at least one non-whitespace char (greedy)

['stephen.marquard@uct.ac.za']


In [18]:
m = 'From stephen.marquard@uct.ac.za Sat Jan  5 09:14:16 2008'

p = re.findall('^From (\S+@\S+)', m)
print(p)

# ^From \S+@\S+  means find matches that start with "From" and a space
# and then string extraction occurs only within the part in parenthesis

['stephen.marquard@uct.ac.za']


###### Extracting a host name - using find and string slicing (Method 1)

In [19]:
m = 'From stephen.marquard@uct.ac.za Sat Jan  5 09:14:16 2008'
atpos = m.find('@')
print(atpos)

21


In [22]:
sppos = m.find(' ',atpos)
# starting from index atpos, find space(' ')
# str.find(str we want to find, starting index, ending index)

print(sppos)

31


In [23]:
host = m[atpos+1 : sppos]
print(host)

uct.ac.za


###### Extracting a host name - using find and string slicing (Method 2)

In [24]:
m = 'From stephen.marquard@uct.ac.za Sat Jan  5 09:14:16 2008'
words = m.split() # splitting by space
email = words[1] # choosing the second split element
pieces = email.split('@') # splitting the split element above by @
print(pieces[1])

uct.ac.za


###### Extracting a host name - using find and string slicing (Method 3)

In [26]:
import re

m = 'From stephen.marquard@uct.ac.za Sat Jan  5 09:14:16 2008'
r = re.findall('@([^ ]*)', m)

print(r)
# @([^ ]*)  means look through the string until you find an @ sign
# and then match many of(*) non-blank characters([^ ])

['uct.ac.za']


###### Extracting a host name - using find and string slicing (Method 4)

In [29]:
import re
m = 'From stephen.marquard@uct.ac.za Sat Jan  5 09:14:16 2008'
i = re.findall('^From .*@([^ ]*)', m)
print(i)

# ^From .*@([^ ]*)  means match string that starts with "From" 
# followed by space any number of characters until I find an @ sign
# then extract any number of non-blank characters

['uct.ac.za']


In [32]:
import re
c = 'We just received $10.00 for cookies.'
d = re.findall('\$[0-9.]+', c)
print(d)

# \$[0-9.]+  means escaping dollar sign(so matching real dollar sign)
# followed by at least one or more digit or period([0-9.]+)

['$10.00']
