# REGEX

In [14]:
import re

In [15]:
phoneNumRegex = re.compile(r'\d\d\d-\d\d\d-\d\d\d\d')

In [16]:
mo = phoneNumRegex.search('My number is 415-555-4242.')

In [17]:
print('Phone number found: ' + mo.group())

Phone number found: 415-555-4242


In [11]:
mo.group()

'415-555-4242'

In [14]:
phoneNumRegex = re.compile(r'(\d\d\d)-(\d\d\d-\d\d\d\d)')

In [15]:
mo = phoneNumRegex.search('My number is 415-555-4242.')

In [16]:
mo.group(1)

'415'

In [17]:
mo.group(2)

'555-4242'

In [18]:
mo.group(0)

'415-555-4242'

In [19]:
mo.groups()

('415', '555-4242')

In [20]:
areaCode, mainNumber = mo.groups()

In [21]:
print(areaCode)

415


In [22]:
print(mainNumber)

555-4242


#### Matching Multiple Groups with the Pipe

The | character is called a <i>pipe</i>. It can be used anywhere to match one of many expressions. For example, the regular r'Batman|Tina Fey' will match either 'Bataman' or Tina Fey'.

In [23]:
heroRegex = re.compile(r'Batman|Tina Fey')

In [24]:
mo1 = heroRegex.search('Batman and Tina Fey')

In [25]:
mo1.group()

'Batman'

In [26]:
mo2 = heroRegex.search('Tina Fey and Batman')

In [27]:
mo2.group()

'Tina Fey'

In [28]:
batRegex = re.compile(r'Bat(man|mobile|copter|bat)')

In [29]:
mo = batRegex.search('Batmobile lost a wheel')

In [30]:
mo.group()

'Batmobile'

In [31]:
mo.group(1)

'mobile'

#### Optional Matching with the Question Mark

The ? character flags the group that precedes it as an optional part of the pattern. 

In [32]:
batRegex = re.compile(r'Bat(wo)?man')

In [33]:
mo1 = batRegex.search('The Adventures of Batman')

In [34]:
mo1.group()

'Batman'

In [35]:
mo2 = batRegex.search('The Adventures of Batwoman')

In [36]:
mo2.group()

'Batwoman'

Using the earlier phone number example, one can make the regex look for phone numbers that do or do not have an area code.

In [37]:
phoneRegex = re.compile(r'(\d\d\d-)?\d\d\d-\d\d\d\d')

In [38]:
mo1 = phoneRegex.search('My number is 415-555-4242')

In [39]:
mo1.group()

'415-555-4242'

In [40]:
mo2 = phoneRegex.search('My number is 555-4242')

In [41]:
mo2.group()

'555-4242'

#### Matching One or More with the Plus

In [42]:
batRegex = re.compile(r'Bat(wo)+man')

In [43]:
mo1 = batRegex.search('The Adventures of Batwoman')

In [44]:
mo1.group()

'Batwoman'

In [45]:
mo2 = batRegex.search('The Adventures of Batwowowowoman')

In [46]:
mo2.group()

'Batwowowowoman'

In [47]:
mo3 = batRegex.search('The Adventures of Batman')

In [48]:
mo3 == None

True

#### Matching Specific Repititions with Braces

If you have a group that you want to repeat a specific number of times, follow the group in your regex with a number in braces. For example, the regex (Ha){3} will match the string 'HaHaHa', but it will not match 'HaHa', since the latter has only two repeats of the (Ha) group.

In [49]:
haRegex = re.compile(r'(Ha){3}')
mo1 = haRegex.search('HaHaHa')

In [50]:
mo1.group()

'HaHaHa'

In [51]:
mo2 = haRegex.search('Ha')

In [52]:
mo2 == None

True

#### Greedy and Non-Greedy Matching

In [53]:
greedyHaRegex = re.compile(r'(Ha){3,5}')

In [54]:
mo1 = greedyHaRegex.search('HaHaHaHaHa')

In [55]:
mo1.group()

'HaHaHaHaHa'

In [59]:
nongreedyHaRegex = re.compile(r'(Ha){3,5}?')

In [60]:
mo2 = nongreedyHaRegex.search('HaHaHaHaHa')

In [61]:
mo2.group()

'HaHaHa'

#### findall() Method

In addition to the search() method, Regex objects also have a findall() method. While search() will return a Match object of the first matched text in the searched string, the findall() method will return the strings of every match in the searched string.

In [6]:
phoneNumRegex = re.compile(r'\d\d\d-\d\d\d-\d\d\d\d')

In [7]:
mo = phoneNumRegex.search('Cell: 415-555-9999 Work: 212-555-0000')

In [8]:
mo.group()

'415-555-9999'

In [9]:
mo2 = phoneNumRegex.findall('Cell: 415-555-9999 Work: 212-555-0000')

In [10]:
mo2

['415-555-9999', '212-555-0000']

In [11]:
phoneNumRegex = re.compile(r'(\d\d\d)-(\d\d\d)-(\d\d\d\d)')

In [12]:
momo = phoneNumRegex.findall('Cell: 415-555-9999 Work: 212-555-0000')

In [13]:
momo

[('415', '555', '9999'), ('212', '555', '0000')]

In [14]:
cell, work = phoneNumRegex.findall('Cell: 415-555-9890 Work: 212-598-0190')

In [15]:
work

('212', '598', '0190')

In [16]:
cell

('415', '555', '9890')

<p>When findall() is called on a regex with no groups, the method returns a list of string matches, such as ['415-555-9999', '212-555-0000'].</p>
<p>When called on a regex that has groups, such as (\d\d\d)-(\d\d\d)-(\d\d\d\d), the method findall() returns a list of tuples of strings (one string for each group), such as [('415', '555', '9999'), ('212', '555', '0000')].</p>

### Character Class

Character classes are used for shortening regular expressions

In [17]:
xmasRegex = re.compile(r'\d+\s\w+')

In [18]:
xmasRegex.findall('12 drummers, 11 pipers, 10 lords, 9 ladies, 8 maids, 7 swans, 6 geese, 5 rings, 4 birds, 3 hens, 2 doves, 1 partridge')

['12 drummers',
 '11 pipers',
 '10 lords',
 '9 ladies',
 '8 maids',
 '7 swans',
 '6 geese',
 '5 rings',
 '4 birds',
 '3 hens',
 '2 doves',
 '1 partridge']

In [5]:
xmasRegex1 = re.compile(r'\d+\s\w+')

In [6]:
xmasRegex1.findall('12 drummers, 11 pipers, 10 lords, 9 ladies, 8 maids, 7 swans, 6 geese, 5 rings, 4 birds, 3 hens, 2 doves, 1 partridge')

['12 drummers',
 '11 pipers',
 '10 lords',
 '9 ladies',
 '8 maids',
 '7 swans',
 '6 geese',
 '5 rings',
 '4 birds',
 '3 hens',
 '2 doves',
 '1 partridge']

The regular expression \d+\s\w+ will match text that has one or more numeric digits (\d+), followed by a whitespace character (\s), followed by one or more letter/digit/underscore character (\w+). The findall() method returns all matching strings of the regex pattern in a list.

#### Making Your Own Character Classes

You can define your own character class using square brackets

In [87]:
vowelRegex = re.compile(r'[aeiouAEIOU]')

In [91]:
vowelRegex.findall('RoboCop eats baby food. BABY FOOD.')

['o', 'o', 'o', 'e', 'a', 'a', 'o', 'o', 'A', 'O', 'O']

<p>You can also include ranges of letters or numbers by using a hyphen. For example, the character class [a-zA-Z0-9] will match all lowercase letters, uppercase letters, and numbers.</p>
<p>Note that inside the square brackets, the normal regular expression symbols are not interpreted as such. This means you do not need to escape the .,*,?, or () characters with a preceding backlash. For example, the character class [0-5.] will match digits 0 to 5 and a period. You do not need to write it as [0-5\.]</p>
<P>By placing a caret character (^) just after the character class's opening bracket, you can make a <i>negative character class.<i> A negative character class will match all the characters that are not in the character class. 

In [89]:
consonantRegex = re.compile(r'[^aeiouAEIOU]')

In [92]:
consonantRegex.findall('RoboCop eats baby food. BABY FOOD.')

['R',
 'b',
 'C',
 'p',
 ' ',
 't',
 's',
 ' ',
 'b',
 'b',
 'y',
 ' ',
 'f',
 'd',
 '.',
 ' ',
 'B',
 'B',
 'Y',
 ' ',
 'F',
 'D',
 '.']

### Caret and Dollar Sign Characters

In [93]:
beginsWithHello = re.compile(r'^Hello')

In [94]:
beginsWithHello.search('Hello, world!')

<re.Match object; span=(0, 5), match='Hello'>

In [96]:
beginsWithHello.search('He said hello.') == None

True

In [97]:
endsWithNumber = re.compile(r'\d$')

In [98]:
endsWithNumber.search('Your number is 42')

<re.Match object; span=(16, 17), match='2'>

In [99]:
endsWithNumber.search('Your number is forty two.')

In [105]:
wholeStringIsNum = re.compile(r'^\d+$')

In [106]:
wholeStringIsNum.search('1234567890')

<re.Match object; span=(0, 10), match='1234567890'>

In [107]:
wholeStringIsNum.search('12345xyz67890') == None

True

In [108]:
wholeStringIsNum.search('12 34567890') == None

True

### Wildcard Character

The dot(.) character in a regular expression is called a wildcard and will match any character except for a newline. 

In [109]:
atRegex = re.compile(r'.at')

In [110]:
atRegex.findall('The cat in the hat sat on the flat mat.')

['cat', 'hat', 'sat', 'lat', 'mat']

In [114]:
atRegex2 = re.compile(r'\.at')

In [115]:
atRegex2.findall('The cat in the hat sat on the flat mat at home.')

[]

#### Matching Everything with Dot-Star

In [116]:
nameRegex = re.compile(r'First Name: (.*) Last Name: (.*)')

In [117]:
mo = nameRegex.search('First Name: Al Last Name: Sweigart')

In [118]:
mo.group(1)

'Al'

In [119]:
mo.group(2)

'Sweigart'

the dot means 'any single character except newline.' and the star character means 'zero or more of the preceding character'

In [120]:
nongreedyRegex = re.compile(r'<.*?>')

In [121]:
mo = nongreedyRegex.search('<To serve man> for dinner.')

In [122]:
mo.group()

'<To serve man>'

In [123]:
greedyRegex = re.compile(r'<.*>')

In [126]:
mo = greedyRegex.search('<To serve man> for dinner.>')

In [127]:
mo.group()

'<To serve man> for dinner.>'

Both regexes roughly translate to 'Match an opening angle bracket, followed by anything, followed by a closing angle bracket.'

#### Matching Newlines with the Dot Character

In [128]:
noNewLineRegex = re.compile('.*')

In [139]:
noNewLineRegex.search(
    'Serve the public trust.\nProtect the innocent.' 
    '\nUphold the law.').group()

'Serve the public trust.'

In [138]:
newlineRegex = re.compile('.*', re.DOTALL)

In [140]:
newlineRegex.search('Serve the public trust.\nProtect the innocent'
                   '\nUphold the law.').group()

'Serve the public trust.\nProtect the innocent\nUphold the law.'

#### Case-Insensitive Matching

In [141]:
regex1 = re.compile('RobCop')

In [142]:
regex2 = re.compile('ROBOCOP')

In [143]:
regex3 = re.compile('robOcop')

In [144]:
regex4 = re.compile('RobocOp')

But sometimes our only concern is to match the letters irrespective of the case

In [145]:
robocop = re.compile(r'robocop', re.I)

In [146]:
robocop.search('RoboCop is part man, part machine, all cop.').group()

'RoboCop'

In [148]:
robocop.search('ROBOCOP protects the innocent').group()

'ROBOCOP'

In [149]:
robocop.search('Al, why does your programming book' 
               'talk about robocop so much?').group()

'robocop'

#### Substituting Strings with the sub() Method

The sub() method for Regex objects is passed two arguments. The first argument is a string to replace any matches. The second is the string for the regular expression. The sub() method returns a string with the substitutions applied.

In [150]:
namesRegex = re.compile(r'Agent \w+')

In [151]:
namesRegex.sub('CENSORED', 'Agent Alice gave the secret documents'
              'to Agent Bob.')

'CENSORED gave the secret documentsto CENSORED.'

In [156]:
agentNamesRegex = re.compile(r'Agent (\w)\w*')

In [160]:
agentNamesRegex.sub(r'\1****', 'Agent Alice told Agent Carol that ' 
                    'Agent Eve knew Agent Bob was a double agent')

'A**** told C**** that E**** knew B**** was a double agent'

## PROJECT: PHONE NUMBER AND EMAIL ADDRESS EXTRACTOR

<p>Your phone and email address extractor will need to do the following:<p>
    <ol>
        <li>Get the text off the clipboard.</li>
        <li>Find all phone numbers and email addresses in the text.</li>
        <li>Paste them onto the clipboard.</li>
    </ol>
<p>Now you an start thinking about how this might work in code. The code will need to do the following</p>
    <ol>
    <li>Use the pyperclip module to copy and paste strings.</li>
    <li>Create two regexes, one for matching phone numbers and the other for matching email addresses.</li>
    <li>Find all matches, not just the first match, of both regexes</li>
    <li>   Neatly format the matched strings into a single string to paste.</li>
    <li>Display some kind of message if no matches were found in the text</li>
    </ol>

#### <i>Step1: Create a Regex for Phone Numbers</i>

In [7]:
network_part = "192.168.2."
host_parts = [20, 40, 60]

In [9]:
for idx in range(len(host_parts)):
    host_part = host_parts[idx]
    ip = network_part + str(host_part)
    print('The router IP is: ' + ip)

The router IP is: 192.168.2.20
The router IP is: 192.168.2.40
The router IP is: 192.168.2.60


### Project Work: Date Detection

Write a regular expression that can detect dates in the <i>DD/MM/YYYY</i> format. Assume that the days range from 01 to 31, the months range from 01 to 12, and the years range from 1000 to 2999. Note that if the day or month is a single digit, it'll have a leading zero.

The regular expression doesn't have to detect correct days for each month or for leap years; it will accept nonexistent dates like 31/02/2020 or 31/04/2021. Then store these strings into variables named <b>month, day</b> and <b>year</b>, and write additional code that can detect if it is a valid date. April, June, September, and November have 30 days, February has 28 days, and the rest of the months have 31 days. February has 29 days in leap years. Leap years are every year evenly divisible by 4, 100 or 400. 

In [22]:

#Step 1: Get dates to verify
    
getdate = input('Enter date: ')
    

Enter date: 04/06/2020


In [43]:
#Create date regular expression
dateRegex = re.compile(r'''(
    [1-31]+
    [\/]+
    [1-12]+
    [\/]+
    [1000 - 2999])
    
''')


In [44]:
getdate

'04/06/2020'

In [45]:
di = dateRegex.search(getdate)

In [46]:
di.groups

AttributeError: 'NoneType' object has no attribute 'groups'