In [None]:
if not True:
    print("hi")

In [None]:
if not False:
    print("hi")

In [None]:
print(r"xs\nd")

### Matching Regex Objects 

- (r'\d\d\d-\d\d\d-\d\d\d\d')  or (r'\d{3}-\d{3}-\d{4}')  
both are same

In [None]:
import re
Check=re.compile(r'\d{3}-\d{3}-\d{4}')
mo = Check.search('My number is 415-555-4242.')
print('Phone number found: ' + mo.group())

In [8]:
mo

<re.Match object; span=(13, 25), match='415-555-4242'>

In [9]:
mo.group()

'415-555-4242'

## More Pattern Matching with Regular Expressions


#### Grouping with Parentheses

In [8]:
import re

In [9]:
phoneNumRegex = re.compile(r'(\d\d\d)-(\d\d\d-\d\d\d\d)')
mo = phoneNumRegex.search('My number is 415-555-4242.')

In [10]:
mo.group(1)

'415'

In [11]:
mo.group(2)

'555-4242'

In [12]:
mo.group()

'415-555-4242'

In [13]:
phoneNumRegex = re.compile(r'(\(\d\d\d\)) (\d\d\d-\d\d\d\d)')
mo = phoneNumRegex.search('My phone number is (415) 555-4242.')
mo.group(1)

'(415)'

In [14]:
mo.group(2)

'555-4242'

## Matching Multiple Groups with the Pipe

In [15]:
heroRegex = re.compile (r'Batman|Tina Fey')
mo1 = heroRegex.search('Batman and Tina Fey.')
mo1.group()

'Batman'

In [16]:
mo2 = heroRegex.search('Tina Fey and Batman.')
mo2.group()

'Tina Fey'

In [17]:
## the first occurrence of matching text will be returned as the Match object


In [19]:
batRegex = re.compile(r'Bat(man|mobile|copter|bat)')
mo = batRegex.search('Batmobile lost a wheel')
mo.group()

'Batmobile'

In [20]:
mo.group(1)

'mobile'

In [22]:
"""match any of the strings 'Batman', 
'Batmobile', 'Batcopter', and 'Batbat'. Since all these strings start with Bat, it 
would be nice if you could specify that prefix only once"""

"match any of the strings 'Batman', \n'Batmobile', 'Batcopter', and 'Batbat'. Since all these strings start with Bat, it \nwould be nice if you could specify that prefix only once"

## Optional Matching with the Question Mark

In [24]:
batRegex = re.compile(r'Bat(wo)?man')
mo1 = batRegex.search('The Adventures of Batman')
mo1.group()

'Batman'

In [26]:
mo2 = batRegex.search('The Adventures of Batwoman') 
mo2.group()

'Batwoman'

In [28]:
phoneRegex = re.compile(r'(\d\d\d-)?\d\d\d-\d\d\d\d')
mo1 = phoneRegex.search('My number is 415-555-4242')
mo1.group()

'415-555-4242'

In [29]:
mo2 = phoneRegex.search('My number is 555-4242')
mo2.group()

'555-4242'

## Matching Zero or More with the Star

  \*  match zero or more

In [30]:
batRegex = re.compile(r'Bat(wo)*man')
mo1 = batRegex.search('The Adventures of Batman')
mo1.group()

'Batman'

In [31]:
mo2 = batRegex.search('The Adventures of Batwoman')
mo2.group()

'Batwoman'

In [32]:
mo3 = batRegex.search('The Adventures of Batwowowowoman')
mo3.group()

'Batwowowowoman'

## Matching One or More with the Plus

 \+  match one or more

In [33]:
batRegex = re.compile(r'Bat(wo)+man')
mo1 = batRegex.search('The Adventures of Batwoman')
mo1.group()

'Batwoman'

In [34]:
mo2 = batRegex.search('The Adventures of Batwowowowoman')
mo2.group()

'Batwowowowoman'

In [35]:
mo3 = batRegex.search('The Adventures of Batman')
mo3 == None

True

## Matching Specific Repetitions with Curly Brackets

-  regular expressions match identical patterns
```python
(Ha){3}
(Ha)(Ha)(Ha)
```
```python
(Ha){3,5}
((Ha)(Ha)(Ha))|((Ha)(Ha)(Ha)(Ha))|((Ha)(Ha)(Ha)(Ha)(Ha))
```

 (Ha){,5} will match 
zero to five instances

In [37]:
haRegex = re.compile(r'(Ha){3}')
mo1 = haRegex.search('HaHaHa')
mo1.group()



'HaHaHa'

In [38]:
mo2 = haRegex.search('Ha')
mo2 == None

True

## Greedy and Nongreedy Matching

- (Ha){3,5} can match three, four, or five instances of Ha in the string 'HaHaHaHaHa'
- Python’s regular expressions are greedy by default

In [1]:
import re

In [3]:
greedyHaRegex = re.compile(r'(Ha){3,5}')
mo1 = greedyHaRegex.search('HaHaHaHaHa')
mo1.group()

'HaHaHaHaHa'

In [5]:
nongreedyHaRegex = re.compile(r'(Ha){3,5}?')
mo2 = nongreedyHaRegex.search('HaHaHaHaHa')
mo2.group()

'HaHaHa'

The non greedy version of the curly brackets, which matches the shortest string possible, has the closing curly bracket followed by a question mark.

## The findall() Method

In [7]:
phoneNumRegex = re.compile(r'\d\d\d-\d\d\d-\d\d\d\d')
mo = phoneNumRegex.search('Cell: 415-555-9999 Work: 212-555-0000')
mo.group()

'415-555-9999'

In [8]:
phoneNumRegex = re.compile(r'\d\d\d-\d\d\d-\d\d\d\d')
phoneNumRegex.findall('Cell: 415-555-9999 Work: 212-555-0000')

['415-555-9999', '212-555-0000']

In [9]:
# If there are groups in the regular expression
phoneNumRegex = re.compile(r'(\d\d\d)-(\d\d\d)-(\d\d\d\d)') 
phoneNumRegex.findall('Cell: 415-555-9999 Work: 212-555-0000')

[('415', '555', '9999'), ('212', '555', '0000')]

1. When called on a regex with no groups, such as \d\d\d-\d\d\d-\d\d\d\d, 
the method findall() returns a list of string matches, such as ['415-555-
9999', '212-555-0000'].
                                                                
2. When called on a regex that has groups, such as (\d\d\d)-(\d\d\d)-(\d\
d\d\d), the method findall() returns a list of tuples of strings (one string 
for each group), such as [('415', '555', '1122'), ('212', '555', '0000')].

## Character Classes

![image.png](attachment:image.png)

In [8]:
xmasRegex = re.compile(r'\d+\s\w+')
xmasRegex.findall('12 drummers, 11 pipers, 10 lords, 9 ladies, 8 maids, 7 swans, 6 geese, 5 rings, 4 birds, 3 hens, 2 doves, 1 partridge , 3 2   ')

['12 drummers',
 '11 pipers',
 '10 lords',
 '9 ladies',
 '8 maids',
 '7 swans',
 '6 geese',
 '5 rings',
 '4 birds',
 '3 hens',
 '2 doves',
 '1 partridge',
 '3 2']

## Making Your Own Character Classes

In [9]:
vowelRegex = re.compile(r'[aeiouAEIOU]')
vowelRegex.findall('RoboCop eats baby food. BABY FOOD.')

['o', 'o', 'o', 'e', 'a', 'a', 'o', 'o', 'A', 'O', 'O']

(^) just after the character class’s opening 
bracket, you can make a negative character class. A negative character class 
will match all the characters that are not in the character class

In [3]:
vowelRegex = re.compile(r'[^aeiouAEIOU]')
vowelRegex.findall('RoboCop eats baby food. BABY FOOD.')

['R',
 'b',
 'C',
 'p',
 ' ',
 't',
 's',
 ' ',
 'b',
 'b',
 'y',
 ' ',
 'f',
 'd',
 '.',
 ' ',
 'B',
 'B',
 'Y',
 ' ',
 'F',
 'D',
 '.']

- the character class [a-zA-Z0-9] will match all lowercase letters, uppercase letters, and numbers 


## The Caret and Dollar Sign Characters

(^) -  beginning 

($) -  end

In [5]:
beginsWithHello = re.compile(r'^Hello')
beginsWithHello.search('Hello world!')

<re.Match object; span=(0, 5), match='Hello'>

In [6]:
beginsWithHello.search('He said hello.') == None

True

In [7]:
endsWithNumber = re.compile(r'\d$')
endsWithNumber.search('Your number is 42')

<re.Match object; span=(16, 17), match='2'>

In [8]:
endsWithNumber.search('Your number is forty two.') == None

True

In [16]:
# e r'^\d+$' regular expression string matches strings that both begin and end with one or more numeric characters
wholeStringIsNum = re.compile(r'^\d+$')
wholeStringIsNum.search('1234567890')

<re.Match object; span=(0, 10), match='1234567890'>

In [13]:
wholeStringIsNum.search('12345xyz67890') == None

True

In [14]:
wholeStringIsNum.search('12 34567890') == None

True

## The Wildcard Character

The . (or dot) character in a regular expression is called a wildcard and will 
match any character except for a newline. 

In [18]:
atRegex = re.compile(r'.at')
atRegex.findall('The cat in the hat sat on the flat mat.')

['cat', 'hat', 'sat', 'lat', 'mat']

## Matching Everything with Dot-Star

dot-star (.*) to stand in for that “anything.” Remember that the 
dot character means “any single character except the newline,” and the 
star character means “zero or more of the preceding character.

In [20]:
nameRegex = re.compile(r'First Name: (.*) Last Name: (.*)')
mo = nameRegex.search('First Name: Al Last Name: Sweigart')

In [22]:
mo.group(1)

'Al'

In [23]:
mo.group(2)

'Sweigart'

The dot-star uses greedy mode: It will always try to match as much text as 
possible. To match any and all text in a nongreedy fashion, use the dot, star, 
and question mark (.*?)

In [24]:
nongreedyRegex = re.compile(r'<.*?>')
mo = nongreedyRegex.search('<To serve man> for dinner.>')
mo.group()

'<To serve man>'

In [26]:
nongreedyRegex = re.compile(r'<.*>')
mo = nongreedyRegex.search('<To serve man> for dinner.>')
mo.group()

'<To serve man> for dinner.>'

## Matching Newlines with the Dot Character

In [29]:
nongreedyRegex = re.compile('.*')
mo = nongreedyRegex.search('<To serve man> \nfor dinner.>')
mo.group()

'<To serve man> '

In [30]:
nongreedyRegex = re.compile('.*',re.DOTALL)
mo = nongreedyRegex.search('<To serve man> \nfor dinner.>')
mo.group()

'<To serve man> \nfor dinner.>'

## Review of Regex Symbols

![image.png](attachment:image.png)

## Case-Insensitive Matching

 Normally, regular expressions match text with the exact casing

To make your regex case-insensitive, you can pass re.IGNORECASE or re.I as a second argument to re.compile().

In [32]:
robocop = re.compile(r'robocop', re.I)
robocop.search('RoboCop is part man, part machine, all cop.').group()


'RoboCop'

In [33]:
robocop.search('ROBOCOP protects the innocent.').group()

'ROBOCOP'

In [34]:
robocop.search('Al, why does your programming book talk about robocop so much?').group()

'robocop'

## Substituting Strings with the sub() Method

The sub() method for Regex objects is 
passed two arguments. The first argument is a string to replace any matches. 
The second is the string for the regular expression. The sub() method returns 
a string with the substitutions applied.

In [35]:
namesRegex = re.compile(r'Agent \w+')
namesRegex.sub('CENSORED', 'Agent Alice gave the secret documents to Agent Bob.')

'CENSORED gave the secret documents to CENSORED.'

censor the names of the secret agents by 
showing just the first letters of their names. To do this, you could use the 
regex Agent (\w)\w* and pass r'\1****' as the first argument to sub().

In [48]:
agentNamesRegex = re.compile(r'Agent (\w)\w*')
agentNamesRegex.sub(r'\1***', 'Agent Alice told Agent Carol that Agent Eve knew Agent Bob was a double agent.')

'A*** told C*** that E*** knew B*** was a double agent.'

In [41]:
agentNamesRegex = re.compile(r'Agent (\w)(\w)\w*')
agentNamesRegex.sub(r'\2****', 'Agent Alice told Agent Carol that Agent Eve knew Agent Bob was a double agent.')

'l**** told a**** that v**** knew o**** was a double agent.'

## Managing Complex Regexes

re.compile() function 
to ignore whitespace and comments inside the regular expression string. 
This “verbose mode” can be enabled by passing the variable re.VERBOSE as 
the second argument to re.compile().

In [2]:
phoneRegex = re.compile(r'((\d{3}|\(\d{3}\))?(\s|-|\.)?\d{3}(\s|-|\.)\d{4}(\s*(ext|x|ext.)\s*\d{2,5})?)')
# or

phoneRegex = re.compile(r'''(
 (\d{3}|\(\d{3}\))? # area code
 (\s|-|\.)? # separator
 
 \d{3} # first 3 digits
 (\s|-|\.) # separator
 \d{4} # last 4 digits
 (\s*(ext|x|ext.)\s*\d{2,5})? # extension
 )''', re.VERBOSE)

## Combining re.IGNORECASE, re.DOTALL, and re.VERBOSE

combining the re.IGNORECASE, re.DOTALL, and 
re.VERBOSE variables using the pipe character (|), which in this context is 
known as the bitwise or operator

In [3]:
someRegexValue = re.compile('foo', re.IGNORECASE | re.DOTALL)

In [4]:
someRegexValue = re.compile('foo', re.IGNORECASE | re.DOTALL | re.VERBOSE)

## Phone Number and Email Address Extractor

•	 Get the text off the clipboard.

•	 Find all phone numbers and email addresses in the text.

•	 Paste them onto the clipboard.

Now you can start thinking about how this might work in code. The 
code will need to do the following:

•	 Use the pyperclip module to copy and paste strings.

•	 Create two regexes, one for matching phone numbers and the other for 
matching email addresses.

•	 Find all matches, not just the first match, of both regexes.

•	 Neatly format the matched strings into a single string to paste.

•	 Display some kind of message if no matches were found in the text.



In [27]:
# Regex for Phone Numbers
import pyperclip, re
phoneRegex = re.compile(r'''(
 (\d{3}|\(\d{3}\))? # area code
 (\s|-|\.)? # separator
 (\d{3}) # first 3 digits
 (\s|-|\.) # separator
 (\d{4}) # last 4 digits
 (\s*(ext|x|ext.)\s*(\d{2,5}))? # extension
 )''', re.VERBOSE)

emailRegex = re.compile(r'''(
[a-zA-Z0-9._%+-]+ # username
@ # @ symbol
[a-zA-Z0-9.-]+ # domain name
 (\.[a-zA-Z]{2,4}) # dot-something
 )''', re.VERBOSE)
# Find matches in clipboard text.
text = str(pyperclip.paste())

matches = []
for groups in phoneRegex.findall(text):
  phoneNum = '-'.join([groups[1], groups[3], groups[5]])
  if groups[8] != '':
    phoneNum += ' x' + groups[8]
  matches.append(phoneNum)
for groups in emailRegex.findall(text):
    matches.append(groups[0])
    
if len(matches) > 0:
 pyperclip.copy('\n'.join(matches))
 print('Copied to clipboard:')
 print('\n'.join(matches))
else:
 print('No phone numbers or email addresses found.')

Copied to clipboard:
800-420-7240
415-863-9900
415-863-9950
info@nostarch.com
media@nostarch.com
academic@nostarch.com
help@nostarch.com
