# for this program to work you hae to have something copied to your clip board (CTRL+C).
Copy the text below for an example:

https://nostarch.com/contactus/


# Goal:  finding every phone number and email address in a long web page or document

Or other aspects

# Steps: Create our road map(Big Picture Thinking)
* Get the text off the clipboard.
* Find all phone numbers and email addresses in the text.
* Paste them onto the clipboard.

Now you can start thinking about how this might work in code. The code will need to do the following:

* Use the pyperclip module to copy and paste strings.
* Create two regexes, one for matching phone numbers and the other for matching email addresses.
* Find all matches, not just the first match, of both regexes.
* Neatly format the matched strings into a single string to paste.
* Display some kind of message if no matches were found in the text.

# import modules needed

In [1]:
import pyperclip, re

# Create a Regex for phone number

In [2]:
phoneRegex = re.compile(r"""
(\d{3}|\(\d{3}\))? # area code
(\s|-|\.)? # seperator
(\d{3}) # first three digits
(\s|-|\.)? # sperator
(\d{4}) # last 4 digits
(\s*(ext|x|ext.)\s*(\d{2,5}))? # extension
""", re.VERBOSE)

#### the breakdown of above:
The phone number begins with an optional area code, so the area code group is followed with a question mark. Since the area code can be just three digits (that is, \d{3}) or three digits within parentheses (that is, \(\d{3}\)), you should have a pipe joining those parts. You can add the regex comment # Area code to this part of the multiline string to help you remember what (\d{3}|\(\d{3}\))? is supposed to match.

The phone number separator character can be a space (\s), hyphen (-), or period (.), so these parts should also be joined by pipes. The next few parts of the regular expression are straightforward: three digits, followed by another separator, followed by four digits. The last part is an optional extension made up of any number of spaces followed by ext, x, or ext., followed by two to five digits.

# Step 2: Create a Regex for Email Addresses

In [3]:
emailRegex = re.compile(r"""(
[a-zA-Z0-9._%+-]+ #username->1
@                # the @ symbol->2
[a-zA-Z0-9.-]+   # domain name->3
(\.[a-zA-Z]{2,4})# dot something->last part
)""", re.VERBOSE)

#### the breakdown of above:
The username part of the email address ➊ is one or more characters that can be any of the following: lowercase and uppercase letters, numbers, a dot, an underscore, a percent sign, a plus sign, or a hyphen. You can put all of these into a character class: [a-zA-Z0-9._%+-].

The domain and username are separated by an @ symbol ➋. The domain name ➌ has a slightly less permissive character class with only letters, numbers, periods, and hyphens: [a-zA-Z0-9.-]. And last will be the “dot-com” part (technically known as the top-level domain), which can really be dot-anything. This is between two and four characters.

The format for email addresses has a lot of weird rules. This regular expression won’t match every possible valid email address, but it’ll match almost any typical email address you’ll encounter

# Step 3: Find All Matches in the Clipboard Text

In [4]:
text = str(pyperclip.paste())

matches = [] # 1
for groups in phoneRegex.findall(text): # 2
    phoneNum = '-'.join([groups[1], groups[3], groups[5]])
    if groups[8] != "":
        phoneNum += ' x' + groups[8]
    matches.append(phoneNum)
for groups in emailRegex.findall(text): # 3
    matches.append(groups[0])

#### the breakdown of above:
There is one tuple for each match, and each tuple contains strings for each group in the regular expression. Remember that group 0 matches the entire regular expression, so the group at index 0 of the tuple is the one you are interested in.

As you can see at ➊, you’ll store the matches in a list variable named matches. It starts off as an empty list, and a couple for loops. For the email addresses, you append group 0 of each match ➌. For the matched phone numbers, you don’t want to just append group 0. While the program detects phone numbers in several formats, you want the phone number appended to be in a single, standard format. The phoneNum variable contains a string built from groups 1, 3, 5, and 8 of the matched text ➋. (These groups are the area code, first three digits, last four digits, and extension.)

# Step 4: Join the Matches into a String for the Clipboard

In [5]:
if len(matches) > 0:
    pyperclip.copy('\n'.join(matches))
    print('Copied to clipboard:')
    print('\n'.join(matches))
else:
    print('No phone numbers or email addresses found')

No phone numbers or email addresses found


# Put it all together and run your code all in one

In [14]:
phoneRegex = re.compile(r"""(
(\d{3}|\(\d{3}\))? # area code
(\s|-|\.)? # seperator
(\d{3}) # first three digits
(\s|-|\.) # sperator
(\d{4}) # last 4 digits
(\s*(ext|x|ext.)\s*(\d{2,5}))? # extension
)""", re.VERBOSE)

emailRegex = re.compile(r"""(
[a-zA-Z0-9._%+-]+ #username->1
@                # the @ symbol->2
[a-zA-Z0-9.-]+   # domain name->3
(\.[a-zA-Z]{2,4})# dot something->last part
)""", re.VERBOSE)

text = str(pyperclip.paste())

matches = [] # 1
for groups in phoneRegex.findall(text): # 2
    phoneNum = '-'.join([groups[1], groups[3], groups[5]])
    if groups[8] != "":
        phoneNum += ' x' + groups[8]
    matches.append(phoneNum)
    
for groups in emailRegex.findall(text): # 3
    matches.append(groups[0])
    
if len(matches) > 0:
    pyperclip.copy('\n'.join(matches))
    print('Copied to clipboard:')
    print('\n'.join(matches))
else:
    print('No phone numbers or email addresses found')

Copied to clipboard:
800-420-7240
415-863-9900
415-863-9950
info@nostarch.com
media@nostarch.com
academic@nostarch.com
conferences@nostarch.com
info@nostarch.com


# practice question on Flash cards can be found here