# Pattern matching with regular expressions

You may be familiar with searching for text by pressing CTRL-F and typing in the words you’re looking for. Regular expressions go one step further: They allow you to specify a pattern of text to search for. You may not know a business’s exact email address, but you will know it will be one or more names, maybe some special characters between the name, followed by a at-symbol, and then a domain-name followed by a top-level domain (.com, .se). This is how you, as a human, know a email when you see it: dennis.hedlund@klarna.com is a email address, but dennis,hedlund!@klarna.not.valid is not.

Regular expressions are helpful, but not many non-programmers know about them even though most modern text editors and word processors, such as Microsoft Word or OpenOffice, have find and find-and-replace features that can search based on regular expressions. Regular expressions are huge time-savers, not just for software users but also for programmers. In fact, tech writer Cory Doctorow argues that even before teaching programming, we should be teaching regular expressions:

>“Knowing [regular expressions] can mean the difference between solving a problem in 3 steps and solving it in 3,000 steps. When you’re a nerd, you forget that the problems you solve with a couple keystrokes can take other people days of tedious, error-prone work to slog through.”[1]

## Finding Patterns of Text Without Regular Expressions

Say you want to find a phone number in a string. You know the pattern: three numbers, a hyphen, three numbers, a hyphen, and four numbers. Here’s an example: 415-555-4242.

Let’s use a function named isPhoneNumber() to check whether a string matches this pattern, returning either True or False.

In [5]:
def isPhoneNumber(text):
    if len(text) != 12:
        return False
    for i in range(0, 3):
        if not text[i].isdigit():
            return False
    if text[3] != '-':
        return False
    for i in range(4, 7):
        if not text[i].isdigit():
            return False
    if text[7] != '-':
        return False
    for i in range(8, 12):
        if not text[i].isdigit():
            return False
    return True

print('415-555-4242 is a phone number:')
print(isPhoneNumber('415-555-4242'))
print('Moshi moshi is a phone number:')
print(isPhoneNumber('Moshi moshi'))

415-555-4242 is a phone number:
True
Moshi moshi is a phone number:
False


The isPhoneNumber() function has code that does several checks to see whether the string in text is a valid phone number. If any of these checks fail, the function returns False. 

You would have to add even more code to find this pattern of text in a larger string. <br>Lets modify our code a little:

In [6]:
def isPhoneNumber(text):
    if len(text) != 12:
        return False
    for i in range(0, 3):
        if not text[i].isdigit():
            return False
    if text[3] != '-':
        return False
    for i in range(4, 7):
        if not text[i].isdigit():
            return False
    if text[7] != '-':
        return False
    for i in range(8, 12):
        if not text[i].isdigit():
            return False
    return True

message = 'Call me at 415-555-1011 tomorrow. 415-555-9999 is my office.'
for i in range(len(message)):
    chunk = message[i:i+12]
    if isPhoneNumber(chunk):
        print('Phone number found: ' + chunk)
print('Done')

Phone number found: 415-555-1011
Phone number found: 415-555-9999
Done


On each iteration of the for loop, a new chunk of 12 characters from message is assigned to the variable chunk. We then run that chunk through the isPhoneNumber function to check if the part of chunk is a valid phone number.<br>

While the string in message is short in this example, it could be millions of characters long and the program would still run in less than a second. A similar program that finds phone numbers using regular expressions would also run in less than a second, but regular expressions make it quicker to write these programs.

## Finding Patterns of Text with Regular Expressions

The previous phone number–finding program works, but it uses a lot of code to do something limited: The isPhoneNumber() function is 17 lines but can find only one pattern of phone numbers. What about a phone number formatted like 415.555.4242 or (415) 555-4242? What if the phone number had an extension, like 415-555-4242 x99? The isPhoneNumber() function would fail to validate them. You could add yet more code for these additional patterns, but there is an easier way.

Regular expressions, called regexes for short, are descriptions for a pattern of text. For example, a \d in a regex stands for a digit character—that is, any single numeral 0 to 9. The regex \d\d\d-\d\d\d-\d\d\d\d is used by Python to match the same text the previous isPhoneNumber() function did: a string of three numbers, a hyphen, three more numbers, another hyphen, and four numbers. Any other string would not match the \d\d\d-\d\d\d-\d\d \d\d regex.

But regular expressions can be much more sophisticated. For example, adding a 3 in curly brackets `{3}` after a pattern is like saying, “Match this pattern three times.” So the slightly shorter regex \d{3}-\d{3}-\d{4} also matches the correct phone number format. <br>
Example:

In [7]:
import re
def isPhoneNumber(text):
    phoneNumRegex = re.compile(r'\d{3}-\d{3}-\d{4}')
    phoneNumbers = re.findall(phoneNumRegex, text)
    return phoneNumbers

message = 'Call me at 415-555-1011 tomorrow. 415-555-9999 is my office.'
phoneNumbers = isPhoneNumber(message)
print phoneNumbers

['415-555-1011', '415-555-9999']


## Creating Regex Objects

All the regex functions in Python are in the re module. To be able to use this we first need to import it. We do that by calling `import re` at the top of our code.

Passing a string value representing your regular expression to re.compile() returns a Regex pattern object (or simply, a Regex object).

To create a Regex object that matches the phone number pattern, let's create a regex object containing the pattern for the phone number:

In [10]:
import re
#In this example we define each single digit it should match
phoneNumRegex = re.compile(r'\d\d\d-\d\d\d-\d\d\d\d') 

# In this example we tell the regexp to repeat the pattern {#} nr of times 
phoneNumRegex = re.compile(r'\d{3}-\d{3}-\d{4}')

Now the phoneNumRegex variable contains a Regex object.

Escape characters in Python use the backslash (\). The string value '\n' represents a single newline character, not a backslash followed by a lowercase n. You need to enter the escape character \\ to print a single backslash. So '\\n' is the string that represents a backslash followed by a lowercase n. However, by putting an r before the first quote of the string value, you can mark the string as a raw string, which does not escape characters.

Since regular expressions frequently use backslashes in them, it is convenient to pass raw strings to the re.compile() function instead of typing extra backslashes. Typing r'\d\d\d-\d\d\d-\d\d\d\d' is much easier than typing '\\\d\\\d\\\d-\\\d\\\d\\\d-\\\d\\d\\\d\\\d'.

## Grouping with Parentheses

Say you want to separate the area code from the rest of the phone number. Adding parentheses will create groups in the regex: (\d\d\d)-(\d\d\d-\d\d\d\d). Then you can use the group() match object method to grab the matching text from just one group.

The first set of parentheses in a regex string will be group 1. The second set will be group 2. By passing the integer 1 or 2 to the group() match object method, you can grab different parts of the matched text. Passing 0 or nothing to the group() method will return the entire matched text. 

In [11]:
phoneNumRegex = re.compile(r'(\d\d\d)-(\d\d\d-\d\d\d\d)')
mo = phoneNumRegex.search('My number is 415-555-4242.')
print mo.group(1)
print mo.group(2)
print mo.group(0)

415
555-4242
415-555-4242


## Matching Multiple Groups with the Pipe

The | character is called a pipe. You can use it anywhere you want to match one of many expressions. For example, the regular expression r'Batman|Tina Fey' will match either 'Batman' or 'Tina Fey'.

When both Batman and Tina Fey occur in the searched string, the first occurrence of matching text will be returned as the Match object. <br>
Example:

In [13]:
heroRegex = re.compile (r'Batman|Tina Fey')
mo1 = heroRegex.search('Batman and Tina Fey.')
print mo1.group()

mo2 = heroRegex.search('Tina Fey and Batman.')
print mo2.group()

Batman
Tina Fey


<b>NOTE</b><br>
You can find all matching occurrences with the findall() method

You can also use the pipe to match one of several patterns as part of your regex. For example, say you wanted to match any of the strings 'Batman', 'Batmobile', 'Batcopter', and 'Batbat'. Since all these strings start with Bat, it would be nice if you could specify that prefix only once. This can be done with parentheses. <br>
Example:

In [15]:
batRegex = re.compile(r'Bat(man|mobile|copter|bat)')
mo = batRegex.search('Batmobile lost a wheel')
print mo.group()
print mo.group(1)

Batmobile
mobile


The method call mo.group() returns the full matched text 'Batmobile', while mo.group(1) returns just the part of the matched text inside the first parentheses group, 'mobile'. By using the pipe character and grouping parentheses, you can specify several alternative patterns you would like your regex to match.

# Full list of regexp symbols available for matching

* The ? matches zero or one of the preceding group.<br>
    * Example: 'Bat(man)?' matches Bat and Batman

* The * matches zero or more of the preceding group.<br>
    * Example: 'Bat(man)*' matches Bat, Batman, Batmanman, Batmanmanman and so on.
    
* The + matches one or more of the preceding group.
    * Example: 'Bat(man)+' matches Batman, Batmanman, Batmanmanman and so on, but not only Bat.

* The {n} matches exactly n of the preceding group.
    * Example: 'Bat(man){2}' matches Batmanman and nothing else.

* The {n,} matches n or more of the preceding group.
    * Example: 'Bat(man){2,}' matches Batmanman, Batmanmanman and so on. But not Bat or Batman.

* The {,m} matches 0 to m of the preceding group.
    * Example: 'Bat(man){,2}' matches Bat, Batman and Batmanman.

* The {n,m} matches at least n and at most m of the preceding group.

* {n,m}? or *? or +? performs a nongreedy match of the preceding group.

* ^spam means the string must begin with spam.

* spam$ means the string must end with spam.

* The . matches any character, except newline characters.

* \d, \w, and \s match a digit, word, or space character, respectively.

* \D, \W, and \S match anything except a digit, word, or space character, respectively.

* [abc] matches any character between the brackets (such as a, b, or c).
    * You can define "sets" of characters: 
    * [a-z] matches lower case characters a to z
    * [A-Z] matches upper case characters A to Z
    * [0-9] matches digits 0 to 9 
    * [a-zA-Z0-9] matches a to z, A to Z and 0 to 9. 
    * [\\\/\\\\$] matches the special characters / and $. Dont forget to escape them with \

* [^abc] matches any character that isn’t between the brackets.

<b>Task 1:</b> Write a Python program to check that a string contains only a certain set of characters (in this case a-z, A-Z and 0-9). 
<br>You will need to use re.match() https://docs.python.org/2/library/re.html#re.match<br>
Try validating string1 and string2.


In [None]:
import re
string1 = "Pneumonoultramicroscopicsilicovolcanoconiosis"
string2 = "Hello World!"
#Insert code here


<b> Task 2:</b> Write a python program that validations an email address.<br>
A valid email address has four parts:
* Recipient name
    * Uppercase and lowercase letters in English (A-Z, a-z)
    * Digits from 0 to 9
    * One or more words separarted by . or -
* @ symbol
    * Must come after the recipient name
* Domain name
    * Uppercase and lowercase letters in English (A-Z, a-z)
    * Digits from 0 to 9 
    * A hyphen (-)
    * A period (.)  (used to identify a sub-domain; for example,  email.domainsample)
* Top-level domain ( Only match one of the following top level domains)
    * .com
    * .net
    * .org
    * .se
    
<br><b>Talk to your neighbor and try to solve this exercise, two heads are better than one</b> 

In [18]:
#Insert code here
