##Pattern Matching with Regular Expressions

In the big data era, we often need to search for or retrieve useful information everyday. In general, **Information retrieval** (IR) is the activity of obtaining relevant information from a collection of (commonly massive) information resources. Searches can be based on full-text or other content-based indexing (e.g., search engine). Information retrieval is the science of searching for information in a document, searching for documents themselves, and also searching for the metadata that describes data, and for databases of texts (TextRetrieval), videos (VideoRetrieval), images (ImageRetrieval) or sounds (MusicRetrieval).

There are many ways to find useful infomation. The manual approach (e.g., searching a webpage by pressing CTRL-F and ebtering the targeted word) is often used but only applicable when the scope of searched infomation is small. 

Sometimes, we need find exact information (e.g., name). More often, we need find information that follows certain rules or pattern (e.g., email address). Let's start with a simple example.



##Search a string

Python string method find() determines if string str occurs in string, or in a substring of string if starting index beg and ending index end are given. Please note the function find() is case sensitive. 

Syntax:
```
str.find(str, beg=0, end=len(string))
```
Parameters:
```
str − This specifies the string to be searched.

beg − This is the starting index, by default its 0.

end − This is the ending index, by default its equal to the length of the string.
```
Return Value:
```
Index if found and -1 otherwise.
```

In [None]:
msg = "We are learning Python online this semester."
target = "online"
index = msg.find(target)
if index != -1:
    print("The target \"" + target + "\" found at the index: " + str(index))
else: print("Not found!")

Not found!


There are many different ways to check if a target is contained in a given string. Do you still remember the "in" and "not in" operators? Please note the return is a boolean value: either **True** or **False**.

In [None]:
msg = "We are learning Python online this semester."
target = "online"
if target in msg:
    print("Found")
else: print("Not found!")

Found


## Pattern Matching Algorithm

Sometimes, you need search for a type of information, other than exact and specific information. For example, we need find any or all phone numbers, or IP addresses, or emails from a given text. So the challenge here is how we should describe what we want to search, which is so called "**pattern**".

Let's start from a relativly easier example: find a north American phone number, such as 309-438-8000. Here, we simply assume all phone numbers start with three digits (i.e., area code), followed by a hyphen "-", and another three digits, followed by another hyphon, and then followed by last four digits. Such a description about what a phone number looks like is actually a definition of a pattern. 

Apparently, finding infomation matching a pattern from a given string is much more challenging than finding an exactly matched substring. We need develop a specific pattern matching algorithm for such a task. Please note, developing such an algorithm has nothing to do with Python. We can implement an algorithm by different programming languages. Of course, we will use Python here.

###How shall we develop an algorithm? 

An algorithm is simply the steps (just like a recipe for cooking) to solve a problem. But how to find these steps? This is one of the most critical skills every professional must grasp. 

There is no formula we can always follow to find such a solution. However, there are some good approaches or methods we can follow. One of them is called "Divid-and-Conquer". This divide-and-conquer technique is the basis of efficient algorithms for all kinds of problems, such as sorting (e.g., quicksort), pattern matching, and syntactic analysis (e.g., top-down parsers), etc.

####Divide and Conquer

In computer science, divide and conquer is an algorithm design paradigm based on multi-branched recursion. A divide-and-conquer approach works by recursively breaking down a problem into two or more sub-problems of the same or related type, until these become simple enough to be solved directly. The solutions to the sub-problems are then combined to give a solution to the original problem.

Understanding and designing divide-and-conquer algorithms is a complex skill that requires a good understanding of the nature of the underlying problem to be solved. It is often necessary to replace the original problem with a more general or complicated problem in order to initialize the recursion, and there is no systematic method for finding the proper generalization. The good news is for many problems like finding a phone number, our intuition is usually sufficient for tackling such a challenge. Let's give a try!



###The Solution without Regular Expression

We denote the algorithm or function of finding a phone number as **findPhoneNumber**. The basic idea is to divide the original problem of findPhoneNumber into multiple subproblems of verifying multiple conditions. As long as all conditions are satisfiable or true, then we can claim we have found a phone number.  

Let's take a close look at a phone nymber (e.g., 309-438-8000) to see all intuitive conditions a phone number has to satisfy:
1.   A phone numer is a 12-character long string;
2.   The first three characters are digits;
3.   The forth character is hyphen;
4.   The next three characters are digits;
5.   The eighth character is hyphen;
6.   The last four characters are digits.

Obviously, each subproblem above is much easier to solve. But let's make sure this by testing the following examples.





In [None]:
#Check the length of a string
msg = "A phone numer is a 12-character long string;"
print("The length of this message is: " + str(len(msg)))

The length of this message is: 44


In [None]:
#Check if a single char is a digit
msg = "1"
if msg.isdecimal():
    print("It is a digit.")
else: print("It is NOT a digit.")

It is a digit.


In [None]:
#Check if all three characters are all digits
msg = "309"
result = True
for c in msg:
    if not c.isdecimal():
        result = False #as long as one char is NOT digit, return False.

if result: 
    print("All three characters are digits")
else:     print("NOT all three characters are digits")


All three characters are digits


In [None]:
msg = "-"
result = False
if msg == "-":
    result = True

if result:
    print("It's a hyphen.")
else: print("It's NOT a hyphen.")

It's a hyphen.


So far, it seems we have got all subproblems (so called divide) solved (i.e., so called conquer). Turning back to the original problem, how shall we find all phone numbers from a given string?

Since we don't know where a phone number appears in the string, we have to check each 12-character chunk (often called a fixed width observation window) from the beginning of the string, and then slide it though the whole string.  String slicing makes this work easy.



In [None]:
msg = "0123456789ABCDEF"
for i in range(len(msg) - 11):
    print(msg[i:i+12])

0123456789AB
123456789ABC
23456789ABCD
3456789ABCDE
456789ABCDEF


In [None]:
def findPhoneNumber(str):
    if len(str) != 12:
        return False
    for i in range(0, 3):
        if not str[i].isdecimal():
            return False
    if str[3] != '-':
        return False
    for i in range(4, 7):
        if not str[i].isdecimal():
            return False
    if str[7] != '-':
        return False
    for i in range(8, 12):
        if not str[i].isdecimal():
            return False
    return True

In [None]:
msg = "My office number is 309-438-8000. My cell phone is 288-888-8888."
for i in range(len(msg) - 11):
    str = msg[i:i+12]
    if findPhoneNumber(str):
        print("A phone number found: " + str)

A phone number found: 309-438-8000
A phone number found: 288-888-8888


Although taking quite a lot effort, we seem have got the job done. One drawback is we have to hard code all searching rules. Accordingly, if any change on the searching rules, we have to modify the code. For example, if a phone number can also be wittern as "309.438.8000" or "(309)-438-8000" or "(309)438-8000", we seem have to make a big change on the original code. Do we have a better way to do this job? Fortunately, yes, we do have a better way to do this, which is to use **Regular Expression**.

##Regular Expression
Regular expressions (called REs, or regexes, or regex patterns) are essentially a tiny, highly specialized programming language. You may regard a regular expression as a special sequence of characters using a specialized syntax that define a search pattern to help you match or find such a pattern in other strings. 

Regular expression is an independent language, which doesn't rely on any specific programming language. However, its functionalities have been incorporated or embedded into most (if not all) high level programming languages, such as Python, Jave, Perl, C#. More specifically, regular expression is embedded inside Python and made available through the **re** module. 

Using this little language, you specify the rules for the set of possible strings that you want to match. You can then ask questions such as “Does this string match the pattern?”, or “Is there a match for the pattern anywhere in this string?”. You can also use REs to modify a string or to split it apart in various ways.

In the following, we will use the same example above to show how we can find a phone number using Regular Expression. 


###How to define a phone number pattern with Regular Expression

As a different language. a regular expression has its symbols and syntax. For example, a phone number can be defined via a regular expression as: \d\d\d-\d\d\d-\d\d\d\d. We can guess \d stands for a digit. 

Please note, the backslash "\" is commonly used in regular expression. Since the backslash is used to form escape characters in Python (e.g., \n stands for new line in a string),  we need use raw strings to represent regular expressions. A raw string completely ignores all escape characters and prints any backslash that appears in the string. You can place an r before the beginning quotation mark of a string to make it a raw string.  For example:
```
phoneNumberRE = r"\d\d\d-\d\d\d-\d\d\d\d"
```
Let's take a look at the complete implementation:




In [None]:
import re

msg = "My office number is 309-438-8000. My cell phone is 288-888-8888."

phoneNumberRE = r"\d\d\d-\d\d\d-\d\d\d\d"
phoneNumberREObj = re.compile(phoneNumberRE)
matchList = phoneNumberREObj.findall(msg)

for match in matchList:
    print("A phone number found: " + match)

A phone number found: 309-438-8000
A phone number found: 288-888-8888


The code above can also be implemented in a concise way (essentially the same):

In [None]:
import re

msg = "My office number is 309-438-8000. My cell phone is 288-888-8888."

matchList = re.findall(r"\d\d\d-\d\d\d-\d\d\d\d", msg)

for match in matchList:
    print("A phone number found: " + match)

A phone number found: 309-438-8000
A phone number found: 288-888-8888


There are several common steps to use regular expression in Python:
1.   Import the regex module with **import re**.
2.   Define the targeted pattern using a raw string: **phoneNumberRE = r"\d\d\d-\d\d\d-\d\d\d\d"**
3.   Create a Regex object with the re.compile() function: **phoneNumberREObj = re.compile(phoneNumberRE)**
4.   Pass the string you want to search into one of the Regex object’s methods (e.g., findall, search, match):  **matchList = phoneNumberREObj.findall(msg)**
5.   Print the result. The returned data types are different from different methods of a Regex object. Thus, the corresponding result should be processed accordingly. 




##Common Methods in a RE Object

The common used methods from a RE object are:
*   **findall()**: Perform a global search over a whole string. Returns a list containing all matches, or an empty list if no matches are found.
*   **search()**: Returns a Match object if there is a match anywhere in the string.
*   **match()**: Returns a Match object if there is a match found at the beginning of a string.



In [None]:
import re

msg1 = "My office number is 309-438-8000. My cell phone is 288-888-8888."
msg2 = "309-438-8000 is my office number. My cell phone is 288-888-8888."

#In the RE below, the three embedded () defines three subgroups; each subgroup is also given an optional name (e.g., areaCode)
phoneNumberRE = r"(?P<areaCode>\d\d\d)-(?P<branchID>\d\d\d)-(?P<lineID>\d\d\d\d)" 
phoneNumberREObj = re.compile(phoneNumberRE)
matchedObj1 = phoneNumberREObj.match(msg1)
matchedObj2 = phoneNumberREObj.search(msg1)
matchedObj3 = phoneNumberREObj.match(msg2)

if matchedObj1 != None:
    print("A phone number found: " + matchedObj1.group())
else: print("No phone number found.")

if matchedObj2:             #The same as: matchedObj2 != None:
    print("A phone number found: " + matchedObj2.group())
else: print("No phone number found.")

if matchedObj3 != None:
    print("A phone number found: " + matchedObj3.group())
else: print("No phone number found.")

print(matchedObj3.group())          #return the complete match
print(matchedObj3.group(0))         #return the complete match
print(matchedObj3.groups())         #return all matched subgroups
print(len(matchedObj3.groups()))    #return the total number of matched subgroups
print(matchedObj3.group(1))         #return 1st matched subgroups
print(matchedObj3.group(2))         #return 2nd matched subgroups
print(matchedObj3.group(3))         #return 3rd matched subgroups
print(matchedObj3.groupdict())      #return subgroup dict key:value items

No phone number found.
A phone number found: 309-438-8000
A phone number found: 309-438-8000
309-438-8000
309-438-8000
('309', '438', '8000')
3
309
438
8000
{'areaCode': '309', 'branchID': '438', 'lineID': '8000'}


###Match Object

A Regex object’s search() or match() method searches the string it is passed for any matches to the regex. The search() method will return None if the regex pattern is not found in the string. If the pattern is found, the search() method returns a Match object, which have a **group()** (equiliant to **group(0)**) return the actual matched text from the searched string. As the example shown above, we can further define multiple subgroups using "()" in a regular expression, and each subgroup can also have different names using "?P\<subgroupName>". group(k) returns the value of the kth subgroup.


##Regular Expression Quantifiers

Regular expressions can contain both special and ordinary characters. Most ordinary characters, like 'A', 'a', or '0', are the simplest regular expressions; they simply match themselves. You can concatenate ordinary characters, so last matches the string 'last'. (In the rest of this section, we’ll write RE’s in this special style, usually without quotes, and strings to be matched 'in single quotes'.)

Some characters, like '|' or '(', are special. Special characters either stand for classes of ordinary characters, or affect how the regular expressions around them are interpreted.

Repetition qualifiers (\*, +, ?, {m,n}, etc) cannot be directly nested. This avoids ambiguity with the lazy or non-greedy modifier suffix ?, and with other modifiers in other implementations. To apply a second repetition to an inner repetition, parentheses may be used. For example, the expression (?:a{6})* matches any multiple of six 'a' characters.




###Metacharacters

| Character | Description | Example |
| --- | --- | --- |
| [] | A set of characters | "[a-m]" |
| \ | Signals a special sequence (can also be used to escape special characters) | "\d" |
| . | Any character (except newline character) | "he..o" |
| ^ | Starts with | "^hello" |
| \$ | Ends with | "world\$" |
| * | Zero or more occurrences | "aix*" |
| + | One or more occurrences | "aix+" |
| {} | Exactly the specified number of occurrences | "al{2}" |
| \| | Either or | "Python\|Java" |



In [None]:
import re

txt = "Pythonis interesting."

#Find all lower case characters alphabetically between "a" and "m":

x = re.findall("[a-m]", txt)
print(x)


['e', 'e']


In [None]:
import re

txt = "The IP address is 192.168.1.1"

#Find all digit characters:

x = re.findall("\d", txt)
print(x)

['192', '168']


In [None]:
import re

txt = "hello Python"

#Search for a sequence that starts with "he", followed by two (any) characters, and an "o":

x = re.findall("he..o", txt)
print(x)

['hello']


In [None]:
import re

txt = "hello Python"

#Check if the string starts with 'hello':

x = re.findall("^hello", txt)
if (x):             #this is the same as: if x != None to check if an object exists
  print("Yes, the string starts with 'hello'")
else:
  print("No match")

Yes, the string starts with 'hello'


In [None]:
import re

txt = "hello Python"

#Check if the string ends with 'world':

x = re.findall("Python$", txt)
if (x):
  print("Yes, the string ends with 'Python'")
else:
  print("No match")

Yes, the string ends with 'Python'


In [None]:
import re

txt = "Python is useful for professionals in networking and cybersecurity fields."

#Check if the string contains "on" followed by 0 or more "x" characters:

x = re.findall("onx*", txt)

print(x)

if (x):
  print("Yes, there is at least one match!")
else:
  print("No match")


['on', 'on']
Yes, there is at least one match!


In [None]:
import re

txt = "Python is useful for professionals in networking and cybersecurity fields."

#Check if the string contains "on" followed by 0 or more "x" characters:

x = re.findall("onx+", txt)

print(x)

if (x):
  print("Yes, there is at least one match!")
else:
  print("No match")

[]
No match


In [None]:
import re

txt = "Python course is offered in this semester."

#Check if the string contains "on" followed by 0 or more "x" characters:

x = re.findall("of{2}", txt)

print(x)

if (x):
  print("Yes, there is at least one match!")
else:
  print("No match")

['off']
Yes, there is at least one match!


In [None]:
import re

txt = "You can learn either Java or Python!"

#Check if the string contains either "Java" or "Python":

x = re.findall("Java|Python", txt)

print(x)

if (x):         
  print("Yes, there is at least one match!")
else:
  print("No match")

['Java', 'Python']
Yes, there is at least one match!


###Special Sequences
A special sequence is a \ followed by one of the characters in the list below, and has a special meaning:

| Character | Description | Example |
| --- | --- | --- |
| \A | Returns a match if the specified characters are at the beginning of the **string** | "\AThe" |
| \b | Returns a match where the specified characters are at the beginning or at the end of a **word** | r"\bain" or r"ain\b"|
| \B | Returns a match where the specified characters are present, but NOT at the beginning (or at the end) of a **word** | r"\Bain" or r"ain\B" |
| \d | Returns a match where the string contains digits (numbers from 0-9) | "\d" |
| \D | Returns a match where the string DOES NOT contain digits | "\D" |
| \s | Returns a match where the string contains a white space character | "\s" |
| \S | Returns a match where the string DOES NOT contain a white space character | "\S" |
| \w | Returns a match where the string contains any word characters (characters from a to Z, digits from 0-9, and the underscore _ character) | "\w" |
| \W | Returns a match where the string DOES NOT contain any word characters | "\W" |
| \Z | Returns a match if the specified characters are at the end of the **string** | "\Z" |


In [None]:
import re

txt = "Python is powerful!"

#Check if the string starts with "The":

x = re.findall("\APython", txt)

print(x)

if (x):
  print("Yes, there is a match!")
else:
  print("No match")


['Python']
Yes, there is a match!


In [None]:
import re

txt = "I love Python programming."

#Check if "Python" is present at the beginning of a WORD: Please note, it's the position of a word, NOT a string.

x = re.findall(r"\bPy", txt)

print(x)

if (x):
  print("Yes, there is at least one match!")
else:
  print("No match")

['Py']
Yes, there is at least one match!


In [None]:
import re

txt = "You can use PyCharm to program Python"

#Check if "Py" is present, but NOT at the end of a word:

x = re.findall(r"Py\B", txt)

print(x)

if (x):
  print("Yes, there is at least one match!")
else:
  print("No match")

['Py', 'Py']
Yes, there is at least one match!


In [None]:
import re

txt = "You can use PyCharm to program Python"

#Check if the string contains any digits (numbers from 0-9):

x = re.findall("\d", txt)

print(x)

if (x):
  print("Yes, there is at least one match!")
else:
  print("No match")

[]
No match


In [None]:
import re

txt = "You can use PyCharm to program Python123"

#Check if the string contains two none digit chars (no numbers from 0-9):

x = re.findall("\D\D", txt)

print(x)

if (x):
  print("Yes, there is at least one match!")
else:
  print("No match")

['Yo', 'u ', 'ca', 'n ', 'us', 'e ', 'Py', 'Ch', 'ar', 'm ', 'to', ' p', 'ro', 'gr', 'am', ' P', 'yt', 'ho']
Yes, there is at least one match!


In [None]:
import re

txt = "You can use PyCharm to program Python"

#Return a match at every white-space character:

x = re.findall("\s", txt)

print(x)

if (x):
  print("Yes, there is at least one match!")
else:
  print("No match")

[' ', ' ', ' ', ' ', ' ', ' ']
Yes, there is at least one match!


In [None]:
import re

txt = "You can use PyCharm to program Python"

#Return a match at every NON white-space character:

x = re.findall("\S", txt)

print(x)

if (x):
  print("Yes, there is at least one match!")
else:
  print("No match")

['Y', 'o', 'u', 'c', 'a', 'n', 'u', 's', 'e', 'P', 'y', 'C', 'h', 'a', 'r', 'm', 't', 'o', 'p', 'r', 'o', 'g', 'r', 'a', 'm', 'P', 'y', 't', 'h', 'o', 'n']
Yes, there is at least one match!


In [None]:
import re

txt = "Python is good"

#Return a match at every two word character (characters from a to Z, digits from 0-9, and the underscore _ character):

x = re.findall("\w\w", txt)

print(x)

if (x):
  print("Yes, there is at least one match!")
else:
  print("No match")

['Py', 'th', 'on', 'is', 'go', 'od']
Yes, there is at least one match!


In [None]:
import re

txt = "Python is good"

#Return a match at every NON word character (characters NOT between a and Z. Like "!", "?" white-space etc.):

x = re.findall("\W", txt)

print(x)

if (x):
  print("Yes, there is at least one match!")
else:
  print("No match")

[' ', ' ']
Yes, there is at least one match!


In [None]:
import re

txt = "Python is good"

#Check if the string ends with "ood":

x = re.findall("ood\Z", txt)

print(x)

if (x):
  print("Yes, there is a match!")
else:
  print("No match")

['ood']
Yes, there is a match!


###Sets
A set is a set of characters inside a pair of square brackets [] with a special meaning:

| Character | Description |
| --- | --- |
| [arn] | Returns a match where one of the specified characters (a, r, or n) are present | 
| [a-n] | Returns a match for any lower case character, alphabetically between a and n |  
| [^arn] | Returns a match for any character EXCEPT a, r, and n |  
|[0123]  | Returns a match where any of the specified digits (0, 1, 2, or 3) are present |  
| [0-9] | Returns a match for any digit between 0 and 9 |  
| [0-5][0-9] | Returns a match for any two-digit numbers from 00 and 59 |  
| [a-zA-Z] | Returns a match for any character alphabetically between a and z, lower case OR upper case |  


In [None]:
import re

txt = "Python class starts at 11:00 and 17:00"

#Check if the string has any a, s, or t characters:

x = re.findall("[ast]", txt)

print(x)

if (x):
  print("Yes, there is at least one match!")
else:
  print("No match")

['t', 'a', 's', 's', 's', 't', 'a', 't', 's', 'a', 't', 'a']
Yes, there is at least one match!


In [None]:
import re

txt = "Python class starts at 11:00 and 17:00"

#Check if the string has any characters between a and n:

x = re.findall("[a-t]", txt)

print(x)

if (x):
  print("Yes, there is at least one match!")
else:
  print("No match")

['t', 'h', 'o', 'n', 'c', 'l', 'a', 's', 's', 's', 't', 'a', 'r', 't', 's', 'a', 't', 'a', 'n', 'd']
Yes, there is at least one match!


In [None]:
import re

txt = "Python class starts at 11:00 and 17:00"

#Check if the string has other characters than a, s, or t:

x = re.findall("[^ast]", txt)

print(x)

if (x):
  print("Yes, there is at least one match!")
else:
  print("No match")

['P', 'y', 'h', 'o', 'n', ' ', 'c', 'l', ' ', 'r', ' ', ' ', '1', '1', ':', '0', '0', ' ', 'n', 'd', ' ', '1', '7', ':', '0', '0']
Yes, there is at least one match!


In [None]:
import re

txt = "Python class starts at 11:00 and 17:00"

#Check if the string has any 0, 1, 2, or 3 digits:

x = re.findall("[0123]", txt)

print(x)

if (x):
  print("Yes, there is at least one match!")
else:
  print("No match")

['1', '1', '0', '0', '1', '0', '0']
Yes, there is at least one match!


In [None]:
import re

txt = "Python class starts at 11:00 and 17:00"

#Check if the string has any digits:

x = re.findall("[0-9]+", txt)

print(x)

if (x):
  print("Yes, there is at least one match!")
else:
  print("No match")


['11', '00', '17', '00']
Yes, there is at least one match!


In [None]:
import re

txt = "Python class starts at 11:00 and 17:00"

#Check if the string has any two-digit numbers, from 00 to 68:

x = re.findall("[0-6][0-8]", txt)

print(x)

if (x):
  print("Yes, there is at least one match!")
else:
  print("No match")

['11', '00', '17', '00']
Yes, there is at least one match!


In [None]:
import re

txt = "We are learning Python 123 now."

#Check if the string has any characters from a to z lower case, and A to Z upper case:

x = re.findall("[a-zA-Z]", txt)

print(x)

if (x):
  print("Yes, there is at least one match!")
else:
  print("No match")

['W', 'e', 'a', 'r', 'e', 'l', 'e', 'a', 'r', 'n', 'i', 'n', 'g', 'P', 'y', 't', 'h', 'o', 'n', 'n', 'o', 'w']
Yes, there is at least one match!


###Greedy and Lazy (Non-greedy) Matching

Using regular expression, we may define a pattern that can be matched in different ways in a strong. 

* Greedy: As Many As Possible (longest match)

By default, a regex quantifier (e.g., \*) tells the regex engine to match as many instances of its quantified token or subpattern as possible. This behavior is called **greedy**.

For instance, take the + quantifier. It allows the engine to match one or more of the token it quantifies: \d+ can therefore match one or more digits. But "one or more" is rather vague: in the string 123, "one or more digits" (starting from the left) could be 1, 12 or 123. Which of these does \d+ match?

Because by default quantifiers are greedy, \d+ matches as many digits as possible, i.e. 123. For the match attempt that starts at a given position, a greedy quantifier gives you the longest match.

* Lazy: As Few As Possible (shortest match)

In contrast to the standard greedy quantifier, which eats up as many instances of the quantified token as possible, a **lazy** (sometimes called Non-greedy or reluctant) quantifier tells the engine to match as few of the quantified tokens as needed. A regular quantifier is made lazy by appending a ? question mark to it.

Since the * quantifier allows the engine to match zero or more characters, \w*?E tells the engine to match zero or more word characters, but as few as needed—which might be none at all — then to match an E. In the string 123EEE, starting from the very left, "zero or more word characters then an E" could be 123E, 123EE or 123EEE. Which of these does \w*?E match?

Because the *? quantifier is lazy, \w*? matches as few characters as possible to allow the overall match attempt to succeed, i.e. 123 — and the overall match is 123E. For the match attempt that starts at a given position, a lazy quantifier gives you the shortest match.

Let's look at an example:

In [None]:
import re

htmlHead  = '<h1>TITLE</h1>'
result1 = re.search(r'<.*>', htmlHead)
print(result1.group())
result2 = re.search(r'<.*?>', htmlHead)
print(result2.group())

<h1>TITLE</h1>
<h1>


In [None]:
import re

str  = 'hahahahaha'
result1 = re.search(r'(ha){1,5}', str)
print(result1.group())
result2 = re.search(r'(ha){1,5}?', str)
print(result2.group())

hahahahaha
ha


In [None]:
import re

str  = 'hahaha something else in the middle haha'
result1 = re.search(r'(ha).*(ha)', str)
print(result1.group())
result2 = re.search(r'(ha).*?(ha)', str)
print(result2.group())

hahaha something else in the middle haha
haha


##Developing a Regular Expression 

In many applications, we need identify or valide users' emails. For the user name part before the symbol "@", it may contains uppercaser and/or lowercase letters, and/or numbers, and/or several legitimate special characters (e.g., _, .). According to the RE syntax we have learn, the patter for username in an email is "[a-zA-Z0-9\\_\\.]+@".

Right after "@" an organizan name is followed and ended by a dot ".", which can be represented in RE as "[a-zA-Z0-9]+\\."

Finally, an email address is ended by a top domain name, such as com, edu, net. In this example, we only consider these three domains, which can be represented as "(com|edu|net)". 


In [None]:
import re
regExp = r"[a-zA-Z0-9\_\.]+@[a-zA-Z0-9]+\.(com|edu|net)"
userEmail_input = input()
if (re.search(regExp, userEmail_input)):
    print("This is a valid email.")
else: print("This is an invalid email.")

ytang@ilstuedu
This is an invalid email.
