## Regular Expression (Regex)

A regular expression (shortened as regex or regexp; also referred to as rational expression) is a sequence of characters that define a search pattern. Usually such patterns are used by string-searching algorithms for "find" or "find and replace" operations on strings, or for input validation.

**Source:** Wikipedia

Regular expressions are extremely useful in extracting information from text such as code, log files, spreadsheets, or even documents. And while there is a lot of theory behind formal languages, the following lessons and examples will explore the more practical uses of regular expressions so that you can use them as quickly as possible.

**Source:** [RegExOne](https://regexone.com/)

The best way to dive into **regex** is to go through examples.

In [4]:
# how would you do text searching and substring operations using regular Python??

text = "BS Management Engineering, John Gokongwei School of Management, Loyola Schools, Ateneo de Manila University"

index=text.find("Loyola")

text[index:]

# how would you replace text in a string using regular Python??

newtext = text.replace("Management","Mgt")
print(newtext)


BS Mgt Engineering, John Gokongwei School of Mgt, Loyola Schools, Ateneo de Manila University


In [5]:
# demo: import re

import re

In [11]:
# demo: findall

text = "BS Management Engineering, John Gokongwei School of Management, Loyola Schools, Ateneo de Manila University"

print(re.findall("BS",text))
print(re.findall("Schools",text))
print(re.findall("School",text))
print(re.findall("Management",text))
print(re.findall("Engineering",text))


['BS']
['Schools']
['School', 'School']
['Management', 'Management']
['Engineering']


### Search

Scan through string looking for the first location where the regular expression pattern produces a match, and return a corresponding match object. Return None if no position in the string matches the pattern; note that this is different from finding a zero-length match at some point in the string.

In [17]:
text = "BS Management Engineering, John Gokongwei School of Management, Loyola Schools, Ateneo de Manila University"

try:
    print(re.search(r'BS', text).group())
    print(re.search(r'Management',text).group())
    print(re.search(r'BS Management',text).group())
    print(re.search(r'Engineering',text).group())
    print(re.search(r'engineering',text).group())
    
except:
    print("Error")




BS
Management
BS Management
Engineering
Error


### Match

If zero or more characters at the beginning of string match the regular expression pattern, return a corresponding match object. Return None if the string does not match the pattern; note that this is different from a zero-length match.

The `group()` function returns the string matched by the regular expression.

In [19]:
text = "BS Management Engineering, John Gokongwei School of Management, Loyola Schools, Ateneo de Manila University"

try:
    print(re.match(r'BS', text).group())
    print(re.match(r'BS Management',text).group())
    print(re.match(r'Management',text).group())
    print(re.match(r'Engineering',text).group())
    print(re.match(r'engineering',text).group())
    
except:
    print("Error")




BS
BS Management
Error


**Wildcard**

**.**  - A period. Matches any single character except newline character.

In [27]:
text = "BS Management Engineering, John Gokongwei School of Management, Loyola Schools, Ateneo de Manila University"

print(re.search("BS.Management",text).group())
print(re.search("Management.Engineer...",text).group())
print(re.search(".anagement.Engineer...",text).group())
print(re.findall("Manage....",text))
print(re.findall("School.",text))



BS Management
Management Engineering
Management Engineering
['Management', 'Management']
['School ', 'Schools']


**\w** - Lowercase w. Matches any single letter, digit or underscore.

In [29]:
text = "BS Management Engineering, John Gokongwei School of Management, Loyola Schools, Ateneo de Manila University"

print(re.search(r"School\w",text).group())
print(re.findall(r"School\w",text))




Schools
['Schools']


**\W** - Uppercase w. Matches any character not part of \w (lowercase w).

In [30]:
text = "BS Management Engineering, John Gokongwei School of Management, Loyola Schools, Ateneo de Manila University"

print(re.search(r"School\W",text).group())
print(re.findall(r"School\W",text))




School 
['School ']


**\s** - Lowercase s. Matches a single whitespace character like: space, newline, tab, return.

In [49]:
text = "BS Management Engineering, John Gokongwei School of Management, Loyola Schools, Ateneo de Manila University"
course = "ITMGT25.03 Information Technology Application Programming, \nQuantitative Methods and Information Technology"

print(re.findall(r"\s",course))




[' ', ' ', ' ', ' ', ' ', '\n', ' ', ' ', ' ', ' ']


**\S** - Uppercase s. Matches any character not part of \s (lowercase s).

In [48]:
text = "BS Management Engineering, John Gokongwei School of Management, Loyola Schools, Ateneo de Manila University"
course = "ITMGT25.03 Information Technology Application Programming, \nQuantitative Methods and Information Technology"

print(re.findall(r"\S",course))
print(re.findall(r"\S\S",course))
print(re.findall(r"\S\S\S",course))






['I', 'T', 'M', 'G', 'T', '2', '5', '.', '0', '3', 'I', 'n', 'f', 'o', 'r', 'm', 'a', 't', 'i', 'o', 'n', 'T', 'e', 'c', 'h', 'n', 'o', 'l', 'o', 'g', 'y', 'A', 'p', 'p', 'l', 'i', 'c', 'a', 't', 'i', 'o', 'n', 'P', 'r', 'o', 'g', 'r', 'a', 'm', 'm', 'i', 'n', 'g', ',', 'Q', 'u', 'a', 'n', 't', 'i', 't', 'a', 't', 'i', 'v', 'e', 'M', 'e', 't', 'h', 'o', 'd', 's', 'a', 'n', 'd', 'I', 'n', 'f', 'o', 'r', 'm', 'a', 't', 'i', 'o', 'n', 'T', 'e', 'c', 'h', 'n', 'o', 'l', 'o', 'g', 'y']
['IT', 'MG', 'T2', '5.', '03', 'In', 'fo', 'rm', 'at', 'io', 'Te', 'ch', 'no', 'lo', 'gy', 'Ap', 'pl', 'ic', 'at', 'io', 'Pr', 'og', 'ra', 'mm', 'in', 'g,', 'Qu', 'an', 'ti', 'ta', 'ti', 've', 'Me', 'th', 'od', 'an', 'In', 'fo', 'rm', 'at', 'io', 'Te', 'ch', 'no', 'lo', 'gy']
['ITM', 'GT2', '5.0', 'Inf', 'orm', 'ati', 'Tec', 'hno', 'log', 'App', 'lic', 'ati', 'Pro', 'gra', 'mmi', 'ng,', 'Qua', 'nti', 'tat', 'ive', 'Met', 'hod', 'and', 'Inf', 'orm', 'ati', 'Tec', 'hno', 'log']


**\n** - Lowercase n. Matches newline.

**\r** - Lowercase r. Matches return.

**\d** - Lowercase d. Matches decimal digit 0-9.

In [47]:

text = "BS Management Engineering, John Gokongwei School of Management, Loyola Schools, Ateneo de Manila University"
course = "ITMGT25.03 Information Technology Application Programming, \nQuantitative Methods and Information Technology"

print(re.findall(r"\n",course))
print(re.findall(r"\r",course))
print(re.findall(r"\d",course))
print(re.findall(r"\d\d",course))






['\n']
[]
['2', '5', '0', '3']
['25', '03']


^ - Caret. Matches a pattern at the start of the string.

In [46]:
text = "BS Management Engineering, John Gokongwei School of Management, Loyola Schools, Ateneo de Manila University"
course = "ITMGT25.03 Information Technology Application Programming, \nQuantitative Methods and Information Technology"

try:
    print(re.search(r"^ITM",course).group())
    print(re.search(r"^Programming",course).group())
except:
    print("Error")




ITM
Error


$ - Matches a pattern at the end of string.

In [45]:

text = "BS Management Engineering, John Gokongwei School of Management, Loyola Schools, Ateneo de Manila University"
course = "ITMGT25.03 Information Technology Application Programming, \nQuantitative Methods and Information Technology"

try:
    print(re.search(r"Technology$",course).group())
    print(re.search(r"Engineering$",course).group())
except:
    print("Error")






Technology
Error


**[abc]** - Matches a or b or c.  
**[12]** - Matches 1 or 2

**[a-zA-Z0-9]** - Matches any letter from (a to z) or (A to Z) or (0 to 9). 

Characters that are not within a range can be matched by complementing the set. 
If the first character of the set is ^, all the characters that are not in the set will be matched.

In [59]:
text = "BS Management Engineering, John Gokongwei School of Management, Loyola Schools, Ateneo de Manila University"
course = "ITMGT25.03 Information Technology Application Programming, \nQuantitative Methods and Information Technology"

print(re.findall(r"[abc]",course))
print(re.findall(r"[ITM]",course))
print(re.findall(r"[2503]",course))
print(re.findall(r"[2503][2503]",course))
print(re.findall(r"[a-z]",course))
print(re.findall(r"[a-z][a-z][a-z]",course))
print(re.findall(r"[0-9][0-9]",course))
print(re.findall(r"[0-9][0-9]\.[0-9][0-9]",course))



['a', 'c', 'c', 'a', 'a', 'a', 'a', 'a', 'a', 'c']
['I', 'T', 'M', 'T', 'I', 'T', 'M', 'I', 'T']
['2', '5', '0', '3']
['25', '03']
['n', 'f', 'o', 'r', 'm', 'a', 't', 'i', 'o', 'n', 'e', 'c', 'h', 'n', 'o', 'l', 'o', 'g', 'y', 'p', 'p', 'l', 'i', 'c', 'a', 't', 'i', 'o', 'n', 'r', 'o', 'g', 'r', 'a', 'm', 'm', 'i', 'n', 'g', 'u', 'a', 'n', 't', 'i', 't', 'a', 't', 'i', 'v', 'e', 'e', 't', 'h', 'o', 'd', 's', 'a', 'n', 'd', 'n', 'f', 'o', 'r', 'm', 'a', 't', 'i', 'o', 'n', 'e', 'c', 'h', 'n', 'o', 'l', 'o', 'g', 'y']
['nfo', 'rma', 'tio', 'ech', 'nol', 'ogy', 'ppl', 'ica', 'tio', 'rog', 'ram', 'min', 'uan', 'tit', 'ati', 'eth', 'ods', 'and', 'nfo', 'rma', 'tio', 'ech', 'nol', 'ogy']
['25', '03']
['25.03']


**\A** - Uppercase a. Matches only at the start of the string. Works across multiple lines as well.

In [62]:
text = "BS Management Engineering, John Gokongwei School of Management, Loyola Schools, Ateneo de Manila University"
course = "ITMGT25.03 Information Technology Application Programming, \nQuantitative Methods and Information Technology"

print(re.findall(r"\A[abc]",course))
print(re.findall(r"\A[ITM]",course))
print(re.findall(r"\A[2503]",course))






[]
['I']
[]


\b - Lowercase b. Matches only the beginning or end of the word.

In [66]:
text = "BS Management Engineering, John Gokongwei School of Management, Loyola Schools, Ateneo de Manila University"
course = "ITMGT25.03 Information Technology Application Programming, \nQuantitative Methods and Information Technology"

print(re.findall(r"\b[abc]",course))
print(re.findall(r"\b[ITM]",course))
print(re.findall(r"\b[ITMP]",course))
print(re.findall(r"\b[ITM][ITM]",course))







['a']
['I', 'I', 'T', 'M', 'I', 'T']
['I', 'I', 'T', 'P', 'M', 'I', 'T']
['IT']


### Repetitions

It becomes quite tedious if you are looking to find long patterns in a sequence. Fortunately, the re module handles repetitions using the following special characters:

`+` - Checks for one or more characters to its left.

In [68]:

text = "BS Management Engineering, John Gokongwei School of Management, Loyola Schools, Ateneo de Manila University"
course = "ITMGT25.03 Information Technology Application Programming, \nQuantitative Methods and Information Technology"

print(re.findall(r"[ITM]+",course))
print(re.findall(r"[Tech]+",course))







['ITM', 'T', 'I', 'T', 'M', 'I', 'T']
['T', 'T', 'Tech', 'c', 'e', 'e', 'h', 'Tech']


`*` - Checks for zero or more characters to its left.

In [69]:


text = "BS Management Engineering, John Gokongwei School of Management, Loyola Schools, Ateneo de Manila University"
course = "ITMGT25.03 Information Technology Application Programming, \nQuantitative Methods and Information Technology"

print(re.findall(r"[ITM]*",course))
print(re.findall(r"[Tech]*",course))







['ITM', '', 'T', '', '', '', '', '', '', 'I', '', '', '', '', '', '', '', '', '', '', '', 'T', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', 'M', '', '', '', '', '', '', '', '', '', '', '', 'I', '', '', '', '', '', '', '', '', '', '', '', 'T', '', '', '', '', '', '', '', '', '', '']
['', 'T', '', '', 'T', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', 'Tech', '', '', '', '', '', '', '', '', '', '', '', '', 'c', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', 'e', '', '', 'e', '', 'h', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', 'Tech', '', '', '', '', '', '', '']


`?` - Checks for exactly zero or one character to its left.

In [70]:

text = "BS Management Engineering, John Gokongwei School of Management, Loyola Schools, Ateneo de Manila University"
course = "ITMGT25.03 Information Technology Application Programming, \nQuantitative Methods and Information Technology"

print(re.findall(r"[ITM]?",course))
print(re.findall(r"[Tech]?",course))








['I', 'T', 'M', '', 'T', '', '', '', '', '', '', 'I', '', '', '', '', '', '', '', '', '', '', '', 'T', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', 'M', '', '', '', '', '', '', '', '', '', '', '', 'I', '', '', '', '', '', '', '', '', '', '', '', 'T', '', '', '', '', '', '', '', '', '', '']
['', 'T', '', '', 'T', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', 'T', 'e', 'c', 'h', '', '', '', '', '', '', '', '', '', '', '', '', 'c', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', 'e', '', '', 'e', '', 'h', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', 'T', 'e', 'c', 'h', '', '', '', '', '', '', '']


But what if you want to check for exact number of sequence repetition?

For example, checking the validity of a phone number in an application. re module handles this very gracefully as well using the following regular expressions:

`{x}` - Repeat exactly x number of times.

`{x,}` - Repeat at least x times or more.

`{x, y}` - Repeat at least x times but no more than y times.

In [71]:
# Extract possible phone number(s) from text

contact_info = "Joseph Benjamin R. Ilagan jbilagan@ateneo.edu 0917-9999999, 0918-9999999"

print(re.findall("[0-9]{4}-[0-9]{7}", contact_info))




['0917-9999999', '0918-9999999']


In [76]:
# Extract Email

# Extract possible phone number(s) from text

contact_info = "Joseph Benjamin R. Ilagan jbilagan@ateneo.edu 0917-9999999, 0918-9999999"

print(re.findall("[a-zA-Z0-9]+@[a-zA-Z0-9]+\.[a-z]+", contact_info))

['jbilagan@ateneo.edu']


#### Groups and Groupings

Suppose that, when you're validating email addresses and want to check the user name and host separately.

This is when the group feature of regular expression comes in handy. It allows you to pick up parts of the matching text.

Parts of a regular expression pattern bounded by parenthesis() are called **groups**. The parenthesis does not change what the expression matches, but rather forms groups within the matched sequence. You have been using the `group()` function all along in this tutorial's examples. The plain `match.group()` without any argument is still the whole matched text as usual.

The `+` and `*` qualifiers are said to be greedy.

In [82]:
contact_info = "Joseph Benjamin R. Ilagan jbilagan@ateneo.edu 0917-9999999, 0918-9999999"

print(re.search("([a-zA-Z0-9]+)@[a-zA-Z0-9]+\.[a-z]+", contact_info).group())
print(re.search("([a-zA-Z0-9]+)@([a-zA-Z0-9]+)\.([a-z]+)", contact_info).group(1))
print(re.search("([a-zA-Z0-9]+)@([a-zA-Z0-9]+)\.([a-z]+)", contact_info).group(2))
print(re.search("([a-zA-Z0-9]+)@([a-zA-Z0-9]+)\.([a-z]+)", contact_info).group(3))


jbilagan@ateneo.edu
jbilagan
ateneo
edu


#### Greedy vs. Non-Greedy Matching

When a special character matches as much of the search sequence (string) as possible, it is said to be a "Greedy Match". It is the normal behavior of a regular expression but sometimes this behavior is not desired:

#### Replace text


In [86]:

# Extract possible phone number(s) from text
contact_info = "Joseph Benjamin R. Ilagan jbilagan@ateneo.edu 0917-9999999, 0918-9999999"


print(re.sub(r"(09[0-9][0-9])-",r"(\1) ",contact_info))
print(re.sub(r"(09[0-9][0-9])-",r"+63\1 ",contact_info))
print(re.sub(r"\+630",r"+63", re.sub(r"(09[0-9][0-9])-",r"+63\1 ",contact_info)))


Joseph Benjamin R. Ilagan jbilagan@ateneo.edu (0917) 9999999, (0918) 9999999
Joseph Benjamin R. Ilagan jbilagan@ateneo.edu +630917 9999999, +630918 9999999
Joseph Benjamin R. Ilagan jbilagan@ateneo.edu +63917 9999999, +63918 9999999


### IMPORTANT

Please go through the tutorials and sample puzzles [here](https://regexcrossword.com/). We won't have time for a full-blown lecture on regex, but the topic is too important for you not to cover through self-study because future assignments and tests will depend heavily on it.