# Functions in the re module

The Python `re` module had several functions for searching and modifying strings using regular expressions. We describe a few of them. See the [re module documentation](https://docs.python.org/3/library/re.html?highlight=re#module-re) for the complete list. 

In [1]:
import re

## re.findall


This function returns a list of all matches of the `pattern` in the `string`. The third argument, `flags` can be used to specify flags for the regular expression.

For example, here we find all sequences of digits in a string:

In [2]:
text1 = "This costs $57 for a 100 lbs box, so $171 for 3 boxes."
re.findall(r"\d+", text1)

['57', '100', '171', '3']

### re.findall with match groups

In some cases we are interested only in a part of a match. For example, we may want to find all dollar amounts in the format "$57", but we are interested in the numeric value "57" only. Such situations can be handled using a match groups (i.e. parts of a regular expression enclosed in parentheses). If we create a match group in the pattern, then the whole pattern will be matched, but only the value of the match group will be returned:

In [3]:
# search for sequences of digits starting with "$", 
# but return digits only
re.findall(r"\$(\d+)", text1)

['57', '171']

If the pattern includes more than one match group, `re.findall` will return a list of tuples with values of the match groups:

In [4]:
text2 = "This class starts at 9:30, and ends at at 10:15"
# find all tuples in the form (hours, minutes)
re.findall(r"(\d+):(\d+)", text2)

[('9', '30'), ('10', '15')]

### Non-capturing match groups

It often happens that need to create a match group in a regular expression just for the purpose of specifying what should be matched, and not because we want to retrieve its value. Such match groups can be specified using the format `(?:...)`, their values will not be returned by `re.findall`.   

In [5]:
# flight itinerary
from textwrap import dedent
text3= """
       BUF 11:30 PM =>=>=>=> EWR 12:45 PM
       EWR 7:45 PM =>=>=>=>=> LHR 6:55 AM
       """
text3 = dedent(text3).strip()
print(text3)

BUF 11:30 PM =>=>=>=> EWR 12:45 PM
EWR 7:45 PM =>=>=>=>=> LHR 6:55 AM


In [6]:
# find all flight arrivals and departures
re.findall(r"""(.+?)        # match departure
               \ (?:=>)+ \  # match, but not capture, the =>=> part
               (.+)         # match arrival"""  , text3, re.X)

[('BUF 11:30 PM', 'EWR 12:45 PM'), ('EWR 7:45 PM', 'LHR 6:55 AM')]

## re.sub

This function finds matches for the `pattern` in the `string` and replaces them with the `repl` string. The `count` argument specifies the maximum number of replacements to be performed. The default value `count=0` means the all matches should be replaced. The `flags` argument can specify regular expression flags.  

In [29]:
text4 = "This costs $57 for a 100 lbs box, so $171 for 3 boxes."
# replace all sequences of digits by the string "(NUMBER)"
new_text = re.sub(r"\d+", r"(NUMBER)", text4)
# print results
print(f"\033[1mORIGINAL TEXT:\033[0m\n{text4}\n")
print(f"\033[1mNEW TEXT:\033[0m\n{new_text}")

[1mORIGINAL TEXT:[0m
This costs $57 for a 100 lbs box, so $171 for 3 boxes.

[1mNEW TEXT:[0m
This costs $(NUMBER) for a (NUMBER) lbs box, so $(NUMBER) for (NUMBER) boxes.


The function `re.sub` is more flexible than the example above suggests, since the value of the replacement string can depend on the value of the match being replaced. In order to make use of it, we need to specify one or more match groups in the pattern. Each capturing match group is automatically given a label `\1`, `\2`, `\3` etc.
(with `\1` denoting the leftmost match group). When these labels are used in the replacement string, they will be themselves replaced by the values of the corresponding match groups.

**Examples.**

In [26]:
text4 = "This costs $57 for a 100 lbs box, so $171 for 3 boxes."
# add decimals to all prices
new_text = re.sub(r"(\$\d+)", r"\1.00", text4)
# print results
print(f"\033[1mORIGINAL TEXT:\033[0m\n{text4}\n")
print(f"\033[1mNEW TEXT:\033[0m\n{new_text}")

[1mORIGINAL TEXT:[0m
This costs $57 for a 100 lbs box, so $171 for 3 boxes.

[1mNEW TEXT:[0m
This costs $57.00 for a 100 lbs box, so $171.00 for 3 boxes.


In [30]:
text5 = "Flight itinerary:\nBUF 11:30 PM =>=>=> EWR 12:45 PM =>=>=> LHR 6:55 AM"
# reformat the itinerary
new_text = re.sub(r"""(.+?)           # \1 first airport
                      \ (?:=>)+\      #    the =>=> part
                      (.+)            # \2 second airport
                      \ (?:=>)+\      #    the =>=> part
                      (.+)            # \3 third airport
                      """, 
                  r"Dep.: \1; Arr.: \2\nDep.: \2; Arr.: \3", text5, flags=re.X)
# print results
print(f"\033[1mORIGINAL TEXT:\033[0m\n{text5}\n")
print(f"\033[1mNEW TEXT:\033[0m\n{new_text}")

[1mORIGINAL TEXT:[0m
Flight itinerary:
BUF 11:30 PM =>=>=> EWR 12:45 PM =>=>=> LHR 6:55 AM

[1mNEW TEXT:[0m
Flight itinerary:
Dep.: BUF 11:30 PM; Arr.: EWR 12:45 PM
Dep.: EWR 12:45 PM; Arr.: LHR 6:55 AM
