The purpose of this notebook is to explore capturing groups and lookaround.

Preferred website to test Regular Expressions: https://regex101.com/

# Capturing Groups

Capturing groups are numbered in sequential order in which they show up.

In [1]:
import re

In [2]:
result = re.search(r"(ABC)(DEF)(XYZ)\1\2\3", "ABCDEFXYZABCDEFXYZ")
result[0]

'ABCDEFXYZABCDEFXYZ'

Capturing groups can also be embedded in one another which is ordered from the outside in

In [3]:
result = re.search(r"(ABC(DEF))(XYZ)\1\2\3", "ABCDEFXYZABCDEFDEFXYZ")
result[0]

'ABCDEFXYZABCDEFDEFXYZ'

# Lookaround

Lookaround in RegEx is what allows you to consider certain parts of a pattern without returning it.
First we'll look at positive lookahead.

In [4]:
result = re.search(r".*(?=ABC)", "DEFABC")
result[0]

'DEF'

Next is positive lookbehind

In [5]:
result = re.search(r"(?<=ABC).*", "ABCDEF")
result[0]

'DEF'

Now let's combine the two to search a series of files for html opening and closing title tags

In [18]:
import os

# assign directory
directory = './htmls'

# that directory
for root, dirs, files in os.walk(directory):
    for filename in files:
        filepath = os.path.join(root, filename)
        with open(filepath, encoding="utf8") as f:
            for line in f:
                result = re.search(r"(?<=<[Tt][Ii][Tt][Ll][Ee]>).*(?=</[Tt][Ii][Tt][Ll][Ee]>\s?)", line)
                if result:
                    print(result[0] + " - " + filename)
                    break

Cal Poly Humboldt - cph.txt
Cal Poly | Learn by Doing - cpslo.txt
Home – Chico State - csuc.txt
California State University, East Bay - csueb.txt
Home - California State University, Fresno - csuf.txt
Cal Maritime Home - csum.txt
Home | California State University Monterey Bay - csumb.txt
California State University, Northridge - csun.txt
California State University, Sacramento | Sacramento State - csusa.txt
Sonoma State University - csuso.txt
Home | California State University Stanislaus - csustan.txt
San Francisco State University - sfsu.txt
San José State University - sjsu.txt
서울대학교 2023학년도 입학전형 일정 안내 - snu.txt
清华大学 - tu.txt
UC Davis | UC Davis - ucd.txt


There is also negative lookahead and lookbehind. Let's first take a look a negative lookahead.

In [7]:
result = re.search(r"\d{3}(?!DEF)", "oaihsdghj123ABC")
result[0]

'123'

In [8]:
result = re.search(r"\d{3}(?!DEF)", "oaihsdghj123DEF")
result

Notice how the cell above has no output. This is because the provided string that is searched doesn't match the regular expression. Let's consider negative lookbehind.

In [9]:
result = re.search(r"(?<!ABC)\d\d\d", "ljkandsfvb159")
result[0]

'159'

In [10]:
result = re.search(r"(?<!ABC)\d\d\d", "ljkandsABC159")
result

And again we see that there is no output because the three digits are preceded by ABC