Regular Expressions

In [1]:
regular_text = "The phone number found written in a sheet of paper in the crime scene was 408-567-1223, this is not good, this number belongs to the agent Dillon pushing to many pencils, you son of a bitch!!!! (see predator)"

In [2]:
"phone" in regular_text

True

In [3]:
import re

In [6]:
pattern = "phone"
match = re.search(pattern, regular_text)

In [7]:
match.span()

(4, 9)

In [8]:
match.start()

4

In [9]:
match.end()

9

If we want find more iterations of the pattern in a text we use findAll

In [10]:
n_text = "Yeah th phone was not mine, the phone was stolen but the real phone is lost"
pattern = "phone"
match = re.search(pattern, n_text)
match.span()

(8, 13)

In [11]:
all_matches = re.findall(pattern, n_text)
all_matches

['phone', 'phone', 'phone']

In [12]:
iterator = re.finditer(pattern, n_text)
for m in iterator:
    print(f"Span: {m.span()}")


Span: (8, 13)
Span: (32, 37)
Span: (62, 67)


# Patterns

So far we've learned how to search for a basic string. What about more complex examples? Such as trying to find a telephone number in a large string of text? Or an email address?

We could just use search method if we know the exact phone or email, but what if we don't know it? We may know the general format, and we can use that along with regular expressions to search the document for strings that match a particular pattern.

This is where the syntax may appear strange at first, but take your time with this; often it's just a matter of looking up the pattern code.

Let's begin!

## Identifiers for Characters in Patterns

Characters such as a digit or a single string have different codes that represent them. You can use these to build up a pattern string. Notice how these make heavy use of the backwards slash \ . Because of this when defining a pattern string for regular expression we use the format:

    r'mypattern'

placing the r in front of the string allows python to understand that the \ in the pattern string are not meant to be escape slashes.

Below you can find a table of all the possible identifiers:

<table ><tr><th>Character</th><th>Description</th><th>Example Pattern Code</th><th >Exammple Match</th></tr>

<tr ><td><span >\d</span></td><td>A digit</td><td>file_\d\d</td><td>file_25</td></tr>

<tr ><td><span >\w</span></td><td>Alphanumeric</td><td>\w-\w\w\w</td><td>A-b_1</td></tr>



<tr ><td><span >\s</span></td><td>White space</td><td>a\sb\sc</td><td>a b c</td></tr>



<tr ><td><span >\D</span></td><td>A non digit</td><td>\D\D\D</td><td>ABC</td></tr>

<tr ><td><span >\W</span></td><td>Non-alphanumeric</td><td>\W\W\W\W\W</td><td>*-+=)</td></tr>

<tr ><td><span >\S</span></td><td>Non-whitespace</td><td>\S\S\S\S</td><td>Yoyo</td></tr></table>

For example:

In [14]:
regular_text

'The phone number found written in a sheet of paper in the crime scene was 408-567-1223, this is not good, this number belongs to the agent Dillon pushing to many pencils, you son of a bitch!!!! (see predator)'

In [15]:
pattern = r"\d\d\d-\d\d\d-\d\d\d\d"

In [17]:
is_match = re.search(pattern, regular_text)
is_match

<re.Match object; span=(74, 86), match='408-567-1223'>

In [18]:
print(f"Span: {is_match.span()}")

Span: (74, 86)


In [19]:
is_match.group()


'408-567-1223'

Notice the repetition of \d. That is a bit of an annoyance, especially if we are looking for very long strings of numbers. Let's explore the possible quantifiers.

## Quantifiers

Now that we know the special character designations, we can use them along with quantifiers to define how many we expect.

<table ><tr><th>Character</th><th>Description</th><th>Example Pattern Code</th><th >Exammple Match</th></tr>

<tr ><td><span >+</span></td><td>Occurs one or more times</td><td>	Version \w-\w+</td><td>Version A-b1_1</td></tr>

<tr ><td><span >{3}</span></td><td>Occurs exactly 3 times</td><td>\D{3}</td><td>abc</td></tr>



<tr ><td><span >{2,4}</span></td><td>Occurs 2 to 4 times</td><td>\d{2,4}</td><td>123</td></tr>



<tr ><td><span >{3,}</span></td><td>Occurs 3 or more</td><td>\w{3,}</td><td>anycharacters</td></tr>

<tr ><td><span >\*</span></td><td>Occurs zero or more times</td><td>A\*B\*C*</td><td>AAACC</td></tr>

<tr ><td><span >?</span></td><td>Once or none</td><td>plurals?</td><td>plural</td></tr></table>

Let's rewrite our pattern using these quantifiers:

In [20]:
pattern = r"\d{3}-\d{3}-\d{4}"

In [22]:
my_match = re.search(pattern, regular_text)

In [24]:
my_match

<re.Match object; span=(74, 86), match='408-567-1223'>

In [25]:
my_match.group()

'408-567-1223'

In [26]:
my_match.span()

(74, 86)

Let's say we want to extract the area code, so we use parenthesis to group using the pattern

In [27]:
pattern_group = r"(\d{3})-(\d{3})-(\d{4})"
new_match = re.search(pattern_group, regular_text)

In [30]:
#extracting area code using group 1 this is possible because the pattern is grouping by parenthesis
print(f"Phone number: {new_match.group()}")
print(f"Are code: {new_match.group(1)}")
print(f"Last 4 digits: {new_match.group(3)}")

Phone number: 408-567-1223
Are code: 408
Last 4 digits:1223


In [32]:
# | pipe operator either this | this match
re.search(r"man|woman", "This is a man but he looks like a woman among the man")

<re.Match object; span=(10, 13), match='man'>

In [33]:
# wildcard operator .
re.findall(r".at", "The cat in the hat sat like an expat")

['cat', 'hat', 'sat', 'pat']

In [36]:
# Starts with operator ^
re.search(r"^\d", "1 starts with a digit")

<re.Match object; span=(0, 1), match='1'>

In [37]:
# Ends with operator $
re.search(r"\d$", "ends with a digit 22")

<re.Match object; span=(19, 20), match='2'>

In [38]:
# Exclusion []
phrase = "There are 3 numbers 12 inside this 4 sentence"
re.findall(r"[^\d]", phrase)

['T',
 'h',
 'e',
 'r',
 'e',
 ' ',
 'a',
 'r',
 'e',
 ' ',
 ' ',
 'n',
 'u',
 'm',
 'b',
 'e',
 'r',
 's',
 ' ',
 ' ',
 'i',
 'n',
 's',
 'i',
 'd',
 'e',
 ' ',
 't',
 'h',
 'i',
 's',
 ' ',
 ' ',
 's',
 'e',
 'n',
 't',
 'e',
 'n',
 'c',
 'e']

In [40]:
re.findall(r"[^\d]+",phrase)

['There are ', ' numbers ', ' inside this ', ' sentence']

In [44]:
remove_punctuation = "This is a long story!!! but contains many punctuation, . Hoe to remove it??"
my_list = re.findall(r"[^!.?, ]+", remove_punctuation)

In [46]:
my_list

['This',
 'is',
 'a',
 'long',
 'story',
 'but',
 'contains',
 'many',
 'punctuation',
 'Hoe',
 'to',
 'remove',
 'it']

In [47]:
" ".join(my_list)

'This is a long story but contains many punctuation Hoe to remove it'

In [48]:
frase = "Only find the longest hyphen-words. Where are the long-ish dash words?"
re.findall(r"[\w]+-[\w]+", frase)

['hyphen-words', 'long-ish']