# Step-up Your RegEx Game in Python

source article: https://towardsdatascience.com/step-up-your-regex-game-in-python-1ec20c5d65f

Author: [James Briggs](https://towardsdatascience.com/@jamescalam?source=post_page-----1ec20c5d65f----------------------)

Near the top of this page is my name and in the HTML — my username. It looks like this:

In [1]:
s = '<a href="/@jamescalam?source=post_page-----22e4e63463af----------------------" class="cg ch au av aw ax ay az ba bb it be ck cl" rel="noopener">James Briggs</a>'

Say we are interested in pulling theusername from the HTML of the page, we've taken all ``<a>`` elements and will identify the username using RegEx.

We could do something messy like this

In [9]:
if '<a href="/@' in s:
    username = s.replace('<a href="/', '')
    username = username.split('?source')[0]
username

'@jamescalam'

But using __look-behind__ and __look-ahead__ assertions, we get much more dynamic, flexible logic like so:

In [10]:
import re

In [11]:
if bool(re.search(r"(?=<\/)'@.*(?=\?source)",s)):
    username = re.search(r"(?=<\/)'@.*(?=\?source)",s).group()
username

'@jamescalam'

## Look-Behind
The look-behind assertion tell our regex to **assert** that any potential match is **preceded** by the pattern given to the assertion. Let's compare a regex with and without this assertion:

 * **Look-behind** — we are looking behind (preceding) our pattern.
 
 * **Assertion** — we are asserting the presence of this other pattern, but we are not matching it (eg including in our outputted text).


## In Practice

For this, we'll use [Enron email dataset](https://www.kaggle.com/wcukierski/enron-email-dataset?select=emails.csv). We have to columns ``file`` and ``message``. If we take a look at the first instance of the ``message`` column we find:

In [12]:
import pandas as pd

In [13]:
data = pd.read_csv("data/emails.csv")

In [14]:
print(data['message'].iloc[0])

Message-ID: <18782981.1075855378110.JavaMail.evans@thyme>
Date: Mon, 14 May 2001 16:39:00 -0700 (PDT)
From: phillip.allen@enron.com
To: tim.belden@enron.com
Subject: 
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit
X-From: Phillip K Allen
X-To: Tim Belden <Tim Belden/Enron@EnronXGate>
X-cc: 
X-bcc: 
X-Folder: \Phillip_Allen_Jan2002_1\Allen, Phillip K.\'Sent Mail
X-Origin: Allen-P
X-FileName: pallen (Non-Privileged).pst

Here is our forecast

 


From this data, we would like to extract ``Message-ID`` and ``Date``.

In [15]:
# we could write something like this:
def boring_stract(msg):
    msg = msg.splitlines()
    msg_id = "".join(msg[0].replace('Message-ID: <',"").split(".")[:2])
    time = msg[1].replace("Date: ","")[-5:]

Now, this looks pretty awful and is going to take a long time to run -it's also not dynamic at all, what if for some reason a field is missing in the data or ``Date`` and ``Form`` switch position? -The code would break

This is where we ise regex, relying heavily on look-ahead and look-behind assertions. Let's compare code with and without regex:

In [26]:
data['message'].iloc[0].splitlines()

['Message-ID: <18782981.1075855378110.JavaMail.evans@thyme>',
 'Date: Mon, 14 May 2001 16:39:00 -0700 (PDT)',
 'From: phillip.allen@enron.com',
 'To: tim.belden@enron.com',
 'Subject: ',
 'Mime-Version: 1.0',
 'Content-Type: text/plain; charset=us-ascii',
 'Content-Transfer-Encoding: 7bit',
 'X-From: Phillip K Allen',
 'X-To: Tim Belden <Tim Belden/Enron@EnronXGate>',
 'X-cc: ',
 'X-bcc: ',
 "X-Folder: \\Phillip_Allen_Jan2002_1\\Allen, Phillip K.\\'Sent Mail",
 'X-Origin: Allen-P',
 'X-FileName: pallen (Non-Privileged).pst',
 '',
 'Here is our forecast',
 '',
 ' ']

**Message-ID**

In [33]:
msg = data['message'].iloc[0]

# No RegEx
msg = msg.splitlines()
msg_id = msg[0].replace('Message-ID: <', '')
msg_id = msg_id.split('.')[:2]
msg_id = ''.join(msg_id)
msg_id

'187829811075855378110'

In [35]:
msg = data['message'].iloc[0]

# With RegEx:
msg_id = re.search(r'(?<=Message-ID: <)\d+.\d+(?=.)', msg).group()
msg_id

'18782981.1075855378110'

**Date**

In [44]:
msg = data['message'].iloc[0]

#No Regex
msg = msg.splitlines()
time = msg[1].replace('Date: ',"")[:-6]
time

'Mon, 14 May 2001 16:39:00 -0700'

In [39]:
msg = data['message'].iloc[0]

# with Regex
time = re.search(r'(?<=Date: ).*(?= \(\w\w\w\))', msg).group()
time

'Mon, 14 May 2001 16:39:00 -0700'

On both the Message-ID and Date regexes, we begin with ``(?<= )`` and end with ``(?= )`` — the positive look-behind and look-ahead assertions respectively. For the example above, we output the following for msg_id and time