

Regular expressions, also called regex, is a syntax or rather a language to search, extract and manipulate specific string patterns from a larger text. It is widely used in projects that involve text validation, NLP and text mining

# Context:
Fraudulent e-mails contain criminally deceptive information, usually with the intent of convincing the recipient to give the sender a large amount of money. Perhaps the best known type of fraudulent e-mails is the Nigerian Letter or “419” Fraud.

Content:
This dataset is a collection of more than 2,500 "Nigerian" Fraud Letters, dating from 1998 to 2007.

These emails are in a single text file. Each e-mail has a header which includes the following information:

Return-Path: address the email was sent from

X-Sieve: the X-Sieve host (always cmu-sieve 2.0)

Message-Id: a unique identifier for each message

From: the message sender (sometimes blank)

Reply-To: the email address to which replies will be sent

To: the email address to which the e-mail was originally set (some are truncated for anonymity)

Date: Date e-mail was sent

Subject: Subject line of e-mail

X-Mailer: The platform the e-mail was sent from

MIME-Version: The Multipurpose Internet Mail Extension version

Content-Type: type of content & character encoding

Content-Transfer-Encoding: encoding in bits

X-MIME-Autoconverted: the type of autoconversion done

Status: r (read) and o (opened)

In [0]:
#First, we’ll prepare the data set by opening the test file, setting it to read-only, and reading it. We’ll also assign it to a variable, fh (for “file handle”).



In [9]:
#Now, suppose we want to find out who the emails are from. We could try raw Python on its own:

fh = open(r"fradulent_emails.txt", "r",encoding="utf8", errors='ignore').read()
for line in fh.split("\n"):
    if "From:" in line:
        print(line)

From: "MR. JAMES NGOLA." <james_ngola2002@maktoob.com>
From: "Mr. Ben Suleman" <bensul2004nng@spinfinder.com>
From: "PRINCE OBONG ELEME" <obong_715@epatra.com>
From: "PRINCE OBONG ELEME" <obong_715@epatra.com>
From: "Maryam Abacha" <m_abacha03@www.com>
From: Kuta David <davidkuta@postmark.net>
From: "Barrister tunde dosumu" <tunde_dosumu@lycos.com>
From: "William Drallo" <william2244drallo@maktoob.com>
From: "MR USMAN ABDUL" <abdul_817@rediffmail.com>
From: "Tunde  Dosumu" <barrister_td@lycos.com>
From: MR TEMI JOHNSON <temijohnson2@rediffmail.com>
From: "Dr.Sam jordan" <sjordan@diplomats.com>
From: p_brown2@lawyer.com
Subject: From: Barrister Peter Brown
From: mic_k1@post.com
From: "COL. MICHAEL BUNDU" <mikebunduu1@rediffmail.com>
From: "MRS MARIAM ABACHA" <elixwilliam@usa.com>
From: " DR. ANAYO AWKA " <anayoawka@hotmail.com>
From: " DR. ANAYO AWKA " <anayoawka@hotmail.com>
From: "Victor Aloma" <victorloma@netscape.net>
From: "Victor Aloma" <victorloma@netscape.net>
From: "JAMES NGOLA

But that’s not giving us exactly what we want. If you take a look at our test file, we could figure out why and fix it, but instead, let’s use Python’s re module and do it with regular expressions!

We’ll start by importing Python’s re module. Then, we’ll use a function called re.findall() that returns a list of all instances of a pattern we define in the string we’re looking at.


In [0]:
import re

for line in re.findall("From:.*", fh):
    print(line)

his is essentially the same length as our raw Python, but that’s because it’s a very simple example. The more you’re trying to do, the more effort Python regex is likely to save you.

Before we move on, let’s take a closer look at re.findall(). This function takes two arguments in the form of re.findall(pattern, string). Here, pattern represents the substring we want to find, and string represents the main string we want to find it in. The main string can consist of multiple lines. In this case, we’re having it search through all of fh, the file with our selected emails.

The .* is a shorthand for a string pattern.

Common Python Regex Patterns
The pattern we used with re.findall() above contains a fully spelled-out out string, "From:". This is useful when we know precisely what we’re looking for, right down to the actual letters and whether or not they’re upper or lower case. If we don’t know the exact format of the strings we want, we’d be lost. Fortunately, regex has basic patterns that account for this scenario. Let’s look at the ones we use in this tutorial:

w matches alphanumeric characters, which means a-z, A-Z, and 0-9. It also matches the underscore, _, and the dash, -.

d matches digits, which means 0-9.

s matches whitespace characters, which include the tab, new line, carriage return, and space characters.

S matches non-whitespace characters.

. matches any character except the new line character n.

With these regex patterns in hand, you’ll quickly understand our code above as we go on to explain it.

In [0]:
#We can now explain the use of .* in the line re.findall("From:.*", text) above. Let’s look at . first:
for line in re.findall("From:.", fh):
    print(line)

By adding a . next to From:, we look for one additional character next to it. Because . looks for any character except n, it captures the space character, which we cannot see. We can try more dots to verify this.

In [0]:
for line in re.findall("From:...........", fh):
    print(line)

It looks like adding dots does acquire the rest of the line for us. But, it’s tedious and we don’t know how many dots to add. This is where the asterisk symbol, *, comes in.

* matches zero or more instances of a pattern on its left. This means it looks for repeating patterns. When we look for repeating patterns, we say that our search is “greedy.” If we don’t look for repeating patterns, we can call our search “non-greedy” or “lazy.”

In [0]:
for line in re.findall("From:.*", fh):
    print(line)

From: "MR. JAMES NGOLA." <james_ngola2002@maktoob.com>
From: "Mr. Ben Suleman" <bensul2004nng@spinfinder.com>
From: "PRINCE OBONG ELEME" <obong_715@epatra.com>
From: "PRINCE OBONG ELEME" <obong_715@epatra.com>
From: "Maryam Abacha" <m_abacha03@www.com>
From: Kuta David <davidkuta@postmark.net>
From: "Barrister tunde dosumu" <tunde_dosumu@lycos.com>
From: "William Drallo" <william2244drallo@maktoob.com>
From: "MR USMAN ABDUL" <abdul_817@rediffmail.com>
From: "Tunde  Dosumu" <barrister_td@lycos.com>
From: MR TEMI JOHNSON <temijohnson2@rediffmail.com>
From: "Dr.Sam jordan" <sjordan@diplomats.com>
From: p_brown2@lawyer.com
From: Barrister Peter Brown
From: mic_k1@post.com
From: "COL. MICHAEL BUNDU" <mikebunduu1@rediffmail.com>
From: "MRS MARIAM ABACHA" <elixwilliam@usa.com>
From: " DR. ANAYO AWKA " <anayoawka@hotmail.com>
From: " DR. ANAYO AWKA " <anayoawka@hotmail.com>
From: "Victor Aloma" <victorloma@netscape.net>
From: "Victor Aloma" <victorloma@netscape.net>
From: "JAMES NGOLA" <james_

Because * matches zero or more instances of the pattern indicated on its left, and . is on its left here, we are able to acquire all the characters in the From: field until the end of the line. This prints out the full line with beautifully succinct code.

We might even go further and isolate only the name. Let’s use re.findall() to return a list of lines containing the pattern "From:.*" as we’ve done before. We’ll assign it to the variable match for neatness. Next, we’ll iterate through the list. In each cycle, we’ll execute re.findall again, matching the first quotation mark to pick out just the name:

In [17]:
match = re.findall("From:.*", fh)

for line in match:
    print(re.findall('\".*\"', line))

['"MR. JAMES NGOLA."']
['"Mr. Ben Suleman"']
['"PRINCE OBONG ELEME"']
['"PRINCE OBONG ELEME"']
['"Maryam Abacha"']
[]
['"Barrister tunde dosumu"']
['"William Drallo"']
['"MR USMAN ABDUL"']
['"Tunde  Dosumu"']
[]
['"Dr.Sam jordan"']
[]
[]
[]
['"COL. MICHAEL BUNDU"']
['"MRS MARIAM ABACHA"']
['" DR. ANAYO AWKA "']
['" DR. ANAYO AWKA "']
['"Victor Aloma"']
['"Victor Aloma"']
['"JAMES NGOLA"']
['"MARTIN CHIME"']
['"Mr George Mboro"']
['"MARTIN  CHIME"']
['"MARTIN  CHIME"']
[]
[]
['"ADE WILLIAMS"']
[]
['"MRS. M SESE-SEKO"']
['"obina okoro"']
['"DR. JAMES  ALBERT"']
[]
[]
['"MR FRED OBI."']
['"mbeki ngumeni"']
[]
['"Mr. David Agu"']
['"bell.idr bell.idr"']
['"idris.bello idris.bello"']
[]
['"CLEMENT APUTE"']
['"MR GODWIN IGBUNU"']
['"bell.idr bell.idr"']
['"Mr. David Agu"']
['"deborah kabila"']
[]
[]
['"MOHAMMED BELLO (Ph.D)"']
['"PETER WILLIAMS"']
['"Oliveira Savimbi"']
['"Mr. Daniel Osondu"']
[]
[]
['"Ruka"']
['"IDRIS ADAMU"']
['"DR.  SANNI  SABO"']
['"DR.  SANNI  SABO"']
['"Ibrahim Galadim

Notice that we use a backslash next to the first quotation mark. The backslash is a special character used for escaping other special characters. For instance, when we want to use a quotation mark as a string literal instead of a special character, we escape it with a backslash like this: \". If we do not escape the pattern above with backslashes, it would become "".*"", which the Python interpreter would read as a period and an asterisk between two empty strings. It would produce an error and break the script. Hence, it’s crucial that we escape the quotation marks here with backslashes.

After the first quotation mark is matched, .* acquires all the characters in the line until the next quotation mark, also escaped in the pattern. This gets us just the name, within quotation marks. The name is also printed within square brackets because re.findall returns matches in a list.

#email address

In [0]:
match = re.findall("From:.*", fh)

for line in match:
    print(re.findall("\w\S*@.*\w", line))

Emails always contain an @ symbol, so we start with it. The part of the email before the @ symbol might contain alphanumeric characters, which means w is required. However, because some emails contain a period or a dash, that’s not enough. We add S to look for non-whitespace characters. But, w\S will get only two characters. Add * to look for repetitions. The front part of the pattern thus looks like this: \w\S*@.

In [0]:
for line in match:
    print(re.findall("@.*", line))

The domain name usually contains alphanumeric characters, periods, and a dash sometimes, so a . will do. To make it greedy, we extend the search with a *. This allows us to match any character till the end of the line.

If we look at the line closely, we see that each email is encapsulated within angle brackets, < and >. Our pattern, .*, includes the closing bracket, >. Let’s remedy it:

In [0]:
for line in match:
    print(re.findall("@.*\w", line))


Email addresses end with an alphanumeric character, so we cap the pattern with w. So, after the @ symbol we have .*\w, which means that the pattern we want is a group of any type of characters ending with an alphanumeric character. This excludes >.

Our full email address pattern thus looks like this: \w\S*@.*\w.

In [0]:
for line in match:
    print(re.findall("\w\S*@.*\w", line))

#Common Python Regex Functions
re.findall() is undeniably useful, but it’s not the only built-in function that’s available to us in re:

re.search()

re.split()

re.sub()

re.search()
While re.findall() matches all instances of a pattern in a string and returns them in a list, re.search() matches the first instance of a pattern in a string, and returns it as a re match object.

In [27]:
match = re.search("From:.*", fh)
print(type(match))
print(type(match.group()))
print(match)
print(match.group())

<class '_sre.SRE_Match'>
<class 'str'>
<_sre.SRE_Match object; span=(190, 244), match='From: "MR. JAMES NGOLA." <james_ngola2002@maktoob>
From: "MR. JAMES NGOLA." <james_ngola2002@maktoob.com>


Like re.findall(), re.search() also takes two arguments. The first is the pattern to match, and the second is the string to find it in. Here, we’ve assigned the results to the match variable for neatness.

Because re.search() returns a re match object, we can’t display the name and email address by printing it directly. Instead, we have to apply the group() function to it first. We’ve printed both their types out in the code above. As we can see, group() converts the match object into a string.

We can also see that printing match displays properties beyond the string itself, whereas printing match.group() displays only the string.

#re.split()
Suppose we need a quick way to get the domain name of the email addresses. We could do it with three regex operations, like so:

In [0]:
address = re.findall("From:.*", fh)
for item in address:
    for line in re.findall("\w\S*@.*\w", item):
        username, domain_name = re.split("@", line)
        print("{}, {}".format(username, domain_name))

The first line is familiar. We return a list of strings, each containing the contents of the From: field, and assign it to a variable. Next, we iterate through the list to find the email addresses. At the same time, we iterate through the email addresses and use the re module’s split() function to snip each address in half, with the @ symbol as the delimiter. Finally, we print it.

#re.sub()
Another handy re function is re.sub(). As the function name suggests, it substitutes parts of a string. An example:

In [29]:
sender = re.search("From:.*", fh)
address = sender.group()
email = re.sub("From", "Email", address)
print(address)
print(email)

From: "MR. JAMES NGOLA." <james_ngola2002@maktoob.com>
Email: "MR. JAMES NGOLA." <james_ngola2002@maktoob.com>


We’ve already seen the tasks on the first and second lines before. On the third line, we apply re.sub() on address, which is the full From: field in the email header.

re.sub() takes three arguments. The first is the substring to substitute, the second is a string we want in its place, and the third is the main string itself.

#Sorting Emails with Python Regex and Pandas
Our corpus is a single text file containing thousands of emails (though again, for this tutorial we’re using a much smaller file with just two emails, since printing the results of our regex work on the full corpus would make this post far too long).

We’ll use regex and pandas to sort the parts of each email into appropriate categories so that the Corpus can be more easily read or analysed.

We’ll sort each email into the following categories:

sender_name

sender_address

recipient_address

recipient_name

date_sent

subject

email_body




Each of these categories will become a column in our pandas dataframe (i.e., our table). This will make it easier for us work on and analyze each column individually.

We’ll keep working with our small sample, but it’s worth reiterating that regular expressions allow us to write more concise code. Concise code reduces the number of operations our machines have to do, which speeds up our analytical process. Working with our small file of two emails, there’s not much difference, but if you try processing the entire corpus with and without regex, you’ll start to see the advantages!

#Importing
To start, let’s import the libraries we’ll need and get our file opened again.

In addition to re and pandas, we’ll import Python’s email package as well, which will help with the body of the email. The body of the email is rather complicated to work with using regex alone. It might even require enough cleaning up to warrant its own tutorial. So, we’ll use the well-developed email package to save some time and let us focus on learning regex.

In [0]:
import re
import pandas as pd
import email

emails = []

fh = open(r"fradulent_emails.txt", "r",encoding="utf8", errors='ignore').read()

We’ve also created an empty list, emails, which will store dictionaries. Each dictionary will contain the details of each email.

Now, let’s begin applying regex!

In [2]:
contents = re.split(r"From r", fh)
#contents
contents.pop(0)

''

We use the re module’s split function to split the entire chunk of text in fh into a list of separate emails, which we assign to the variable contents. This is important because we want to work on the emails one by one, by iterating through the list with a for loop. But, how do we know to split by the string "From r"?

We know this because we looked into the file before we wrote the script. We didn’t have to peruse the thousands of emails in there. Just the first few, to see what the structure of the data looks like. Whenever possible, it’s good to get your eyes on the actual data before you start working with code, as you’ll often discover useful features like this.

Emails start with “From r”

Disorganized data like this may require a lot of cleaning up. For instance, even though we count 3,977 emails in this set using the full script we’re about to construct for this tutorial, there are actually more. Some emails actually are not preceded by "From r", and so are not counted separately. (However, for the purposes of brevity, we’ll proceed as if that issue has already been fixed and all emails are separated by "From r".)

Notice also that we use contents.pop(0) to get rid of the first element in the list. That’s because a "From r" string precedes the first email. When that string is split, it produces an empty string at index 0. The script we’re about to write is designed for emails. If we try to use it on an empty string, it might throw errors. Getting rid of the empty string lets us keep these errors from breaking our script.

#Getting Every Name and Address With a For Loop

In [0]:
count =0
for item in contents:
# First two lines again so that Jupyter runs the code.
  emails_dict = {}

# Find sender's email address and name.
  #if count ==0: 
  print(item)
    # Step 1: find the whole line beginning with "From:".
  sender = re.search(r"From:.*", item)
    #print(sender.group())

In [12]:
type(sender)

NoneType

With Step 1, we find the entire From: field using the re.search() function. The . means any character except n, and * extends it to the end of the line. We then assign this to the variable sender.

But, data isn’t always straightforward. It can contain surprises. For instance, what if there’s no From: field? The script would throw an error and break. We pre-empt errors from this scenario in Step 2.

In [0]:
for item in contents:
# First two lines again so that Jupyter runs the code.
    emails_dict = {}

# Find sender's email address and name.

    # Step 1: find the whole line beginning with "From:".
    sender = re.search(r"From:.*", item)


# Step 2: find the email address and name.
    if sender is not None:
        s_email = re.search(r"\w\S*@.*\w", sender.group())
        s_name = re.search(r":.*<", sender.group())
    else:
        s_email = None
        s_name = None

In [0]:
print("sender type: " + str(type(sender)))

sender type: <class 'NoneType'>


We’ll use a different tactic for the name. Each name is bounded by the colon, :, of the substring "From:" on the left, and by the opening angle bracket, <, of the email address on the right. Hence, we use :.*< to find the name. We get rid of : and < from each result in a moment.

Now, let’s print out the results of our code to see how they look.

recall that if there is no From: field, sender would have the value of None, and so too would s_email and s_name. Hence, we have to check for this scenario again so that the script doesn’t break unexpectedly. Let’s see how to construct the code with s_email first.

In [17]:
for item in contents:
# First two lines again so that Jupyter runs the code.
  emails_dict = {}

# Find sender's email address and name.

    # Step 1: find the whole line beginning with "From:".
  sender = re.search(r"From:.*", item)
  # Step 2: find the email address and name.
  if sender is not None:
    s_email = re.search(r"\w\S*@.*\w", sender.group())
    s_name = re.search(r":.*<", sender.group())
  else:
    s_email = None
    s_name = None

# Step 3A: assign email address as string to a variable.
  if s_email is not None:
      sender_email = s_email.group()
  else:
      sender_email = None
# Add email address to dictionary.
  emails_dict["sender_email"] = sender_email

  if s_name is not None:
      sender_name = re.sub("s*<", "", re.sub(":s*", "", s_name.group()))
  else:
      sender_name = None

# Add sender's name to dictionary.
  emails_dict["sender_name"] = sender_name
  print(emails_dict)

{'sender_email': 'james_ngola2002@maktoob.com', 'sender_name': ' "MR. JAMES NGOLA." '}
{'sender_email': 'bensul2004nng@spinfinder.com', 'sender_name': ' "Mr. Ben Suleman" '}
{'sender_email': 'obong_715@epatra.com', 'sender_name': ' "PRINCE OBONG ELEME" '}
{'sender_email': 'obong_715@epatra.com', 'sender_name': ' "PRINCE OBONG ELEME" '}
{'sender_email': 'm_abacha03@www.com', 'sender_name': ' "Maryam Abacha" '}
{'sender_email': 'davidkuta@postmark.net', 'sender_name': ' Kuta David '}
{'sender_email': 'tunde_dosumu@lycos.com', 'sender_name': ' "Barrister tunde dosumu" '}
{'sender_email': 'william2244drallo@maktoob.com', 'sender_name': ' "William Drallo" '}
{'sender_email': 'abdul_817@rediffmail.com', 'sender_name': ' "MR USMAN ABDUL" '}
{'sender_email': 'barrister_td@lycos.com', 'sender_name': ' "Tunde  Dosumu" '}
{'sender_email': 'temijohnson2@rediffmail.com', 'sender_name': ' MR TEMI JOHNSON '}
{'sender_email': 'sjordan@diplomats.com', 'sender_name': ' "Dr.Sam jordan" '}
{'sender_email'

In [0]:
emails_dict

{'sender_email': None, 'sender_name': None}

Again, we have match objects. Every time we apply re.search() to strings, it produces match objects. We have to turn them into string objects.

Before we do this, recall that if there is no From: field, sender would have the value of None, and so too would s_email and s_name. Hence, we have to check for this scenario again so that the script doesn’t break unexpectedly. Let’s see how to construct the code with s_email first.


Then, we use the re module’s re.sub() function twice before assigning the string to a variable. First, we remove the colon and any whitespace characters between it and the name. We do this by substituting :s* with an empty string "". Then, we remove whitespace characters and the angle bracket on the other side of the name, again substituting it with an empty string. Finally, after assigning the string to sender_name, we add it to the dictionary.

In [0]:
#Now that we’ve found the sender’s email address and name, we do exactly the same set of steps 
#to acquire the recipient’s email address and name for the dictionary.

In [38]:
for item in contents:
# First two lines again so that Jupyter runs the code.
  emails_dict = {}

# Find sender's email address and name.

    # Step 1: find the whole line beginning with "From:".
  sender = re.search(r"From:.*", item)
  # Step 2: find the email address and name.
  if sender is not None:
    s_email = re.search(r"\w\S*@.*\w", sender.group())
    s_name = re.search(r":.*<", sender.group())
  else:
    s_email = None
    s_name = None

# Step 3A: assign email address as string to a variable.
  if s_email is not None:
      sender_email = s_email.group()
  else:
      sender_email = None
# Add email address to dictionary.
  emails_dict["sender_email"] = sender_email

  if s_name is not None:
      sender_name = re.sub("s*<", "", re.sub(":s*", "", s_name.group()))
  else:
      sender_name = None

# Add sender's name to dictionary.
  emails_dict["sender_name"] = sender_name
  
  

  recipient = re.search(r"To:.*", item)
  if recipient is not None:
    r_email = re.search(r"\w\S*@.*\w", recipient.group())
    #print (r_email.group())
    r_name = re.search(r":.*<", recipient.group())
    #print(r_name.group())
  else:
    r_email = None
    r_name = None

  if r_email is not None:
    #print(r_email.group())
    recipient_email = r_email.group()
  else:
    recipient_email = None

  emails_dict["recipient_email"] = recipient_email

  if r_name is not None:
    
    recipient_name = re.sub("\s*<", "", re.sub(":\s*", "", r_name.group()))
    print(recipient_name)
  else:
    recipient_name = None

  emails_dict["recipient_name"] = recipient_name


" resbob@al-islam.com"

"=?iso-8859-1?Q?lawrence.smith?="

"elisabth  lamine"
"webmaster@aclweb.org"

"R@M"
"grace abba-gana"

"astrada@kukamail.com"

"godfrey eche3"
"godfrey eche3"
"s.kaa@katamail.com"
=?iso-8859-1?Q?makathy?=
Rrrrr
Rrrrr
Rrrrr

Rrrrr
Rrrrr
Rrrrr

"SI2003@kukamail.com"
llavanwa@UM>,rossah@UM>,








"zainabuba@katamail.com"
"patelnikes4you"
"patelnikes4you"
"patelnikes4you"




"mjaja02"

"jeffersila04"
"dr_frances"
"dr_frances"
"markduke2004"

"williams\.onwa"

"larry_ed1"

"Shadak Shari"
"Shadak Shari"
"Shadak Shari"
rrrrr
"info@crack-ag.com"
"Joey"
"rrrrr"


Masesela Dikgale


"rrrrr"
rrrrr
"rrrrr"
"rrrrr"
rrrrr
"kenneth masuku."
"danbrown"
"SESAY MASSAQUOE."
"SESAY MASSAQUOE."






rrrrr




rrrrr
"lucas"
rrrrr
rrrrr
rrrrr
rrrrr
rrrrr
"mohammad urban"
"mohammad urban"
rrrrr


"benard_mcarthy@wanadoo.es"
"benard_mcarthy@wanadoo.es"
rrrrr


"KOJO WILLIAM"


rrrrr
"webmaster"
"webmaster"

"peterkhan@latinmail.com"
"webmaster"
"webmaster"
"webmaster"
"webmaster"


In [37]:
#Getting the Date of the Email

for item in contents:
# First two lines again so that Jupyter runs the code.
  emails_dict = {}

  date_field = re.search(r"Date:.*", item)
  if date_field is not None:
    date = re.search(r"\d+\s\w+\s\d+", date_field.group())
  else:
    date = None

  if date is not None:
    date_sent = date.group()
    date_star = date_star_test.group()
  else:
    date_sent = None

emails_dict["date_sent"] = date_sent

print(date_field.group())

NameError: ignored

In [0]:
import re
import pandas as pd
import email

emails = []


contents = re.split(r"From r",fh)
contents.pop(0)

for item in contents:
    emails_dict = {}

    sender = re.search(r"From:.*", item)

    if sender is not None:
        s_email = re.search(r"\w\S*@.*\w", sender.group())
        s_name = re.search(r":.*<", sender.group())
    else:
        s_email = None
        s_name = None

    if s_email is not None:
        sender_email = s_email.group()
    else:
        sender_email = None

    emails_dict["sender_email"] = sender_email

    if s_name is not None:
        sender_name = re.sub("\s*<", "", re.sub(":\s*", "", s_name.group()))
    else:
        sender_name = None

    emails_dict["sender_name"] = sender_name

    recipient = re.search(r"To:.*", item)

    if recipient is not None:
        r_email = re.search(r"\w\S*@.*\w", recipient.group())
        r_name = re.search(r":.*<", recipient.group())
    else:
        r_email = None
        r_name = None

    if r_email is not None:
        recipient_email = r_email.group()
    else:
        recipient_email = None
    emails_dict["recipient_email"] = recipient_email

    if r_name is not None:
        recipient_name = re.sub("s*<", "", re.sub(":s*", "", r_name.group()))
    else:
        recipient_name = None

    emails_dict["recipient_name"] = recipient_name

    date_field = re.search(r"Date:.*", item)

    if date_field is not None:
        date = re.search(r"\d+\s\w+\s\d+", date_field.group())
    else:
        date = None

    if date is not None:
        date_sent = date.group()
    else:
        date_sent = None

    emails_dict["date_sent"] = date_sent

    subject_field = re.search(r"Subject: .*", item)

    if subject_field is not None:
        subject = re.sub(r"Subject: ", "", subject_field.group())
    else:
        subject = None

    emails_dict["subject"] = subject

    # "item" substituted with "email content here" so full email not
# displayed.

    full_email = email.message_from_string(item)
    body = full_email.get_payload()
    emails_dict["email_body"] = "email body here"

    emails.append(emails_dict)
# Print number of dictionaries, and hence, emails, in the list.
print("Number of emails: " + str(len(emails)))

print("\n")

# Print first item in the emails list to see how it looks.
for key, value in emails[0].items():
    print(str(key) + ": " + str(emails[0][key]))

Number of emails: 3977


sender_email: james_ngola2002@maktoob.com
sender_name: "MR. JAMES NGOLA."
recipient_email: james_ngola2002@maktoob.com
recipient_name: None
date_sent: 31 Oct 2002
subject: URGENT BUSINESS ASSISTANCE AND PARTNERSHIP
email_body: email body here


##Manipulating Data With Pandas

In [0]:
import pandas as pd
# Module imported above, imported again as reminder.
emails_df = pd.DataFrame(emails)

In [0]:
emails_df.head()

Unnamed: 0,sender_email,sender_name,recipient_email,recipient_name,date_sent,subject,email_body
0,james_ngola2002@maktoob.com,"""MR. JAMES NGOLA.""",james_ngola2002@maktoob.com,,31 Oct 2002,URGENT BUSINESS ASSISTANCE AND PARTNERSHIP,email body here
1,bensul2004nng@spinfinder.com,"""Mr. Ben Suleman""",R@M,,31 Oct 2002,URGENT ASSISTANCE /RELATIONSHIP (P),email body here
2,obong_715@epatra.com,"""PRINCE OBONG ELEME""",obong_715@epatra.com,,31 Oct 2002,GOOD DAY TO YOU,email body here
3,obong_715@epatra.com,"""PRINCE OBONG ELEME""",webmaster@aclweb.org,,31 Oct 2002,GOOD DAY TO YOU,email body here
4,m_abacha03@www.com,"""Maryam Abacha""",m_abacha03@www.com,,1 Nov 2002,I Need Your Assistance.,email body here


We can also find precisely what we want. For instance, we can find all the emails sent from a particular domain name. However, let’s learn a new regex pattern to improve our precision in finding the items we want.

The pipe symbol, |, looks for characters on either side of itself. For instance, a|b looks for either a or b.

| might seem to do the same as [ ], but they really are different. Suppose we want to match either "crab", "lobster", or "isopod". Using crab|lobster|isopod would make more sense than [crablobsterisopod], wouldn’t it? The former would look for each whole word, whereas the latter would look for every single letter.

In [0]:
emails_df[emails_df["sender_email"].str.contains("epatra|spinfinder",na=False)]

Unnamed: 0,sender_email,sender_name,recipient_email,recipient_name,date_sent,subject,email_body
1,bensul2004nng@spinfinder.com,"""Mr. Ben Suleman""",R@M,,31 Oct 2002,URGENT ASSISTANCE /RELATIONSHIP (P),email body here
2,obong_715@epatra.com,"""PRINCE OBONG ELEME""",obong_715@epatra.com,,31 Oct 2002,GOOD DAY TO YOU,email body here
3,obong_715@epatra.com,"""PRINCE OBONG ELEME""",webmaster@aclweb.org,,31 Oct 2002,GOOD DAY TO YOU,email body here
567,kaladah@epatra.com,,webmaster@aclweb.org,,24 Nov 2003,KALADA HART,email body here
584,rharare1@spinfinder.com,robert harare,R@M,,01 Dec 2003,Investment Opportunity,email body here


In [0]:
# Step 1: find the index where the "sender_email" column contains "@spinfinder.com".
index = emails_df[emails_df["sender_email"].str.contains(r"\w\S*@spinfinder.com",na=False)].index.values

In [0]:
# Step 2: use the index to find the value of the cell i the "sender_email" column.
# The result is returned as pandas Series object
address_Series = emails_df.loc[index]["sender_email"]
print(address_Series)
print(type(address_Series))

1      bensul2004nng@spinfinder.com
584         rharare1@spinfinder.com
Name: sender_email, dtype: object
<class 'pandas.core.series.Series'>


In [0]:
# Step 3: extract the email address, which is at index 0 in the Series object.
address_string = address_Series[1]
print(address_string)
print(type(address_string))

bensul2004nng@spinfinder.com
<class 'str'>


In [0]:
# Step 4: find the value of the "email_body" column where the "sender email" column is address_string.
print(emails_df[emails_df["sender_email"] == address_string]["email_body"].values)

['email body here']


In [0]:
##Regex Groups

Group Extraction
The "group" feature of a regular expression allows you to pick out parts of the matching text. Suppose for the emails problem that we want to extract the username and host separately. To do this, add parenthesis ( ) around the username and host in the pattern, like this: r'([\w.-]+)@([\w.-]+)'. In this case, the parenthesis do not change what the pattern will match, instead they establish logical "groups" inside of the match text. On a successful search, match.group(1) is the match text corresponding to the 1st left parenthesis, and match.group(2) is the text corresponding to the 2nd left parenthesis. The plain match.group() is still the whole match text as usual.

In [0]:
str = 'purple alice-b@google.com monkey dishwasher'
match = re.search(r'([\w.-]+)@([\w.-]+)', str)
if match:
  print (match.group())   ## 'alice-b@google.com' (the whole match)
  print (match.group(1))  ## 'alice-b' (the username, group 1)
  print (match.group(2))  ## 'google.com' (the host, group 2)

alice-b@google.com
alice-b
google.com


#In Class

Problem 1: Return the each Character of a given string without inclucing spaces


'Intellee is an academy for getting your IT skills upgraded'

'Intellee is an academy for getting your IT skills upgraded'

In [19]:
import re
result=re.findall(r'\w','Intellee is an academy for getting your IT skills upgraded')
print (result)

['I', 'n', 't', 'e', 'l', 'l', 'e', 'e', 'i', 's', 'a', 'n', 'a', 'c', 'a', 'd', 'e', 'm', 'y', 'f', 'o', 'r', 'g', 'e', 't', 't', 'i', 'n', 'g', 'y', 'o', 'u', 'r', 'I', 'T', 's', 'k', 'i', 'l', 'l', 's', 'u', 'p', 'g', 'r', 'a', 'd', 'e', 'd']


#Problem 2: Return the last and first words of a given string without inclucing spaces

In [20]:
input_string = 'Intellee is an academy for getting your IT skills upgraded'
result=re.findall(r'^\w+',input_string)
print (result)
result=re.findall(r'\w+$',input_string)
print (result)

['Intellee']
['upgraded']


#Problem 2: Return the first two character of each word

In [21]:
result=re.findall(r'\b\w\w',input_string)
print (result)

['In', 'is', 'an', 'ac', 'fo', 'ge', 'yo', 'IT', 'sk', 'up']


#Problem 3: Return the domain of a email id

In [28]:
result=re.findall(r'([\S]*)@(\w+).(\w+)','abc.test@gmail.com, xyz@test.in, test.first@rbc.com, first.test@rest.biz')
print (result)

[('', 'gmail', 'com'), ('', 'test', 'in'), ('', 'rbc', 'com'), ('', 'rest', 'biz')]


#Problem 4: Return date from given string

In [0]:
result=re.findall(r'\d{2}-\d{2}-\d{4}','Amit 34-3456 12-05 07, XYZ 56-4532 11*11-11, ABC 67-8945 12 01 2009')
print (result)

['12-05-2007', '11-11-2011', '12-01-2009']


#Problem 5: Return all words of a string those starts with vowel

In [0]:
result=re.findall(r'\b[aeiouAEIOU]\w+',input_string)
print (result) 

['Intellee', 'is', 'an', 'academy', 'IT', 'upgraded']


#Problem 6: Validate a phone number (phone number must be of 10 digits and starts with 7 or 4) 

In [30]:
import re
li=['7057967656','413-705-7966','4137057966']
for val in li:
 if re.match(r'[4,7]{1}[0-2-]{9,13}',val) and (len(val) >= 10 and len(val) < 13):
     print ('yes')
 else:
     print ('no')

no
no
no
