# Spammer: Harvesting emails and phone numbers
Extract phone numbers and email addresses from web pages (HTML).

In [1]:
from __future__ import print_function

import numpy as np
import os
import pandas as pd
import re

## Data
Data is split into training and testing sets. Data consists of HTML files, with embedded e-mail addresses and phone numbers.

In [2]:
TRAIN_FOLDER = 'dev'
TEST_FOLDER = 'test'
TRAIN_LABELS_FILE = 'devGOLD'

In [3]:
train_files = os.listdir(TRAIN_FOLDER)
print(sorted(train_files))

['aiken', 'ashishg', 'balaji', 'bgirod', 'cheriton', 'christos', 'dabo', 'dlwh', 'dm', 'engler', 'eroberts', 'fedkiw', 'hager', 'hanrahan', 'horowitz', 'jks', 'jurafsky', 'jure', 'knuth', 'koller', 'kosecka', 'kunle', 'lam', 'latombe', 'levoy', 'manning', 'nass', 'nick', 'nickm']


These are the HTML files in our training set. Let's load all the raw text.

In [4]:
# Dictionary comprehension FTW
train_data = {fn:open(os.path.join(TRAIN_FOLDER, fn), 'r').readlines() for fn in train_files}

Ok, great. Now for each training file we have the raw HTML. Let's take a look at the training labels next.

In [5]:
with open(TRAIN_LABELS_FILE, 'r') as f:
    for line in f.readlines()[:5]:
        print(line)

ashishg	e	ashishg@stanford.edu

ashishg	e	rozm@stanford.edu

ashishg	p	650-723-1614

ashishg	p	650-723-4173

ashishg	p	650-814-1478



We can see that our labels come in a white-space delimited file. From context we can tell that the first column is the training file, the second column is the type ('e' for email, 'p' for phone-number), and the third column is the extracted value. This is a good format for a pandas data frame, so let's load the label data using pandas.

In [6]:
train_labels = pd.read_csv(TRAIN_LABELS_FILE, delim_whitespace=True,
                           header=None, names=['file_name','type','value'])
print(train_labels.head())

  file_name type                 value
0   ashishg    e  ashishg@stanford.edu
1   ashishg    e     rozm@stanford.edu
2   ashishg    p          650-723-1614
3   ashishg    p          650-723-4173
4   ashishg    p          650-814-1478


Great, looking good! Split the data frame up by type for our own use and we're good to go.

In [7]:
train_email_labels = train_labels[train_labels.type == 'e']
print(train_email_labels.head())
print(train_email_labels.shape[0])

   file_name type                     value
0    ashishg    e      ashishg@stanford.edu
1    ashishg    e         rozm@stanford.edu
5     balaji    e       balaji@stanford.edu
9   cheriton    e  cheriton@cs.stanford.edu
10  cheriton    e       uma@cs.stanford.edu
28


In [8]:
train_phone_labels = train_labels[train_labels.type == 'p']
print(train_phone_labels.head())
print(train_phone_labels.shape[0])

  file_name type         value
2   ashishg    p  650-723-1614
3   ashishg    p  650-723-4173
4   ashishg    p  650-814-1478
6    bgirod    p  650-723-4539
7    bgirod    p  650-724-3648
41


## Train our model
Let's start developing our algorithm. We will be using cascades of regular expressions. So instead of having a machine learning model automatically learn from the training data through some sort of optimization, I will manually update the model to fit the training data, and hopefully generalize to the test data.

### Email Patterns
Let's start by looking at lines with emails. We'll use the ashishg file to debug our work.

In [9]:
train_email_labels[train_email_labels.file_name == 'ashishg']

Unnamed: 0,file_name,type,value
0,ashishg,e,ashishg@stanford.edu
1,ashishg,e,rozm@stanford.edu


In [10]:
for line in train_data['ashishg']:
    print(line)

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"

"http://www.w3.org/TR/html4/loose.dtd">

<html>

<head>

<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">

<title> Ashish Goel </title>

<style type="text/css">

<!--

.style1 {font-size: 10px}

.style3 {font-family: Arial, Helvetica, sans-serif}

.style5 {font-family: Arial; font-size: 10px; }

.style6 {font-family: Verdana, Arial, Helvetica, sans-serif}

.style7 {font-family: Arial}

.style8 {font-size: 10pt}

.style9 {font-family: Arial; font-size: 10pt; }

body {

	background-color: #FFFFFF;

}

-->

</style>

</head>



<body>

<p>&nbsp;</p>

<table width="697" border="0">

  <tr>

    <td width="144"><img src="goelsmallc.png" width="144" height="163"></td>

    <td width="543"><p class="MsoNormal style1"><strong><span class="MsoNormal"><span

style='font-size:14.0pt;font-family:Arial'><b>Ashish Goel<o:p></o:p><br>

    </b></span></span></strong><span style='font-size:10.0pt;font-family:Arial'

Manually inspecting the file, we find the two lines containing the emails:

Email: ashishg @ stanford.edu<o:p></o:p><br>

and

Admin asst: Roz Morf, Terman 405, (650)723-4173, rozm @ stanford.edu</span></p>

Note the spaces around "@". Let's take a crack at a regular expression to capture these emails, and turn them into our standard form.

In [11]:
local_pattern = r'([a-zA-Z]+)'  # capture group around local mailbox name
at_pattern = r' *@ *'  # optional spaces around '@' symbol
domain_pattern = r'([a-zA-Z]+\.[a-zA-Z]+)'  # capture group around domain name

email_pattern = local_pattern + at_pattern + domain_pattern
    
def grep_emails(lines):
    global email_pattern
    
    email_regex = re.compile(email_pattern)
    
    for line in lines:
        captured_groups = email_regex.findall(line)
        if len(captured_groups) != 0:
            for captured_group in captured_groups:
                local_name = captured_group[0]
                domain_name = captured_group[1]
                email = local_name.lower() + '@' + domain_name.lower()
                yield email, line

In [12]:
for email, _ in grep_emails(train_data['ashishg']):
    print(email)

ashishg@stanford.edu
rozm@stanford.edu


Great! One file down. How many other training examples does our regex catch?

In [13]:
def extract_training_emails():
    file_names = []
    emails = []
    for fn in train_data.keys():
        for email, _ in grep_emails(train_data[fn]):
            file_names.append(fn)
            emails.append(email)
    df = pd.DataFrame({'file_name':file_names, 'value':emails})
    df = df.drop_duplicates()
    df = df.reset_index(drop=True)
    return df
            
training_emails = extract_training_emails()
print(training_emails)

   file_name                 value
0      kunle   kunle@ogun.stanford
1      kunle  darlene@csl.stanford
2    manning   manning@cs.stanford
3    manning   dbarros@cs.stanford
4     engler        engler@lcs.mit
5     fedkiw    fedkiw@cs.stanford
6   hanrahan  hanrahan@cs.stanford
7       nick  parlante@cs.stanford
8    kosecka        kosecka@cs.gmu
9   eroberts  eroberts@cs.stanford
10   latombe   latombe@cs.stanford
11   latombe   asandra@cs.stanford
12   latombe   liliana@cs.stanford
13  cheriton  cheriton@cs.stanford
14  cheriton       uma@cs.stanford
15      nass     nass@stanford.edu
16      dabo      dabo@cs.stanford
17   ashishg  ashishg@stanford.edu
18   ashishg     rozm@stanford.edu
19    balaji   balaji@stanford.edu


Whoa, that's a lot! Any false positives?

In [14]:
def print_email_false_positives():
    print("False Positives:")

    for fn in train_data.keys():
        for email, line in grep_emails(train_data[fn]):
            labels_subset = train_email_labels[train_email_labels.file_name == fn]
            if email not in labels_subset.value.values:
                print("%s: %s from line \"%s\"" % (fn, email, line))
                
print_email_false_positives()

False Positives:
kunle: kunle@ogun.stanford from line "HREF="mailto:kunle@ogun.stanford.edu">kunle@ogun.stanford.edu</A><BR>
"
kunle: kunle@ogun.stanford from line "HREF="mailto:kunle@ogun.stanford.edu">kunle@ogun.stanford.edu</A><BR>
"
kunle: darlene@csl.stanford from line "HREF="mailto:darlene@csl.stanford.edu">darlene@csl.stanford.edu</A>
"
kunle: darlene@csl.stanford from line "HREF="mailto:darlene@csl.stanford.edu">darlene@csl.stanford.edu</A>
"
manning: manning@cs.stanford from line "    <td><a href="mailto:manning@cs.stanford.edu">manning@cs.stanford.edu</a></td>
"
manning: manning@cs.stanford from line "    <td><a href="mailto:manning@cs.stanford.edu">manning@cs.stanford.edu</a></td>
"
manning: dbarros@cs.stanford from line "    <a href="mailto:dbarros@cs.stanford.edu">dbarros@cs.stanford.edu</a>
"
manning: dbarros@cs.stanford from line "    <a href="mailto:dbarros@cs.stanford.edu">dbarros@cs.stanford.edu</a>
"
manning: manning@cs.stanford from line "<a href="mailto:manning@cs.

Hmm. These don't look like false positives. There are emails there, we're just not extracting them correctly. A quick glance shows the most common error we're making is not processing domain names with two periods. Let's fix that.

In [15]:
domain_pattern = r'([a-zA-Z]+(?:\.[a-zA-Z]+)+)'  # use non-capture group
email_pattern = local_pattern + at_pattern + domain_pattern

print_email_false_positives()

False Positives:
nick: parlante@cs.stanford.edu from line "EMail: <A HREF="mailto:nick.parlante@cs.stanford.edu">nick.parlante@cs.stanford.edu</a><br>
"
nick: parlante@cs.stanford.edu from line "EMail: <A HREF="mailto:nick.parlante@cs.stanford.edu">nick.parlante@cs.stanford.edu</a><br>
"


Ok, great that removed all but two of the errors. Our regex is not correctly picking out local mailbox names with periods in them. We can easily fix that.

In [16]:
local_pattern = r'([a-zA-Z]+(?:\.[a-zA-Z]+)*)'  # add optional periods to local name
email_pattern = local_pattern + at_pattern + domain_pattern

print_email_false_positives()

False Positives:


Great, all fixed. And it turns out that no, we're being pretty specific - we haven't matched any non-email text. What about false negatives - which emails have we failed to pick up on?

In [17]:
def print_email_false_negatives():
    print("False Negatives:")

    training_emails = extract_training_emails()
    common = train_email_labels.merge(training_emails, on=['file_name', 'value'])
    print(train_email_labels[(~train_email_labels.file_name.isin(common.file_name)) & \
                       (~train_email_labels.value.isin(common.value))][['file_name', 'value']].to_string(index=False))

print_email_false_negatives()

False Negatives:
file_name                          value
    dlwh              dlwh@stanford.edu
   hager               hager@cs.jhu.edu
     jks      jks@robotics.stanford.edu
jurafsky          jurafsky@stanford.edu
     lam            lam@cs.stanford.edu
   levoy      ada@graphics.stanford.edu
   levoy  melissa@graphics.stanford.edu


Ok, good. This narrows down which files we need to look at. Let's start with dlwh.

In [18]:
for line in train_data['dlwh']:
    print(line)

d-l-w-h-@-s-t-a-n-f-o-r-d-.-e-d-u



Well that just seems petty. We'll have to add dashes as acceptable characters in our regular expression. But we'd have to get really funky with it to standardize the matched string using nested capture groups. So let's not do that with regex at all, and just remove dashes from the email after the match. This is going to require us to re-write our grep function.

In [19]:
local_pattern = r'([-a-zA-Z]+(?:\.[-a-zA-Z]+)*)'  # add dashes
domain_pattern = r'([-a-zA-Z]+(?:\.[-a-zA-Z]+)+)'  # add dashes

email_pattern = local_pattern + at_pattern + domain_pattern

def grep_emails(lines):
    global email_pattern
    
    email_regex = re.compile(email_pattern)
    
    for line in lines:
        captured_groups = email_regex.findall(line)
        if len(captured_groups) != 0:
            for captured_group in captured_groups:
                local_name = captured_group[0]
                domain_name = captured_group[1]
                email = local_name.lower() + '@' + domain_name.lower()
                email = email.replace("-", "")  # special case :(
                yield email, line

In [20]:
print_email_false_positives()
print_email_false_negatives()

False Positives:
False Negatives:
file_name                          value
   hager               hager@cs.jhu.edu
     jks      jks@robotics.stanford.edu
jurafsky          jurafsky@stanford.edu
     lam            lam@cs.stanford.edu
   levoy      ada@graphics.stanford.edu
   levoy  melissa@graphics.stanford.edu


Ok, great. We handled that special case. Unfortunately that didn't knock out any of the other false negatives, so we'll have to look at the hager file next.

In [21]:
for line in train_data['hager']:
    print(line)

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"

"http://www.w3.org/TR/html4/loose.dtd">

<html lang="en">

<head>

<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">

<title>Dr. Gregory D. Hager - Dept of Computer Science - The Johns Hopkins University</title>

<link rel="stylesheet" href="default.css" type="text/css" />

<script type="text/javascript" src="http://www.jhu.edu/~homepage/scripts/rollover.js"></script>

<script language="JavaScript" type="text/javascript" src="http://www.jhu.edu/~homepage/scripts/globals.js"></script>

<script language="VBScript" type="text/vbscript" src="http://www.jhu.edu/~homepage/scripts/vb.txt"></script>

<script language="JavaScript1.1" type="text/javascript" src="http://www.jhu.edu/~homepage/scripts/detect.js"></script>

</head>



<body>

    

<table width="100%"  border="0" cellspacing="0" cellpadding="0" summary="this is the layout table that centers the page layout">

	<tr>

		<td class="leftside">&nbsp;</

The email is coming from this tricky line:

<li> <a href="https://www.cs.jhu.edu">hager at cs dot jhu dot edu</a></li>

So we need to parse 'at' as '@' (sometimes), and 'dot' as '.' (sometimes). This will be tricky.

In [22]:
at_pattern = r'(?: *@ *| +at +)'  # non-capture group
domain_pattern = r'([-a-zA-Z]+(?:(?:\.| +dot +)[-a-zA-Z]+)+)'  # add dashes

email_pattern = local_pattern + at_pattern + domain_pattern

def grep_emails(lines):
    global email_pattern
    
    email_regex = re.compile(email_pattern)
    
    for line in lines:
        captured_groups = email_regex.findall(line)
        if len(captured_groups) != 0:
            for captured_group in captured_groups:
                local_name = captured_group[0]
                domain_name = captured_group[1]
                email = local_name.lower() + '@' + domain_name.lower()
                email = email.replace("-", "")
                email = re.sub(r'\bdot\b', '.', email)  # another special case, 'dot' -> '.'
                email = email.replace(" ", "")
                yield email, line
                
print_email_false_positives()
print_email_false_negatives()

False Positives:
jure: server@cs.stanford.edu from line "<address>Apache/2.2.4 (Fedora) Server at cs.stanford.edu Port 80</address>
"
False Negatives:
file_name                          value
     jks      jks@robotics.stanford.edu
jurafsky          jurafsky@stanford.edu
   levoy      ada@graphics.stanford.edu
   levoy  melissa@graphics.stanford.edu


Oh no! Now that we're translating 'x at y' to 'x@y', we mistakenly match 'Server at cs.stanford.edu' as the email "server@cs.stanford.edu". Honestly, I don't think we can avoid this though. It's only from the surrounding context of the line that we can tell this is a false positive. Our spammer list will just have to live with this. We knocked out two of the false negatives with that change. Next up is the file jks.

In [23]:
for line in train_data['jks']:
    print(line)

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">

<html xmlns:v="urn:schemas-microsoft-com:vml" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:w="urn:schemas-microsoft-com:office:word" xmlns:st1="urn:schemas-microsoft-com:office:smarttags" xmlns="http://www.w3.org/TR/REC-html40"><head>

<meta http-equiv="Content-Type" content="text/html; charset=us-ascii">

<meta name="ProgId" content="Word.Document">

<meta name="Generator" content="Microsoft Word 9">

<meta name="Originator" content="Microsoft Word 9">

<link rel="File-List" href="./index_files/filelist.xml">

<link rel="Edit-Time-Data" href="./index_files/editdata.mso"><!--[if !mso]>

<style>

v\:* {behavior:url(#default#VML);}

o\:* {behavior:url(#default#VML);}

w\:* {behavior:url(#default#VML);}

.shape {behavior:url(#default#VML);}

</style>

<![endif]-->







<title>Kenneth Salisbury's Home Page</title><!--[if gte mso 9]><xml>

 <o:DocumentProperties>

  <o:Author>Kenneth Salisbury</o:Author>

  <o:Las

The offending line is:

<p class="MsoNormal" style="margin-left: 3.45pt;">jks at robotics;stanford;edu<br>&nbsp;</p>

We just have to add semicolons as a special-case separator, just like 'dot'.

In [24]:
domain_pattern = r'([-;a-zA-Z]+(?:(?:\.| +dot +|;)[-;a-zA-Z]+)+)'  # add semicolons

email_pattern = local_pattern + at_pattern + domain_pattern

def grep_emails(lines):
    global email_pattern
    
    email_regex = re.compile(email_pattern)
    
    for line in lines:
        captured_groups = email_regex.findall(line)
        if len(captured_groups) != 0:
            for captured_group in captured_groups:
                local_name = captured_group[0]
                domain_name = captured_group[1]
                email = local_name.lower() + '@' + domain_name.lower()
                email = email.replace("-", "")
                email = email.replace(";", ".")  # another special case, ';' -> '.'
                email = re.sub(r'\bdot\b', '.', email)
                email = email.replace(" ", "")
                yield email, line
                
print_email_false_positives()
print_email_false_negatives()

False Positives:
jure: server@cs.stanford.edu from line "<address>Apache/2.2.4 (Fedora) Server at cs.stanford.edu Port 80</address>
"
False Negatives:
file_name                          value
jurafsky          jurafsky@stanford.edu
   levoy      ada@graphics.stanford.edu
   levoy  melissa@graphics.stanford.edu


Phew. Almost there. Next up is the jurafsky file.

In [25]:
for line in train_data['jurafsky']:
    print(line)

<!doctype html public "-//w3c//dtd html 4.0 transitional//en">

<html>

<head>

   <meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">

<META name="description" content="Home Page for Dan Jurafsky at Stanford University">

<META name="keywords" content="Jurafsky, jurafsky, computational linguistics, speech recognition">

   <title>Dan Jurafsky - Home Page</title>

</head>



<script type="text/javascript"><!-- 

function obfuscate( domain, name ) { document.write('<a href="mai' + 

'lto:' + name + '@' + domain + '">' + name + '@' + domain + '</' + 'a>'); }

// --></script>





<style type="text/css">

<!--



body {

  font-size: 11pt;

font-family: "Frutiger Linotype", Calibri, Georgia, "Garamond Book", "New Baskerville", "Times New Roman", Times, serif;

}



.header {

font-size: 120%;

font-family: "Frutiger Linotype", Calibri, "Myriad Web", "Myriad Roman", "Gill Sans", "Gill Sans MT", Arial, Helvetica, sans-serif;

}



.column_title{

margin-bottom: 10px;

}

Wow. This author really does not want their email to be scraped. They define a javascript function:

function obfuscate( domain, name ) { document.write('<a href="mai' + 

'lto:' + name + '@' + domain + '">' + name + '@' + domain + '</' + 'a>'); }

and later list their email with it: "obfuscate('stanford.edu','jurafsky');"

Should we respect their wishes? (Yes) Nah, let's add a special case for their function.

In [26]:
def grep_emails(lines):
    global local_pattern, at_pattern, domain_pattern
    
    email_pattern = local_pattern + at_pattern + domain_pattern
    email_regex = re.compile(email_pattern)
    
    obfuscate_pattern = r"obfuscate\( *'" + domain_pattern + \
                        r"' *, *'" + local_pattern + r"' *\) *;"
    obfuscate_regex = re.compile(obfuscate_pattern)
    
    for line in lines:
        captured_groups = email_regex.findall(line)
        if len(captured_groups) != 0:
            for captured_group in captured_groups:
                local_name = captured_group[0]
                domain_name = captured_group[1]
                email = local_name.lower() + '@' + domain_name.lower()
                email = email.replace("-", "")
                email = email.replace(";", ".")
                email = re.sub(r'\bdot\b', '.', email)
                email = email.replace(" ", "")
                yield email, line
                
        # Add checking of obfuscation regex
        obfuscate_captured_groups = obfuscate_regex.findall(line)
        if len(obfuscate_captured_groups) != 0:
            for captured_group in obfuscate_captured_groups:
                local_name = captured_group[1]
                domain_name = captured_group[0]
                email = local_name.lower() + '@' + domain_name.lower()
                email = email.replace("-", "")
                email = email.replace(";", ".")
                email = re.sub(r'\bdot\b', '.', email)
                email = email.replace(" ", "")
                yield email, line
                
print_email_false_positives()
print_email_false_negatives()

False Positives:
jure: server@cs.stanford.edu from line "<address>Apache/2.2.4 (Fedora) Server at cs.stanford.edu Port 80</address>
"
False Negatives:
file_name                          value
   levoy      ada@graphics.stanford.edu
   levoy  melissa@graphics.stanford.edu


One more file.

In [27]:
for line in train_data['levoy']:
    print(line)

<html>

<head>

<title>

Marc Levoy's Home Page

</title>

</head>

<body>



<h1>Marc Levoy</h1>



<table>

<tr>

<td width=225>

<!--

<a href=Marc_Museum2.jpg>

<img src="Marc_Museum2_35-light.gif"></a>

-->

<img src="marc-kamikochi-jul06-cbalessh.jpg"></a>

<br>

Professor,

<br>

jointly appointed in<br>

Computer Science and<br>

Electrical Engineering

</td>



<td width=375>

<dl>

<dt>	<b>Affiliations:</b>

<dd>	<a href="http://graphics.stanford.edu">

	Computer Graphics Laboratory</a>

<dd>	<a href="http://csl.stanford.edu">

	Computer Systems Laboratory</a>

<dd>	<a href="http://cs.stanford.edu">

	Computer Science Department</a>

<dd>	<a href="http://ee.stanford.edu/ee.html">

	Electrical Engineering Department</a>

<dd>	<a href="http://soe.stanford.edu/">

	School of Engineering</a>

<dd>	<a href="http://www.stanford.edu/">

	Stanford University</a>

<dt>	<b>Office:</b>

<dd>	Gates Computer Science Building<br>

	Room 366, Wing 3B

<dd>	Stanford University

<dd>	Stanford

& #x40; inserts hex into the markdown which renders as "@". We'll add it to our pattern.

In [28]:
at_pattern = r'(?: *@ *| +at +| *&#x40; *)'  # non-capture group

print_email_false_positives()
print_email_false_negatives()

False Positives:
jure: server@cs.stanford.edu from line "<address>Apache/2.2.4 (Fedora) Server at cs.stanford.edu Port 80</address>
"
False Negatives:
Empty DataFrame
Columns: [file_name, value]
Index: []


Yay, we've fit all of the email patterns in our training data! Yippee!

### Phone Patterns
Oh no, I almost forgot about the phone numbers. Maybe this will go faster now that we have a little experience.

In [29]:
# Copy all the email functions, just change names to phone
def grep_phones(lines):
    global phone_pattern

    phone_regex = re.compile(phone_pattern)
    
    for line in lines:
        captured_groups = phone_regex.findall(line)
        if len(captured_groups) != 0:
            for captured_group in captured_groups:
                area_code = captured_group[0]
                prefix = captured_group[1]
                line_num = captured_group[2]
                phone_num = area_code + '-' + prefix + '-' + line_num
                phone_num = re.sub(r"\(|\)| +", "", phone_num)
                yield phone_num, line
                
def print_phone_false_positives():
    print("False Positives:")

    for fn in train_data.keys():
        for phone_num, line in grep_phones(train_data[fn]):
            labels_subset = train_phone_labels[train_phone_labels.file_name == fn]
            if phone_num not in labels_subset.value.values:
                print("%s: %s from line \"%s\"" % (fn, phone_num, line))
                
def print_phone_false_negatives():
    print("False Negatives:")

    file_names = []
    phone_nums = []
    for fn in train_data.keys():
        for phone_num, _ in grep_phones(train_data[fn]):
            file_names.append(fn)
            phone_nums.append(phone_num)

    training_phone_nums = pd.DataFrame({'file_name':file_names, 'value':phone_nums})
    training_phone_nums = training_phone_nums.drop_duplicates()
    training_phone_nums = training_phone_nums.reset_index(drop=True)

    common = train_phone_labels.merge(training_phone_nums, on=['file_name', 'value'])
    print(train_phone_labels[(~train_phone_labels.file_name.isin(common.file_name)) & \
                       (~train_phone_labels.value.isin(common.value))][['file_name', 'value']].to_string(index=False))


Let's take a first stab at the pattern before we look at the files.

In [30]:
phone_pattern = r"(\(?[0-9]{3}\)?)(?:-| ) *([0-9]{3})(?:-| ) *([0-9]{4})"

print_phone_false_positives()
print_phone_false_negatives()

False Positives:
False Negatives:
file_name         value
 ashishg  650-723-1614
 ashishg  650-723-4173
 ashishg  650-814-1478
horowitz  650-725-3707


Wow! Nice, just two files with special cases to look at.

In [31]:
for line in train_data['ashishg']:
    print(line)

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"

"http://www.w3.org/TR/html4/loose.dtd">

<html>

<head>

<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">

<title> Ashish Goel </title>

<style type="text/css">

<!--

.style1 {font-size: 10px}

.style3 {font-family: Arial, Helvetica, sans-serif}

.style5 {font-family: Arial; font-size: 10px; }

.style6 {font-family: Verdana, Arial, Helvetica, sans-serif}

.style7 {font-family: Arial}

.style8 {font-size: 10pt}

.style9 {font-family: Arial; font-size: 10pt; }

body {

	background-color: #FFFFFF;

}

-->

</style>

</head>



<body>

<p>&nbsp;</p>

<table width="697" border="0">

  <tr>

    <td width="144"><img src="goelsmallc.png" width="144" height="163"></td>

    <td width="543"><p class="MsoNormal style1"><strong><span class="MsoNormal"><span

style='font-size:14.0pt;font-family:Arial'><b>Ashish Goel<o:p></o:p><br>

    </b></span></span></strong><span style='font-size:10.0pt;font-family:Arial'

Ok, the line "Phone: (650)814-1478 [Cell], Fax: (650)723-1614"

doesn't have a separator between the area code and the prefix. Should we allow no prefixes in general? 9876543210 could be a phone number or a number. I suppose as long as the total length is correct, we can increase our recall by allowing some false positives and accepting it.

In [32]:
phone_pattern = r"(\(?[0-9]{3}\)?)(?:-| |) *([0-9]{3})(?:-| |) *([0-9]{4})"

print_phone_false_positives()
print_phone_false_negatives()

False Positives:
bgirod: 134-217-6593 from line "	mso-font-signature:-1342176593 1775729915 48 0 524447 0;}
"
bgirod: 177-572-9915 from line "	mso-font-signature:-1342176593 1775729915 48 0 524447 0;}
"
bgirod: 134-217-6593 from line "	mso-font-signature:-1342176593 1775729915 48 0 524447 0;}
"
bgirod: 177-572-9915 from line "	mso-font-signature:-1342176593 1775729915 48 0 524447 0;}
"
bgirod: 161-904-3898 from line "	mso-list-template-ids:1619043898;}
"
bgirod: 194-257-0626 from line "	{mso-list-id:1942570626;
"
manning: 157-586-0368 from line "<a href="http://www.amazon.com/exec/obidos/ASIN/1575860368/">Ergativity:
"
latombe: 870-145-1791 from line "	mso-font-signature:-536870145 1791491579 18 0 131231 0;}
"
latombe: 161-061-1985 from line "	mso-font-signature:-1610611985 1107304683 0 0 159 0;}
"
latombe: 110-730-4683 from line "	mso-font-signature:-1610611985 1107304683 0 0 159 0;}
"
latombe: 107-371-7157 from line "	mso-font-signature:-520082689 -1073717157 41 0 66047 0;}
"
latombe

Whoa, that's a lot more false positives than expected. But no more false negatives! So we have a question of balance here between recall and precision. We could call it a day, but let's try getting more specialized: we will fit our model more closely to the training data (overfit?), by only allowing no separator between the area code and prefix when the area code is surrounded by parentheses.

In [33]:
phone_pattern = r"(\([0-9]{3}\)(?:-| |)|[0-9]{3}(?:-| +))([0-9]{3})(?:-| +)([0-9]{4})"

def grep_phones(lines):
    global phone_pattern

    phone_regex = re.compile(phone_pattern)
    
    for line in lines:
        captured_groups = phone_regex.findall(line)
        if len(captured_groups) != 0:
            for captured_group in captured_groups:
                area_code = captured_group[0]
                prefix = captured_group[1]
                line_num = captured_group[2]
                area_code = area_code.replace("-", "")  # special case
                phone_num = area_code + '-' + prefix + '-' + line_num
                phone_num = re.sub(r"\(|\)| +", "", phone_num)
                yield phone_num, line
                
print_phone_false_positives()
print_phone_false_negatives()

False Positives:
False Negatives:
Empty DataFrame
Columns: [file_name, value]
Index: []


Ok, that was much easier than the emails. Who knows how the performance will be on the test set though. We probably overfit our training set.

## Test our model
Let's see how well the algorithm we've developed works on the test data.

In [34]:
test_files = os.listdir(TEST_FOLDER)
print(sorted(test_files))

['ok', 'ouster', 'pal']


In [35]:
test_data = {fn:open(os.path.join(TEST_FOLDER, fn), 'r').readlines() \
              for fn in test_files}

In [36]:
for fn in test_data.keys():
    for email, line in grep_emails(test_data[fn]):
        print("Found email \"%s\" in file \"%s\" in line \"%s\"" % \
              (email, fn, line))
    for phone_num, line in grep_phones(test_data[fn]):
        print("Found phone number \"%s\" in file \"%s\" in line \"%s\"" % \
              (phone_num, fn, line))

Found phone number "650-723-9753" in file "ok" in line "								Phone: (650) 723-9753<br>
"
Found phone number "650-725-1449" in file "ok" in line "								Fax: (650) 725-1449<br>
"
Found email "science@u.c" in file "ouster" in line "He was Professor of Computer Science at U.C. Berkeley from
"
Found phone number "650-725-9046" in file "pal" in line "            Phone: +1 650 725 9046<BR>
"


Ouch, ok yeah we see some clear false positives. We could further refine the model now, but that would make our test set a hold-out or validation set used for tuning, and we would have to go scrape a new test set. Since we can't do that nor am I able to re-generate the sets, we'll have to be done. For a spammer, recall is going to be the most important, since false positives will simply not answer the phone or not get the email. Spammers like to cast a wide net anyway, so not focusing too much on precision makes sense.