# Contact extraction : PDF parsing and regular expressions

Author: Nikola LOHINSKI

Github: https://github.com/NikolaLohinski/pdf_contact_extraction

<div class="alert alert-info" style="margin-top: 15px">
In this notebook, you will learn how to :
<br>
<ul style="list-style: none; padding: 0;">
<li>[&#128279;](#1-PDF-parsing-with-pdfminer) Cast a PDF document to text data in python using the open source library **pdfminer** ;</li>
<li>[&#128279;](#2-Data-extraction-with-Regular-Expressions) Retrieve phone numbers and emails from a text document using **regular expressions** ;</li>
<li>[&#128279;](#3-Contact-extraction) **Extract contacts** from a PDF non structured document using the two previous points.</li>
</ul>
</div>


<div class="alert alert-warning">
To run this notebook, you will need :
<br>
<ul>
<li>to run on **python $\geq$ 3.5.2**;</li>
<li>to have **pdfminer** installed. Run '**pip install pdfminer.six==20170720**' in the python environment you are using for this notebook.</li>
<li>to have **pyenchant** installed. Run '**pip install pyenchant**' in the python environment you are using for this notebook</li>
</ul>
</div>

Let's start with the imports :

In [2]:
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from pdfminer.pdfpage import PDFPage
from io import StringIO
import re
import enchant
from functools import reduce

## 1 PDF parsing with `pdfminer`

References :
- [pdfminer](https://github.com/pdfminer/pdfminer.six) on Github
- [example of usage](https://stackoverflow.com/questions/26494211/extracting-text-from-a-pdf-file-using-pdfminer-in-python) from which this adapted example was built

Nothing fancy here, we are just trying to cast PDF files back to text format using `pdfminer`.
<div class="alert alert-danger">
**Important : ** Bear in mind that the following method only applies to digital files, not scans ; it only applies to files that have been generated through Microsoft Word, OpenOffice, Gedit etc... and not to files that have been printed and then scanned.
</div>

Following is a function demonstrating the way to use `pdfminer` to convert a PDF file to a string :

In [3]:
def convert_pdf_to_txt(path_to_file):
    rsrcmgr = PDFResourceManager()
    retstr = StringIO()
    codec = 'utf-8'
    laparams = LAParams()
    device = TextConverter(rsrcmgr, retstr, codec=codec, laparams=laparams)
    with open(path_to_file, 'rb') as file_reader:
        interpreter = PDFPageInterpreter(rsrcmgr, device)
        password = ""
        maxpages = 0
        caching = True
        pagenos=set()
        pages = PDFPage.get_pages(
            file_reader,
            pagenos,
            maxpages=maxpages,
            password=password,
            caching=caching,
            check_extractable=True
        )
        for page in pages:
            interpreter.process_page(page)
        text = retstr.getvalue()
        file_reader.close()
        device.close()
        retstr.close()
        return text

Let's test this function out with a file :

In [4]:
path_2_file = 'Fake_PDF.pdf'
text = convert_pdf_to_txt(path_2_file)
print(text)

 

Company, Inc. 

 

Fake business trip report 

 

REF: ABC-1999-12-A  

 

 

 

 

 

         DATE: 13/12/1999 

Name 

Venste Bergspiel 

Company 

F.I.R.M.I.N  Germany 

Contact 

Business manager 

 

 

+49 6 03 89 92 99 

venste.bergspiel@company.de 

Visit date  Monday 13th December 1999 

Topic 

Business plan review 

 

1.  General 

Cats  is  a  sung-through  British musical composed by Andrew Lloyd Webber, based on Old Possum's Book 

of Practical Cats by T. S. Eliot, and produced by Cameron Mackintosh. The musical tells the story of a tribe of cats 

called the Jellicles and the night they  make what is known as "the Jellicle choice" and decide which cat will ascend 

to  the  Heaviside  Layer  and  come  back  to  a  new  life.  Cats  introduced  the  song  standard  "Memory".  The  first 

performance  of Cats was in 1981. 

Directed  by  Trevor  Nunn  and  choreographed  by  Gillian  Lynne,  Cats  first  opened  in  the West End in 1981 

and  then  with  the  same 

## 2 Data extraction with Regular Expressions

References:
- [Tutorial on RegEx](https://www.regular-expressions.info/)
- [Other tutorial on RegEx](https://docs.oracle.com/javase/tutorial/essential/regex/)
- [Online tester and cheat sheet](https://regexr.com/) used to build the followin RegEx

Regular expressions are a way to match patterns in text data. We define the pattern of characters we are looking for and a compiler finds the content matching the givent pattern in a given text. For example, lets look at the following piece of text:
<ul style="list-style: none; font-style: italic">
    <li>... and thank you for your time and patience with my request.</li>
    <li>Best regards,</li>
    <li><br></li>
    <li>Venste Bergspiel</li>
    <li>Executive assistant</li>
    <li>email: venste.bergspiel@ssicju.ra</li>
    <li>tel: 12 34 56 78 90</li>
    <li><br></li>
    <li>PS: I hope this does not ...</li>
</ul>

Let's say we are looking for an email adress in this text. An email adress is defined by :
- a series of words, separted by dots, or dashes $\rightarrow$ **`venste.bergspiel`**
- an `@` character $\rightarrow$ **`@`**
- a news series of words, eventually separated by dots or dashes $\rightarrow$ **`ssicju`**
- a dot $\rightarrow$ **`.`**
- a single word of 2 to 3 characters followed by a space or a return to lign character $\rightarrow$ **`ra`**

This leads down to the following regular expression :

<div class="text-center">
<span style="border-radius: 3px; background-color:#f7f7f7; border: 1px solid #ababab; padding: 5px 10px; font-size: 20px; font-family: Lucida">
    (\w(.|-)?)+\@(\w(.|-)?)+(\.\w{2,3}(\s|\n))+
</span>
</div>
- `\w{x, y}` means we are looking for a word, of length varying from x to y. If x or y are not given, then the length is not constrained
- `\(.|-)?` there may be a dot or a dash, or none of both
- `(...)+` means that there is one or more of `...`
- `\@` means there must be the `@`Â character
- `\.` means there must be the `.` character
- `(\s|\n)` means there must be space or a return to line character

Let's test it out :

In [5]:
# The regular expression defining the pattern
expression = r'(\w(.|-)?)+\@(\w(.|-)?)+(\.\w{2,3}(\s|\n))+'
# The text to analyse
text = '''
...and thank you for your time and patience with my request.
Best regards,

Venste Bergspiel
Executive assistant
email: venste.bergspiel@ssicju.ra
tel: 12 34 56 78 90

PS: I hope this does not ... 
'''
# First compile the expression
regex = re.compile(expression)
# Then find matches
matchs = regex.finditer(text)
# Finally ouptut them
print('Text to analyse :')
print(text)
print('------------------------------------------------------------\n')
print('Testing RegEx: {}\n'.format(expression))
print('Found the following matches :')
for i, m in enumerate(matchs):
    print('{}. {}'.format(i + 1, m.group()))

Text to analyse :

...and thank you for your time and patience with my request.
Best regards,

Venste Bergspiel
Executive assistant
email: venste.bergspiel@ssicju.ra
tel: 12 34 56 78 90

PS: I hope this does not ... 

------------------------------------------------------------

Testing RegEx: (\w(.|-)?)+\@(\w(.|-)?)+(\.\w{2,3}(\s|\n))+

Found the following matches :
1. venste.bergspiel@ssicju.ra



The above RegEx works fine but is not the fastest one for email addresses. Here is a less understandable but better one :

<div class="text-center" style="margin-top: 15px">
<span style="border-radius: 3px; background-color:#f7f7f7; border: 1px solid #ababab; padding: 5px 10px; font-size: 20px; font-family: Lucida">
    [a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+\.[a-zA-Z0-9-.]+
</span>
</div>

Now let's try to do the same for a phone number. Starting simple, we will only consider for a moment a phone number being a string of 5 series of 2 numbers, separated by spaces. It leads down to the following RegEx:

<div class="text-center">
<span style="border-radius: 3px; background-color:#f7f7f7; border: 1px solid #ababab; padding: 5px 10px; font-size: 20px; font-family: Lucida">
    (\d{2}(\s|\n)){5}
</span>
</div>
- `\d` means we are looking for digits
- `(...){x}` means we want `(...)` to be of length `x`

Let's test it out on the same example :

In [6]:
# The regular expression defining the pattern
expression = r'(\d{2}(\s|\n)){5}'
# The text to analyse
text = '''
...and thank you for your time and patience with my request.
Best regards,

Venste Bergspiel
Executive assistant
email: venste.bergspiel@ssicju.ra
tel: 12 34 56 78 90

PS: I hope this does not ... 
'''
# First compile the expression
regex = re.compile(expression)
# Then find matches
matchs = regex.finditer(text)
# Finally ouptut them
print('Text to analyse :')
print(text)
print('------------------------------------------------------------\n')
print('Testing RegEx: {}\n'.format(expression))
print('Found the following matches :')
for i, m in enumerate(matchs):
    print('{}. {}'.format(i + 1, m.group()))

Text to analyse :

...and thank you for your time and patience with my request.
Best regards,

Venste Bergspiel
Executive assistant
email: venste.bergspiel@ssicju.ra
tel: 12 34 56 78 90

PS: I hope this does not ... 

------------------------------------------------------------

Testing RegEx: (\d{2}(\s|\n)){5}

Found the following matches :
1. 12 34 56 78 90



The example above shows how to match a specific format of phone number. Unfortunately, there is no generic RegEx to match any type of phone number from any country, taking into account the country code and the additionnal non essentials digits. We suggest in the followin a specific RegEx that could be used to matche a significant number of European phone numbers :

<div class="text-center" style="margin-top: 15px">
<span style="border-radius: 3px; background-color:#f7f7f7; border: 1px solid #ababab; padding: 5px 10px; font-size: 12px; font-family: Lucida; line-height:35px">
    (\s|\n)(`\`+\d{2}(`\`(0`\`))?((\s|-)\d{3}){3}|`\`+\d{2}(`\`(0`\`))?(\s|-)(`\`(0`\`))?\d((\s|-)\d{2}){4}|`\`+\d{2}(`\`(0`\`))?\d{9}|(\d{2}(\-|\s)){4}\d{2}|\d{10})(\s|\n)
</span>
</div>

This humanly unreadble expression matches the following formats :
- 0345678912
- 03 45 67 89 12
- 03-45-67-89-12
- +12345678912
- +12 3 45 67 89 12
- +12-3-45-67-89-12
- +12 345 678 912
- +12-345-678-912
- +12(0)345678912
- +12(0) 345 678 912
- +12(0)-345-678-912
- +12 (0)3 45 67 89 12
- +12-(0)3-45-67-89-12

<div class="alert alert-danger">
**Important** : This RegEx could of course be adapted for other purposes and simplified, but has overall good performance on the tested data.
</div>

## 3 Contact extraction

Now is the time to dive into contact extraction. We have seen that we can take a PDF file and extract text from it, and that we can find phone numbers and emails in a string using RegEx. Now we can combine all of those in order to determine contacts in a document.

The idea is to work sequentially line by line and build a contact list dynamically. The algorithm does the following steps :
- convert PDF to brut text ;
- go line by line and look for phone numbers ;
- keep line numbers of matches ;
- filter contacts with a list of already known phone numbers ;
- look 5 lines after and 5 lines before the match for email addresses ;
- look 5 lines after and 5 lines before the match for words outside of dictionary to determine names ;
- consider that matches close in line numbers represent the same person ;
- filter contacts with a list of already known phone numbers, name and/or email addresses.


In [7]:
# Import PDF and convert to a string
text = convert_pdf_to_txt('Fake_PDF.pdf')
# Convert to list of lines
lines = text.split('\n')
# Remove lines without words by matching every line to a word or digit type RegEx
lines_filtered = list(filter(lambda l: len(re.findall('(\w|\d)+', l)) > 0, lines))
# Build RegEx for phone numbers
tel_regex = re.compile(r'(\+\d{2}(\(0\))?((\s|-)\d{3}){3}|\+\d{2}(\(0\))?(\s|-)(\(0\))?\d((\s|-)\d{2}){4}|\+\d{2}(\(0\))?\d{9}|(\d{2}(\-|\s)){4}\d{2}|\d{10})')
tels = list()
for i, l in enumerate(lines_filtered):
    tel_match = tel_regex.finditer(l)
    for m in tel_match:
        phone = m.group()
        tels.append((i, str(phone)))
        print('Line {} :\t{}'.format(i, phone))

Line 10 :	+49 6 03 89 92 99
Line 53 :	1236885398
Line 75 :	+33 6 12 34 56 78


As you may see above, we have now extracted all the phone numbers from the PDF file, and their position in the converted text variable. Note that there is a number that is not a contact that is actually popping out.

Now we can go through neighbouring lines and determine the contact name and / or email if available.

In [8]:
dictionary = enchant.Dict("en_US")
punctuation = ('.', ',', '(', ')', '+', ';', '"', '\\', '/', '|')
email_regex = re.compile(r'[a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+.[a-zA-Z0-9-.]+')
contacts = list()
for tel in tels:
    contact = {
        'name': None,
        'tel': None,
        'mail': None
    }
    line = tel[0]
    phone = tel[1]
    contact['tel'] = [phone]
    emails = list()
    neighbouring_lines = lines_filtered[max(line - 5, 0):min(line + 5, len(lines_filtered) - 1)]
    # look for emails and find the closest one
    for i, l in enumerate(neighbouring_lines):
        email_match = email_regex.finditer(l)
        closest_mail_line = -1
        for m in email_match:
            if contact.get('mail') is None:
                closest_mail_line = i
                contact['mail'] = [m.group()]
            else:
                if abs(closest_mail_line - line) > abs(i - line):
                    closest_mail_line = i
                    contact['mail'] = [m.group()]
    # convert lines to list of ordered words to better filter them
    ordered_pieces = reduce(
        lambda words, line: words + line.split(' '),
        neighbouring_lines,
        list()
    )
    # filter not words and words with digits
    ordered_none_digits = list(filter(lambda p: re.match('\w', p) is not None and re.match('\d', p) is None, ordered_pieces))
    # filter words with punctuation
    ordered_words = list(filter(lambda p: all([c not in punctuation for c in p]), ordered_none_digits))
    # Finally keep only words that are not in dictionary
    words = list(filter(lambda w: not dictionary.check(w), ordered_words))
    # This should give a name in the end hopefully
    contact['name'] = ' '.join(words) if len(words) > 0 else None
    # Finally check if we have not found already the contact, in which case we have to add the
    # new phone number to the previous contact
    previous_contact = next(filter(lambda c: c.get('name') == contact.get('name'), contacts), None)
    if previous_contact is None:
        contacts.append(contact)
    else:
        previous_contact['tel'].append(phone)
# Finally filter contacts that have no email and no name. Those are probably mis-matches
contacts = list(filter(lambda c: c.get('name') is not None or c.get('mail') is not None, contacts))
contacts

[{'mail': ['venste.bergspiel@company.de'],
  'name': 'Venste Bergspiel',
  'tel': ['+49 6 03 89 92 99']},
 {'mail': None, 'name': 'Tocs Yelldir', 'tel': ['+33 6 12 34 56 78']}]

We finally get out of the original PDF 2 contacts.