Assignment Details:
 
Pre-requisites of the code assignment:
1. Working experience on Python and any of the NLP libraries like Spacy.
2. Working experience on Machine Learning
 
 
Extract following details from given PDF.
1. Extract following details from Applicant and Agent details section in the PDF:
A. Extract Person names
B. Extract Address
C. Extract Company
E.g. : For Assignment-1.pdf,
   name="Amin",
   Company name="Britbuild Properties Ltd",
   Address ="166 Weir Road, London, United Kingdom"
 
2. Extract proposed materials from the Materials / Proposed Building Section.
E.g: In case of Assignment-1.pdf:
materials : ["Stock brickwork", "Zinc cladding", "Aluminium clad timber glazed windows", "Aluminium clad timber doors"]

Importing Libraries

In [1]:
import re
import pdfplumber
import camelot
import json
import spacy
from spacy.matcher import Matcher

Parsing the whole text from the PDF

In [2]:
text=''
with pdfplumber.open('Assignment-1.pdf') as pdf:
    #print(pdf.pages)
    for i in range(len(pdf.pages)):
        #print(i)
        page = pdf.pages[i]
        text+= '\n'+page.extract_text()

In [3]:
print(text)


Application for Planning Permission.
Town and Country Planning Act 1990
Publication of applications on planning authority websites.
Please note that the information provided on this application form and in supporting documents may be published on the Authority’s website. If
you require any further clarification, please contact the Authority’s planning department.
1. Site Address
Number
Suffix
Property name Land Adjoining Norbury Railway Station
Address line 1 Norbury Avenue
Address line 2 Norbury
Address line 3
Town/city London
Postcode SW16 3RW
Description of site location must be completed if postcode is not known:
Easting (x) 530727
Northing (y) 169734
Description
2. Applicant Details
Title Mr
First name
Surname Amin
Company name Britbuild Properties Ltd
Address line 1 166 Weir Road
Address line 2
Address line 3
Town/city London
Country United Kingdom
Planning Portal Reference: PP-09239967
2. Applicant Details
Postcode
Are you an agent acting on behalf of the applicant? Yes   No
Pr

Extracting the text portions containging the applicant and agent details from the parsed text using Spacy Matcher

In [4]:
nlp = spacy.load('en_core_web_lg')

In [5]:
matcher_applicant= Matcher(nlp.vocab)
matcher_agent= Matcher(nlp.vocab)

In [6]:
pattern_applicant=[{'ORTH':'Applicant'},
          {'ORTH':'Details'},
          {'ORTH':'\n'},
          {'IS_ASCII': True,'OP':'*'},
          {'ORTH':'\n'},
          {'ORTH': 'Postcode'},
          {'ORTH':{'REGEX': "^[a-zA-Z0-9]*\n"}}]

pattern_agent=[{'ORTH':'Agent'},
          {'ORTH':'Details'},
          {'ORTH':'\n'},
          {'IS_ASCII': True,'OP':'*'},
          {'ORTH':'\n'},
          {'ORTH': 'Postcode'},
          {'ORTH':{'REGEX': r".*"}},
          {'ORTH':{'REGEX': r".*"}},
          {'ORTH':'\n'}]

In [7]:
matcher_applicant.add('Applicant_detail',None,pattern_applicant)
matcher_agent.add('Agent_detail',None,pattern_agent)

In [8]:
doc=nlp(text)

In [9]:
matches_applicant = matcher_applicant(doc)
matches_agent = matcher_agent(doc)

applicant_text=''
agent_text=''

In [10]:
for match_id, start,end in matches_applicant :
    matched_span = doc[start:end]
    applicant_text+=matched_span.text
    
for match_id, start,end in matches_agent :
    matched_span = doc[start:end]
    agent_text+=matched_span.text    

In [11]:
applicant_text

'Applicant Details\nTitle Mr\nFirst name\nSurname Amin\nCompany name Britbuild Properties Ltd\nAddress line 1 166 Weir Road\nAddress line 2\nAddress line 3\nTown/city London\nCountry United Kingdom\nPlanning Portal Reference: PP-09239967\n2. Applicant Details\nPostcode\n'

In [12]:
agent_text

'Agent Details\nTitle Miss\nFirst name Ellen\nSurname Creegan\nCompany name Iceni Projects\nAddress line 1 This is the Space\nAddress line 2 68 Quay Street\nAddress line 3\nTown/city Manchester\nCountry\nPostcode M3 3EJ\n'

Extracting the Material Table from the PDF using camelot

In [13]:

tables=camelot.read_pdf('Assignment-1.pdf', flavor='stream', pages='3')
print(len(tables))
tables[1].parsing_report
df1=tables[1].df 

2


In [14]:
df1.rename(columns = {0:'DESC',1:'MATERIAL'}, inplace = True) 

In [15]:
df1[df1['DESC']=='Description of proposed materials and finishes:']

Unnamed: 0,DESC,MATERIAL
4,Description of proposed materials and finishes:,Stock brickwork
7,Description of proposed materials and finishes:,Zinc cladding
10,Description of proposed materials and finishes:,Aluminium clad timber glazed windows
13,Description of proposed materials and finishes:,Aluminium clad timber doors


In [16]:
proposed_material=list(df1[df1['DESC']=='Description of proposed materials and finishes:']['MATERIAL'])
proposed_material

['Stock brickwork',
 'Zinc cladding',
 'Aluminium clad timber glazed windows',
 'Aluminium clad timber doors']

A function to extract specific information from the text using Regex

In [17]:
def find_detail(text):  
    name = re.match(r".*\nTitle(?P<title>.*)\nFirst name(?P<f_name>.*)\nSurname(?P<s_name>.*)\n.*" , text)
    f_name=''.join(name.groupdict().values())
    company = re.match(r"[a-zA-Z0-9\n\s]*\nCompany name(?P<c_name>.*)\n" , text)
    c_name=''.join(company.groupdict().values())
    address = re.match(r"[a-zA-Z0-9\n\s]*\nAddress line 1(?P<a_1>.*)\nAddress line 2(?P<a_2>.*)\nAddress line 3(?P<a_3>.*)\nTown/city(?P<city>.*)\nCountry(?P<country>.*)" ,text)
    add=''.join(address.groupdict().values())
    post = re.match(r"[.a-zA-Z0-9\n\s/\-:]*\nPostcode(?P<p_code>.*)\n",text)
    p_code=''.join(post.groupdict().values())
    address=add+''+p_code
    
    return {'Full Name': f_name,'Company Name': c_name,'Full Address': address}

In [18]:
#find_detail(applicant_text)
#find_detail(agent_text)

Creating the final output and saving it as a json file

In [19]:
final_output={'Agent Details':find_detail(agent_text),'Applicant Details': find_detail(applicant_text),'Proposed Material':proposed_material}

In [20]:
final_output

{'Agent Details': {'Full Name': ' Miss Ellen Creegan',
  'Company Name': ' Iceni Projects',
  'Full Address': ' This is the Space 68 Quay Street Manchester M3 3EJ'},
 'Applicant Details': {'Full Name': ' Mr Amin',
  'Company Name': ' Britbuild Properties Ltd',
  'Full Address': ' 166 Weir Road London United Kingdom'},
 'Proposed Material': ['Stock brickwork',
  'Zinc cladding',
  'Aluminium clad timber glazed windows',
  'Aluminium clad timber doors']}

In [21]:
json_object = json.dumps(final_output) 

with open("output.json", "w") as outfile: 
    outfile.write(json_object)