# <center><span style='color:red'>**Entity Recognition for OCR using text data**  </span></center>

In [1]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    filepath = dirname
    list_filename = filenames
    #for filename in filenames:
    #    print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

### Accessing the directory with all the files and all the filenames

In [2]:
print(filepath)
print(list_filename)

/kaggle/input/text-extraction-for-ocr/ImageAndXML_Data
['2084022143.tif', '524483561+-3562_ocr.xml', '11295049_ocr.xml', '2029377724_ocr.xml', '2063216131.tif', '2028729078.tif', '99380808_0809.tif', '0060027778_gt.xml', 'ti01410980_gt.xml', '2063207059_7062_gt.xml', '86619053_ocr.xml', '11237705_gt.xml', '2072957946.tif', '2029377724.tif', '2070238500.tif', '0001139626.tif', '0060053761_gt.xml', '2044696237_gt.xml', '91781209.tif', '81615715_ocr.xml', '2015020047_gt.xml', '0011899826_gt.xml', '2044402070_ocr.xml', 'ti17120743.tif', '2063209074_9077_gt.xml', '83312574_2575_ocr.xml', '518253707+-3715.tif', '92873320_3322_gt.xml', 'ti16310917_ocr.xml', '2041388921_ocr.xml', '2041222841.tif', '83553545_gt.xml', '2084022167_gt.xml', '2041063035.tif', '2080724769_gt.xml', '508725601+-5603_gt.xml', '2074417069.tif', '2071382979_ocr.xml', '2071217430_ocr.xml', '2073684118_gt.xml', '514967102+-7103_gt.xml', 'ti17120431.tif', '92185588_ocr.xml', '2070424201.tif', '2071413187_gt.xml', '207248407

## XML Parsing
XML parsing in python is done either by using **lxml parser along with beautiful soup or using elementtree library**. Choose the one which suits your needs better. 

Our aim is to read the text data from XML file and put the information of all the files in a text file. First we read a file and try to get the text from a single file then we loop over all the files to get the text from all the files. 

In [3]:
#For making list of all the *_OCR.xml files in the data folder use glob or fnmatch+os.listdir()
import fnmatch

file_list = []
for file in list_filename:
    if fnmatch.fnmatch(file,'*_ocr.xml'):
        file_list.append(file)    

In [4]:
#Let's try to open one file and see how well does it help us:
with open(os.path.join(dirname,file_list[0]),'r') as f:
    data = f.read()
f.close()    

In [5]:
print(data)

<?xml version="1.0" encoding="utf-8"?>
<PcGts xmlns="http://schema.primaresearch.org/PAGE/gts/pagecontent/2013-07-15">
  <Metadata>
    <Creator>ABBYY FineReader Engine 11 + alto2page.xslt 2018.11.09</Creator>
    <Created>2019-01-23T00:00:00</Created>
    <LastChange>2019-01-23T00:00:00</LastChange>
  </Metadata>
  <Page imageFilename="524483561+-3562.tif" imageHeight="1000" imageWidth="754">
    <TextRegion id="Page1_TopMargin">
      <Property key="Margin" value="Top"/>
      <Coords points="0,0 754,0 754,3 0,3"/>
    </TextRegion>
    <TextRegion id="Page1_LeftMargin">
      <Property key="Margin" value="Left"/>
      <Coords points="0,3 34,3 34,962 0,962"/>
    </TextRegion>
    <TextRegion id="Page1_RightMargin">
      <Property key="Margin" value="Right"/>
      <Coords points="716,3 754,3 754,962 716,962"/>
    </TextRegion>
    <TextRegion id="Page1_BottomMargin">
      <Property key="Margin" value="Bottom"/>
      <Coords points="0,962 754,962 754,1000 0,1000"/>
    </TextReg

## Observations:

In [6]:
# The structure of xml(for our interest) is:
# <TextRegion>
#     <TextLine>
#         <Word>
#             <Unicode> Text </Unicode>
#         </Word>
#     </TextLine>
# </TextRegion>

#### Information in a particular text region (< TextRegion >) is a block of information which is of the same type like address or description. But this block of informaiton may be divided into different lines (< TextLine >). 
While reading we need to club information according to text line and text region to make it more reasonable so that sequential property of the text data could be preserved. Opposed to this a normal final_all text in xml would give non sequential data where sequential property of the data will be lost. 
Information from XML can be extracted row wise, block wise or all the text altogether. Row wise text is smaller in size and contains less sequential information so it will be better to use if for regex comparison. While whole text together or block wise information is expected to be in more sequential manner and thus more suitable for unsupervised learning methods (not supervise because of no labeled data). 

NOTE: From xml we are currently not taking the confidence into account, for a better model confidence threshold should be decided/optimized and used. 

Parsing the XML file using element tree:

In [7]:
from lxml import etree, objectify

parser = etree.XMLParser(remove_blank_text=True)
tree = etree.parse(os.path.join(dirname,file_list[0]))
root = tree.getroot()

#This file contains xml name space which come attached with each tag, 
#xmlns makes it difficult for a reader to focus on desired tag, so we remove the namespaces
for elem in root.getiterator():
    if not hasattr(elem.tag, 'find'): continue  
    i = elem.tag.find('}')
    if i >= 0:
        elem.tag = elem.tag[i+1:]
objectify.deannotate(root, cleanup_namespaces=True)

In [8]:
#all the tag in xml are added to tag_list 
tag_list = []
for element in root.iter():
    tag_list = element.tag

In [9]:
## check the output if its in the same order as we need it
# for element in root.iter('Word'):
#     print('block and line info = ' + str(element.attrib))
#     for child in element:
#         #print(child.tag)
#         if child.tag == 'TextEquiv':
#             print('for text confidence score =' + str(child.attrib))
#             for grandchild in child:
#                 if grandchild.tag == 'Unicode':
#                     print('text = '+str(grandchild.text))

In [10]:
# getting the information in the format in which we need it:
import re
prev_block_Page = ''
prev_block_Block = ''
prev_block_Line = ''
prev_block_Word = ''
sentence = ''
sentence2 = ''
sentence3 = ''
block = []
line = []
for element in root.iter('Unicode'):
    same_page = same_block = same_line = next_word = False
    parent_node = next(element.iterancestors('Word'))
    block_list = parent_node.attrib['id'].split('_')
    if(prev_block_Page == block_list[0] or prev_block_Page == ''):
        same_page = True
    if(prev_block_Block == block_list[1] or prev_block_Block == ''):
        same_block = True        
    if(prev_block_Line == block_list[2] or prev_block_Line == ''):
        same_line = True
    if(prev_block_Word == int(block_list[3][1:])-1 or prev_block_Word == ''):
        next_word = True
                    
    #only same line present in one sentnece:
    #Here we check if the sentence contains the keywords which we are looking for like 'Date'
    if same_line and same_block:
        sentence3 = sentence3 + re.sub("[^0-9a-zA-Z:,]+", ' ',element.text) + ' ' 
    else:
        line.append(sentence3)
        sentence3 = ''
        sentence3 = sentence3 + re.sub("[^0-9a-zA-Z:,]+", ' ',element.text)+ ' '
    
    #same block in one line:
    if same_block:
        sentence = sentence + re.sub("[^0-9a-zA-Z:,]+", ' ',element.text) + ' '
    else:
        block.append(sentence)
        sentence = ''
        sentence = sentence + re.sub("[^0-9a-zA-Z:,]+", ' ',element.text)+ ' '  
      
    #all text in same line:
    sentence2 = sentence2 + re.sub("[^0-9a-zA-Z:,]+", ' ',element.text) + ' '
        
    prev_block_Page = block_list[0]
    prev_block_Block = block_list[1]
    prev_block_Line = block_list[2]
    prev_block_Word = int(block_list[3][1:])
print(line)
print(block)    
print(sentence2)

['SENT BY:R  J  REYNOLDS TOBACCO: 12 22 99   4:16PM   9722318699  336 741 4606:1 3  3 ']
['SENT BY:R  J  REYNOLDS TOBACCO: 12 22 99   4:16PM   9722318699  336 741 4606:1 3  3 ']
SENT BY:R  J  REYNOLDS TOBACCO: 12 22 99   4:16PM   9722318699  336 741 4606:1 3  3 52448 3561 


In [11]:
filename = 'beautiful_data.txt'
write_txt_to_file = open(filename,'a')
write_txt_to_file.write(sentence2)
write_txt_to_file.close()

In [12]:
import csv
field = ['TEXT']
row = sentence
filename = 'beautiful_data.csv'
with open(filename,'a',newline='') as csvfile:
    csvwriter = csv.writer(csvfile)#creates an object for writing in csv files
    csvwriter.writerow(sentence)

In [13]:
###########################################################################################
###########                                                                     ###########
########### This is the summary of complete process, looping over all the files ###########                      
###########                                                                     ###########
###########################################################################################
#Do the whole process on the list of all the files:
#Reading the file saving the contents in data and proceeding 

txt_filename = 'beautiful_data_summary.txt'
for xml_file in file_list:
    parser = None
    root = None
    tree = None
    new_sentence = ''
    parser = etree.XMLParser(remove_blank_text=True)
    xml_file_path = os.path.join(dirname,xml_file)
    tree = etree.parse(xml_file_path)
    root = tree.getroot()
    #remove the namespaces
    for elem in root.getiterator():
        if not hasattr(elem.tag, 'find'): continue  
        i = elem.tag.find('}')
        if i >= 0:
            elem.tag = elem.tag[i+1:]
    objectify.deannotate(root, cleanup_namespaces=True)
    
    for element in root.iter('Unicode'):
        new_sentence = new_sentence + re.sub("[^0-9a-zA-Z:,]+", ' ',element.text) + ' '
    #print('new_sentence = '+new_sentence)    
    #writing the new_sentence in the file 
    wrtie_txt_to_file = None
    write_txt_to_file = open(txt_filename,'a')
    write_txt_to_file.write(new_sentence)
    write_txt_to_file.write('\n')
    write_txt_to_file.close()
    

In [14]:
import pandas as pd
pd.set_option('display.max_colwidth',3000)
df = pd.read_csv('./beautiful_data_summary.txt',delimiter='/n',header=None)
df.head(7)

  This is separate from the ipykernel package so we can avoid doing imports until


Unnamed: 0,0
0,SENT BY:R J REYNOLDS TOBACCO: 12 22 99 4:16PM 9722318699 336 741 4606:1 3 3 52448 3561
1,"March 30, 1964 From: To: Attention: Statement of Chargos Professor T D Sterling College of Medicine Eden and Bethosda Avenues Cincinnati, Ohio Tobacco Industry Research Committee 150 East 42 Street New York 17, New York Dr R C Hockott Consultation, March 10, 1964 Airplane fare Hotel accomodation Taxis and food Honorarium 97 13 14 00 20 00 200 00 331 13 Review and Experimental Design Tea days at 200 per day 2000 00 Consultation with Dr Eugene Saenger 200 00 2200 00 TOTAL 2531 13"
2,"Covington Burling 1201 Pennsylvania Avenue N W Washington D C 20044 4358 Dec 20 1991 Mr John Rupp OC Testing in Buildings Reference: M P To the sampling and analysis of buildings for their OC content as per Gray Robertson s memo dated February 21, 1991 Buildings sampled in Pecan Per 1991 In ISA: 21 In Australia: 5 Total buildings: 26 VCCs 26 1,200 per building 31,200 00 Total price now due for payment 31 200 00 For HBI Inc CR: dh 2029377724"
3,"INVOICE September 1, 1996 Lorillard Corporation P O Box 21688 Greensboro, North Carolina 27420 1688 Attention: Dr Alex W Spears September assessment for CIAR Less monthly adjustment for Dunn research project 1996 contract Net amount due 54,487 00 mm 1099 Winicison Road Suite 280 Linthlctun Maryland 21090 2216 4JO 684 3777 Fax 410 684 3729 86619053"
4,"LT ll 70 TO o o NEW YORK 10017, LORILLARD 200 BAST 42nd STREET MEDICAL ASSOCIATION EDUCATION AND RESEARCH FOUNDATION 535 NORTH DEARBORN STREET 19 71 60610 t f U 1970 and 1971 installment 196,500 each for research into the alleged relationship of smoking to health 393 000 00 393 000 00 1 1 1 1 81615715 7 i t r t i M f S i"
5,"errvaoat Wamatn Pier 5 San fVanc scc California S4II1 Te eot cx e 5 955 2 X Canie Lan3os n To e 279738 Wi ASP LANDOR ASSOCIATES S a eg c Oes n Con iitHoto INVOICE invoice No 08347 Date 30 AUGUST 1985 Project No 018023 To PHILIP MORRIS INCORP 120 PARK AVENUE NEW YORK, N Y 10017 ATTN:MR FRED DELLA CROSSE n m V w I FOR PROFESSIONAL SERVICES RENDERED IN CONNECTION WITH DESIGN DEVELOPMENT FOR: VIRGINIA SLIMS ULTRA LIGHTS 100 S FLIP TOP 30X REG MEN BUDGET: 18,500 OR 20 O C PRESENTATION 21,345 00 9,250 00 12,095 00 PROFESSIONAL SERVICES FEES LESS PREVIOUSLY BILLED BALANCE NOTE: ADDITIONAL OUT OF POCKET EXPENSES TO BE BILLED UPON RECEIPT OF SUPPLIER INVOICES TOTAL AMOUNT THIS INVOICE 12,095 00 2044402070"
6,"ORIGINAL INVOICE NT RESEARCH INSTITUTE 7622 Lorillard Inc Mr Charles Gaworski P O Box 21688 Greensboro NC 27420 PLEASE REFER TO OUR INVOICE NUMBER AND REMIT TO: P O BOX 92003 CHICAGO, ILLINOIS 60675 FOR INQUIRIES CALL: 312 567 4136 DATE PROJECT NO ACCOUNT NUMBER CONTRACT OR P O NO TERMS 10 25 99 LL 08712 001 1112 02 000 P O 21293 NET CASH increase in Cost for Smoking Machine Maintenance and additional Pulmonary Physiology Measurements 15 500 00 DEPT 870C ACCT nottfij n 1333 DATE 8331257"


## <font color='red'> **Setting up SPACY Pipeline**</font>
Now we have text information from the xml files. Next step is to extract the entities. <br> 
 Using SPACY for extracting relevant information 

In [15]:
import spacy
import spacy.cli
spacy.cli.download("en")
nlp = spacy.load("en")

[38;5;2m✔ Download and installation successful[0m
You can now load the model via spacy.load('en_core_web_sm')
[38;5;2m✔ Linking successful[0m
/opt/conda/lib/python3.7/site-packages/en_core_web_sm -->
/opt/conda/lib/python3.7/site-packages/spacy/data/en
You can now load the model via spacy.load('en')


In [16]:
# detecting date and filling it in date dictionary, directly puttin in a 
#series won't work because in case of multiple entry as list it will take only
#the last entry:
date_dict = {}
#invoice_no = {}
customer_name_dict = {}
#total_amt = {}
for i in range(0, df.shape[0]):
  date_list = []
  name_list = []
  text = df.iloc[i,:]
  doc = nlp(str(text))
  for ent in doc.ents:
      if ent.label_ == 'DATE':
          date_list.append(ent.text)
          #date_dict[i] = ent.text
          #df['Date'] = ent.text
      elif ent.label_ == 'PERSON':
          name_list.append(ent.text)

  date_dict[i] = date_list        
  customer_name_dict[i] = name_list  

### **Unfiltered Results**:
Code above give us the results in the unfiltered form (for ex: date may contain date and some other numbers tagged as date). Now we need to apply several other techniques to filter the results, one is discussed below. 

In [17]:
import pprint
#pprint.pprint(customer_name_dict)
pprint.pprint(date_dict)

{0: ['12 22 99', '9722318699', '3 52448 3561'],
 1: ['0    March 30, 1964', 'New York 17', 'March 10, 1964', '2531 13'],
 2: ['1201', '20044', '4358', 'February 21, 1991', '2029377724'],
 3: ['September 1, 1996',
     '21688',
     '1688',
     'September',
     'monthly',
     '1996',
     '1099'],
 4: ['70', '10017', '19 71', '1970', '1971'],
 5: ['10017'],
 6: ['21688', '1333', '8331257'],
 7: ['August 30, 1904', 'MS 38851', '0S 90b'],
 8: ['January 2, 1994', '1875', '20006', '87', '2041388921'],
 9: ['07631', 'April 13, 1994', '3784', '30 DAYS 2071382979'],
 10: ['5122',
      '1560',
      '2432 89 01',
      '07 01 96',
      'July, 1996',
      '07 01',
      '6,666 67'],
 11: ['1038553', '10016', '016 40'],
 12: ['68008', '68008', '03 29 96', '21', '22 22 5 25', '1064'],
 13: ['14 2803', 'jUNE 005'],
 14: ['January 9', 'March 15', '30 Days'],
 15: ['7566', 'the month of May, 1989'],
 16: ['4074', 'September 30, 1994'],
 17: ['10013', '1984', '20810 70', '00 20810', '3121 61 176

### **POC for filtering results**:
Following code shows how by applying pipeline in series a result can be filtered out. Several random values assigned as dates in the previous result are assigned as 'CARDINAL' value here. It shows how a result can be filtered using the pipeline in series. Here we have used same pipeline for better result a different pipeline (different) model should be used. 

In [18]:
stopper = 0
for key in date_dict:
    stopper += 1
    print(key)
    print(date_dict[key])
    for iter in range(0,len(date_dict[key])):
        token = date_dict[key][iter]
        doc = nlp(str(token))
        for ent in doc.ents:
          print(ent.text, ent.label_)
    if stopper > 2:
      break
    print('******************************************************************************************')

0
['12 22 99', '9722318699', '3 52448 3561']
12 22 99 CARDINAL
3 52448 CARDINAL
******************************************************************************************
1
['0    March 30, 1964', 'New York 17', 'March 10, 1964', '2531 13']
0    March 30, 1964 DATE
New York 17 GPE
March 10, 1964 DATE
2531 13 DATE
******************************************************************************************
2
['1201', '20044', '4358', 'February 21, 1991', '2029377724']
1201 DATE
February 21, 1991 DATE


#### **Matching the custom patterns**

Loading the spacy Matcher and making a matcher object from it. The pattern to be matched can be added to 'matcher' and the 'matcher' object can be applied to the spacy doc (containing thet text data) inorder to match with the customized patterns.



In [19]:
from spacy.matcher import Matcher
nlp = spacy.load("en_core_web_sm")
matcher = Matcher(nlp.vocab, validate=True)
df[0].astype('string')
print(type(df.iloc[5][0]))

<class 'str'>


In [20]:
date_pattern3 = [{'LOWER':{"REGEX":'^jan|^feb|^mar|^apr|^may|^jun|^jul|^aug|^sep|^oct|^nov|^dec'}},{'LOWER':{'REGEX':'\d{1,2}'}},{'LOWER':{'REGEX':'\d{0-4}'}}]
date_pattern4 = [{'LOWER':{"IN":['jan','feb','mar','apr','may','jun','jul','aug','sep','oct','nov','dec']}}]


matcher.add('Custom_Date', None, date_pattern4)
doc = nlp(df.iloc[i][0])
matches = matcher(doc)
matches

for match_id, start, end in matches:
    string_id = nlp.vocab.strings[match_id]
    span = doc[start:end]
    #print(match_id, string_id, start, end, span.text)
    print(start, end, span.text)

#### **Call_back function**

In [21]:
date_pattern1 = [{'LOWER':{"REGEX":'(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)'}},{}]
                           

def callback_method(matcher, doc, i, matches):
    match_id, start, end = matches[i]
    entity = doc[start:end]
    print(entity.text)
    
matcher = Matcher(nlp.vocab)
matcher.add('Date_Cust2', callback_method,date_pattern1)
doc = nlp(df.iloc[5][0])
matcher(doc)    

AUGUST 1985


[(3056582165036000932, 46, 48)]

**Hope this helps you to start with the data set and gives you an idea about what to do and how to do. Do share your amazing ideas and methods for entity extraction from such files.**