# Problem Statement:
## 1. Given a new document - you have to classify it as an addressline or non-addressline.

In [1]:
from src import DataStats
from pprint import pprint
from src import check_repeated_data
from src.utility import jsonl_reader
from src.utility import inspection_full_matching
from src.utility import inspection_partial_matching

I0725 16:27:15.942118 140005237073728 file_utils.py:41] PyTorch version 1.5.0+cu101 available.
I0725 16:27:16.759424 140005237073728 file_utils.py:57] TensorFlow version 2.2.0 available.


In [2]:
dataset = jsonl_reader('dataset/sample_dataset.jsonl')

from an initial observation of the dataset, it appears that the groundtruth is not a groundtruth/labeled data in a traditional sense. its more like this dataset has these two addresses.(we dont necessairly have its spans)

Hence, the main objective of this notebook is to do some data exploration and eventually I will be creating a labeled dataset for my further modelling experiments, based on these observations.

In [3]:
Z = DataStats(dataset)
Z.stats
#some stats on our dataset:

Number of the documents where both entities are present: 3/105
Number of the documents where at least one are present: 29/105
Number of the documents where vendor address are present: 25/105
Number of the documents where buyer address are present: 7/105


In [4]:
# let's check where the ground truth is not there

#Example_1
gt_not_found = Z.not_found
document = gt_not_found[0]
inspection_full_matching(document)
#printing the given groundtruth & the corresponding matchs found in the document.

{'buyer_address': {'region': {'x1': 55, 'x2': 632, 'y1': 473, 'y2': 535},
                   'text': '26 Theres South Rd CaliFornia'},
 'vendor_address': {'region': {'x1': 44, 'x2': 944, 'y1': 198, 'y2': 239},
                    'text': '6t76et Kaduty Loop, #5-00,caliFomia 526974'}}

['Koss TRADING PVT LTD iho 8431 TAX INVOICE lor(1',
 'Tax Invoice No: 2020926 3059219/',
 '6t6et Kaduty Loop, #5-00,califomia 526974 Date 21.06.15 Avócél :11.36-1.30',
 'Tel: 12975 0198 Fax: 1297569 0266 A/C No 2SE0071 WIT',
 'UEN: 198502422N']


In [5]:
#Example_2
inspection_full_matching(gt_not_founds[10], n=7)

{'buyer_address': {'region': {'x1': 147, 'x2': 408, 'y1': 794, 'y2': 853},
                   'text': '8 Marina View Caliofornia 93301'},
 'vendor_address': {'region': {'x1': 866, 'x2': 1513, 'y1': 3249, 'y2': 3333},
                    'text': '4461 Cordova Street\n'
                            'Vancouver, British Columbia, V6B 1E1'}}

['8 Marina View GST Reg NO: 2016138210',
 '#34-01 Goodwin PCG Pte Ltd',
 '93301',
 'California',
 'TRAINING DETAILS']


In [6]:
#Example_3
document = dataset[86]
pprint(document['ground_truth'])
pprint([i['text'] for i in document['document']])

{'buyer_address': {'region': {'x1': 420, 'x2': 810, 'y1': 347, 'y2': 456},
                   'text': 'Tellas South Road\nCalifornia 560 1327'},
 'vendor_address': {'region': {'x1': 815, 'x2': 1697, 'y1': 3249, 'y2': 3317},
                    'text': '20 ANGMIOLA INDIC INDUSTRIAL PARK 2A #05-09 RMC '
                            'TECHLINK \n'
                            'CALIFORNIA 567761'}}
['TAX INVOICE',
 'No.: T11512/19',
 'Date : 29-12-15',
 'TO: FANNY Chemistry',
 'Tellas South Road',
 'California 560 1327',
 'GST Reg. No:',
 '5303 1 1930',
 'GST REG. NO: 20-0306066H',
 'Attn: Finance Department Sales Person: Linda',
 'Terms: Final',
 'Tel: 6861 1773 Fax: 6862 3327',
 'S/N DESCRIPTION CLAIM % QUANTITY UNIT RATE AMOUNT S$',
 'Project: Production & Warehouse Floor @ 26 Tuas West Rd Singapore 638382',
 'As per confirmation on your PO No.53610479-000 OM dated 06.07.2015',
 '1) Vector High Performance Floor. 100% 1 lot $ 101,871.59 $ 101,871.59',
 'Coating at Prod & W. House',
 'Origi

### Some Initial Observations:
- Ground truth is not excatly present in to the dataset.
- Example_1 ***(Easy to tackle)*** shows that One possible reason is OCR error. If OCR is failing(it can be a case where single character is failing) then also we are not able to find the exact string. In above example, ground truth contains **6t76et Kaduty Loop, \#5-00,caliFomia 526974** where the document contains **6t6et Kaduty Loop, #5-00,califomia 526974 Date 21.06.15 Avócél :11.36-1.30**. Here only first token is not matching.
- Example_2 ***(Bit complex scenario)*** shows that ground truth data is coming from the multiple lines. And It also shows that Address lines are not continues in document object. For example, This is given as buyer name, **8 Marina View Caliofornia 93301** but in actual document it is present in multiple line, as shown below:
    1. 8 Marina View GST Reg NO: 2016138210
    2. #34-01 Goodwin PCG Pte Ltd
    3. 93301
    4. California
    5. TRAINING DETAILS 
- Example_3 **(Bit wierd)** Ground truth is almost non-existent. Ground truth data is **20 ANGMIOLA INDIC INDUSTRIAL PARK 2A #05-09 RMC** where After ocr we have **20 ANG MORA 88-FRRIE TECHPA98H**. Mostly any sort of string matching algorithms will fail here.
    
### What Next?
- We can try to see a bit more of the string matching. 
- Given dataset contains bounding box information as well. lets see if thats any useful.

In [7]:
Z1 = DataStats(dataset, 'partial-match')

In [8]:
Z1.stats

Number of the documents where both entities are present: 105/105
Number of the documents where at least one are present: 105/105
Number of the documents where vendor address are present: 105/105
Number of the documents where buyer address are present: 105/105
Number of the documents where multi line vendor addresses are present: 55/105
Number of the documents where multi line buyer addresses are present: 93/105


In [9]:
#Example_4
document = gt_not_found[10]
lines = [i['text'] for i in  document['document']]
inspection_partial_matching(document)

vendor_address
Ground Truth: 4461 Cordova Street
Vancouver, British Columbia, V6B 1E1

Actual Present Text: Bank Address : 4461 Cordova Street, Vancouver, British
Account Number : Columbia, V6B 1E1, 60-46031-44591
4461 Cordova Street
Vancouver, British Columbia, V6B 151
**********
buyer_address
Ground Truth: 8 Marina View Caliofornia 93301

Actual Present Text: 8 Marina View GST Reg NO: 2016138210
93301
California
**********


In [10]:
#Example_5
document = gt_not_found[2]
lines = [i['text'] for i in  document['document']]
inspection_partial_matching(document)

vendor_address
Ground Truth: 940 Nancy Street #27N, NC Buildıng North Carolina

Actual Present Text: 940 Nancy Street #27N, NC Building North Carolina 27530 Tel: +65 52589 6144 Fax: +65 7820 4311
4940 Nancy Street The Federal Banking Corporation Ltd NC Branch
#27N, NC Building 2470 Nancy Street
84 North Carolina 27537
office: 8C 4940 Nancy Street #27N, NC Building North Carolina 27530 Registration No.: 199782100D
**********
buyer_address
Ground Truth: 4290 Victoria Court
Fort Fairfield
Maine 04742

Actual Present Text: 4290 Vic ria Court DATE OF INVOICE 21-Feb-17
Fort Fairfield
Maine 04742 ACCOUNT DETAILS
**********


In [11]:
#Example_6
document = gt_not_found[21]
lines = [i['text'] for i in  document['document']]
inspection_partial_matching(document)

vendor_address
Ground Truth: 32 eper Pay Cedar Rd #07
02A Da Jin Factory Buidlding California Singapore 520136

Actual Present Text: California 039780
32 Pec per Pay Ledar Ad #07. LO2A Da Jin Factory Building California ingapore 520136 . Tel: 6281 7520, Fax: 6284 1259, Email: sales@pypemedia.com.sg
**********
buyer_address
Ground Truth: Oke Temarsek Avenue
21st Floor Millenial Tower
California 039780

Actual Present Text: Oke Temarsek Avenue
21st Floor Millenial Tower
California 039780
32 Pec per Pay Ledar Ad #07. LO2A Da Jin Factory Building California ingapore 520136 . Tel: 6281 7520, Fax: 6284 1259, Email: sales@pypemedia.com.sg
**********


In [12]:
#Example_7
document = dataset[10]
lines = [i['text'] for i in  document['document']]
document['ground_truth']

{'buyer_address': {'text': '4290 Victoria Court\nFort FairfieLd\nMaine 04742',
  'region': {'x1': 311, 'y1': 866, 'x2': 646, 'y2': 995}},
 'vendor_address': {'text': 'Farrell+Gould Project Pvt Ltd\n4940 Nancy Street\n#27N, NC Building\nNorth Carolina 27530',
  'region': {'x1': 314, 'y1': 2640, 'x2': 772, 'y2': 2812}}}

In [19]:
# Lets check if the  ground truth is repeating 
both_address_reapting = check_repeated_data(dataset)
buyer_address_reapting = check_repeated_data(dataset, ['buyer_address'])
vendor_addree_reapting = check_repeated_data(dataset, ['vendor_address'])

print(f'Buyer address is reapting: {len(buyer_address_reapting)}/{len(dataset)}')
print(f'Vendor address is reapting: {len(vendor_addree_reapting)}/{len(dataset)}')
print(f'Vendor and Buyer address are reapting: {len(both_address_reapting)}/{len(dataset)}')

Buyer address is reapting: 69/105
Vendor address is reapting: 58/105
Vendor and Buyer address are reapting: 46/105


### Data Observation:
1. In Example_4 and Example_5, vendor address is reapting multiple times in the Actual dataset.
2. One more interesting Scenario (Example_6), where we have mutliple line matching, partial data from the ground truth **California 039780** is matching with **32 Pec per Pay Ledar Ad #07. LO2A Da Jin Factory Building California ingapore 520136 . Tel: 6281 7520, Fax: 6284 1259, Email: sales@pypemedia.com.sg**.
3. In previous senario, we can see only one token is partially matching with actual line but that line contains the address. ideally If we are doing **Address** vs **Non-Address** classification, then ground truth data shold contain this line **32 Pec per Pay Ledar Ad #07. LO2A Da Jin Factory Building California ingapore 520136 . Tel: 6281 7520, Fax: 6284 1259, Email: sales@pypemedia.com.sg** also into the ground truth.
4. In Example_7, ground truth data, where vendor address **'Farrell+Gould Project Pvt Ltd\n4940 Nancy Street\n#27N, NC Building\nNorth Carolina 27530** contains the Organization data.
5. I have the text data coming from invoices. And Invoice is coming from the same vendor, means the actual data/format is  same (**Mostly**). Here, out of total 105 invoices, 58 docs are repeating which tells that we have data from the roughly 47 different types of the invoice.
6. 46 times vendor and buyer addresses are reapting which means at a text level data, we have very **less variance**.


### Overall Problem Observation:
1. Example_6 shows us that there is address in document which is neither vendor address nor buyer address. What if we try to first build address classifation/address identification which helps to identifiy all the address from the document and then further process it.
2. Once we identify the line which contains the address, We need to focus on address extraction. For example, Example_4, `office: 8C 4940 Nancy Street #27N, NC Building North Carolina 27530 Registration No.: 199782100D` line contains address but it also contains some sort of non-address tokens.
3. As we seen in previous examples, ground truth data is actully present either in one line or in multiple lines. As such,  it kinda makes sense to convert 2-class classification(**addressline vs non-addressline**) into a 3-class classification(**full-addressline vs partial-addressline vs non-addressline**)? Adding this additional class might helps us for our further task of identifying the buyer & vendor.(reframing the actual problem like this might be useful, because later  when we actually want to parse out the address, if we have identified the line s partial, it tells us that there are a lot of junk tokens that need to be cleaned up vs a full-line where mostly we do not need to do any clean-up) 

### What Next:
1. Create 2-class classification(**addressline vs non-addressline**) and 3-class classification(**full-addressline vs partial-addresslines vs non-addresslines**) dataset. 