# Understanding and loading the inputs
In order to come up with a solution to the proposed challenge, we begin with an analysis of the input OCRs to understand how they are structured. Each JSON contains a key `pages` that holds an array, and each object within it contains a key `fullTextAnnotation` with an attribute `text` where a plain version of the text extracted resides. We proceed to create a class to better represent a single input JSON and a function to parse it into a more manageable format.

Now we test the custom class against the input OCR files located in the `./ocr` folder.

In [1]:
import json
from glob import glob
from src.classes import OCR

ocrs = glob("ocr/*.json")

object_ocrs = []
for ocr in ocrs:
    with open(ocr, 'r') as text_ocr:
        json_ocr = json.loads(text_ocr.read())
        object_ocr = OCR(json_ocr)
        object_ocrs.append(object_ocr)

When we print the first ticket:

In [2]:
print(object_ocrs[0])



# Field extraction
Next is finding the patterns necessary to extract each of the desired output fields.

In [3]:
test_subject = object_ocrs[1]  # Ticket 1

## Date
This value can be found towards the end of the receipt, preceded by a timestamp and in a `DD/MM/YYYY` format. The following regex pattern is suggested:  

`(?<=[01][0-9]:[0-5]\d).*(([0-2]|(3))(?(3)[01]|\d)/(0|(1)(?(5)[0-2]|[09]))/(1|(2))(?(7)0|9)(?(7)[0-3]|[7-9])\d)`  

A positive lookbehind is used to match the format of the timestamp (considering the value constraints for the format used - 00:00 AM/PM), then any other characters between that and the date pattern are ignored, and lastly the date match is captured with a group. The pattern used to match the date also considers reasonable value constraints, with dates ranging from `1970` to `2039`.

In [4]:
test_subject.set_date()
test_subject.date

'04/11/2021'

## Store address
This value can be found after the keyword `JUMBO`, which is used as an anchor in a joint manner with the keyphrase `VENDEDOR ELECTRO` to extract it. The suggested regex pattern is:  

`(?<=JUMBO).*[^\n]((.*\s)*)(?=VENDEDOR ELECTRO)`

A positive lookbehind is used to find the keyword, the following non-newline characters are ignored (which correspond to the name of the store), and then every character and space are captured with a group.  

_Note_: This pattern causes a backtracking issue if it's ran against the annotation in full, which is why a slicing approach is taken in which the pattern is applied only in the region comprehended by the first match for the keyword `JUMBO` and the next 10 lines.

In [5]:
test_subject.set_store_address()
test_subject.store_address

'CARRERA 98 No 16-50'

## Invoice number
Assuming that the ticket number located at the bottom - `J\d{3} \d{6}` is the invoice number, the following pattern is used to retrieve it:  

`(?<=TIQUETE)\s?(J\d{3})\s?(?=(\d{6}))`

In this pattern we look for the keyword `TIQUETE`, then match and capture a pattern `J\d{3}`, and lastly we search for a positive lookahead of the pattern `\d{6}` and also capture it. The invoice number is then returned as a formatted string of the captured groups 1 and 2.

In [6]:
test_subject.set_invoice_number()
test_subject.invoice_number

'J212 341304'

## Subtotal
A general pattern to extract this value was not found at the time of solving the challenge. A mathematical approach of finding it based on the extracted values is:  

`SUBTOTAL = SUM(LINE_ITEMS_TOTALS)`

The code implementation of this calculation falls beyond the scope of the challenge.

## Total
A general pattern to extract this value was not found at the time of solving the challenge. As with the Subtotal, a mathematical approach of finding it based on the extracted values could be implemented if a pattern to extract the value for each tax code is built and used. Considering that this calculation is beyond the scope of the challenge, neither the code to do it or extract the amounts (%) for the tax codes is implemented.

## Line items
To extract the information for each line item we generate a slice of the annotation between the keywords `VENDEDOR ELECTRO` and `NRO. CUENTA`, the pattern defined to extract each attribute will then be run against it.

### SKU and Description
Each line item starts with its `SKU` and a `description`. The pattern suggested to retrieve them is:  

`(\d{13})\s?((\w*[^\n])*)`

This pattern matches and captures a 13 digit number to retrieve the `SKU`, then matches and captures combinations of non-newline characters and spaces to retrieve the description.

In [7]:
test_subject.set_line_items_SKU_and_description()
test_subject.line_items

[{'sku': '7702406000150', 'description': 'Azucar refinado'},
 {'sku': '7702406000150', 'description': 'Azucar refinado'},
 {'sku': '7702406000150', 'description': 'Azucar refinado'},
 {'sku': '7702406000150', 'description': 'Azucar refinado'},
 {'sku': '7702020212052', 'description': 'DONAREPA PERLAD'},
 {'sku': '7702020212052', 'description': 'DONAREPA PERLAD'},
 {'sku': '7705491102020', 'description': 'PANELA EXTRA RE'},
 {'sku': '7705491102020', 'description': 'PANELA EXTRA RE'},
 {'sku': '7702026020507', 'description': 'Servilleta FAMI'},
 {'sku': '7702010225123', 'description': 'FABULOSO LAVAND'},
 {'sku': '7701018005089', 'description': 'ACEITE OLEOCALI'},
 {'sku': '7500435126823', 'description': 'Shampoo H&S lim'},
 {'sku': '7622201772840', 'description': 'Gelatina ROYAL'},
 {'sku': '7509546672557', 'description': 'Desodorante SPE'},
 {'sku': '7702026193003', 'description': 'Papel hig. FAMIL'}]

### Tax code
To retrieve this value we start by detecting possible tax codes based on: the frequency in which they appear, and their presence within the taxes description after the taxation table. The pattern to achieve this is:  

`(((?<=\d\s)|(?<!.))([A-Z])\n)|(([A-Z])(?=\=))`

Where we check for single capital letters that either are preceded by a digit and a space, or are the only character on a line. To retrieve the tax codes that are explicitly mentioned within the receipt, we match single capital letters followed by an equal sign.

### Total
By using the known tax codes retrieved before, we can delimit the slice of the annotation which is most likely to hold the list of totals for each line item. The step by step procedure is:  
- Look for any amounts that match the pattern `^(\d*\s)(?=([TAX_CODE])$)`.
- Find the index for both the first match top to bottom and the first match bottom to top.
- Within the slice delimited by those indexes, capture any other feasible amount even if it doesn't have a tax code.
- To check for coherence on the codeless added amounts, traverse the array comparing with the modulo operator.
- If the length of the resulting list equals the amount of SKUs retrieved, match the elements checking that the uniqueness across both lists is preserved.

In [8]:
test_subject.set_line_items_tax_codes_and_totals()
test_subject.line_items

[{'sku': '7702406000150',
  'description': 'Azucar refinado',
  'total': 8590,
  'tax_code': 'A'},
 {'sku': '7702406000150',
  'description': 'Azucar refinado',
  'total': 8590,
  'tax_code': 'A'},
 {'sku': '7702406000150',
  'description': 'Azucar refinado',
  'total': 8590,
  'tax_code': 'A'},
 {'sku': '7702406000150',
  'description': 'Azucar refinado',
  'total': 8590,
  'tax_code': 'A'},
 {'sku': '7702020212052',
  'description': 'DONAREPA PERLAD',
  'total': 3690,
  'tax_code': 'A'},
 {'sku': '7702020212052',
  'description': 'DONAREPA PERLAD',
  'total': 3690,
  'tax_code': 'A'},
 {'sku': '7705491102020',
  'description': 'PANELA EXTRA RE',
  'total': 4390,
  'tax_code': None},
 {'sku': '7705491102020',
  'description': 'PANELA EXTRA RE',
  'total': 4390,
  'tax_code': None},
 {'sku': '7702026020507',
  'description': 'Servilleta FAMI',
  'total': 6590,
  'tax_code': 'N'},
 {'sku': '7702010225123',
  'description': 'FABULOSO LAVAND',
  'total': 6590,
  'tax_code': 'F'},
 {'sku':