Once you have submitted a PDF to Microsoft's OCR service -- Read API or Form Recognizer -- and saved the file, you will get a JSON output. This file is meant to walk you through what that output is. 

In [2]:
import json

In [6]:
with open("test_gazette_read.txt") as f: 
    data = json.load(f)

## READ API

The below Gazette is Vol. CXXI-No. 176, dated 27th December 2019. You can find the original on the Connected Africa website [here](https://data.connectedafrica.net/entities/410092.40e70c195e3249476cc0da8eb7854ac112236e7e). For brevity, we have only included the first two pages of the Gazette. 

The below walks through a nested data structure by layer. 

### 1. Outer structure

* `status`: If you got a successful output, this will be "succeeded." If not, this will include an error message describing why the Read API failed on this input. (This is usually due to an link that does not point to PDF data or a PDF that is not sized correctly.) 
* `createdDateTime`: Date and time when the call was submitted
* `lastUpdatedDateTime`: Data and time when the data finished processing and was retrieved from Microsoft's server. 
* `analyzeResult`: This is what you care about.

In [8]:
data.keys()

dict_keys(['status', 'createdDateTime', 'lastUpdatedDateTime', 'analyzeResult'])

### 2. analyzeResult

`analyzeResult` contains two key/value pairs: 
* The version of the Read API that was used to generate the output (this may become relevant as updates are released which improve the OCR). 
* The actual results of the analysis

In [10]:
data['analyzeResult'].keys()

dict_keys(['version', 'readResults'])

We (August 2020) used the Read API version 3.0.0: 

In [11]:
data['analyzeResult']['version']

'3.0.0'

### 3. readResults: actual text of the PDF!

`readResults` will give you a list of all of the pages in the PDF. (Each element represents a single page.) 

In [14]:
pages_list = data['analyzeResult']['readResults']
print("readResults type: " + str(type(pages_list)))
print("number of pages: " + str(len(pages_list)))

readResults type: <class 'list'>
number of pages: 2


### 4. A single page

A single page is represented as another dictionary: 
* `page`: the index of the page
* `angle`: the angle of the page -- Read API guesses and corrects rotated pages
* `width`: width of the page
* `height`: height of the page
* `unit`: unit of width/height measurement (default is inch) 
* `lines`: a list of all "lines" on the page

In [21]:
second_page = pages_list[1]
second_page.keys()

dict_keys(['page', 'angle', 'width', 'height', 'unit', 'lines'])

In [24]:
print("Page index: " + str(second_page['page']))
print("Height: " + str(second_page['height']))

Page index: 2
Height: 11.6633


### 5. Lines on the page

Each element of the `lines` list is a dictionary with the following information:
* `boundingBox`: a list of the x-y coordinates of each of the four corners representing where the line is located on the page. 
* `text`: the text on that line
* `words`: a list of all of the words on that line, with additional information for each of those words.

Note that a single "line" does not read across the entire page. When the Read API encounters whitespace between two blocks of text, it will split those blocks into two separate lines. This is crucial; this means that the Read API identifies separate columns. 

### 6. Words in the line

Each element of the `words` list is (another) dictionary with the following information: 
* `boundingBox`: the bounding box for that particular word
* `text`: the text of that word
* `confidence`: the confidence rating (0-1) that this word was read correctly. 

We found that approximately 0.1% of words in all of the Gazettes we process had a confidence rating less than 70%, and 0.006% had a confidence rating less than 50%. The Read API calculates confidence in part based on an English lexicon, and most of the words with relatively lower confidence ratings were names of people, organizations, and places. 

**Here are the first three lines of the second page**. We can see here the page number and the title of the page ("THE KENYA GAZETTE" + the date of publication). 

In [27]:
second_page['lines'][:3]

[{'boundingBox': [1.0502,
   0.4667,
   1.3403,
   0.47,
   1.3403,
   0.5933,
   1.0469,
   0.5867],
  'text': '4942',
  'words': [{'boundingBox': [1.0469,
     0.4667,
     1.3347,
     0.4698,
     1.3326,
     0.5931,
     1.0469,
     0.5882],
    'text': '4942',
    'confidence': 0.983}]},
 {'boundingBox': [3.4907, 0.4633, 4.9643, 0.4633, 4.9643, 0.6, 3.494, 0.6033],
  'text': 'THE KENYA GAZETTE',
  'words': [{'boundingBox': [3.4939,
     0.4641,
     3.755,
     0.467,
     3.7555,
     0.6037,
     3.495,
     0.6062],
    'text': 'THE',
    'confidence': 0.983},
   {'boundingBox': [3.7829,
     0.4673,
     4.2771,
     0.4691,
     4.2765,
     0.601,
     3.7834,
     0.6035],
    'text': 'KENYA',
    'confidence': 0.986},
   {'boundingBox': [4.3051,
     0.4691,
     4.9577,
     0.4642,
     4.9558,
     0.6022,
     4.3045,
     0.6009],
    'text': 'GAZETTE',
    'confidence': 0.985}]},
 {'boundingBox': [6.1746,
   0.4667,
   7.3982,
   0.4567,
   7.4015,
   0.6033,
   6

And here are a few lines from the middle of the page, which give us a bit of information from Gazette notices, which we can divide and knit together based on the geometric information of the bounding boxes: 

In [100]:
middle = len(second_page['lines'])//2
second_page['lines'][middle:middle + 4]

[{'boundingBox': [4.3842,
   5.7733,
   5.7111,
   5.7633,
   5.7145,
   5.8867,
   4.3842,
   5.8933],
  'text': 'GAZETTE NOTICE NO. 12160',
  'words': [{'boundingBox': [4.4006,
     5.7766,
     4.8182,
     5.7808,
     4.811,
     5.893,
     4.3932,
     5.8921],
    'text': 'GAZETTE',
    'confidence': 0.984},
   {'boundingBox': [4.8409,
     5.7808,
     5.1902,
     5.7787,
     5.1833,
     5.8923,
     4.8338,
     5.893],
    'text': 'NOTICE',
    'confidence': 0.983},
   {'boundingBox': [5.213,
     5.7783,
     5.3952,
     5.7751,
     5.3884,
     5.8913,
     5.2061,
     5.8922],
    'text': 'NO.',
    'confidence': 0.983},
   {'boundingBox': [5.418,
     5.7746,
     5.714,
     5.7663,
     5.7073,
     5.8888,
     5.4112,
     5.8911],
    'text': '12160',
    'confidence': 0.986}]},
 {'boundingBox': [1.0836,
   5.97,
   1.9737,
   5.9667,
   1.9771,
   6.0767,
   1.0836,
   6.0833],
  'text': 'Paul Obiero Olang*',
  'words': [{'boundingBox': [1.0919,
     5.9705,


Feel free to explore the Gazette a bit more, and see the walkthrough notebooks for a description of how we knit text together and divide sections from each other. 

## FORM RECOGNIZER: LAYOUT API

The Layout API is more expensive than the Read API, but it has the significant advantage of being able to read tables. 

Below is a walkthrough of the Form Recognizer API with the Gazette Vol. CXXI--No. 55, dated May 3rd, 2019, special issue. For brevity, we will only be looking at page 12, which contains a table. A copy is available [here](https://data.connectedafrica.net/entities/240946.0a895cc8dc72d87e618c8565f64c8a41c6d015fa#page=12). 

In [37]:
with open("test_formrec_layout_api_output.txt") as f:
    data = json.load(f)

Note that the Layout API output contains all of the information that the Read API output contains, which is described above. However, there is **one additional key/value pair in the "analyzeResult" dictionary**: 

In [66]:
data['analyzeResult'].keys()

dict_keys(['version', 'readResults', 'pageResults'])

This `pageResults` element is the information that the Form Recognizer API provides that the Read API doesn't. 

### 1. Information in readResults: same as that in the Read API.  

In [53]:
page_12 = data['analyzeResult']['readResults'][0]['lines']

(Again, the JSON that we are drawing from only contains the output of the 12th page.) 

When we look at what is contained in `readResults`, we get the same information that the Read API would have given us. 

Here are a few arbitrary lines from the page, which correspond with cells on the table. You can clearly see that these happen to correspond with cells in the second row of the table (21. Tana Water Services Board, 6,600,000.00). Each cell is associated with a boundingBox, the geometric coordinates of where it is located.

In [55]:
page_12[7:10]

[{'boundingBox': [1.6505,
   1.0029,
   1.7783,
   1.0029,
   1.7783,
   1.0783,
   1.6505,
   1.0783],
  'text': '21.',
  'words': [{'boundingBox': [1.6505,
     1.0029,
     1.7783,
     1.0029,
     1.7783,
     1.0783,
     1.6505,
     1.0783],
    'text': '21.',
    'confidence': 1}]},
 {'boundingBox': [2.0724,
   1.0021,
   3.294,
   1.0021,
   3.294,
   1.0787,
   2.0724,
   1.0787],
  'text': 'Tana Water Services Board',
  'words': [{'boundingBox': [2.0724,
     1.0044,
     2.292,
     1.0044,
     2.292,
     1.0782,
     2.0724,
     1.0782],
    'text': 'Tana',
    'confidence': 1},
   {'boundingBox': [2.3211,
     1.0044,
     2.5918,
     1.0044,
     2.5918,
     1.0783,
     2.3211,
     1.0783],
    'text': 'Water',
    'confidence': 1},
   {'boundingBox': [2.6244,
     1.0021,
     2.9913,
     1.0021,
     2.9913,
     1.0787,
     2.6244,
     1.0787],
    'text': 'Services',
    'confidence': 1},
   {'boundingBox': [3.0259,
     1.0021,
     3.294,
     1.0021,
  

### 2. The `pageResults` list: an additional representation of each page, with a list of the tables on that page

In [68]:
page_results = data['analyzeResult']['pageResults']

Like `readResults`, `pageResults` is a list, in which each element is a dictionary representing a single page that contains a table. 

In [69]:
type(page_results)

list

In [70]:
page_12_results = page_results[0]

The dictionary that represents a page contains two keys: 
* Page: the page number (begins indexing at 1)
* Tables: a list of the tables on the page

In [80]:
print("Keys for the page: " + str(page_12_results.keys()))
print("Page number: " + str(page_12_results['page']))
print("Number of tables on the page: " + str(len(page_12_results['tables'])))

Keys for the page: dict_keys(['page', 'tables'])
Page number: 1
Number of tables on the page: 1


### 3. A table on a page

A table is represented as a dictionary with the following keys: 
* Rows: The number of rows in the table
* Columns: The number of columns on the table
* Cells: Information about the location and contents of each cell in the table.

If you look at the original PDF, you will notice that the table actually has 81 rows (indexed 20-100). This suggests that the Form Recognizer API is not perfectly dividing cells. 

In [81]:
table = page_12_results['tables'][0]

In [87]:
print("Keys for the table: " + str(table.keys()))
print("Number of rows: " + str(table['rows']))
print("Number of columns: " + str(table['columns']))

Keys for the table: dict_keys(['rows', 'columns', 'cells'])
Number of rows: 78
Number of columns: 3


### 4. Cells in a table

Each cell is a dictionary with: 
* `rowIndex`: index of the row (starts at 0)
* `columnIndex`: index of the column (starts at 0)
* `text`: text in the cell
* `boundingBox`: bounding box (list of x-y coordinates of each of the four corners for that cell)
* `elements`: a pathway to the elements in the readResults list: `readResults/[page index]/lines/[line index]/words/[word index]`. 

Here is what an arbitrary cell in a table looks like: 

In [92]:
table['cells'][10]

{'rowIndex': 3,
 'columnIndex': 1,
 'text': 'Nairobi Centre for International Arbitration',
 'boundingBox': [2.0356,
  1.2322,
  5.2356,
  1.2322,
  5.2356,
  1.3589,
  2.0356,
  1.3589],
 'elements': ['#/readResults/0/lines/14/words/0',
  '#/readResults/0/lines/14/words/1',
  '#/readResults/0/lines/14/words/2',
  '#/readResults/0/lines/14/words/3',
  '#/readResults/0/lines/14/words/4']}

If we wanted to find the elements in the table as they are represented by the `readResults` portion, we could follow the path outlined in `elements` to get the text: 

In [96]:
data['analyzeResult']['readResults'][0]['lines'][14]['words'][0]

{'boundingBox': [2.0719, 1.2588, 2.407, 1.2588, 2.407, 1.335, 2.0719, 1.335],
 'text': 'Nairobi',
 'confidence': 1}

In [97]:
data['analyzeResult']['readResults'][0]['lines'][14]['words'][1]

{'boundingBox': [2.4409,
  1.2595,
  2.7314,
  1.2595,
  2.7314,
  1.3353,
  2.4409,
  1.3353],
 'text': 'Centre',
 'confidence': 1}

We did not develop an approach for knitting tables together and including them in the text of a page, unfortunately. However, we believe that this would be relatively simple to do, given the information provided in the `pageResults` dictionary and the ease with which one can cross-reference its elements with the `readResults` output. 