### Overall Process

If you *can* use form recognizer Layout API for all of the Gazettes, this is preferable. 

If you can't use form recognizer Layout API for all of the Gazettes: 
* Get list of all gazette pages that have tables in them, using the column number estimator.<sup>1</sup> If a page is estimated to have "None" or more than 2 columns, it's likely a table. 
* Get the full PDF of those gazettes
* Filter for the page(s) (remember that PDF Reader Objects start indexing at zero) 
* Send the PDF data for the filtered pages to the form recognizer API

<sup>1</sup>There are limitations with the column estimation strategy. In particular, if a page contains one small table (e.g., 1-2 rows) and is otherwise a two-column page, the column estimator will not identify it as containing tables. 

A possible improvement would be to filter pages for Notices of interest that are anticipated to contain tables. For example, a project interested in the Land Act could find pages containing a Land Act announcement. 

Once you have the tables: 
* Knit the tables together using the information provided: 
    + If `resp_json` is json format of response object, then `resp_json['analyzeResult']['pageResults'][0]['tables']` will give the tables on page 0
    + Gives rowIndex & columnIndex, + bounding box for the text, + the text
    
Unfortunately, we did not have time to implement this process fully. We hope that the below starting point, as well as the explanation of the Form Recognizer Layout API output described in the `additional_walkthroughs` folder, will be a starting point for someone else to implement this in a straightforward way. 

In [2]:
import os
import json
import requests

from helpers import json_extraction as je
from helpers import write_urls as wu 

ROUTETOROOTDIR = '/home/dssg-cfa/notebooks/dssg-cfa-public/'
IMPORTSCRIPTSDIR = ROUTETOROOTDIR + "util/py_files"
os.chdir(IMPORTSCRIPTSDIR)
import orderingText

ke_gazettes = "/home/dssg-cfa/ke-gazettes/"
filenames = [f for f in os.listdir(ke_gazettes)]

In [3]:
# Pseudocode to build off of:

likely_tables = []
for fn in filenames[:5]: 
    with open(ke_gazettes + fn) as f:
        data_json = json.load(f)
    content = data_json['analyzeResult']['readResults']
    pg_lst = []
    for i in range(len(content)):
        page_lines = content[i]['lines']
        num = orderingText.getNumCols(page_lines)
        if (num != None and num > 2) or (num == None):
            if i == 0 and len(content) == 1: 
                continue # skip table of contents 
            pg_lst.append(i)
            
    # POSSIBLE PSEUDOCODE: 
    # access source database metadata, using our map from the DSSG filenames
    # use this metadata to access a URL  
    # call form recognizer API on the page, passing it the URL and the pg_lst (indices)
        # json_extraction.call_form_rec_layout_api() will be helpful for this
    # append results of form recognizer API to the JSON, with page numbers attached 
    # re-save the JSON under the same filename -- now including Form Recognizer results
    likely_tables.append({fn: pg_lst})
        
likely_tables

[{'gazette-ke-vol-cx-no-100-dated-19-december-2008-special': []},
 {'gazette-ke-vol-cvii-no-34-dated-13-may-2005': [0, 35]},
 {'gazette-ke-vol-cxiv-no-45-dated-23-may-2012-special': []},
 {'gazette-ke-vol-cxviii-no-163-dated-23-december-2016': [1,
   2,
   13,
   14,
   15,
   16,
   17,
   18,
   19,
   20,
   21,
   22,
   31]},
 {'gazette-ke-vol-cxix-no-94-dated-10-july-2017-special': [1]}]