### Run PDF Visual Ingestor step by step
This notebook shows you examples of how to use the visual ingestor. You will learn:
1. How to run and see raw output from the nlmatics modified tika server
2. Parse the raw output from tika server using visual ingestor

Before proceeding ensure that you have a tika server running:
1. Install latest version of java
2. Run java -jar /jars/tika-server-standard-nlm-modified-2.4.1_v4.jar

In [10]:
import os
import sys
from tika import parser
from IPython.core.display import display, HTML
from bs4 import BeautifulSoup

src_dirs = ['../']

for src_dir in src_dirs:
    module_path = os.path.abspath(os.path.join(src_dir))
    if module_path not in sys.path:
        sys.path.append(module_path)


%load_ext autoreload

from nlm_ingestor.ingestor import patterns, line_parser, visual_ingestor
from nlm_ingestor.ingestor.visual_ingestor import visual_ingestor, table_parser, indent_parser, block_renderer, order_fixer

%autoreload 2

### It is very important to first ensure that you have your own tika server running and use the url here

If you do not run the tika server, py tika library will start a default server which won't have all the nlmatics modifications to proceed further.

Before proceeding, ensure that the nlm-modified-tika server is running and tesseract is installed (if using ocr).

In [12]:
os.environ["TIKA_SERVER_ENDPOINT"] = "http://localhost:9998"

The following code:
1. takes a pdf from local file system 
2. uses the nlmatics modified tika server to parse it, and
3. displays it on the browser

In [13]:
doc_loc = '/Users/ambikasukla/projects/data/sample-8k.pdf'
# doc_loc = '/Users/ambikasukla/Downloads/scansmpl.pdf'

# by default we will turn off ocr as it is slow, use true here to parse ocr files
needs_ocr = False
timeout = 3000
if not needs_ocr:
    headers = {
        "X-Tika-OCRskipOcr": "true",
    }
    parsed = parser.from_file(doc_loc, xmlContent=True, requestOptions={'headers': headers, 'timeout': timeout})
else:
    print("ocr")
    headers = {
        "X-Tika-OCRskipOcr": "false",
        "X-Tika-OCRoutputType": "hocr",
        "X-Tika-OCRocrEngineMode": "3",
        "X-Tika-PDFExtractInlineImages":"false",
        "X-Tika-Timeout-Millis": str(100*timeout),
        "X-Tika-OCRtimeoutSeconds": str(timeout),
    }
    parsed = parser.from_file(doc_loc, xmlContent=True, requestOptions={'headers': headers, 'timeout': timeout})

html_str = parsed["content"]

# optionally you can store these files locally and view them in the browser
# html_loc = '/mnt/c/Users/ambik/Downloads/orig-html.html'
# f = open(html_loc, "w")
# f.write(html_str)
# f.close()

In [14]:
display(HTML(html_str))

The following code:
- Takes the html returned from nlmatics modified tika parser and parses it using bs
- Passes the bs output pages to visual_ingestor to turn it into a format that you see in llmsherpa
- displays the html

In [15]:
soup = BeautifulSoup(str(parsed), "html.parser")
pages = soup.find_all("div", class_='page')
ocr_page = soup.find_all('div', class_="ocr_page", id='page_1')

block_renderer.HTML_DEBUG = True
visual_ingestor.LINE_DEBUG = False
indent_parser.LEVEL_DEBUG = False
indent_parser.NO_INDENT = False
visual_ingestor.MIXED_FONT_DEBUG = False
table_parser.TABLE_DEBUG = False
visual_ingestor.BLOCK_DEBUG = False
table_parser.TABLE_COL_DEBUG = False
table_parser.TABLE_HG_DEBUG = False
table_parser.TABLE_BOUNDS_DEBUG = False
visual_ingestor.HF_DEBUG = False
order_fixer.REORDER_DEBUG = False
visual_ingestor.MERGE_DEBUG = False
table_parser.TABLE_2_COL_DEBUG = False

# you can also let the visual ingestor only parse select pages
# but note that this will cause the document statistics to be incorrect
# and behaviour may not be consistent with what you see when you parse more pages or the entire document

# parsed_doc = visual_ingestor.Doc(pages[23:27], [])
parsed_doc = visual_ingestor.Doc(pages, [])

# optionally you can save the html in a file system and view it in a browser
# html_loc = '/mnt/c/Users/ambik/Downloads/orig-small-html.html'
# parsed_doc.html_str = parsed_doc.html_str.replace("\xa0", " ")
# f = open(html_loc, "w")
# import html
# f.write(parsed_doc.html_str)
# f.close()
display(HTML(parsed_doc.html_str)) 