# Example of usage Spark OCR
* Load example PDF
* Preview it
* Recognize text

## Add init script for install fresh version of Tesseract
Note: Need restart cluster after added script if run first time on Databrick accaunt.

In [3]:
from sparkocr.databricks import create_init_script_for_tesseract
create_init_script_for_tesseract()

## Check tesseract installation
Need tesseract 4.1.1

In [5]:
%sh
tesseract -v

## Import OCR transformers and utils

In [7]:
from sparkocr.transformers import *
from sparkocr.databricks import display_images, OCR_MODEL_DIR
from pyspark.ml import PipelineModel

## Define OCR transformers and pipeline
* Transforrm binary data to Image schema using [BinaryToImage](https://nlp.johnsnowlabs.com/docs/en/ocr#binarytoimage). More details about Image Schema [here](https://nlp.johnsnowlabs.com/docs/en/ocr#image-schema).
* Recognize text using [TesseractOcr](https://nlp.johnsnowlabs.com/docs/en/ocr#tesseractocr) transformer.

In [9]:
def pipeline():
    
    # Transforrm PDF document to struct image format
    pdf_to_image = PdfToImage()
    pdf_to_image.setInputCol("content")
    pdf_to_image.setOutputCol("image")
    pdf_to_image.setResolution(200)

    # Run tesseract OCR
    ocr = TesseractOcr()
    ocr.setInputCol("image")
    ocr.setOutputCol("text")
    ocr.setConfidenceThreshold(65)
    ocr.setTessdata(OCR_MODEL_DIR)
    
    pipeline = PipelineModel(stages=[
        pdf_to_image,
        ocr
    ])
    
    return pipeline

## Copy example files from OCR resources to DBFS

In [11]:
import pkg_resources
import shutil, os
ocr_examples = "/dbfs/FileStore/examples"
resources = pkg_resources.resource_filename('sparkocr', 'resources')
if not os.path.exists(ocr_examples):
  shutil.copytree(resources, ocr_examples)

In [12]:
%fs ls /FileStore/examples/ocr/pdfs

path,name,size
dbfs:/FileStore/examples/ocr/pdfs/.DS_Store,.DS_Store,6148
dbfs:/FileStore/examples/ocr/pdfs/alexandria_multi_page.pdf,alexandria_multi_page.pdf,70556
dbfs:/FileStore/examples/ocr/pdfs/fonts.pdf,fonts.pdf,11601
dbfs:/FileStore/examples/ocr/pdfs/multiplepages/,multiplepages/,0
dbfs:/FileStore/examples/ocr/pdfs/rotated/,rotated/,0
dbfs:/FileStore/examples/ocr/pdfs/test_document.pdf,test_document.pdf,693743


## Read PDF document as binary file from DBFS

In [14]:
pdf_example = '/FileStore/examples/ocr/pdfs/test_document.pdf'
pdf_example_df = spark.read.format("binaryFile").load(pdf_example).cache()
display(pdf_example_df)

path,modificationTime,length,content
dbfs:/FileStore/examples/ocr/pdfs/test_document.pdf,2020-04-01T08:59:25.000+0000,693743,JVBERi0xLjQgCjEgMCBvYmoKPDwKL1BhZ2VzIDIgMCBSCi9UeXBlIC9DYXRhbG9nCj4+CmVuZG9iagoyIDAgb2JqCjw8Ci9UeXBlIC9QYWdlcwovS2lkcyBbIDMgMCBSIDE3IDAgUiBdCi9Db3VudCAyCj4+CmVuZG9iagozIDA= (truncated)


## Preview PDF using _display_images_ function

In [16]:
display_images(PdfToImage().setOutputCol("image").transform(pdf_example_df), limit=3)

## Run OCR pipelines

In [18]:
result = pipeline().transform(pdf_example_df).cache()

## Display results

In [20]:
display(result.select("pagenum", "text", "confidence"))

pagenum,text,confidence
0,"Patient Nam Financial Numbe Random Hospital Date of Birth Patient Location H&P | Chief Complaint Arthritis ‘ | Shoriness of breath CHF - CDHQ‘SS'I'\'E hearl failure Chronic kidney discasc History of Present lliness Chroni¢ venous insulficiency Edcma Y . . ‘ Gl bleeding Patient is an Bd-year-old male wilh a past medical history of hypertension, HFpEF las Glaucoma known EF 53%, mild to moderate TR, puimonary hypertension, permanent atrial Goul tibrillation on Eliquis, history of Gl blesd, CK-k48, and anemia who presenls with full wesks vperiension oi ¢eneralized fatigue and fecling unwell. He also notes some shortness of bresth and Peptic ulcer worsening dyspnea wilh minimal exerlion. His major complaints are shoulder and joinl Peripheral ncuropathy pains. diffusely. He also compizins of ""bene pain'. He denics having any fevers or chills. Peripheral vascular diszase ¢ denies having any chest pain, palpitalicns. He dznigs any worse exlremily Pulmonary hyperiension swelling than his baseline. He states ha's been compliant with his medications. Although Tricuspid r'egurqilalion he stales he ran out of his Eliquis & few wesks ago. He denies having any blood in his Historical ) stools or mcicna, although he doas 1ake iron pills and states his stools arc irequently black. mlt,.’,mq data H:s hemeglobin is at baseline. Procedure/Surqgical Histor Twelve-lead EKG showing atrial fibrillation, RBBB, LAFB, PVC. Chest x-ray showing new duodenal resection, duodenojcjunostomy. small right creater than left pleural effusions with mild pulmonary vascular congestion. BNP small bowel enterolomy, removal of foreign increased to 2200, up fram 1900. Tropeain 0.03. Renal funclion &t baselina. IHemaoglobin object and repair oi enterotomy (05/21/2014), colonoscopy (12/10/2013), egd (12/09/2013), H/O endoscopy (07/2013), H/O colenoscopy She normally takes 80 mg of oral Lasix daily. He was given 80 mg of IV Lasix in the ED. {03/2013), pifonidal cys! removal at base of He is currently net negative closa to 1 L. He is still en 2 L nasal cannula. spinc (1981), laser eye surgery ior glaucoma. lesions on small intesline closed up. Home Medications  ! S 5 _ ) Home A 10 system review of sysiems was complelad and negative except as documented in HPI. allopurinal 300 mg oral fablet, 360 MG= 1 Physical Exam TAB, PO. Daily . alenolol 25 mg oral 1ablel, 25 MG= 1 TAB, Vitals & WMeasurements PO, Daily T:36.8 °C (Cral) TMIN: 36.8 ""C ({Oral) TMAX: 37.0 °C (Cral} HR: 54 RR: 17 (:f‘l!Oﬂ]lH“dOHE 25 mg oral tablel, 25 MG= BP: 14063 WT: iC0.3KG 1 TABR. PO, M/W/F Pulse Ox: 100 % Oxygen: 2 L/min via Nasal Cannula Combigan 0.2%-0.5% ophthalraic GENERAL: no acule.distress solution, 1 DROP, Both Eyes, Q12H HEAD: normecephalic Eliquis 5 mg oral lablel, 3 KMG= 1 TAB. EYESTEARS/NOSE/MTHROAT: pupils are equal. normal oropharynx PO, BID NECK: normal inspection l2rrous sullate 325 mg (65 mg elemental RESPIRATORY: no respiralory distress, no rales on my exam iron) oral tablet, 325 MG= 1 TAB, PO, CARDIOVASCULAR: irregular. brady. no murmurs, rubs or gallcps Daily ABDOIEN: soft, non-tender Lasix 80 mg oral 1ablet. 80 MG= 1 TAB. EXTREMITIES: Bilzteral chronic vEnous siasis chanoes PO, BID NEUROLOGIC: alert and orieniec x 3. no gross motar ¢r seasary daficils omeprazolc 20 mg oral delayed relcasc Assessment/Plan 5 capsufeéggs.‘dG:“‘- E,ip' IP%L’?BO Acute on chronic diastolic CHF (congestive heart failure) c(r)c:::.‘cx RS EraliEELC 100 Acute on chronic diastolic heari failure exacerbation. Small pleural effusions dilaterally with mild pulmonary vascular congsslion on chest x-ray, slight €levalion in BNEF. Welll conlinue 1 more day of IV diuresis with 80 mg IV Lagix. Me may have had a viral infection ) which precipitated this. We'll add Tylencl ior his joint pains. Continue atenclol and Daily B chlorthalidene. serlraline 50 Mg oral tablel, 75 MG= 1.5 TAB. PQ. Daily potassium chlorida 20 mEq oral 1ablet, extended releass, 20 MEQ= 1 TAB, PO, Printed : 7/17/2017 13:01 EDT Page 16 oi 42 PowerChart",77.49104619026184
1,"Palient Name Financial Number Date of Birth Patient Location Random Hospital Tylenol for pain. Palienl also lakes Percoce! al home, will add (his ¢n. Chronic Kidney disease At baselne. Monitor while disresing. Hypertension Blood pressures within tolerabls ranges. Pulmonary hypertension Tricuspid regurgitation Iild-to-moderaic on echocardiogram last year Attending physician note-the patient was interviewed and examined. The appropriatc mlormation in power charl was reviewed. Thz patienl was discussed wilh Dr, Parsad. Paticat may have & mild degree of heart failure. He and his wife were morc concerned 'with his peripherzl edema. He has underlying renal insufliciency as well. We'll Iry 10 diuress him 10 his ""dry"" weight. We will then try to adjust his medications to XKcep him within a natrow range of [hat weighl. We will stop his atenolol this point since he s relalively bradycardic and obscrve his heart rate on the cardiac monitor. He will progress with his care and aclivily as tolerated. Printed : 7/17;2017 13:01 EDT Page 17 oi 42  Vitamin D2 50,600 intl units (1.25 mg) oral capsule, 1 TAB, PO, Weexkly-Tue Allergies shelliish (coui) sulfa drug (maculogapulear rash) Social Histary Ever Smoked Tobacco: Former Smoker Alcoho! use - frequency: Nong Drug use: Never Lab Results 102 07/16/17 05:30 lo 07/16/17 05:30 \L10.1/ L34 BMP GLU NA K CL TOTAL CO2 BUN CRT ANION GAP CA CBC with diff WBC HGB HCT RBC MCV MCH MCHC RDW MPV 07/16/17 05:30 102 mg/dL 143 MMOL/L 3.6 MMOL/L 98 MMOL/L 40 MMOL/L 26 mg/dL 1.23 mg/dL S 7.9 mg/dL 07/16/17 05:30 3.4/nl 10.1 G/DL 32.4 % 3.41 /PL 95.0 FL 29.6 pg 31.2 % 15.9 % 10.7 FL PowarChart",72.95114339192709


## Clear cache

In [22]:
result.unpersist()
pdf_example_df.unpersist()