# Extract text from scanned documents with Spark OCR

This notebook will illustrates how to:
* Load example PDF
* Preview it
* Recognize text

## Import OCR transformers and utils

In [0]:
from sparkocr.transformers import *
from sparkocr.databricks import display_images
from pyspark.ml import PipelineModel

## Define OCR transformers and pipeline
* Transforrm binary data to Image schema using [BinaryToImage](https://nlp.johnsnowlabs.com/docs/en/ocr_pipeline_components#binarytoimage). More details about Image Schema [here](https://nlp.johnsnowlabs.com/docs/en/ocr_structures#image-schema).
* Recognize text using [ImageToText](https://nlp.johnsnowlabs.com/docs/en/ocr_pipeline_components#imagetotext) transformer.

In [0]:
def pipeline():
    
    # Transforrm PDF document to struct image format
    pdf_to_image = PdfToImage()
    pdf_to_image.setInputCol("content")
    pdf_to_image.setOutputCol("image")
    pdf_to_image.setResolution(200)
    pdf_to_image.setPartitionNum(8)

    # Run OCR
    ocr = ImageToText()
    ocr.setInputCol("image")
    ocr.setOutputCol("text")
    ocr.setConfidenceThreshold(65)
    
    pipeline = PipelineModel(stages=[
        pdf_to_image,
        ocr
    ])
    
    return pipeline

## Copy example files from OCR resources to DBFS

In [0]:
import pkg_resources
import shutil, os
ocr_examples = "/dbfs/FileStore/examples"
resources = pkg_resources.resource_filename('sparkocr', 'resources')
if not os.path.exists(ocr_examples):
  shutil.copytree(resources, ocr_examples)

In [0]:
%fs ls /FileStore/examples/ocr/pdfs

path,name,size
dbfs:/FileStore/examples/ocr/pdfs/.DS_Store,.DS_Store,6148
dbfs:/FileStore/examples/ocr/pdfs/alexandria_multi_page.pdf,alexandria_multi_page.pdf,70556
dbfs:/FileStore/examples/ocr/pdfs/fonts.pdf,fonts.pdf,11601
dbfs:/FileStore/examples/ocr/pdfs/multiplepages/,multiplepages/,0
dbfs:/FileStore/examples/ocr/pdfs/rotated/,rotated/,0
dbfs:/FileStore/examples/ocr/pdfs/test_document.pdf,test_document.pdf,693743


## Read PDF document as binary file from DBFS

In [0]:
pdf_example = '/FileStore/examples/ocr/pdfs/test_document.pdf'
pdf_example_df = spark.read.format("binaryFile").load(pdf_example).cache()
display(pdf_example_df)

path,modificationTime,length,content
dbfs:/FileStore/examples/ocr/pdfs/test_document.pdf,2020-04-01T08:59:25.000+0000,693743,JVBERi0xLjQgCjEgMCBvYmoKPDwKL1BhZ2VzIDIgMCBSCi9UeXBlIC9DYXRhbG9nCj4+CmVuZG9iagoyIDAgb2JqCjw8Ci9UeXBlIC9QYWdlcwovS2lkcyBbIDMgMCBSIDE3IDAgUiBdCi9Db3VudCAyCj4+CmVuZG9iagozIDA= (truncated)


## Preview PDF using _display_images_ function

In [0]:
display_images(PdfToImage().setOutputCol("image").transform(pdf_example_df), limit=3)

## Run OCR pipelines

In [0]:
result = pipeline().transform(pdf_example_df).cache()

## Display results

In [0]:
display(result.select("pagenum", "text", "confidence"))

pagenum,text,confidence
0,"Patient Nam Financial Numbe Random Hospital Date of Girth Patient Location H & P Chief Complaint Arthritis | Shortness of breath CHF - Congestive heart failure Chronic kidney disease History of Present Iliness Chronic venous insufficiency Edema ar GI bleeding Patient is an 64-year-old male wilh a past medical history of hypertension, HFpEF las Glaucoma known EF 55%c, mild to moderate TA, pulmonary hypertension, permanent atrial Gout fibrillation on Eliquis, history of GI blesd, CK-I48, and anemia who presents with full weeks ypertension oi ccneralized fatigue and fcoling unwell. He also notes some shortness oi Dreath and Peptic ulcer worsening dyspnea wilh minimal exertion. His major complaints are shoulder ard joint Peripheral neuropathy pains. diffuscly. He also complains of ""bone pain’. He denics having any fevers or chills. Peripheral vascular disease e denies having any chest pain, palpitations. He denies any worse extremity Pulmonary hypertension swelling than his baseline. He states he’s been compliant with his medications. Although Tricuspid regurgitation he stales he ran out of his Eliquis & few weeks ago. He denies having any blood in his Historical . stools or mc!cna, although he does takc iron pills and states his stools arc irequently black. ~ No qualifying data His hemeglobin is al baseline. Procedure/Surqgical Histor Twelve-lead EKG showing atrial fibrillation, RBBB, LAFB, PVC. Chest x-ray showing new = duodenal resection, duodenojcjunostomy. smail right creater than left pleural effusions with mild pulmonary vascular congestion. BNP small bowel enterolomy, removal of foreign increased to 2800, up fram 1900. Tropoain 0.03. Renal function at baseline. Hemoglobin object and repair oi enterotomy (05/2 1/20 14). colonoscopy (12/10/2013), egd (12/09/2013), H/O endoscopy (07/2013), HO colonoscopy She normally takes 80 mg of oral Lasix daily. He was given 80 mg of IV Lasix in the ED. (03/2013), pifonidal cyst removal at base of He is currently net nogative close to 1 L. He is stillon 2 L nasal cannula. spine (1981), laser eye surgery ior glaucoma. lesions on small intestine closed up. Home Medications ai baseline.  ! Ss 5 ; Home A 10 system review of sysiems was completed and negative except as documented in HPI. allopurinol 300 mg oral lable, 360 MG= 1 Physical Exam TAB, PO. Daily alenolol 25 mg oral iablel, 25 MG= 1 TAB, Vilals & Measurements PO, Daily T: 36.8 °C (Oral) TMIN: 36.8 ""C (Oral) TMAX: 37.0 “C (Oral) HR: 54 RR: 17 evantialidens 25 mg oral tablet, 25 MG= BP: 140°63 WT: 100.3 KG 1 TAB. PO, MAGE Pulse Ox: 100 % Oxygen: 2 Limin via Nasal Cannula Combigan 0.2%-0.5% ophthalraic GENERAL: no acute distress solution, 1 DROP, Both Eyes, Q12H HEAD: normecephalic Eliquis 5 mg oral lablet, 5 MG= 1 TAB, EYES‘EARS/NOSE/THAROAT: gupils are equal. normal oropharynx PO, BID NECK: normal inspection lerrous sulfate 925 mg (65 nig elemental RESPIRATORY: no respiratory distress, no rales on my exam iron) oral tablet, 325 MG= 1 TAB, PO, CARDIOVASCULAR: irregular. brady. no murmurs. rubs or gallops Daily ABDOIAEN: soft, non-tendes Lasix 80 mg oral tabict. 80 MG= 1 TAB. EXTREMITIES: Bilateral chronic venous stasis changes PO, BID NEUROLOGIC: alert and aosieniec x 3. no gross motor or sensary deficits omeprazole 20 mg oral delayed rcleasc AssessmenvPlan 5 See a os Acute on chronic diastolic CHF (congestive heart failure) ""CAM WOES URSIN ye Heat Acute on chronic diastolic heart failure exacerbation. Smail pleural effusions dilaterally with mild pulmonary vascular congestion on chest x-ray, slighi elevation in BNR. We'll continue 1 more day af IY diuresis with 20 mg IV Lasix. He may have had 2 viral infection ; which precipitated this. We'll add Tylenol for his joint pains. Continue atenolol and Daily _ chlorthalidone. sertraline 50 mg oral tablel, 75 MG= 1,5 TAB, PO. Daily parameters. Continue Eliquis for stroke prevention. No evidence oj tieeding, hemog!abin : i at baseline lriamcinglone 0.1% lopical oiniment, 1 oe APP. Topical, Daily potassium chloride 20 mEq oral tablet, extended release, 20 MEQ= 1 TAB, PO, Printed: 7/17/2017 13:01 EDT Page 16 of 42 PowerChart",85.12532162666321
1,"Patient Name Financial Number Date of Birth Patient Location Random Hospital H & P Anemia At baseline Arthritis Tylenol for pain. Patient also takes Percocet alt home, will add this cn. Chronic kidney disease AY baseline. tMonitor while diuresing. Hypertension Blood pressures within tolerable ranges. Pulmonary hypertension Tricuspid regurgitation Wild-to-moderaic on echocardiogram last year Attending physician note-the patient was interviewed and examined. The appropriaic information in power chart was reviewed. The patient was discussed wilh Dr, Persad. Patient may have & mild degree of heart failurc. He and his wife were morc concernce with Ins peripheral edema. He has underlying renal insufficiency as well. We'll (ry to diuress him 10 his “dry"" weight. We will then try to adjust hie medications to keep him within a natrow range of hat weight. We will stop his atenolol this point since he ts relatively bradycardic and observe his heart rate on the cardiac monitor. He will progress with his care and aclivily as tolerated. Printed: 7/17/2017 13:01 EDT Page 17 of 42  Vitamin D2 $0,000 intl units (1.25 mg) oral capsule, 1 TAB, PO, Weexly-Tue Allergies shelliisn (cout) sulfa drug (maculopapular rash) Social History Ever Smoked Tobacco: Former Smoker Alcohol use - frequency: None Drug use: Never Lab Results O7/16/97 05:30 to O7/16/17 05:30 102 07/16/17 05:30 to 07/16/17 05:30 L 125 fL 32.4 \ BMP GLU NA K CL TOTAL CO2 BUN CRT ANION GAP CA CBC with diff WBC HGB HOT RBC VICV MICH MCHC RDW MPV O7/16/17 05:30 102 mg/dL 143 MMOL/L 3.6 MMOL/L 98 MMOL/L 40 MMOL/L 26 mg/dL. 1.23 mg/dL 5 7.9 mg/dL 07/16/17 05:30 3.4/ nl 10.1 G/DL 32.4% 3.41 /PL 95.0 FL 29.6 pg 31.2 % 15.9 % 10.7 FL PowerChart",78.96080513000489


## Clear cache

In [0]:
result.unpersist()
pdf_example_df.unpersist()