![JohnSnowLabs](https://nlp.johnsnowlabs.com/assets/images/logo.png)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/visual-nlp/2.1.Pdf_processing.ipynb)

## Blogposts and videos

- [Text Detection in Spark OCR](https://medium.com/spark-nlp/text-detection-in-spark-ocr-dcd8002bdc97)

- [Table Detection & Extraction in Spark OCR](https://medium.com/spark-nlp/table-detection-extraction-in-spark-ocr-50765c6cedc9)

- [Extract Tabular Data from PDF in Spark OCR](https://medium.com/spark-nlp/extract-tabular-data-from-pdf-in-spark-ocr-b02136bc0fcb)

- [Signature Detection in Spark OCR](https://medium.com/spark-nlp/signature-detection-in-spark-ocr-32f9e6f91e3c)

- [GPU image pre-processing in Spark OCR](https://medium.com/spark-nlp/gpu-image-pre-processing-in-spark-ocr-3-1-0-6fc27560a9bb)

- [How to Setup Spark OCR on UBUNTU - Video](https://www.youtube.com/watch?v=cmt4WIcL0nI)


**More examples here**

https://github.com/JohnSnowLabs/spark-ocr-workshop

For get the trial license please go to:

https://www.johnsnowlabs.com/install/

### Colab Setup

In [None]:
# Install the johnsnowlabs library to access Spark-OCR and Spark-NLP for Healthcare, Finance, and Legal.
!pip install -q johnsnowlabs

In [None]:
from google.colab import files
print('Please Upload your John Snow Labs License using the button below')
license_keys = files.upload()

In [None]:
from johnsnowlabs import nlp, visual

# After uploading your license run this to install all licensed Python Wheels and pre-download Jars the Spark Session JVM
nlp.install(refresh_install=True, visual=True)

In [None]:
from johnsnowlabs import nlp, visual
import pandas as pd

# Automatically load license data and start a session with all jars user has access to
spark = nlp.start(visual=True)

👌 Detected license file /content/spark_nlp_for_healthcare_spark_ocr_8356 (8).json
👌 Launched [92mcpu optimized[39m session with with: 🚀Spark-NLP==5.2.2, 💊Spark-Healthcare==5.2.1, 🕶Spark-OCR==5.1.2, running on ⚡ PySpark==3.1.2


In [None]:
import pkg_resources

from pyspark.ml import PipelineModel
from pyspark.sql import functions as F

## Read pdfs to the dataframe and display

In [None]:
pdf_path = visual.pkg_resources.resource_filename('sparkocr', 'resources/ocr/pdfs/*.pdf')
pdf_example_df = spark.read.format("binaryFile").load(pdf_path).cache()

visual.display_pdf(pdf_example_df)

## Define pipeline for extract text from the searchable and ocr text from the scaned pdf

In [None]:
def pipeline():

    # If text PDF extract text
    pdf_to_text = visual.PdfToText() \
        .setInputCol("content") \
        .setOutputCol("text") \
        .setSplitPage(True) \
        .setExtractCoordinates(True) \
        .setStoreSplittedPdf(True)

    # If image pdf, extract image
    pdf_to_image = visual.PdfToImage() \
        .setInputCol("content") \
        .setOutputCol("image") \
        .setKeepInput(True)

    # Run OCR
    ocr = visual.ImageToText() \
        .setInputCol("image") \
        .setOutputCol("text") \
        .setConfidenceThreshold(60)

    pipeline = PipelineModel(stages=[
        pdf_to_text,
        pdf_to_image,
        ocr
    ])

    return pipeline

## Run pipeline and show results

In [None]:
result = pipeline().transform(pdf_example_df).cache()
result.show()

+--------------------+--------------------+------+--------------------+--------------------+----------------+---------------+--------------------+--------------------+-----------+-------+-----------+--------------------+---------+
|                path|    modificationTime|length|                text|           positions|height_dimension|width_dimension|             content|               image|total_pages|pagenum|documentnum|          confidence|exception|
+--------------------+--------------------+------+--------------------+--------------------+----------------+---------------+--------------------+--------------------+-----------+-------+-----------+--------------------+---------+
|file:/usr/local/l...|2024-02-06 19:52:...| 70556|Alexandria was fo...|[{[{A, 72.024, 76...|             792|            612|[25 50 44 46 2D 3...|                null|          0|      0|          0|-1.79769313486231...|     null|
|file:/usr/local/l...|2024-02-06 19:52:...|693743|Patient Name
Fina...|[{[{P

## Display text using pandas dataframe

In [None]:
import pandas as pd
pd.set_option('display.max_colwidth', None)

result.select("text").toPandas().style.set_properties(**{'white-space': 'pre-wrap', 'text-align': 'left'})

Unnamed: 0,text
0,"Alexandria was founded around a small, ancient Egyptian town c. 332 BC by Alexander the Great,[4] king of Macedon and leader of the Greek League of Corinth, during his conquest of the Achaemenid Empire. Alexandria became an important center of Hellenistic civilization and remained the capital of Ptolemaic Egypt and Roman and Byzantine Egypt for almost 1,000 years, until the Muslim conquest of Egypt in AD 641, when a new capital was founded at Fustat (later absorbed into Cairo). Hellenistic Alexandria was best known for the Lighthouse of Alexandria (Pharos), one of the Seven Wonders of the Ancient World; its Great Library (the largest in the ancient world); and the Necropolis, one of the Seven Wonders of the Middle Ages. Alexandria was at one time the second most powerful city of the ancient Mediterranean region, after Rome. Ongoing maritime archaeology in the harbor of Alexandria, which began in 1994, is revealing details of Alexandria both before the arrival of Alexander, when a city named Rhacotis existed there, and during the Ptolemaic dynasty."
1,"Patient Name Financial Number Date of Girth Patient Location Random Hospital  H & P Anemia Vitamin D2 $0,000 intl units (1.25 ma) oral ALDASeIne capsule, 1 TAS, PO, V/eexly-Tue Arthritis Allergies Tylenol for pain. Patient also takes Percocet al home, will add this on. Chronic kidney disease AY baseline. Monitor while divresing. Hypertension Blood pressures within tolerable ranges. Pulmonary hypertension Tricuspid regurgitation ild-to-moderaie on echocardiogram last year sholliish (cout) sulfa drug (maculopapular rash)  Social History Ever Smoked tobacco: Former Smoker Alcohol use - frequency; None Drug use: Never Lab Results 07/10/77 05:30 to O7/16/17 05:30  Attending physician note-the patient was interviewed and examined. The appropriatc information in power chart was reviewed. The patient was discussed wilh Dr, Persad. 143 1L 981H 26? Patient may have @ mild degree oj heart failure. He and his wife were morc concernee with a Ins peripheral edema. He has underlying renal insufficiency as well. We'll try to diurese him to his “dry"" weight. We will then try to adjust his medications to kcep him within a narrow range of [hat weight. We will stop his atenolol this point since he is relatively bradycardic anc observe his heart rate onthe cardiac monitor. He will progress with his care and aclivily as tolerated. 102 07/16/17 05:30 to O7/ 16/17 05:30 fL 32.4 \ Printed: 7/1 7/2017 13:01 EDT Page 17 of 42 BMP GLU NA K CL TOTAL COZ BUN CRT ANION GAP CA CBC with diff WBC HGB HCT RBC MCV MICH MCHC RDW MPV 07/16/17 05:30 102 mg/dL 143 MMOL/L 3.6 MMOL/L 98 MMOL/L 40 MMOL/L 26 mg/dL. 1.23 mg/dL 5 7.9maQ/dL 07/16/17 05:30 3.4/ nl 10.1 G/DL 32.4 “Yo 3.41 /PL 95.0 FL 29.6 pg 31.2 % 15,9 %o 10.7 FL PowerChart"
2,"8 i , . ! 9 i , . ! 10 i , . ! 11 i , . ! 12 i , . ! 13 i , . ! 14 i , . !"
3,"Patient Nam Financial Numbe Random Hospital Date of Birth Patient Location  Chief Complaint Shortness of breath History of Present Illness  Patient is an 84-year-old male wilh a past medical history of hypertension, HFpEF last known EF 55%, mild to moderate TA, pulmonary hypertension, permanent atrial fibrillation on Eliquis, history of GI blesd, CK-IM8, and anemia who presents with full weeks oi ccncralized fatigue and fecling unwell. He also notes some shortness oi Breath and worsening dyspnea willy minimal exerlion. His major complaints are shoulder and joint pains. diffusely. He also complains of ""bone pain’. He denics having any fevers or crills. e demes having any chest pain, palpitalicns, He denies any worse extremity swelling than his baseline. He states he’s been compliant with his mcdications. Although he stales he ran out of his Eliquis @ few weeks ago. He denies having any blood in his Stools or mc!ena, although he does take iron pills and states his stools arc irequently black. His hemoglobin is al baseline. Twelve-lead EKG showing atrial fibrillation, RBBB, LAFB, PVC. Chest x-ray showing new small right creater than left pleural effusions with mild pulmonary vascular congestion. BNP increased to 2800, up fram 1900. Tropoain 0.03. Renal function at baseline. Hemoaglapin at baseline. She normally takes 80 mq of oral Lasix daily. He was given 80 mg of IV Lasix in the ED. He is currently net negative close to 1 L. He is still on 2 L nasal cannula. ! Ss 5 A 10 system roview af systems was completed and negative except as documented in HPI. Physical Exam  Vitals & Measurements T: 36.8 °C (Oral) TMIN: 36.8 ""C (Oral) TMAX: 37.0 °C {Oral} HR: 54 RR: 7 BP: 140/63 WT: 100.3 KG Pulse Ox: 100 % Oxygen: 2 Limin via Nasal Cannula GENERAL: no acute distress HEAD: normecephalic EYES‘EARS/‘NOSE/THAOAT: nupils are equal. normal oropharynx NECK: normal inspection RESPIRATORY: no respiratory distress, no rales on my exam CARDIOVASCULAR: irregular. brady. no murmurs, rubs or galleps ABDOMEN: soft, non-tendes EXTREMITIES: Bilateral chronic venous stasis changes NEUROLOGIC: alert and osieniec x 3. no gross motar or sensory deaficils  Acute on chronic diastolic CHF (congestive heart failure) Acute on chronic diastolic heart failure exacerbation. Small pleural effusions dilaterally with mild pulmonary vascular congesiion on chest x-ray, slight elevation in BNR. We'll continue 1 more day af IV diuresis with &0 mg IV Lasix. He may have had a viral infection which precipilated this. We'll add Tylenol jor his joint paias. Continue atenclof and chiorthalidone. AF - Atrial fibrillation Permanent atrial fibrillation. Rates bradycardic in the &0s. Continue atenolol with hola parameters. Coniinue Eliquis for stroke prevention. No evidence oj bleeding, hemog'abin at baseline.  Printed: 7/17/2017 13:07 EDT Page 16 of 42  Arincitis CHF - Congestive heart failure Chronic kidney disease Chronic venous insufficiency Edema GI bleeding Glaucoma Goul Hypertension Peptic ulcer Peripheral ncuropathy Peripheral vascular disease Pulmonary hypertension Tricuspid regurgitation Historical No qualifying data Procedure/Surgical History duodenal resection, duodenojcjunostomy. small bowel enterolomy, removal of foreign object and repair oj enterotomy (05/2 1/20 14), colonoscopy (12/10/2013), egd (1209/2013), H/O endoscopy (07/2013), H’O colonoscopy (03/2013), pilonidal cyst removal at base of spine (1981), laser eye surgery ior glaucoma. lesions on small intestine closed up.  Home Medications Home allopurinol 300 mg oral tablet, 300 MG= 1 TAB, PO. Daily atenolol 25 mg oral iablel, 25 MG= 1 TAB, PO, Daily chtorthalidone 25 mg oral tablet, 25 MG= 1 TAB, PO, MiWOF Combigan 0.2%-0.5% ophthalmic solution, 1 DROP, Both Eyes, Q12H Eliquis 5 mg oral lablet, 5 MG= 1 TAB, PO, BID lerrous sulfate 925 mg (65 nig elemental iron) oral tablet, 325 MG= 1 TAB, PO,  Daily Lasix 80 mg oral tabic:. 80 MG= | TAB. PO, BID omeprazole 20 mg oral delayed scicasc capsule, 20 MG= 1 CAP, PO, BID Percocei 5/325 oral tablet. | TAB, PO. QAM potassium chloride 20 mEq oral tablet, extended release, 20 MEQO= 1 TAB, PO, Daily serlraline 50 mg oral tablet, 75 MG= 1,5 TAB, PQ. Daiiy triamcinolone 0.71% lopical cream, 1 APP, Topical, Daily lriamcmolone 0.1% lopical ominient, 1 APP. Topical, Daily PowerChart"
4,"Alexandria is the second-largest city in Egypt and a major economic centre, extending about 32 km (20 mi) along the coast of the Mediterranean Sea in the north central part of the country. Its low elevation on the Nile delta makes it highly vulnerable to rising sea levels. Alexandria is an important industrial center because of its natural gas and oil pipelines from Suez. Alexandria is also a popular tourist destination."
