![JohnSnowLabs](https://nlp.johnsnowlabs.com/assets/images/logo.png)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JohnSnowLabs/spark-ocr-workshop/blob/master/tutorials/Certification_Trainings_JSL/6.1.SparkOcrStreamingPDF.ipynb)

## Spark OCR Streaming

## Blogposts and videos

- [Text Detection in Spark OCR](https://medium.com/spark-nlp/text-detection-in-spark-ocr-dcd8002bdc97)

- [Table Detection & Extraction in Spark OCR](https://medium.com/spark-nlp/table-detection-extraction-in-spark-ocr-50765c6cedc9)

- [Extract Tabular Data from PDF in Spark OCR](https://medium.com/spark-nlp/extract-tabular-data-from-pdf-in-spark-ocr-b02136bc0fcb)

- [Signature Detection in Spark OCR](https://medium.com/spark-nlp/signature-detection-in-spark-ocr-32f9e6f91e3c)

- [GPU image pre-processing in Spark OCR](https://medium.com/spark-nlp/gpu-image-pre-processing-in-spark-ocr-3-1-0-6fc27560a9bb)

- [How to Setup Spark OCR on UBUNTU - Video](https://www.youtube.com/watch?v=cmt4WIcL0nI)


**More examples here**

https://github.com/JohnSnowLabs/spark-ocr-workshop

### Colab Setup

In [23]:
# Install the johnsnowlabs library to access Spark-OCR and Spark-NLP for Healthcare, Finance, and Legal.
!pip install -q johnsnowlabs

In [24]:
from google.colab import files
print('Please Upload your John Snow Labs License using the button below')
license_keys = files.upload()

Please Upload your John Snow Labs License using the button below


Saving healthcare_nlp_training_license_jan23.json to healthcare_nlp_training_license_jan23 (1).json


In [25]:
from johnsnowlabs import nlp, visual, medical

# After uploading your license run this to install all licensed Python Wheels and pre-download Jars the Spark Session JVM
nlp.install(refresh_install=True, visual=True)

👌 Detected license file /content/healthcare_nlp_training_license_jan23.json
📋 Stored John Snow Labs License in /root/.johnsnowlabs/licenses/license_number_0_for_Spark-Healthcare_Spark-OCR.json
👷 Setting up  John Snow Labs home in /root/.johnsnowlabs, this might take a few minutes.
Downloading 🐍+🚀 Python Library spark_nlp-4.2.4-py2.py3-none-any.whl
Downloading 🐍+💊 Python Library spark_nlp_jsl-4.2.4-py3-none-any.whl
Downloading 🐍+🕶 Python Library spark_ocr-4.2.4-py3-none-any.whl
Downloading 🫘+🚀 Java Library spark-nlp-assembly-4.2.4.jar
Downloading 🫘+💊 Java Library spark-nlp-jsl-4.2.4.jar
Downloading 🫘+🕶 Java Library spark-ocr-assembly-4.2.4.jar
🙆 JSL Home setup in /root/.johnsnowlabs
👌 Everything is already installed, no changes made


In [26]:
import pyspark
import json
import os

## Initialization of spark session

In [27]:
from johnsnowlabs import visual, nlp
import pandas as pd

# Automatically load license data and start a session with all jars user has access to
spark = nlp.start(visual=True)

Spark Session already created, some configs may not take.
👌 Detected license file /content/healthcare_nlp_training_license_jan23.json


In [28]:
from pyspark.ml import PipelineModel
from pyspark.sql.functions import *

In [29]:
# fill path to folder with PDF's here
dataset_path = "/content/*.pdf"

In [30]:
# read one file for infer schema
pdfs_df = spark.read.format("binaryFile").load(dataset_path).limit(1)

## Define OCR pipeline

In [31]:
# Transform binary to image
pdf_to_image = visual.PdfToImage()
pdf_to_image.setOutputCol("image")

# Run OCR for each region
ocr = visual.ImageToText()
ocr.setInputCol("image")
ocr.setOutputCol("text")
ocr.setConfidenceThreshold(60)

# OCR pipeline
pipeline = PipelineModel(stages=[
    pdf_to_image,
    ocr
])

## Define streaming pipeline and start it
Note: each start erase previous results

In [None]:
# count of files in one microbatch
maxFilesPerTrigger = 4

# read files as stream
pdf_stream_df = spark.readStream \
.format("binaryFile") \
.schema(pdfs_df.schema) \
.option("maxFilesPerTrigger", maxFilesPerTrigger) \
.load(dataset_path)

# process files using OCR pipeline
result = pipeline.transform(pdf_stream_df).withColumn("timestamp", current_timestamp())

# store results to memory table
query = result.writeStream \
 .format('memory') \
 .queryName('result') \
 .start()

In [34]:
# get progress of streamig job
query.lastProgress

{'id': '90359645-7a20-452b-91ab-e65cfc501430',
 'runId': '1817bbd7-be6f-4b91-bd4c-ca1ef610a42b',
 'name': None,
 'timestamp': '2023-01-23T14:34:44.551Z',
 'batchId': 1,
 'numInputRows': 0,
 'inputRowsPerSecond': 0.0,
 'processedRowsPerSecond': 0.0,
 'durationMs': {'latestOffset': 3, 'triggerExecution': 3},
 'stateOperators': [],
 'sources': [{'description': 'FileStreamSource[file:/content/*.pdf]',
   'startOffset': {'logOffset': 0},
   'endOffset': {'logOffset': 0},
   'numInputRows': 0,
   'inputRowsPerSecond': 0.0,
   'processedRowsPerSecond': 0.0}],
 'sink': {'description': 'FileSink[results/]', 'numOutputRows': -1}}

In [None]:
# need to run for stop steraming job
query.stop()

## Show results from 'result' table
Remember to upload some file to the /content folder in colab.

In [35]:
# count of processed records (number of processed pages in results)
spark.table("result").count() 

1

In [36]:
# show results
spark.table("result").select("timestamp","pagenum", "path", "text").show(10)

+--------------------+-------+--------------------+--------------------+
|           timestamp|pagenum|                path|                text|
+--------------------+-------+--------------------+--------------------+
|2023-01-23 13:48:...|      0|file:/content/noi...| 

 

 

ne Pa a ...|
+--------------------+-------+--------------------+--------------------+



## Run streaming job for storing results to disk

In [37]:
# format: could also be parquet, or csv
# path: route to a file system location
query = result.select("text").writeStream \
 .format('text') \
 .option("path", "results/") \
 .option("checkpointLocation", "checkpointDir") \
 .start()

In [38]:
# get progress of streamig job
query.lastProgress

In [None]:
# need to run for stop steraming job
query.stop()

## Read results from disk

In [40]:
!cp /content/noised.pdf /content/noised_1.pdf

In [42]:
results = spark.read.format("text").load("results/*.txt")
results.sample(.1).show(truncate=False)

+----------------------------------------------------+
|value                                               |
+----------------------------------------------------+
|er ‘Sample No. _ 5031 -: JS BD oats                 |
|. Cartons --- OLD GOLD STRAIGHT                     |
|. =, Requirements: Markings-~- Sample number on each|
|Benzo (A) Pyrene Analyses — T/C -CF~ O.C S51: Fee - |
|                                                    |
| , BLEND CASING RECASING                            |
|                                                    |
|                                                    |
|                                                    |
|                                                    |
|Laboratory “----- One Tray .                        |
|| | le 4 68 fb                                      |
|Filter Production--- -- , .                         |
|Shipping ----------- Tot _                          |
+----------------------------------------------------+



## Clean results and checkpoint folders

In [None]:
%%bash
rm -r -f results
rm -r -f checkpointDir