![JohnSnowLabs](https://nlp.johnsnowlabs.com/assets/images/logo.png)

# Spark OCR


## Blogposts and videos

- [How to Setup Spark OCR on UBUNTU - Video](https://www.youtube.com/watch?v=cmt4WIcL0nI)

- [Installing Spark NLP and Spark OCR in air-gapped networks (offline mode)
](https://medium.com/spark-nlp/installing-spark-nlp-and-spark-ocr-in-air-gapped-networks-offline-mode-f42a1ee6b7a8)

- [Table Detection & Extraction in Spark OCR](https://medium.com/spark-nlp/table-detection-extraction-in-spark-ocr-50765c6cedc9)

- [Signature Detection in Spark OCR](https://medium.com/spark-nlp/signature-detection-in-spark-ocr-32f9e6f91e3c)

- [GPU image pre-processing in Spark OCR](https://medium.com/spark-nlp/gpu-image-pre-processing-in-spark-ocr-3-1-0-6fc27560a9bb)

**More examples here**

https://github.com/JohnSnowLabs/spark-ocr-workshop

**Setup**

In [1]:
import sparkocr
import sys
from pyspark.sql import SparkSession
from sparkocr import start
import base64
from sparkocr.transformers import *
from pyspark.ml import PipelineModel
from pyspark.sql import functions as F
from sparkocr.enums import *
from sparkocr.utils import display_images, display_image

VBox()

Starting Spark application


ID,YARN Application ID,Kind,State,Spark UI,Driver log,User,Current session?
0,application_1698694675566_0001,pyspark,idle,Link,Link,,✔


FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

SparkSession available as 'spark'.


FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

## Pdf to Text


In [2]:
%%sh
wget -q -O /tmp/sample_doc.pdf    http://www.asx.com.au/asxpdf/20171103/pdf/43nyyw9r820c6r.pdf

In [3]:
%%sh
hdfs dfs -copyFromLocal /tmp/sample_doc.pdf /user/hadoop/sample.pdf

In [4]:
# Transform PDF document to images per page
pdf_to_image = PdfToImage()\
      .setInputCol("content")\
      .setOutputCol("image")\
      .setResolution(100)

# Run OCR
ocr = ImageToText()\
      .setInputCol("image")\
      .setOutputCol("text")\
      .setConfidenceThreshold(65)
      # .setKeepLayout(True) \ # to preserve the layout of the input

pdf_to_text_pipeline = PipelineModel(stages=[
    pdf_to_image,
    ocr
])

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [5]:
pdf = pdf = '/user/hadoop/sample.pdf'
pdf_example_df = spark.read.format("binaryFile").load(pdf).cache()

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [6]:
result = pdf_to_text_pipeline.transform(pdf_example_df).cache()

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [7]:
result.select("pagenum","text", "confidence").show()

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

+-------+--------------------+-----------------+
|pagenum|                text|       confidence|
+-------+--------------------+-----------------+
|      0|digital RE.\n\nAS...|89.96996718186598|
+-------+--------------------+-----------------+

In [8]:
result.select("text").collect()

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

[Row(text='digital RE.\n\nASX ANNOUNCEMENT\n3 November 2017\n\nNotice Pursuant to Paragraph 708A(5)(e) of the Corporations Act\n2001 ("Act")\n\nDigitalX Limited (ASX:DCC) (DCC or the Company) confirms that the Company has today\nissued 620,000 Fully Paid Ordinary Shares (Shares) upon exercise of 620,000 Unlisted\nOptions exercisable at $0.0324 Expiring 14 September 2019 and 3,725,000 Shares upon\nexercise of 3,725,000 Unlisted Incentive Options exercisable at $0.08 expiring 10 February\n2018.\n\nThe Act restricts the on-sale of securities issued without disclosure, unless the sale is exempt\nunder section 708 or 708A of the Act. By giving this notice, a sale of the Shares noted above\nwill fall within the exemption in section 708A(5) of the Act.\n\nThe Company hereby notifies ASX under paragraph 708A(5)(e) of the Act that:\n(a) the Company issued the Shares without disclosure to investors under Part 6D.2 of\n\nthe Act;\n\n(b) _asatthe date of this notice, the Company has complied with 

In [9]:
print("\n".join([row.text for row in result.select("text").collect()]))


VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

digital RE.

ASX ANNOUNCEMENT
3 November 2017

Notice Pursuant to Paragraph 708A(5)(e) of the Corporations Act
2001 ("Act")

DigitalX Limited (ASX:DCC) (DCC or the Company) confirms that the Company has today
issued 620,000 Fully Paid Ordinary Shares (Shares) upon exercise of 620,000 Unlisted
Options exercisable at $0.0324 Expiring 14 September 2019 and 3,725,000 Shares upon
exercise of 3,725,000 Unlisted Incentive Options exercisable at $0.08 expiring 10 February
2018.

The Act restricts the on-sale of securities issued without disclosure, unless the sale is exempt
under section 708 or 708A of the Act. By giving this notice, a sale of the Shares noted above
will fall within the exemption in section 708A(5) of the Act.

The Company hereby notifies ASX under paragraph 708A(5)(e) of the Act that:
(a) the Company issued the Shares without disclosure to investors under Part 6D.2 of

the Act;

(b) _asatthe date of this notice, the Company has complied with the provisions of Chapter
2M of th

###  With Skew Correction

In [10]:
from sparkocr.utils import display_image
from sparkocr.metrics import score

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [11]:
def ocr_pipeline(skew_correction=False):

    # Transforrm PDF document to images per page
    pdf_to_image = PdfToImage()\
          .setInputCol("content")\
          .setOutputCol("image")\
          .setResolution(100)

    # Image skew corrector
    skew_corrector = ImageSkewCorrector()\
          .setInputCol("image")\
          .setOutputCol("corrected_image")\
          .setAutomaticSkewCorrection(skew_correction)

    # Run OCR
    ocr = ImageToText()\
          .setInputCol("corrected_image")\
          .setOutputCol("text")

    pipeline_ocr = PipelineModel(stages=[
        pdf_to_image,
        skew_corrector,
        ocr
    ])

    return pipeline_ocr

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [12]:
%%sh
wget -q -O /tmp/400_rot.pdf  https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp-workshop/master/tutorials/Certification_Trainings/Healthcare/data/ocr/400_rot.pdf

In [13]:
%%sh
hdfs dfs -copyFromLocal /tmp/400_rot.pdf /user/hadoop/

In [14]:
pdf_rotated_df = spark.read.format("binaryFile").load('/user/hadoop/400_rot.pdf').cache()

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [15]:
pdf_pipeline = ocr_pipeline(False)

result = pdf_pipeline.transform(pdf_rotated_df).cache()

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [16]:
result.show()

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

+--------------------+--------------------+-------+--------------------+-----------+-------+-----------+--------------------+-----------------+---------+--------------------+--------------------+
|                path|    modificationTime| length|               image|total_pages|pagenum|documentnum|     corrected_image|       confidence|exception|                text|           positions|
+--------------------+--------------------+-------+--------------------+-----------+-------+-----------+--------------------+-----------------+---------+--------------------+--------------------+
|hdfs://ip-172-31-...|2023-10-30 20:25:...|2240141|{hdfs://ip-172-31...|          1|      0|          0|{hdfs://ip-172-31...|92.89510854085286|         |FOREWORD\n\nElect...|[{[{FOREWORD\n\n,...|
+--------------------+--------------------+-------+--------------------+-----------+-------+-----------+--------------------+-----------------+---------+--------------------+--------------------+

In [17]:
result.select("pagenum").collect()[0].pagenum

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

0

In [18]:
display_image(result.select("image").collect()[0].image)

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…


    Image #0:
    Origin: hdfs://ip-172-31-20-250.us-east-2.compute.internal:8020/user/hadoop/400_rot.pdf
    Resolution: 100 dpi
    Width: 826 px
    Height: 1169 px
    Mode: ImageType.TYPE_BYTE_GRAY
    Number of channels: 1
<PIL.Image.Image image mode=L size=826x1169 at 0x7FA9554D3D90>

### Display recognized text without skew correction


In [19]:
result.select("pagenum","text", "confidence").show()

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

+-------+--------------------+-----------------+
|pagenum|                text|       confidence|
+-------+--------------------+-----------------+
|      0|FOREWORD\n\nElect...|92.89510854085286|
+-------+--------------------+-----------------+

In [20]:
print("\n".join([row.text for row in result.select("text").collect()]))

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

FOREWORD

Electronic design engineers are the true idea men of the electronic
industries. They create ideas and use them in their designs, they stimu-
late ideas in other designers, and they borrow and adapt ideas from
others. One could almost say they feed on and grow on ideas,

torial content has reflected this awareness. Each issue is literally a col-
lection of useful ideas. In one section, however, special attention has

unique, ingenious, and often very simple ideas that readers have found
useful, sometimes as parts of larger designs and sometimes as aids in

To encourage this exchange of ideas, ELECTRONIC DESIGN
has been sponsoring an IFD Award program. Readers are asked to
vote on the ideas they find most useful in the IFD section of
ELECTRONIC DESIGN. Awards are made to the idea getting the
most votes in an issue, and from the issue winners a grand prize of
$1,000 is awarded for the best “Idea of the Year.”

It is difficult to categorize ideas for designers; they are often
use

### Display results with skew correction

In [21]:
pdf_pipeline_corrected = ocr_pipeline(True)

corrected_result = pdf_pipeline_corrected.transform(pdf_rotated_df).cache()

print("\n".join([row.text for row in corrected_result.select("text").collect()]))


VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

FOREWORD

Electronic design engineers are the true idea men of the electronic
industries. They create ideas and use them in their designs, they stimu-
late ideas in other designers, and they borrow and adapt ideas from
others. One could almost say they feed on and grow on ideas.

ELECTRONIG DESIGN has recognized this need and its edi-
torial content has reflected this awareness. Each issue is literally a cal-
lection of useful ideas. In one section, however, special attention has
been devoted to providing a forum for the exchange of ideas between
readers~a section called “Ideas For Design.” Here are presented clever,
unique, ingenious, and often very simple ideas that readers have found
useful, sometimes as parts of larger designs and sometimes as aids in
measuring the parameters or testing the effectiveness of their designs.
Many are quite simple “little” ideas, but experienced designers know
that good little ideas make the good large design possible.

To encourage this exchange of id

In [22]:
corrected_result.select("pagenum","text", "confidence").show()


VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

+-------+--------------------+-----------------+
|pagenum|                text|       confidence|
+-------+--------------------+-----------------+
|      0|FOREWORD\n\nElect...|92.86057472229004|
+-------+--------------------+-----------------+

### Display skew corrected images

In [23]:
display_image(corrected_result.select("corrected_image").collect()[0].corrected_image)

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…


    Image #0:
    Origin: hdfs://ip-172-31-20-250.us-east-2.compute.internal:8020/user/hadoop/400_rot.pdf
    Resolution: 100 dpi
    Width: 866 px
    Height: 1197 px
    Mode: ImageType.TYPE_BYTE_GRAY
    Number of channels: 1
<PIL.Image.Image image mode=L size=866x1197 at 0x7FA9554DC110>

### Compute score and compare
Read original text and calculate scores for both results.

In [24]:
%%sh
wget -q -O /tmp/400.txt   https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp-workshop/master/tutorials/Certification_Trainings/Healthcare/data/ocr/400.txt

In [25]:
%%sh
hdfs dfs -copyFromLocal /tmp/400.txt  /user/hadoop/400_1.txt

In [26]:
%%sh
hdfs dfs -ls  /user/hadoop/

Found 3 items
-rw-r--r--   1 emr-notebook hdfsadmingroup       2669 2023-10-30 20:25 /user/hadoop/400_1.txt
-rw-r--r--   1 emr-notebook hdfsadmingroup    2240141 2023-10-30 20:25 /user/hadoop/400_rot.pdf
-rw-r--r--   1 emr-notebook hdfsadmingroup     212973 2023-10-30 20:24 /user/hadoop/sample.pdf


In [27]:
pdf_rotated_text = spark.read.text('/user/hadoop/400_1.txt')

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [28]:
df = pdf_rotated_text.toPandas()

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [29]:
with open('400_1.txt', 'w') as file:
    for i in range(len(df)):
        file.write(df['value'].iloc[i]+'\n')
with open('400_1.txt', 'r') as file:
    print(file.read())

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

FOREWORD
Electronic design engineers are the true idea men of the electronic
industries. They create ideas and use them in their designs, they stimu-
late ideas in other designers, and they borrow and adapt ideas from
others. One could almost say they feed on and grow on ideas.

ELECTRONIC DESIGN has recognized this need and its edi-
torial content has reflected this awareness. Each issue is literally a col-
lection of useful ideas. In one section, however, special attention has
been devoted to providing a forum for the exchange of ideas between
readers—a section called “Ideas For Design.” Here are presented clever,
unique, ingenious, and often very simple ideas that readers have found
useful, sometimes as parts of larger designs and sometimes as aids in
measuring the parameters or testing the effectiveness of their designs.
Many are quite simple “little” ideas, but experienced designers know
that good little ideas make the good large design possible.

To encourage this exchange of ide

In [30]:
detected = "\n".join([row.text for row in result.collect()])
corrected_detected = "\n".join([row.text for row in corrected_result.collect()])

# read original text
pdf_rotated_text = open('400_1.txt', "r").read()

# compute scores
detected_score = score(pdf_rotated_text, detected)
corrected_score = score(pdf_rotated_text, corrected_detected)

#  print scores
print("Score without skew correction: {0}".format(detected_score))
print("Score with skew correction: {0}".format(corrected_score))

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

Score without skew correction: 0.6008403361344538
Score with skew correction: 0.9502458210422812

## Reading multiple pdfs from folder

In [31]:
pdf_path = "/user/hadoop/*.pdf"

pdfs = spark.read.format("binaryFile").load(pdf_path).cache()
#images = spark.read.format("binaryFile").load('text_with_noise.png').cache()

pdfs.count()

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

2

In [32]:
# Transforrm PDF document to images per page
pdf_to_image = PdfToImage()\
      .setInputCol("content")\
      .setOutputCol("image")\
      .setResolution(100)

# Run OCR
ocr = ImageToText()\
      .setInputCol("image")\
      .setOutputCol("text")\
      .setConfidenceThreshold(65)\
      .setIgnoreResolution(False)

ocr_pipeline = PipelineModel(stages=[
    pdf_to_image,
    ocr
])


VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [33]:
results = ocr_pipeline.transform(pdfs)

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [34]:
results.columns

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

['path', 'modificationTime', 'length', 'image', 'total_pages', 'pagenum', 'documentnum', 'confidence', 'exception', 'text', 'positions']

In [35]:
results.select('path','confidence','text').show()

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

+--------------------+-----------------+--------------------+
|                path|       confidence|                text|
+--------------------+-----------------+--------------------+
|hdfs://ip-172-31-...|92.89510854085286|FOREWORD\n\nElect...|
|hdfs://ip-172-31-...|89.96996718186598|digital RE.\n\nAS...|
+--------------------+-----------------+--------------------+

### Recognize text from PDFs and store results to PDF with text layout

In [36]:
from sparkocr.utils import display_pdf_file

# Transforrm PDF document to images per page
pdf_to_image = PdfToImage() \
    .setInputCol("content") \
    .setOutputCol("image") \
    .setResolution(100)

# Run OCR and render results to PDF
ocr = ImageToTextPdf() \
    .setInputCol("image") \
    .setOutputCol("pdf_page")

# Assemble multipage PDF
pdf_assembler = PdfAssembler() \
    .setInputCol("pdf_page") \
    .setOutputCol("pdf")

pdf_pipeline = PipelineModel(stages=[
    pdf_to_image,
    ocr,
    pdf_assembler
])

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [37]:
%%sh
wget -q -O /tmp/sample_doc.pdf http://www.asx.com.au/asxpdf/20171103/pdf/43nyyw9r820c6r.pdf

In [38]:
%%sh
hdfs dfs -copyFromLocal /tmp/sample_doc.pdf /user/hadoop/

In [39]:
pdf = '/user/hadoop/sample_doc.pdf'
pdf_example_df = spark.read.format("binaryFile").load(pdf).cache()

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [40]:
pdf_example_df.show()

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

+--------------------+--------------------+------+--------------------+
|                path|    modificationTime|length|             content|
+--------------------+--------------------+------+--------------------+
|hdfs://ip-172-31-...|2023-10-30 20:25:...|212973|[25 50 44 46 2D 3...|
+--------------------+--------------------+------+--------------------+

In [41]:
result = pdf_pipeline.transform(pdf_example_df)

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [42]:
result.show()

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

+--------------------+--------------------+---------+
|                path|                 pdf|exception|
+--------------------+--------------------+---------+
|hdfs://ip-172-31-...|[25 50 44 46 2D 3...|         |
+--------------------+--------------------+---------+

In [43]:
pdf = result.select("pdf").head().pdf

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [44]:
with open("searchable.pdf", "wb") as pdfFile:
  pdfFile.write(pdf)

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

99303

## Image processing after reading a pdf

In [45]:
from sparkocr.enums import *

# Read binary as image
pdf_to_image = PdfToImage()\
  .setInputCol("content")\
  .setOutputCol("image")\
  .setResolution(100)

# Binarize using adaptive tresholding
binarizer = ImageAdaptiveThresholding()\
  .setInputCol("image")\
  .setOutputCol("binarized_image")\
  .setBlockSize(91)\
  .setOffset(50)

# Apply morphology opening
opening = ImageMorphologyOperation()\
  .setKernelShape(KernelShape.SQUARE)\
  .setOperation(MorphologyOperationType.OPENING)\
  .setKernelSize(3)\
  .setInputCol("binarized_image")\
  .setOutputCol("opening_image")

# Remove small objects
remove_objects = ImageRemoveObjects()\
  .setInputCol("opening_image")\
  .setOutputCol("corrected_image")\
  .setMinSizeObject(130)

# Image Layout Analyzer for detect regions
image_layout_analyzer = ImageLayoutAnalyzer()\
  .setInputCol("corrected_image")\
  .setOutputCol("region")\

draw_regions = ImageDrawRegions()\
  .setInputCol("corrected_image")\
  .setInputRegionsCol("region")\
  .setOutputCol("image_with_regions")

# Run tesseract OCR for corrected image
ocr_corrected = ImageToText()\
  .setInputCol("corrected_image")\
  .setOutputCol("corrected_text")\
  .setPositionsCol("corrected_positions")\
  .setConfidenceThreshold(65)

# Run OCR for original image
ocr = ImageToText()\
  .setInputCol("image")\
  .setOutputCol("text")

# OCR pipeline
image_pipeline = PipelineModel(stages=[
    pdf_to_image,
    binarizer,
    opening,
    remove_objects,
    image_layout_analyzer,
    draw_regions,
    ocr,
    ocr_corrected
])

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [46]:
%%sh
wget -q -O /tmp/noised.pdf https://raw.githubusercontent.com/JohnSnowLabs/spark-ocr-workshop/master/jupyter/data/pdfs/noised.pdf

In [47]:
%%sh
hdfs dfs -copyFromLocal /tmp/noised.pdf /user/hadoop/

In [48]:
image_df = spark.read.format("binaryFile").load('/user/hadoop/noised.pdf').cache()
image_df.show()

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

+--------------------+--------------------+-------+--------------------+
|                path|    modificationTime| length|             content|
+--------------------+--------------------+-------+--------------------+
|hdfs://ip-172-31-...|2023-10-30 20:25:...|2115939|[25 50 44 46 2D 3...|
+--------------------+--------------------+-------+--------------------+

In [49]:
result = image_pipeline.transform(image_df).cache()

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [50]:
for r in result.distinct().collect():
    print("Original: %s" % r.path)
    display_image(r.image)

    print("Corrected: %s" % r.path)
    display_image(r.corrected_image)

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

Original: hdfs://ip-172-31-20-250.us-east-2.compute.internal:8020/user/hadoop/noised.pdf

    Image #0:
    Origin: hdfs://ip-172-31-20-250.us-east-2.compute.internal:8020/user/hadoop/noised.pdf
    Resolution: 100 dpi
    Width: 826 px
    Height: 1169 px
    Mode: ImageType.TYPE_BYTE_GRAY
    Number of channels: 1
<PIL.Image.Image image mode=L size=826x1169 at 0x7FA9554E3350>
Corrected: hdfs://ip-172-31-20-250.us-east-2.compute.internal:8020/user/hadoop/noised.pdf

    Image #0:
    Origin: hdfs://ip-172-31-20-250.us-east-2.compute.internal:8020/user/hadoop/noised.pdf
    Resolution: 100 dpi
    Width: 826 px
    Height: 1169 px
    Mode: ImageType.TYPE_BYTE_BINARY
    Number of channels: 1
<PIL.Image.Image image mode=1 size=826x1169 at 0x7FA9554E3350>

### Results with original image

In [51]:
from termcolor import colored

grouped_results = result.groupBy("path", "pagenum").agg(F.concat_ws("", F.collect_list("text")).alias("text"))
for row in grouped_results.collect():
    print(colored("Filename:\n%s , page: %d" % (row.path, row.pagenum), "red"))
    print("Recognized text:\n%s" % row.text)

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

Filename:
hdfs://ip-172-31-20-250.us-east-2.compute.internal:8020/user/hadoop/noised.pdf , page: 0
Recognized text:
 

Dat

 

 

 

 

 

Sample No. __5031 : 23 ;
* Original request made by Mr. C. b. Tucker, Jr. "on “ipojes
C Sample specifications written by John H. M. Bohlken
_ BLEND, CASING _—-RECASING FINAL FLAVOR MENTHOL FLAVOR

“ OLD GOLD STRAIGHT Tobacco Blend

 

Control for Sample No. 5030

   

 

OLD GOLD STRAIGHT

  
  

 

85 mm.
Circumference. 25.3. mm. 5
Paper ~~. Ecusta 556
Firnness OLD GOLD STRAIGHT .

 

OLD GOLD STRAIGHT :
OLD GOLD STRAIGHT  Wrappings: .

    

 

Labels OLD GOLD STRAIGHT
C . Filter Length. OLD GOLD STRAIGHT, Closures: Standard Blue
~ Tear Tape-- Gold
. Cartons OLD GOLD STRAIGHT

  

=. Requirement: 5 Markings-~

One Tray .

Sample number on each
pack and carton

 

bf a :
ple" 4.6: Lo
boratory Analysis: —~ 7 <A V/s . ew +
Tars and Nicotine, Taste Panel, Burning Time, Gas Phase Analysis,
Benzo (A) Pyrene Analyses — 7-/¢-CF- @.( ssi/ek
Responsibility:

### Results with corrected image


In [52]:
grouped_results = result.groupBy("path", "pagenum").agg(F.concat_ws("", F.collect_list("corrected_text")).alias("corrected_text"))
for row in grouped_results.collect():
    print(colored("Filename:\n%s , page: %d" % (row.path, row.pagenum), "red"))
    print("Recognized text:\n%s" % row.corrected_text)

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

Filename:
hdfs://ip-172-31-20-250.us-east-2.compute.internal:8020/user/hadoop/noised.pdf , page: 0
Recognized text:

In [53]:
result.columns

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

['path', 'modificationTime', 'length', 'image', 'total_pages', 'pagenum', 'documentnum', 'binarized_image', 'opening_image', 'corrected_image', 'region', 'image_with_regions', 'confidence', 'text', 'positions', 'confidence', 'exception', 'corrected_text', 'corrected_positions']

### Abbyy output

In [54]:
abbyy = """-----
% Date: 7/16/68
X*: I; * • ■ Sample No. 5031___ — .*
•* Original request made by _____Mr. C. L. Tucker, Jr. on
Sample specifications written by
BLEND CASING RECASING
OLD GOLD STRAIGHT Tobacco Blend
Control for Sample No. 5030
John H. M. Bohlken
FINAL FLAVOR
) 7/10/68
MENTHOL FLAVOR
• Cigarettes; * . .v\ . /,*, *, S •
Brand --------- OLD GOLD STRAIGHT -V . ••••
; . L e n g t h ------- — 85 mm. . : '
Circumference-- 25.3 mm. • ' *;. • •
P a p e r ---------- Ecusta 556 • * .
F i r m n e s s---- —— OLD GOLD STRAIGHT . ! •■'
D r a w ___________ OLD GOLD STRAIGHT
W e i g h t --------- 0LD GOLD STRAIGHT Wrappings: « -
Tipping Paper — — *
p H n f —. — — _ _ ~ L a b e l s ----OLD GOLD STRAIGHT
( • Filter Length-- . — Closures--- Standard Blue .
^ ^ ; • Tear Tape— Gold
Cartons --- OLD GOLD STRAIGHT
s Requirements: . - •' • Markings-- Sample number on each
• pack and carton Laboratory----- One Tray .
O t h e r s --------- * , s • • . 4
Laboratory A n a l ysis^ I " '/***• * 7 ' ^ ^
Tars and Nicotine, Taste Panel, Burning Time, Gas Phase Analysis,
Benzo (A) Pyrene Analyses — J-ZZ-Zf'- (£. / •
Responsibility;
Tobacco B l e n d ------Manufacturing - A. Kraus . . * -
Filter Production--- —
• Making & P a c k i n g---Product Development , John H. M. Bohlken
Shipping -----------
Reports:
t
Written by — John H. M. Bohlken
Original to - Mr. C. L. Tucker, Jr.
Copies t o ---Dr. A. W. Spears
• 9 ..
"""

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

### Display original and corrected images with regions


In [55]:
for r in result.select("path","image","image_with_regions").distinct().collect():
    print("Original: %s" % r.path)
    display_image(r.image)

    print("Corrected: %s" % r.path)
    display_image(r.image_with_regions)

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

Original: hdfs://ip-172-31-20-250.us-east-2.compute.internal:8020/user/hadoop/noised.pdf

    Image #0:
    Origin: hdfs://ip-172-31-20-250.us-east-2.compute.internal:8020/user/hadoop/noised.pdf
    Resolution: 100 dpi
    Width: 826 px
    Height: 1169 px
    Mode: ImageType.TYPE_BYTE_GRAY
    Number of channels: 1
<PIL.Image.Image image mode=L size=826x1169 at 0x7FA955468F10>
Corrected: hdfs://ip-172-31-20-250.us-east-2.compute.internal:8020/user/hadoop/noised.pdf

    Image #0:
    Origin: hdfs://ip-172-31-20-250.us-east-2.compute.internal:8020/user/hadoop/noised.pdf
    Resolution: 0 dpi
    Width: 826 px
    Height: 1169 px
    Mode: ImageType.TYPE_BYTE_BINARY
    Number of channels: 3
<PIL.Image.Image image mode=1 size=826x1169 at 0x7FA955468110>

## Image (or Natural Scene) to Text

In [56]:
%%sh
wget -q -O /tmp/text_with_noise.png https://raw.githubusercontent.com/JohnSnowLabs/spark-ocr-workshop/master/jupyter/data/images/text_with_noise.png

In [57]:
%%sh
hdfs dfs -copyFromLocal /tmp/text_with_noise.png /user/hadoop/text_with_noise.png

In [58]:
image_df = spark.read.format("binaryFile").load('/user/hadoop/text_with_noise.png').cache()

# Read binary as image
binary_to_image = BinaryToImage()
binary_to_image.setInputCol("content")
binary_to_image.setOutputCol("image")

# Scale image
scaler = ImageScaler()
scaler.setInputCol("image")
scaler.setOutputCol("scaled_image")
scaler.setScaleFactor(2.0)

# Binarize using adaptive tresholding
binarizer = ImageAdaptiveThresholding()
binarizer.setInputCol("scaled_image")
binarizer.setOutputCol("binarized_image")
binarizer.setBlockSize(71)
binarizer.setOffset(65)

remove_objects = ImageRemoveObjects()
remove_objects.setInputCol("binarized_image")
remove_objects.setOutputCol("cleared_image")
remove_objects.setMinSizeObject(400)
remove_objects.setMaxSizeObject(4000)

# Run OCR
ocr = ImageToText()
ocr.setInputCol("cleared_image")
ocr.setOutputCol("text")
ocr.setConfidenceThreshold(50)
ocr.setIgnoreResolution(False)

# OCR pipeline
noisy_pipeline = PipelineModel(stages=[
    binary_to_image,
    scaler,
    binarizer,
    remove_objects,
    ocr
])


result = noisy_pipeline \
.transform(image_df) \
.cache()


for r in result.distinct().collect():
    print("Original: %s" % r.path)
    display_image(r.image)
    print("Binarized")
    display_image(r.binarized_image)
    print("Removing objects")
    display_image(r.cleared_image)


VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

Original: hdfs://ip-172-31-20-250.us-east-2.compute.internal:8020/user/hadoop/text_with_noise.png

    Image #0:
    Origin: hdfs://ip-172-31-20-250.us-east-2.compute.internal:8020/user/hadoop/text_with_noise.png
    Resolution: 95 dpi
    Width: 1095 px
    Height: 134 px
    Mode: ImageType.TYPE_BYTE_GRAY
    Number of channels: 1
<PIL.Image.Image image mode=L size=1095x134 at 0x7FA9554CF490>
Binarized

    Image #0:
    Origin: hdfs://ip-172-31-20-250.us-east-2.compute.internal:8020/user/hadoop/text_with_noise.png
    Resolution: 95 dpi
    Width: 2190 px
    Height: 268 px
    Mode: ImageType.TYPE_BYTE_BINARY
    Number of channels: 1
<PIL.Image.Image image mode=1 size=2190x268 at 0x7FA955530DD0>
Removing objects

    Image #0:
    Origin: hdfs://ip-172-31-20-250.us-east-2.compute.internal:8020/user/hadoop/text_with_noise.png
    Resolution: 95 dpi
    Width: 2190 px
    Height: 268 px
    Mode: ImageType.TYPE_BYTE_BINARY
    Number of channels: 1
<PIL.Image.Image image mode=1 size

In [59]:
print("\n".join([row.text for row in result.select("text").collect()]))

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

Su la ba e de la g ande saue de Zeus a Olympe Phid as avai
pe ene les Douze D eux En ele Sole | (Hel os) et la Lune (Selene)
esdouze dv n es g oupees deux adeux s o donna en ens x couples

### Text from Scene

In [60]:
%%sh
wget -q -O tmp/natural_scene.jpeg https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp-workshop/master/tutorials/Certification_Trainings/Healthcare/data/ocr/natural_scene.jpeg

In [61]:
%%sh
hdfs dfs -copyFromLocal tmp/natural_scene.jpeg /user/hadoop/natural_scene.jpeg

In [62]:
image_df = spark.read.format("binaryFile").load('/user/hadoop/natural_scene.jpeg').cache()

# Apply morphology opening
morpholy_operation = ImageMorphologyOperation()
morpholy_operation.setKernelShape(KernelShape.DISK)
morpholy_operation.setKernelSize(5)
morpholy_operation.setOperation("closing")
morpholy_operation.setInputCol("cleared_image")
morpholy_operation.setOutputCol("corrected_image")

# Run OCR
ocr = ImageToText()
ocr.setInputCol("corrected_image")
ocr.setOutputCol("text")
ocr.setConfidenceThreshold(50)
ocr.setIgnoreResolution(False)

# OCR pipeline
scene_pipeline = PipelineModel(stages=[
    binary_to_image,
    scaler,
    binarizer,
    remove_objects,
    morpholy_operation,
    ocr
])

result = scene_pipeline \
.transform(image_df) \
.cache()


for r in result.distinct().collect():
    print("Original: %s" % r.path)
    display_image(r.image)
    print("Binarized")
    display_image(r.binarized_image)
    print("Removing objects")
    display_image(r.cleared_image)
    print("Morphology closing")
    display_image(r.corrected_image)

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

Original: hdfs://ip-172-31-20-250.us-east-2.compute.internal:8020/user/hadoop/natural_scene.jpeg

    Image #0:
    Origin: hdfs://ip-172-31-20-250.us-east-2.compute.internal:8020/user/hadoop/natural_scene.jpeg
    Resolution: 0 dpi
    Width: 640 px
    Height: 480 px
    Mode: ImageType.TYPE_BYTE_GRAY
    Number of channels: 1
<PIL.Image.Image image mode=L size=640x480 at 0x7FA955464910>
Binarized

    Image #0:
    Origin: hdfs://ip-172-31-20-250.us-east-2.compute.internal:8020/user/hadoop/natural_scene.jpeg
    Resolution: 0 dpi
    Width: 1280 px
    Height: 960 px
    Mode: ImageType.TYPE_BYTE_BINARY
    Number of channels: 1
<PIL.Image.Image image mode=1 size=1280x960 at 0x7FA955464910>
Removing objects

    Image #0:
    Origin: hdfs://ip-172-31-20-250.us-east-2.compute.internal:8020/user/hadoop/natural_scene.jpeg
    Resolution: 0 dpi
    Width: 1280 px
    Height: 960 px
    Mode: ImageType.TYPE_BYTE_BINARY
    Number of channels: 1
<PIL.Image.Image image mode=1 size=1280x960

## DOCX Processing (version 1.10.0)

### Read DOCX document as binary file

In [63]:
%%sh
wget -q -O /tmp/doc2.docx https://github.com/JohnSnowLabs/spark-nlp-workshop/raw/master/visual-nlp/data/doc2.docx

In [64]:
%%sh
hdfs dfs -copyFromLocal /tmp/doc2.docx /user/hadoop/doc2.docx

In [65]:
doc_example_df = spark.read.format("binaryFile").load("/user/hadoop/doc2.docx").cache()

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

### DocxtoText

#### Extract text using DocToText transformer

In [66]:
from sparkocr.transformers import *

doc_to_text = DocToText()
doc_to_text.setInputCol("content")
doc_to_text.setOutputCol("text")

result = doc_to_text.transform(doc_example_df)

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

#### Display result DataFrame

In [67]:
result.show()

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

+--------------------+--------------------+------+--------------------+---------+-------+
|                path|    modificationTime|length|                text|exception|pagenum|
+--------------------+--------------------+------+--------------------+---------+-------+
|hdfs://ip-172-31-...|2023-10-30 20:26:...| 33260|Sample Document\n...|     null|      0|
+--------------------+--------------------+------+--------------------+---------+-------+

#### Display extracted text

In [68]:
print("\n".join([row.text for row in result.select("text").collect()]))

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

Sample Document
This document was created using accessibility techniques for headings, lists, image alternate text, tables, and columns. It should be completely accessible using assistive technologies such as screen readers.
Headings
There are eight section headings in this document. At the beginning, "Sample Document" is a level 1 heading. The main section headings, such as "Headings" and "Lists" are level 2 headings. The Tables section contains two sub-headings, "Simple Table" and "Complex Table," which are both level 3 headings.
Lists
The following outline of the sections of this document is an ordered (numbered) list with six items. The fifth item, "Tables," contains a nested unordered (bulleted) list with two items.
Headings 
Lists 
Links 
Images 
Tables 
Simple Tables 
Complex Tables 
Columns 
Links
In web documents, links can point different locations on the page, different pages, or even downloadable documents, such as Word documents or PDFs:
Top of this Page
Sample Document
Sa

### DocxToTextTable
#### (Extracting table data from Microsoft DOCX documents)

#### Preview document using DocToPdf and PdfToImage transformers

In [69]:
image_df = PdfToImage().setResolution(100).transform(DocToPdf().setOutputCol("content").transform(doc_example_df))
for r in image_df.select("image").collect():
    display_image(r.image)

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…


    Image #0:
    Origin: hdfs://ip-172-31-20-250.us-east-2.compute.internal:8020/user/hadoop/doc2.docx
    Resolution: 100 dpi
    Width: 849 px
    Height: 1100 px
    Mode: ImageType.TYPE_BYTE_GRAY
    Number of channels: 1
<PIL.Image.Image image mode=L size=849x1100 at 0x7FA955486850>

    Image #0:
    Origin: hdfs://ip-172-31-20-250.us-east-2.compute.internal:8020/user/hadoop/doc2.docx
    Resolution: 100 dpi
    Width: 849 px
    Height: 1100 px
    Mode: ImageType.TYPE_BYTE_GRAY
    Number of channels: 1
<PIL.Image.Image image mode=L size=849x1100 at 0x7FA955464E10>

    Image #0:
    Origin: hdfs://ip-172-31-20-250.us-east-2.compute.internal:8020/user/hadoop/doc2.docx
    Resolution: 100 dpi
    Width: 849 px
    Height: 1100 px
    Mode: ImageType.TYPE_BYTE_GRAY
    Number of channels: 1
<PIL.Image.Image image mode=L size=849x1100 at 0x7FA955464E10>

#### Extract text using DocToText transformer

In [70]:
doc_to_table = DocToTextTable()
doc_to_table.setInputCol("content")
doc_to_table.setOutputCol("tables")

result = doc_to_table.transform(doc_example_df)

result.show()

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

+--------------------+--------------------+------+--------------------+---------+
|                path|    modificationTime|length|              tables|exception|
+--------------------+--------------------+------+--------------------+---------+
|hdfs://ip-172-31-...|2023-10-30 20:26:...| 33260|{{0, 0, 0.0, 0.0,...|     null|
|hdfs://ip-172-31-...|2023-10-30 20:26:...| 33260|{{1, 0, 0.0, 0.0,...|     null|
+--------------------+--------------------+------+--------------------+---------+

In [71]:
result.select(result["tables.chunks"].getItem(3)["chunkText"]).show(truncate=False)

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

+-----------------------------------------+
|tables.chunks AS chunks#9817[3].chunkText|
+-----------------------------------------+
|[Window-Eyes, 214, 12%]                  |
|[NVDA, 238, 14%, 105, 9% ]               |
+-----------------------------------------+

#### Display extracted data in JSON format

In [72]:
import json
df_json = result.select("tables").toJSON()
for row in df_json.collect():
    print(json.dumps(json.loads(row), indent=4))

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

{
    "tables": {
        "area": {
            "index": 0,
            "page": 0,
            "x": 0.0,
            "y": 0.0,
            "width": 0.0,
            "height": 0.0,
            "score": 0.0,
            "label": "0",
            "angle": 0.0
        },
        "chunks": [
            [
                {
                    "chunkText": "Screen Reader",
                    "x": 0.0,
                    "y": 0.0,
                    "width": 90.0,
                    "height": 0.0
                },
                {
                    "chunkText": "Responses",
                    "x": 0.0,
                    "y": 0.0,
                    "width": 95.75,
                    "height": 0.0
                },
                {
                    "chunkText": "Share",
                    "x": 0.0,
                    "y": 0.0,
                    "width": 95.75,
                    "height": 0.0
                }
            ],
            [
                {
              

## Text to Pdf

In [73]:
def pipeline():
    # Transforrm PDF document to images per page
    pdf_to_image = PdfToImage() \
        .setInputCol("content") \
        .setOutputCol("image") \
        .setResolution(100) \
        .setKeepInput(True)

    # Run OCR
    ocr = ImageToText() \
        .setInputCol("image") \
        .setOutputCol("text") \
        .setConfidenceThreshold(60) \
        .setIgnoreResolution(False) \
        .setPageSegMode(PageSegmentationMode.SPARSE_TEXT)

    # Render results to PDF
    textToPdf = TextToPdf() \
        .setInputCol("positions") \
        .setInputImage("image") \
        .setOutputCol("pdf")

    pipeline = PipelineModel(stages=[
        pdf_to_image,
        ocr,
        textToPdf
    ])

    return pipeline

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [74]:
# !wget -q https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp-workshop/master/tutorials/Certification_Trainings/Healthcare/data/ocr/MT_00.pdf

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [75]:
%%sh
wget -q -O /tmp/test_document.pdf https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp-workshop/master/tutorials/Certification_Trainings/Healthcare/data/ocr/test_document.pdf

In [76]:
%%sh
hdfs dfs -copyFromLocal /tmp/test_document.pdf /user/hadoop/test_document.pdf

In [77]:
# pdf_example_df = spark.read.format("binaryFile").load('MT_00.pdf').cache()
pdf_example_df = spark.read.format("binaryFile").load('/user/hadoop/test_document.pdf').cache()

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [78]:
result = pipeline().transform(pdf_example_df).cache()

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [79]:
result.columns

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

['path', 'text', 'pdf', 'exception']

In [80]:
display_image(PdfToImage().setResolution(100).transform(pdf_example_df).select("image").collect()[0].image)

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…


    Image #0:
    Origin: hdfs://ip-172-31-20-250.us-east-2.compute.internal:8020/user/hadoop/test_document.pdf
    Resolution: 100 dpi
    Width: 1674 px
    Height: 2205 px
    Mode: ImageType.TYPE_BYTE_GRAY
    Number of channels: 1
<PIL.Image.Image image mode=L size=1674x2205 at 0x7FA955482D90>

In [81]:
# Store results to pdf file
pdf = result.select("pdf").head().pdf

pdfFile = open("result.pdf", "wb")

pdfFile.write(pdf)

pdfFile.close()

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [82]:
# Convert pdf to image and display¶

image_df = PdfToImage() \
    .setInputCol("pdf") \
    .setOutputCol("image") \
    .setResolution(100) \
    .transform(result.select("pdf", "path"))

for r in image_df.collect():
    display_image(r.image)


VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…


    Image #0:
    Origin: hdfs://ip-172-31-20-250.us-east-2.compute.internal:8020/user/hadoop/test_document.pdf
    Resolution: 100 dpi
    Width: 1777 px
    Height: 2444 px
    Mode: ImageType.TYPE_BYTE_GRAY
    Number of channels: 1
<PIL.Image.Image image mode=L size=1777x2444 at 0x7FA9553B0B50>

    Image #0:
    Origin: hdfs://ip-172-31-20-250.us-east-2.compute.internal:8020/user/hadoop/test_document.pdf
    Resolution: 100 dpi
    Width: 1777 px
    Height: 2333 px
    Mode: ImageType.TYPE_BYTE_GRAY
    Number of channels: 1
<PIL.Image.Image image mode=L size=1777x2333 at 0x7FA9553B0E90>

## Working with PPT Documents

### Read PPT document

In [83]:
%%sh
wget -q -O /tmp/Spark_NLP_NER.pptx https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp-workshop/master/tutorials/Certification_Trainings/Healthcare/data/ocr/Spark_NLP_NER.pptx

In [84]:
%%sh
hdfs dfs -copyFromLocal /tmp/Spark_NLP_NER.pptx /user/hadoop/Spark_NLP_NER.pptx

In [85]:
ppt_example_df = spark.read.format("binaryFile").load('/user/hadoop/Spark_NLP_NER.pptx').cache()

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [86]:
#Read PPT document as binary file¶

# convert PPT to PDF
pdf_df = PptToPdf() \
    .setOutputCol("content") \
    .transform(ppt_example_df)

# Convert PDF to image for display
image_df = PdfToImage() \
    .setImageType(ImageType.TYPE_3BYTE_BGR) \
    .setResolution(100) \
    .transform(pdf_df)

display_images(image_df)

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…


    Image #0:
    Origin: hdfs://ip-172-31-20-250.us-east-2.compute.internal:8020/user/hadoop/Spark_NLP_NER.pptx
    Resolution: 100 dpi
    Width: 826 px
    Height: 1169 px
    Mode: ImageType.TYPE_3BYTE_BGR
    Number of channels: 3
<PIL.Image.Image image mode=RGB size=826x1169 at 0x7FA955498DD0>

### Extracting table data from PPT documents

In [87]:
from sparkocr.transformers import *
from sparkocr.utils import display_images, display_tables, display_pdf
from pyspark.sql.functions import collect_list,col

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [88]:
# Preview document using PptToPdf and PdfToImage transformers¶
image_df = PptToPdf().setOutputCol("content").transform(ppt_example_df)

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [89]:
#Extract tables from PPT using PptToTextTable transformer¶

ppt_to_table = PptToTextTable()
ppt_to_table.setInputCol("content")
ppt_to_table.setOutputCol("table")

result = ppt_to_table.transform(ppt_example_df).cache()

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [90]:
result.show()

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

+--------------------+--------------------+-------+--------------------+---------+-------+
|                path|    modificationTime| length|               table|exception|pagenum|
+--------------------+--------------------+-------+--------------------+---------+-------+
|hdfs://ip-172-31-...|2023-10-30 20:26:...|4490997|{{0, 0, 304.01157...|     null|      0|
+--------------------+--------------------+-------+--------------------+---------+-------+

In [91]:
display_tables(result)

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

Filename: Spark_NLP_NER.pptx
Page:     0
Table:    0
6
                     col0       col1   col2  col3     col4          col5
0             NLP Feature  Spark NLP  spaCy  NLTK  CoreNLP  Hugging Face
1            Tokenization        Yes    Yes   Yes      Yes           Yes
2   Sentence segmentation        Yes    Yes   Yes      Yes            No
3                Steeming        Yes    Yes   Yes      Yes            No
4           Lemmatization        Yes    Yes   Yes      Yes            No
5             POS tagging        Yes    Yes   Yes      Yes            No
6      Entity recognition        Yes    Yes   Yes      Yes           Yes
7              Dep parser        Yes    Yes   Yes      Yes            No
8            Text matcher        Yes    Yes    No       No            No
9            Date matcher        Yes     No    No       No            No
10     Sentiment detector        Yes     No   Yes      Yes           Yes
11    Text classification        Yes    Yes   Yes       No           

## Dicom to Image

In [92]:
%%sh
wget -q -O /tmp/dicom_5.dcm https://github.com/JohnSnowLabs/spark-ocr-workshop/raw/master/jupyter/data/dicom/deidentify-medical-2.dcm

In [93]:
%%sh
hdfs dfs -copyFromLocal /tmp/dicom_5.dcm  /user/hadoop/dicom_5.dcm 

In [94]:
dicom_path = '/user/hadoop/dicom_5.dcm'

# Read dicom file as binary file
dicom_df = spark.read.format("binaryFile").load(dicom_path)

dicomToImage = DicomToImage() \
  .setInputCol("content") \
  .setOutputCol("image") \
  .setMetadataCol("meta")

data = dicomToImage.transform(dicom_df)

for image in data.collect():
      display_image(image.image)

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…


    Image #0:
    Origin: hdfs://ip-172-31-20-250.us-east-2.compute.internal:8020/user/hadoop/dicom_5.dcm
    Resolution: 0 dpi
    Width: 914 px
    Height: 985 px
    Mode: ImageType.TYPE_BYTE_GRAY
    Number of channels: 1
<PIL.Image.Image image mode=L size=914x985 at 0x7FA955475A90>

In [95]:
# Extract text from image
ocr = ImageToText() \
    .setInputCol("image") \
    .setOutputCol("text") \
    .setIgnoreResolution(False) \
    .setOcrParams(["preserve_interword_spaces=0"])

print("\n".join([row.text for row in ocr.transform(data).select("text").collect()]))

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

Fr: 1/19 Name: Good, Guy
Acc#: AccessionNumber PID: 125-98-445

Study Date: go 6/4 999 Sex: PatientSex
_ DOB:,08/02/1929

pur

 

LEADTOOLS

## Spark OCR for recognize text and store results to HOCR

In [96]:
# Transforrm PDF document to images per page
pdf_to_image = PdfToImage() \
    .setInputCol("content") \
    .setOutputCol("image") \
    .setResolution(100) \
    .setImageType(ImageType.TYPE_3BYTE_BGR)

# Run OCR
ocr = ImageToHocr() \
    .setInputCol("image") \
    .setOutputCol("hocr") \
    .setIgnoreResolution(False)

document_assembler = HocrDocumentAssembler() \
    .setInputCol("hocr") \
    .setOutputCol("document")

tokenizer = HocrTokenizer() \
    .setInputCol("hocr") \
    .setOutputCol("token") \

draw_annotations = ImageDrawAnnotations() \
    .setInputCol("image") \
    .setInputChunksCol("token") \
    .setOutputCol("image_with_annotations") \
    .setFilledRect(False) \
    .setFontSize(10) \
    .setRectColor(Color.red)

pipeline = PipelineModel(stages=[
    pdf_to_image,
    ocr,
    document_assembler,
    tokenizer,
    draw_annotations
])

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [97]:
%%sh
wget -q -O /tmp/test_document.pdf https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp-workshop/master/visual-nlp/data/budget.pdf

In [98]:
%%sh
hdfs dfs -copyFromLocal /tmp/test_document.pdf /user/hadoop/test_document.pdf

copyFromLocal: `/user/hadoop/test_document.pdf': File exists


CalledProcessError: Command 'b'hdfs dfs -copyFromLocal /tmp/test_document.pdf /user/hadoop/test_document.pdf\n'' returned non-zero exit status 1.

In [99]:
pdf_example_df = spark.read.format("binaryFile").load("/user/hadoop/test_document.pdf").cache()

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [100]:
result = pipeline.transform(pdf_example_df).cache()

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [107]:
result.select("pagenum", "hocr").show(truncate=140)

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

+-------+--------------------------------------------------------------------------------------------------------------------------------------------+
|pagenum|                                                                                                                                        hocr|
+-------+--------------------------------------------------------------------------------------------------------------------------------------------+
|      0|  <div class='ocr_page' id='page_1' title='image "unknown"; bbox 0 0 1674 2205; ppageno 0; scan_res 100 100'>\n   <div class='ocr_carea' ...|
|      1|  <div class='ocr_page' id='page_1' title='image "unknown"; bbox 0 0 1691 2199; ppageno 0; scan_res 100 100'>\n   <div class='ocr_carea' ...|
+-------+--------------------------------------------------------------------------------------------------------------------------------------------+

In [102]:
display_images(result, "image_with_annotations", width=1000)

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…


    Image #0:
    Origin: hdfs://ip-172-31-20-250.us-east-2.compute.internal:8020/user/hadoop/test_document.pdf
    Resolution: 0 dpi
    Width: 1674 px
    Height: 2205 px
    Mode: ImageType.TYPE_3BYTE_BGR
    Number of channels: 3
<PIL.Image.Image image mode=RGB size=1674x2205 at 0x7FA955482F90>

    Image #1:
    Origin: hdfs://ip-172-31-20-250.us-east-2.compute.internal:8020/user/hadoop/test_document.pdf
    Resolution: 0 dpi
    Width: 1691 px
    Height: 2199 px
    Mode: ImageType.TYPE_3BYTE_BGR
    Number of channels: 3
<PIL.Image.Image image mode=RGB size=1691x2199 at 0x7FA955422090>

In [103]:
from IPython.core.display import display, HTML
display(HTML(result.select("hocr").collect()[0].hocr))

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

<IPython.core.display.HTML object>

## Text Detection in an Image using Regex Patterns

In [None]:
import pkg_resources
from pyspark.ml import PipelineModel
import pyspark.sql.functions as f
from sparkocr.transformers import *
from sparkocr.enums import *
from sparkocr.utils import display_images

In [None]:
%%sh
wget -q -O tmp/image.jpg https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp-workshop/master/visual-nlp/data/020_Yas_patella.jpg

In [None]:
%%sh
hdfs dfs -copyFromLocal tmp/image.jpg /user/hadoop/image.jpg

In [None]:
image_df = spark.read.format("binaryFile").load("/user/hadoop/image.jpg")

display_images(BinaryToImage().setImageType(ImageType.TYPE_3BYTE_BGR).transform(image_df), "image")

In [None]:
table_detector = ImageTableDetector \
        .pretrained("general_model_table_detection_v2", "en", "public/ocr/models")\
        .setInputCol("image")\
        .setOutputCol("region")\
        .setScoreThreshold(0.9)\
        .setApplyCorrection(True)\
        .setScaleWidthToCol("width_dimension")\
        .setScaleHeightToCol("height_dimension")

In [None]:
spark

In [None]:
binary_to_image = BinaryToImage()
binary_to_image.setImageType(ImageType.TYPE_3BYTE_BGR)

text_detector = ImageTextDetectorV2.pretrained("image_text_detector_v2", "en", "clinical/ocr")
text_detector.setInputCol("image")
text_detector.setOutputCol("text_regions")
text_detector.setSizeThreshold(10)
text_detector.setScoreThreshold(0.9)
text_detector.setLinkThreshold(0.4)
text_detector.setTextThreshold(0.2)
text_detector.setWidth(1512)


draw_regions = ImageDrawRegions()
draw_regions.setInputCol("image")
draw_regions.setInputRegionsCol("text_regions")
draw_regions.setOutputCol("image_with_regions")
draw_regions.setRectColor(Color.green)
draw_regions.setRotated(True)

pipeline = PipelineModel(stages=[
    binary_to_image,
    text_detector,
    draw_regions
])

In [None]:
result =  pipeline.transform(image_df).cache()
display_images(result, "image_with_regions")

In [None]:
# Define a new alphabet

symbols = """:$&(){}[]?/\\!><@=#-;,%_“.|'`"*#^+~€"""
numbers = "0123456789"
englishAlphabet = "abcdefghijklmnopqrstuvwxyz"
special = "β¢£©®—"

chars = symbols + numbers + englishAlphabet + englishAlphabet.upper() +special

with open('./tmp/custom_alphabet.txt', 'w') as alphabet_file:
    alphabet_file.write(chars)

In [None]:
entities =[
    {
        "id": "ref",
        "label": "REF",
        "patterns": ["\\d{4}-\\d{2}-\\d{3}"],
        "regex": True
    },
    {
        "id": "date",
        "label": "DATE",
        "patterns": ["\\d{4}-\\d{2}-\\d{2}"],
        "regex": True
    },
    {
        "id": "lot",
        "label": "LOT",
        "patterns": ["\\d{7}"],
        "regex": True
    }
]

with open('./tmp/entities.json', 'w') as jsonfile:
    json.dump(entities, jsonfile)

In [None]:
import sparknlp
from sparknlp.base import *
from sparknlp.annotator import *
from sparknlp.common import *

splitter = ImageSplitRegions() \
    .setInputCol("image") \
    .setInputRegionsCol("text_regions") \
    .setOutputCol("text_image") \
    .setDropCols(["image"]) \
    .setExplodeCols(["text_regions"]) \
    .setRotated(True) \
    .setImageType(ImageType.TYPE_BYTE_GRAY)

ocr = ImageToText() \
    .setInputCol("text_image") \
    .setOutputCol("text") \
    .setPageSegMode(PageSegmentationMode.SINGLE_WORD) \
    .setIgnoreResolution(False)

documentAssembler = DocumentAssembler() \
    .setInputCol("text") \
    .setOutputCol("document")

tokenizer = Tokenizer() \
    .setInputCols(["document"]) \
    .setOutputCol("token")

entityRuler = EntityRulerApproach() \
    .setInputCols(["document", "token"]) \
    .setOutputCol("entities") \
    .setPatternsResource("entities.json") \
    .setAlphabetResource("./custom_alphabet.txt")

pipeline_nlp = Pipeline().setStages([
    splitter,
    ocr,
    documentAssembler,
    tokenizer,
    entityRuler
])

text_result = pipeline_nlp.fit(result).transform(result).cache()

In [None]:
text_result.selectExpr("explode(entities)").show(truncate=False)

In [None]:
print(("").join([x.text for x in text_result.select("text").collect()]))