![JohnSnowLabs](https://nlp.johnsnowlabs.com/assets/images/logo.png)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/ocr/PDF_TEXT_NER.ipynb)

# Recognize entities in scanned PDFs

To run this yourself, you will need to upload your **Spark OCR** license keys to the notebook. Otherwise, you can look at the example outputs at the bottom of the notebook. To upload license keys, open the file explorer on the left side of the screen and upload `workshop_license_keys.json` to the folder that opens.

For more in-depth tutorials: https://github.com/JohnSnowLabs/spark-ocr-workshop/tree/master/jupyter

## 1. Colab Setup

Read licence key

In [1]:
import json
import os

from google.colab import files

license_keys = files.upload()
os.rename(list(license_keys.keys())[0], 'spark_ocr.json')

with open('spark_ocr.json') as f:
    license_keys = json.load(f)

# Defining license key-value pairs as local variables
locals().update(license_keys)

Saving spark_nlp_for_healthcare_spark_ocr_3565.json to spark_nlp_for_healthcare_spark_ocr_3565.json


Install Dependencies

In [None]:
# Installing pyspark and spark-nlp
! pip install --upgrade -q pyspark==3.0.2 spark-nlp==$PUBLIC_VERSION

# Installing Spark NLP Healthcare
! pip -q install --upgrade spark-nlp-jsl==$JSL_VERSION  --extra-index-url https://pypi.johnsnowlabs.com/$SECRET

# Installing Spark OCR
! pip install spark-ocr==$OCR_VERSION\+spark30 --extra-index-url=https://pypi.johnsnowlabs.com/$SPARK_OCR_SECRET --upgrade

<b><h1><font color='darkred'>!!! ATTENTION !!! </font><h1><b>

<b>After running previous cell, <font color='darkred'>RESTART the COLAB RUNTIME </font> and go ahead.<b>

Importing Libraries

In [1]:
import json, os

with open("spark_ocr.json", 'r') as f:
  license_keys = json.load(f)

# Adding license key-value pairs to environment variables
os.environ.update(license_keys)

# Defining license key-value pairs as local variables
locals().update(license_keys)

In [2]:
import pandas as pd
import numpy as np
import os

#Pyspark Imports
from pyspark.sql import SparkSession
from pyspark.ml import PipelineModel
from pyspark.sql import functions as F

# Necessary imports from Spark OCR library
import sparkocr
from sparkocr import start
from sparkocr.transformers import *
from sparkocr.enums import *
from sparkocr.utils import display_image, to_pil_image
from sparkocr.metrics import score
import pkg_resources

# import sparknlp packages
from sparknlp.annotator import *
from sparknlp.base import *
import sparknlp_jsl
from sparknlp_jsl.annotator import *


Start Spark Session

In [3]:
# Start spark
spark = sparkocr.start(secret=SPARK_OCR_SECRET, 
                       nlp_version=PUBLIC_VERSION,
                       nlp_secret=SECRET,
                       nlp_internal=JSL_VERSION
                       )
spark

Spark version: 3.0.2
Spark NLP version: 3.3.4
Spark OCR version: 3.9.1



## 2. Download and read scanned pdf image. 
**To process PDF, download it and just use pdf_to_image annotator instead of binary_to_image in the pipeline**

In [4]:
!wget https://www.reneelab.com/wp-content/uploads/sites/2/2015/11/target-500x600.png -O 1.jpg

--2022-01-10 17:25:24--  https://www.reneelab.com/wp-content/uploads/sites/2/2015/11/target-500x600.png
Resolving www.reneelab.com (www.reneelab.com)... 172.66.43.113, 172.66.40.143, 2606:4700:3108::ac42:2b71, ...
Connecting to www.reneelab.com (www.reneelab.com)|172.66.43.113|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [image/png]
Saving to: ‘1.jpg’

1.jpg                   [ <=>                ] 141.88K  --.-KB/s    in 0.03s   

2022-01-10 17:25:24 (5.35 MB/s) - ‘1.jpg’ saved [145284]



In [5]:
image_df = spark.read.format("binaryFile").load('./1.jpg').cache()
image_df.show()

+-------------------+-------------------+------+--------------------+
|               path|   modificationTime|length|             content|
+-------------------+-------------------+------+--------------------+
|file:/content/1.jpg|2016-12-19 13:28:45|145284|[89 50 4E 47 0D 0...|
+-------------------+-------------------+------+--------------------+



## 3. Construct OCR & NLP pipelines

OCR Pipleline

In [6]:
# To load PDF instead of Image,
#pdf_to_image = PdfToImage() \
#            .setInputCol("content") \
#            .setOutputCol("image_raw") \
#            .setKeepInput(True)

# Read binary as image
binary_to_image = BinaryToImage()
binary_to_image.setInputCol('content')
binary_to_image.setOutputCol('image')

# Scale image
scaler = ImageScaler()
scaler.setInputCol('image')
scaler.setOutputCol('scaled_image')
scaler.setScaleFactor(2.0)

# Binarize using adaptive tresholding
binarizer = ImageAdaptiveThresholding()
binarizer.setInputCol('scaled_image')
binarizer.setOutputCol('binarized_image')
binarizer.setBlockSize(91)
binarizer.setOffset(70)

# Remove extraneous objects from image
remove_objects = ImageRemoveObjects()
remove_objects.setInputCol('binarized_image')
remove_objects.setOutputCol('cleared_image')
remove_objects.setMinSizeObject(30)
remove_objects.setMaxSizeObject(4000)

# Apply morphology opening
morpholy_operation = ImageMorphologyOperation()
morpholy_operation.setKernelShape(KernelShape.DISK)
morpholy_operation.setKernelSize(1)
morpholy_operation.setOperation('closing')
morpholy_operation.setInputCol('cleared_image')
morpholy_operation.setOutputCol('corrected_image')

# Extract text from corrected image with OCR
ocr = ImageToText()
ocr.setInputCol('binarized_image')
ocr.setOutputCol('text')
ocr.setConfidenceThreshold(50)
ocr.setIgnoreResolution(False)

# Create pipeline
pipeline = PipelineModel(stages=[
    binary_to_image,
    scaler,
    binarizer,
    remove_objects,
    morpholy_operation,
    ocr])



NLP Pipeline containing **Spell Correction** and **NER**

In [7]:
documentAssembler = DocumentAssembler().setInputCol("text").setOutputCol("document")

tokenizer = Tokenizer() \
    .setInputCols(["document"]) \
    .setOutputCol("token")

spellModel = ContextSpellCheckerModel\
    .pretrained('spellcheck_dl')\
    .setInputCols("token")\
    .setOutputCol("checked")
    
embeddings = WordEmbeddingsModel.pretrained('glove_100d').\
                    setInputCols(["document", 'checked']).\
                    setOutputCol("embeddings")

public_ner = NerDLModel.pretrained('onto_100', 'en') \
          .setInputCols(["document", "token", "embeddings"]) \
          .setOutputCol("ner")

ner_converter = NerConverter() \
                .setInputCols(["document", "token", "ner"]) \
                  .setOutputCol("ner_chunk")

nlp_pipeline =  Pipeline(stages=[documentAssembler, 
                                tokenizer,
                                spellModel,
                                embeddings,
                                public_ner,
                                ner_converter])

spellcheck_dl download started this may take some time.
Approximate size to download 111.4 MB
[OK!]
glove_100d download started this may take some time.
Approximate size to download 145.3 MB
[OK!]
onto_100 download started this may take some time.
Approximate size to download 13.5 MB
[OK!]


## 4. Run OCR pipeline

In [8]:
result = pipeline.transform(image_df).cache()

## 5. Visualize Results

Display result dataframe

In [9]:
result.select("text", "confidence").show()

+--------------------+----------------+
|                text|      confidence|
+--------------------+----------------+
|ADVERTISEMENT.

T...|91.8821029663086|
+--------------------+----------------+



Display text and images

In [10]:
result_arr = []
for r in result.distinct().collect():
  print (r.text)
  result_arr.append(r.text)

ADVERTISEMENT.

Tuts publication of the Works of Jonn Kwox, it is
supposed, will extend to Five Volumes. It was thought
advisable to commence the series with his History of
the Reformation in Scotland, as the work of greatest
importance. The next volume will thus contain the
Third and Fourth Books, which continue the History to
the year 1564; at which period his historical labours
may be considered to terminate. But the Fifth Book,
forming a sequel to the History, and published under
his name in 1644, will also be included. His Letters
and Miscellancous Writings will be arranged in the
subsequent volumes, as nearly as possible in chronolo-
gical order; each portion being introduced by a separate
avtice, respecting the manuscript or printed copies from
which they have been taken.

It may perhaps be expected that a Life of the Author
thould have been prefixed to this volume. The Life of
Knox, by Dr. M-Crig, is however a work so universally
known, and of so much historical value, as to su

# 6. Run NLP pipeline

In [11]:
empty_df = spark.createDataFrame([['']]).toDF("text")
pipelineModel = nlp_pipeline.fit(empty_df)
df = spark.createDataFrame(pd.DataFrame({"text":result_arr}))
nlp_result = pipelineModel.transform(df)

#7. Visualize NLP results

Contextual Spell Correction

In [12]:
nlp_result.select(F.explode(F.arrays_zip('token.result', 'checked.result')).alias("cols")) \
.select(F.expr("cols['0']").alias("original"),
        F.expr("cols['1']").alias("corrected")).show(truncate=False)

+-------------+-------------+
|original     |corrected    |
+-------------+-------------+
|ADVERTISEMENT|ADVERTISEMENT|
|.            |.            |
|Tuts         |puts         |
|publication  |publication  |
|of           |of           |
|the          |the          |
|Works        |Works        |
|of           |of           |
|Jonn         |John         |
|Kwox         |Knox         |
|,            |,            |
|it           |it           |
|is           |is           |
|supposed     |supposed     |
|,            |,            |
|will         |will         |
|extend       |extend       |
|to           |to           |
|Five         |Five         |
|Volumes      |Volumes      |
+-------------+-------------+
only showing top 20 rows



NER 

In [13]:

nlp_result.select(F.explode(F.arrays_zip('ner_chunk.result', 'ner_chunk.metadata')).alias("cols")) \
.select(F.expr("cols['0']").alias("chunk"),
        F.expr("cols['1']['entity']").alias("ner_label")).show(truncate=False)


+--------------------------------------+-----------+
|chunk                                 |ner_label  |
+--------------------------------------+-----------+
|the Works of Jonn Kwox                |WORK_OF_ART|
|Five Volumes                          |WORK_OF_ART|
|History of
the Reformation            |WORK_OF_ART|
|Scotland                              |GPE        |
|Third                                 |ORDINAL    |
|Fourth                                |ORDINAL    |
|History to
the year 1564              |DATE       |
|labours                               |ORG        |
|Fifth                                 |ORDINAL    |
|History                               |WORK_OF_ART|
|1644                                  |DATE       |
|His Letters
and Miscellancous Writings|WORK_OF_ART|
|a Life of the Author                  |WORK_OF_ART|
|The Life of
Knox                      |WORK_OF_ART|
|Dr                                    |PERSON     |
|M-Crig                                |PERSON