![JohnSnowLabs](https://nlp.johnsnowlabs.com/assets/images/logo.png)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/5.Spark_OCR.ipynb.ipynb)

# Spark OCR

### .. will be enriched ... work in progress ... 

## Colab Setup

In [None]:
import json

with open('license_keys.json') as f_in:
    license_keys = json.load(f_in)

license_keys.keys()

In [None]:
# template for license_key.json

{'secret':"xxx",
'SPARK_NLP_LICENSE': 'aaa',
'JSL_OCR_LICENSE': 'bbb',
'AWS_ACCESS_KEY_ID':"ccc",
'AWS_SECRET_ACCESS_KEY':"ddd",
'JSL_OCR_SECRET':"eee"}

In [None]:
ocr_version = '1.2.0'

secret = license_keys['JSL_OCR_SECRET']

%pip install spark-ocr==$version --user --extra-index-url=https://pypi.johnsnowlabs.com/$secret --upgrade

In [None]:
import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["PATH"] = os.environ["JAVA_HOME"] + "/bin:" + os.environ["PATH"]

In [None]:
import sparkocr
import sys
from pyspark.sql import SparkSession
from sparkocr import start
import os

os.environ['JSL_OCR_LICENSE'] = license_keys['JSL_OCR_LICENSE']

spark = start(secret=secret)
spark

In [None]:
!wget 'http://www.asx.com.au/asxpdf/20171103/pdf/43nyyw9r820c6r.pdf'


--2020-04-10 14:45:06--  http://www.asx.com.au/asxpdf/20171103/pdf/43nyyw9r820c6r.pdf
Resolving www.asx.com.au (www.asx.com.au)... 203.15.147.66
Connecting to www.asx.com.au (www.asx.com.au)|203.15.147.66|:80... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://www.asx.com.au/asxpdf/20171103/pdf/43nyyw9r820c6r.pdf [following]
--2020-04-10 14:45:07--  https://www.asx.com.au/asxpdf/20171103/pdf/43nyyw9r820c6r.pdf
Connecting to www.asx.com.au (www.asx.com.au)|203.15.147.66|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 212973 (208K) [application/pdf]
Saving to: ‘43nyyw9r820c6r.pdf’


2020-04-10 14:45:08 (487 KB/s) - ‘43nyyw9r820c6r.pdf’ saved [212973/212973]



In [None]:
import base64
from sparkocr.transformers import *
from pyspark.ml import PipelineModel

def pipeline():
    
    # Transforrm PDF document to images per page
    pdf_to_image = PdfToImage()
    pdf_to_image.setInputCol("content")
    pdf_to_image.setOutputCol("image")

    # Run tesseract OCR
    ocr = TesseractOcr()
    ocr.setInputCol("image")
    ocr.setOutputCol("text")
    ocr.setConfidenceThreshold(65)
    
    pipeline = PipelineModel(stages=[
        pdf_to_image,
        ocr
    ])
    
    return pipeline

In [None]:
pdf = '43nyyw9r820c6r.pdf'
pdf_example_df = spark.read.format("binaryFile").load(pdf).cache()

In [None]:
result = pipeline().transform(pdf_example_df).cache()


In [None]:
result.select("pagenum","text", "confidence").show()

+-------+--------------------+-----------------+
|pagenum|                text|       confidence|
+-------+--------------------+-----------------+
|      0|ASX ANNOUNCEMENT
...|95.18117046356201|
+-------+--------------------+-----------------+



In [None]:
result.select("pagenum","text", "confidence").show()

+-------+--------------------+-----------------+
|pagenum|                text|       confidence|
+-------+--------------------+-----------------+
|      0|ASX ANNOUNCEMENT
...|95.26571559906006|
+-------+--------------------+-----------------+



In [None]:
result.select("text").collect()

[Row(text='ASX ANNOUNCEMENT\n3 November 2017\n\nNotice Pursuant to Paragraph 708A(5)(e) of the Corporations Act\n2001 ("Act")\n\nDigitalX Limited (ASX:DCC) (DCC or the Company) confirms that the Company has today\nissued 620,000 Fully Paid Ordinary Shares (Shares) upon exercise of 620,000 Unlisted\nOptions exercisable at $0.0324 Expiring 14 September 2019 and 3,725,000 Shares upon\nexercise of 3,725,000 Unlisted Incentive Options exercisable at $0.08 expiring 10 February\n2018.\n\nThe Act restricts the on-sale of securities issued without disclosure, unless the sale is exempt\nunder section 708 or 708A of the Act. By giving this notice, a sale of the Shares noted above\nwill fall within the exemption in section 708A(5) of the Act.\n\nThe Company hereby notifies ASX under paragraph 708A(5)(e) of the Act that:\n(a) the Company issued the Shares without disclosure to investors under Part 6D.2 of\nthe Act;\n(b) as at the date of this notice, the Company has complied with the provisions of 

In [None]:
print("\n".join([row.text for row in result.select("text").collect()]))


ASX ANNOUNCEMENT
3 November 2017

Notice Pursuant to Paragraph 708A(5)(e) of the Corporations Act
2001 ("Act")

DigitalX Limited (ASX:DCC) (DCC or the Company) confirms that the Company has today
issued 620,000 Fully Paid Ordinary Shares (Shares) upon exercise of 620,000 Unlisted
Options exercisable at $0.0324 Expiring 14 September 2019 and 3,725,000 Shares upon
exercise of 3,725,000 Unlisted Incentive Options exercisable at $0.08 expiring 10 February
2018.

The Act restricts the on-sale of securities issued without disclosure, unless the sale is exempt
under section 708 or 708A of the Act. By giving this notice, a sale of the Shares noted above
will fall within the exemption in section 708A(5) of the Act.

The Company hereby notifies ASX under paragraph 708A(5)(e) of the Act that:
(a) the Company issued the Shares without disclosure to investors under Part 6D.2 of
the Act;
(b) as at the date of this notice, the Company has complied with the provisions of Chapter
2M of the Act as they 