# Example of usage Spark OCR with http source

## Install spark-ocr python packge
Need specify path to `spark-ocr-assembly-[version].jar` or `secret`

In [9]:
secret = ""
version = secret.split("-")[0]
spark_ocr_jar_path = "../../target/scala-2.11"

In [None]:
# install from PYPI using secret
%pip install spark-ocr==$version --extra-index-url=https://pypi.johnsnowlabs.com/$secret --force-reinstall
%pip install requests

In [None]:
# or install from local path
# %pip install ../dist/spark-ocr-[version].tar.gz

## Initialization of spark session

In [10]:
from pyspark.sql import SparkSession
from sparkocr import start

spark = start(secret=secret, jar_path=spark_ocr_jar_path)
spark

SparkConf Configured, Starting to listen on port: 60321
JAR PATH:/usr/local/lib/python3.7/site-packages/sparkmonitor/listener.jar


## Import OCR transformers

In [19]:
import requests
import io
from sparkocr.transformers import *
from pyspark.ml import PipelineModel

## Define OCR transformers and pipeline

In [18]:
def pipeline():
    
    # Transforrm PDF document to images per page
    pdf_to_image = PdfToImage()
    pdf_to_image.setInputCol("content")
    pdf_to_image.setOutputCol("image")

    # Run tesseract OCR
    ocr = TesseractOcr()
    ocr.setInputCol("image")
    ocr.setOutputCol("text")
    ocr.setConfidenceThreshold(65)
    
    pipeline = PipelineModel(stages=[
        pdf_to_image,
        ocr
    ])
    
    return pipeline

## Read PDF document as binary file

In [26]:
url = 'http://www.asx.com.au/asxpdf/20171103/pdf/43nyyw9r820c6r.pdf'
response = requests.get(url)
my_raw_data = response.content
pdf_content = io.BytesIO(my_raw_data)
pdf_example_df = spark.createDataFrame([("file1", bytearray(my_raw_data)),], ("path", "content"))
pdf_example_df.show()

+-----+--------------------+
| path|             content|
+-----+--------------------+
|file1|[25 50 44 46 2D 3...|
+-----+--------------------+



## Run OCR pipelines

In [27]:
result = pipeline().transform(pdf_example_df).cache()

## Display results

In [28]:
result.select("pagenum","text", "confidence").show()

+-------+--------------------+-----------------+
|pagenum|                text|       confidence|
+-------+--------------------+-----------------+
|      0|ASX ANNOUNCEMENT
...|91.96627892388238|
+-------+--------------------+-----------------+



### Display recognized text

In [29]:
print("\n".join([row.text for row in result.select("text").collect()]))

ASX ANNOUNCEMENT
3 November 2017

Notice Pursuant to Paragraph 708A(5)(e) of the Corporations Act
2001 ("Act")

DigitalX Limited (ASX:DCC) (DCC or the Company) confirms that the Company has today
issued 620,000 Fully Paid Ordinary Shares (Shares) upon exercise of 620,000 Unlisted
Options exercisable at $0.0324 Expiring 14 September 2019 and 3,725,000 Shares upon
exercise of 3,725,000 Unlisted Incentive Options exercisable at $0.08 expiring 10 February
2018.

The Act restricts the on-sale of securities issued without disclosure, unless the sale is exempt
under section 708 or 708A of the Act. By giving this notice, a sale of the Shares noted above
will fall within the exemption in section 708A(5) of the Act.

The Company hereby notifies ASX under paragraph 708A(5)(e) of the Act that:
(a) the Company issued the Shares without disclosure to investors under Part 6D.2 of
the Act;
(b) as at the date of this notice, the Company has complied with the provisions of Chapter
2M of the Act as they 

## Clear cache

In [30]:
result.unpersist()

DataFrame[path: string, image: struct<origin:string,height:int,width:int,nChannels:int,mode:int,resolution:int,data:binary>, pagenum: int, confidence: double, positions: array<struct<mapping:array<struct<c:string,p:int,x:float,y:float,width:float,height:float,fontSize:int>>>>, exception: string, text: string]