# Example of Pretrained Pipelines
Pretrained Pipelines can be considered predefined recipes in the form of Visual NLP pipelines, these recipes come with a set of stages and parameters that help to accomplish specific tasks.

## Install spark-ocr python packge
Need specify path to `spark-ocr-assembly-[version].jar` or `secret`

In [None]:
secret = ""
license = ""
version = secret.split("-")[0]
spark_ocr_jar_path = "../../target/scala-2.12"

In [None]:
# install from PYPI using secret
#%pip install spark-ocr==$version --extra-index-url=https://pypi.johnsnowlabs.com/$secret --upgrade

## Initialization of spark session

In [None]:
from pyspark.sql import SparkSession
from sparkocr import start
import os

if license:
    os.environ['SPARK_OCR_LICENSE'] = license

spark = start(secret=secret, jar_path=spark_ocr_jar_path)
spark

Spark version: 3.4.1
Spark NLP version: 5.1.2
Spark OCR version: 5.1.0


## Load Pretrained Pipelines

### mixed_scanned_digital_pdf
In this simple pipeline we can use the predefined pipeline to handle a mix of scanned(containing images) and digital PDFs(containing digital text). The output is going to be returned in a Dataframe column, and it will contain texts coming from both sources.</br>
You could even have a single PDF file with a mix of digital and scanned pages.</br>
Other Options:</br>
* __mixed_scanned_digital_pdf_image_cleaner__: same as above but cleaning noise from images.</br>
* __mixed_scanned_digital_pdf_skew_correction__: same as above but with page rotation correction.

In [None]:
from pyspark.ml import PipelineModel
from sparkocr.pretrained import *

mixed_pdf_pipeline = PretrainedPipeline('mixed_scanned_digital_pdf', 'en', 'clinical/ocr')

mixed_scanned_digital_pdf download started this may take some time.
Approx size to download 6.7 KB
[OK!]


### Call the pipeline
We are listing the 'mixed_pdfs' folder, that one contains two PDF files, one is scanned and the other is digital. You can open them yourself and verify.

In [None]:
pdf_path = 'mixed_pdfs'
!ls mixed_pdfs

immortal_image.pdf  immortal_text.pdf


We will display using the dataframe

In [None]:
pdf_example_df = spark.read.format("binaryFile").load(pdf_path).cache()
result = mixed_pdf_pipeline.transform(pdf_example_df)
result

And to avoid truncation, using collect() on just the text column,

In [None]:
result.select("text").collect()

### image_handwritten_transformer_extraction
Let's use another example, this time for doing transformer based OCR on handwritten texts. </br>
Other similar options are,

* __image_printed_transformer_extraction__: OCR printed texts contained on images.
* __pdf_printed_transformer_extraction__: OCR printed texts contained in PDFs.
* __pdf_handwritten_transformer_extraction__: OCR handwritten texts contained in PDFs.

In [None]:
from pyspark.ml import PipelineModel
from sparkocr.pretrained import *

image_handwritten_transformer_extraction = PretrainedPipeline('image_handwritten_transformer_extraction', 'en', 'clinical/ocr')

### Load image and display it

In [None]:
from pyspark.ml import PipelineModel
import pyspark.sql.functions as f
from sparkocr.transformers import *
from sparkocr.enums import *
from sparkocr.utils import display_images

imagePath = "./data/handwritten/handwritten_example.jpg"
image_df = spark.read.format("binaryFile").load(imagePath)
display_images(BinaryToImage().transform(image_df), "image")

### display results

In [None]:
result = image_handwritten_transformer_extraction.transform(image_df).cache()
print(("").join([x.text for x in result.select("text").collect()]))

This is an example of handwritten
sex .
Let's # check the performance ?
I hope it will be awesome .


### LightPipeline

In [None]:
image_handwritten_transformer_extraction.model.stages

[BinaryToImage_4a25c0442190,
 IMAGE_TEXT_DETECTOR_57486d9529bc,
 IMAGE_TO_TEXT_V2_1b4d60a5f4a9]

In [None]:
from sparkocr.base import LightPipeline
lp = LightPipeline(image_handwritten_transformer_extraction.model)

In [None]:
%%time
lp.fromLocalPath(imagePath)

CPU times: user 11.4 ms, sys: 733 µs, total: 12.1 ms
Wall time: 5.33 s


[{'image': ImageOutput(path: handwritten_example.jpg, exception: None),
  'text_regions': [Coordinate (x: 2277.3103, y: 747.747, width: 303.97388, height: 4112.281),
   Coordinate (x: 572.09106, y: 1044.0774, width: 227.29648, height: 556.3352),
   Coordinate (x: 2184.8506, y: 1496.3049, width: 318.13586, height: 3854.7915),
   Coordinate (x: 1889.187, y: 1996.0546, width: 306.62164, height: 3152.0547)],
  'text': Annotation(image_to_text, 0, 99, This is an example of handwritten
  sex .
  Let's # check the performance ?
  I hope it will be awesome ., Map(), [])}]

### digital_pdf_table_extractor

In [None]:
from pyspark.ml import PipelineModel
from sparkocr.pretrained import *

digital_pdf_table_extractor = PretrainedPipeline('digital_pdf_table_extractor', 'en', 'clinical/ocr')

In [None]:
pdfPath = "./data/tab_pdfs/budget.pdf"
df = spark.read.format("binaryFile").load(pdfPath)

In [None]:
from sparkocr.utils import display_pdf_file
display_pdf_file(pdfPath)

In [None]:
from sparkocr.utils import display_tables
result = digital_pdf_table_extractor.transform(df)
display_tables(result, table_col = "tables", table_index_col = "table_index")

Filename: budget.pdf
Page:     0
Table:    0
11


Unnamed: 0,col0,col1,col2,col3,col4,col5,col6,col7,col8,col9,col10
0,,Ministry / Demand,Revenue,Plan Capital,Total,Revenue,Non - Plan Capital,Total,Total Plan & Non - Plan,Page No .,
1,MINISTRY OF AGRICULTURE,,28130 . 48,67 . 52,28198 . 00,2863 . 09,1 . 85,2864 . 94,31062 . 94,,
2,Department of Agriculture and Cooperation,,22260 . 55,48 . 45,22309 . 00,342 . 51,0 . 74,343 . 25,22652 . 25,,1 - 10
3,Department of Agricultural Research and Education,,3715 . 00,. . .,3715 . 00,2429 . 39,. . .,2429 . 39,6144 . 39,,11 - 13
4,"Department of Animal Husbandry , Dairying and ...",,2154 . 93,19 . 07,2174 . 00,91 . 19,1 . 11,92 . 30,2266 . 30,,14 - 19
5,DEPARTMENT OF ATOMIC ENERGY,,1779 . 00,4101 . 00,5880 . 00,3710 . 84,855 . 75,4566 . 59,10446 . 59,,
6,Atomic Energy,,1483 . 00,3427 . 00,4910 . 00,2971 . 25,855 . 75,3827 . 00,8737 . 00,,20 - 25
7,Nuclear Power Schemes,,296 . 00,674 . 00,970 . 00,739 . 59,. . .,739 . 59,1709 . 59,,26 - 27
8,MINISTRY OF CHEMICALS AND FERTILISERS,,360 . 83,153 . 17,514 . 00,73104 . 46,0 . 09,73104 . 55,73618 . 55,,
9,Department of Chemicals and Petrochemicals,,171 . 49,35 . 51,207 . 00,63 . 67,0 . 01,63 . 68,270 . 68,,28 - 30
