![JohnSnowLabs](https://nlp.johnsnowlabs.com/assets/images/logo.png)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JohnSnowLabs/spark-ocr-workshop/tree/master/jupyter/Cards/SparkOcrPretrainedPipelinesImageHandwrittenTransformerExtraction.ipynb)

# Example of Pretrained Pipelines

Pretrained Pipelines can be considered predefined recipes in the form of Visual NLP pipelines, these recipes come with a set of stages and parameters that help to accomplish specific tasks.

## Blogposts and videos

- [Text Detection in Spark OCR](https://medium.com/spark-nlp/text-detection-in-spark-ocr-dcd8002bdc97)

- [Table Detection & Extraction in Spark OCR](https://medium.com/spark-nlp/table-detection-extraction-in-spark-ocr-50765c6cedc9)

- [Extract Tabular Data from PDF in Spark OCR](https://medium.com/spark-nlp/extract-tabular-data-from-pdf-in-spark-ocr-b02136bc0fcb)

- [Signature Detection in Spark OCR](https://medium.com/spark-nlp/signature-detection-in-spark-ocr-32f9e6f91e3c)

- [GPU image pre-processing in Spark OCR](https://medium.com/spark-nlp/gpu-image-pre-processing-in-spark-ocr-3-1-0-6fc27560a9bb)

- [How to Setup Spark OCR on UBUNTU - Video](https://www.youtube.com/watch?v=cmt4WIcL0nI)


**More examples here**

https://github.com/JohnSnowLabs/spark-ocr-workshop

### Colab Setup

In [1]:
import json, os
import sys

if 'google.colab' in sys.modules:
    from google.colab import files

    if 'spark_ocr.json' not in os.listdir():
      license_keys = files.upload()
      os.rename(list(license_keys.keys())[0], 'spark_ocr.json')

with open('spark_ocr.json') as f:
    license_keys = json.load(f)

# Defining license key-value pairs as local variables
locals().update(license_keys)

Saving spark_nlp_for_healthcare_spark_ocr_9387.json to spark_nlp_for_healthcare_spark_ocr_9387.json


In [2]:
!pip install transformers

# Installing pyspark and spark-nlp
%pip install --upgrade -q pyspark==3.2.1 spark-nlp==$PUBLIC_VERSION

# Installing Spark OCR
#! pip uninstall spark-ocr -Y
%pip install spark-ocr==$OCR_VERSION --extra-index-url=https://pypi.johnsnowlabs.com/$SPARK_OCR_SECRET --upgrade

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m281.4/281.4 MB[0m [31m4.5 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m55.6/55.6 kB[0m [31m3.1 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m579.2/579.2 kB[0m [31m26.7 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m199.0/199.0 kB[0m [31m12.3 MB/s[0m eta [36m0:00:00[0m
[?25h  Building wheel for pyspark (setup.py) ... [?25l[?25hdone
Looking in indexes: https://pypi.org/simple, https://pypi.johnsnowlabs.com/5.4.0-f40a4114fd59c8d06434c58c9e28fa076aa4af9e
Collecting spark-ocr==5.4.0
  Downloading https://pypi.johnsnowlabs.com/5.4.0-f40a4114fd59c8d06434c58c9e28fa076aa4af9e/spark-ocr/spark-ocr-5.4.0.tar.gz (42.6 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m42.6/42.6 MB[0m [31m11.9 MB/s[0m eta [

<b><h1><font color='darkred'>!!! ATTENTION !!! </font><h1><b>

<b>After running previous cell, <font color='darkred'>RESTART the COLAB RUNTIME </font> and go ahead.<b>

### Initialize Spark session

In [1]:
import json, os

with open("spark_ocr.json", 'r') as f:
  license_keys = json.load(f)

# Adding license key-value pairs to environment variables
os.environ.update(license_keys)

# Defining license key-value pairs as local variables
locals().update(license_keys)

In [2]:
import pkg_resources

from pyspark.ml import PipelineModel
import pyspark.sql.functions as f

from sparkocr import start
from sparkocr.transformers import *
from sparkocr.enums import *
from sparkocr.utils import *
from sparkocr.metrics import score

In [3]:
# Start spark
spark = start(secret=SPARK_OCR_SECRET, nlp_version=PUBLIC_VERSION)

Spark version: 3.2.1
Spark NLP version: 5.4.0
Spark OCR version: 5.4.0



## Load Pretrained Pipelines


In [4]:
from sparknlp.pretrained import PretrainedPipeline

pipeline = PretrainedPipeline('digital_pdf_table_extractor', 'en', 'clinical/ocr')

digital_pdf_table_extractor download started this may take some time.
Approx size to download 264.9 MB
[OK!]


## Call the pipeline

In [5]:
pdf_path = '/content/BiomedPap_bio-202402-0013-3.pdf'
pdf_example_df = spark.read.format("binaryFile").load(pdf_path).cache()
result = pipeline.transform(pdf_example_df)
result

path,modificationTime,length,hocr,height_dimension,width_dimension,pagenum,image,total_pages,tmp_pagenum,documentnum,table_regions,tables,exception,table_index
file:/content/Bio...,2024-08-07 09:24:...,54028,"<div title=""bbox ...",841,595,0,{file:/content/Bi...,1,0,0,"{0, 0, 54.70161, ...","{{-1, -1, 54.7016...",,0


In [6]:
display_images(result, "image", width=1000)

Output hidden; open in https://colab.research.google.com to view.

In [7]:
display_tables(result, table_col = "tables", table_index_col = "table_index")

Filename: BiomedPap_bio-202402-0013-3.pdf
Page: 0
Table: 0
Number of Columns: 3


col0,col1,col2
Empty,"cTnI ( ng / L ) ( Architect , Abbott )","cTnT ( ng / L ) ( Cobas , Roche )"
Case 1,Empty,Empty
First sample ( before hospitalisation ),1782,7
Samples during hospitalisation,"1741 , 3520 and 3622",34 ( after coronary angiography )
After hospitalisation,395 ( 3 years after hospitalisation ) 360 ( 4 years after hospitalisation ) 536 ( 5 years after hospitalisation ),
Case 2,Empty,Empty
June 25,107,–
July 2,835,–
July 28,–,8
August 25,439,Empty
