# Example of Pretrained Pipelines
Pretrained Pipelines can be considered predefined recipes in the form of Visual NLP pipelines, these recipes come with a set of stages and parameters that help to accomplish specific tasks.

## Install spark-ocr python packge
Need specify path to `spark-ocr-assembly-[version].jar` or `secret`

In [5]:
secret = ""
license = ""
version = secret.split("-")[0]
spark_ocr_jar_path = "../target/scala-2.12/"

## Initialization of spark session

In [6]:
from pyspark.sql import SparkSession
from sparkocr import start
import sys
import os

if license:
    os.environ['JSL_OCR_LICENSE'] = license

spark = start(secret=secret, jar_path=spark_ocr_jar_path)
spark

Spark version: 3.1.2
Spark NLP version: 5.1.1
Spark OCR version: 5.1.0rc3



## Load Pretrained Pipelines

### mixed_scanned_digital_pdf
In this simple pipeline we can use the predefined pipeline to handle a mix of scanned(containing images) and digital PDFs(containing digital text). The output is going to be returned in a Dataframe column, and it will contain texts coming from both sources.</br>
You could even have a single PDF file with a mix of digital and scanned pages.</br>
Other Options:</br>
* __mixed_scanned_digital_pdf_image_cleaner__: same as above but cleaning noise from images.</br>
* __mixed_scanned_digital_pdf_skew_correction__: same as above but with page rotation correction.

In [7]:
from pyspark.ml import PipelineModel
from sparkocr.pretrained import *

mixed_pdf_pipeline = PretrainedPipeline('mixed_scanned_digital_pdf', 'en', 'clinical/ocr')

mixed_scanned_digital_pdf download started this may take some time.
Approx size to download 6.7 KB
[OK!]


### Call the pipeline
We are listing the 'mixed_pdfs' folder, that one contains two PDF files, one is scanned and the other is digital. You can open them yourself and verify.

In [4]:
pdf_path = './mixed_pdfs'
!ls mixed_pdfs

immortal_image.pdf  immortal_text.pdf


We will display using the dataframe

In [5]:
pdf_example_df = spark.read.format("binaryFile").load(pdf_path).cache()
result = mixed_pdf_pipeline.transform(pdf_example_df)
result

path,modificationTime,length,text,positions,height_dimension,width_dimension,content,image,total_pages,pagenum,documentnum,confidence,exception
file:/home/jose/s...,2023-04-13 11:12:...,243543,...,"[{[{w, 0, 14.4, 3...",383.0,284.0,[25 50 44 46 2D 3...,,0,0,0,-1.79769313486231...,
file:/home/jose/s...,2023-04-13 11:12:...,90047,would have been a...,[{[{would have be...,841.8897705078125,595.2755737304688,[25 50 44 46 2D 3...,{file:/home/jose/...,1,0,0,95.82769730511833,


And to avoid truncation, using collect() on just the text column,

In [6]:
result.select("text").collect()

[Row(text='                                                                       \n   would    have   been   a liberation,    a joy,  and   a fiesta.     \n   He sensed that had he been able to choose or                        \n   dream     his death    that  night,   this  is the  death   he      \n   would    have   dreamed     or  chosen.                             \n       Dahlmann firmly grips the knife, which he                       \n   may    have   no  idea  how   to  manage,     and   steps  out      \n   into  the  plains.                                                  \n                                                                       \n                                                                       \n                                                                       \n   The     Aleph                                                       \n   (1949)                                                              \n                                        

### image_handwritten_transformer_extraction
Let's use another example, this time for doing transformer based OCR on handwritten texts. </br>
Other similar options are,

* __image_printed_transformer_extraction__: OCR printed texts contained on images.
* __pdf_printed_transformer_extraction__: OCR printed texts contained in PDFs.
* __pdf_handwritten_transformer_extraction__: OCR handwritten texts contained in PDFs.

In [None]:
from pyspark.ml import PipelineModel
from sparkocr.pretrained import *

image_handwritten_transformer_extraction = PretrainedPipeline('image_handwritten_transformer_extraction', 'en', 'clinical/ocr')

### Load image and display it

In [None]:
from pyspark.ml import PipelineModel
import pyspark.sql.functions as f
from sparkocr.transformers import *
from sparkocr.enums import *
from sparkocr.utils import display_images

imagePath = "./data/handwritten/handwritten_example.jpg"
image_df = spark.read.format("binaryFile").load(imagePath)
display_images(BinaryToImage().transform(image_df), "image")

### display results

In [5]:
result = image_handwritten_transformer_extraction.transform(image_df).cache()
print(("").join([x.text for x in result.select("text").collect()]))

NameError: name 'image_handwritten_transformer_extraction' is not defined

### digital_pdf_table_extractor

In [9]:
from pyspark.ml import PipelineModel
from sparkocr.pretrained import *

digital_pdf_table_extractor = PretrainedPipeline('digital_pdf_table_extractor', 'en', 'clinical/ocr')

digital_pdf_table_extractor download started this may take some time.
Approx size to download 267.1 MB
[OK!]


In [11]:
pdfPath = "/home/jose/spark-ocr/python/sparkocr/resources/ocr/pdfs/tabular-pdf/f1120.pdf"
df = spark.read.format("binaryFile").load(pdfPath)

In [5]:
from sparkocr.utils import display_pdf_file
display_pdf_file(pdfPath)

In [12]:
from sparkocr.utils import display_tables
result = digital_pdf_table_extractor.transform(df)
display_tables(result, table_col = "tables", table_index_col = "table_index")

Filename: f1120.pdf
Page: 0
Table: 0
Number of Columns: 18


col0,col1,col2,col3,col4,col5,col6,col7,col8,col9,col10,col11,col12,col13,col14,col15,col16,col17
Form 1120,Empty,Empty,Empty,Empty,Empty,Empty,Empty,Empty,"For calendar year 2019 U . S or . tax Corporation year beginning Income , 2019 Tax , ending Return",Empty,Empty,Empty,", 20",Empty,Empty,OMB No . 1545 - 0123,Empty
Form Department of the Treasury,Empty,Empty,Empty,Empty,Empty,Empty,Empty,Empty,"For calendar ▶ year Go to 2019 www or . irs tax . gov year / Form1120 beginning for instructions , and 2019 the , ending latest information . , 20",Empty,Empty,Empty,Empty,Empty,Empty,2019,
Internal Revenue Service,Empty,Empty,Empty,Empty,Empty,Empty,Empty,Empty,▶ Go to www . irs . gov / Form1120 for instructions and the latest information .,,,,,,,,
A Check if :,Empty,Empty,Empty,Empty,Empty,Empty,Empty,Empty,Name,Empty,Empty,Empty,Empty,Empty,Empty,B Employer identification number,
1a Consolidated ( attach Form 851 return ) .,Empty,TYPE,,,,,,,,,,,,,,,
b Life dated / nonlife return consoli . . - .,Empty,OR,Empty,Empty,Empty,Empty,Empty,Empty,"Number , street , and room or suite no . If a P . O . box , see instructions .",Empty,Empty,Empty,Empty,C Date incorporated,,,
2 Personal ( attach Sch holding . PH ) co . . .,Empty,PRINT,Empty,Empty,Empty,Empty,Empty,Empty,"City or town , state or province , country , and ZIP or foreign postal code",Empty,Empty,Empty,Empty,D Total assets ( see instructions ),,,
3 Personal ( see instructions service ) corp . . .,Empty,Empty,Empty,Empty,Empty,Empty,Empty,Empty,Empty,Empty,Empty,Empty,Empty,$,,,
4 Schedule M - 3 attached,Empty,Empty,E Check if : ( 1 ),Initial return,Empty,( 2 ),Empty,Empty,( 4 ),Empty,Empty,Empty,Empty,Address change,,,
1a,Gross receipts or sales .,Empty,Empty,Empty,Empty,Empty,Empty,1a,Empty,50000.00,,,,,,,


Filename: f1120.pdf
Page: 1
Table: 0
Number of Columns: 5


col0,col1,col2,col3,col4
Schedule C,"Dividends instructions , ) Inclusions , and Special Deductions ( see",( a ) Dividends inclusions and,( b ) %,( c ) Special ( a ) × deductions ( b )
1,Dividends from less - than - 20 % - owned domestic corporations ( other than debt - financed stock ),234,50,
2,Dividends from 20 % - or - more - owned domestic corporations ( other than debt - financed stock ),324123,65,
3,Dividends on certain debt - financed stock of domestic and foreign corporations,324,instructions see,
4,Dividends on certain preferred stock of less - than - 20 % - owned public utilities,234,23 . 3,
5,Dividends on certain preferred stock of 20 % - or - more - owned public utilities .,42134,26 . 7,
6,Dividends from less - than - 20 % - owned foreign corporations and certain FSCs,4234,50,
7,Dividends from 20 % - or - more - owned foreign corporations and certain FSCs,4234,65,
8,Dividends from wholly owned foreign subsidiaries,42348987,100,
9,Subtotal . Add lines 1 through 8 . See instructions for limitations .,987,instructions see,


Filename: f1120.pdf
Page: 2
Table: 0
Number of Columns: 11


col0,col1,col2,col3,col4,col5,col6,col7,col8,col9,col10
1,Check if the corporation is a member of a controlled group ( attach Schedule O ( Form 1120 ) ) . See instructions ▶,Empty,Empty,Empty,Empty,Empty,Empty,Empty,Empty,Empty
2,Income tax . See instructions,Empty,Empty,Empty,2,Empty,Empty,432.13,,
3,Base erosion minimum tax amount ( attach Form 8991 ) .,Empty,Empty,Empty,Empty,3,Empty,Empty,34.32,
4,Add lines 2 and 3 .,Empty,Empty,Empty,Empty,Empty,4,Empty,Empty,4.23
5 a,5a,3.21,,,,,,,,
b,5b,3.12,,,,,,,,
c,5c,2.13,,,,,,,,
d,5d,3.24,,,,,,,,
e,5e,5.11,,,,,,,,
6,Total credits . Add lines 5a through 5e,Empty,6,Empty,Empty,Empty,Empty,Empty,Empty,23.41


Filename: f1120.pdf
Page: 3
Table: 0
Number of Columns: 4


col0,col1,col2,col3
( i ) Name of Entity,Identification ( ii ) Employer Number ( if any ),( iii Organization ) Country of,"Percentage ( iv ) Maximum Owned in Profit , Loss , or Capital"


Filename: f1120.pdf
Page: 3
Table: 1
Number of Columns: 4


col0,col1,col2,col3
( i ) Name of Corporation,Identification ( ii ) Employer Number ( if any ),( iii Incorporation ) Country of,Owned ( iv ) Percentage in Voting Stock


Filename: f1120.pdf
Page: 4
Table: 0
Number of Columns: 8


col0,col1,col2,col3,col4,col5,col6,col7
13,,"Are the corporation’s total receipts ( page 1 , line 1a , plus lines 4 through 10 ) for the tax year and its total assets at the end of the tax year less than $ 250 , 000 ? If “Yes , ” the corporation is not required to complete Schedules L , M - 1 , and M - 2 . Instead , enter the total amount of cash distributions and the book value of property distributions ( other than cash ) made during the tax year ▶ $",,43214.32,Empty,Yes,No
14,,"Is the corporation required to file Schedule UTP ( Form 1120 ) , Uncertain Tax Position Statement ? See instructions If “Yes , ” complete and attach Schedule UTP .",,,,,
15a,Empty,Did the corporation make any payments in 2019 that would require it to file Form ( s ) 1099 ?,,,,,
b,Empty,"If “Yes , ” did or will the corporation file required Form ( s ) 1099 ? .",,,,,
16,,"During this tax year , did the corporation have an 80 % - or - more change in ownership , including a change due to redemption of its own stock ?",,,,,
17,,"During or subsequent to this tax year , but before the filing of this return , did the corporation dispose of more than 65 % ( by value ) of its assets in a taxable , non - taxable , or tax deferred transaction ?",,,,,
18,,Did the corporation receive assets in a section 351 transfer in which any of the transferred assets had a fair market basis or fair market value of more than $ 1 million ? .,,,,,
19,,"During the corporation’s tax year , did the corporation make any payments that would require it to file Forms 1042 and 1042 - S under chapter 3 ( sections 1441 through 1464 ) or chapter 4 ( sections 1471 through 1474 ) of the Code ? .",,,,,
20,Empty,Is the corporation operating on a cooperative basis ? .,,,,,
21,,"During the tax year , did the corporation pay or accrue any interest or royalty for which the deduction is not allowed under section 267A ? See instructions If “Yes , ” enter the total amount of the disallowed deductions ▶ $",8576857.0,,,,


Filename: f1120.pdf
Page: 5
Table: 0
Number of Columns: 15


col0,col1,col2,col3,col4,col5,col6,col7,col8,col9,col10,col11,col12,col13,col14
1,Net income ( loss ) per books . . . . . .,Empty,Empty,Empty,3242,Empty,Empty,7,Empty,Income recorded on books this year,Empty,Empty,Empty,Empty
2,Federal income tax per books . . . . .,Empty,Empty,Empty,34,Empty,Empty,Empty,Empty,not included on this return ( itemize ) :,,,,
3,Excess of capital losses over capital gains,Empty,Empty,Empty,42342,Empty,Empty,Empty,Empty,Tax - exempt interest $,Empty,4353,,
4,Income subject to tax not recorded on books this year ( itemize ) :,234 42342,Empty,Empty,Empty,Empty,Empty,8,Empty,Deductions on this return not charged,,,,
5,Expenses recorded on books this year not deducted on this return ( itemize ) : a Depreciation . . . . $ b Charitable contributions . $ c Travel and entertainment . $,4234 4234 42364536 5426524,,,543,,,9,,against book income this year ( itemize ) : a Depreciation . . $ b Charitable contributions $ Add lines 7 and 8 . . . . . .,53425 5345,5344535 5345,5342,
6,Add lines 1 through 5 . . . . . . . . Schedule M - 2,,,,,,,10,"Analysis of Unappropriated Retained Earnings per Books ( Line 25 , Schedule L )","Income ( page 1 , line 28 ) —line 6 less line 9",,,,
1,. . . . .,Empty,Empty,Empty,Empty,5,Empty,Empty,Empty,Distributions : a Cash . . . . .,Empty,Empty,Empty,5432
2,Net income ( loss ) per books . . . . . .,Empty,b Stock,Empty,Empty,Empty,Empty,Empty,Empty,. . . .,Empty,Empty,Empty,52345
3,Other increases ( itemize ) :,,,54352143,,,6 7,,,c Property . . . . Other decreases ( itemize ) : Add lines 5 and 6 . . . . . .,,,,2354534 2345
4,"Add lines 1 , 2 , and 3 . . . . . . . .",Empty,Empty,Empty,Empty,Empty,Empty,8,Empty,Balance at end of year ( line 4 less line 7 ),,,,


Filename: f1120.pdf
Page: 5
Table: 1
Number of Columns: 20


col0,col1,col2,col3,col4,col5,col6,col7,col8,col9,col10,col11,col12,col13,col14,col15,col16,col17,col18,col19
Form 1120 ( 2019 ),Empty,Empty,Empty,Empty,Empty,Empty,Empty,Empty,Empty,Empty,Empty,Empty,Empty,Empty,Empty,Empty,Empty,Page 6,Empty
Schedule L,Balance Sheets per Books Assets,,,,( a ),Beginning of tax year,,( b ),,,,,( c ),End of tax year,Empty,( d ),,,
1,Cash . . . . . . . . . . . . 2a Trade notes and accounts receivable . . . b Less allowance for bad debts . . . . .,412.34 (,,534.24,,),413241.23 43214.3,Empty,Empty,12341234,(,Empty,Empty,),Empty,Empty,87.64,,
3,Inventories . . . . . . . . . . .,Empty,Empty,Empty,Empty,Empty,41.32,Empty,Empty,Empty,Empty,Empty,Empty,Empty,Empty,Empty,34.76,,
4,. . . . .,Empty,Empty,Empty,Empty,Empty,4312,Empty,Empty,Empty,Empty,Empty,Empty,Empty,Empty,Empty,5697.87,,
5,Tax - exempt securities ( see instructions ) . .,Empty,Empty,Empty,Empty,Empty,432412.34,Empty,Empty,Empty,Empty,Empty,Empty,Empty,Empty,Empty,0.74,,
6,Other current assets ( attach statement ) . .,Empty,Empty,Empty,Empty,Empty,43.24,Empty,Empty,Empty,Empty,Empty,Empty,Empty,Empty,Empty,474.76,,
7,Loans to shareholders . . . . . . .,Empty,Empty,Empty,Empty,Empty,421.34,Empty,Empty,Empty,Empty,Empty,Empty,Empty,Empty,Empty,47.75,,
8,Mortgage and real estate loans . . . . .,Empty,Empty,Empty,Empty,Empty,98.76,Empty,Empty,Empty,Empty,Empty,Empty,Empty,Empty,Empty,73.56,,
9,Other investments ( attach statement ) . . . 10a Buildings and other depreciable assets b Less accumulated depreciation . . . . . 11a Depletable assets . . . . . . . . . b Less accumulated depletion . . . . . .,( (,345.64 42.14 142985.25 1234,,,) ),457 42 2341,,,43.24 476.54 42.34 344.32,,( (,,) ),,,25.46 57.43 876,,
