![JohnSnowLabs](https://nlp.johnsnowlabs.com/assets/images/logo.png)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/ocr/PDF_TO_TEXT.ipynb)

# PDF to Text

To run this yourself, you will need to upload your **Spark OCR** license keys to the notebook. Otherwise, you can look at the example outputs at the bottom of the notebook. To upload license keys, open the file explorer on the left side of the screen and upload `workshop_license_keys.json` to the folder that opens.

For more in-depth tutorials: https://github.com/JohnSnowLabs/spark-ocr-workshop/tree/master/jupyter

## 1. Colab Setup

Install correct version of Pillow and Restart runtime

In [None]:
import json
import os

from google.colab import files

license_keys = files.upload()
os.rename(list(license_keys.keys())[0], 'spark_ocr.json')

with open('spark_ocr.json') as f:
    license_keys = json.load(f)

# Defining license key-value pairs as local variables
locals().update(license_keys)

Install Dependencies

In [None]:
# Installing pyspark and spark-nlp
! pip install --upgrade -q pyspark==3.0.2 spark-nlp==$PUBLIC_VERSION

# Installing Spark OCR
! pip install spark-ocr==$OCR_VERSION\+spark30 --extra-index-url=https://pypi.johnsnowlabs.com/$SPARK_OCR_SECRET --upgrade

<b><h1><font color='darkred'>!!! ATTENTION !!! </font><h1><b>

<b>After running previous cell, <font color='darkred'>RESTART the COLAB RUNTIME </font> and go ahead.<b>

Importing Libraries

In [1]:
import json, os

with open("spark_ocr.json", 'r') as f:
  license_keys = json.load(f)

# Adding license key-value pairs to environment variables
os.environ.update(license_keys)

# Defining license key-value pairs as local variables
locals().update(license_keys)

In [2]:
import pandas as pd
import numpy as np
import os

#Pyspark Imports
from pyspark.sql import SparkSession
from pyspark.ml import PipelineModel
from pyspark.sql import functions as F

# Necessary imports from Spark OCR library
import sparkocr
from sparkocr import start
from sparkocr.transformers import *
from sparkocr.enums import *
from sparkocr.utils import display_image, to_pil_image
from sparkocr.metrics import score
import pkg_resources

# import sparknlp packages
from sparknlp.annotator import *
from sparknlp.base import *

Start Spark Session

In [3]:
spark = sparkocr.start(secret=SPARK_OCR_SECRET, 
                       nlp_version=PUBLIC_VERSION
                       )

Spark version: 3.0.2
Spark NLP version: 3.3.4
Spark OCR version: 3.9.1



## 2. Download and read a pdf file

In [4]:
!wget http://unec.edu.az/application/uploads/2014/12/pdf-sample.pdf -O sample.pdf

--2022-01-10 17:32:18--  http://unec.edu.az/application/uploads/2014/12/pdf-sample.pdf
Resolving unec.edu.az (unec.edu.az)... 144.76.199.105
Connecting to unec.edu.az (unec.edu.az)|144.76.199.105|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 7945 (7.8K) [application/pdf]
Saving to: ‘sample.pdf’


2022-01-10 17:32:18 (389 MB/s) - ‘sample.pdf’ saved [7945/7945]



In [5]:
image_df = spark.read.format("binaryFile").load('sample.pdf').cache()
image_df.show()

+--------------------+-------------------+------+--------------------+
|                path|   modificationTime|length|             content|
+--------------------+-------------------+------+--------------------+
|file:/content/sam...|2014-12-29 12:10:50|  7945|[25 50 44 46 2D 3...|
+--------------------+-------------------+------+--------------------+



## 3. Construct the OCR pipeline

In [6]:
pdf_to_image = PdfToImage() \
            .setInputCol("content") \
            .setOutputCol("image_raw") \
            .setKeepInput(True)

# Transform image to the binary color model
binarizer = ImageBinarizer() \
            .setInputCol("image_raw") \
            .setOutputCol("image") \
            .setThreshold(130)
# Run OCR for each region
ocr = ImageToText() \
            .setInputCol("image") \
            .setOutputCol("text") \
            .setIgnoreResolution(False) \
            .setPageSegMode(PageSegmentationMode.SPARSE_TEXT) \
            .setConfidenceThreshold(60)

#Render text with positions to Pdf document.
textToPdf = TextToPdf() \
            .setInputCol("positions") \
            .setInputImage("image") \
            .setInputText("text") \
            .setOutputCol("pdf") \
            .setInputContent("content")
# OCR pipeline
pipeline = PipelineModel(stages=[
            pdf_to_image,
            binarizer,
            ocr,
            textToPdf
        ])

## 4. Run OCR pipeline

In [7]:
result = pipeline.transform(image_df).cache()

## 5. Visualize Results

Display result dataframe

In [8]:
result.select("text").show()

+--------------------+
|                text|
+--------------------+
|[Adobe Acrobat PD...|
+--------------------+



Display text

In [None]:
result_arr = []
for r in result.distinct().collect():
  for page in r.text:
    print (page)
    result_arr.append(page)

Adobe Acrobat PDF Files

Adobe® Portable Document Format (PDF) is a universal file format that preserves all

of the fonts, formatting, colours and graphics of any source document, regardless of

the application and platform used to create it.

Adobe PDF is an ideal format for electronic document distribution as it overcomes the

problems commonly encountered with electronic file sharing.

Anyone, anywhere can open a PDF file. All you need is the free Adobe Acrobat

Reader. Recipients of other file formats sometimes can't open files because they

don't have the applications used to create the documents.

PDF files always print correctly on any printing device.

PDF files always display exactly as created, regardless of fonts, software, and

operating systems. Fonts, and graphics are not lost due to platform, software, and

version incompatibilities.

The free Acrobat Reader is easy to download and can be freely distributed by

anyone.

Compact PDF files are smaller than their source fi