![JohnSnowLabs](https://nlp.johnsnowlabs.com/assets/images/logo.png)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/ocr/PDF_TO_TEXT.ipynb)

# PDF to Text

To run this yourself, you will need to upload your **Spark OCR** license keys to the notebook. Otherwise, you can look at the example outputs at the bottom of the notebook. To upload license keys, open the file explorer on the left side of the screen and upload `workshop_license_keys.json` to the folder that opens.

For more in-depth tutorials: https://github.com/JohnSnowLabs/spark-ocr-workshop/tree/master/jupyter

## 1. Colab Setup

Install correct version of Pillow and Restart runtime

In [1]:
# Install correct Pillow version
import PIL
if PIL.__version__  != '6.2.1':
  print ('Installing correct version of Pillow. Kernel will restart automatically')
  !pip install --upgrade pillow==6.2.1
  # hard restart runtime
  import os
  os.kill(os.getpid(), 9)
else:
  print ('Correct Pillow detected')

Correct Pillow detected


Read licence key

In [2]:
import os
import json

with open('workshop_license_keys.json') as f:
    license_keys = json.load(f)

print (license_keys.keys())

secret = license_keys['JSL_OCR_SECRET']
os.environ['SPARK_OCR_LICENSE'] = license_keys['SPARK_OCR_LICENSE']
os.environ['JSL_OCR_LICENSE'] = license_keys['SPARK_OCR_LICENSE']
version = secret.split("-")[0]
print ('Spark OCR Version:', version)

dict_keys(['JSL_OCR_SECRET', 'SPARK_OCR_LICENSE'])
Spark OCR Version: 1.5.0


Install Dependencies

In [3]:
# Install Java
!apt-get update
!apt-get install -y openjdk-8-jdk
!java -version

# Install pyspark, SparkOCR, and SparkNLP
!pip install --ignore-installed -q pyspark==2.4.4
# Insall Spark Ocr from pypi using secret
!python -m pip install --upgrade spark-ocr==$version  --extra-index-url https://pypi.johnsnowlabs.com/$secret
# or install from local path
# %pip install --user ../../python/dist/spark-ocr-[version].tar.gz
!pip install --ignore-installed -q spark-nlp==2.5.2

0% [Working]            Ign:1 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64  InRelease
0% [Waiting for headers] [Waiting for headers] [Waiting for headers] [Waiting f                                                                               Get:2 https://cloud.r-project.org/bin/linux/ubuntu bionic-cran35/ InRelease [3,626 B]
0% [Waiting for headers] [Waiting for headers] [2 InRelease 0 B/3,626 B 0%] [Wa0% [Waiting for headers] [Waiting for headers] [Waiting for headers] [Waiting f0% [2 InRelease gpgv 3,626 B] [Waiting for headers] [Waiting for headers] [Wait                                                                               Hit:3 http://ppa.launchpad.net/graphics-drivers/ppa/ubuntu bionic InRelease
0% [2 InRelease gpgv 3,626 B] [Waiting for headers] [Waiting for headers] [Wait                                                                               Ign:4 https://developer.download.nvidia.com/compute/machine-learning/repos/u

Importing Libraries

In [4]:
import pandas as pd
import numpy as np
import os

#Pyspark Imports
from pyspark.sql import SparkSession
from pyspark.ml import PipelineModel
from pyspark.sql import functions as F

# Necessary imports from Spark OCR library
from sparkocr import start
from sparkocr.transformers import *
from sparkocr.enums import *
from sparkocr.utils import display_image, to_pil_image
from sparkocr.metrics import score
import pkg_resources

# import sparknlp packages
from sparknlp.annotator import *
from sparknlp.base import *

os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["PATH"] = os.environ["JAVA_HOME"] + "/bin:" + os.environ["PATH"]


Start Spark Session

In [5]:
spark = start(secret=secret)
spark

## 2. Download and read a pdf file

In [6]:
!wget http://unec.edu.az/application/uploads/2014/12/pdf-sample.pdf -O sample.pdf

--2020-08-10 18:12:23--  http://unec.edu.az/application/uploads/2014/12/pdf-sample.pdf
Resolving unec.edu.az (unec.edu.az)... 176.9.78.164
Connecting to unec.edu.az (unec.edu.az)|176.9.78.164|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 7945 (7.8K) [application/pdf]
Saving to: ‘sample.pdf’


2020-08-10 18:12:24 (16.9 MB/s) - ‘sample.pdf’ saved [7945/7945]



In [7]:
image_df = spark.read.format("binaryFile").load('sample.pdf').cache()
image_df.show()

+--------------------+-------------------+------+--------------------+
|                path|   modificationTime|length|             content|
+--------------------+-------------------+------+--------------------+
|file:/content/sam...|2014-12-29 12:10:50|  7945|[25 50 44 46 2D 3...|
+--------------------+-------------------+------+--------------------+



## 3. Construct the OCR pipeline

In [8]:
pdf_to_image = PdfToImage() \
            .setInputCol("content") \
            .setOutputCol("image_raw") \
            .setKeepInput(True)

# Transform image to the binary color model
binarizer = ImageBinarizer() \
            .setInputCol("image_raw") \
            .setOutputCol("image") \
            .setThreshold(130)
# Run OCR for each region
ocr = ImageToText() \
            .setInputCol("image") \
            .setOutputCol("text") \
            .setIgnoreResolution(False) \
            .setPageSegMode(PageSegmentationMode.SPARSE_TEXT) \
            .setConfidenceThreshold(60)

#Render text with positions to Pdf document.
textToPdf = TextToPdf() \
            .setInputCol("positions") \
            .setInputImage("image") \
            .setInputText("text") \
            .setOutputCol("pdf") \
            .setInputContent("content")
# OCR pipeline
pipeline = PipelineModel(stages=[
            pdf_to_image,
            binarizer,
            ocr,
            textToPdf
        ])

## 4. Run OCR pipeline

In [9]:
result = pipeline.transform(image_df).cache()

## 5. Visualize Results

Display result dataframe

In [10]:
result.select("text").show()

+--------------------+
|                text|
+--------------------+
|[Adobe Acrobat PD...|
+--------------------+



Display text

In [11]:
result_arr = []
for r in result.distinct().collect():
  for page in r.text:
    print (page)
    result_arr.append(page)

Adobe Acrobat PDF Files

Adobe® Portable Document Format (PDF) is a universal file format that preserves all

of the fonts, formatting, colours and graphics of any source document, regardless of

the application and platform used to create it.

Adobe PDF is an ideal format for electronic document distribution as it overcomes the

problems commonly encountered with electronic file sharing.

Anyone, anywhere can open a PDF file. All you need is the free Adobe Acrobat

Reader. Recipients of other file formats sometimes can't open files because they

don't have the applications used to create the documents.

PDF files always print correctly on any printing device.

PDF files always display exactly as created, regardless of fonts, software, and

operating systems. Fonts, and graphics are not lost due to platform, software, and

version incompatibilities.

The free Acrobat Reader is easy to download and can be freely distributed by

anyone.

Compact PDF files are smaller than their source fi