![JohnSnowLabs](https://nlp.johnsnowlabs.com/assets/images/logo.png)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JohnSnowLabs/nlu/blob/master/examples/colab/ocr/ocr_for_img_pdf_docx_files.ipynb)

# OCR for PDFs, Images and Docx files
In this notebook we will extract texts from Haiku Poems using OCR for `PDFs`, `PNGs` and `DOCX` files.


| NLU Spell            | Transformer Class                                                                       |
|----------------------|-----------------------------------------------------------------------------------------|
| nlu.load(`img2text`) | [ImageToText](https://nlp.johnsnowlabs.com/docs/en/ocr_pipeline_components#imagetotext) |              
| nlu.load(`pdf2text`) | [PdfToText](https://nlp.johnsnowlabs.com/docs/en/ocr_pipeline_components#pdftotext)     |              
| nlu.load(`doc2text`) | [DocToText](https://nlp.johnsnowlabs.com/docs/en/ocr_pipeline_components#doctotext)     |              


When your nlu pipeline contains a `ocr spell` the predict method will accept the following inputs : 

- a `string` pointing to a folder or to a file
- a `list`, `numpy array` or `Pandas Series` containing paths pointing to folders or files 
- a `Pandas Dataframe` or `Spark Dataframe` containing a column named `path` which has one path entry per row pointing to folders or files

For every path in the input passed to the `predict()` method, nlu will distinguish between two cases: 
1. If the path points to a `file`, nlu will apply OCR transformers to it, if the file type is applicable with the currently loaded OCR pipeline.
2. If the path points to a `folder`, nlu will recuirsively search for files in the folder and subfolders which have file types wich are applicable with the loaded OCR pipeline.

NLU checks the file endings to determine wether the OCR models can be applied or not, i.e. `.pdf`, `.img` etc.. 
If your files lack these endings, NLU will not process them.


In [None]:
%%capture
!wget https://setup.johnsnowlabs.com/nlu/colab.sh -O - | bash

import nlu


## Authorize your environment for OCR
You need a Spark OCR license for using OCR spells, which is available for [free here](https://www.johnsnowlabs.com/spark-nlp-try-free/) 

Either upload the json credentials file to the `/content` folder of google colab or manually pass the credentials to `nlu.auth`.      
For more details on how to authorize your environment, check out the [OCR documentaton page](https://nlu.johnsnowlabs.com/docs/en/nlu_for_ocr#authorize-via-providing-string-parameters)

In [None]:
%%capture
import nlu
# Alterantively, upload a secrets.json file and pass the path to nlu.auth()
# nlu.auth('/content/spark_nlp_ocr_hc.json')

AWS_ACCESS_KEY_ID = 'Your Credentials'
AWS_SECRET_ACCESS_KEY = 'Your Credentials'
OCR_SECRET = 'Your Credentials'
JSL_SECRET = 'Your Credentials'
OCR_LICENSE = "Your Credentials"
SPARK_NLP_LICENSE = 'Your Credentials'
# this will automatically install the OCR library and NLP Healthcare library when credentials are provided
nlu.auth(SPARK_NLP_LICENSE,AWS_ACCESS_KEY_ID,AWS_SECRET_ACCESS_KEY,JSL_SECRET, OCR_LICENSE, OCR_SECRET)

## Download some `PDF` files we can apply OCR on


In [None]:
! wget https://github.com/JohnSnowLabs/nlu/raw/3.4.0rc1/tests/datasets/ocr/pdf/haiku.pdf
! wget https://github.com/JohnSnowLabs/nlu/raw/3.4.0rc1/tests/datasets/ocr/pdf/Compiling_a_Curriculum_Vitae.pdf

--2022-04-15 03:44:11--  https://github.com/JohnSnowLabs/nlu/raw/3.4.0rc1/tests/datasets/ocr/pdf/haiku.pdf
Resolving github.com (github.com)... 140.82.114.4
Connecting to github.com (github.com)|140.82.114.4|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://raw.githubusercontent.com/JohnSnowLabs/nlu/3.4.0rc1/tests/datasets/ocr/pdf/haiku.pdf [following]
--2022-04-15 03:44:11--  https://raw.githubusercontent.com/JohnSnowLabs/nlu/3.4.0rc1/tests/datasets/ocr/pdf/haiku.pdf
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 28486 (28K) [application/octet-stream]
Saving to: ‘haiku.pdf’


2022-04-15 03:44:11 (13.5 MB/s) - ‘haiku.pdf’ saved [28486/28486]

--2022-04-15 03:44:12--  https://github.com/JohnSnowLabs/nlu/raw/3.4.0rc1/tests

## Use the load the `pdf2text` spell and pass a directory which contains PDF files.
NLU will recursively search the folder for files which have `.pdf`  suffix in their name and apply the OCR model to it

In [None]:
df = nlu.load('pdf2text').predict('/content/haiku.pdf')
print(df.iloc[0].values[0])

“Lighting One Candle” by Yosa Buson
The light of a candle
Is transferred to another candle—
Spring twilight



## Download some `image` files we can apply OCR on


In [None]:
%%capture
# Download some image files
! wget https://github.com/JohnSnowLabs/nlu/raw/3.4.0rc1/tests/datasets/ocr/images/100_dollar.jpg
! wget https://github.com/JohnSnowLabs/nlu/raw/3.4.0rc1/tests/datasets/ocr/images/50_dollar.jpg
! wget https://github.com/JohnSnowLabs/nlu/raw/3.4.0rc1/tests/datasets/ocr/images/haiku.png

## Use the load the `img2text` spell and pass a directory which contains PDF files.
NLU will recursively search the folder for files with have `.jpeg`, `.png`, `.bmp`, `.wbmp`, `.gif`, `.jpg`, `.tiff`  suffix in their name and apply the OCR model to it



In [None]:
nlu.load('img2text').predict('/content/haiku.png')

Unnamed: 0,text
0,“The Old Pond” by Matsuo Basho\nAn old silent ...


## Download some Docx files


In [None]:
%%capture
# Download some Doc fiels
! wget https://github.com/JohnSnowLabs/nlu/raw/3.4.0rc1/tests/datasets/ocr/docx/haiku.docx

## Use the load the `doc2text` spell and pass a directory which contains PDF files.
NLU will recursively search the folder for files with have `.docx` suffix in their name and apply the OCR model to it



In [None]:
df = nlu.load('doc2text').predict('/content/haiku.docx')
print(df.iloc[0].values[0])


“In a Station of the Metro” by Ezra Pound
The apparition of these faces in the crowd;
Petals on a wet, black bough.




## OCR examples with alternativy input types

### Predict on array of paths

In [None]:
import pandas as pd 
pdf_paths = ['/content/Compiling_a_Curriculum_Vitae.pdf','/content/haiku.pdf',]
nlu.load('pdf2text' ).predict(pdf_paths)

Unnamed: 0,text
0,\nA Curriculum Vitae \nAlso called a CV or ...
1,\nCurriculum Vitae Format \nYour Contact In...
2,\nProfessional Memberships \nInterests \n \...
3,"\nJohn Smith \nStreet, City, State, Zip \n..."
4,"\nAwards and Honors: \n Treldar Scholar, 2..."
5,\n \n
6,“Lighting One Candle” by Yosa Buson\nThe light...


### Predict on dataframe containing a `path` column

In [None]:
# create a dataframe with paths
pdf_paths = ['/content/Compiling_a_Curriculum_Vitae.pdf','/content/haiku.pdf',]
df = pd.DataFrame({'path':pdf_paths})
df

Unnamed: 0,path
0,/content/Compiling_a_Curriculum_Vitae.pdf
1,/content/haiku.pdf


In [None]:
# Process dataframe with paths
nlu.load('pdf2text' ).predict(df)


Unnamed: 0,text
0,\nA Curriculum Vitae \nAlso called a CV or ...
1,\nCurriculum Vitae Format \nYour Contact In...
2,\nProfessional Memberships \nInterests \n \...
3,"\nJohn Smith \nStreet, City, State, Zip \n..."
4,"\nAwards and Honors: \n Treldar Scholar, 2..."
5,\n \n
6,“Lighting One Candle” by Yosa Buson\nThe light...


## Stack OCR + NLP models

You can combine OCR spells with any other NLU spell.
This enables you to apply any NLP model directly on the text extracted by the OCR model

In [None]:
%%capture
# Lets download an image containing text with named entities
! wget https://github.com/JohnSnowLabs/nlu/raw/3.4.0rc1/tests/datasets/ocr/images/presidents.png


In [None]:
# Extract entities in the text of every file
df = nlu.load('img2text ner' ).predict('/content/presidents.png', output_level='chunk')
df[['entities_ner','entities_ner_class','entities_ner_confidence']]

onto_recognize_entities_sm download started this may take some time.
Approx size to download 160.1 MB
[OK!]


Unnamed: 0,entities_ner,entities_ner_class,entities_ner_confidence
0,Four,CARDINAL,0.9978
0,"(William Henry Harrison, Zachary Taylor,\nWarr...",PERSON,0.771225
0,"Franklin D. Roosevelt),",PERSON,0.7377667
0,"(Abraham\nLincoln, James A. Garfield, William ...",PERSON,0.71588576
0,"John F. Kennedy),",PERSON,0.8516
0,"(Richard Nixon,",PERSON,0.64785
0,9,CARDINAL,0.8247
0,John Tyler,PERSON,0.7595
0,first,ORDINAL,0.8554
0,10,CARDINAL,0.8789
