![JohnSnowLabs](https://nlp.johnsnowlabs.com/assets/images/logo.png)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JohnSnowLabs/nlu/blob/master/examples/colab/ocr/table_extraction.ipynb)

# Table extraction for PDFs, PPTs and Docx files
In this notebook we will extract tables in text format from 3 different PDF files using OCR for `PDFs`, `PPTs` and `DOCX` files.


| NLU Spell            | Transformer Class                                                                       |
|----------------------|-----------------------------------------------------------------------------------------|
| nlu.load(`pdf2table`) | [PdfToTextTable](https://nlp.johnsnowlabs.com/docs/en/ocr_pipeline_components#pdftotexttable) |              
| nlu.load(`ppt2table`) | [PptToTextTable](https://nlp.johnsnowlabs.com/docs/en/ocr_pipeline_components#ppttotexttable)     |              
| nlu.load(`doc2table`) | [DocToTextTable](https://nlp.johnsnowlabs.com/docs/en/ocr_pipeline_components#doctotexttable)     |              


When your nlu pipeline contains a `ocr spell` the predict method will accept the following inputs : 

- a `string` pointing to a folder or to a file
- a `list`, `numpy array` or `Pandas Series` containing paths pointing to folders or files 
- a `Pandas Dataframe` or `Spark Dataframe` containing a column named `path` which has one path entry per row pointing to folders or files

For every path in the input passed to the `predict()` method, nlu will distinguish between two cases: 
1. If the path points to a `file`, nlu will apply OCR transformers to it, if the file type is applicable with the currently loaded OCR pipeline.
2. If the path points to a `folder`, nlu will recuirsively search for files in the folder and subfolders which have file types wich are applicable with the loaded OCR pipeline.

NLU checks the file endings to determine wether the OCR models can be applied or not, i.e. `.pdf`, `.doc`, `.ppt` etc.. 
If your files lack these endings, NLU will not process them.


In [40]:
%%capture
!wget https://setup.johnsnowlabs.com/nlu/colab.sh -O - | bash

import nlu


## Authorize your environment for OCR
You need a Spark OCR license for using OCR spells, which is available for [free here](https://www.johnsnowlabs.com/spark-nlp-try-free/) 

Either upload the json credentials file to the `/content` folder of google colab or manually pass the credentials to `nlu.auth`.      
For more details on how to authorize your environment, check out the [OCR documentaton page](https://nlu.johnsnowlabs.com/docs/en/nlu_for_ocr#authorize-via-providing-string-parameters)

In [None]:
import nlu
import json, os
from google.colab import files

if 'spark_jsl.json' not in os.listdir():
  license_keys = files.upload()
  os.rename(list(license_keys.keys())[0], 'spark_jsl.json')

# this will automatically install the OCR library and NLP Healthcare library when credentials are provided
nlu.auth('spark_jsl.json')

## Download some `PDF` files we can apply OCR on


In [42]:
! wget https://github.com/JohnSnowLabs/nlu/raw/4.0.0/tests/datasets/ocr/pdf/tables.pdf
! wget https://github.com/JohnSnowLabs/nlu/raw/4.0.0/tests/datasets/ocr/pdf/insurance.pdf

--2022-07-16 15:24:18--  https://github.com/JohnSnowLabs/nlu/raw/4.0.0/tests/datasets/ocr/pdf/tables.pdf
Resolving github.com (github.com)... 140.82.121.4
Connecting to github.com (github.com)|140.82.121.4|:443... connected.
HTTP request sent, awaiting response... 404 Not Found
2022-07-16 15:24:18 ERROR 404: Not Found.

--2022-07-16 15:24:18--  https://github.com/JohnSnowLabs/nlu/raw/4.0.0/tests/datasets/ocr/pdf/insurance.pdf
Resolving github.com (github.com)... 140.82.121.4
Connecting to github.com (github.com)|140.82.121.4|:443... connected.
HTTP request sent, awaiting response... 404 Not Found
2022-07-16 15:24:18 ERROR 404: Not Found.



## Create helper function to print talbes easily

In [43]:
def print_tables(results):
  for i, df in enumerate(results):
    print(f"Table {i+1}:")
    print(df)
    print("\n")

## Use the load the `pdf2table` spell
At first step we will run the spell on simple `pdf` file then we will run this on more complex `pdf` that contains multiple tables

In [44]:
results = nlu.load('pdf2table').predict('/content/tables.pdf')
print_tables(results)

Table 1:
     mpg cyl   disp   hp  drat     wt   qsec vs am gear
1   21.0   6  160.0  110  3.90  2.620  16.46  0  1    4
2   21.0   6  160.0  110  3.90  2.875  17.02  0  1    4
3   22.8   4  108.0   93  3.85  2.320  18.61  1  1    4
4   21.4   6  258.0  110  3.08  3.215  19.44  1  0    3
5   18.7   8  360.0  175  3.15  3.440  17.02  0  0    3
6   18.1   6  225.0  105  2.76  3.460  20.22  1  0    3
7   14.3   8  360.0  245  3.21  3.570  15.84  0  0    3
8   24.4   4  146.7   62  3.69  3.190  20.00  1  0    4
9   22.8   4  140.8   95  3.92  3.150  22.90  1  0    4
10  19.2   6  167.6  123  3.92  3.440  18.30  1  0    4
11  17.8   6  167.6  123  3.92  3.440  18.90  1  0    4
12  16.4   8  275.8  180  3.07  4.070  17.40  0  0    3
13  17.3   8  275.8  180  3.07  3.730  17.60  0  0    3
14  15.2   8  275.8  180  3.07  3.780  18.00  0  0    3
15  10.4   8  472.0  205  2.93  5.250  17.98  0  0    3
16  10.4   8  460.0  215  3.00  5.424  17.82  0  0    3
17  14.7   8  440.0  230  3.23  5.345  

## Run the `pdf2table` spell on more complex pdf file
Here we have a real pdf file with 8 table. let run `pdf2table` spell on it!

In [45]:
results = nlu.load('pdf2table').predict('/content/insurance.pdf')
print_tables(results)

Table 1:
     IV-565  \
1    IV-568   
2    IV-570   
3    IV-575   
4    IV-580   

  Group Life Insurance Definition and Group\rLife Insurance Standard Provisions Model\rAct  \
1          Military Sales Practices Model Regulation                                         
2  Advertisements of Life Insurance and\rAnnuitie...                                         
3  Life and Health Insurance Policy Language\rSim...                                         
4         Life Insurance Disclosure Model Regulation                                         

  This model act sets forth group life insurance standard provisions. It defines group life\rinsurance and contains provisions relating to limits of group life insurance, notice of\rcompensation, dependent coverage, standard provisions, and a supplementary provision\rrelating to conversion privileges.  \
1  The purpose of this model regulation is to set...                                                                                      

## Table extraction examples with alternativy input types

### Predict on array of paths

In [46]:
pdf_paths = ['/content/tables.pdf','/content/insurance.pdf',]
results = nlu.load('pdf2table' ).predict(pdf_paths)
print_tables(results)

Table 1:
     mpg cyl   disp   hp  drat     wt   qsec vs am gear
1   21.0   6  160.0  110  3.90  2.620  16.46  0  1    4
2   21.0   6  160.0  110  3.90  2.875  17.02  0  1    4
3   22.8   4  108.0   93  3.85  2.320  18.61  1  1    4
4   21.4   6  258.0  110  3.08  3.215  19.44  1  0    3
5   18.7   8  360.0  175  3.15  3.440  17.02  0  0    3
6   18.1   6  225.0  105  2.76  3.460  20.22  1  0    3
7   14.3   8  360.0  245  3.21  3.570  15.84  0  0    3
8   24.4   4  146.7   62  3.69  3.190  20.00  1  0    4
9   22.8   4  140.8   95  3.92  3.150  22.90  1  0    4
10  19.2   6  167.6  123  3.92  3.440  18.30  1  0    4
11  17.8   6  167.6  123  3.92  3.440  18.90  1  0    4
12  16.4   8  275.8  180  3.07  4.070  17.40  0  0    3
13  17.3   8  275.8  180  3.07  3.730  17.60  0  0    3
14  15.2   8  275.8  180  3.07  3.780  18.00  0  0    3
15  10.4   8  472.0  205  2.93  5.250  17.98  0  0    3
16  10.4   8  460.0  215  3.00  5.424  17.82  0  0    3
17  14.7   8  440.0  230  3.23  5.345  

### Predict on dataframe containing a `path` column

In [47]:
# create a dataframe with paths
import pandas as pd 

pdf_paths = ['/content/tables.pdf','/content/insurance.pdf',]
df = pd.DataFrame({'path':pdf_paths})
df

Unnamed: 0,path
0,/content/tables.pdf
1,/content/insurance.pdf


In [48]:
# Process dataframe with paths
results = nlu.load('pdf2table').predict(df)
print_tables(results)

Table 1:
     mpg cyl   disp   hp  drat     wt   qsec vs am gear
1   21.0   6  160.0  110  3.90  2.620  16.46  0  1    4
2   21.0   6  160.0  110  3.90  2.875  17.02  0  1    4
3   22.8   4  108.0   93  3.85  2.320  18.61  1  1    4
4   21.4   6  258.0  110  3.08  3.215  19.44  1  0    3
5   18.7   8  360.0  175  3.15  3.440  17.02  0  0    3
6   18.1   6  225.0  105  2.76  3.460  20.22  1  0    3
7   14.3   8  360.0  245  3.21  3.570  15.84  0  0    3
8   24.4   4  146.7   62  3.69  3.190  20.00  1  0    4
9   22.8   4  140.8   95  3.92  3.150  22.90  1  0    4
10  19.2   6  167.6  123  3.92  3.440  18.30  1  0    4
11  17.8   6  167.6  123  3.92  3.440  18.90  1  0    4
12  16.4   8  275.8  180  3.07  4.070  17.40  0  0    3
13  17.3   8  275.8  180  3.07  3.730  17.60  0  0    3
14  15.2   8  275.8  180  3.07  3.780  18.00  0  0    3
15  10.4   8  472.0  205  2.93  5.250  17.98  0  0    3
16  10.4   8  460.0  215  3.00  5.424  17.82  0  0    3
17  14.7   8  440.0  230  3.23  5.345  

### Pass a directory which contains PDF files.
NLU will recursively search the folder for files which have `.pdf`  suffix in their name and apply the OCR model to it

In [49]:
results = nlu.load('pdf2table' ).predict("/content")
print_tables(results)

Table 1:
    Volume/\rModel #                                     Title of Model  \
1            III-440   Insurance Holding Company System\rRegulatory Act   
2            III-450  Insurance Holding Company System Model\rRegula...   

                                         Description  \
1  This model act includes requirements pertainin...   
2  This model regulation sets forth rules and pro...   

  Related Charts: State Laws on Insurance Topics    
1                                                   
2                                                   


Table 2:
     mpg cyl   disp   hp  drat     wt   qsec vs am gear
1   21.0   6  160.0  110  3.90  2.620  16.46  0  1    4
2   21.0   6  160.0  110  3.90  2.875  17.02  0  1    4
3   22.8   4  108.0   93  3.85  2.320  18.61  1  1    4
4   21.4   6  258.0  110  3.08  3.215  19.44  1  0    3
5   18.7   8  360.0  175  3.15  3.440  17.02  0  0    3
6   18.1   6  225.0  105  2.76  3.460  20.22  1  0    3
7   14.3   8  360.0  245  3.21  3.

## Use `ppt2table` spell

### let download the test `ppt` file first.

In [50]:
!wget https://github.com/JohnSnowLabs/nlu/raw/4.0.0/tests/datasets/ocr/ppt/tables.ppt

--2022-07-16 15:24:38--  https://github.com/JohnSnowLabs/nlu/raw/4.0.0/tests/datasets/ocr/ppt/tables.ppt
Resolving github.com (github.com)... 140.82.121.3
Connecting to github.com (github.com)|140.82.121.3|:443... connected.
HTTP request sent, awaiting response... 404 Not Found
2022-07-16 15:24:38 ERROR 404: Not Found.



### Load and make prediction on *the* test file 

In [51]:
results = nlu.load('ppt2table').predict("/content/tables.ppt")
print_tables(results)

Table 1:
                          mpg cyl   disp   hp  drat     wt   qsec vs am gear  \
1             Mazda RX4  21.0   6  160.0  110  3.90  2.620  16.46  0  1    4   
2         Mazda RX4 Wag  21.0   6  160.0  110  3.90  2.875  17.02  0  1    4   
3            Datsun 710  22.8   4  108.0   93  3.85  2.320  18.61  1  1    4   
4        Hornet 4 Drive  21.4   6  258.0  110  3.08  3.215  19.44  1  0    3   
5     Hornet Sportabout  18.7   8  360.0  175  3.15  3.440  17.02  0  0    3   
6               Valiant  18.1   6  225.0  105  2.76  3.460  20.22  1  0    3   
7            Duster 360  14.3   8  360.0  245  3.21  3.570  15.84  0  0    3   
8             Merc 240D  24.4   4  146.7   62  3.69  3.190  20.00  1  0    4   
9              Merc 230  22.8   4  140.8   95  3.92  3.150  22.90  1  0    4   
10             Merc 280  19.2   6  167.6  123  3.92  3.440  18.30  1  0    4   
11            Merc 280C  17.8   6  167.6  123  3.92  3.440  18.90  1  0    4   
12           Merc 450SE  16.4  

## Use `doc2table` spell

### let download the test `ppt` file first.

In [52]:
!wget https://github.com/JohnSnowLabs/nlu/raw/4.0.0/tests/datasets/ocr/doc/tables.docx

--2022-07-16 15:24:40--  https://github.com/JohnSnowLabs/nlu/raw/4.0.0/tests/datasets/ocr/doc/tables.docx
Resolving github.com (github.com)... 140.82.121.4
Connecting to github.com (github.com)|140.82.121.4|:443... connected.
HTTP request sent, awaiting response... 404 Not Found
2022-07-16 15:24:40 ERROR 404: Not Found.



### Load and make prediction on *the* test file 

In [53]:
results = nlu.load('doc2table').predict("/content/tables.docx")
print_tables(results)