![JohnSnowLabs](https://nlp.johnsnowlabs.com/assets/images/logo.png)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JohnSnowLabs/nlu/blob/master/examples/colab/ocr/table_extraction.ipynb)

# Table extraction for PDFs, PPTs and Docx files
In this notebook we will extract tables in text format from 3 different PDF files using OCR for `PDFs`, `PPTs` and `DOCX` files.


| NLU Spell            | Transformer Class                                                                       |
|----------------------|-----------------------------------------------------------------------------------------|
| nlu.load(`pdf2table`) | [PdfToTextTable](https://nlp.johnsnowlabs.com/docs/en/ocr_pipeline_components#pdftotexttable) |              
| nlu.load(`ppt2table`) | [PptToTextTable](https://nlp.johnsnowlabs.com/docs/en/ocr_pipeline_components#ppttotexttable)     |              
| nlu.load(`doc2table`) | [DocToTextTable](https://nlp.johnsnowlabs.com/docs/en/ocr_pipeline_components#doctotexttable)     |              


When your nlu pipeline contains a `ocr spell` the predict method will accept the following inputs : 

- a `string` pointing to a folder or to a file
- a `list`, `numpy array` or `Pandas Series` containing paths pointing to folders or files 
- a `Pandas Dataframe` or `Spark Dataframe` containing a column named `path` which has one path entry per row pointing to folders or files

For every path in the input passed to the `predict()` method, nlu will distinguish between two cases: 
1. If the path points to a `file`, nlu will apply OCR transformers to it, if the file type is applicable with the currently loaded OCR pipeline.
2. If the path points to a `folder`, nlu will recuirsively search for files in the folder and subfolders which have file types wich are applicable with the loaded OCR pipeline.

NLU checks the file endings to determine wether the OCR models can be applied or not, i.e. `.pdf`, `.doc`, `.ppt` etc.. 
If your files lack these endings, NLU will not process them.


In [None]:
%%capture
!wget https://setup.johnsnowlabs.com/nlu/colab.sh -O - | bash
import nlu

## Authorize your environment for OCR
You need a Spark OCR license for using OCR spells, which is available for [free here](https://www.johnsnowlabs.com/spark-nlp-try-free/) 

Either upload the json credentials file to the `/content` folder of google colab or manually pass the credentials to `nlu.auth`.      
For more details on how to authorize your environment, check out the [OCR documentaton page](https://nlu.johnsnowlabs.com/docs/en/nlu_for_ocr#authorize-via-providing-string-parameters)

In [None]:
import nlu
import json, os
from google.colab import files

if 'spark_jsl.json' not in os.listdir():
  license_keys = files.upload()
  os.rename(list(license_keys.keys())[0], 'spark_jsl.json')

# this will automatically install the OCR library and NLP Healthcare library when credentials are provided
nlu.auth('spark_jsl.json')

## Download some `PDF` files we can apply OCR on


In [None]:
! wget https://github.com/JohnSnowLabs/nlu/raw/4.0.0/tests/datasets/ocr/pdf/tables.pdf
! wget https://github.com/JohnSnowLabs/nlu/raw/4.0.0/tests/datasets/ocr/pdf/insurance.pdf

## Create helper function to print talbes easily

In [3]:
def print_tables(results):
  for i, df in enumerate(results):
    print(f"Table {i+1}:")
    print(df)
    print("\n")

## Use the load the `pdf2table` spell
At first step we will run the spell on simple `pdf` file then we will run this on more complex `pdf` that contains multiple tables

In [18]:
results = nlu.load('pdf2table').predict('/content/tables.pdf')
print_tables(results)

Table 1:
     mpg cyl   disp   hp  drat     wt   qsec vs am gear
1   21.0   6  160.0  110  3.90  2.620  16.46  0  1    4
2   21.0   6  160.0  110  3.90  2.875  17.02  0  1    4
3   22.8   4  108.0   93  3.85  2.320  18.61  1  1    4
4   21.4   6  258.0  110  3.08  3.215  19.44  1  0    3
5   18.7   8  360.0  175  3.15  3.440  17.02  0  0    3
6   18.1   6  225.0  105  2.76  3.460  20.22  1  0    3
7   14.3   8  360.0  245  3.21  3.570  15.84  0  0    3
8   24.4   4  146.7   62  3.69  3.190  20.00  1  0    4
9   22.8   4  140.8   95  3.92  3.150  22.90  1  0    4
10  19.2   6  167.6  123  3.92  3.440  18.30  1  0    4
11  17.8   6  167.6  123  3.92  3.440  18.90  1  0    4
12  16.4   8  275.8  180  3.07  4.070  17.40  0  0    3
13  17.3   8  275.8  180  3.07  3.730  17.60  0  0    3
14  15.2   8  275.8  180  3.07  3.780  18.00  0  0    3
15  10.4   8  472.0  205  2.93  5.250  17.98  0  0    3
16  10.4   8  460.0  215  3.00  5.424  17.82  0  0    3
17  14.7   8  440.0  230  3.23  5.345  

In [19]:
# You get an array of pandas Dataframes returned, You can easily view them

In [20]:
results[0]

Unnamed: 0,mpg,cyl,disp,hp,drat,wt,qsec,vs,am,gear
1,21.0,6,160.0,110,3.9,2.62,16.46,0,1,4
2,21.0,6,160.0,110,3.9,2.875,17.02,0,1,4
3,22.8,4,108.0,93,3.85,2.32,18.61,1,1,4
4,21.4,6,258.0,110,3.08,3.215,19.44,1,0,3
5,18.7,8,360.0,175,3.15,3.44,17.02,0,0,3
6,18.1,6,225.0,105,2.76,3.46,20.22,1,0,3
7,14.3,8,360.0,245,3.21,3.57,15.84,0,0,3
8,24.4,4,146.7,62,3.69,3.19,20.0,1,0,4
9,22.8,4,140.8,95,3.92,3.15,22.9,1,0,4
10,19.2,6,167.6,123,3.92,3.44,18.3,1,0,4


In [21]:
results[1]

Unnamed: 0,Sepal.Length Sepal.Width,Petal.Length,Petal.Width,Species,Unnamed: 5
1,5.1,3.5,1.4,0.2 setosa,
2,4.9,3.0,1.4,0.2 setosa,
3,4.7,3.2,1.3,0.2 setosa,
4,4.6,3.1,1.5,0.2 setosa,
5,5.0,3.6,1.4,0.2 setosa,
6,5.4,3.9,1.7,0.4 setosa,
7,Sepal.Length,Sepal.Width,Petal.Length,Petal.Width,Species
8,145 6.7,3.3,5.7,2.5,virginica
9,146 6.7,3.0,5.2,2.3,virginica
10,147 6.3,2.5,5.0,1.9,virginica


In [22]:
results[2]

Unnamed: 0,supp
1,VC
2,VC
3,VC
4,VC
5,VC
6,VC
7,VC
8,VC
9,VC
10,VC


## Run the `pdf2table` spell on more complex pdf file
Here we have a real pdf file with 8 tables. Lets run `pdf2table` spell on it!

In [5]:
results = nlu.load('pdf2table').predict('/content/insurance.pdf')
print_tables(results)

Table 1:
  0_ Volume/\rModel #                                 Title of Model  \
1             III-360            Consumer Credit Insurance Model Act   
2             III-365  Credit Personal Property Insurance Model\rAct   
3             III-370    Consumer Credit Insurance Model\rRegulation   
4             III-375            Creditor-Placed Insurance Model Act   

                                         Description  \
1  This model act helps promote the public welfar...   
2  This model act 1) Promotes the public welfare ...   
3  This model regulation helps protect the intere...   
4  This model act creates a legal framework withi...   

  Related Charts: State Laws on Insurance Topics 5_  
1                                                    
2                                                    
3      PC-50 - Terrorism and War Risks Exclusion     
4                                                    


Table 2:
  0_  III-385  \
1     III-390   
2     III-395   

  Model Regulati

## Table extraction examples with alternativy input types

### Predict on array of paths

In [6]:
pdf_paths = ['/content/tables.pdf','/content/insurance.pdf',]
results = nlu.load('pdf2table' ).predict(pdf_paths)
print_tables(results)

Table 1:
  0_ Volume/\rModel #                                 Title of Model  \
1             III-360            Consumer Credit Insurance Model Act   
2             III-365  Credit Personal Property Insurance Model\rAct   
3             III-370    Consumer Credit Insurance Model\rRegulation   
4             III-375            Creditor-Placed Insurance Model Act   

                                         Description  \
1  This model act helps promote the public welfar...   
2  This model act 1) Promotes the public welfare ...   
3  This model regulation helps protect the intere...   
4  This model act creates a legal framework withi...   

  Related Charts: State Laws on Insurance Topics 5_  
1                                                    
2                                                    
3      PC-50 - Terrorism and War Risks Exclusion     
4                                                    


Table 2:
  0_  III-385  \
1     III-390   
2     III-395   

  Model Regulati

### Predict on dataframe containing a `path` column

In [7]:
# create a dataframe with paths
import pandas as pd 

pdf_paths = ['/content/tables.pdf','/content/insurance.pdf',]
df = pd.DataFrame({'path':pdf_paths})
df

Unnamed: 0,path
0,/content/tables.pdf
1,/content/insurance.pdf


In [8]:
# Process dataframe with paths
results = nlu.load('pdf2table').predict(df)
print_tables(results)

Table 1:
  0_ Volume/\rModel #                                 Title of Model  \
1             III-360            Consumer Credit Insurance Model Act   
2             III-365  Credit Personal Property Insurance Model\rAct   
3             III-370    Consumer Credit Insurance Model\rRegulation   
4             III-375            Creditor-Placed Insurance Model Act   

                                         Description  \
1  This model act helps promote the public welfar...   
2  This model act 1) Promotes the public welfare ...   
3  This model regulation helps protect the intere...   
4  This model act creates a legal framework withi...   

  Related Charts: State Laws on Insurance Topics 5_  
1                                                    
2                                                    
3      PC-50 - Terrorism and War Risks Exclusion     
4                                                    


Table 2:
  0_  III-385  \
1     III-390   
2     III-395   

  Model Regulati

### Pass a directory which contains PDF files.
NLU will recursively search the folder for files which have `.pdf`  suffix in their name and apply the OCR model to it

In [9]:
results = nlu.load('pdf2table' ).predict("/content")
print_tables(results)

Table 1:
  0_ Volume/\rModel #                                 Title of Model  \
1             III-360            Consumer Credit Insurance Model Act   
2             III-365  Credit Personal Property Insurance Model\rAct   
3             III-370    Consumer Credit Insurance Model\rRegulation   
4             III-375            Creditor-Placed Insurance Model Act   

                                         Description  \
1  This model act helps promote the public welfar...   
2  This model act 1) Promotes the public welfare ...   
3  This model regulation helps protect the intere...   
4  This model act creates a legal framework withi...   

  Related Charts: State Laws on Insurance Topics 5_  
1                                                    
2                                                    
3      PC-50 - Terrorism and War Risks Exclusion     
4                                                    


Table 2:
  0_  III-385  \
1     III-390   
2     III-395   

  Model Regulati

## Use `ppt2table` spell

### lets download the test `ppt` file first.

In [None]:
!wget https://github.com/JohnSnowLabs/nlu/raw/4.0.0/tests/datasets/ocr/ppt/tables.ppt

### Load and make prediction on *the* test file 

In [11]:
results = nlu.load('ppt2table').predict("/content/tables.ppt")
print_tables(results)

Table 1:
                          mpg cyl   disp   hp  drat     wt   qsec vs am gear  \
1             Mazda RX4  21.0   6  160.0  110  3.90  2.620  16.46  0  1    4   
2         Mazda RX4 Wag  21.0   6  160.0  110  3.90  2.875  17.02  0  1    4   
3            Datsun 710  22.8   4  108.0   93  3.85  2.320  18.61  1  1    4   
4        Hornet 4 Drive  21.4   6  258.0  110  3.08  3.215  19.44  1  0    3   
5     Hornet Sportabout  18.7   8  360.0  175  3.15  3.440  17.02  0  0    3   
6               Valiant  18.1   6  225.0  105  2.76  3.460  20.22  1  0    3   
7            Duster 360  14.3   8  360.0  245  3.21  3.570  15.84  0  0    3   
8             Merc 240D  24.4   4  146.7   62  3.69  3.190  20.00  1  0    4   
9              Merc 230  22.8   4  140.8   95  3.92  3.150  22.90  1  0    4   
10             Merc 280  19.2   6  167.6  123  3.92  3.440  18.30  1  0    4   
11            Merc 280C  17.8   6  167.6  123  3.92  3.440  18.90  1  0    4   
12           Merc 450SE  16.4  

## Use `doc2table` spell

### lets download the test `DOC/DOCX` file first.

In [None]:
!wget https://github.com/JohnSnowLabs/nlu/raw/4.0.0/tests/datasets/ocr/docx_with_table/doc2.docx


### Load and make prediction on *the* test file 

In [13]:
results = nlu.load('doc2table').predict("/content/doc2.docx")
print_tables(results)

Table 1:
   Screen Reader Responses Share
1           JAWS       853   49%
2           NVDA       238   14%
3    Window-Eyes       214   12%
4  System Access       181   10%
5      VoiceOver       159    9%


Table 2:
                   May 2012  September 2010
1  Screen Reader  Responses           Share
2           JAWS        853             49%
3           NVDA        238             14%
4    Window-Eyes        214             12%
5  System Access        181             10%
6      VoiceOver        159              9%


