## Table Extraction From Documents

### Text Analysis

Most RAG systems rely on the text content from documents.

### Structured Data

Some industries like Finance, Insurance and Accounting rely on more structured data such as tables embedded into unstructured data such as PDF document.

To help with question answering, its important to be able to extract the able from these documents.


#### Inherent Structure:

Some documents contain table structure information that you can use to extract data from. Example of such documents include: HTMLs and Word Docs. For such documents, we can use rule based parsers to extract table information.

#### Inference Required:

Some documents do not contain table structure details. When working with such documents, table structure details are inferred. Techniques such as:

- **Table Transformers**

> These are models which identify the bounding boxes for table cells and converts the output into a html format.

> **Steps used:**
> 1. Identify table using DLD models
> 2. Run the identified table into the table transfomer model

Read more from [here](https://arxiv.org/pdf/2203.01017)

- **Vision Transformers**

> Vision transformers work as we have already seen, the only difference when working with tables is that the output is in HTML format.

- **OCR Postprocessing can be applied**

> This involve OCRing the table and then use rule based or statistical parsers to extract the table information from it.

Once the table information is extracted using one of these techniques, the extracted table information can be exported to HTML format to preserve table structure.



In [1]:
import warnings
warnings.filterwarnings('ignore')

In [2]:
from unstructured_client import UnstructuredClient
from unstructured_client.models import shared
from unstructured_client.models.errors import SDKError

from unstructured.staging.base import dict_to_elements

import os

In [3]:
s = UnstructuredClient(
    api_key_auth=os.getenv("UNSTRUCTURED_API_KEY")
)

In [4]:
filename = "./example_datasets/embedded-images-tables.pdf"

with open(filename, "rb") as f:
    files=shared.Files(
        content=f.read(),
        file_name=filename,
    )

req = shared.PartitionParameters(
    files=files,
    strategy="hi_res",
    hi_res_model_name="yolox",
    skip_infer_table_types=[],
    pdf_infer_table_structure=True,
)

try:
    resp = s.general.partition(req)
    elements = dict_to_elements(resp.elements)
except SDKError as e:
    print(e)

In [5]:
tables = [el for el in elements if el.category == "Table"]

In [6]:
tables[0].text

'Inhibitor Polarization Corrosion be (V/dec) ba (V/dec) Ecorr (V) icorr (AJcm?) concentration (g) resistance (Q) rate (mmj/year) 0.0335 0.0409 —0.9393 0.0003 24.0910 2.8163 1.9460 0.0596 .8276 0.0002 121.440 1.5054 0.0163 0.2369 .8825 0.0001 42121 0.9476 s NO 03233 0.0540 —0.8027 5.39E-05 373.180 0.4318 0.1240 0.0556 .5896 5.46E-05 305.650 0.3772 = 5 0.0382 0.0086 .5356 1.24E-05 246.080 0.0919'

In [7]:
table_html = tables[0].metadata.text_as_html

In [8]:
from io import StringIO 
from lxml import etree

parser = etree.XMLParser(remove_blank_text=True)
file_obj = StringIO(table_html)
tree = etree.parse(file_obj, parser)
print(etree.tostring(tree, pretty_print=True).decode())

<table>
  <thead>
    <th>Inhibitor concentration (g)</th>
    <th>be (V/dec)</th>
    <th>ba (V/dec)</th>
    <th>Ecorr (V)</th>
    <th>icorr (AJcm?)</th>
    <th>Polarization resistance (Q)</th>
    <th>Corrosion rate (mmj/year)</th>
  </thead>
  <tr>
    <td/>
    <td>0.0335</td>
    <td>0.0409</td>
    <td>&#8212;0.9393</td>
    <td>0.0003</td>
    <td>24.0910</td>
    <td>2.8163</td>
  </tr>
  <tr>
    <td>NO</td>
    <td>1.9460</td>
    <td>0.0596</td>
    <td>&#8212;0.8276</td>
    <td>0.0002</td>
    <td>121.440</td>
    <td>1.5054</td>
  </tr>
  <tr>
    <td/>
    <td>0.0163</td>
    <td>0.2369</td>
    <td>&#8212;0.8825</td>
    <td>0.0001</td>
    <td>42121</td>
    <td>0.9476</td>
  </tr>
  <tr>
    <td>s</td>
    <td>03233</td>
    <td>0.0540</td>
    <td>&#8212;0.8027</td>
    <td>5.39E-05</td>
    <td>373.180</td>
    <td>0.4318</td>
  </tr>
  <tr>
    <td/>
    <td>0.1240</td>
    <td>0.0556</td>
    <td>&#8212;0.5896</td>
    <td>5.46E-05</td>
    <td>305.650</td>
   

In [9]:
from IPython.core.display import HTML
HTML(table_html)

Inhibitor concentration (g),be (V/dec),ba (V/dec),Ecorr (V),icorr (AJcm?),Polarization resistance (Q),Corrosion rate (mmj/year)
,0.0335,0.0409,—0.9393,0.0003,24.091,2.8163
NO,1.946,0.0596,—0.8276,0.0002,121.44,1.5054
,0.0163,0.2369,—0.8825,0.0001,42121.0,0.9476
s,3233.0,0.054,—0.8027,5.39e-05,373.18,0.4318
,0.124,0.0556,—0.5896,5.46e-05,305.65,0.3772
= 5,0.0382,0.0086,—0.5356,1.24e-05,246.08,0.0919
