# Extract a PDF table

- Unstructured uses PyPDF and Python-pptx to extract the text from the documents.
- Two format (images, and text)

For information, these two libraries give slightly different results.

## Data 

A table from a PDF document in two format.

The table is the following:


![table](../docs/images/figure1.png)


**Our mission:**

Extract the structured informations from this table 

In [2]:
import sys
sys.path.append('../')

In [3]:
from config.llm import GPT_4
from utils import pretty_print_element, ask_question
from unstructured.partition.pdf import partition_pdf

In [4]:
pdf_path = "../data/pdf/table.pdf"

The PDF contains two tables: 
- One is image based
- The other is text based

![table_image](../docs/images/pdf_table.png)

In [5]:
elements = partition_pdf(
    filename=pdf_path,
    infer_table_structure=True,
    strategy="hi_res",
    include_page_breaks=True,
    chunking_strategy='auto',
)

Some weights of the model checkpoint at microsoft/table-transformer-structure-recognition were not used when initializing TableTransformerForObjectDetection: ['model.backbone.conv_encoder.model.layer2.0.downsample.1.num_batches_tracked', 'model.backbone.conv_encoder.model.layer3.0.downsample.1.num_batches_tracked', 'model.backbone.conv_encoder.model.layer4.0.downsample.1.num_batches_tracked']
- This IS expected if you are initializing TableTransformerForObjectDetection from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TableTransformerForObjectDetection from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


## Detect elements

Unstructured detect the table using OCR and extract the text from the table.
It also provides and HTML output

In [6]:
for element in elements:
    pretty_print_element(element)

-------- Element --------
Element type: <class 'unstructured.documents.elements.Title'>
Element text: Table format ingestion
-------- Element --------
Element type: <class 'unstructured.documents.elements.PageBreak'>
Element text: 
-------- Element --------
Element type: <class 'unstructured.documents.elements.Header'>
Element text: Tableau
-------- Element --------
Element type: <class 'unstructured.documents.elements.Table'>
Element text: Provider Model GPT-4 (8K) (Azure) OpenAI GPT-4 Turbo GPT-3.5-turbo Google Vertex AI Gemini Pro 1 token ~= 4 chars PaLM 2 input price per 1k Token $0.03000 $0.01000 $0.00050 $0.00100 $0.00200 output price per 1K Token $0.06000 $0.03000 $0.00150 $0.00200 $0.00200 input price per 1M Token $30.00 $10.00 $0.50 $1.00 $2.00 output price per 1M Token $60.00 $30.00 $1.50 $2.00 $2.00
HTML table: <table><thead><th>. Provider</th><th>Model</th><th>input price per 1k Token |</th><th>output price per 1K Token</th><th>input price | | per 1M Token</th><th>output pr

The output for the **text based table** HTML looks like: 

<table><thead><th>. Provider</th><th>Model</th><th>input price per 1k Token |</th><th>output price per 1K Token</th><th>input price | | per 1M Token</th><th>output price per 1M Token</th></thead><tr><td rowspan="3">(Azure) OpenAl</td><td>GPT-4 (8K)</td><td>0.03000</td><td>0.06000</td><td>$30.00</td><td>$60.00</td></tr><tr><td></td><td>GPT-4 Turbo</td><td>0.01000</td><td>0.03000</td><td>$10.00</td><td>$30.00</td></tr><tr><td></td><td>GPT-3.5-turbo</td><td>0.00050</td><td>0.00150</td><td>0.50</td><td>1.50</td></tr><tr><td>Google Vertex Al</td><td>Gemini Pro</td><td>0.00100</td><td>0.00200</td><td>1.00</td><td>2.00</td></tr><tr><td>1 token ~= 4 chars</td><td>PaLM 2</td><td>0.00200</td><td>0.00200</td><td>2.00</td><td>2.00</td></tr></table>

The corresponding text is: 

```
Provider Model GPT-4 (8K) (Azure) OpenAI GPT-4 Turbo GPT-3.5-turbo Google Vertex AI Gemini Pro 1 token ~= 4 chars PaLM 2 input price per 1k Token $0.03000 $0.01000 $0.00050 $0.00100 $0.00200 output price per 1K Token $0.06000 $0.03000 $0.00150 $0.00200 $0.00200 input price per 1M Token $30.00 $10.00 $0.50 $1.00 $2.00 output price per 1M Token $60.00 $30.00 $1.50 $2.00 $2.00
```

-------- 

The output for the **image based table** OCR text is:

```
5 input price output price input price | output price per rover Witstel per 1k Token | per 1K Token | per 1M Token 1M Token GPT-4 (8K) $0.03000 $0.06000 $30.00 $60.00 GPT-4 Turbo $0.01000 $0.03000 $10.00 $30.00 (Azure) OpenAl GPT-3.5-turbo $0.00050 $0.00150 $0.50 $1.50 Gemini Pro $0.00100 $0.00200 $1.00 $2.00 PaLM 2 $0.00200 $0.00200 $2.00 $2.00 Google Vertex Al 1 token ~= 4 chars
```

## Extraction conclusion 

- Extracting a table from a PDF is possible only if the table is in table format. 
- If the table is in image format, the OCR is used but the result is not perfect.
- Rewriting the HTML table using GPT-4 could be a better solution than using a deterministic approach such as Pandas or BeautifulSoup
- The usage of multimodal could be a solution to extract the table from the image format.

Notes: 

- The table extraction for a PDF file is made with PyPDF library and Tesseract. It can be adjusted by the `hi_res_model_name` parameter.

# GPT 4 - Answer from table 

In [7]:
question = "Peux-tu trier les LLM des cloud Azure et GCP selon leur coûts par milliers de token ?"

prompt_text_with_html_input = """With the following informations:
    ----------
    {extracted_table}
    ----------
    Answer the following question:
    ----------
    {question}
    ----------
    """

                    kawargs was transferred to model_kwargs.
                    Please confirm that kawargs is what you intended.


## Text based table
### Using the HTML format from the text based table

In [8]:
html_formated_table = """
    <table><thead><th>. Provider</th><th>Model</th><th>input price per 1k Token |</th><th>output price per 1K Token</th><th>input price | | per 1M Token</th><th>output price per 1M Token</th></thead><tr><td rowspan="3">(Azure) OpenAl</td><td>GPT-4 (8K)</td><td>0.03000</td><td>0.06000</td><td>$30.00</td><td>$60.00</td>
    </tr><tr><td></td><td>GPT-4 Turbo</td><td>0.01000</td><td>0.03000</td><td>$10.00</td><td>$30.00</td></tr><tr><td></td><td>GPT-3.5-turbo</td>
    <td>0.00050</td><td>0.00150</td><td>0.50</td><td>1.50</td></tr><tr><td>Google Vertex Al</td><td>Gemini Pro</td>
    <td>0.00100</td><td>0.00200</td><td>1.00</td><td>2.00</td></tr><tr><td>1 token ~= 4 chars</td><td>PaLM 2</td><td>0.00200</td>
    <td>0.00200</td><td>2.00</td><td>2.00</td></tr></table>
"""

In [9]:
answer = ask_question(
    llm=GPT_4,
    prompt_text=prompt_text_with_html_input,
    extracted_table=html_formated_table,
    question=question
)

print(answer)

Voici les LLM des cloud Azure et GCP triés par coût par milliers de token, du moins cher au plus cher :

1. GPT-3.5-turbo (Azure) : 0.00050$ en entrée, 0.00150$ en sortie
2. Gemini Pro (Google Vertex AI) : 0.00100$ en entrée, 0.00200$ en sortie
3. PaLM 2 (1 token ~= 4 chars) : 0.00200$ en entrée, 0.00200$ en sortie
4. GPT-4 Turbo (Azure) : 0.01000$ en entrée, 0.03000$ en sortie
5. GPT-4 (8K) (Azure) : 0.03000$ en entrée, 0.06000$ en sortie


### Using the extracted text from the text based table

In [10]:
text_based_extracted_table = """Provider Model GPT-4 (8K) (Azure) OpenAI GPT-4 Turbo GPT-3.5-turbo Google Vertex AI Gemini Pro 1 token ~= 4 chars PaLM 2 input price per 1k Token $0.03000 $0.01000 $0.00050 $0.00100 $0.00200 output price per 1K Token $0.06000 $0.03000 $0.00150 $0.00200 $0.00200 input price per 1M Token $30.00 $10.00 $0.50 $1.00 $2.00 output price per 1M Token $60.00 $30.00 $1.50 $2.00 $2.00"""

In [11]:
answer = ask_question(
    llm=GPT_4,
    prompt_text=prompt_text_with_html_input,
    extracted_table=text_based_extracted_table,
    question=question
)

print(answer)

Selon les informations fournies, voici le classement des modèles de Language Learning Model (LLM) des fournisseurs cloud Azure et Google Cloud Platform (GCP) en fonction de leur coût par millier de tokens :

1. Google Vertex AI Gemini Pro : $2.00 pour 1M de tokens en entrée et $2.00 pour 1M de tokens en sortie.
2. Azure GPT-4 (8K) : $30.00 pour 1M de tokens en entrée et $60.00 pour 1M de tokens en sortie.

Notez que le coût est basé sur le total des coûts d'entrée et de sortie.


The answer is not valid because the structure is lost

## Image based table
### Using the OCR extracted text from the table image

In [12]:
ocr_extracted_text = """5 input price output price input price | output price per rover Witstel per 1k Token | per 1K Token | per 1M Token 1M Token GPT-4 (8K) $0.03000 $0.06000 $30.00 $60.00 GPT-4 Turbo $0.01000 $0.03000 $10.00 $30.00 (Azure) OpenAl GPT-3.5-turbo $0.00050 $0.00150 $0.50 $1.50 Gemini Pro $0.00100 $0.00200 $1.00 $2.00 PaLM 2 $0.00200 $0.00200 $2.00 $2.00 Google Vertex Al 1 token ~= 4 chars"""

In [13]:
answer = ask_question(
    llm=GPT_4,
    prompt_text=prompt_text_with_html_input,
    extracted_table=ocr_extracted_text,
    question=question
)

print(answer)

Selon les informations fournies, le classement des LLM des cloud Azure et GCP selon leur coût par millier de tokens est le suivant :

1. OpenAI GPT-3.5-turbo (Azure) : $0.50 par 1K Token
2. Gemini Pro : $1.00 par 1K Token
3. PaLM 2 : $2.00 par 1K Token
4. GPT-4 Turbo : $10.00 par 1K Token
5. GPT-4 (8K) : $30.00 par 1K Token


## Conclusion 

- Use the HTML format of the table if possible 
- The OCR is not perfect and the result is not valid

![Summary](../docs/images/figure2.png)