# Extract a PPTX table

- Unstructured uses Python-pptx to extract the text from the documents.
- Two format (images, and text)

For information, these two libraries give slightly different results.

## Data 

A table from a PPTX document in two format.

The table is the following:


![table](../docs/images/figure1.png)


**Our mission:**

Extract the structured informations from this table 

In [2]:
import sys
sys.path.append('../')

In [3]:
from unstructured.partition.pptx import partition_pptx
from utils import pretty_print_element

In [4]:
pptx_path = "../data/pptx/table.pptx"

The PPTX contains two tables: 
- One is image based
- The other is text based

![table_image](../docs/images/pdf_table.png)

In [6]:
elements = partition_pptx(
    filename=pptx_path,
    infer_table_structure=True,
    strategy="hi_res",
    include_page_breaks=True,
    chunking_strategy='auto',
)

## Detect elements

Unstructured detect the table using OCR and extract the text from the table.
It also provides and HTML output

In [7]:
for element in elements:
    pretty_print_element(element)

-------- Element --------
Element type: <class 'unstructured.documents.elements.Title'>
Element text: Table format ingestion
-------- Element --------
Element type: <class 'unstructured.documents.elements.PageBreak'>
Element text: 
-------- Element --------
Element type: <class 'unstructured.documents.elements.Title'>
Element text: Tableau
-------- Element --------
Element type: <class 'unstructured.documents.elements.Table'>
Element text: Provider            Model          input price per 1k Token    output price per 1K Token    input price per 1M Token    output price per 1M Token
(Azure)             GPT-4 (8K)     $0.03000                    $0.06000                     $30.00                      $60.00
OpenAI
                    GPT-4 Turbo    $0.01000                    $0.03000                     $10.00                      $30.00
                    GPT-3.5-turbo  $0.00050                    $0.00150                     $0.50                       $1.50
Google Vertex AI    Gem

The output for the **text based table** HTML looks like: 

<table>
<thead>
<tr><th>Provider  </th><th>Model        </th><th>input price per 1k Token  </th><th>output price per 1K Token  </th><th>input price per 1M Token  </th><th>output price per 1M Token  </th></tr>
</thead>
<tbody>
<tr><td>(Azure)
OpenAI           </td><td>GPT-4 (8K)   </td><td>$0.03000                  </td><td>$0.06000                   </td><td>$30.00                    </td><td>$60.00                     </td></tr>
<tr><td>          </td><td>GPT-4 Turbo  </td><td>$0.01000                  </td><td>$0.03000                   </td><td>$10.00                    </td><td>$30.00                     </td></tr>
<tr><td>          </td><td>GPT-3.5-turbo</td><td>$0.00050                  </td><td>$0.00150                   </td><td>$0.50                     </td><td>$1.50                      </td></tr>
<tr><td>Google Vertex AI

1 token ~= 4 chars           </td><td>Gemini Pro   </td><td>$0.00100                  </td><td>$0.00200                   </td><td>$1.00                     </td><td>$2.00                      </td></tr>
<tr><td>          </td><td>PaLM 2       </td><td>$0.00200                  </td><td>$0.00200                   </td><td>$2.00                     </td><td>$2.00                      </td></tr>
</tbody>
</table>

The corresponding text is: 

```
Provider            Model          input price per 1k Token    output price per 1K Token    input price per 1M Token    output price per 1M Token
(Azure)             GPT-4 (8K)     $0.03000                    $0.06000                     $30.00                      $60.00
OpenAI
                    GPT-4 Turbo    $0.01000                    $0.03000                     $10.00                      $30.00
                    GPT-3.5-turbo  $0.00050                    $0.00150                     $0.50                       $1.50
Google Vertex AI    Gemini Pro     $0.00100                    $0.00200                     $1.00                       $2.00

1 token ~= 4 chars
                    PaLM 2         $0.00200                    $0.00200                     $2.00                       $2.00
```

-------- 

The output for the **image based table** OCR is empty, no OCR is done in Unstructured

## Extraction conclusion 

- Extracting a table from a PPTX is possible only if the table is in table format. 
- If the table is in image format, the OCR is not implemented but can be used with Tesseract or other OCR libraries.
- Rewriting the HTML table using GPT-4 could be a better solution than using a deterministic approach such as Pandas or BeautifulSoup
- The usage of multimodal could be a solution to extract the table from the image format.

Notes: 

- The table extraction for a PDF file is made with PyPDF library and Tesseract. It can be adjusted by the `hi_res_model_name` parameter.

# GPT 4 - Answer from table 

In [8]:
from utils import ask_question
from config.llm import GPT_4

question = "Peux-tu trier les LLM des cloud Azure et GCP selon leur coûts par milliers de token ?"

prompt_text_with_html_input = """With the following informations:
    ----------
    {extracted_table}
    ----------
    Answer the following question:
    ----------
    {question}
    ----------
    """

                kawargs was transferred to model_kwargs.
                Please confirm that kawargs is what you intended.


## Text based table
### Using the HTML format from the text based table

In [9]:
html_formated_table = """
<table>
<thead>
<tr><th>Provider  </th><th>Model        </th><th>input price per 1k Token  </th><th>output price per 1K Token  </th><th>input price per 1M Token  </th><th>output price per 1M Token  </th></tr>
</thead>
<tbody>
<tr><td>(Azure)
OpenAI           </td><td>GPT-4 (8K)   </td><td>$0.03000                  </td><td>$0.06000                   </td><td>$30.00                    </td><td>$60.00                     </td></tr>
<tr><td>          </td><td>GPT-4 Turbo  </td><td>$0.01000                  </td><td>$0.03000                   </td><td>$10.00                    </td><td>$30.00                     </td></tr>
<tr><td>          </td><td>GPT-3.5-turbo</td><td>$0.00050                  </td><td>$0.00150                   </td><td>$0.50                     </td><td>$1.50                      </td></tr>
<tr><td>Google Vertex AI

1 token ~= 4 chars           </td><td>Gemini Pro   </td><td>$0.00100                  </td><td>$0.00200                   </td><td>$1.00                     </td><td>$2.00                      </td></tr>
<tr><td>          </td><td>PaLM 2       </td><td>$0.00200                  </td><td>$0.00200                   </td><td>$2.00                     </td><td>$2.00                      </td></tr>
</tbody>
</table>
"""

In [10]:
answer = ask_question(
    llm=GPT_4,
    prompt_text=prompt_text_with_html_input,
    extracted_table=html_formated_table,
    question=question
)

print(answer)

Voici les LLM des cloud Azure et GCP triés par leur coût par millier de tokens :

1. GPT-3.5-turbo (Azure) : $0.00050 (input), $0.00150 (output)
2. Gemini Pro (Google Vertex AI) : $0.00100 (input), $0.00200 (output)
3. PaLM 2 (Google Vertex AI) : $0.00200 (input), $0.00200 (output)
4. GPT-4 Turbo (Azure) : $0.01000 (input), $0.03000 (output)
5. GPT-4 (8K) (Azure) : $0.03000 (input), $0.06000 (output)


### Using the extracted text from the text based table

In [11]:
text_based_extracted_table = """Provider            Model          input price per 1k Token    output price per 1K Token    input price per 1M Token    output price per 1M Token
(Azure)             GPT-4 (8K)     $0.03000                    $0.06000                     $30.00                      $60.00
OpenAI
                    GPT-4 Turbo    $0.01000                    $0.03000                     $10.00                      $30.00
                    GPT-3.5-turbo  $0.00050                    $0.00150                     $0.50                       $1.50
Google Vertex AI    Gemini Pro     $0.00100                    $0.00200                     $1.00                       $2.00

1 token ~= 4 chars
                    PaLM 2         $0.00200                    $0.00200                     $2.00                       $2.00
"""

In [12]:
answer = ask_question(
    llm=GPT_4,
    prompt_text=prompt_text_with_html_input,
    extracted_table=text_based_extracted_table,
    question=question
)

print(answer)

Voici les LLM des cloud Azure et GCP triés par leur coût par millier de tokens :

1. Google Vertex AI Gemini Pro : $0.00100 (input) - $0.00200 (output)
2. Azure GPT-4 (8K) : $0.03000 (input) - $0.06000 (output)
3. OpenAI GPT-4 Turbo : $0.01000 (input) - $0.03000 (output)
4. OpenAI GPT-3.5-turbo : $0.00050 (input) - $0.00150 (output)
5. PaLM 2 : $0.00200 (input) - $0.00200 (output)


The answer is not valid because the structure is lost

## Conclusion 

- Use the HTML format of the table if possible but the extracted text can be used as well
- The OCR is not implemented and has to be done with Tesseract or other OCR libraries

![Summary](../docs/images/figure3.png)