## Quick Tour

The following examples show how to get started with the `unstructured` library. See
our [documentation page](https://unstructured-io.github.io/unstructured) for a full description
of the features in the library.

Another way to try out the `unstructured` library is by running a docker container -- compatible with either Intel/AMD or Apple Silicon! Check out the [instructions for using the docker image](https://github.com/Unstructured-IO/unstructured#dizzy-instructions-for-using-the-docker-image).

In [None]:
# how it comes at package documentation
!apt-get -qq install poppler-utils tesseract-ocr
%pip install -q --user --upgrade pillow
%pip install -q unstructured["all-docs"]==0.12.5
# %pip install -q --upgrade unstructured

In [None]:
# !apt-get -qq install poppler-utils tesseract-ocr
# %pip install -q --user --upgrade pillow
%pip install -q unstructured[pdf,image,local-inference]==0.13.3
%pip install libmagic==1.0
%pip install poppler-utils==0.1.0
# %pip install tessearct-ocr
%pip install pandoc==2.3
# %pip install -q --upgrade unstructured


In [None]:
!apt-get -qq install poppler-utils tesseract-ocr

In [None]:
# !pip show tesseract-ocr

See our [example docs page](https://github.com/Unstructured-IO/unstructured/tree/main/example-docs) to find example docs used in this tutorial. You can also upload your own files by clicking on “Choose Files” on the left panel then select and upload the file to Colab.

In [None]:
!mkdir -p example-docs
# Install example-10k.html and layout-parser-paper.pdf
!wget  https://raw.githubusercontent.com/Unstructured-IO/unstructured/main/example-docs/example-10k.html -P example-docs
!wget  https://raw.githubusercontent.com/Unstructured-IO/unstructured/main/example-docs/layout-parser-paper-fast.pdf -P example-docs

# Install NLTK Data
import nltk
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')

### PDF Parsing

There are two strategies availalbe for parsing PDF documents: "fast" and "hi_res." The default strategy is "hi_res"

If your main objective is extracting text from a "clean" PDF, i.e. one that does not include text in images that require OCR), go with the "fast" option.

Otherwise, if your PDF may have images with text to extract, or, you prefer to have better structured Elements that better characterize the text items within the document, go with with the "hi_res" option.

Naturally, "fast" is faster than "hi_res" -- by an order of magnitude!

In [None]:
# https://medium.com/unstructured-io/mastering-table-extraction-revolutionize-your-earnings-reports-analysis-with-ai-1bc32c22720e


# https://unstructured-io.github.io/unstructured/core/partition.html

In [30]:
from unstructured.partition.pdf import partition_pdf
# from unstructured.partition.auto import partition
filename = "example-docs/pdf example.pdf"
# filename = 'example-docs/pdf example - footer with words.pdf'
# filename = 'example-docs/pdf example - simple footer at bottom.pdf'
# filename = 'example-docs/Q3FY24-CFO-Commentary.pdf'

elements_fast = partition_pdf(filename, strategy='fast')

elements = partition_pdf(filename,
                         strategy='hi_res', # "auto", "hi_res", "ocr_only", and "fast".
                         infer_table_structure=True,
                         model_name = "yolox",
                        #  include_page_breaks=True,
                        #  extract_element_types=["Table"],
                        #  extract_images_in_pdf=True,
                        #  extract_image_block_output_dir="example-docs",
                        #  languages=["eng", "pt"]
                         )

In [None]:
from collections import Counter
display(Counter(type(element) for element in elements))

Let's display the type and text of the elements in the document:

In [77]:
display(*[(element.category, element.text) for element in elements])

('Header', 'ACME ltda Company solutions technology')

('Image', 'ACME')

('NarrativeText',
 "The FIFA World Cup, often simply called the World Cup, is an international association football competition between the senior men's national teams of the members of the Fédération Internationale de Football Association (FIFA), the sport's global governing body.")

('NarrativeText', 'Table in a human readable format')

('Table',
 'Date Account manager Registration number 24 April 2024 Fernando FGT#344')

('NarrativeText', 'Table in a not so much human readable format')

('Table', 'Product Property Person Joe Doe Date 25 April 2024')

('NarrativeText', 'Table in a tabular format')

('Table', 'Id 1 2 Product A B Metric 11 87')

('NarrativeText', 'My company ABCD#3487RE')

('UncategorizedText', '1')

('Header', 'ACME ltda Company solutions technology')

('NarrativeText', 'This is to clean non ascii characters')

('NarrativeText', '\\x88This text contains®non-ascii characters!●')

('NarrativeText', 'This is ordered bullets bullets')

('NarrativeText',
 '1 This is a very important point 1.1 That is another point 1.2 Followed by a super important 2 That is other thing')

('NarrativeText', 'This is bullets')

('NarrativeText',
 'This is a very important point o That is another point o Followed by a super important')

('ListItem', 'That is other thing')

('NarrativeText',
 "A settlement was established in the area by the Gaels during or before the 7th century,[13] followed by the Vikings. As the Kingdom of Dublin grew, it became Ireland's principal settlement by the 12th century Anglo-Norman invasion of Ireland.[13] The city expanded rapidly from the 17th century and was briefly the second largest in the British Empire and sixth largest in Western Europe after the Acts")

('NarrativeText', 'My company ABCD#3487RE')

('UncategorizedText', '2')

('Header', 'ACME ltda Company solutions technology')

('NarrativeText',
 'of Union in 1800.[14] Following independence in 1922, Dublin became the capital of the Irish Free State, renamed Ireland in 1937. As of 2018, the city was listed by the Globalization and World Cities Research Network (GaWC) as a global city, with a ranking of "Alpha minus", which placed it among the top thirty cities in the world.')

('Image', '== Microsoft')

('NarrativeText', 'My company ABCD#3487RE')

('UncategorizedText', '3')

In [78]:
display(*[(element.category, element.text) for element in elements_fast])

('Header', 'ACME ltda Company solutions technology')

('NarrativeText',
 "The FIFA World Cup, often simply called the World Cup, is an international association football competition between the senior men's national teams of the members of the Fédération Internationale de Football Association (FIFA), the sport's global governing body.")

('Title', 'Table in a human readable format')

('Title', 'Date Account manager Registration number')

('Title', '24 April 2024 Fernando FGT#344')

('Title', 'Table in a not so much human readable format')

('Title', 'Product Property')

('Title', 'Person Joe Doe')

('Title', 'Date 25 April 2024')

('Title', 'Table in a tabular format')

('Title', 'Id 1 2')

('Title', 'Product A B')

('Title', 'Metric 11 87')

('Footer', 'My company ABCD#3487RE')

('UncategorizedText', '1')

('Header', 'ACME ltda Company solutions technology')

('NarrativeText', 'This is to clean non ascii characters')

('Title', '\\x88This text contains®non-ascii characters!●')

('NarrativeText', 'This is ordered bullets bullets')

('NarrativeText',
 '1 This is a very important point 1.1 That is another point 1.2 Followed by a super important 2 That is other thing')

('NarrativeText', 'This is bullets')

('ListItem',
 'This is a very important point o That is another point o Followed by a super important')

('ListItem', 'That is other thing')

('NarrativeText',
 "A settlement was established in the area by the Gaels during or before the 7th century,[13] followed by the Vikings. As the Kingdom of Dublin grew, it became Ireland's principal settlement by the 12th century Anglo-Norman invasion of Ireland.[13] The city expanded rapidly from the 17th century and was briefly the second largest in the British Empire and sixth largest in Western Europe after the Acts")

('Footer', 'My company ABCD#3487RE')

('UncategorizedText', '2')

('Header', 'ACME ltda Company solutions technology')

('NarrativeText',
 'of Union in 1800.[14] Following independence in 1922, Dublin became the capital of the Irish Free State, renamed Ireland in 1937. As of 2018, the city was listed by the Globalization and World Cities Research Network (GaWC) as a global city, with a ranking of "Alpha minus", which placed it among the top thirty cities in the world.')

('Footer', 'My company ABCD#3487RE')

('UncategorizedText', '3')

In [79]:
# # get footer and uncategorized text
elements_dict = convert_to_dict(elements_fast)
extracted_elements_fast = []
for el in elements_dict:
  if el["type"] in ['UncategorizedText','Footer' ]:
    extracted_elements_fast.append(el["text"])

extracted_elements_fast = list(set(extracted_elements_fast))
extracted_elements_fast

['2', '3', 'My company ABCD#3487RE', '1']

In [76]:
# get footer
# elements_dict = convert_to_dict(elements_fast)
# extracted_elements_fast = []
# for el in elements_dict:
#   if el["type"] in ['Footer' ]:
#     extracted_elements_fast.append(el["text"])

# extracted_elements_fast_footer = list(set(extracted_elements_fast))
# extracted_elements_fast_footer

['My company ABCD#3487RE']

In [80]:
from unstructured.staging.base import convert_to_dict
import re
from unstructured.cleaners.core import (clean, clean_extra_whitespace,
                                        clean_non_ascii_chars)
from unstructured.staging.base import convert_to_dataframe
import pandas as pd

categories_to_remove = ['Header',
                        'UncategorizedText',
                        'Image',
                        'Footer',
                        ]


def clean_html_tag(x):
  return re.sub(re.compile('<.*?>'), ' ', x)


def remove_table_index(x):
  return re.sub(re.compile('\n\d+'), '\n', x)

elements_dict = convert_to_dict(elements)
extracted_elements = []
for el in elements_dict:
  if el["type"] not in categories_to_remove:
    if el["type"] == "Table":

      # just for debug: table as html
      extracted_elements.append('html table: ' + el["metadata"]["text_as_html"])

      # get table as text and text_as_html, clean tags
      extracted_elements.append(clean_extra_whitespace(
                                  clean_non_ascii_chars(
                                      clean_html_tag(
                                          el["metadata"]["text_as_html"]).replace("    ",'; ').replace("   ","")
                                    )))

      # extract table as series of strings that look like a table
      table_as_str = str(pd.read_html(el["metadata"]["text_as_html"]))
      table_as_str = table_as_str[1:-1]
      extracted_elements.append(remove_table_index(table_as_str))

    else:
      extracted_elements.append(clean_extra_whitespace(
                                  clean_non_ascii_chars(
                                      el["text"])
                                  ))



print("\n\n".join(extracted_elements))
# extracted_elements

The FIFA World Cup, often simply called the World Cup, is an international association football competition between the senior men's national teams of the members of the Fdration Internationale de Football Association (FIFA), the sport's global governing body.

Table in a human readable format

html table: <table><thead><th>Date</th><th>24 April 2024</th></thead><tr><td>Account manager</td><td>Fernando</td></tr><tr><td>Registration number</td><td>FGT#344</td></tr></table>

Date 24 April 2024; Account manager Fernando; Registration number FGT#344

                  Date 24 April 2024
      Account manager      Fernando
  Registration number       FGT#344

Table in a not so much human readable format

html table: <table><tr><td>Property</td><td>Joe Doe</td><td>25 April 2024</td></tr></table>

Property Joe Doe 25 April 2024

          0        1              2
  Property  Joe Doe  25 April 2024

Table in a tabular format

html table: <table><thead><th>Id</th><th>Product</th><th>Metric</th

remove footer

In [81]:
final_list = []
for el in extracted_elements:
  if el not in extracted_elements_fast:
    final_list.append(el)


final_list = "\n\n".join(final_list)
print(final_list)

The FIFA World Cup, often simply called the World Cup, is an international association football competition between the senior men's national teams of the members of the Fdration Internationale de Football Association (FIFA), the sport's global governing body.

Table in a human readable format

html table: <table><thead><th>Date</th><th>24 April 2024</th></thead><tr><td>Account manager</td><td>Fernando</td></tr><tr><td>Registration number</td><td>FGT#344</td></tr></table>

Date 24 April 2024; Account manager Fernando; Registration number FGT#344

                  Date 24 April 2024
      Account manager      Fernando
  Registration number       FGT#344

Table in a not so much human readable format

html table: <table><tr><td>Property</td><td>Joe Doe</td><td>25 April 2024</td></tr></table>

Property Joe Doe 25 April 2024

          0        1              2
  Property  Joe Doe  25 April 2024

Table in a tabular format

html table: <table><thead><th>Id</th><th>Product</th><th>Metric</th