## Quick Tour

The following examples show how to get started with the `unstructured` library. See
our [documentation page](https://unstructured-io.github.io/unstructured) for a full description
of the features in the library.

Another way to try out the `unstructured` library is by running a docker container -- compatible with either Intel/AMD or Apple Silicon! Check out the [instructions for using the docker image](https://github.com/Unstructured-IO/unstructured#dizzy-instructions-for-using-the-docker-image).

In [None]:
# how it comes at package documentation
!apt-get -qq install poppler-utils tesseract-ocr
%pip install -q --user --upgrade pillow
%pip install -q unstructured["all-docs"]==0.12.5
# %pip install -q --upgrade unstructured

In [None]:
# !apt-get -qq install poppler-utils tesseract-ocr
%pip install -q --user --upgrade pillow
%pip install -q unstructured[pdf,image,local-inference]==0.13.3
%pip install libmagic
%pip install poppler-utils
%pip install tessearct-ocr
%pip install pandoc
# %pip install -q --upgrade unstructured

In [None]:
!apt-get -qq install poppler-utils tesseract-ocr

In [None]:
# !pip show tesseract-ocr

See our [example docs page](https://github.com/Unstructured-IO/unstructured/tree/main/example-docs) to find example docs used in this tutorial. You can also upload your own files by clicking on “Choose Files” on the left panel then select and upload the file to Colab.

In [None]:
!mkdir -p example-docs
# Install example-10k.html and layout-parser-paper.pdf
!wget  https://raw.githubusercontent.com/Unstructured-IO/unstructured/main/example-docs/example-10k.html -P example-docs
!wget  https://raw.githubusercontent.com/Unstructured-IO/unstructured/main/example-docs/layout-parser-paper-fast.pdf -P example-docs

# Install NLTK Data
import nltk
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')

### PDF Parsing

There are two strategies availalbe for parsing PDF documents: "fast" and "hi_res." The default strategy is "hi_res"

If your main objective is extracting text from a "clean" PDF, i.e. one that does not include text in images that require OCR), go with the "fast" option.

Otherwise, if your PDF may have images with text to extract, or, you prefer to have better structured Elements that better characterize the text items within the document, go with with the "hi_res" option.

Naturally, "fast" is faster than "hi_res" -- by an order of magnitude!

In [None]:
# https://medium.com/unstructured-io/mastering-table-extraction-revolutionize-your-earnings-reports-analysis-with-ai-1bc32c22720e


# https://unstructured-io.github.io/unstructured/core/partition.html

In [None]:
from unstructured.partition.pdf import partition_pdf
# from unstructured.partition.auto import partition
filename = "example-docs/pdf example.pdf"
# filename = 'example-docs/pdf example - footer with words.pdf'
# filename = 'example-docs/pdf example - simple footer at bottom.pdf'
# filename = 'example-docs/Q3FY24-CFO-Commentary.pdf'

# elements = partition_pdf(filename, strategy='fast')

elements = partition_pdf(filename,
                         strategy='hi_res', # "auto", "hi_res", "ocr_only", and "fast".
                         infer_table_structure=True,
                         model_name = "yolox",
                        #  include_page_breaks=True,
                        #  extract_element_types=["Table"],
                        #  extract_images_in_pdf=True,
                        #  extract_image_block_output_dir="example-docs",
                        #  languages=["eng", "pt"]
                         )

In [None]:
from collections import Counter
display(Counter(type(element) for element in elements))

Let's display the type and text of the elements in the document:

In [None]:
display(*[(element.category, element.text) for element in elements])

In [None]:
categories_to_remove = ['Header',
                        'UncategorizedText',
                        'Image',
                        'Footer',
                        # 'Title'
                        ]
# result = "\n\n".join([str(el) for el in elements if el.category not in categories_to_remove])
# print(result)

In [None]:
# pipeline
  # convert elements to dict
  # get table as text and text_as_html
  # clean tags

from unstructured.staging.base import convert_to_dict
import re
from unstructured.cleaners.core import (clean_extra_whitespace,
                                        clean_non_ascii_chars)


def clean_html_tag(raw_html):
  # return re.compile(r'<[^>]+>').sub(' ', raw_html)
  return re.sub(re.compile('<.*?>'), ' ', raw_html)


elements_dict = convert_to_dict(elements)
extracted_elements = []
for el in elements_dict:
  if el["type"] not in categories_to_remove:
    if el["type"] == "Table":
      extracted_elements.append('html table: ' + el["metadata"]["text_as_html"])
      extracted_elements.append(clean_extra_whitespace(
                                  clean_non_ascii_chars(
                                      clean_html_tag(
                                          el["metadata"]["text_as_html"]).replace("    ",'; ').replace("   ","")
                                    )))

    extracted_elements.append(clean_extra_whitespace(
                                  clean_non_ascii_chars(
                                      el["text"]
                                  )))


extracted_elements = "\n\n".join(extracted_elements)
print(extracted_elements)

