## Quick Tour

The following examples show how to get started with the `unstructured` library. See
our [documentation page](https://unstructured-io.github.io/unstructured) for a full description
of the features in the library.

Another way to try out the `unstructured` library is by running a docker container -- compatible with either Intel/AMD or Apple Silicon! Check out the [instructions for using the docker image](https://github.com/Unstructured-IO/unstructured#dizzy-instructions-for-using-the-docker-image).

In [1]:
# Install Requirements
!apt-get -qq install poppler-utils tesseract-ocr
# Upgrade Pillow to latest version
%pip install -q --user --upgrade pillow
# Install Python Packages
%pip install -q unstructured["all-docs"]==0.12.5
# NOTE: you may also upgrade to the latest version with the command below,
#       though a more recent version of unstructured will not have been tested with this notebook
# %pip install -q --upgrade unstructured

Selecting previously unselected package poppler-utils.
(Reading database ... 121752 files and directories currently installed.)
Preparing to unpack .../poppler-utils_22.02.0-2ubuntu0.3_amd64.deb ...
Unpacking poppler-utils (22.02.0-2ubuntu0.3) ...
Selecting previously unselected package tesseract-ocr-eng.
Preparing to unpack .../tesseract-ocr-eng_1%3a4.00~git30-7274cfa-1.1_all.deb ...
Unpacking tesseract-ocr-eng (1:4.00~git30-7274cfa-1.1) ...
Selecting previously unselected package tesseract-ocr-osd.
Preparing to unpack .../tesseract-ocr-osd_1%3a4.00~git30-7274cfa-1.1_all.deb ...
Unpacking tesseract-ocr-osd (1:4.00~git30-7274cfa-1.1) ...
Selecting previously unselected package tesseract-ocr.
Preparing to unpack .../tesseract-ocr_4.1.1-2.1build1_amd64.deb ...
Unpacking tesseract-ocr (4.1.1-2.1build1) ...
Setting up tesseract-ocr-eng (1:4.00~git30-7274cfa-1.1) ...
Setting up tesseract-ocr-osd (1:4.00~git30-7274cfa-1.1) ...
Setting up poppler-utils (22.02.0-2ubuntu0.3) ...
Setting up tess

See our [example docs page](https://github.com/Unstructured-IO/unstructured/tree/main/example-docs) to find example docs used in this tutorial. You can also upload your own files by clicking on “Choose Files” on the left panel then select and upload the file to Colab.

In [2]:
!mkdir -p example-docs
# Install example-10k.html and layout-parser-paper.pdf
!wget  https://raw.githubusercontent.com/Unstructured-IO/unstructured/main/example-docs/example-10k.html -P example-docs
!wget  https://raw.githubusercontent.com/Unstructured-IO/unstructured/main/example-docs/layout-parser-paper-fast.pdf -P example-docs

--2024-04-23 15:48:12--  https://raw.githubusercontent.com/Unstructured-IO/unstructured/main/example-docs/example-10k.html
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.110.133, 185.199.109.133, 185.199.108.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.110.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 2456707 (2.3M) [text/plain]
Saving to: ‘example-docs/example-10k.html’


2024-04-23 15:48:12 (35.7 MB/s) - ‘example-docs/example-10k.html’ saved [2456707/2456707]

--2024-04-23 15:48:12--  https://raw.githubusercontent.com/Unstructured-IO/unstructured/main/example-docs/layout-parser-paper-fast.pdf
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 172270 (168K) [app

In [3]:
# Install NLTK Data
import nltk
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.


True

### HTML Parsing

You can parse an HTML document using the following workflow:

In [None]:
from unstructured.documents.html import HTMLDocument

doc = HTMLDocument.from_file("example-docs/example-10k.html")

# This is how you would use a document from your google Drive
"""
from google.colab import drive
drive.mount('/content/drive/')
doc = HTMLDocument.from_file("drive/MyDrive/your-filename.html")
"""

The third page of output looks like the following:

In [None]:
print(doc.pages[2])

In [None]:
doc.pages[2].elements

You can see that the parser successfully differentiated between titles and narrative text.

### PDF Parsing

There are two strategies availalbe for parsing PDF documents: "fast" and "hi_res." The default strategy is "hi_res"

If your main objective is extracting text from a "clean" PDF, i.e. one that does not include text in images that require OCR), go with the "fast" option.

Otherwise, if your PDF may have images with text to extract, or, you prefer to have better structured Elements that better characterize the text items within the document, go with with the "hi_res" option.

Naturally, "fast" is faster than "hi_res" -- by an order of magnitude!

In [95]:
from unstructured.partition.pdf import partition_pdf
from unstructured.partition.auto import partition
filename = "example-docs/pdf example.pdf"
# filename = 'example-docs/Q3FY24-CFO-Commentary.pdf'

elements = partition_pdf(filename,
                         strategy='hi_res',
                         infer_table_structure=True,
                         model_name = "yolox")


# elements = partition(filename=filename,
#                      strategy='hi_res',
#            )

# elements_fast = partition_pdf("example-docs/layout-parser-paper-fast.pdf", strategy="fast")

In [112]:
# pipeline
# save elements to json
# get table as text and text_as_html
# clean tags

from unstructured.staging.base import convert_to_dict

def cleanhtml(raw_html):
  return re.compile(r'<[^>]+>').sub(' ', raw_html)

data = convert_to_dict(elements)
extracted_elements = []
for entry in data:
  if entry["type"] == "Table":
    extracted_elements.append(entry["metadata"]["text_as_html"])
    extracted_elements.append(entry["text"])

extracted_elements =  [cleanhtml(item) for item in extracted_elements]
extracted_elements
# with open("example-docs/nvidia-yolox.txt", 'w') as output_file:
#     for element in extracted_elements:
#         output_file.write(element + "\n\n")

['   Account manager  Joe Doe    Registration policy  AC#23535POD   ',
 'Insurance date Account manager Registration policy 22 April 2024 Joe Doe AC#23535POD',
 '   Insurance date  Account manager  Reg policy    22 April 2023  Joe Doe  ASDW#2343   ',
 'Insurance date 22 April 2023 Account manager Joe Doe Reg policy ASDW#2343']

In [108]:
# import json
# import re
# from unstructured.staging.base import elements_to_json
# elements_to_json(elements, filename=f"example-docs/nvidia.json")


# def cleanhtml(raw_html):
#   return re.compile(r'<[^>]+>').sub(' ', raw_html)
#   # cleantext = re.sub(re.compile('<.*?>'), '', raw_html)
#   # return cleantext

# def process_json_file(input_filename):
#     # Read the JSON file
#     with open(input_filename, 'r') as file:
#         data = json.load(file)

#     # Iterate over the JSON data and extract required table elements
#     extracted_elements = []
#     for entry in data:
#         if entry["type"] == "Table":
#             extracted_elements.append(entry["metadata"]["text_as_html"])
#             extracted_elements.append(entry["text"])

#     extracted_elements =  [cleanhtml(item) for item in extracted_elements]
#     with open("example-docs/nvidia-yolox.txt", 'w') as output_file:
#         for element in extracted_elements:
#             output_file.write(element + "\n\n")

# process_json_file("example-docs/nvidia.json")

In [62]:
from collections import Counter

display(Counter(type(element) for element in elements))
print("")
# The composition of elements can be different for elements derived with the "fast" strategy
# display(Counter(type(element) for element in elements_fast))

Counter({unstructured.documents.elements.Header: 2,
         unstructured.documents.elements.NarrativeText: 6,
         unstructured.documents.elements.Table: 2,
         unstructured.documents.elements.Image: 2})




Let's display the type and text of some of the elements in the document:

In [None]:
display(*[(element.category, element.text) for element in elements])

You can see that the parser also successfully differentiated between titles and narrative text from a PDF file. However, be aware that element classification is improving as the library evolves, tends to be more accurate with the "hi_res" strategy, and may not always correct.

Now we can join the elements and print the extracted texts from the PDF

In [64]:
categories_to_remove = ['Header',
                        'UncategorizedText',
                        'Image',
                        'Footer'

                        ]
result = "\n\n".join([str(el) for el in elements if el.category not in categories_to_remove])
print(result)

# footer is not managed!

The FIFA World Cup, often simply called the World Cup, is an international association football competition between the senior men's national teams of the members of the Fédération Internationale de Football Association (FIFA), the sport's global governing body.

This is a more human readable table format

Insurance date Account manager Registration policy 22 April 2024 Joe Doe AC#23535POD

This is another attempt

Insurance date 22 April 2023 Account manager Joe Doe Reg policy ASDW#2343

This is a footer 1

To date, the final of the World Cup has only been contested by teams from the UEFA (Europe) and CONMEBOL (South America) confederations. European nations have won twelve titles, while South American nations have won ten. Only three teams from outside these two continents have ever

This is a footer 2


In [121]:
from unstructured.staging.base import convert_to_dict
import re
def cleanhtml(raw_html):
  return re.compile(r'<[^>]+>').sub(' ', raw_html)

data = convert_to_dict(elements)
extracted_elements = []
for entry in data:
  if entry["type"] not in categories_to_remove:
    if entry["type"] == "Table":
      extracted_elements.append(entry["metadata"]["text_as_html"] + ' - ' + entry['type'])
    extracted_elements.append(entry["text"] + ' - ' + entry['type'])

extracted_elements =  [cleanhtml(item) for item in extracted_elements]

extracted_elements = "\n\n".join(extracted_elements)
print(extracted_elements)

The FIFA World Cup, often simply called the World Cup, is an international association football competition between the senior men's national teams of the members of the Fédération Internationale de Football Association (FIFA), the sport's global governing body. - NarrativeText

This is a more human readable table format - NarrativeText

   Account manager  Joe Doe    Registration policy  AC#23535POD    - Table

Insurance date Account manager Registration policy 22 April 2024 Joe Doe AC#23535POD - Table

This is another attempt - NarrativeText

   Insurance date  Account manager  Reg policy    22 April 2023  Joe Doe  ASDW#2343    - Table

Insurance date 22 April 2023 Account manager Joe Doe Reg policy ASDW#2343 - Table

This is a footer 1 - NarrativeText

To date, the final of the World Cup has only been contested by teams from the UEFA (Europe) and CONMEBOL (South America) confederations. European nations have won twelve titles, while South American nations have won ten. Only three te