Setup unstructured-api local environment with docker container

`docker run -p 8000:8000 -d --rm --name unstructured-api downloads.unstructured.io/unstructured-io/unstructured-api:latest --port 8000 --host 0.0.0.0`

In [1]:
from unstructured_client import UnstructuredClient
from unstructured_client.models import shared

client = UnstructuredClient(
    server_url="http://localhost:8000",
    api_key_auth="", #no need to authorize this parameter cause you don't use SASS api key.
)

Get your pdf file from local. for example, take example from below.
    
https://github.com/Unstructured-IO/unstructured/blob/main/example-docs/interface-config-guide-p93.pdf

In [2]:
filename = "example/multi-column.pdf"

with open(filename, "rb") as f:
    files = shared.Files(
        content=f.read(),
        file_name=filename,
    )

files

Files(content=b'%PDF-1.5\n%\x8f\n77 0 obj\n<< /Filter /FlateDecode /Length 6090 >>\nstream\nx\xda\xcd\\K\x93\xdbF\x92\xbe\xebWp\x0f\x8e@\xc76\xb1\xa8\x17P\xf0\xc4\xc6\x84e[\x0e\xef\x8e\xfc\x90\x15\xa1\x83=\x074\x89fC"\t\n\x00\xd5\xd6\xfc\xfa\xcd\xac\xcc\x02\xaa\xd8 \xd9\x9e\xd9\x99\xddK\x13(\xd43+\x1f_feu\xb6\xd8,\xb2\xc5w/2\xfe}\xf9\xf6\xc5\x7f\xbc\x12\x99Y\x88,-\xb3R,\xde\xde/D\x9aI\xf8\x98-\xc4\xa2\x90\x8b"\xb7\xa9\x90j\xf1v\xb7\xf85\xf9\xf9\xab\x9b\xa5\xd4&YWC\xd5\xd7C\x7fK\xaf\xcd@\xbf\xd5\xb6o\xe9\xa9?\xde\xdfH\x93\xdc\xd7]O\x05\xf7]\xbb\xa3\xa7\xe1\xf1Fd\t\xd7{\xac\xab\x0f\xfb\xba\xef\xeb>\xbd\xf9\xeb\xdb\xffZ\xc0,\xca"\x18\xdc\xe8\xd4\x14<\xf8\xab\xa6\xeb\x07\x1c\xd1d\xc9\xf7_\xbf\xc5\x0e\xca\xe4\xd0\xd5CW5\xfbf\xbf\xa1\x82\xa6\xa7\x1a\xabvw8\x0e\xd5\xd0\xb4\xfbj\xbb\xfd\xcc\x1f\xf7C\xbd\xef\x1b\x9c\xda\xa7\x1ba\x92\xda\x0fj\x8315\x12@\xd3\x98\xd5~\x8d-%/QR\xf7\xf0\xbbo\xb9\x00\xc7\xd9\xd6CMC\xc0\xfb\xb6\xae:z\x1c\x1e*\xae\xd4\xd57"O6\xc7\xad\xff\xd4\xd78\x93U\xdd\xbb\t\x9cP\\\

Then, setup unstructured api parameters.

In [3]:
req = shared.PartitionParameters(
    files=files,
    chunking_strategy="by_title",
    strategy='hi_res',
    split_pdf_page=True,
    coordinates=True, ## this is just example. but if you want split_pdf_page, recommand to use hi_res strategy.
)

In [4]:
try:
    resp = client.general.partition(req)
    print("Handled results :", len(resp.elements))
except Exception as e:
    print("Exception :", e)

INFO: Splitting PDF by page on client. Using 5 threads when calling API.
INFO: Set UNSTRUCTURED_CLIENT_SPLIT_CALL_THREADS env var if you want to change that.
Handled results : 165


In same condition except split_pdf_page, check runtime.

As multi-page-split takes less time if pdf get multiple, long pages.
(example file will takes - 27.3s with split, 1m5.6s without split)

In [12]:
req = shared.PartitionParameters(
    files=files,
    chunking_strategy="by_title",
    strategy='hi_res',
    # split_pdf_page=True,
    coordinates=True, ## this is just example. but if you want split_pdf_page, recommand to use hi_res strategy.
)

In [13]:
try:
    resp = client.general.partition(req)
    print("Handled results :", len(resp.elements))
except Exception as e:
    print("Exception :", e)

Handled results : 163


In [None]:
for element in resp.elements:
    print(element.keys())