## Quick Tour

The following examples show how to get started with the `unstructured` library. See
our [documentation page](https://unstructured-io.github.io/unstructured) for a full description
of the features in the library.

In [1]:
# Install Requirements
!apt-get -qq install poppler-utils tesseract-ocr
# Upgrade Pillow to latest version
%pip install -q --user --upgrade pillow
# Install Python Packages
%pip install -q unstructured["local-inference"]==0.5.2 layoutparser
# upgrade to the latest, though has not been tested
# %pip install -q --upgrade unstructured layoutparser
%pip install -q "detectron2@git+https://github.com/facebookresearch/detectron2.git@v0.6#egg=detectron2"

Selecting previously unselected package poppler-utils.
(Reading database ... 122518 files and directories currently installed.)
Preparing to unpack .../poppler-utils_0.86.1-0ubuntu1.1_amd64.deb ...
Unpacking poppler-utils (0.86.1-0ubuntu1.1) ...
Selecting previously unselected package tesseract-ocr-eng.
Preparing to unpack .../tesseract-ocr-eng_1%3a4.00~git30-7274cfa-1_all.deb ...
Unpacking tesseract-ocr-eng (1:4.00~git30-7274cfa-1) ...
Selecting previously unselected package tesseract-ocr-osd.
Preparing to unpack .../tesseract-ocr-osd_1%3a4.00~git30-7274cfa-1_all.deb ...
Unpacking tesseract-ocr-osd (1:4.00~git30-7274cfa-1) ...
Selecting previously unselected package tesseract-ocr.
Preparing to unpack .../tesseract-ocr_4.1.1-2build2_amd64.deb ...
Unpacking tesseract-ocr (4.1.1-2build2) ...
Setting up tesseract-ocr-eng (1:4.00~git30-7274cfa-1) ...
Setting up tesseract-ocr-osd (1:4.00~git30-7274cfa-1) ...
Setting up poppler-utils (0.86.1-0ubuntu1.1) ...
Setting up tesseract-ocr (4.1.1-2b

See our [example docs page](https://github.com/Unstructured-IO/unstructured/tree/main/example-docs) to find example docs used in this tutorial. You can also upload your own files by clicking on “Choose Files” on the left panel then select and upload the file to Colab.

In [None]:
!mkdir example-docs
# Install example-10k.html and layout-parser-paper.pdf
!wget  https://raw.githubusercontent.com/Unstructured-IO/unstructured/main/example-docs/example-10k.html -P example-docs
!wget  https://raw.githubusercontent.com/Unstructured-IO/unstructured/main/example-docs/layout-parser-paper-fast.pdf -P example-docs

In [2]:
# Install NLTK Data
import nltk
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.


True

### HTML Parsing

You can parse an HTML document using the following workflow:

In [None]:
from unstructured.documents.html import HTMLDocument

doc = HTMLDocument.from_file("example-docs/example-10k.html")

# This is how you would use a document from your google Drive
"""
from google.colab import drive
drive.mount('/content/drive/')
doc = HTMLDocument.from_file("drive/MyDrive/your-filename.html")
"""

'\nfrom google.colab import drive\ndrive.mount(\'/content/drive/\')\ndoc = HTMLDocument.from_file("drive/MyDrive/your-filename.html")\n'

The third page of output looks like the following:

In [None]:
print(doc.pages[2])

SPECIAL NOTE REGARDING FORWARD-LOOKING STATEMENTS

This report contains statements that do not relate to historical or current facts but are “forward-looking” statements. These statements relate to analyses and other information based on forecasts of future results and estimates of amounts not yet determinable. These statements may also relate to future events or trends, our future prospects and proposed new products, services, developments or business strategies, among other things. These statements can generally (although not always) be identified by their use of terms and phrases such as anticipate, appear, believe, could, would, estimate, expect, indicate, intent, may, plan, predict, project, pursue, will continue and other similar terms and phrases, as well as the use of the future tense.

Actual results could differ materially from those expressed or implied in our forward-looking statements. Our future financial condition and results of operations, as well as any forward-looking

In [None]:
doc.pages[2].elements

[<unstructured.documents.html.HTMLNarrativeText at 0x7f2b9d6b9190>,
 <unstructured.documents.html.HTMLNarrativeText at 0x7f2b9d6b91f0>,
 <unstructured.documents.html.HTMLNarrativeText at 0x7f2b9d6b91c0>]

You can see that the parser successfully differentiated between titles and narrative text.

### PDF Parsing

You can use the following workflow to parse PDF documents.

In [3]:
from unstructured.partition.pdf import partition_pdf

elements = partition_pdf("/content/20181024_bill.pdf")

Downloading model_final.pth:   0%|          | 0.00/330M [00:00<?, ?B/s]

Downloading (…)50_FPN_3x/config.yml:   0%|          | 0.00/5.37k [00:00<?, ?B/s]

The following are the counts for the types of elements present in the document:

In [14]:
print(elements[0].to_dict()['text'].replace('\n','').replace(' ',''))

’晩1:唖wl細側rf附潅ﾙ寺一驚鶴：黄徽‘熊Ijii)〔〃()I〕･伸’|蛾i(1負‘ゞI鋤〔,！E「I言■＝一二F．‐l‐．､ぎぎ,1lii卑价釣篭患价ieoF126oooxci:_20az724‘洲‘i噌砿.州嫡弧i1．i分3‘(cid:631)』'’'''1.-1t.,''1;l'』‘』fj〔.．↑･1ﾋ調Iﾉ:鰈iiI1汽罪:12(j11分2911F’．､71’』1.f1r(,サ茎#1,’1W'''1:|}(cid:631)241傍241.''1|､>側i･I'j;'‘111:(cid:631)(cid:631)』i(cid:631)il:‘‘!『洲:鼠:tjミ“I价“･･j･『》';,11t｛)罰＄･I:｜l《；則；veh#22s“3Aa72211,.jiA(cid:631)．＃浄､1331“i;(cid:631)ｾﾙ:(cid:631)瀬：1<jJ,0nos+Hi‘卜(cid:631)l(cid:631)!'”ﾙz(cid:631)皇;(cid:631)−2(.!((cid:631)〕(cid:631)!#1ﾘ(cid:631)：llr了(cid:631)11(cid:631)『軍、｝‘(cid:631):'崎i‘i‐:;j2a~mnMj'(cid:631)‘1iHii’ｲ,ﾐｻ↑（j'“加澱ti2f，


In [17]:
[element.to_dict()['text'].replace('\n','').replace(' ','') for element in elements]

["’晩1:唖wl細側rf附潅ﾙ寺一驚鶴：黄徽‘熊Ijii)〔〃()I〕･伸’|蛾i(1負‘ゞI鋤〔,！E「I言■＝一二F．‐l‐．､ぎぎ,1lii卑价釣篭患价\x0c\x0cieoF126oooxci\x0c:_\x0c\x0c\x0c\x0c\x0c\x0c20\x0c\x0ca\x0cz\x0c\x0c\x0c\x0c\x0c724\x0c\x0c\x0c\x0c‘洲‘i噌砿.州嫡弧i1．i分3‘(cid:631)』'’'''1.-1t.,''1;l'』‘』fj〔.．↑･1ﾋ調Iﾉ:鰈iiI1汽罪:12(j11分2911F’．､71’』1.f1r(,サ茎#1,’1W'''1:|}(cid:631)241傍241.''1|､>側i･I'j;'‘111:(cid:631)(cid:631)』i(cid:631)il:‘‘!『洲:鼠:tjミ“I价“･･j･『》';,11t｛)罰＄･I:｜l《；則；\x0c\x0c\x0c\x0c\x0c\x0cveh\x0c#2\x0c\x0c\x0c\x0c\x0c\x0c2s“3\x0c\x0cA\x0ca7\x0c\x0c\x0c\x0c\x0c\x0c22\x0c\x0c11,.jiA(cid:631)．＃浄､1331“i;(cid:631)ｾﾙ:(cid:631)瀬：1<jJ,0\x0c\x0cnos+H\x0c\x0ci\x0c\x0c\x0c\x0c\x0c\x0c\x0c\x0c‘卜(cid:631)l(cid:631)!'”ﾙz(cid:631)皇;(cid:631)−2(.!((cid:631)〕(cid:631)!#1ﾘ(cid:631)：llr了(cid:631)11(cid:631)『軍、｝‘(cid:631):'崎i‘i‐:;j\x0c2a~\x0c\x0cmn\x0c\x0c\x0c\x0c\x0c\x0c\x0c\x0c\x0cMj'(cid:631)‘1iHii’ｲ,ﾐｻ↑（j'“加澱ti2f，",
 "晩1:唖wl細側rf附潅ﾙ寺一驚鶴：黄徽‘熊Ijii)〔〃()I〕･伸’|蛾i(1負‘ゞI鋤〔,！E「I言■＝一二F．‐l‐．､ぎぎ,1lii卑价釣篭患价‘洲‘i噌砿.州嫡弧i1．i分3‘(cid:631)』'’'''1.-1t.,''1;l'』‘』fj〔.．↑･1ﾋ調I",
 '麺侭−−鴨漁督情悌

In [18]:
from unstructured.partition.pdf import partition_pdf

test_elements = partition_pdf("/content/airFlight.pdf")

In [None]:
[element.to_dict()['text'].replace('\n','').replace(' ','') for element in test_elements]

In [None]:
len(test_elements)

24

In [None]:
[str(el) for el in test_elements]

In [22]:
new_elems = partition_pdf('/content/20180915_bill.pdf')

In [28]:
new_elems[1].to_dict()

{'element_id': '648fb4e5bfb3bd39a925a58e48133bf2',
 'coordinates': [[23.798675537109375, 210.62965393066406],
  [563.2022705078125, 210.62965393066406],
  [563.2022705078125, 628.3871459960938],
  [23.798675537109375, 628.3871459960938]],
 'text': 'RECHNUNG Rechnungs.-Nr. 474081 Datum: 14.09.18 z!mme٢ Anre!se Abre!se Se!te Benutzer ID 539 09.09.18 14.09.18 lofi KUE Gastname :Herr Jens Walter Datum Beschreibung Belastung Entlastung Übernachtung exklusive Frühstück* Übernachtung exklusive Frühstück* Übernachtung exklusive Frühstück* Übernachtung exklusive Frühstück* Übernachtung exklusive Frühstück* Mastercard IFC 09.09.18 110.00 10.09.18 11.09.18 110.00 110.00 12.09.18 13.09.18 110.00 110.00 14.09.18 550.00 Umsatzsteuer Detail Total inkl. MwSt. MwSt. 7% * Total Netto EUR 514.02 514.02 550.00 550.00 MwSt. EUR 35.98 35.98 Brutto EUR 550.00 550.00 Saldo 0.00 EUR Finanzamt: Hamburg Mitte steuernummer: 48/741/01228 Kreditkartsnöetaits Vertragsnummer : 154694832 Beleg Nr. Transaktionsbetrag :

In [30]:
def extract_text(file_path):
  pdf_elems = partition_pdf('/content/20180915_bill.pdf')
  text_data = [element.to_dict()['text'].replace('\n','').replace(' ','') for element in pdf_elems]
  return text_data

In [31]:
bill1 = extract_text('/content/20180915_bill.pdf')

In [None]:
bill1