<a href="https://colab.research.google.com/github/AreebAhmad-02/Rags-using-qdrant/blob/main/Unstructured_io_pdf_qdrant.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Quick Tour

The following examples show how to get started with the `unstructured` library. See
our [documentation page](https://unstructured-io.github.io/unstructured) for a full description
of the features in the library.

Another way to try out the `unstructured` library is by running a docker container -- compatible with either Intel/AMD or Apple Silicon! Check out the [instructions for using the docker image](https://github.com/Unstructured-IO/unstructured#dizzy-instructions-for-using-the-docker-image).

In [None]:
# Install Requirements
!apt-get -qq install poppler-utils tesseract-ocr
# Upgrade Pillow to latest version
%pip install -q --user --upgrade pillow
# Install Python Packages
%pip install -q unstructured["all-docs"]==0.12.5
# NOTE: you may also upgrade to the latest version with the command below,
#       though a more recent version of unstructured will not have been tested with this notebook
# %pip install -q --upgrade unstructured

Selecting previously unselected package poppler-utils.
(Reading database ... 121913 files and directories currently installed.)
Preparing to unpack .../poppler-utils_22.02.0-2ubuntu0.4_amd64.deb ...
Unpacking poppler-utils (22.02.0-2ubuntu0.4) ...
Selecting previously unselected package tesseract-ocr-eng.
Preparing to unpack .../tesseract-ocr-eng_1%3a4.00~git30-7274cfa-1.1_all.deb ...
Unpacking tesseract-ocr-eng (1:4.00~git30-7274cfa-1.1) ...
Selecting previously unselected package tesseract-ocr-osd.
Preparing to unpack .../tesseract-ocr-osd_1%3a4.00~git30-7274cfa-1.1_all.deb ...
Unpacking tesseract-ocr-osd (1:4.00~git30-7274cfa-1.1) ...
Selecting previously unselected package tesseract-ocr.
Preparing to unpack .../tesseract-ocr_4.1.1-2.1build1_amd64.deb ...
Unpacking tesseract-ocr (4.1.1-2.1build1) ...
Setting up tesseract-ocr-eng (1:4.00~git30-7274cfa-1.1) ...
Setting up tesseract-ocr-osd (1:4.00~git30-7274cfa-1.1) ...
Setting up poppler-utils (22.02.0-2ubuntu0.4) ...
Setting up tess

See our [example docs page](https://github.com/Unstructured-IO/unstructured/tree/main/example-docs) to find example docs used in this tutorial. You can also upload your own files by clicking on “Choose Files” on the left panel then select and upload the file to Colab.

In [None]:
!mkdir -p example-docs
# Install example-10k.html and layout-parser-paper.pdf

!wget  https://raw.githubusercontent.com/Unstructured-IO/unstructured/main/example-docs/example-10k.html -P example-docs
!wget  https://raw.githubusercontent.com/Unstructured-IO/unstructured/main/example-docs/layout-parser-paper-fast.pdf -P example-docs

--2024-06-10 15:13:52--  https://raw.githubusercontent.com/Unstructured-IO/unstructured/main/example-docs/example-10k.html
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 2456707 (2.3M) [text/plain]
Saving to: ‘example-docs/example-10k.html’


2024-06-10 15:13:52 (152 MB/s) - ‘example-docs/example-10k.html’ saved [2456707/2456707]

--2024-06-10 15:13:52--  https://raw.githubusercontent.com/Unstructured-IO/unstructured/main/example-docs/layout-parser-paper-fast.pdf
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 172270 (168K) [appl

In [None]:
# Install NLTK Data
import nltk
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.


True

### HTML Parsing

You can parse an HTML document using the following workflow:

In [None]:
from unstructured.documents.html import HTMLDocument

doc = HTMLDocument.from_file("example-docs/example-10k.html")

# This is how you would use a document from your google Drive
# """
# from google.colab import drive
# drive.mount('/content/drive/')
# doc = HTMLDocument.from_file("drive/MyDrive/your-filename.html")
# """

'\nfrom google.colab import drive\ndrive.mount(\'/content/drive/\')\ndoc = HTMLDocument.from_file("drive/MyDrive/your-filename.html")\n'

The third page of output looks like the following:

In [None]:
print(doc.pages[2])

SPECIAL NOTE REGARDING FORWARD-LOOKING STATEMENTS

This report contains statements that do not relate to historical or current facts but are “forward-looking” statements. These statements relate to analyses and other information based on forecasts of future results and estimates of amounts not yet determinable. These statements may also relate to future events or trends, our future prospects and proposed new products, services, developments or business strategies, among other things. These statements can generally (although not always) be identified by their use of terms and phrases such as anticipate, appear, believe, could, would, estimate, expect, indicate, intent, may, plan, predict, project, pursue, will continue and other similar terms and phrases, as well as the use of the future tense.

Actual results could differ materially from those expressed or implied in our forward-looking statements. Our future financial condition and results of operations, as well as any forward-looking

In [None]:
doc.pages[2].elements

[<unstructured.documents.html.HTMLTitle at 0x7fc2ebf462c0>,
 <unstructured.documents.html.HTMLNarrativeText at 0x7fc2ebf46410>,
 <unstructured.documents.html.HTMLNarrativeText at 0x7fc2ebf464a0>]

You can see that the parser successfully differentiated between titles and narrative text.

### PDF Parsing

There are two strategies availalbe for parsing PDF documents: "fast" and "hi_res." The default strategy is "hi_res"

If your main objective is extracting text from a "clean" PDF, i.e. one that does not include text in images that require OCR), go with the "fast" option.

Otherwise, if your PDF may have images with text to extract, or, you prefer to have better structured Elements that better characterize the text items within the document, go with with the "hi_res" option.

Naturally, "fast" is faster than "hi_res" -- by an order of magnitude!

In [None]:
from unstructured.partition.pdf import partition_pdf

elements = partition_pdf("/content/datadocs/The_Alchemist.pdf")



In [None]:
elements_fast = partition_pdf("example-docs/layout-parser-paper-fast.pdf", strategy="fast")

Let's examine the types of elements returned for both the "hi_res" and "fast" strategies:

In [None]:
from collections import Counter

display(Counter(type(element) for element in elements))
print("")
# The composition of elements can be different for elements derived with the "fast" strategy
display(Counter(type(element) for element in elements_fast))

Counter({unstructured.documents.elements.Title: 435,
         unstructured.documents.elements.NarrativeText: 3539,
         unstructured.documents.elements.Text: 31,
         unstructured.documents.elements.Address: 1})




Counter({unstructured.documents.elements.Text: 9,
         unstructured.documents.elements.Title: 4,
         unstructured.documents.elements.NarrativeText: 8,
         unstructured.documents.elements.ListItem: 4})

Let's display the type and text of some of the elements in the document:

In [None]:
print(type(elements[0]))

<class 'unstructured.documents.elements.Title'>


In [None]:
display(*[(type(element), element.text) for element in elements[10:13]])

(unstructured.documents.elements.Title, 'ONE')

(unstructured.documents.elements.NarrativeText,
 'The boy’s name was Santiago. Dusk was')

(unstructured.documents.elements.NarrativeText, 'falling as the…')

You can see that the parser also successfully differentiated between titles and narrative text from a PDF file. However, be aware that element classification is improving as the library evolves, tends to be more accurate with the "hi_res" strategy, and may not always correct.

Now we can join the elements and print the extracted texts from the PDF

In [None]:
print("\n\n".join([str(el) for el in elements]))

THE ALCHEMIST

PAULO COELHO

TRANSLATED BY ALAN R. CLARKE

Contents

INTRODUCTION

I remember receiving a letter from the

American publisher Harper Collins…

PROLOGUE

The alchemist picked up a book that someone

in the…

ONE

The boy’s name was Santiago. Dusk was

falling as the…

TWO

The boy had been working for the crystal

merchant for…

EPILOGUE

The boy reached the small, abandoned

church just as night…

ABOUT THE AUTHOR

INTERNATIONAL ACCLAIM

BOOKS BY PAULO COELHO

CREDITS

COVER

COPYRIGHT

ABOUT THE PUBLISHER

TEN YEARS ON

I REMEMBER RECEIVING A LETTER FROM THE AMERICAN publisher Harper

Collins that said that: “reading The Alchemist was like getting up at dawn and seeing the sun rise while the rest of the world still slept.” I

went outside, looked up at the sky, and thought to myself: “So, the book is going to be published in English!” At the time, I was

struggling to establish myself as a writer and to follow my path

despite all the voices telling me it was impossibl