## Quick Tour

The following examples show how to get started with the `unstructured` library. See
our [documentation page](https://unstructured-io.github.io/unstructured) for a full description
of the features in the library.

Another way to try out the `unstructured` library is by running a docker container -- compatible with either Intel/AMD or Apple Silicon! Check out the [instructions for using the docker image](https://github.com/Unstructured-IO/unstructured#dizzy-instructions-for-using-the-docker-image).

In [1]:
# Install Requirements
!apt-get -qq install poppler-utils tesseract-ocr
# Upgrade Pillow to latest version
%pip install -q --user --upgrade pillow
# Install Python Packages
%pip install -q unstructured["all-docs"]==0.12.5
# NOTE: you may also upgrade to the latest version with the command below,
#       though a more recent version of unstructured will not have been tested with this notebook
# %pip install -q --upgrade unstructured

/bin/bash: line 1: apt-get: command not found
Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.


See our [example docs page](https://github.com/Unstructured-IO/unstructured/tree/main/example-docs) to find example docs used in this tutorial. You can also upload your own files by clicking on “Choose Files” on the left panel then select and upload the file to Colab.

In [2]:
!mkdir -p example-docs
# Install example-10k.html and layout-parser-paper.pdf
!wget  https://raw.githubusercontent.com/Unstructured-IO/unstructured/main/example-docs/example-10k.html -P example-docs
!wget  https://raw.githubusercontent.com/Unstructured-IO/unstructured/main/example-docs/layout-parser-paper-fast.pdf -P example-docs

--2024-05-23 22:45:01--  https://raw.githubusercontent.com/Unstructured-IO/unstructured/main/example-docs/example-10k.html
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 2606:50c0:8003::154, 2606:50c0:8002::154, 2606:50c0:8001::154, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|2606:50c0:8003::154|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 2456707 (2.3M) [text/plain]
Saving to: ‘example-docs/example-10k.html’


2024-05-23 22:45:03 (1.31 MB/s) - ‘example-docs/example-10k.html’ saved [2456707/2456707]

--2024-05-23 22:45:03--  https://raw.githubusercontent.com/Unstructured-IO/unstructured/main/example-docs/layout-parser-paper-fast.pdf
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 2606:50c0:8000::154, 2606:50c0:8001::154, 2606:50c0:8002::154, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|2606:50c0:8000::154|:443... connected.
HTTP request sent, awaiting response... 2

In [1]:
# Install NLTK Data
import nltk
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')

[nltk_data] Downloading package punkt to /home/dikshant/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /home/dikshant/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


True

### HTML Parsing

You can parse an HTML document using the following workflow:

In [None]:
from unstructured.documents.html import HTMLDocument

doc = HTMLDocument.from_file("example-docs/example-10k.html")

# This is how you would use a document from your google Drive
"""
from google.colab import drive
drive.mount('/content/drive/')
doc = HTMLDocument.from_file("drive/MyDrive/your-filename.html")
"""

'\nfrom google.colab import drive\ndrive.mount(\'/content/drive/\')\ndoc = HTMLDocument.from_file("drive/MyDrive/your-filename.html")\n'

The third page of output looks like the following:

In [None]:
print(doc.pages[2])

SPECIAL NOTE REGARDING FORWARD-LOOKING STATEMENTS

This report contains statements that do not relate to historical or current facts but are “forward-looking” statements. These statements relate to analyses and other information based on forecasts of future results and estimates of amounts not yet determinable. These statements may also relate to future events or trends, our future prospects and proposed new products, services, developments or business strategies, among other things. These statements can generally (although not always) be identified by their use of terms and phrases such as anticipate, appear, believe, could, would, estimate, expect, indicate, intent, may, plan, predict, project, pursue, will continue and other similar terms and phrases, as well as the use of the future tense.

Actual results could differ materially from those expressed or implied in our forward-looking statements. Our future financial condition and results of operations, as well as any forward-looking

In [None]:
doc.pages[2].elements

[<unstructured.documents.html.HTMLTitle at 0x7fc2ebf462c0>,
 <unstructured.documents.html.HTMLNarrativeText at 0x7fc2ebf46410>,
 <unstructured.documents.html.HTMLNarrativeText at 0x7fc2ebf464a0>]

You can see that the parser successfully differentiated between titles and narrative text.

### PDF Parsing

There are two strategies availalbe for parsing PDF documents: "fast" and "hi_res." The default strategy is "hi_res"

If your main objective is extracting text from a "clean" PDF, i.e. one that does not include text in images that require OCR), go with the "fast" option.

Otherwise, if your PDF may have images with text to extract, or, you prefer to have better structured Elements that better characterize the text items within the document, go with with the "hi_res" option.

Naturally, "fast" is faster than "hi_res" -- by an order of magnitude!

In [2]:
from unstructured.partition.pdf import partition_pdf

# elements = partition_pdf("tata.pdf")

elements_fast = partition_pdf("/home/dikshant/BOSCH/Round1/tata.pdf",
    chunking_strategy="by_title",
    strategy="fast",
    max_characters=1500,
    overlap=300,
    overlap_all= True
  )

ModuleNotFoundError: No module named 'unstructured.partition'; 'unstructured' is not a package

Let's examine the types of elements returned for both the "hi_res" and "fast" strategies:

In [19]:
from collections import Counter

# display(Counter(type(element) for element in elements))
# print("")
# The composition of elements can be different for elements derived with the "fast" strategy
display(Counter(type(element) for element in elements_fast))

Counter({unstructured.documents.elements.CompositeElement: 692})

In [None]:
display(*[(type(element), element.text) for element in elements_fast])

Let's display the type and text of some of the elements in the document:

You can see that the parser also successfully differentiated between titles and narrative text from a PDF file. However, be aware that element classification is improving as the library evolves, tends to be more accurate with the "hi_res" strategy, and may not always correct.

Now we can join the elements and print the extracted texts from the PDF

In [9]:
print("\n\n".join([str(el) for el in elements]))

Dikshant Khandelwal Roll No.:CS22BTECH11017 B.Tech - Compute Science and Engineering Indian Institute Of Technology, Hyderabad

+91-7014393414 cs22btech11017@iith.ac.in dikkpsd@gmail.com Github | linkedin

Education

Degree/Certificate B.Tech. CSE Senior Secondary Secondary

Institute/Board Indian Institute of Technology, Hyderabad Noble Kingdom Public School Noble Kingdom Public School

CGPA/Percentage 9.65 96.2% 96.83%

Year 2022-Present 2022 2020

Achievements

JEE Advanced , Secured AIR 608 • JEE Mains, Secured AIR 1193

2022 2022

Experience

– Teaching Assistant

JUL23-NOV23

∗Worked as a TA for the course Discrete Mathematics

Skills & Interests

– Programming Languages: Python, C/C++, Dart, Kotlin, Rust, Javscript – Frameworks: Flutter, ExpressJS, FastAPI, Tensorflow, Selenium – Interests: Android Developement, Backend Developement, Reinforcement Learning, Web Scraping – Operating Systems: Windows, Linux – Soft Skills: Communication, team-work, problem-solving

Projects

– Stac

In [2]:
! pip install PyMuPDF Pillow

Collecting PyMuPDF
  Downloading PyMuPDF-1.24.4-cp38-none-manylinux2014_x86_64.whl.metadata (3.4 kB)
Collecting PyMuPDFb==1.24.3 (from PyMuPDF)
  Using cached PyMuPDFb-1.24.3-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl.metadata (1.4 kB)
Downloading PyMuPDF-1.24.4-cp38-none-manylinux2014_x86_64.whl (3.4 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.4/3.4 MB[0m [31m1.2 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m0m
[?25hUsing cached PyMuPDFb-1.24.3-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (15.8 MB)
Installing collected packages: PyMuPDFb, PyMuPDF
Successfully installed PyMuPDF-1.24.4 PyMuPDFb-1.24.3


In [11]:
# STEP 1 
# import libraries 
import fitz 
import io 
import os
from PIL import Image 

# STEP 2 
# file path you want to extract images from 
file = "/home/dikshant/BOSCH/Round1/tata.pdf"

# open the file 
pdf_file = fitz.open(file) 

# STEP 3 
output_dir = './images'
# iterate over PDF pages 
for page_index in range(len(pdf_file)): 

	# get the page itself 
	page = pdf_file[page_index] 
	image_list = page.get_images()

	# printing number of images found in this page 
	if image_list: 
		print( 
			f"[+] Found a total of {len(image_list)} images in page {page_index}") 
	else: 
		print("[!] No images found on page", page_index) 
	for img_index, img in enumerate(page.get_images(), start=1): 

		# get the XREF of the image 
		xref = img[0] 
		# extract the image bytes 
		base_image = pdf_file.extract_image(xref)
		image_bytes = base_image["image"] 
		image = Image.open(io.BytesIO(image_bytes))

    	# save image
		image_filename = f"page{page_index+1}_img{img_index}.png"
		image.save(os.path.join(output_dir, image_filename))
		print(f"[*] Image saved as {image_filename}")
		# get the image extension 


[+] Found a total of 4 images in page 0
[*] Image saved as page1_img1.png
[*] Image saved as page1_img2.png
[*] Image saved as page1_img3.png
[*] Image saved as page1_img4.png
[+] Found a total of 1 images in page 1
[*] Image saved as page2_img1.png
[+] Found a total of 1 images in page 2
[*] Image saved as page3_img1.png
[+] Found a total of 1 images in page 3
[*] Image saved as page4_img1.png
[!] No images found on page 4
[!] No images found on page 5
[!] No images found on page 6
[!] No images found on page 7
[+] Found a total of 2 images in page 8
[*] Image saved as page9_img1.png
[*] Image saved as page9_img2.png
[+] Found a total of 2 images in page 9
[*] Image saved as page10_img1.png
[*] Image saved as page10_img2.png
[+] Found a total of 1 images in page 10
[*] Image saved as page11_img1.png
[+] Found a total of 2 images in page 11
[*] Image saved as page12_img1.png
[*] Image saved as page12_img2.png
[+] Found a total of 1 images in page 12
[*] Image saved as page13_img1.png
[

OSError: broken data stream when reading image file