## Set Up the Environment

In [1]:
%run setup.ipynb

## Document Loaders

Document loaders are used to import data from various sources into LangChain as `Document` objects. A `Document` typically includes a piece of text along with its associated metadata.

### Examples of Document Loaders:

- **Text File Loader:** Loads data from a simple `.txt` file.
- **Web Page Loader:** Retrieves the text content from any web page.
- **YouTube Video Transcript Loader:** Loads transcripts from YouTube videos.

### Functionality:

- **Load Method:** Each document loader has a `load` method that enables the loading of data as documents from a pre-configured source.
- **Lazy Load Option:** Some loaders also support a "lazy load" feature, which allows data to be loaded into memory gradually as needed.

For more detailed information, visit [LangChain's document loader documentation](https://python.langchain.com/docs/modules/data_connection/document_loaders/).


### PDF Loaders

[Portable Document Format (PDF)](https://en.wikipedia.org/wiki/PDF), standardized as ISO 32000, is a file format developed by Adobe in 1992 to present documents, including text formatting and images, in a manner independent of application software, hardware, and operating systems.

LangChain integrates with a host of PDF parsers. Some are simple and relatively low-level; others will support OCR and image-processing, or perform advanced document layout analysis. The right choice will depend on your use-case and through experimentation.

Here we will see how to load PDF documents into the LangChain `Document` format

We download a research paper to experiment with

If the following command fails you can download the paper manually by going to http://arxiv.org/pdf/2103.15348.pdf, save it as `layoutparser_paper.pdf`and upload it on the left in Colab from the upload files option

### PyPDFLoader

Here we load a PDF using `pypdf` into list of documents, where each document contains the page content and metadata with page number. Typically each PDF page becomes one document

In [2]:
from langchain_community.document_loaders import PyPDFLoader

loader = PyPDFLoader("../../docs/layoutparser_paper.pdf")
pages = loader.load()

print(pages[0].page_content)
print(pages[0].metadata)

LayoutParser: A Uniﬁed Toolkit for Deep
Learning Based Document Image Analysis
Zejiang Shen1 (  ), Ruochen Zhang2, Melissa Dell3, Benjamin Charles Germain
Lee4, Jacob Carlson3, and Weining Li5
1 Allen Institute for AI
shannons@allenai.org
2 Brown University
ruochen zhang@brown.edu
3 Harvard University
{melissadell,jacob carlson}@fas.harvard.edu
4 University of Washington
bcgl@cs.washington.edu
5 University of Waterloo
w422li@uwaterloo.ca
Abstract. Recent advances in document image analysis (DIA) have been
primarily driven by the application of neural networks. Ideally, research
outcomes could be easily deployed in production and extended for further
investigation. However, various factors like loosely organized codebases
and sophisticated model conﬁgurations complicate the easy reuse of im-
portant innovations by a wide audience. Though there have been on-going
eﬀorts to improve reusability and simplify deep learning (DL) model
development in disciplines like natural language processin

In [3]:
len(pages)

16

In [5]:
from pprint import pprint

In [6]:
pprint(pages[0])

Document(metadata={'producer': 'pdfTeX-1.40.21', 'creator': 'LaTeX with hyperref', 'creationdate': '2021-06-22T01:27:10+00:00', 'author': '', 'keywords': '', 'moddate': '2021-06-22T01:27:10+00:00', 'ptex.fullbanner': 'This is pdfTeX, Version 3.14159265-2.6-1.40.21 (TeX Live 2020) kpathsea version 6.3.2', 'subject': '', 'title': '', 'trapped': '/False', 'source': '../../docs/layoutparser_paper.pdf', 'total_pages': 16, 'page': 0, 'page_label': '1'}, page_content='LayoutParser: A Uniﬁed Toolkit for Deep\nLearning Based Document Image Analysis\nZejiang Shen1 (\x00 ), Ruochen Zhang2, Melissa Dell3, Benjamin Charles Germain\nLee4, Jacob Carlson3, and Weining Li5\n1 Allen Institute for AI\nshannons@allenai.org\n2 Brown University\nruochen zhang@brown.edu\n3 Harvard University\n{melissadell,jacob carlson}@fas.harvard.edu\n4 University of Washington\nbcgl@cs.washington.edu\n5 University of Waterloo\nw422li@uwaterloo.ca\nAbstract. Recent advances in document image analysis (DIA) have been\nprima

In [7]:
pprint(pages[0].page_content)

('LayoutParser: A Uniﬁed Toolkit for Deep\n'
 'Learning Based Document Image Analysis\n'
 'Zejiang Shen1 (\x00 ), Ruochen Zhang2, Melissa Dell3, Benjamin Charles '
 'Germain\n'
 'Lee4, Jacob Carlson3, and Weining Li5\n'
 '1 Allen Institute for AI\n'
 'shannons@allenai.org\n'
 '2 Brown University\n'
 'ruochen zhang@brown.edu\n'
 '3 Harvard University\n'
 '{melissadell,jacob carlson}@fas.harvard.edu\n'
 '4 University of Washington\n'
 'bcgl@cs.washington.edu\n'
 '5 University of Waterloo\n'
 'w422li@uwaterloo.ca\n'
 'Abstract. Recent advances in document image analysis (DIA) have been\n'
 'primarily driven by the application of neural networks. Ideally, research\n'
 'outcomes could be easily deployed in production and extended for further\n'
 'investigation. However, various factors like loosely organized codebases\n'
 'and sophisticated model conﬁgurations complicate the easy reuse of im-\n'
 'portant innovations by a wide audience. Though there have been on-going\n'
 'eﬀorts to improve

In [8]:
print(pages[0].metadata)

{'producer': 'pdfTeX-1.40.21', 'creator': 'LaTeX with hyperref', 'creationdate': '2021-06-22T01:27:10+00:00', 'author': '', 'keywords': '', 'moddate': '2021-06-22T01:27:10+00:00', 'ptex.fullbanner': 'This is pdfTeX, Version 3.14159265-2.6-1.40.21 (TeX Live 2020) kpathsea version 6.3.2', 'subject': '', 'title': '', 'trapped': '/False', 'source': '../../docs/layoutparser_paper.pdf', 'total_pages': 16, 'page': 0, 'page_label': '1'}


### PyMuPDFLoader

This is the fastest of the PDF parsing options, and contains detailed metadata about the PDF and its pages, as well as returns one document per page. It uses the `pymupdf` library internally.

In [9]:
from langchain_community.document_loaders import PyMuPDFLoader

loader = PyMuPDFLoader("../../docs/layoutparser_paper.pdf")
pages = loader.load()

print(pages[0].page_content)
print(pages[0].metadata)

LayoutParser: A Uniﬁed Toolkit for Deep
Learning Based Document Image Analysis
Zejiang Shen1 ( ), Ruochen Zhang2, Melissa Dell3, Benjamin Charles Germain
Lee4, Jacob Carlson3, and Weining Li5
1 Allen Institute for AI
shannons@allenai.org
2 Brown University
ruochen zhang@brown.edu
3 Harvard University
{melissadell,jacob carlson}@fas.harvard.edu
4 University of Washington
bcgl@cs.washington.edu
5 University of Waterloo
w422li@uwaterloo.ca
Abstract. Recent advances in document image analysis (DIA) have been
primarily driven by the application of neural networks. Ideally, research
outcomes could be easily deployed in production and extended for further
investigation. However, various factors like loosely organized codebases
and sophisticated model conﬁgurations complicate the easy reuse of im-
portant innovations by a wide audience. Though there have been on-going
eﬀorts to improve reusability and simplify deep learning (DL) model
development in disciplines like natural language processing

In [10]:
len(pages)

16

In [11]:
print(pages[0])

page_content='LayoutParser: A Uniﬁed Toolkit for Deep
Learning Based Document Image Analysis
Zejiang Shen1 ( ), Ruochen Zhang2, Melissa Dell3, Benjamin Charles Germain
Lee4, Jacob Carlson3, and Weining Li5
1 Allen Institute for AI
shannons@allenai.org
2 Brown University
ruochen zhang@brown.edu
3 Harvard University
{melissadell,jacob carlson}@fas.harvard.edu
4 University of Washington
bcgl@cs.washington.edu
5 University of Waterloo
w422li@uwaterloo.ca
Abstract. Recent advances in document image analysis (DIA) have been
primarily driven by the application of neural networks. Ideally, research
outcomes could be easily deployed in production and extended for further
investigation. However, various factors like loosely organized codebases
and sophisticated model conﬁgurations complicate the easy reuse of im-
portant innovations by a wide audience. Though there have been on-going
eﬀorts to improve reusability and simplify deep learning (DL) model
development in disciplines like natural langu

In [12]:
pages[0].metadata

{'producer': 'pdfTeX-1.40.21', 'creator': 'LaTeX with hyperref', 'creationdate': '2021-06-22T01:27:10+00:00', 'source': '../../docs/layoutparser_paper.pdf', 'file_path': '../../docs/layoutparser_paper.pdf', 'total_pages': 16, 'format': 'PDF 1.5', 'title': '', 'author': '', 'subject': '', 'keywords': '', 'moddate': '2021-06-22T01:27:10+00:00', 'trapped': '', 'modDate': 'D:20210622012710Z', 'creationDate': 'D:20210622012710Z', 'page': 0}

In [13]:
print(pages[0].page_content)

LayoutParser: A Uniﬁed Toolkit for Deep
Learning Based Document Image Analysis
Zejiang Shen1 ( ), Ruochen Zhang2, Melissa Dell3, Benjamin Charles Germain
Lee4, Jacob Carlson3, and Weining Li5
1 Allen Institute for AI
shannons@allenai.org
2 Brown University
ruochen zhang@brown.edu
3 Harvard University
{melissadell,jacob carlson}@fas.harvard.edu
4 University of Washington
bcgl@cs.washington.edu
5 University of Waterloo
w422li@uwaterloo.ca
Abstract. Recent advances in document image analysis (DIA) have been
primarily driven by the application of neural networks. Ideally, research
outcomes could be easily deployed in production and extended for further
investigation. However, various factors like loosely organized codebases
and sophisticated model conﬁgurations complicate the easy reuse of im-
portant innovations by a wide audience. Though there have been on-going
eﬀorts to improve reusability and simplify deep learning (DL) model
development in disciplines like natural language processing

### UnstructuredPDFLoader


[Unstructured.io](https://unstructured-io.github.io/unstructured/) supports a common interface for working with unstructured or semi-structured file formats, such as Markdown or PDF. LangChain's [`UnstructuredPDFLoader`](https://api.python.langchain.com/en/latest/document_loaders/langchain_community.document_loaders.pdf.UnstructuredPDFLoader.html) integrates with Unstructured to parse PDF documents into LangChain [`Document`](https://api.python.langchain.com/en/latest/documents/langchain_core.documents.base.Document.html) objects.

In [14]:
from langchain_community.document_loaders import UnstructuredPDFLoader

loader = UnstructuredPDFLoader('../../docs/layoutparser_paper.pdf')
data = loader.load()

print(data[0].page_content)
print(data[0].metadata)

1 2 0 2

n u J

1 2

]

V C . s c [

2 v 8 4 3 5 1 . 3 0 1 2 : v i X r a

LayoutParser: A Uniﬁed Toolkit for Deep Learning Based Document Image Analysis

Zejiang Shen1 ((cid:0)), Ruochen Zhang2, Melissa Dell3, Benjamin Charles Germain Lee4, Jacob Carlson3, and Weining Li5

1 Allen Institute for AI shannons@allenai.org 2 Brown University ruochen zhang@brown.edu 3 Harvard University {melissadell,jacob carlson}@fas.harvard.edu 4 University of Washington bcgl@cs.washington.edu 5 University of Waterloo w422li@uwaterloo.ca

Abstract. Recent advances in document image analysis (DIA) have been primarily driven by the application of neural networks. Ideally, research outcomes could be easily deployed in production and extended for further investigation. However, various factors like loosely organized codebases and sophisticated model conﬁgurations complicate the easy reuse of im- portant innovations by a wide audience. Though there have been on-going eﬀorts to improve reusability and simplify d

Load PDF with complex parsing, table detection and chunking by sections

Refer to https://community.databricks.com/t5/data-engineering/trying-to-use-pdf2image-on-databricks/td-p/12914


In [None]:
# #install poppler on the cluster (should be done by init scripts)
# def install_ocr_on_nodes():
#     """
#     install poppler on the cluster (should be done by init scripts)
#     """
#     # from pyspark.sql import SparkSession
#     import subprocess
#     num_workers = max(1,int(spark.conf.get("spark.databricks.clusterUsageTags.clusterWorkers")))
#     command = "sudo rm -rf /var/cache/apt/archives/* /var/lib/apt/lists/* && sudo apt-get clean && sudo apt-get update && sudo apt-get install poppler-utils tesseract-ocr -y" 
#     def run_subprocess(command):
#         try:
#             output = subprocess.check_output(command, stderr=subprocess.STDOUT, shell=True)
#             return output.decode()
#         except subprocess.CalledProcessError as e:
#             raise Exception("An error occurred installing OCR libs:"+ e.output.decode())
#     #install on the driver
#     run_subprocess(command)
#     def run_command(iterator):
#         for x in iterator:
#             yield run_subprocess(command)
#     # spark = SparkSession.builder.getOrCreate()
#     data = spark.sparkContext.parallelize(range(num_workers), num_workers) 
#     # Use mapPartitions to run command in each partition (worker)
#     output = data.mapPartitions(run_command)
#     try:
#         output.collect();
#         print("OCR libraries installed")
#     except Exception as e:
#         print(f"Couldn't install on all node: {e}")
#         raise e

In [19]:
# install_ocr_on_nodes()

In [1]:
# takes 3-4 mins on Colab
loader = UnstructuredPDFLoader('../../docs/layoutparser_paper.pdf',
                               strategy='hi_res',
                               extract_images_in_pdf=False,
                               infer_table_structure=True,
                               chunking_strategy="by_title",
                               max_characters=4000, # max size of chunks
                               new_after_n_chars=3800, # preferred size of chunks
                               combine_text_under_n_chars=2000, # smaller chunks < 2000 chars will be combined into a larger chunk
                               mode='elements')
data = loader.load()

NameError: name 'UnstructuredPDFLoader' is not defined

In [17]:
data

[Document(metadata={'source': '../../docs/layoutparser_paper.pdf', 'filetype': 'application/pdf', 'languages': ['eng'], 'last_modified': '2025-05-30T10:16:46', 'page_number': 1, 'orig_elements': 'eJzNWm2P2zYS/iu8/dQeTJWkSEraL5c0Aa7ppXdFm16LpkHAl5GtRJYMvexmG/S/35CSnHW9zcEBHPhDEM+IQ3KeeZf25fsrqGELzfC68lfX5Mo6mZemAFoImVLJXUlzXnrKOTCW8UynQlytyNUWBuPNYFDm/ZVr285XjRmgj3Rt7tpxeL2Bar0ZkCMEYygzs28rP2yQy7PI3bVVMwS5ly+lSuSKKCUT+WpFZpKzVCQ80JyxpHiAMQkg46q/6wfYBj2+r95B/ePOOLj6Ax94GMANVdu8drXp+9e7rrW4jCVKS81xQVnVMNztIMp+/91VvG6zHs066vTyCpr11avI7YfX29ZXZQURMcGEokzRlL3g7Jrra6mD9A4lXzfj1kIXdA2XGOBdQOOKE0EY/mvISL4lgXpF/kuekIT0QXK5xjdgPAqj5J+NVBg0jpSaWu8llR6AWps76jOhlNU5y4w5s5E4E2ki7ltJZklxYKUjRpT4qJk+pxXcfax/ahwCs2676nfwL8KKB2D3zgijMwwL8IpKqxW1KWBsiNQyyVJl2PlhX1BdaK2nYNmjfMSIEhcD+8uTYWeC5xoyQ53SlkqmHS1KxakT2luNOpQCzg57IQ+8nQt+6NxHjEniYmAXJ8Murcstl44KrdDbC3R0hLegplSp0NzZVJ0PdsUSFkFUCQuozrRS6UTn+YP0tP5yQMdSlXCVyvxGENP9Ut2cbAXOGXfeMpqaAhOPsJjqjTGUq8yJQoDiOj+38+99e6FlkWQHzn/EiBIXY4fTYS/KVBUKD/Ei11RmnlObsxx/

In [20]:
len(data)

16

In [21]:
[doc.metadata['category'] for doc in data]

['CompositeElement', 'CompositeElement', 'CompositeElement', 'CompositeElement', 'CompositeElement', 'CompositeElement', 'CompositeElement', 'CompositeElement', 'CompositeElement', 'CompositeElement', 'CompositeElement', 'CompositeElement', 'CompositeElement', 'CompositeElement', 'CompositeElement', 'CompositeElement']

In [22]:
pprint(data[0])

Document(metadata={'source': '../../docs/layoutparser_paper.pdf', 'filetype': 'application/pdf', 'languages': ['eng'], 'last_modified': '2025-05-30T10:16:46', 'page_number': 1, 'orig_elements': 'eJzNWm2P2zYS/iu8/dQeTJWkSEraL5c0Aa7ppXdFm16LpkHAl5GtRJYMvexmG/S/35CSnHW9zcEBHPhDEM+IQ3KeeZf25fsrqGELzfC68lfX5Mo6mZemAFoImVLJXUlzXnrKOTCW8UynQlytyNUWBuPNYFDm/ZVr285XjRmgj3Rt7tpxeL2Bar0ZkCMEYygzs28rP2yQy7PI3bVVMwS5ly+lSuSKKCUT+WpFZpKzVCQ80JyxpHiAMQkg46q/6wfYBj2+r95B/ePOOLj6Ax94GMANVdu8drXp+9e7rrW4jCVKS81xQVnVMNztIMp+/91VvG6zHs066vTyCpr11avI7YfX29ZXZQURMcGEokzRlL3g7Jrra6mD9A4lXzfj1kIXdA2XGOBdQOOKE0EY/mvISL4lgXpF/kuekIT0QXK5xjdgPAqj5J+NVBg0jpSaWu8llR6AWps76jOhlNU5y4w5s5E4E2ki7ltJZklxYKUjRpT4qJk+pxXcfax/ahwCs2676nfwL8KKB2D3zgijMwwL8IpKqxW1KWBsiNQyyVJl2PlhX1BdaK2nYNmjfMSIEhcD+8uTYWeC5xoyQ53SlkqmHS1KxakT2luNOpQCzg57IQ+8nQt+6NxHjEniYmAXJ8Murcstl44KrdDbC3R0hLegplSp0NzZVJ0PdsUSFkFUCQuozrRS6UTn+YP0tP5yQMdSlXCVyvxGENP9Ut2cbAXOGXfeMpqaAhOPsJjqjTGUq8yJQoDiOj+38+99e6FlkWQHzn/EiBIXY4fTYS/KVBUKD/Ei11RmnlObsxx/l

In [23]:
print(data[0].page_content)

1 2 0 2 n u J 1 2 ] V C . s

c

[

2

2103.15348v2 arXiv

v

8

4

3

5

1

.

3

0

1

2

:

v

i

X

r

a

LayoutParser: A Uniﬁed Toolkit for Deep Learning Based Document Image Analysis

Zejiang Shen! (4), Ruochen Zhang”, Melissa Dell?, Benjamin Charles Germain Lee*, Jacob Carlson’, and Weining Li®

1 Allen Institute for AI shannons@allenai.org 2 Brown University ruochen zhang@brown.edu 3 Harvard University {melissadell,jacob carlson}@fas.harvard.edu 4 University of Washington bcgl@cs.washington.edu 5 University of Waterloo w422li@uwaterloo.ca

Abstract. Recent advances in document image analysis (DIA) have been primarily driven by the application of neural networks. Ideally, research outcomes could be easily deployed in production and extended for further investigation. However, various factors like loosely organized codebases and sophisticated model conﬁgurations complicate the easy reuse of im- portant innovations by a wide audience. Though there have been on-going eﬀorts to impro

In [27]:
print(data[5].metadata)

{'source': '../../docs/layoutparser_paper.pdf', 'filetype': 'application/pdf', 'languages': ['eng'], 'last_modified': '2025-05-30T10:16:46', 'page_number': 5, 'orig_elements': 'eJzNWW1v3LgR/ivEFi2S1lIoinrzXYpekwZn4BK4ifuljrGgSGqXtVbaEyXb2+T+e2dIyd61N2liYI2DEccacUjOPDPDZ6jzTzNd65Vu+rlRs2MyS3Ulq1TTQCj4xRlNglLwKtBpVKSppHlOs9kRma10L5ToBeh8msm27ZRpRK+te67Fph36+VKbxbIHCWOUgs4ovjaqX4I0ypx03ZqmR73z8zjjITsiEU3isLg4IneCPApTFOT0SwKnApKZ3dher9CWU3Oj6w9rIfXsN3ihdK9lb9pmLmth7XzdtSUMo2FBCw7vK1PrfrPWTvX07cztuFkMYuHMOp/pZjG7cFLbz1etMpXRzmmMsiQAT8X0LKLHUXrMU9Reg+a8GVal7mBUgnvo9Q06ZBaHEfnFuYO8nrZF3rZK1xY1p22cmb52m78Pk1aCqlzSQJdcBFwpFZRZFgc5lzItATFepQeEKQ5z8Dk4Pcw9TF4QF0XIUBDFnIbxXolXeiRQPI4OAVRtmkun+Wlme9GBkxulb0AQ5/EWaCzHwUNX44M0vQ473TAaJRWsAxg7nHYn4HR7gojdn2CpQT9bCXu5TzvJ8+2YSe5rXw+gXXh3dW3DZr9dfD3sTpox6k5FZ3V3RATxaJMVxh7pxaW2IFStHDDaiFnBZESArCGmWcNA0Siy0I3uMIxQ39ietBXpYA8IQi06UrY38K5qO9IvNUzaLXRPZNv0OGWnFwCtDclr83GoSkp1h+Kqa1ek74QyiLyoCYTuslX2iBjUqQ3MCDmitF7jVFdtPYwDGz107r/+uu0uLYGdLTUuDXteiWYQdb0hcsA

In [25]:
from IPython.display import HTML

HTML(data[5].metadata['text_as_html'])

KeyError: 'text_as_html'

Load using raw unstructured.io APIs for PDFs

In [None]:
from unstructured.partition.pdf import partition_pdf

# Get elements - takes 3-4 mins
raw_pdf_elements = partition_pdf(
    filename="./docs/layoutparser_paper.pdf",
    strategy='hi_res',
    # Unstructured first finds embedded image blocks
    extract_images_in_pdf=False,
    # Use layout model (YOLOX) to get bounding boxes (for tables) and find titles
    # Titles are any sub-section of the document
    infer_table_structure=True,
    # Post processing to aggregate text once we have the title
    chunking_strategy="by_title",
    # Chunking params to aggregate text blocks
    # Attempt to create a new chunk 3800 chars
    # Attempt to keep chunks > 2000 chars
    max_characters=4000,
    new_after_n_chars=3800,
    combine_text_under_n_chars=2000,
    image_output_dir_path="./",
)

In [None]:
len(raw_pdf_elements)

In [None]:
raw_pdf_elements

In [None]:
raw_pdf_elements[5].to_dict()

Convert into LangChain `document`format

In [None]:
from langchain_core.documents import Document

lc_docs = [Document(page_content=doc.text,
                    metadata=doc.metadata.to_dict())
              for doc in raw_pdf_elements]
lc_docs[5]