# Exploring Document Loaders in LangChain

## Install OpenAI, HuggingFace and LangChain dependencies

In [5]:
!pip install langchain==0.3.11
!pip install langchain-openai==0.2.12
!pip install langchain-community==0.3.11

Collecting langchain==0.3.11
  Downloading langchain-0.3.11-py3-none-any.whl.metadata (7.1 kB)
Collecting langsmith<0.3,>=0.1.17 (from langchain==0.3.11)
  Downloading langsmith-0.2.11-py3-none-any.whl.metadata (14 kB)
Collecting numpy<2,>=1.22.4 (from langchain==0.3.11)
  Downloading numpy-1.26.4-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (61 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m61.0/61.0 kB[0m [31m2.5 MB/s[0m eta [36m0:00:00[0m
Downloading langchain-0.3.11-py3-none-any.whl (1.0 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.0/1.0 MB[0m [31m17.3 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading langsmith-0.2.11-py3-none-any.whl (326 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m326.9/326.9 kB[0m [31m20.3 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading numpy-1.26.4-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (18.3 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

In [50]:
# takes 2 - 5 mins to install on Colab
!pip install "unstructured[all-docs]==0.14.0"



In [1]:
# install OCR dependencies for unstructured
!sudo apt-get install tesseract-ocr
!sudo apt-get install poppler-utils

Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
tesseract-ocr is already the newest version (4.1.1-2.1build1).
0 upgraded, 0 newly installed, 0 to remove and 35 not upgraded.
Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
The following NEW packages will be installed:
  poppler-utils
0 upgraded, 1 newly installed, 0 to remove and 35 not upgraded.
Need to get 186 kB of archives.
After this operation, 697 kB of additional disk space will be used.
Get:1 http://archive.ubuntu.com/ubuntu jammy-updates/main amd64 poppler-utils amd64 22.02.0-2ubuntu0.8 [186 kB]
Fetched 186 kB in 48s (3,888 B/s)
debconf: unable to initialize frontend: Dialog
debconf: (No usable dialog-like program is installed, so the dialog based frontend cannot be used. at /usr/share/perl5/Debconf/FrontEnd/Dialog.pm line 78, <> line 1.)
debconf: falling back to frontend: Readline
debconf: unable to initialize frontend: Readline


In [1]:
!pip install jq==1.7.0
!pip install pypdf==4.2.0
!pip install pymupdf==1.24.4



## Document Loaders

Document loaders are used to import data from various sources into LangChain as `Document` objects. A `Document` typically includes a piece of text along with its associated metadata.

### Examples of Document Loaders:

- **Text File Loader:** Loads data from a simple `.txt` file.
- **Web Page Loader:** Retrieves the text content from any web page.
- **YouTube Video Transcript Loader:** Loads transcripts from YouTube videos.

### Functionality:

- **Load Method:** Each document loader has a `load` method that enables the loading of data as documents from a pre-configured source.
- **Lazy Load Option:** Some loaders also support a "lazy load" feature, which allows data to be loaded into memory gradually as needed.

For more detailed information, visit [LangChain's document loader documentation](https://python.langchain.com/docs/modules/data_connection/document_loaders/).


### Text Loader

The simplest loader reads in a file as text and places it all into one document.



In [3]:
!curl -o README.md https://raw.githubusercontent.com/langchain-ai/langchain/master/README.md

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0100  5169  100  5169    0     0  25327      0 --:--:-- --:--:-- --:--:-- 25463


In [2]:
from langchain_community.document_loaders import TextLoader

loader = TextLoader("./README.md")
doc = loader.load()

In [3]:
len(doc)

1

In [4]:
type(doc[0])

In [5]:
print(doc[0].page_content[:100])

<picture>
  <source media="(prefers-color-scheme: light)" srcset="docs/static/img/logo-dark.svg">
  


### Markdown Loader

Markdown is a lightweight markup language for creating formatted text using a plain-text editor.

This showcases how to load Markdown documents into a langchain document format that we can use in our pipelines and chains.

Load the whole document

Download nltk packages if needed

In [6]:
import nltk
nltk.download('punkt_tab')
nltk.download('averaged_perceptron_tagger_eng')

[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.
[nltk_data] Downloading package averaged_perceptron_tagger_eng to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger_eng.zip.


True

In [7]:
from langchain_community.document_loaders import UnstructuredMarkdownLoader

loader = UnstructuredMarkdownLoader("./README.md", mode='single')
docs = loader.load()

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.


In [8]:
len(docs)

1

In [9]:
type(docs[0])

In [10]:
print(docs[0].page_content[:100])

[!NOTE]
Looking for the JS/TS library? Check out LangChain.js.

LangChain is a framework for buildin


In [11]:
from langchain_community.document_loaders import UnstructuredMarkdownLoader

loader = UnstructuredMarkdownLoader("./README.md", mode="elements")
docs = loader.load()

In [12]:
len(docs)

18

In [13]:
docs[:10]

[Document(metadata={'source': './README.md', 'last_modified': '2025-05-29T11:31:12', 'languages': ['eng'], 'filetype': 'text/markdown', 'file_directory': '.', 'filename': 'README.md', 'category': 'NarrativeText', 'element_id': 'ada4984f55e3bfe7057f8abd1b24a809'}, page_content='[!NOTE]\nLooking for the JS/TS library? Check out LangChain.js.'),
 Document(metadata={'source': './README.md', 'last_modified': '2025-05-29T11:31:12', 'languages': ['eng'], 'filetype': 'text/markdown', 'file_directory': '.', 'filename': 'README.md', 'category': 'NarrativeText', 'element_id': 'd9dd2676fd3ef14eba932c9da5c5636b'}, page_content='LangChain is a framework for building LLM-powered applications. It helps you chain\ntogether interoperable components and third-party integrations to simplify AI\napplication development —  all while future-proofing decisions as the underlying\ntechnology evolves.'),
 Document(metadata={'source': './README.md', 'last_modified': '2025-05-29T11:31:12', 'languages': ['eng'], 'f

In [14]:
from collections import Counter
Counter([doc.metadata['category'] for doc in docs])

Counter({'NarrativeText': 7, 'Title': 4, 'ListItem': 7})

Comparing Unstructured.io loaders vs LangChain wrapper API

In [15]:
from unstructured.partition.md import partition_md

docs = partition_md(filename="./README.md")

In [16]:
len(docs)

18

In [17]:
docs[:10]

[<unstructured.documents.elements.NarrativeText at 0x78bc92b9c790>,
 <unstructured.documents.elements.NarrativeText at 0x78bc8c57a310>,
 <unstructured.documents.elements.Title at 0x78bc8c579ed0>,
 <unstructured.documents.elements.NarrativeText at 0x78bc8c57ab90>,
 <unstructured.documents.elements.Title at 0x78bc8c57ac50>,
 <unstructured.documents.elements.NarrativeText at 0x78bc8c57add0>,
 <unstructured.documents.elements.NarrativeText at 0x78bc8c57ae50>,
 <unstructured.documents.elements.Title at 0x78bc8c57af10>,
 <unstructured.documents.elements.NarrativeText at 0x78bc8c57afd0>,
 <unstructured.documents.elements.NarrativeText at 0x78bc8c57b050>]

In [18]:
docs[0].to_dict()

{'type': 'NarrativeText',
 'element_id': 'ada4984f55e3bfe7057f8abd1b24a809',
 'text': '[!NOTE]\nLooking for the JS/TS library? Check out LangChain.js.',
 'metadata': {'last_modified': '2025-05-29T11:31:12',
  'languages': ['eng'],
  'filetype': 'text/markdown',
  'file_directory': '.',
  'filename': 'README.md'}}

In [19]:
docs[1].to_dict()

{'type': 'NarrativeText',
 'element_id': 'd9dd2676fd3ef14eba932c9da5c5636b',
 'text': 'LangChain is a framework for building LLM-powered applications. It helps you chain\ntogether interoperable components and third-party integrations to simplify AI\napplication development —  all while future-proofing decisions as the underlying\ntechnology evolves.',
 'metadata': {'last_modified': '2025-05-29T11:31:12',
  'languages': ['eng'],
  'filetype': 'text/markdown',
  'file_directory': '.',
  'filename': 'README.md'}}

In [20]:
from langchain_core.documents import Document

lc_docs = [Document(page_content=doc.text,
                    metadata=doc.metadata.to_dict())
              for doc in docs]
lc_docs[:10]

[Document(metadata={'last_modified': '2025-05-29T11:31:12', 'languages': ['eng'], 'filetype': 'text/markdown', 'file_directory': '.', 'filename': 'README.md'}, page_content='[!NOTE]\nLooking for the JS/TS library? Check out LangChain.js.'),
 Document(metadata={'last_modified': '2025-05-29T11:31:12', 'languages': ['eng'], 'filetype': 'text/markdown', 'file_directory': '.', 'filename': 'README.md'}, page_content='LangChain is a framework for building LLM-powered applications. It helps you chain\ntogether interoperable components and third-party integrations to simplify AI\napplication development —  all while future-proofing decisions as the underlying\ntechnology evolves.'),
 Document(metadata={'last_modified': '2025-05-29T11:31:12', 'languages': ['eng'], 'filetype': 'text/markdown', 'file_directory': '.', 'filename': 'README.md'}, page_content='bash\npip install -U langchain'),
 Document(metadata={'last_modified': '2025-05-29T11:31:12', 'languages': ['eng'], 'parent_id': 'd096b8fd4a734

### CSV Loader

A comma-separated values (CSV) file is a delimited text file that uses a comma to separate values. Each line of the file is a data record. Each record consists of one or more fields, separated by commas.

LangChain implements a CSV Loader that will load CSV files into a sequence of `Document` objects. Each row of the CSV file is converted to one document.

In [21]:
import pandas as pd

# Create a DataFrame with some dummy real estate data
data = {
    'Property_ID': [101, 102, 103, 104, 105],
    'Address': ['123 Elm St', '456 Oak St', '789 Pine St', '321 Maple St', '654 Cedar St'],
    'City': ['Springfield', 'Rivertown', 'Laketown', 'Hillside', 'Sunnyvale'],
    'State': ['CA', 'TX', 'FL', 'NY', 'CO'],
    'Zip_Code': [98765, 87654, 76543, 65432, 54321],
    'Bedrooms': [3, 2, 4, 3, 5],
    'Bathrooms': [2, 1, 3, 2, 4],
    'Listing_Price': [500000, 350000, 600000, 475000, 750000]
}

df = pd.DataFrame(data)

# Save the DataFrame to a CSV file
df.to_csv('data.csv', index=False)

In [22]:
from langchain_community.document_loaders.csv_loader import CSVLoader

loader = CSVLoader(file_path="./data.csv")
docs = loader.load()

In [23]:
docs

[Document(metadata={'source': './data.csv', 'row': 0}, page_content='Property_ID: 101\nAddress: 123 Elm St\nCity: Springfield\nState: CA\nZip_Code: 98765\nBedrooms: 3\nBathrooms: 2\nListing_Price: 500000'),
 Document(metadata={'source': './data.csv', 'row': 1}, page_content='Property_ID: 102\nAddress: 456 Oak St\nCity: Rivertown\nState: TX\nZip_Code: 87654\nBedrooms: 2\nBathrooms: 1\nListing_Price: 350000'),
 Document(metadata={'source': './data.csv', 'row': 2}, page_content='Property_ID: 103\nAddress: 789 Pine St\nCity: Laketown\nState: FL\nZip_Code: 76543\nBedrooms: 4\nBathrooms: 3\nListing_Price: 600000'),
 Document(metadata={'source': './data.csv', 'row': 3}, page_content='Property_ID: 104\nAddress: 321 Maple St\nCity: Hillside\nState: NY\nZip_Code: 65432\nBedrooms: 3\nBathrooms: 2\nListing_Price: 475000'),
 Document(metadata={'source': './data.csv', 'row': 4}, page_content='Property_ID: 105\nAddress: 654 Cedar St\nCity: Sunnyvale\nState: CO\nZip_Code: 54321\nBedrooms: 5\nBathrooms

In [24]:
docs[0]

Document(metadata={'source': './data.csv', 'row': 0}, page_content='Property_ID: 101\nAddress: 123 Elm St\nCity: Springfield\nState: CA\nZip_Code: 98765\nBedrooms: 3\nBathrooms: 2\nListing_Price: 500000')

In [25]:
print(docs[0].page_content)

Property_ID: 101
Address: 123 Elm St
City: Springfield
State: CA
Zip_Code: 98765
Bedrooms: 3
Bathrooms: 2
Listing_Price: 500000


`CSVLoader` will accept a `csv_args` kwarg that supports customization of arguments passed to Python's csv.`DictReader`.

In [26]:
loader = CSVLoader(file_path="./data.csv",
                   csv_args={
                      "delimiter": ",",
                      "quotechar": '"',
                      "fieldnames": ["Property ID", "Address", "City", "State",
                                     "Zip Code", "Bedrooms", "Bathrooms", "Price"],
                   },
                  )
docs = loader.load()

In [27]:
docs

[Document(metadata={'source': './data.csv', 'row': 0}, page_content='Property ID: Property_ID\nAddress: Address\nCity: City\nState: State\nZip Code: Zip_Code\nBedrooms: Bedrooms\nBathrooms: Bathrooms\nPrice: Listing_Price'),
 Document(metadata={'source': './data.csv', 'row': 1}, page_content='Property ID: 101\nAddress: 123 Elm St\nCity: Springfield\nState: CA\nZip Code: 98765\nBedrooms: 3\nBathrooms: 2\nPrice: 500000'),
 Document(metadata={'source': './data.csv', 'row': 2}, page_content='Property ID: 102\nAddress: 456 Oak St\nCity: Rivertown\nState: TX\nZip Code: 87654\nBedrooms: 2\nBathrooms: 1\nPrice: 350000'),
 Document(metadata={'source': './data.csv', 'row': 3}, page_content='Property ID: 103\nAddress: 789 Pine St\nCity: Laketown\nState: FL\nZip Code: 76543\nBedrooms: 4\nBathrooms: 3\nPrice: 600000'),
 Document(metadata={'source': './data.csv', 'row': 4}, page_content='Property ID: 104\nAddress: 321 Maple St\nCity: Hillside\nState: NY\nZip Code: 65432\nBedrooms: 3\nBathrooms: 2\nP

Unstructured.io loads the entire CSV as a single table

In [28]:
from langchain_community.document_loaders import UnstructuredCSVLoader

loader = UnstructuredCSVLoader("./data.csv")
docs = loader.load()

In [29]:
len(docs)

1

In [30]:
docs[0]

Document(metadata={'source': './data.csv'}, page_content='\n\n\nProperty_ID\nAddress\nCity\nState\nZip_Code\nBedrooms\nBathrooms\nListing_Price\n\n\n101\n123 Elm St\nSpringfield\nCA\n98765\n3\n2\n500000\n\n\n102\n456 Oak St\nRivertown\nTX\n87654\n2\n1\n350000\n\n\n103\n789 Pine St\nLaketown\nFL\n76543\n4\n3\n600000\n\n\n104\n321 Maple St\nHillside\nNY\n65432\n3\n2\n475000\n\n\n105\n654 Cedar St\nSunnyvale\nCO\n54321\n5\n4\n750000\n\n\n')

### JSON Loader

[JSON (JavaScript Object Notation)](https://en.wikipedia.org/wiki/JSON) is an open standard file format and data interchange format that uses human-readable text to store and transmit data objects consisting of attribute–value pairs and arrays (or other serializable values).

[JSON Lines](https://jsonlines.org/) is a file format where each line is a valid JSON value.

LangChain implements a [JSONLoader](https://api.python.langchain.com/en/latest/document_loaders/langchain_community.document_loaders.json_loader.JSONLoader.html) to convert JSON and JSONL data into LangChain `Document` objects. It uses a specified [`jq` schema](https://en.wikipedia.org/wiki/Jq_(programming_language)) to parse the JSON files, allowing for the extraction of specific fields into the content and metadata of the LangChain Document.

It uses the `jq` python package.

In [31]:
import json

# Sample data dictionary similar to the one you provided but with modified contents
data = {
    'image': {'creation_timestamp': 1675549016, 'uri': 'image_of_the_meeting.jpg'},
    'is_still_participant': True,
    'joinable_mode': {'link': '', 'mode': 1},
    'magic_words': [],
    'messages': [
        {'content': 'See you soon!',
         'sender_name': 'User B',
         'timestamp_ms': 1675597571851},
        {'content': 'Thanks for the update! See you then.',
         'sender_name': 'User A',
         'timestamp_ms': 1675597435669},
        {'content': 'Actually, the green one is sold out.',
         'sender_name': 'User B',
         'timestamp_ms': 1675596277579},
        {'content': 'I was hoping to purchase the green one!',
         'sender_name': 'User A',
         'timestamp_ms': 1675595140251},
        {'content': 'I’m really interested in the green one, not the red!',
         'sender_name': 'User A',
         'timestamp_ms': 1675595109305},
        {'content': 'Here’s the $150 for it.',
         'sender_name': 'User B',
         'timestamp_ms': 1675595068468},
        {'photos': [{'creation_timestamp': 1675595059,
                     'uri': 'image_of_the_item.jpg'}],
         'sender_name': 'User B',
         'timestamp_ms': 1675595060730},
        {'content': 'It typically sells for at least $200 online',
         'sender_name': 'User B',
         'timestamp_ms': 1675595045152},
        {'content': 'How much are you asking?',
         'sender_name': 'User A',
         'timestamp_ms': 1675594799696},
        {'content': 'Good morning! $50 is far too low.',
         'sender_name': 'User B',
         'timestamp_ms': 1675577876645},
        {'content': 'Hello! I’m interested in the item you posted. I can offer $50. Let me know if that works for you. Thanks!',
         'sender_name': 'User A',
         'timestamp_ms': 1675549022673}
    ],
    'participants': [{'name': 'User A'}, {'name': 'User B'}],
    'thread_path': 'inbox/User A and User B chat',
    'title': 'User A and User B chat'
}

# Save the modified data to a JSON file
with open('chat_data.json', 'w') as file:
    json.dump(data, file, indent=4)


To load the full data as a single document

In [32]:
from langchain_community.document_loaders import JSONLoader

loader = JSONLoader(
    file_path='./chat_data.json',
    jq_schema='.',
    text_content=False)

data = loader.load()

In [33]:
len(data)

1

In [34]:
data

[Document(metadata={'source': '/content/chat_data.json', 'seq_num': 1}, page_content='{"image": {"creation_timestamp": 1675549016, "uri": "image_of_the_meeting.jpg"}, "is_still_participant": true, "joinable_mode": {"link": "", "mode": 1}, "magic_words": [], "messages": [{"content": "See you soon!", "sender_name": "User B", "timestamp_ms": 1675597571851}, {"content": "Thanks for the update! See you then.", "sender_name": "User A", "timestamp_ms": 1675597435669}, {"content": "Actually, the green one is sold out.", "sender_name": "User B", "timestamp_ms": 1675596277579}, {"content": "I was hoping to purchase the green one!", "sender_name": "User A", "timestamp_ms": 1675595140251}, {"content": "I\\u2019m really interested in the green one, not the red!", "sender_name": "User A", "timestamp_ms": 1675595109305}, {"content": "Here\\u2019s the $150 for it.", "sender_name": "User B", "timestamp_ms": 1675595068468}, {"photos": [{"creation_timestamp": 1675595059, "uri": "image_of_the_item.jpg"}],

Suppose we are interested in extracting the values under the `messages` key of the JSON data

In [35]:
loader = JSONLoader(
    file_path='./chat_data.json',
    jq_schema='.messages[]',
    text_content=False)

data = loader.load()
data

[Document(metadata={'source': '/content/chat_data.json', 'seq_num': 1}, page_content='{"content": "See you soon!", "sender_name": "User B", "timestamp_ms": 1675597571851}'),
 Document(metadata={'source': '/content/chat_data.json', 'seq_num': 2}, page_content='{"content": "Thanks for the update! See you then.", "sender_name": "User A", "timestamp_ms": 1675597435669}'),
 Document(metadata={'source': '/content/chat_data.json', 'seq_num': 3}, page_content='{"content": "Actually, the green one is sold out.", "sender_name": "User B", "timestamp_ms": 1675596277579}'),
 Document(metadata={'source': '/content/chat_data.json', 'seq_num': 4}, page_content='{"content": "I was hoping to purchase the green one!", "sender_name": "User A", "timestamp_ms": 1675595140251}'),
 Document(metadata={'source': '/content/chat_data.json', 'seq_num': 5}, page_content='{"content": "I\\u2019m really interested in the green one, not the red!", "sender_name": "User A", "timestamp_ms": 1675595109305}'),
 Document(met

Suppose we are interested in extracting the values under the `content` field within the `messages` key of the JSON data

In [36]:
loader = JSONLoader(
    file_path='./chat_data.json',
    jq_schema='.messages[].content',
    text_content=False)

data = loader.load()
data

[Document(metadata={'source': '/content/chat_data.json', 'seq_num': 1}, page_content='See you soon!'),
 Document(metadata={'source': '/content/chat_data.json', 'seq_num': 2}, page_content='Thanks for the update! See you then.'),
 Document(metadata={'source': '/content/chat_data.json', 'seq_num': 3}, page_content='Actually, the green one is sold out.'),
 Document(metadata={'source': '/content/chat_data.json', 'seq_num': 4}, page_content='I was hoping to purchase the green one!'),
 Document(metadata={'source': '/content/chat_data.json', 'seq_num': 5}, page_content='I’m really interested in the green one, not the red!'),
 Document(metadata={'source': '/content/chat_data.json', 'seq_num': 6}, page_content='Here’s the $150 for it.'),
 Document(metadata={'source': '/content/chat_data.json', 'seq_num': 7}, page_content=''),
 Document(metadata={'source': '/content/chat_data.json', 'seq_num': 8}, page_content='It typically sells for at least $200 online'),
 Document(metadata={'source': '/conten

### PDF Loaders

[Portable Document Format (PDF)](https://en.wikipedia.org/wiki/PDF), standardized as ISO 32000, is a file format developed by Adobe in 1992 to present documents, including text formatting and images, in a manner independent of application software, hardware, and operating systems.

LangChain integrates with a host of PDF parsers. Some are simple and relatively low-level; others will support OCR and image-processing, or perform advanced document layout analysis. The right choice will depend on your use-case and through experimentation.

Here we will see how to load PDF documents into the LangChain `Document` format

We download a research paper to experiment with

If the following command fails you can download the paper manually by going to http://arxiv.org/pdf/2103.15348.pdf, save it as `layoutparser_paper.pdf`and upload it on the left in Colab from the upload files option

In [37]:
!wget -O 'layoutparser_paper.pdf' 'http://arxiv.org/pdf/2103.15348.pdf'

--2025-05-29 11:52:20--  http://arxiv.org/pdf/2103.15348.pdf
Resolving arxiv.org (arxiv.org)... 151.101.131.42, 151.101.67.42, 151.101.195.42, ...
Connecting to arxiv.org (arxiv.org)|151.101.131.42|:80... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: http://arxiv.org/pdf/2103.15348 [following]
--2025-05-29 11:52:20--  http://arxiv.org/pdf/2103.15348
Reusing existing connection to arxiv.org:80.
HTTP request sent, awaiting response... 200 OK
Length: 4686220 (4.5M) [application/pdf]
Saving to: ‘layoutparser_paper.pdf’


2025-05-29 11:52:20 (53.3 MB/s) - ‘layoutparser_paper.pdf’ saved [4686220/4686220]



#### PyPDFLoader

Here we load a PDF using `pypdf` into list of documents, where each document contains the page content and metadata with page number. Typically each PDF page becomes one document

In [38]:
from langchain_community.document_loaders import PyPDFLoader

loader = PyPDFLoader("./layoutparser_paper.pdf")
pages = loader.load()

  from cryptography.hazmat.primitives.ciphers.algorithms import AES, ARC4


In [39]:
len(pages)

16

In [40]:
pages[0]

Document(metadata={'source': './layoutparser_paper.pdf', 'page': 0}, page_content='LayoutParser : A Uniﬁed Toolkit for Deep\nLearning Based Document Image Analysis\nZejiang Shen1( \x00), Ruochen Zhang2, Melissa Dell3, Benjamin Charles Germain\nLee4, Jacob Carlson3, and Weining Li5\n1Allen Institute for AI\nshannons@allenai.org\n2Brown University\nruochen zhang@brown.edu\n3Harvard University\n{melissadell,jacob carlson }@fas.harvard.edu\n4University of Washington\nbcgl@cs.washington.edu\n5University of Waterloo\nw422li@uwaterloo.ca\nAbstract. Recent advances in document image analysis (DIA) have been\nprimarily driven by the application of neural networks. Ideally, research\noutcomes could be easily deployed in production and extended for further\ninvestigation. However, various factors like loosely organized codebases\nand sophisticated model conﬁgurations complicate the easy reuse of im-\nportant innovations by a wide audience. Though there have been on-going\neﬀorts to improve reusab

In [41]:
print(pages[0].page_content)

LayoutParser : A Uniﬁed Toolkit for Deep
Learning Based Document Image Analysis
Zejiang Shen1(  ), Ruochen Zhang2, Melissa Dell3, Benjamin Charles Germain
Lee4, Jacob Carlson3, and Weining Li5
1Allen Institute for AI
shannons@allenai.org
2Brown University
ruochen zhang@brown.edu
3Harvard University
{melissadell,jacob carlson }@fas.harvard.edu
4University of Washington
bcgl@cs.washington.edu
5University of Waterloo
w422li@uwaterloo.ca
Abstract. Recent advances in document image analysis (DIA) have been
primarily driven by the application of neural networks. Ideally, research
outcomes could be easily deployed in production and extended for further
investigation. However, various factors like loosely organized codebases
and sophisticated model conﬁgurations complicate the easy reuse of im-
portant innovations by a wide audience. Though there have been on-going
eﬀorts to improve reusability and simplify deep learning (DL) model
development in disciplines like natural language processing an

In [42]:
print(pages[4].page_content)

LayoutParser : A Uniﬁed Toolkit for DL-Based DIA 5
Table 1: Current layout detection models in the LayoutParser model zoo
Dataset Base Model1Large Model Notes
PubLayNet [38] F / M M Layouts of modern scientiﬁc documents
PRImA [3] M - Layouts of scanned modern magazines and scientiﬁc reports
Newspaper [17] F - Layouts of scanned US newspapers from the 20th century
TableBank [18] F F Table region on modern scientiﬁc and business document
HJDataset [31] F / M - Layouts of history Japanese documents
1For each dataset, we train several models of diﬀerent sizes for diﬀerent needs (the trade-oﬀ between accuracy
vs. computational cost). For “base model” and “large model”, we refer to using the ResNet 50 or ResNet 101
backbones [ 13], respectively. One can train models of diﬀerent architectures, like Faster R-CNN [ 28] (F) and Mask
R-CNN [ 12] (M). For example, an F in the Large Model column indicates it has a Faster R-CNN model trained
using the ResNet 101 backbone. The platform is maintained 

#### PyMuPDFLoader

This is the fastest of the PDF parsing options, and contains detailed metadata about the PDF and its pages, as well as returns one document per page. It uses the `pymupdf` library internally.

In [30]:
from langchain_community.document_loaders import PyMuPDFLoader

loader = PyMuPDFLoader("./layoutparser_paper.pdf")
pages = loader.load()

In [44]:
len(pages)

16

In [45]:
pages[0]

Document(metadata={'source': './layoutparser_paper.pdf', 'file_path': './layoutparser_paper.pdf', 'page': 0, 'total_pages': 16, 'format': 'PDF 1.5', 'title': '', 'author': '', 'subject': '', 'keywords': '', 'creator': 'LaTeX with hyperref', 'producer': 'pdfTeX-1.40.21', 'creationDate': 'D:20210622012710Z', 'modDate': 'D:20210622012710Z', 'trapped': ''}, page_content='LayoutParser: A Uniﬁed Toolkit for Deep\nLearning Based Document Image Analysis\nZejiang Shen1 (\x00), Ruochen Zhang2, Melissa Dell3, Benjamin Charles Germain\nLee4, Jacob Carlson3, and Weining Li5\n1 Allen Institute for AI\nshannons@allenai.org\n2 Brown University\nruochen zhang@brown.edu\n3 Harvard University\n{melissadell,jacob carlson}@fas.harvard.edu\n4 University of Washington\nbcgl@cs.washington.edu\n5 University of Waterloo\nw422li@uwaterloo.ca\nAbstract. Recent advances in document image analysis (DIA) have been\nprimarily driven by the application of neural networks. Ideally, research\noutcomes could be easily de

In [46]:
pages[0].metadata

{'source': './layoutparser_paper.pdf',
 'file_path': './layoutparser_paper.pdf',
 'page': 0,
 'total_pages': 16,
 'format': 'PDF 1.5',
 'title': '',
 'author': '',
 'subject': '',
 'keywords': '',
 'creator': 'LaTeX with hyperref',
 'producer': 'pdfTeX-1.40.21',
 'creationDate': 'D:20210622012710Z',
 'modDate': 'D:20210622012710Z',
 'trapped': ''}

In [47]:
print(pages[0].page_content)

LayoutParser: A Uniﬁed Toolkit for Deep
Learning Based Document Image Analysis
Zejiang Shen1 ( ), Ruochen Zhang2, Melissa Dell3, Benjamin Charles Germain
Lee4, Jacob Carlson3, and Weining Li5
1 Allen Institute for AI
shannons@allenai.org
2 Brown University
ruochen zhang@brown.edu
3 Harvard University
{melissadell,jacob carlson}@fas.harvard.edu
4 University of Washington
bcgl@cs.washington.edu
5 University of Waterloo
w422li@uwaterloo.ca
Abstract. Recent advances in document image analysis (DIA) have been
primarily driven by the application of neural networks. Ideally, research
outcomes could be easily deployed in production and extended for further
investigation. However, various factors like loosely organized codebases
and sophisticated model conﬁgurations complicate the easy reuse of im-
portant innovations by a wide audience. Though there have been on-going
eﬀorts to improve reusability and simplify deep learning (DL) model
development in disciplines like natural language processing

In [48]:
print(pages[4].page_content)

LayoutParser: A Uniﬁed Toolkit for DL-Based DIA
5
Table 1: Current layout detection models in the LayoutParser model zoo
Dataset
Base Model1 Large Model
Notes
PubLayNet [38]
F / M
M
Layouts of modern scientiﬁc documents
PRImA [3]
M
-
Layouts of scanned modern magazines and scientiﬁc reports
Newspaper [17]
F
-
Layouts of scanned US newspapers from the 20th century
TableBank [18]
F
F
Table region on modern scientiﬁc and business document
HJDataset [31]
F / M
-
Layouts of history Japanese documents
1 For each dataset, we train several models of diﬀerent sizes for diﬀerent needs (the trade-oﬀbetween accuracy
vs. computational cost). For “base model” and “large model”, we refer to using the ResNet 50 or ResNet 101
backbones [13], respectively. One can train models of diﬀerent architectures, like Faster R-CNN [28] (F) and Mask
R-CNN [12] (M). For example, an F in the Large Model column indicates it has a Faster R-CNN model trained
using the ResNet 101 backbone. The platform is maintained and

#### UnstructuredPDFLoader

[Unstructured.io](https://unstructured-io.github.io/unstructured/) supports a common interface for working with unstructured or semi-structured file formats, such as Markdown or PDF. LangChain's [`UnstructuredPDFLoader`](https://api.python.langchain.com/en/latest/document_loaders/langchain_community.document_loaders.pdf.UnstructuredPDFLoader.html) integrates with Unstructured to parse PDF documents into LangChain [`Document`](https://api.python.langchain.com/en/latest/documents/langchain_core.documents.base.Document.html) objects.

Load PDF as a single document - no complex parsing

In [55]:
# Install pdfminer.six with a compatible version
!pip install pdfminer.six==20221105

Collecting pdfminer.six==20221105
  Downloading pdfminer.six-20221105-py3-none-any.whl.metadata (4.0 kB)
Downloading pdfminer.six-20221105-py3-none-any.whl (5.6 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m5.6/5.6 MB[0m [31m42.8 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: pdfminer.six
  Attempting uninstall: pdfminer.six
    Found existing installation: pdfminer.six 20250327
    Uninstalling pdfminer.six-20250327:
      Successfully uninstalled pdfminer.six-20250327
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
pdfplumber 0.11.6 requires pdfminer.six==20250327, but you have pdfminer-six 20221105 which is incompatible.[0m[31m
[0mSuccessfully installed pdfminer.six-20221105


In [1]:
from langchain_community.document_loaders import UnstructuredPDFLoader

loader = UnstructuredPDFLoader('./layoutparser_paper.pdf')
data = loader.load()

  from cryptography.hazmat.primitives.ciphers.algorithms import AES, ARC4


In [2]:
len(data)

1

In [3]:
print(data[0].page_content[:1000])

1 2 0 2

n u J

1 2

]

V C . s c [

2 v 8 4 3 5 1 . 3 0 1 2 : v i X r a

LayoutParser: A Uniﬁed Toolkit for Deep Learning Based Document Image Analysis

Zejiang Shen1 ((cid:0)), Ruochen Zhang2, Melissa Dell3, Benjamin Charles Germain Lee4, Jacob Carlson3, and Weining Li5

1 Allen Institute for AI shannons@allenai.org 2 Brown University ruochen zhang@brown.edu 3 Harvard University {melissadell,jacob carlson}@fas.harvard.edu 4 University of Washington bcgl@cs.washington.edu 5 University of Waterloo w422li@uwaterloo.ca

Abstract. Recent advances in document image analysis (DIA) have been primarily driven by the application of neural networks. Ideally, research outcomes could be easily deployed in production and extended for further investigation. However, various factors like loosely organized codebases and sophisticated model conﬁgurations complicate the easy reuse of im- portant innovations by a wide audience. Though there have been on-going eﬀorts to improve reusability and simplify d

Load PDF with complex parsing, table detection and chunking by sections

In [4]:
# takes 3-4 mins on Colab
loader = UnstructuredPDFLoader('./layoutparser_paper.pdf',
                               strategy='hi_res',
                               extract_images_in_pdf=False,
                               infer_table_structure=True,
                               chunking_strategy="by_title",
                               max_characters=4000, # max size of chunks
                               new_after_n_chars=3800, # preferred size of chunks
                               combine_text_under_n_chars=2000, # smaller chunks < 2000 chars will be combined into a larger chunk
                               mode='elements')
data = loader.load()

yolox_l0.05.onnx:   0%|          | 0.00/217M [00:00<?, ?B/s]

config.json:   0%|          | 0.00/1.47k [00:00<?, ?B/s]

Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


model.safetensors:   0%|          | 0.00/115M [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/46.8M [00:00<?, ?B/s]

Some weights of the model checkpoint at microsoft/table-transformer-structure-recognition were not used when initializing TableTransformerForObjectDetection: ['model.backbone.conv_encoder.model.layer2.0.downsample.1.num_batches_tracked', 'model.backbone.conv_encoder.model.layer3.0.downsample.1.num_batches_tracked', 'model.backbone.conv_encoder.model.layer4.0.downsample.1.num_batches_tracked']
- This IS expected if you are initializing TableTransformerForObjectDetection from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TableTransformerForObjectDetection from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [5]:
len(data)

18

In [6]:
[doc.metadata['category'] for doc in data]

['CompositeElement',
 'CompositeElement',
 'CompositeElement',
 'CompositeElement',
 'CompositeElement',
 'Table',
 'CompositeElement',
 'CompositeElement',
 'CompositeElement',
 'Table',
 'CompositeElement',
 'CompositeElement',
 'CompositeElement',
 'CompositeElement',
 'CompositeElement',
 'CompositeElement',
 'CompositeElement',
 'CompositeElement']

In [7]:
data[0]

Document(metadata={'source': './layoutparser_paper.pdf', 'filetype': 'application/pdf', 'languages': ['eng'], 'last_modified': '2023-01-23T09:15:33', 'page_number': 1, 'orig_elements': 'eJzNWG1vFDkS/iu++QSn6d5+f8mXI4B0sJu9Q5C9PS2LIrddM2Pobrdsd8KA9r/fY/ckDIRdCaRE9yUTl12ul+epKs+8/riingYa3YWSqxO2EjwRXdOIKE2LPCo2XR61CdWRqETeZUnX1kWxWrPVQI5L7jh0Pq6E1kaqkTuyYd3zvZ7dxY7UducgybIkgc5BfKWk20Ga1kE6aTU6r/f6dVHGxZqVbRbXb9bssKyTPG78Mk2SuL29Xo5DsLJ762jwUbxQ76l/NXFBqz+wsVE9uf1EYevFz6vgy7id+TY4/HpF43b1Jkituxi0VBtFIR1ZkuVRkkZZfp60J2l5kudee4LmxTgPHRkfiLfh6L0PdZWyjCUs86euTf4yCqRmq436QPLcn4PCl4lPsqZsaJNHoizKqCjLImrbkqK6S7NM1HXWNN0dJ76qyrj8lHgQoIzbo0zfEiwKf5l6SY6EU3q8EEiuvZiM7nAsicsiyat7xmZkM/uReYTesP+wJyxmlgmYOMLqGXEJxa8AVHSbmrKmiyoObIqu6KIuLbuoIVQFz8uultUdA5QmbRFnRwiVZRlXxwjdEiwa/zfVkbFL1rCC5awEDjE+k4DHCeSK/ZcZxr+5coSUskxaGYmsllHRNGhZDeEP7/IWXSsps83dAYOSAI3ztI7zAMyyLpI2TgIOWV3GyVcEi8b31k5R1/eM3FnIxAtuLJkTdsp+GdXv86ZLUpLsXOv+nXJsow17SjSxM+JmVOOWPeYW+0+1mD1g7PkAA+x05P3eKnuM9LlyPX0N3U7yksuqiiiRhL5YVBHvkjraiF

In [8]:
print(data[0].page_content)

1 2 0 2

n u J 1 2 ] V C . s c [

2 v 8 4 3 5 1 . 3 0 1 2 : v i X r a

LayoutParser: A Uniﬁed Toolkit for Deep Learning Based Document Image Analysis

Zejiang Shen! (4), Ruochen Zhang”, Melissa Dell?, Benjamin Charles Germain Lee*, Jacob Carlson’, and Weining Li®

1 Allen Institute for AI shannons@allenai.org 2 Brown University ruochen zhang@brown.edu 3 Harvard University {melissadell,jacob carlson}@fas.harvard.edu 4 University of Washington bcgl@cs.washington.edu 5 University of Waterloo w422li@uwaterloo.ca

Abstract. Recent advances in document image analysis (DIA) have been primarily driven by the application of neural networks. Ideally, research outcomes could be easily deployed in production and extended for further investigation. However, various factors like loosely organized codebases and sophisticated model conﬁgurations complicate the easy reuse of im- portant innovations by a wide audience. Though there have been on-going eﬀorts to improve reusability and simplify deep learn

In [9]:
data[5]

Document(metadata={'source': './layoutparser_paper.pdf', 'last_modified': '2023-01-23T09:15:33', 'text_as_html': "<table><thead><th>Dataset</th><th>| Base Model'|</th><th>| Notes</th></thead><tr><td>PubLayNet B8]|</td><td>F/M</td><td>Layouts of modern scientific documents</td></tr><tr><td>PRImA</td><td>M</td><td>Layouts of scanned modern magazines and scientific report</td></tr><tr><td>Newspaper</td><td>F</td><td>Layouts of scanned US newspapers from the 20th century</td></tr><tr><td>TableBank</td><td>F</td><td>Table region on modern scientific and business document</td></tr><tr><td>HJDataset</td><td>F/M</td><td>Layouts of history Japanese documents</td></tr></table>", 'table_as_cells': [{'x': 0, 'y': 0, 'w': 1, 'h': 1, 'content': 'Dataset'}, {'x': 0, 'y': 1, 'w': 1, 'h': 1, 'content': 'PubLayNet B8]|'}, {'x': 0, 'y': 2, 'w': 1, 'h': 1, 'content': 'PRImA'}, {'x': 0, 'y': 3, 'w': 1, 'h': 1, 'content': 'Newspaper'}, {'x': 0, 'y': 4, 'w': 1, 'h': 1, 'content': 'TableBank'}, {'x': 0, 'y': 

In [10]:
data[5].page_content

'Dataset Base Model1 Large Model Notes PubLayNet [38] PRImA [3] Newspaper [17] TableBank [18] HJDataset [31] F / M M F F F / M M - - F - Layouts of modern scientiﬁc documents Layouts of scanned modern magazines and scientiﬁc reports Layouts of scanned US newspapers from the 20th century Table region on modern scientiﬁc and business document Layouts of history Japanese documents'

In [11]:
from IPython.display import HTML

HTML(data[5].metadata['text_as_html'])

Dataset,| Base Model'|,| Notes
PubLayNet B8]|,F/M,Layouts of modern scientific documents
PRImA,M,Layouts of scanned modern magazines and scientific report
Newspaper,F,Layouts of scanned US newspapers from the 20th century
TableBank,F,Table region on modern scientific and business document
HJDataset,F/M,Layouts of history Japanese documents


Load using raw unstructured.io APIs for PDFs

In [12]:
from unstructured.partition.pdf import partition_pdf

# Get elements - takes 3-4 mins
raw_pdf_elements = partition_pdf(
    filename="./layoutparser_paper.pdf",
    strategy='hi_res',
    # Unstructured first finds embedded image blocks
    extract_images_in_pdf=False,
    # Use layout model (YOLOX) to get bounding boxes (for tables) and find titles
    # Titles are any sub-section of the document
    infer_table_structure=True,
    # Post processing to aggregate text once we have the title
    chunking_strategy="by_title",
    # Chunking params to aggregate text blocks
    # Attempt to create a new chunk 3800 chars
    # Attempt to keep chunks > 2000 chars
    max_characters=4000,
    new_after_n_chars=3800,
    combine_text_under_n_chars=2000,
    image_output_dir_path="./",
)

In [13]:
len(raw_pdf_elements)

18

In [14]:
raw_pdf_elements

[<unstructured.documents.elements.CompositeElement at 0x7f95956b9890>,
 <unstructured.documents.elements.CompositeElement at 0x7f959386f390>,
 <unstructured.documents.elements.CompositeElement at 0x7f95823bf050>,
 <unstructured.documents.elements.CompositeElement at 0x7f96bc86de90>,
 <unstructured.documents.elements.CompositeElement at 0x7f96babb5410>,
 <unstructured.documents.elements.Table at 0x7f96bc893450>,
 <unstructured.documents.elements.CompositeElement at 0x7f9595433e90>,
 <unstructured.documents.elements.CompositeElement at 0x7f95954fbf90>,
 <unstructured.documents.elements.CompositeElement at 0x7f96be6c8050>,
 <unstructured.documents.elements.Table at 0x7f9582420e10>,
 <unstructured.documents.elements.CompositeElement at 0x7f95968bab90>,
 <unstructured.documents.elements.CompositeElement at 0x7f95968b9150>,
 <unstructured.documents.elements.CompositeElement at 0x7f95968b8bd0>,
 <unstructured.documents.elements.CompositeElement at 0x7f95968b9250>,
 <unstructured.documents.ele

In [15]:
raw_pdf_elements[5].to_dict()

{'type': 'Table',
 'element_id': '720a11c7a3fa16628248e6b9613d2c2d',
 'text': 'Dataset Base Model1 Large Model Notes PubLayNet [38] PRImA [3] Newspaper [17] TableBank [18] HJDataset [31] F / M M F F F / M M - - F - Layouts of modern scientiﬁc documents Layouts of scanned modern magazines and scientiﬁc reports Layouts of scanned US newspapers from the 20th century Table region on modern scientiﬁc and business document Layouts of history Japanese documents',
 'metadata': {'last_modified': '2023-01-23T09:15:33',
  'text_as_html': "<table><thead><th>Dataset</th><th>| Base Model'|</th><th>| Notes</th></thead><tr><td>PubLayNet B8]|</td><td>F/M</td><td>Layouts of modern scientific documents</td></tr><tr><td>PRImA</td><td>M</td><td>Layouts of scanned modern magazines and scientific report</td></tr><tr><td>Newspaper</td><td>F</td><td>Layouts of scanned US newspapers from the 20th century</td></tr><tr><td>TableBank</td><td>F</td><td>Table region on modern scientific and business document</td></t

Convert into LangChain `document`format

In [16]:
from langchain_core.documents import Document

lc_docs = [Document(page_content=doc.text,
                    metadata=doc.metadata.to_dict())
              for doc in raw_pdf_elements]
lc_docs[5]

Document(metadata={'last_modified': '2023-01-23T09:15:33', 'text_as_html': "<table><thead><th>Dataset</th><th>| Base Model'|</th><th>| Notes</th></thead><tr><td>PubLayNet B8]|</td><td>F/M</td><td>Layouts of modern scientific documents</td></tr><tr><td>PRImA</td><td>M</td><td>Layouts of scanned modern magazines and scientific report</td></tr><tr><td>Newspaper</td><td>F</td><td>Layouts of scanned US newspapers from the 20th century</td></tr><tr><td>TableBank</td><td>F</td><td>Table region on modern scientific and business document</td></tr><tr><td>HJDataset</td><td>F/M</td><td>Layouts of history Japanese documents</td></tr></table>", 'table_as_cells': [{'x': 0, 'y': 0, 'w': 1, 'h': 1, 'content': 'Dataset'}, {'x': 0, 'y': 1, 'w': 1, 'h': 1, 'content': 'PubLayNet B8]|'}, {'x': 0, 'y': 2, 'w': 1, 'h': 1, 'content': 'PRImA'}, {'x': 0, 'y': 3, 'w': 1, 'h': 1, 'content': 'Newspaper'}, {'x': 0, 'y': 4, 'w': 1, 'h': 1, 'content': 'TableBank'}, {'x': 0, 'y': 5, 'w': 1, 'h': 1, 'content': 'HJDatas

### Microsoft Office Document Loaders

The Microsoft Office suite of productivity software includes Microsoft Word, Microsoft Excel, Microsoft PowerPoint, Microsoft Outlook, and Microsoft OneNote. It is available for Microsoft Windows and macOS operating systems. It is also available on Android and iOS.

[Unstructured.io](https://docs.unstructured.io/open-source/introduction/overview) provides a variety of document loaders to load MS Office documents. Check them out [here](https://docs.unstructured.io/open-source/core-functionality/partitioning).

Here we will leverage LangChain's [`UnstructuredWordDocumentLoader`](https://api.python.langchain.com/en/latest/document_loaders/langchain_community.document_loaders.word_document.UnstructuredWordDocumentLoader.html) to load data from a MS Word document.

In [17]:
!gdown 1DEz13a7k4yX9yFrWaz3QJqHdfecFYRV-

Downloading...
From: https://drive.google.com/uc?id=1DEz13a7k4yX9yFrWaz3QJqHdfecFYRV-
To: /content/Quantum Computing.docx
  0% 0.00/11.4k [00:00<?, ?B/s]100% 11.4k/11.4k [00:00<00:00, 21.6MB/s]


Load word doc as a single document

In [19]:
from langchain_community.document_loaders import UnstructuredWordDocumentLoader

loader = UnstructuredWordDocumentLoader('./Quantum Computing.docx')
data = loader.load()

In [20]:
len(data)

1

In [21]:
data[0].page_content[:1000]

'The Rise of Quantum Computing: A New Era of Innovation\n\nFor decades, classical computing has driven technological advancements, but the limitations of traditional binary processing are becoming evident as the world demands more computational power. Enter quantum computing—a revolutionary approach that leverages the principles of quantum mechanics to solve complex problems at unprecedented speeds.\n\nUnderstanding Quantum Computing\n\nUnlike classical computers that process information using bits (0s and 1s), quantum computers use qubits, which can exist in multiple states simultaneously due to superposition. This unique property allows quantum systems to process vast amounts of data in parallel, making them exponentially more powerful for specific tasks.\n\nAnother key principle, entanglement, enables qubits to be interconnected, meaning the state of one qubit is dependent on another, regardless of distance. This drastically enhances processing efficiency and speed, paving the way f

Load word doc with complex parsing and section based chunks

In [22]:
loader = UnstructuredWordDocumentLoader('./Quantum Computing.docx',
                                        strategy='fast',
                                        chunking_strategy="by_title",
                                        max_characters=3000, # max limit of a document chunk
                                        new_after_n_chars=2500, # preferred document chunk size
                                        mode='elements')
data = loader.load()

In [23]:
len(data)

2

In [24]:
data[0]

Document(metadata={'source': './Quantum Computing.docx', 'emphasized_text_contents': ['Understanding Quantum Computing', 'qubits', 'superposition', 'entanglement', 'Applications Transforming Industries', 'Drug Discovery & Healthcare', 'Financial Modeling', 'Cybersecurity & Cryptography', 'post-quantum cryptography', 'Climate Modeling & Sustainability', 'AI & Machine Learning'], 'emphasized_text_tags': ['b', 'b', 'b', 'b', 'b', 'b', 'b', 'b', 'b', 'b', 'b'], 'file_directory': '.', 'filename': 'Quantum Computing.docx', 'languages': ['eng'], 'last_modified': '2025-02-19T14:45:48', 'orig_elements': 'eJzVWN9v20YS/lcWfjj0ANGgKIqi8pam7Z2BpsDd+Z6awhjuDqWFqV1muZSjFPe/3zdLyVbiNHEBP9QvhrU/ZndmvvnmW/76+wV3vGMXb6y5eKUuuGhMsdKccbnkrGyrMmtMu8qo5Vy3y7Zu5s3FTF3sOJKhSNjz+4WmyBsfDjeG+7jFUI4Vre34xtjAOmJKbF9eHIcd7VgG/jWSi+NOvfG7fozWbS6N1x9kVUduM9KGByz79YLd5uK3NDrEm503trWcblvkxTLLi2y+vp6Xr8rlq7K++B8WRv4QZf56y+rfdmDlW/XosFfqtfqF79SPgWT+yjm/p2i9kwvEQ5+ueG1jx2Lz80AZU5UF8TxrFg0CtTCU0bI02brOmyWX89as25cTqJ98UIY1G

In [27]:
data[1]

Document(metadata={'source': './Quantum Computing.docx', 'emphasized_text_contents': ['Challenges & The Road Ahead', 'Quantum Revolution'], 'emphasized_text_tags': ['b', 'b'], 'file_directory': '.', 'filename': 'Quantum Computing.docx', 'languages': ['eng'], 'last_modified': '2025-02-19T14:45:48', 'orig_elements': 'eJzVVE1v3DYQ/SsDHXpaGdJa8q58C1KgOQVo4FsaLIbkSCJCkQo/1tkG/e+dke0kSAK3vTUXQeR8v/c4bz9V5Gghn0/WVLdQ7YdWazJUj+pg6u6mbWtFStfjtel7vR/401Y7qBbKaDAjx3yqNGaaQrycDK155quWPWhZZ0z2TzKnTB/zSQefuU5i89vq5YzOkZ8owS9wNxO8CWjgxUxoqnc/CM44PQSqzTxaRydjI+nMZaXvq+rx2uNCcvF7QZ/LAi/DspZs/XRlgv4oXg79VJBLbwm5iS2lw5RPSzB2tPSARLPv62Zft8Nd2912/W13rP5iR+lH7M+NIG6XdevjzmZHEvgt0qi00Qqx3uORkR411YPiv/1xVF3fHa7x5vDPSDf/EzR+pbTaTGBzgjWGxSbawYfHqvqpKoyoGa9kJ8+ZNVtBf4ZxB9ZrV4z4fSjKZkgZlXU2X3ZAMYbIiaJMaYPfAXoDDuNEddLoCGaM5h4jgaEzubAK1FfwKtzzMe7g3uYZRIPWl1ASqEj4Ps8xlGlOXBlSWSmygyk6f26Bm8oR15UMcNH0UDWHNbgwcf/u+xG5EoG0sYSzpNEuJIocAyhOC9ewPPEFzhYVt/2UYCy5RLr6WjqvMUbM9kx3AvIPJKSH3qDWTU1D39TdsVX1gHxsu2G81t3BYNP9PBL6LTB

### Directory Loaders

LangChain's [`DirectoryLoader`](https://api.python.langchain.com/en/latest/document_loaders/langchain_community.document_loaders.directory.DirectoryLoader.html) implements functionality for reading files from disk into LangChain [`Document`](https://api.python.langchain.com/en/latest/documents/langchain_core.documents.base.Document.html#langchain_core.documents.base.Document) objects.

In [28]:
!wget -O 'Vision Transformers.pdf' 'https://arxiv.org/pdf/2010.11929.pdf'

--2025-05-29 12:11:49--  https://arxiv.org/pdf/2010.11929.pdf
Resolving arxiv.org (arxiv.org)... 151.101.131.42, 151.101.195.42, 151.101.67.42, ...
Connecting to arxiv.org (arxiv.org)|151.101.131.42|:443... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: http://arxiv.org/pdf/2010.11929 [following]
--2025-05-29 12:11:49--  http://arxiv.org/pdf/2010.11929
Connecting to arxiv.org (arxiv.org)|151.101.131.42|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 3743814 (3.6M) [application/pdf]
Saving to: ‘Vision Transformers.pdf’


2025-05-29 12:11:49 (43.5 MB/s) - ‘Vision Transformers.pdf’ saved [3743814/3743814]



We first define and assign specific loaders which can be used by LangChain when processing the files for a specific file type. We follow this format

```
loaders = {
  'file_format_extension' : (LoaderClass, LoaderKeywordArguments)
}
```

Where:

- `file_format_extension` can be anything like `.docx`, `.pdf`etc.
- `LoaderClass` is a specific data loader like `PyMuPDFLoader`
- `LoaderKeywordArguments` are any specific keyword arguments which needs to be passed into that loader at runtime

In [31]:
# Define a dictionary to map file extensions to their respective loaders
loaders = {
    '.pdf': (PyMuPDFLoader, {}),
    '.docx': (UnstructuredWordDocumentLoader, {'strategy': 'fast',
                                              'chunking_strategy' : 'by_title',
                                              'max_characters' : 3000, # max limit of a document chunk
                                              'new_after_n_chars' : 2500, # preferred document chunk size
                                              'mode' : 'elements'
                                              })
}

`DirectoryLoader` accepts a `loader_cls` argument, which defaults to `UnstructuredLoader` but we can pass our own loaders which we defined above in the `loader_cls`argument and any keyword args for the loader can be passed in the `loader_kwargs` argument.

We can also show a progress bar by setting `show_progress=True`

We can use the `glob` parameter to control which files to load based on file patterns

Here we create two separate loaders to load files which are word documents and PDFs

In [32]:
from langchain_community.document_loaders import DirectoryLoader

# Define a function to create a DirectoryLoader for a specific file type
def create_directory_loader(file_type, directory_path):
    return DirectoryLoader(
        path=directory_path,
        glob=f"**/*{file_type}",
        loader_cls=loaders[file_type][0],
        loader_kwargs=loaders[file_type][1],
        show_progress=True
    )

# Create DirectoryLoader instances for each file type
pdf_loader = create_directory_loader('.pdf', './')
docx_loader = create_directory_loader('.docx', './')

# Load the files
pdf_documents = pdf_loader.load()
docx_documents = docx_loader.load()

100%|██████████| 2/2 [00:00<00:00,  7.99it/s]
100%|██████████| 1/1 [00:00<00:00, 17.15it/s]


In [33]:
len(pdf_documents)

38

In [34]:
pdf_documents[18]

Document(metadata={'source': 'Vision Transformers.pdf', 'file_path': 'Vision Transformers.pdf', 'page': 18, 'total_pages': 22, 'format': 'PDF 1.5', 'title': '', 'author': '', 'subject': '', 'keywords': '', 'creator': 'LaTeX with hyperref', 'producer': 'pdfTeX-1.40.21', 'creationDate': 'D:20210604001958Z', 'modDate': 'D:20210604001958Z', 'trapped': ''}, page_content='Published as a conference paper at ICLR 2021\nwe perform timing of inference speed for the main models of interest, on a TPUv3 accelerator; the\ndifference between inference and backprop speed is a constant model-independent factor.\nFigure 12 (left) shows how many images one core can handle per second, across various input sizes.\nEvery single point refers to the peak performance measured across a wide range of batch-sizes. As\ncan be seen, the theoretical bi-quadratic scaling of ViT with image size only barely starts happening\nfor the largest models at the largest resolutions.\nAnother quantity of interest is the largest

In [35]:
len(docx_documents)

2