**Reference Link:** [RAG Systems Essentials (Analytics Vidhya)](https://courses.analyticsvidhya.com/courses/take/rag-systems-essentials/lessons/60148017-hands-on-deep-dive-into-rag-evaluation-metrics-generator-metrics-i)

# Exploring Document Loaders in LangChain

## Install OpenAI, HuggingFace and LangChain dependencies

In [1]:
!pip install -qq langchain==0.3.11
!pip install -qq langchain-openai==0.2.12
!pip install -qq langchain-community==0.3.11
!pip install -qq jq==1.7.0
!pip install -qq pypdf==4.2.0
!pip install -qq PyMuPDF==1.24.5

In [2]:
# takes 2 - 5 mins to install on Colab
!pip install -qq "unstructured[all-docs]==0.14.0"

[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
langchain-community 0.3.11 requires numpy<2,>=1.22.4; python_version < "3.12", but you have numpy 2.2.6 which is incompatible.
langchain-chroma 0.1.4 requires numpy<2,>=1; python_version < "3.12", but you have numpy 2.2.6 which is incompatible.
chromadb 0.5.23 requires tokenizers<=0.20.3,>=0.13.2, but you have tokenizers 0.21.2 which is incompatible.
langchain-huggingface 0.3.0 requires huggingface-hub>=0.30.2, but you have huggingface-hub 0.27.1 which is incompatible.
langchain-huggingface 0.3.0 requires langchain-core<1.0.0,>=0.3.65, but you have langchain-core 0.3.63 which is incompatible.
langchain 0.3.11 requires numpy<2,>=1.22.4; python_version < "3.12", but you have numpy 2.2.6 which is incompatible.[0m[31m
[0m

In [3]:
!pip install -qq -U pydantic

In [4]:
!pip install -qq pytesseract
!pip install -qq pdf2image

## Document Loaders

Document loaders are used to import data from various sources into LangChain as `Document` objects. A `Document` typically includes a piece of text along with its associated metadata.

### Examples of Document Loaders:

- **Text File Loader:** Loads data from a simple `.txt` file.
- **Web Page Loader:** Retrieves the text content from any web page.
- **YouTube Video Transcript Loader:** Loads transcripts from YouTube videos.

### Functionality:

- **Load Method:** Each document loader has a `load` method that enables the loading of data as documents from a pre-configured source.
- **Lazy Load Option:** Some loaders also support a "lazy load" feature, which allows data to be loaded into memory gradually as needed.

For more detailed information, visit [LangChain's document loader documentation](https://python.langchain.com/docs/modules/data_connection/document_loaders/).


### Text Loader

The simplest loader reads in a file as text and places it all into one document.

In [18]:
# Import the TextLoader from langchain_community.document_loaders
# This loader is used to read text files and convert them into Document objects

from langchain_community.document_loaders import TextLoader

loader = TextLoader("../../docs/dummy.txt")
doc = loader.load()

In [22]:
print(f"\nThe number of documents : {len(doc)}\n")


The number of documents : 1



In [20]:
type(doc)

list

In [29]:
## Type of Each Document in the Doc list is 'langchain_core.documents.base.Document'
type(doc[0])

langchain_core.documents.base.Document

In [28]:
print(f"\n Type of first documents : {type(doc[0])}\n")


 Type of first documents : <class 'langchain_core.documents.base.Document'>



In [19]:
# Print the entire document object to see its structure and content
# This will show us the Document object with its page_content and metadata

print(f"The whole documents are: \n{doc}\n")

The whole documents are: 
[Document(metadata={'source': '../../docs/dummy.txt'}, page_content='Quod equidem non reprehendo;\nLorem ipsum dolor sit amet, consectetur adipiscing elit. Quibus natura iure responderit non esse verum aliunde finem beate vivendi, a se principia rei gerendae peti; Quae enim adhuc protulisti, popularia sunt, ego autem a te elegantiora desidero. Duo Reges: constructio interrete. Tum Lucius: Mihi vero ista valde probata sunt, quod item fratri puto. Bestiarum vero nullum iudicium puto. Nihil enim iam habes, quod ad corpus referas; Deinde prima illa, quae in congressu solemus: Quid tu, inquit, huc? Et homini, qui ceteris animantibus plurimum praestat, praecipue a natura nihil datum esse dicemus?\n\nIam id ipsum absurdum, maximum malum neglegi. Quod ea non occurrentia fingunt, vincunt Aristonem; Atqui perspicuum est hominem e corpore animoque constare, cum primae sint animi partes, secundae corporis. Fieri, inquam, Triari, nullo pacto potest, ut non dicas, quid non 

In [33]:
print(f"The first document is: \n{doc[0]}\n")

The first document is: 
page_content='Quod equidem non reprehendo;
Lorem ipsum dolor sit amet, consectetur adipiscing elit. Quibus natura iure responderit non esse verum aliunde finem beate vivendi, a se principia rei gerendae peti; Quae enim adhuc protulisti, popularia sunt, ego autem a te elegantiora desidero. Duo Reges: constructio interrete. Tum Lucius: Mihi vero ista valde probata sunt, quod item fratri puto. Bestiarum vero nullum iudicium puto. Nihil enim iam habes, quod ad corpus referas; Deinde prima illa, quae in congressu solemus: Quid tu, inquit, huc? Et homini, qui ceteris animantibus plurimum praestat, praecipue a natura nihil datum esse dicemus?

Iam id ipsum absurdum, maximum malum neglegi. Quod ea non occurrentia fingunt, vincunt Aristonem; Atqui perspicuum est hominem e corpore animoque constare, cum primae sint animi partes, secundae corporis. Fieri, inquam, Triari, nullo pacto potest, ut non dicas, quid non probes eius, a quo dissentias. Equidem e Cn. An dubium est, 

In [34]:
print(f"Content of first document : \n{doc[0].page_content}\n")

Content of first document : 
Quod equidem non reprehendo;
Lorem ipsum dolor sit amet, consectetur adipiscing elit. Quibus natura iure responderit non esse verum aliunde finem beate vivendi, a se principia rei gerendae peti; Quae enim adhuc protulisti, popularia sunt, ego autem a te elegantiora desidero. Duo Reges: constructio interrete. Tum Lucius: Mihi vero ista valde probata sunt, quod item fratri puto. Bestiarum vero nullum iudicium puto. Nihil enim iam habes, quod ad corpus referas; Deinde prima illa, quae in congressu solemus: Quid tu, inquit, huc? Et homini, qui ceteris animantibus plurimum praestat, praecipue a natura nihil datum esse dicemus?

Iam id ipsum absurdum, maximum malum neglegi. Quod ea non occurrentia fingunt, vincunt Aristonem; Atqui perspicuum est hominem e corpore animoque constare, cum primae sint animi partes, secundae corporis. Fieri, inquam, Triari, nullo pacto potest, ut non dicas, quid non probes eius, a quo dissentias. Equidem e Cn. An dubium est, quin virt

In [15]:
print(f"The first document's metadata is: \n{doc[0].metadata}\n")

The first document's metadata is: 
{'source': '../../docs/dummy.txt'}



In [35]:
print(doc[0].page_content[:100])

Quod equidem non reprehendo;
Lorem ipsum dolor sit amet, consectetur adipiscing elit. Quibus natura 


### Markdown Loader

* Markdown is a lightweight markup language for creating formatted text using a plain-text editor.
  
* This showcases how to load Markdown documents into a langchain document format that we can use in our pipelines and chains.
  
* This Loader loads the whole document.


#### Download nltk packages if needed

In [37]:
# Solution for SSL certificate verification errors when downloading NLTK data
import nltk
import ssl

try:
    # This bypasses SSL certificate verification (for development/testing environments)
    _create_unverified_https_context = ssl._create_unverified_context
except AttributeError:
    # Legacy Python that doesn't verify HTTPS certificates by default
    pass
else:
    # Handle target environment that doesn't support HTTPS verification
    ssl._create_default_https_context = _create_unverified_https_context

# Now try downloading the NLTK data
try:
    print("Attempting to download NLTK data...")
    nltk.download('punkt_tab')
    nltk.download('averaged_perceptron_tagger_eng')
    print("✅ NLTK data downloaded successfully!")
except Exception as e:
    print(f"❌ Error downloading NLTK data: {e}")
    print("🔄 Trying alternative approach...")
    
    # Alternative: Download older/stable versions of the packages
    try:
        nltk.download('punkt')
        nltk.download('averaged_perceptron_tagger')
        print("✅ Alternative NLTK data downloaded successfully!")
    except Exception as e2:
        print(f"❌ Alternative download also failed: {e2}")
        print("\n🔧 Manual solutions:")
        print("1. Update certificates: /Applications/Python\\ 3.x/Install\\ Certificates.command")
        print("2. Use conda: conda install nltk")
        print("3. Download manually from: https://www.nltk.org/data.html")
        
        # Last resort: Check if the data already exists
        try:
            import nltk.data
            nltk.data.find('tokenizers/punkt')
            print("✅ NLTK punkt data already available!")
        except LookupError:
            print("❌ NLTK punkt data not found. Manual installation required.")
            
        try:
            import nltk.data
            nltk.data.find('taggers/averaged_perceptron_tagger')
            print("✅ NLTK averaged_perceptron_tagger data already available!")
        except LookupError:
            print("❌ NLTK averaged_perceptron_tagger data not found. Manual installation required.")

Attempting to download NLTK data...


[nltk_data] Downloading package punkt_tab to
[nltk_data]     /Users/sourav.banerjee/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.
[nltk_data] Downloading package averaged_perceptron_tagger_eng to
[nltk_data]     /Users/sourav.banerjee/nltk_data...


✅ NLTK data downloaded successfully!


[nltk_data]   Unzipping taggers/averaged_perceptron_tagger_eng.zip.


In [49]:

# The following code demonstrates how to use the UnstructuredMarkdownLoader from LangChain
# to load a markdown file ("./docs/README.md") as a list of Document objects.
# The loader reads the file and processes it into a format suitable for downstream retrieval tasks.

from langchain_community.document_loaders import UnstructuredMarkdownLoader

# The `mode='single'` parameter tells the UnstructuredMarkdownLoader to load the entire markdown file as a single Document object,
# rather than splitting it into multiple documents (e.g., by headings or sections).
# This is useful when you want to treat the whole file as one unit for retrieval or processing,
# preserving the full context and structure of the original markdown.

loader = UnstructuredMarkdownLoader("../../docs/README.md", mode='single')
docs = loader.load()

In [50]:
print(f"\nThe number of documents : {len(docs)}\n")


The number of documents : 1



In [51]:
print(f"\n Type of first documents : {type(docs[0])}\n")


 Type of first documents : <class 'langchain_core.documents.base.Document'>



In [52]:
type(docs[0])

langchain_core.documents.base.Document

In [53]:
print(docs[0].metadata)

{'source': '../../docs/README.md'}


In [54]:
print(docs[0].page_content[:100])

🦜️🔗 LangChain

⚡ Build context-aware reasoning applications ⚡

Looking for the JS/TS library? Check 


#### Load document and separate based on elements

In [55]:
loader = UnstructuredMarkdownLoader("../../docs/README.md", mode="elements")
docs = loader.load()

In [56]:
print(f"\nThe number of documents : {len(docs)}\n")


The number of documents : 63



In [57]:
docs[:10]

[Document(metadata={'source': '../../docs/README.md', 'last_modified': '2025-05-30T10:16:46', 'languages': ['eng'], 'filetype': 'text/markdown', 'file_directory': '../../docs', 'filename': 'README.md', 'category': 'Title', 'element_id': '200b8a7d0dd03f66e4f13456566d2b3a'}, page_content='🦜️🔗 LangChain'),
 Document(metadata={'source': '../../docs/README.md', 'last_modified': '2025-05-30T10:16:46', 'languages': ['eng'], 'parent_id': '200b8a7d0dd03f66e4f13456566d2b3a', 'filetype': 'text/markdown', 'file_directory': '../../docs', 'filename': 'README.md', 'category': 'NarrativeText', 'element_id': '80d06543c0c2b75ca147f3509e518a47'}, page_content='⚡ Build context-aware reasoning applications ⚡'),
 Document(metadata={'source': '../../docs/README.md', 'last_modified': '2025-05-30T10:16:46', 'languages': ['eng'], 'parent_id': '200b8a7d0dd03f66e4f13456566d2b3a', 'filetype': 'text/markdown', 'file_directory': '../../docs', 'filename': 'README.md', 'category': 'NarrativeText', 'element_id': 'd6827

In [58]:
from collections import Counter
Counter([doc.metadata['category'] for doc in docs])

Counter({'ListItem': 26, 'Title': 20, 'NarrativeText': 17})

In [59]:
docs[0].metadata

{'source': '../../docs/README.md',
 'last_modified': '2025-05-30T10:16:46',
 'languages': ['eng'],
 'filetype': 'text/markdown',
 'file_directory': '../../docs',
 'filename': 'README.md',
 'category': 'Title',
 'element_id': '200b8a7d0dd03f66e4f13456566d2b3a'}

In [60]:
docs[0].page_content

'🦜️🔗 LangChain'

In [61]:
docs[1].metadata

{'source': '../../docs/README.md',
 'last_modified': '2025-05-30T10:16:46',
 'languages': ['eng'],
 'parent_id': '200b8a7d0dd03f66e4f13456566d2b3a',
 'filetype': 'text/markdown',
 'file_directory': '../../docs',
 'filename': 'README.md',
 'category': 'NarrativeText',
 'element_id': '80d06543c0c2b75ca147f3509e518a47'}

In [62]:
docs[1].page_content

'⚡ Build context-aware reasoning applications ⚡'

#### Comparing Unstructured.io loaders vs LangChain wrapper API

In [64]:
from unstructured.partition.md import partition_md

docs = partition_md(filename="../../docs/README.md")

In [65]:
len(docs)

63

In [70]:
docs[:10]

[<unstructured.documents.elements.Title at 0x307623b90>,
 <unstructured.documents.elements.NarrativeText at 0x30762cfd0>,
 <unstructured.documents.elements.NarrativeText at 0x30762ef50>,
 <unstructured.documents.elements.NarrativeText at 0x30762ebd0>,
 <unstructured.documents.elements.Title at 0x30762d2d0>,
 <unstructured.documents.elements.Title at 0x30762d010>,
 <unstructured.documents.elements.Title at 0x307121410>,
 <unstructured.documents.elements.Title at 0x305b11010>,
 <unstructured.documents.elements.NarrativeText at 0x305b13fd0>,
 <unstructured.documents.elements.Title at 0x305b11cd0>]

In [71]:
docs[0].to_dict()

{'type': 'Title',
 'element_id': '200b8a7d0dd03f66e4f13456566d2b3a',
 'text': '🦜️🔗 LangChain',
 'metadata': {'last_modified': '2025-05-30T10:16:46',
  'languages': ['eng'],
  'filetype': 'text/markdown',
  'file_directory': '../../docs',
  'filename': 'README.md'}}

In [72]:
docs[1].to_dict()

{'type': 'NarrativeText',
 'element_id': '80d06543c0c2b75ca147f3509e518a47',
 'text': '⚡ Build context-aware reasoning applications ⚡',
 'metadata': {'last_modified': '2025-05-30T10:16:46',
  'languages': ['eng'],
  'parent_id': '200b8a7d0dd03f66e4f13456566d2b3a',
  'filetype': 'text/markdown',
  'file_directory': '../../docs',
  'filename': 'README.md'}}

In [73]:
from langchain_core.documents import Document

lc_docs = [Document(page_content=doc.text,
                    metadata=doc.metadata.to_dict())
              for doc in docs]
lc_docs[:10]

[Document(metadata={'last_modified': '2025-05-30T10:16:46', 'languages': ['eng'], 'filetype': 'text/markdown', 'file_directory': '../../docs', 'filename': 'README.md'}, page_content='🦜️🔗 LangChain'),
 Document(metadata={'last_modified': '2025-05-30T10:16:46', 'languages': ['eng'], 'parent_id': '200b8a7d0dd03f66e4f13456566d2b3a', 'filetype': 'text/markdown', 'file_directory': '../../docs', 'filename': 'README.md'}, page_content='⚡ Build context-aware reasoning applications ⚡'),
 Document(metadata={'last_modified': '2025-05-30T10:16:46', 'languages': ['eng'], 'parent_id': '200b8a7d0dd03f66e4f13456566d2b3a', 'filetype': 'text/markdown', 'file_directory': '../../docs', 'filename': 'README.md'}, page_content='Looking for the JS/TS library? Check out LangChain.js.'),
 Document(metadata={'last_modified': '2025-05-30T10:16:46', 'languages': ['eng'], 'parent_id': '200b8a7d0dd03f66e4f13456566d2b3a', 'filetype': 'text/markdown', 'file_directory': '../../docs', 'filename': 'README.md'}, page_conte

### CSV Loader

A comma-separated values (CSV) file is a delimited text file that uses a comma to separate values. Each line of the file is a data record. Each record consists of one or more fields, separated by commas.

LangChain implements a CSV Loader that will load CSV files into a sequence of `Document` objects. Each row of the CSV file is converted to one document.

In [75]:
import pandas as pd

# Create a DataFrame with some dummy real estate data
data = {
    'Property_ID': [101, 102, 103, 104, 105],
    'Address': ['123 Elm St', '456 Oak St', '789 Pine St', '321 Maple St', '654 Cedar St'],
    'City': ['Springfield', 'Rivertown', 'Laketown', 'Hillside', 'Sunnyvale'],
    'State': ['CA', 'TX', 'FL', 'NY', 'CO'],
    'Zip_Code': [98765, 87654, 76543, 65432, 54321],
    'Bedrooms': [3, 2, 4, 3, 5],
    'Bathrooms': [2, 1, 3, 2, 4],
    'Listing_Price': [500000, 350000, 600000, 475000, 750000]
}

df = pd.DataFrame(data)

# Save the DataFrame to a CSV file
df.to_csv('../../docs/data.csv', index=False)

In [77]:
from langchain_community.document_loaders.csv_loader import CSVLoader

loader = CSVLoader(file_path="../../docs/data.csv")
docs = loader.load()

In [78]:
docs

[Document(metadata={'source': '../../docs/data.csv', 'row': 0}, page_content='Property_ID: 101\nAddress: 123 Elm St\nCity: Springfield\nState: CA\nZip_Code: 98765\nBedrooms: 3\nBathrooms: 2\nListing_Price: 500000'),
 Document(metadata={'source': '../../docs/data.csv', 'row': 1}, page_content='Property_ID: 102\nAddress: 456 Oak St\nCity: Rivertown\nState: TX\nZip_Code: 87654\nBedrooms: 2\nBathrooms: 1\nListing_Price: 350000'),
 Document(metadata={'source': '../../docs/data.csv', 'row': 2}, page_content='Property_ID: 103\nAddress: 789 Pine St\nCity: Laketown\nState: FL\nZip_Code: 76543\nBedrooms: 4\nBathrooms: 3\nListing_Price: 600000'),
 Document(metadata={'source': '../../docs/data.csv', 'row': 3}, page_content='Property_ID: 104\nAddress: 321 Maple St\nCity: Hillside\nState: NY\nZip_Code: 65432\nBedrooms: 3\nBathrooms: 2\nListing_Price: 475000'),
 Document(metadata={'source': '../../docs/data.csv', 'row': 4}, page_content='Property_ID: 105\nAddress: 654 Cedar St\nCity: Sunnyvale\nState

In [81]:
print(docs[0].page_content)

Property_ID: 101
Address: 123 Elm St
City: Springfield
State: CA
Zip_Code: 98765
Bedrooms: 3
Bathrooms: 2
Listing_Price: 500000


`CSVLoader` will accept a `csv_args` kwarg that supports customization of arguments passed to Python's csv.`DictReader`. See the [`csv` module](https://docs.python.org/3/library/csv.html) documentation for more information of what `csv` args are supported.

In [82]:
loader = CSVLoader(file_path="../../docs/data.csv",
                   csv_args={
                      "delimiter": ",",
                      "quotechar": '"',
                      "fieldnames": ["Property ID", "Address", "City", "State",
                                     "Zip Code", "Bedrooms", "Bathrooms", "Price"],
                   },
                  )
docs = loader.load()

In [83]:
docs

[Document(metadata={'source': '../../docs/data.csv', 'row': 0}, page_content='Property ID: Property_ID\nAddress: Address\nCity: City\nState: State\nZip Code: Zip_Code\nBedrooms: Bedrooms\nBathrooms: Bathrooms\nPrice: Listing_Price'),
 Document(metadata={'source': '../../docs/data.csv', 'row': 1}, page_content='Property ID: 101\nAddress: 123 Elm St\nCity: Springfield\nState: CA\nZip Code: 98765\nBedrooms: 3\nBathrooms: 2\nPrice: 500000'),
 Document(metadata={'source': '../../docs/data.csv', 'row': 2}, page_content='Property ID: 102\nAddress: 456 Oak St\nCity: Rivertown\nState: TX\nZip Code: 87654\nBedrooms: 2\nBathrooms: 1\nPrice: 350000'),
 Document(metadata={'source': '../../docs/data.csv', 'row': 3}, page_content='Property ID: 103\nAddress: 789 Pine St\nCity: Laketown\nState: FL\nZip Code: 76543\nBedrooms: 4\nBathrooms: 3\nPrice: 600000'),
 Document(metadata={'source': '../../docs/data.csv', 'row': 4}, page_content='Property ID: 104\nAddress: 321 Maple St\nCity: Hillside\nState: NY\n

In [84]:
print(docs[0].page_content)

Property ID: Property_ID
Address: Address
City: City
State: State
Zip Code: Zip_Code
Bedrooms: Bedrooms
Bathrooms: Bathrooms
Price: Listing_Price


In [85]:
print(docs[1].page_content)

Property ID: 101
Address: 123 Elm St
City: Springfield
State: CA
Zip Code: 98765
Bedrooms: 3
Bathrooms: 2
Price: 500000


#### Compare with unstructured.io

Unstructured.io loads the entire CSV as a single table

In [86]:
from langchain_community.document_loaders import UnstructuredCSVLoader

loader = UnstructuredCSVLoader("../../docs/data.csv")
docs = loader.load()

In [87]:
len(docs)

1

In [88]:
print(docs[0])

page_content='


Property_ID
Address
City
State
Zip_Code
Bedrooms
Bathrooms
Listing_Price


101
123 Elm St
Springfield
CA
98765
3
2
500000


102
456 Oak St
Rivertown
TX
87654
2
1
350000


103
789 Pine St
Laketown
FL
76543
4
3
600000


104
321 Maple St
Hillside
NY
65432
3
2
475000


105
654 Cedar St
Sunnyvale
CO
54321
5
4
750000


' metadata={'source': '../../docs/data.csv'}


In [89]:
print(docs[0].page_content)





Property_ID
Address
City
State
Zip_Code
Bedrooms
Bathrooms
Listing_Price


101
123 Elm St
Springfield
CA
98765
3
2
500000


102
456 Oak St
Rivertown
TX
87654
2
1
350000


103
789 Pine St
Laketown
FL
76543
4
3
600000


104
321 Maple St
Hillside
NY
65432
3
2
475000


105
654 Cedar St
Sunnyvale
CO
54321
5
4
750000





In [90]:
print(docs[0].metadata)

{'source': '../../docs/data.csv'}


### JSON Loader

[JSON (JavaScript Object Notation)](https://en.wikipedia.org/wiki/JSON) is an open standard file format and data interchange format that uses human-readable text to store and transmit data objects consisting of attribute–value pairs and arrays (or other serializable values).

[JSON Lines](https://jsonlines.org/) is a file format where each line is a valid JSON value.

LangChain implements a [JSONLoader](https://api.python.langchain.com/en/latest/document_loaders/langchain_community.document_loaders.json_loader.JSONLoader.html) to convert JSON and JSONL data into LangChain `Document` objects. It uses a specified [`jq` schema](https://en.wikipedia.org/wiki/Jq_(programming_language)) to parse the JSON files, allowing for the extraction of specific fields into the content and metadata of the LangChain Document.

It uses the `jq` python package. Check out [this manual](https://jqlang.github.io/jq/manual/) for a detailed documentation of the `jq` syntax.

In [92]:
import json

# Sample data dictionary similar to the one you provided but with modified contents
data = {
    'image': {'creation_timestamp': 1675549016, 'uri': 'image_of_the_meeting.jpg'},
    'is_still_participant': True,
    'joinable_mode': {'link': '', 'mode': 1},
    'magic_words': [],
    'messages': [
        {'content': 'See you soon!',
         'sender_name': 'User B',
         'timestamp_ms': 1675597571851},
        {'content': 'Thanks for the update! See you then.',
         'sender_name': 'User A',
         'timestamp_ms': 1675597435669},
        {'content': 'Actually, the green one is sold out.',
         'sender_name': 'User B',
         'timestamp_ms': 1675596277579},
        {'content': 'I was hoping to purchase the green one!',
         'sender_name': 'User A',
         'timestamp_ms': 1675595140251},
        {'content': 'I’m really interested in the green one, not the red!',
         'sender_name': 'User A',
         'timestamp_ms': 1675595109305},
        {'content': 'Here’s the $150 for it.',
         'sender_name': 'User B',
         'timestamp_ms': 1675595068468},
        {'photos': [{'creation_timestamp': 1675595059,
                     'uri': 'image_of_the_item.jpg'}],
         'sender_name': 'User B',
         'timestamp_ms': 1675595060730},
        {'content': 'It typically sells for at least $200 online',
         'sender_name': 'User B',
         'timestamp_ms': 1675595045152},
        {'content': 'How much are you asking?',
         'sender_name': 'User A',
         'timestamp_ms': 1675594799696},
        {'content': 'Good morning! $50 is far too low.',
         'sender_name': 'User B',
         'timestamp_ms': 1675577876645},
        {'content': 'Hello! I’m interested in the item you posted. I can offer $50. Let me know if that works for you. Thanks!',
         'sender_name': 'User A',
         'timestamp_ms': 1675549022673}
    ],
    'participants': [{'name': 'User A'}, {'name': 'User B'}],
    'thread_path': 'inbox/User A and User B chat',
    'title': 'User A and User B chat'
}

# Save the modified data to a JSON file
with open('../../docs/chat_data.json', 'w') as file:
    json.dump(data, file, indent=4)


In [94]:
from langchain_community.document_loaders import JSONLoader

loader = JSONLoader(file_path="../../docs/chat_data.json",
                    jq_schema='.',
                    text_content=False)
docs = loader.load()

In [95]:
len(docs)

1

In [96]:
print(docs[0].page_content)
print(docs[0].metadata)

{"image": {"creation_timestamp": 1675549016, "uri": "image_of_the_meeting.jpg"}, "is_still_participant": true, "joinable_mode": {"link": "", "mode": 1}, "magic_words": [], "messages": [{"content": "See you soon!", "sender_name": "User B", "timestamp_ms": 1675597571851}, {"content": "Thanks for the update! See you then.", "sender_name": "User A", "timestamp_ms": 1675597435669}, {"content": "Actually, the green one is sold out.", "sender_name": "User B", "timestamp_ms": 1675596277579}, {"content": "I was hoping to purchase the green one!", "sender_name": "User A", "timestamp_ms": 1675595140251}, {"content": "I\u2019m really interested in the green one, not the red!", "sender_name": "User A", "timestamp_ms": 1675595109305}, {"content": "Here\u2019s the $150 for it.", "sender_name": "User B", "timestamp_ms": 1675595068468}, {"photos": [{"creation_timestamp": 1675595059, "uri": "image_of_the_item.jpg"}], "sender_name": "User B", "timestamp_ms": 1675595060730}, {"content": "It typically sell

In [98]:
print(docs[0])

page_content='{"image": {"creation_timestamp": 1675549016, "uri": "image_of_the_meeting.jpg"}, "is_still_participant": true, "joinable_mode": {"link": "", "mode": 1}, "magic_words": [], "messages": [{"content": "See you soon!", "sender_name": "User B", "timestamp_ms": 1675597571851}, {"content": "Thanks for the update! See you then.", "sender_name": "User A", "timestamp_ms": 1675597435669}, {"content": "Actually, the green one is sold out.", "sender_name": "User B", "timestamp_ms": 1675596277579}, {"content": "I was hoping to purchase the green one!", "sender_name": "User A", "timestamp_ms": 1675595140251}, {"content": "I\u2019m really interested in the green one, not the red!", "sender_name": "User A", "timestamp_ms": 1675595109305}, {"content": "Here\u2019s the $150 for it.", "sender_name": "User B", "timestamp_ms": 1675595068468}, {"photos": [{"creation_timestamp": 1675595059, "uri": "image_of_the_item.jpg"}], "sender_name": "User B", "timestamp_ms": 1675595060730}, {"content": "It 

In [99]:
docs

[Document(metadata={'source': '/Users/sourav.banerjee/Documents/My Codebases/GenerativAI_Demystified/RAG/docs/chat_data.json', 'seq_num': 1}, page_content='{"image": {"creation_timestamp": 1675549016, "uri": "image_of_the_meeting.jpg"}, "is_still_participant": true, "joinable_mode": {"link": "", "mode": 1}, "magic_words": [], "messages": [{"content": "See you soon!", "sender_name": "User B", "timestamp_ms": 1675597571851}, {"content": "Thanks for the update! See you then.", "sender_name": "User A", "timestamp_ms": 1675597435669}, {"content": "Actually, the green one is sold out.", "sender_name": "User B", "timestamp_ms": 1675596277579}, {"content": "I was hoping to purchase the green one!", "sender_name": "User A", "timestamp_ms": 1675595140251}, {"content": "I\\u2019m really interested in the green one, not the red!", "sender_name": "User A", "timestamp_ms": 1675595109305}, {"content": "Here\\u2019s the $150 for it.", "sender_name": "User B", "timestamp_ms": 1675595068468}, {"photos":

Suppose we are interested in extracting the values under the `messages` key of the JSON data

In [100]:
loader = JSONLoader(
    file_path='../../docs/chat_data.json',
    jq_schema='.messages[]',
    text_content=False)

data = loader.load()
data

[Document(metadata={'source': '/Users/sourav.banerjee/Documents/My Codebases/GenerativAI_Demystified/RAG/docs/chat_data.json', 'seq_num': 1}, page_content='{"content": "See you soon!", "sender_name": "User B", "timestamp_ms": 1675597571851}'),
 Document(metadata={'source': '/Users/sourav.banerjee/Documents/My Codebases/GenerativAI_Demystified/RAG/docs/chat_data.json', 'seq_num': 2}, page_content='{"content": "Thanks for the update! See you then.", "sender_name": "User A", "timestamp_ms": 1675597435669}'),
 Document(metadata={'source': '/Users/sourav.banerjee/Documents/My Codebases/GenerativAI_Demystified/RAG/docs/chat_data.json', 'seq_num': 3}, page_content='{"content": "Actually, the green one is sold out.", "sender_name": "User B", "timestamp_ms": 1675596277579}'),
 Document(metadata={'source': '/Users/sourav.banerjee/Documents/My Codebases/GenerativAI_Demystified/RAG/docs/chat_data.json', 'seq_num': 4}, page_content='{"content": "I was hoping to purchase the green one!", "sender_nam

Suppose we are interested in extracting the values under the `content` field within the `messages` key of the JSON data

In [101]:
loader = JSONLoader(
    file_path='../../docs/chat_data.json',
    jq_schema='.messages[].content',
    text_content=False)

data = loader.load()
data

[Document(metadata={'source': '/Users/sourav.banerjee/Documents/My Codebases/GenerativAI_Demystified/RAG/docs/chat_data.json', 'seq_num': 1}, page_content='See you soon!'),
 Document(metadata={'source': '/Users/sourav.banerjee/Documents/My Codebases/GenerativAI_Demystified/RAG/docs/chat_data.json', 'seq_num': 2}, page_content='Thanks for the update! See you then.'),
 Document(metadata={'source': '/Users/sourav.banerjee/Documents/My Codebases/GenerativAI_Demystified/RAG/docs/chat_data.json', 'seq_num': 3}, page_content='Actually, the green one is sold out.'),
 Document(metadata={'source': '/Users/sourav.banerjee/Documents/My Codebases/GenerativAI_Demystified/RAG/docs/chat_data.json', 'seq_num': 4}, page_content='I was hoping to purchase the green one!'),
 Document(metadata={'source': '/Users/sourav.banerjee/Documents/My Codebases/GenerativAI_Demystified/RAG/docs/chat_data.json', 'seq_num': 5}, page_content='I’m really interested in the green one, not the red!'),
 Document(metadata={'sou

#### Basic JSON Loading
For robust loading, especially with diverse file types, consider these options:

In [108]:
from pprint import pprint

In [109]:
file_path = '../../docs/facebook_chat.json'
with open(file_path, "r") as file:
    data = json.load(file)

pprint(data)

{'image': {'creation_timestamp': 1675549016, 'uri': 'image_of_the_chat.jpg'},
 'is_still_participant': True,
 'joinable_mode': {'link': '', 'mode': 1},
 'magic_words': [],
 'messages': [{'content': 'Bye!',
               'sender_name': 'User 2',
               'timestamp_ms': 1675597571851},
              {'content': 'Oh no worries! Bye',
               'sender_name': 'User 1',
               'timestamp_ms': 1675597435669},
              {'content': 'No Im sorry it was my mistake, the blue one is not '
                          'for sale',
               'sender_name': 'User 2',
               'timestamp_ms': 1675596277579},
              {'content': 'I thought you were selling the blue one!',
               'sender_name': 'User 1',
               'timestamp_ms': 1675595140251},
              {'content': 'Im not interested in this bag. Im interested in the '
                          'blue one!',
               'sender_name': 'User 1',
               'timestamp_ms': 1675595109305},
   

#### Using JSONLoader for Structured Retrieval: 
Use jq_schema to specify the data structure and extract only the required fields (Schema-Based Retrieval)

In [110]:
loader = JSONLoader(
    file_path='../../docs/facebook_chat.json',
    jq_schema='.messages[].content',
    text_content=False)

data = loader.load()
pprint(data)

[Document(metadata={'source': '/Users/sourav.banerjee/Documents/My Codebases/GenerativAI_Demystified/RAG/docs/facebook_chat.json', 'seq_num': 1}, page_content='Bye!'),
 Document(metadata={'source': '/Users/sourav.banerjee/Documents/My Codebases/GenerativAI_Demystified/RAG/docs/facebook_chat.json', 'seq_num': 2}, page_content='Oh no worries! Bye'),
 Document(metadata={'source': '/Users/sourav.banerjee/Documents/My Codebases/GenerativAI_Demystified/RAG/docs/facebook_chat.json', 'seq_num': 3}, page_content='No Im sorry it was my mistake, the blue one is not for sale'),
 Document(metadata={'source': '/Users/sourav.banerjee/Documents/My Codebases/GenerativAI_Demystified/RAG/docs/facebook_chat.json', 'seq_num': 4}, page_content='I thought you were selling the blue one!'),
 Document(metadata={'source': '/Users/sourav.banerjee/Documents/My Codebases/GenerativAI_Demystified/RAG/docs/facebook_chat.json', 'seq_num': 5}, page_content='Im not interested in this bag. Im interested in the blue one!')

#### Processing JSON Lines (JSONL): 
Seamlessly handle files where each line represents a separate JSON object by setting json_lines=True.

In [114]:
# Example - JSON (Processing JSON Lines)

loader = JSONLoader(
    file_path='../../docs/facebook_chat_messages.jsonl',
    jq_schema=".",
    text_content=False,
    json_lines=True
)

data = loader.load()
pprint(data)

[Document(metadata={'source': '/Users/sourav.banerjee/Documents/My Codebases/GenerativAI_Demystified/RAG/docs/facebook_chat_messages.jsonl', 'seq_num': 1}, page_content='{"sender_name": "User 2", "timestamp_ms": 1675597571851, "content": "Bye!"}'),
 Document(metadata={'source': '/Users/sourav.banerjee/Documents/My Codebases/GenerativAI_Demystified/RAG/docs/facebook_chat_messages.jsonl', 'seq_num': 2}, page_content='{"sender_name": "User 1", "timestamp_ms": 1675597435669, "content": "Oh no worries! Bye"}'),
 Document(metadata={'source': '/Users/sourav.banerjee/Documents/My Codebases/GenerativAI_Demystified/RAG/docs/facebook_chat_messages.jsonl', 'seq_num': 3}, page_content='{"sender_name": "User 2", "timestamp_ms": 1675596277579, "content": "No Im sorry it was my mistake, the blue one is not for sale"}')]


In [118]:
# Example - JSON (Processing JSON Lines)

loader = JSONLoader(
    file_path='../../docs/facebook_chat_messages.jsonl',
    jq_schema='.sender_name',
    text_content=False,
    json_lines=True
)

data = loader.load()
pprint(data)

[Document(metadata={'source': '/Users/sourav.banerjee/Documents/My Codebases/GenerativAI_Demystified/RAG/docs/facebook_chat_messages.jsonl', 'seq_num': 1}, page_content='User 2'),
 Document(metadata={'source': '/Users/sourav.banerjee/Documents/My Codebases/GenerativAI_Demystified/RAG/docs/facebook_chat_messages.jsonl', 'seq_num': 2}, page_content='User 1'),
 Document(metadata={'source': '/Users/sourav.banerjee/Documents/My Codebases/GenerativAI_Demystified/RAG/docs/facebook_chat_messages.jsonl', 'seq_num': 3}, page_content='User 2')]


In [116]:
# Example - JSON (Use jq_schema='.' and content_key for simpler extraction)

loader = JSONLoader(
    file_path='../../docs/facebook_chat_messages.jsonl',
    jq_schema='.',
    content_key="sender_name",
    text_content=False,
    json_lines=True
)

data = loader.load()
pprint(data)

[Document(metadata={'source': '/Users/sourav.banerjee/Documents/My Codebases/GenerativAI_Demystified/RAG/docs/facebook_chat_messages.jsonl', 'seq_num': 1}, page_content='User 2'),
 Document(metadata={'source': '/Users/sourav.banerjee/Documents/My Codebases/GenerativAI_Demystified/RAG/docs/facebook_chat_messages.jsonl', 'seq_num': 2}, page_content='User 1'),
 Document(metadata={'source': '/Users/sourav.banerjee/Documents/My Codebases/GenerativAI_Demystified/RAG/docs/facebook_chat_messages.jsonl', 'seq_num': 3}, page_content='User 2')]


#### Adding Metadata from JSON: 
Use custom functions to extract additional metadata, enhancing data context and traceability.

In [120]:
# Example - JSON (Adding Metadata from JSON)

def metadata_func(record: dict, metadata: dict) -> dict:
    metadata["sender_name"] = record.get("sender_name")
    metadata["timestamp_ms"] = record.get("timestamp_ms")
    return metadata

loader = JSONLoader(
    file_path='../../docs/facebook_chat.json',
    jq_schema='.messages[]',
    content_key="content",
    metadata_func=metadata_func # Add metadata from JSON
)

data = loader.load()
pprint(data)

[Document(metadata={'source': '/Users/sourav.banerjee/Documents/My Codebases/GenerativAI_Demystified/RAG/docs/facebook_chat.json', 'seq_num': 1, 'sender_name': 'User 2', 'timestamp_ms': 1675597571851}, page_content='Bye!'),
 Document(metadata={'source': '/Users/sourav.banerjee/Documents/My Codebases/GenerativAI_Demystified/RAG/docs/facebook_chat.json', 'seq_num': 2, 'sender_name': 'User 1', 'timestamp_ms': 1675597435669}, page_content='Oh no worries! Bye'),
 Document(metadata={'source': '/Users/sourav.banerjee/Documents/My Codebases/GenerativAI_Demystified/RAG/docs/facebook_chat.json', 'seq_num': 3, 'sender_name': 'User 2', 'timestamp_ms': 1675596277579}, page_content='No Im sorry it was my mistake, the blue one is not for sale'),
 Document(metadata={'source': '/Users/sourav.banerjee/Documents/My Codebases/GenerativAI_Demystified/RAG/docs/facebook_chat.json', 'seq_num': 4, 'sender_name': 'User 1', 'timestamp_ms': 1675595140251}, page_content='I thought you were selling the blue one!'),

### PDF Loaders

[Portable Document Format (PDF)](https://en.wikipedia.org/wiki/PDF), standardized as ISO 32000, is a file format developed by Adobe in 1992 to present documents, including text formatting and images, in a manner independent of application software, hardware, and operating systems.

LangChain integrates with a host of PDF parsers. Some are simple and relatively low-level; others will support OCR and image-processing, or perform advanced document layout analysis. The right choice will depend on your use-case and through experimentation.

Here we will see how to load PDF documents into the LangChain `Document` format

We download a research paper to experiment with

If the following command fails you can download the paper manually by going to http://arxiv.org/pdf/2103.15348.pdf, save it as `layoutparser_paper.pdf`and upload it on the left in Colab from the upload files option

#### PyPDFLoader

Here we load a PDF using `pypdf` into list of documents, where each document contains the page content and metadata with page number. Typically each PDF page becomes one document

In [121]:
from langchain_community.document_loaders import PyPDFLoader

loader = PyPDFLoader("../../docs/layoutparser_paper.pdf")
pages = loader.load()

print(pages[0].page_content)
print(pages[0].metadata)

  from cryptography.hazmat.primitives.ciphers.algorithms import AES, ARC4


LayoutParser : A Uniﬁed Toolkit for Deep
Learning Based Document Image Analysis
Zejiang Shen1(  ), Ruochen Zhang2, Melissa Dell3, Benjamin Charles Germain
Lee4, Jacob Carlson3, and Weining Li5
1Allen Institute for AI
shannons@allenai.org
2Brown University
ruochen zhang@brown.edu
3Harvard University
{melissadell,jacob carlson }@fas.harvard.edu
4University of Washington
bcgl@cs.washington.edu
5University of Waterloo
w422li@uwaterloo.ca
Abstract. Recent advances in document image analysis (DIA) have been
primarily driven by the application of neural networks. Ideally, research
outcomes could be easily deployed in production and extended for further
investigation. However, various factors like loosely organized codebases
and sophisticated model conﬁgurations complicate the easy reuse of im-
portant innovations by a wide audience. Though there have been on-going
eﬀorts to improve reusability and simplify deep learning (DL) model
development in disciplines like natural language processing an

In [122]:
len(pages)

16

In [123]:
pprint(pages[0])

Document(metadata={'source': '../../docs/layoutparser_paper.pdf', 'page': 0}, page_content='LayoutParser : A Uniﬁed Toolkit for Deep\nLearning Based Document Image Analysis\nZejiang Shen1( \x00), Ruochen Zhang2, Melissa Dell3, Benjamin Charles Germain\nLee4, Jacob Carlson3, and Weining Li5\n1Allen Institute for AI\nshannons@allenai.org\n2Brown University\nruochen zhang@brown.edu\n3Harvard University\n{melissadell,jacob carlson }@fas.harvard.edu\n4University of Washington\nbcgl@cs.washington.edu\n5University of Waterloo\nw422li@uwaterloo.ca\nAbstract. Recent advances in document image analysis (DIA) have been\nprimarily driven by the application of neural networks. Ideally, research\noutcomes could be easily deployed in production and extended for further\ninvestigation. However, various factors like loosely organized codebases\nand sophisticated model conﬁgurations complicate the easy reuse of im-\nportant innovations by a wide audience. Though there have been on-going\neﬀorts to impro

In [124]:
pprint(pages[0].page_content)

('LayoutParser : A Uniﬁed Toolkit for Deep\n'
 'Learning Based Document Image Analysis\n'
 'Zejiang Shen1( \x00), Ruochen Zhang2, Melissa Dell3, Benjamin Charles '
 'Germain\n'
 'Lee4, Jacob Carlson3, and Weining Li5\n'
 '1Allen Institute for AI\n'
 'shannons@allenai.org\n'
 '2Brown University\n'
 'ruochen zhang@brown.edu\n'
 '3Harvard University\n'
 '{melissadell,jacob carlson }@fas.harvard.edu\n'
 '4University of Washington\n'
 'bcgl@cs.washington.edu\n'
 '5University of Waterloo\n'
 'w422li@uwaterloo.ca\n'
 'Abstract. Recent advances in document image analysis (DIA) have been\n'
 'primarily driven by the application of neural networks. Ideally, research\n'
 'outcomes could be easily deployed in production and extended for further\n'
 'investigation. However, various factors like loosely organized codebases\n'
 'and sophisticated model conﬁgurations complicate the easy reuse of im-\n'
 'portant innovations by a wide audience. Though there have been on-going\n'
 'eﬀorts to improve reu

In [125]:
print(pages[0].metadata)

{'source': '../../docs/layoutparser_paper.pdf', 'page': 0}


#### PyMuPDFLoader

This is the fastest of the PDF parsing options, and contains detailed metadata about the PDF and its pages, as well as returns one document per page. It uses the `pymupdf` library internally.

In [126]:
pip install --upgrade pymupdf

Collecting pymupdf
  Using cached pymupdf-1.26.3-cp39-abi3-macosx_11_0_arm64.whl.metadata (3.4 kB)
Using cached pymupdf-1.26.3-cp39-abi3-macosx_11_0_arm64.whl (22.4 MB)
Installing collected packages: pymupdf
  Attempting uninstall: pymupdf
    Found existing installation: PyMuPDF 1.24.5
    Uninstalling PyMuPDF-1.24.5:
      Successfully uninstalled PyMuPDF-1.24.5
Successfully installed pymupdf-1.26.3
Note: you may need to restart the kernel to use updated packages.


In [127]:
from langchain_community.document_loaders import PyMuPDFLoader

loader = PyMuPDFLoader("../../docs/layoutparser_paper.pdf")
pages = loader.load()

print(pages[0].page_content)
print(pages[0].metadata)

LayoutParser: A Uniﬁed Toolkit for Deep
Learning Based Document Image Analysis
Zejiang Shen1 ( ), Ruochen Zhang2, Melissa Dell3, Benjamin Charles Germain
Lee4, Jacob Carlson3, and Weining Li5
1 Allen Institute for AI
shannons@allenai.org
2 Brown University
ruochen zhang@brown.edu
3 Harvard University
{melissadell,jacob carlson}@fas.harvard.edu
4 University of Washington
bcgl@cs.washington.edu
5 University of Waterloo
w422li@uwaterloo.ca
Abstract. Recent advances in document image analysis (DIA) have been
primarily driven by the application of neural networks. Ideally, research
outcomes could be easily deployed in production and extended for further
investigation. However, various factors like loosely organized codebases
and sophisticated model conﬁgurations complicate the easy reuse of im-
portant innovations by a wide audience. Though there have been on-going
eﬀorts to improve reusability and simplify deep learning (DL) model
development in disciplines like natural language processing

In [128]:
len(pages)

16

In [131]:
print(pages[0])

page_content='LayoutParser: A Uniﬁed Toolkit for Deep
Learning Based Document Image Analysis
Zejiang Shen1 ( ), Ruochen Zhang2, Melissa Dell3, Benjamin Charles Germain
Lee4, Jacob Carlson3, and Weining Li5
1 Allen Institute for AI
shannons@allenai.org
2 Brown University
ruochen zhang@brown.edu
3 Harvard University
{melissadell,jacob carlson}@fas.harvard.edu
4 University of Washington
bcgl@cs.washington.edu
5 University of Waterloo
w422li@uwaterloo.ca
Abstract. Recent advances in document image analysis (DIA) have been
primarily driven by the application of neural networks. Ideally, research
outcomes could be easily deployed in production and extended for further
investigation. However, various factors like loosely organized codebases
and sophisticated model conﬁgurations complicate the easy reuse of im-
portant innovations by a wide audience. Though there have been on-going
eﬀorts to improve reusability and simplify deep learning (DL) model
development in disciplines like natural langu

In [132]:
pages[0].metadata

{'source': '../../docs/layoutparser_paper.pdf',
 'file_path': '../../docs/layoutparser_paper.pdf',
 'page': 0,
 'total_pages': 16,
 'format': 'PDF 1.5',
 'title': '',
 'author': '',
 'subject': '',
 'keywords': '',
 'creator': 'LaTeX with hyperref',
 'producer': 'pdfTeX-1.40.21',
 'creationDate': 'D:20210622012710Z',
 'modDate': 'D:20210622012710Z',
 'trapped': ''}

In [133]:
print(pages[0].page_content)

LayoutParser: A Uniﬁed Toolkit for Deep
Learning Based Document Image Analysis
Zejiang Shen1 ( ), Ruochen Zhang2, Melissa Dell3, Benjamin Charles Germain
Lee4, Jacob Carlson3, and Weining Li5
1 Allen Institute for AI
shannons@allenai.org
2 Brown University
ruochen zhang@brown.edu
3 Harvard University
{melissadell,jacob carlson}@fas.harvard.edu
4 University of Washington
bcgl@cs.washington.edu
5 University of Waterloo
w422li@uwaterloo.ca
Abstract. Recent advances in document image analysis (DIA) have been
primarily driven by the application of neural networks. Ideally, research
outcomes could be easily deployed in production and extended for further
investigation. However, various factors like loosely organized codebases
and sophisticated model conﬁgurations complicate the easy reuse of im-
portant innovations by a wide audience. Though there have been on-going
eﬀorts to improve reusability and simplify deep learning (DL) model
development in disciplines like natural language processing

In [134]:
print(pages[4].page_content)

LayoutParser: A Uniﬁed Toolkit for DL-Based DIA
5
Table 1: Current layout detection models in the LayoutParser model zoo
Dataset
Base Model1 Large Model
Notes
PubLayNet [38]
F / M
M
Layouts of modern scientiﬁc documents
PRImA [3]
M
-
Layouts of scanned modern magazines and scientiﬁc reports
Newspaper [17]
F
-
Layouts of scanned US newspapers from the 20th century
TableBank [18]
F
F
Table region on modern scientiﬁc and business document
HJDataset [31]
F / M
-
Layouts of history Japanese documents
1 For each dataset, we train several models of diﬀerent sizes for diﬀerent needs (the trade-oﬀbetween accuracy
vs. computational cost). For “base model” and “large model”, we refer to using the ResNet 50 or ResNet 101
backbones [13], respectively. One can train models of diﬀerent architectures, like Faster R-CNN [28] (F) and Mask
R-CNN [12] (M). For example, an F in the Large Model column indicates it has a Faster R-CNN model trained
using the ResNet 101 backbone. The platform is maintained and

#### UnstructuredPDFLoader

[Unstructured.io](https://unstructured-io.github.io/unstructured/) supports a common interface for working with unstructured or semi-structured file formats, such as Markdown or PDF. LangChain's [`UnstructuredPDFLoader`](https://api.python.langchain.com/en/latest/document_loaders/langchain_community.document_loaders.pdf.UnstructuredPDFLoader.html) integrates with Unstructured to parse PDF documents into LangChain [`Document`](https://api.python.langchain.com/en/latest/documents/langchain_core.documents.base.Document.html) objects.

In [22]:
pip install "unstructured==0.10.25" "pdfminer.six==20221105"

Collecting unstructured==0.10.25
  Downloading unstructured-0.10.25-py3-none-any.whl.metadata (25 kB)
Collecting pdfminer.six==20221105
  Using cached pdfminer.six-20221105-py3-none-any.whl.metadata (4.0 kB)
Downloading unstructured-0.10.25-py3-none-any.whl (1.7 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.7/1.7 MB[0m [31m3.8 MB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0m
[?25hUsing cached pdfminer.six-20221105-py3-none-any.whl (5.6 MB)
Installing collected packages: unstructured, pdfminer.six
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2/2[0m [pdfminer.six][0m [pdfminer.six]
[1A[2K[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
pdfplumber 0.11.7 requires pdfminer.six==20250506, but you have pdfminer-six 20221105 which is incompatible.[0m[31m
[0mSuccessfully installed pdfminer.six-20221105 unstructured-0.1

In [23]:
# # Advanced UnstructuredPDFLoader with comprehensive error handling
# import importlib
# import sys

# def test_unstructured_pdf_loader():
#     """Test UnstructuredPDFLoader with comprehensive error handling and fallbacks."""
#     try:
#         # Check if pdfminer.six is properly installed
#         try:
#             import pdfminer.six
#             print("✅ pdfminer.six is available")
#         except ImportError as e:
#             print(f"❌ pdfminer.six import error: {e}")
#             return False
            
#         # Check if unstructured can be imported
#         try:
#             from langchain_community.document_loaders import UnstructuredPDFLoader
#             print("✅ UnstructuredPDFLoader imported successfully")
#         except ImportError as e:
#             print(f"❌ UnstructuredPDFLoader import error: {e}")
#             return False
        
#         # Try to load the PDF
#         print("🔄 Attempting to load PDF with UnstructuredPDFLoader...")
#         loader = UnstructuredPDFLoader('../../docs/layoutparser_paper.pdf')
#         data = loader.load()
        
#         print(f"✅ Successfully loaded PDF with {len(data)} document(s)")
#         print(f"📄 First document content preview: {data[0].page_content[:200]}...")
#         print(f"📊 First document metadata: {data[0].metadata}")
#         return True
        
#     except Exception as e:
#         print(f"❌ Error loading PDF with UnstructuredPDFLoader: {e}")
#         print(f"🔍 Error type: {type(e).__name__}")
        
#         # Provide specific solutions based on error type
#         error_str = str(e).lower()
#         if "psparser" in error_str or "pdfminer" in error_str:
#             print("\n🔧 This is a pdfminer dependency conflict. Solutions:")
#             print("1. Restart your kernel after running the previous cell")
#             print("2. Clear Python import cache")
#             print("3. Use alternative PDF loaders (PyPDFLoader, PyMuPDFLoader)")
#         elif "ssl" in error_str or "certificate" in error_str:
#             print("\n🔧 This is an SSL/certificate issue. Try the NLTK SSL fix above.")
#         elif "permission" in error_str:
#             print("\n🔧 This is a file permission issue. Check file accessibility.")
        
#         return False

# def use_alternative_pdf_loader():
#     """Fallback to reliable PDF loaders if UnstructuredPDFLoader fails."""
#     print("\n🔄 Using reliable alternative: PyPDFLoader")
#     try:
#         from langchain_community.document_loaders import PyPDFLoader
#         loader = PyPDFLoader("../../docs/layoutparser_paper.pdf")
#         pages = loader.load()
#         print(f"✅ Successfully loaded {len(pages)} pages with PyPDFLoader")
#         print(f"📄 First page preview: {pages[0].page_content[:200]}...")
#         return pages
#     except Exception as e:
#         print(f"❌ PyPDFLoader also failed: {e}")
#         return None

# # Try UnstructuredPDFLoader first
# success = test_unstructured_pdf_loader()

# # If it fails, use alternative
# if not success:
#     print("\n" + "="*60)
#     print("🔄 FALLBACK: Using alternative PDF loader...")
#     print("="*60)
#     alternative_data = use_alternative_pdf_loader()

### 🔧 Troubleshooting PDF Loading Issues

If you encounter import errors with `UnstructuredPDFLoader`, here are the most common solutions:

#### **Root Cause:**
The errors `cannot import name 'PSSyntaxError'` or `cannot import name 'psparser'` occur because:
- Conflicting `pdfminer` packages (old vs new versions)
- Python import cache holding onto old module references
- Incomplete package installations

#### **Complete Fix Process:**

1. **First, run the dependency cleanup cell above** ☝️
2. **Restart your Jupyter kernel** (Important!)
   - In Jupyter: `Kernel` → `Restart`
   - In VS Code: `Ctrl+Shift+P` → `Python: Restart Extension`
3. **Re-run the UnstructuredPDFLoader test cell**

#### **If Still Failing - Manual Environment Reset:**

```bash
# Complete environment cleanup (run in terminal)
pip uninstall -y pdfminer pdfminer.six pdfminer3k pycryptodome cryptography unstructured
pip cache purge
pip install pdfminer.six==20231228 cryptography>=3.1
pip install unstructured[pdf]==0.14.0
```

#### **Alternative PDF Loaders (Reliable Options):**

If `UnstructuredPDFLoader` continues to fail, use these proven alternatives:

```python
# Option 1: PyPDFLoader (Fast, Simple)
from langchain_community.document_loaders import PyPDFLoader
loader = PyPDFLoader('../../docs/layoutparser_paper.pdf')
pages = loader.load()

# Option 2: PyMuPDFLoader (Fastest, Rich Metadata)
from langchain_community.document_loaders import PyMuPDFLoader
loader = PyMuPDFLoader('../../docs/layoutparser_paper.pdf')
pages = loader.load()

# Option 3: PDFPlumberLoader (Great for Tables)
# pip install pdfplumber
from langchain_community.document_loaders import PDFPlumberLoader
loader = PDFPlumberLoader('../../docs/layoutparser_paper.pdf')
pages = loader.load()
```

#### **Why These Errors Happen:**
- `unstructured` library has complex dependencies
- Different Python environments may have conflicting packages
- Import caching can hold onto old references even after package updates
- The solution is to clean everything and restart fresh



In [24]:
# # 🔄 Advanced Fix: Clear Python Import Cache (Alternative to Kernel Restart)
# # Run this cell if you can't restart your kernel but still have import issues

# import sys
# import importlib

# def clear_import_cache():
#     """Clear Python import cache for pdfminer-related modules."""
#     modules_to_clear = []
    
#     # Find all pdfminer-related modules in sys.modules
#     for module_name in list(sys.modules.keys()):
#         if any(pkg in module_name.lower() for pkg in ['pdfminer', 'unstructured']):
#             modules_to_clear.append(module_name)
    
#     # Remove them from cache
#     for module_name in modules_to_clear:
#         if module_name in sys.modules:
#             del sys.modules[module_name]
#             print(f"🗑️  Cleared cache for: {module_name}")
    
#     # Clear import caches
#     importlib.invalidate_caches()
#     print(f"✅ Cleared {len(modules_to_clear)} cached modules")
#     print("🔄 Now try importing UnstructuredPDFLoader again")

# print("🧹 Clearing Python import cache...")
# clear_import_cache()


In [26]:
from langchain_community.document_loaders import UnstructuredPDFLoader

loader = UnstructuredPDFLoader('../../docs/layoutparser_paper.pdf')
data = loader.load()

print(data[0].page_content)
print(data[0].metadata)

ImportError: cannot import name 'add_chunking_strategy' from 'unstructured.chunking.title' (/Users/sourav.banerjee/Documents/My Codebases/GenerativAI_Demystified/.venv/lib/python3.11/site-packages/unstructured/chunking/title.py)

Load PDF with complex parsing, table detection and chunking by sections

Refer to https://community.databricks.com/t5/data-engineering/trying-to-use-pdf2image-on-databricks/td-p/12914


In [17]:
#install poppler on the cluster (should be done by init scripts)
def install_ocr_on_nodes():
    """
    install poppler on the cluster (should be done by init scripts)
    """
    # from pyspark.sql import SparkSession
    import subprocess
    num_workers = max(1,int(spark.conf.get("spark.databricks.clusterUsageTags.clusterWorkers")))
    command = "sudo rm -rf /var/cache/apt/archives/* /var/lib/apt/lists/* && sudo apt-get clean && sudo apt-get update && sudo apt-get install poppler-utils tesseract-ocr -y" 
    def run_subprocess(command):
        try:
            output = subprocess.check_output(command, stderr=subprocess.STDOUT, shell=True)
            return output.decode()
        except subprocess.CalledProcessError as e:
            raise Exception("An error occurred installing OCR libs:"+ e.output.decode())
    #install on the driver
    run_subprocess(command)
    def run_command(iterator):
        for x in iterator:
            yield run_subprocess(command)
    # spark = SparkSession.builder.getOrCreate()
    data = spark.sparkContext.parallelize(range(num_workers), num_workers) 
    # Use mapPartitions to run command in each partition (worker)
    output = data.mapPartitions(run_command)
    try:
        output.collect();
        print("OCR libraries installed")
    except Exception as e:
        print(f"Couldn't install on all node: {e}")
        raise e

In [19]:
# takes 3-4 mins on Colab
loader = UnstructuredPDFLoader('../../docs/layoutparser_paper.pdf',
                               strategy='hi_res',
                               extract_images_in_pdf=False,
                               infer_table_structure=True,
                               chunking_strategy="by_title",
                               max_characters=4000, # max size of chunks
                               new_after_n_chars=3800, # preferred size of chunks
                               combine_text_under_n_chars=2000, # smaller chunks < 2000 chars will be combined into a larger chunk
                               mode='elements')
data = loader.load()

  from .autonotebook import tqdm as notebook_tqdm


ModuleNotFoundError: No module named 'matplotlib'