# PDFPlumber

Like PyMuPDF, the output Documents contain detailed metadata about the PDF and its pages, and returns one document per page.

## Overview
### Integration details

| Class | Package | Local | Serializable | JS support|
| :--- | :--- | :---: | :---: |  :---: |
| [PDFPlumberLoader](https://python.langchain.com/api_reference/community/document_loaders/langchain_community.document_loaders.pdf.PDFPlumberLoader.html) | [langchain_community](https://python.langchain.com/api_reference/community/index.html) | ✅ | ❌ | ❌ |
### Loader features
| Source | Document Lazy Loading | Native Async Support
| :---: | :---: | :---: |
| PDFPlumberLoader | ✅ | ❌ |

## Setup

### Credentials

No credentials are needed to use this loader.

If you want to get automated best in-class tracing of your model calls you can also set your [LangSmith](https://docs.smith.langchain.com/) API key by uncommenting below:

In [None]:
# os.environ["LANGSMITH_API_KEY"] = getpass.getpass("Enter your LangSmith API key: ")
# os.environ["LANGSMITH_TRACING"] = "true"

### Installation

Install **langchain_community**.

In [5]:
%pip install -qU langchain_community

## Initialization

Now we can instantiate our model object and load documents:

In [9]:
!pip install -qU pdfplumber

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m42.5/42.5 kB[0m [31m1.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m48.2/48.2 kB[0m [31m2.3 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m59.5/59.5 kB[0m [31m3.9 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m5.6/5.6 MB[0m [31m50.9 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.9/2.9 MB[0m [31m68.1 MB/s[0m eta [36m0:00:00[0m
[?25h

In [10]:
from langchain_community.document_loaders import PDFPlumberLoader
import pdfplumber
loader = PDFPlumberLoader("/content/drive/MyDrive/NLP_System/Mlops_book.pdf")

## Load

In [11]:
docs = loader.load()
docs[0]

Document(metadata={'source': '/content/drive/MyDrive/NLP_System/Mlops_book.pdf', 'file_path': '/content/drive/MyDrive/NLP_System/Mlops_book.pdf', 'page': 0, 'total_pages': 616, 'Title': 'Practical MLOps', 'Author': 'Noah Gift & Alfredo Deza', 'Creator': '', 'Producer': 'ConvertAPI', 'CreationDate': "D:20240924203520+00'00'", 'ModDate': "D:20240924203530+00'00'"}, page_content='\n')

In [12]:
print(docs[0].metadata)

{'source': '/content/drive/MyDrive/NLP_System/Mlops_book.pdf', 'file_path': '/content/drive/MyDrive/NLP_System/Mlops_book.pdf', 'page': 0, 'total_pages': 616, 'Title': 'Practical MLOps', 'Author': 'Noah Gift & Alfredo Deza', 'Creator': '', 'Producer': 'ConvertAPI', 'CreationDate': "D:20240924203520+00'00'", 'ModDate': "D:20240924203530+00'00'"}


## Lazy Load

In [13]:
page = []
for doc in loader.lazy_load():
    page.append(doc)
    if len(page) >= 10:
        # do some paged operation, e.g.
        # index.upsert(page)

        page = []

In [16]:
print(dir(page[0]))  # Replace page[0] with any valid Document object


['__abstractmethods__', '__annotations__', '__class__', '__class_getitem__', '__class_vars__', '__copy__', '__deepcopy__', '__delattr__', '__dict__', '__dir__', '__doc__', '__eq__', '__fields__', '__fields_set__', '__format__', '__ge__', '__get_pydantic_core_schema__', '__get_pydantic_json_schema__', '__getattr__', '__getattribute__', '__getstate__', '__gt__', '__hash__', '__init__', '__init_subclass__', '__iter__', '__le__', '__lt__', '__module__', '__ne__', '__new__', '__pretty__', '__private_attributes__', '__pydantic_complete__', '__pydantic_computed_fields__', '__pydantic_core_schema__', '__pydantic_custom_init__', '__pydantic_decorators__', '__pydantic_extra__', '__pydantic_fields__', '__pydantic_fields_set__', '__pydantic_generic_metadata__', '__pydantic_init_subclass__', '__pydantic_parent_namespace__', '__pydantic_post_init__', '__pydantic_private__', '__pydantic_root_model__', '__pydantic_serializer__', '__pydantic_validator__', '__reduce__', '__reduce_ex__', '__replace__', '

In [25]:
import json

# Convert Document objects to a serializable format
serializable_pages = [{"content": doc.page_content} for doc in page]

# Save the content to a JSON file
with open("/content/drive/MyDrive/processed_pages.json", "w") as file:
    json.dump(serializable_pages, file, indent=4)

print("Document content has been saved to processed_pages.json.")


Document content has been saved to processed_pages.json.


In [24]:
# Assuming you are using some document loader or extractor
page = []
for doc in loader.lazy_load():  # Replace this with your own loader
    page.append(doc)
    if len(page) >= 10:  # Or however many pages you want to process at once
        break  # Just for testing purposes, processes the first 10 pages

# Now, check the contents of the page list
print(len(page))  # Verify there are elements
if len(page) > 0:
    print(type(page[0]))  # Check the type of the first element
    print(page[0])  # View the content of the first element


10
<class 'langchain_core.documents.base.Document'>
page_content='
' metadata={'source': '/content/drive/MyDrive/NLP_System/Mlops_book.pdf', 'file_path': '/content/drive/MyDrive/NLP_System/Mlops_book.pdf', 'page': 0, 'total_pages': 616, 'Title': 'Practical MLOps', 'Author': 'Noah Gift & Alfredo Deza', 'Creator': '', 'Producer': 'ConvertAPI', 'CreationDate': "D:20240924203520+00'00'", 'ModDate': "D:20240924203530+00'00'"}


In [None]:
import json

# Convert Document objects to a serializable format
serializable_pages = [{"content": doc.page_content} for doc in page]

# Save the content to a JSON file
with open("processed_pages.json", "w") as file:
    json.dump(serializable_pages, file, indent=4)

print("Document content has been saved to processed_pages.json.")


SyntaxError: invalid syntax (<ipython-input-20-8ec17d83d1e4>, line 1)

In [18]:
import json

# Load the JSON data from the saved file
with open("processed_pages.json", "r") as file:
    loaded_pages = json.load(file)

# Print the loaded data to verify it
for idx, page in enumerate(loaded_pages):
    print(f"Page {idx + 1} content: {page['content'][:500]}...")  # Print first 500 characters for quick inspection


Page 1 content: Mechanical Turk data labeling, Mechanical Turk Data Labeling
SQS queue, defined, Key Terms
SSH access, Running a Container
statistics, descriptive, Descriptive Statistics and Normal Distributions-
Descriptive Statistics and Normal Distributions
supervised machine learning, Machine Learning Key Concepts
swagger, defined, Key Terms
T
Taleb, Nassim, Focus on Prediction Accuracy Versus the Big Picture
target dataset, baseline dataset versus, Monitoring Drift with AWS
SageMaker
task tracking (in tech...
Page 2 content: technical project management, Technical Project Management-Task Tracking
as DevOps best practice, DevOps and MLOps
project plan, Project Plan
task tracking, Task Tracking
weekly demo, Weekly Demo
technology certifications (see certifications)
TensorFlow
converting into ONNX, Convert TensorFlow into ONNX-Convert
TensorFlow into ONNX
TFHub, TFHub
TensorFlow Developer Certificate, GCP
TensorFlow Playground, Optimization
TensorFlow Processing Unit (see TPU)
"10X b

In [19]:
# Load the JSON data
with open("processed_pages.json", "r") as file:
    loaded_pages = json.load(file)

# Access content for NLP processing
for page in loaded_pages:
    print(page["content"])  # Process each page's content


Mechanical Turk data labeling, Mechanical Turk Data Labeling
SQS queue, defined, Key Terms
SSH access, Running a Container
statistics, descriptive, Descriptive Statistics and Normal Distributions-
Descriptive Statistics and Normal Distributions
supervised machine learning, Machine Learning Key Concepts
swagger, defined, Key Terms
T
Taleb, Nassim, Focus on Prediction Accuracy Versus the Big Picture
target dataset, baseline dataset versus, Monitoring Drift with AWS
SageMaker
task tracking (in technical project management), Task Tracking
technical communication (DevOps best practice), DevOps and MLOps
technical portfolio, building a, Building a Technical Portfolio for MLOps-
Getting a Job: Don’t Storm the Castle, Walk in the Backdoor
project example: cloud native ML application or API, Project: Build
Cloud Native ML Application or API
project example: Docker and Kubernetes container project, Project:
Docker and Kubernetes Container Project
project example: edge ML solution, Project: Build

## API reference

For detailed documentation of all PDFPlumberLoader features and configurations head to the API reference: https://python.langchain.com/api_reference/community/document_loaders/langchain_community.document_loaders.pdf.PDFPlumberLoader.html