# Document loaders

---

Alejandro Ricciardi (Omegapy)  
created date: 01/23/2024   
[GitHub](https://github.com/Omegapy)  

Credit: [LangChain](https://python.langchain.com/docs/expression_language/)

<br>

--- 

 
Projects Description:  
**LangChain** is a framework for developing applications powered by language models.  
**In this project:** This project is a series of LangChain document loaders for LLMs tutorials on Jupyter Notebook.  
The tutorials are a series LangChain Python code examples from the https://python.langchain.com/ website.

Specifically from the section [Document loaders](https://python.langchain.com/docs/modules/data_connection/document_loaders/).

⚠️ **Info**: Head to [Integrations]( https://python.langchain.com/docs/integrations/document_loaders/) for documentation on built-in document loader integrations with 3rd-party tools.

Use document loaders to load data from a source as Document's. A ```Document``` is a piece of text and associated metadata. For example, there are document loaders for loading a simple ```.txt``` file, for loading the text contents of any web page, or even for loading a transcript of a YouTube video.

Document loaders provide a "load" method for loading data as documents from a configured source. They optionally implement a "lazy load" as well for lazily loading data into memory.

<p></p>
<b style="font-size:15;">
⚠️ This project requires an OpenAI key.
</b>


##### Project Map  
- [API Keys](#api-keys)  
- [Getting started (.txt)](@getting-started)
- [CSV](#csv)
    - [Base Example](#base-example-csv)
    - [Customizing the CSV parsing and loading](#customizing-the-csv-parsing-and-loading)
    - [Specify a column to identify the document source](#specify-a-column-to-identify-the-document-source)
- [File Directory](#file-directory)
    - [Base Example](#base-example-file-directory)
    - [Show a progress bar](#show-a-progress-bar)
    - [Use multithreading](#use-multithreading)
    - [Change loader class](#change-loader-class)
        - [Base Example](#base-example-change-loader-class)
        - [Load Python Source Code](#load-python-source-code)
    - [Auto-detect file encodings with TextLoader](#auto-detect-file-encodings-with-textloader)
        - [A. Default Behavior](#a-default-behavior)
        - [B. Silent fail](#b-silent-fail)
        - [C. Auto detect encodings](#c-auto-detect-encodings)
- [HTML](#html)
- [JSON](#json)
    - [Base Example (JSON)](#base-example-json)
    - [Using JSONLoader (Linux Only)](#using-jsonloader-linux-only)
    - [Using JSONLoader for Windows()](#using-jsonloader-for-windows)
- [Markdown](#markdown)
- [Using PyPDF](#using-pypdf)
- [Extracting images](#extracting-images)
- [Using Unstructured](#using-unstructured)
- [Using PyPDFium2](#using-pypdfium2)
- [Using PDFMiner](#using-pdfminer)
- [Using PDFMiner to generate HTML text](#using-pdfminer-to-generate-html-text)
- [PyPDF Directory](#pypdf-directory)
- [Using PDFPlumber](#using-pdfplumber)
- [Using AmazonTextractPDFParser](#Using AmazonTextractPDFParser)

<br>

---


#### API Keys

In [52]:
import os
from dotenv import load_dotenv,find_dotenv
load_dotenv(find_dotenv())
OPENAI_API_KEY = os.environ.get("OPEN_AI_KEY")

---
## Getting started


<br>

---

In [53]:
from langchain_community.document_loaders import TextLoader

loader = TextLoader("./data/Portfolio Solutions for a Business Enterprise-Wide Upgrade.txt")
loader.load()

[Document(page_content="\n\n\n\nPortfolio: Solutions for a Business Enterprise-Wide Upgrade\nAlejandro Ricciardi\nColorado State University Global\nCSC300: Operating Systems and Architecture\nJoe Rangitsch\nAugust 6, 2023\n\n\n\n      Portfolio: Solutions for a Business Enterprise-Wide Upgrade\n      In a rapidly advancing technological landscape, to remain competitive, businesses need to adapt by modernizing their systems. In this portfolio essay, I take on the role of a consultant for a local business, that was asked to propose an enterprise-wide upgrade solution that includes operating systems, mass storage, virtualization, and security for a company. The company currently has a mix of operating systems, including several legacy machines. Additionally, the company does not currently use virtual machines but is strongly considering them. Furthermore, the company's core business is software testing, but it is considering offering a storage solution. In this paper, I outline the advant

[Project Map](#project-map)

---

---
## CSV
A comma-separated values (CSV) file is a delimited text file that uses a comma to separate values. Each line of the file is a data record. Each record consists of one or more fields, separated by commas.

<br>

---

### Base Example (CSV)

In [54]:
from langchain_community.document_loaders.csv_loader import CSVLoader


loader = CSVLoader(file_path='./data/mlb_teams_2012.csv')
data = loader.load()
data

[Document(page_content='Team: Nationals\n"Payroll (millions)": 81.34\n"Wins": 98', metadata={'source': './data/mlb_teams_2012.csv', 'row': 0}),
 Document(page_content='Team: Reds\n"Payroll (millions)": 82.20\n"Wins": 97', metadata={'source': './data/mlb_teams_2012.csv', 'row': 1}),
 Document(page_content='Team: Yankees\n"Payroll (millions)": 197.96\n"Wins": 95', metadata={'source': './data/mlb_teams_2012.csv', 'row': 2}),
 Document(page_content='Team: Giants\n"Payroll (millions)": 117.62\n"Wins": 94', metadata={'source': './data/mlb_teams_2012.csv', 'row': 3}),
 Document(page_content='Team: Braves\n"Payroll (millions)": 83.31\n"Wins": 94', metadata={'source': './data/mlb_teams_2012.csv', 'row': 4}),
 Document(page_content='Team: Athletics\n"Payroll (millions)": 55.37\n"Wins": 94', metadata={'source': './data/mlb_teams_2012.csv', 'row': 5}),
 Document(page_content='Team: Rangers\n"Payroll (millions)": 120.51\n"Wins": 93', metadata={'source': './data/mlb_teams_2012.csv', 'row': 6}),
 Doc

### Customizing the CSV parsing and loading

See the [csv module](https://docs.python.org/3/library/csv.html) documentation for more information of what csv args are supported.

In [55]:
loader = CSVLoader(file_path='./data/mlb_teams_2012.csv', csv_args={
    'delimiter': ',',
    'quotechar': '"',
    'fieldnames': ['MLB Team', 'Payroll in millions', 'Wins']
})

data = loader.load()
data


[Document(page_content='MLB Team: Team\nPayroll in millions: "Payroll (millions)"\nWins: "Wins"', metadata={'source': './data/mlb_teams_2012.csv', 'row': 0}),
 Document(page_content='MLB Team: Nationals\nPayroll in millions: 81.34\nWins: 98', metadata={'source': './data/mlb_teams_2012.csv', 'row': 1}),
 Document(page_content='MLB Team: Reds\nPayroll in millions: 82.20\nWins: 97', metadata={'source': './data/mlb_teams_2012.csv', 'row': 2}),
 Document(page_content='MLB Team: Yankees\nPayroll in millions: 197.96\nWins: 95', metadata={'source': './data/mlb_teams_2012.csv', 'row': 3}),
 Document(page_content='MLB Team: Giants\nPayroll in millions: 117.62\nWins: 94', metadata={'source': './data/mlb_teams_2012.csv', 'row': 4}),
 Document(page_content='MLB Team: Braves\nPayroll in millions: 83.31\nWins: 94', metadata={'source': './data/mlb_teams_2012.csv', 'row': 5}),
 Document(page_content='MLB Team: Athletics\nPayroll in millions: 55.37\nWins: 94', metadata={'source': './data/mlb_teams_2012.

### Specify a column to identify the document source
Use the ```source_column``` argument to specify a source for the document created from each row. Otherwise file_path will be used as the source for all documents created from the CSV file.

This is useful when using documents loaded from CSV files for chains that answer questions using sources.

In [56]:
loader = CSVLoader(file_path='./data/mlb_teams_2012.csv', source_column="Team")

data = loader.load()
data

[Document(page_content='Team: Nationals\n"Payroll (millions)": 81.34\n"Wins": 98', metadata={'source': 'Nationals', 'row': 0}),
 Document(page_content='Team: Reds\n"Payroll (millions)": 82.20\n"Wins": 97', metadata={'source': 'Reds', 'row': 1}),
 Document(page_content='Team: Yankees\n"Payroll (millions)": 197.96\n"Wins": 95', metadata={'source': 'Yankees', 'row': 2}),
 Document(page_content='Team: Giants\n"Payroll (millions)": 117.62\n"Wins": 94', metadata={'source': 'Giants', 'row': 3}),
 Document(page_content='Team: Braves\n"Payroll (millions)": 83.31\n"Wins": 94', metadata={'source': 'Braves', 'row': 4}),
 Document(page_content='Team: Athletics\n"Payroll (millions)": 55.37\n"Wins": 94', metadata={'source': 'Athletics', 'row': 5}),
 Document(page_content='Team: Rangers\n"Payroll (millions)": 120.51\n"Wins": 93', metadata={'source': 'Rangers', 'row': 6}),
 Document(page_content='Team: Orioles\n"Payroll (millions)": 81.43\n"Wins": 93', metadata={'source': 'Orioles', 'row': 7}),
 Docume

[Project Map](#project-map)

---

---
## File Directory
This covers how to load all documents in a directory.

Under the hood, by default this uses the [UnstructuredLoader](https://python.langchain.com/docs/integrations/document_loaders/unstructured_file).

<br>

---

### Base Example (File Directory)

In [57]:
from langchain_community.document_loaders import DirectoryLoader

We can use the ```glob``` parameter to control which files to load. 
Note that here it doesn't load the ```.rst``` file or the .```html``` files.

In [58]:
loader = DirectoryLoader('../', glob="**/*.txt") 

docs = loader.load() # pip install unstructured, note pip install "unstructured[md]" not available on Window Pro PC

len(docs)

ImportError: cannot import name 'open_filename' from 'pdfminer.utils' (p:\Projects\LLM-Frameworks-Tutorials\LangChain Tutorials\Tutorials from Langchain\.venv\Lib\site-packages\pdfminer\utils.py)

[Project Map](#project-map)

---

### Show a progress bar

**By default a progress bar will not be shown.** 
To show a progress bar, install the tqdm library (e.g. ```pip install tqdm```), and set the show_progress parameter to ```True```.

In [59]:
loader = DirectoryLoader('../', glob="**/*.txt", show_progress=True)
docs = loader.load()

  0%|          | 0/5 [00:00<?, ?it/s]

ImportError: cannot import name 'open_filename' from 'pdfminer.utils' (p:\Projects\LLM-Frameworks-Tutorials\LangChain Tutorials\Tutorials from Langchain\.venv\Lib\site-packages\pdfminer\utils.py)

[Project Map](#project-map)

---

### Use multithreading

By default the loading happens in one thread. In order to utilize several threads set the ```use_multithreading``` flag to ```true```.

No multithreading

In [62]:
%%timeit
loader = DirectoryLoader('../', glob="**/*.txt")
docs = loader.load()

ImportError: cannot import name 'open_filename' from 'pdfminer.utils' (p:\Projects\LLM-Frameworks-Tutorials\LangChain Tutorials\Tutorials from Langchain\.venv\Lib\site-packages\pdfminer\utils.py)

 Multithreading

In [60]:
%%timeit
loader = DirectoryLoader('../', glob="**/*.txt", use_multithreading=True)
docs = loader.load()

9.51 ms ± 838 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


[Project Map](#project-map)

---

### Change loader class
By default this uses the ```UnstructuredLoader``` class. However, you can change up the type of loader pretty easily.

##### Base Example (Change loader class)

In [61]:
from langchain_community.document_loaders import TextLoader

loader = DirectoryLoader('../', glob="**/*.txt", loader_cls=TextLoader)

docs = loader.load()

len(docs)

RuntimeError: Error loading ..\Retrieval\data\example-non-utf8.txt

##### Load Python Source Code
If you need to load Python source code files, use the ```PythonLoader```.

In [63]:
from langchain_community.document_loaders import PythonLoader

loader = DirectoryLoader('../../../../../', glob="**/*.py", loader_cls=PythonLoader)

docs = loader.load()

len(docs)

15

[Project Map](#project-map)

---

### Auto-detect file encodings with TextLoader

In this example we will see some strategies that can be useful when loading a large list of arbitrary files from a directory using the ```TextLoader``` class.

First to illustrate the problem, let's try to load multiple texts with arbitrary encodings.

In [64]:
path = './data'
loader = DirectoryLoader(path, glob="**/*.txt", loader_cls=TextLoader)

##### A. Default Behavior

In [65]:
loader.load()

RuntimeError: Error loading data\example-non-utf8.txt

The file example-non-utf8.txt uses a different encoding, so the load() function fails with a helpful message indicating which file failed decoding.

With the default behavior of TextLoader any failure to load any of the documents will fail the whole loading process and no documents are loaded.

##### B. Silent fail
We can pass the parameter ```silent_errors``` to the ```DirectoryLoader``` to skip the files which could not be loaded and continue the load process.

The example-non-utf8.txt will not be loaded in docs

In [66]:
loader = DirectoryLoader(path, glob="**/*.txt", loader_cls=TextLoader, silent_errors=True)
docs = loader.load()

# the example-non-utf8.txt will not be loaded in docs

Error loading file data\example-non-utf8.txt: Error loading data\example-non-utf8.txt


In [67]:
doc_sources = [doc.metadata['source']  for doc in docs]
doc_sources

['data\\An Overview of Thread Implementation in Multi-Threaded Computer Systems.txt',
 'data\\Portfolio Solutions for a Business Enterprise-Wide Upgrade.txt',
 'data\\whatsapp_chat.txt',
 'data\\fake_discord_data\\output.txt']

[Project Map](#project-map)

---

##### C. Auto detect encodings
We can also ask ```TextLoader``` to auto detect the file encoding before failing, by passing the ```autodetect_encoding``` to the loader class.

This will load example-non-utf8.txt without generating an error

In [68]:
text_loader_kwargs={'autodetect_encoding': True}
loader = DirectoryLoader(path, glob="**/*.txt", loader_cls=TextLoader, loader_kwargs=text_loader_kwargs)
docs = loader.load()

In [69]:
doc_sources = [doc.metadata['source']  for doc in docs]
doc_sources

['data\\An Overview of Thread Implementation in Multi-Threaded Computer Systems.txt',
 'data\\example-non-utf8.txt',
 'data\\Portfolio Solutions for a Business Enterprise-Wide Upgrade.txt',
 'data\\whatsapp_chat.txt',
 'data\\fake_discord_data\\output.txt']

[Project Map](#project-map)

---

---
## HTML
The HyperText Markup Language or HTML is the standard markup language for documents designed to be displayed in a web browser.

This covers how to load HTML documents into a document format that we can use downstream.

<br>

---

### Base Example (HTML)

In [70]:
from langchain_community.document_loaders import UnstructuredHTMLLoader

loader = UnstructuredHTMLLoader("data/fake-content.html")
data = loader.load()

data

 20%|██        | 1/5 [00:36<02:27, 36.95s/it]


[Document(page_content='My First Heading\n\nMy first paragraph.', metadata={'source': 'data/fake-content.html'})]

### Loading HTML with BeautifulSoup4

We can also use ```BeautifulSoup4``` to load HTML documents using the ```BSHTMLLoader```. This will extract the text from the HTML ```into page_content```, and the ```page title``` as title ```into metadata```.

In [71]:
from langchain_community.document_loaders import BSHTMLLoader

loader = BSHTMLLoader("data/fake-content.html")
data = loader.load()
data

[Document(page_content='\nTest Title\n\n\nMy First Heading\nMy first paragraph.\n\n\n', metadata={'source': 'data/fake-content.html', 'title': 'Test Title'})]

[Project Map](#project-map)

---

---
## JSON
[JSON](https://en.wikipedia.org/wiki/JSON) (JavaScript Object Notation) is an open standard file format and data interchange format that uses human-readable text to store and transmit data objects consisting of attribute–value pairs and arrays (or other serializable values).

[JSON Lines](https://jsonlines.org/) is a file format where each line is a valid JSON value.

The ```JSONLoader``` uses a specified [jq schema](https://en.wikipedia.org/wiki/Jq_(programming_language)) to parse the JSON files. It uses the jq python package `(Only for Linux OSes)`. Check this [manual](https://jqlang.github.io/jq/manual/) for a detailed documentation of the ```jq``` syntax.This covers how to load HTML documents into a document format that we can use downstream.

<br>

---



### Base Example (JSON)

In [72]:
# jq works only with linux OSes
#!pip install jq

In [73]:
import json
from pathlib import Path
from pprint import pprint


file_path='data/facebook_chat.json'
data = json.loads(Path(file_path).read_text())

pprint(data)

{'image': {'creation_timestamp': 1675549016, 'uri': 'image_of_the_chat.jpg'},
 'is_still_participant': True,
 'joinable_mode': {'link': '', 'mode': 1},
 'magic_words': [],
 'messages': [{'content': 'Bye!',
               'sender_name': 'User 2',
               'timestamp_ms': 1675597571851},
              {'content': 'Oh no worries! Bye',
               'sender_name': 'User 1',
               'timestamp_ms': 1675597435669},
              {'content': 'No Im sorry it was my mistake, the blue one is not '
                          'for sale',
               'sender_name': 'User 2',
               'timestamp_ms': 1675596277579},
              {'content': 'I thought you were selling the blue one!',
               'sender_name': 'User 1',
               'timestamp_ms': 1675595140251},
              {'content': 'Im not interested in this bag. Im interested in the '
                          'blue one!',
               'sender_name': 'User 1',
               'timestamp_ms': 1675595109305},
   

### Using JSONLoader (Linux Only)

Suppose we are interested in extracting the values under the ```content``` field within the ```messages``` key of the JSON data. This can easily be done through the ```JSONLoader``` as shown below.

In [74]:
from langchain_community.document_loaders import JSONLoader

loader = JSONLoader(
    file_path='data/facebook_chat.json',
    jq_schema='.messages[].content',
    text_content=False)

data = loader.load()

pprint(data)

ImportError: jq package not found, please install it with `pip install jq`

### Using JSONLoader for Windows

credit: [ddematheu](https://github.com/langchain-ai/langchain/issues/4396)

In [75]:
import json
from pathlib import Path
from typing import List, Optional, Union

from langchain.docstore.document import Document
from langchain.document_loaders.base import BaseLoader

class JSONLoader(BaseLoader):
    def __init__(
        self,
        file_path: Union[str, Path],
        content_key: Optional[str] = None,
        ):
        self.file_path = Path(file_path).resolve()
        self._content_key = content_key
        
    def create_documents(self, processed_data):
        documents = []
        for item in processed_data:
            content = ''.join(item)
            document = Document(page_content=content, metadata={})
            documents.append(document)
        return documents
    
    def process_item(self, item, prefix=""):
        if isinstance(item, dict):
            result = []
            for key, value in item.items():
                new_prefix = f"{prefix}.{key}" if prefix else key
                result.extend(self.process_item(value, new_prefix))
            return result
        elif isinstance(item, list):
            result = []
            for value in item:
                result.extend(self.process_item(value, prefix))
            return result
        else:
            return [f"{prefix}: {item}"]

    def process_json(self,data):
        if isinstance(data, list):
            processed_data = []
            for item in data:
                processed_data.extend(self.process_item(item))
            return processed_data
        elif isinstance(data, dict):
            return self.process_item(data)
        else:
            return []

    def load(self) -> List[Document]:
        """Load and return documents from the JSON file."""

        docs=[]
        with open(self.file_path, 'r') as json_file:
            try:
                data = json.load(json_file)
                processed_json = self.process_json(data)
                docs = self.create_documents(processed_json)
            except json.JSONDecodeError:
                print("Error: Invalid JSON format in the file.")
        return docs

In [76]:
loader = JSONLoader('data/facebook_chat.json')
loader.load()

pprint(data)

{'image': {'creation_timestamp': 1675549016, 'uri': 'image_of_the_chat.jpg'},
 'is_still_participant': True,
 'joinable_mode': {'link': '', 'mode': 1},
 'magic_words': [],
 'messages': [{'content': 'Bye!',
               'sender_name': 'User 2',
               'timestamp_ms': 1675597571851},
              {'content': 'Oh no worries! Bye',
               'sender_name': 'User 1',
               'timestamp_ms': 1675597435669},
              {'content': 'No Im sorry it was my mistake, the blue one is not '
                          'for sale',
               'sender_name': 'User 2',
               'timestamp_ms': 1675596277579},
              {'content': 'I thought you were selling the blue one!',
               'sender_name': 'User 1',
               'timestamp_ms': 1675595140251},
              {'content': 'Im not interested in this bag. Im interested in the '
                          'blue one!',
               'sender_name': 'User 1',
               'timestamp_ms': 1675595109305},
   

[Project Map](#project-map)

---

---
## Markdown
[Markdown](https://en.wikipedia.org/wiki/Markdown) is a lightweight markup language for creating formatted text using a plain-text editor.

This covers how to load Markdown documents into a document format that we can use downstream.

<br>

---



In [77]:
!pip install unstructured > /dev/null

The system cannot find the path specified.


In [78]:
from langchain_community.document_loaders import UnstructuredMarkdownLoader

markdown_path = "README.md"
loader = UnstructuredMarkdownLoader(markdown_path)

data = loader.load()
data

[Document(page_content='\ufeff-----------------------------------------------------------------------------------------------------------------------------\n\nRetrieval\n\nAlejandro Ricciardi (Omegapy)\n\ncreated date: 01/23/2024\n\nProjects Description:\n\nThis project is a series of LangChain data retrieval for LLMs tutorials on Jupyter Notebook.\n\nThe tutorials are a series LangChain Python code examples from the https://python.langchain.com/ website\n\nSpecifically from the section Retrieval\n\n⚠️ This project requires an OpenAI key.\n\nRequirements:\n\nPython\n\nJupyter Notebook\n\nLangChain \n-\n\nOpenAI API Key\n\nMy Links:\n\nGitHub\n\nFacebook\n\nTwitter\n\nInstagram\n\nRetrieval\n\nMany LLM applications require user-specific data that is not part of the model\'s training set. The primary way of accomplishing this is through Retrieval Augmented Generation (RAG). In this process, external data is retrieved and then passed to the LLM when doing the generation step.\n\nLangChain

[Project Map](#project-map)

---

---
## PDF
[Portable Document Format (PDF)](https://en.wikipedia.org/wiki/PDF), standardized as ISO 32000, is a file format developed by Adobe in 1992 to present documents, including text formatting and images, in a manner independent of application software, hardware, and operating systems.

This covers how to load PDF documents into the Document format that we use downstream.

<br>

---



### Using PyPDF

Load PDF using pypdf into array of documents, where each document contains the page content and metadata with page number.

In [79]:
from langchain_community.document_loaders import PyPDFLoader

loader = PyPDFLoader("data/layout-parser-paper.pdf")
pages = loader.load_and_split()

pages[0]

Document(page_content='LayoutParser : A Uniﬁed Toolkit for Deep\nLearning Based Document Image Analysis\nZejiang Shen1( \x00), Ruochen Zhang2, Melissa Dell3, Benjamin Charles Germain\nLee4, Jacob Carlson3, and Weining Li5\n1Allen Institute for AI\nshannons@allenai.org\n2Brown University\nruochen zhang@brown.edu\n3Harvard University\n{melissadell,jacob carlson }@fas.harvard.edu\n4University of Washington\nbcgl@cs.washington.edu\n5University of Waterloo\nw422li@uwaterloo.ca\nAbstract. Recent advances in document image analysis (DIA) have been\nprimarily driven by the application of neural networks. Ideally, research\noutcomes could be easily deployed in production and extended for further\ninvestigation. However, various factors like loosely organized codebases\nand sophisticated model conﬁgurations complicate the easy reuse of im-\nportant innovations by a wide audience. Though there have been on-going\neﬀorts to improve reusability and simplify deep learning (DL) model\ndevelopment in 

An advantage of this approach is that documents can be retrieved with page numbers.

We want to use ```OpenAIEmbeddings``` so we have to get the OpenAI API Key.


Note: FAISS (Facebook AI Similarity Search) is primarily developed for Unix-based systems.
```FAISS.from_documents(pages, OpenAIEmbeddings())``` will generate an error on Windows OSes



In [80]:
from langchain_community.vectorstores import FAISS 
from langchain_openai import OpenAIEmbeddings

faiss_index = FAISS.from_documents(pages, OpenAIEmbeddings())
docs = faiss_index.similarity_search("How will the community be engaged?", k=2)
for doc in docs:
    print(str(doc.metadata["page"]) + ":", doc.page_content[:300])

ImportError: Could not import faiss python package. Please install it with `pip install faiss-gpu` (for CUDA supported GPU) or `pip install faiss-cpu` (depending on Python version).

### Extracting images
Using the ```rapidocr-onnxruntime``` package we can extract images as text as well:

In [None]:
loader = PyPDFLoader("https://arxiv.org/pdf/2103.15348.pdf", extract_images=True) # pip install rapidocr-onnxruntime
pages = loader.load()
pages[4].page_content

### Using MathPix
you need a mathpix_api_key
Inspired by Daniel Gross's https://gist.github.com/danielgross/3ab4104e14faccc12b49200843adab21

In [81]:
from langchain_community.document_loaders import MathpixPDFLoader

loader = MathpixPDFLoader("data/layout-parser-paper.pdf")
data = loader.load()

ValueError: Did not find mathpix_api_key, please add an environment variable `MATHPIX_API_KEY` which contains it, or pass `mathpix_api_key` as a named parameter.

[Project Map](#project-map)

---

### Using Unstructured

In [82]:
!pip install pdf2image
!pip install pdfminer.six



In [83]:
from langchain_community.document_loaders import UnstructuredPDFLoader

loader = UnstructuredPDFLoader("data/layout-parser-paper.pdf")
data = loader.load()
data[0]

ImportError: cannot import name 'open_filename' from 'pdfminer.utils' (p:\Projects\LLM-Frameworks-Tutorials\LangChain Tutorials\Tutorials from Langchain\.venv\Lib\site-packages\pdfminer\utils.py)

### Using PyPDFium2

In [None]:
!pip install pypdfium2

In [None]:
from langchain_community.document_loaders import PyPDFium2Loader

loader = PyPDFium2Loader("data/layout-parser-paper.pdf")

data = loader.load()

[Project Map](#project-map)

---

### Using PDFMiner

In [None]:
!pip install pdfminer.six

In [84]:
from langchain_community.document_loaders import PDFMinerLoader

loader = PDFMinerLoader("data/layout-parser-paper.pdf")

data = loader.load()

ImportError: `pdfminer` package not found, please install it with `pip install pdfminer.six`

### Using PDFMiner to generate HTML text
This can be helpful for chunking texts semantically into sections as the output html content can be parsed via BeautifulSoup to get more structured and rich information about font size, page numbers, PDF headers/footers, etc.

In [85]:
from langchain_community.document_loaders import PDFMinerPDFasHTMLLoader

loader = PDFMinerLoader("data/layout-parser-paper.pdf")

data = loader.load()[0]   # entire PDF is loaded as a single Document

from bs4 import BeautifulSoup
soup = BeautifulSoup(data.page_content,'html.parser')
content = soup.find_all('div')

import re
cur_fs = None
cur_text = ''
snippets = []   # first collect all snippets that have the same font size
for c in content:
    sp = c.find('span')
    if not sp:
        continue
    st = sp.get('style')
    if not st:
        continue
    fs = re.findall('font-size:(\d+)px',st)
    if not fs:
        continue
    fs = int(fs[0])
    if not cur_fs:
        cur_fs = fs
    if fs == cur_fs:
        cur_text += c.text
    else:
        snippets.append((cur_text,cur_fs))
        cur_fs = fs
        cur_text = c.text
snippets.append((cur_text,cur_fs))
# Note: The above logic is very straightforward. One can also add more strategies such as removing duplicate snippets (as
# headers/footers in a PDF appear on multiple pages so if we find duplicates it's safe to assume that it is redundant info)

  fs = re.findall('font-size:(\d+)px',st)


ImportError: `pdfminer` package not found, please install it with `pip install pdfminer.six`

In [86]:
from langchain.docstore.document import Document
cur_idx = -1
semantic_snippets = []
# Assumption: headings have higher font size than their respective content
for s in snippets:
    # if current snippet's font size > previous section's heading => it is a new heading
    if not semantic_snippets or s[1] > semantic_snippets[cur_idx].metadata['heading_font']:
        metadata={'heading':s[0], 'content_font': 0, 'heading_font': s[1]}
        metadata.update(data.metadata)
        semantic_snippets.append(Document(page_content='',metadata=metadata))
        cur_idx += 1
        continue

    # if current snippet's font size <= previous section's content => content belongs to the same section (one can also create
    # a tree like structure for sub sections if needed but that may require some more thinking and may be data specific)
    if not semantic_snippets[cur_idx].metadata['content_font'] or s[1] <= semantic_snippets[cur_idx].metadata['content_font']:
        semantic_snippets[cur_idx].page_content += s[0]
        semantic_snippets[cur_idx].metadata['content_font'] = max(s[1], semantic_snippets[cur_idx].metadata['content_font'])
        continue

    # if current snippet's font size > previous section's content but less than previous section's heading than also make a new
    # section (e.g. title of a PDF will have the highest font size but we don't want it to subsume all sections)
    metadata={'heading':s[0], 'content_font': 0, 'heading_font': s[1]}
    metadata.update(data.metadata)
    semantic_snippets.append(Document(page_content='',metadata=metadata))
    cur_idx += 1

NameError: name 'snippets' is not defined

[Project Map](#project-map)

---

### PyPDF Directory
Load PDFs from directory

In [87]:
from langchain_community.document_loaders import PyPDFDirectoryLoader

loader = PyPDFDirectoryLoader("data/")

docs = loader.load()
docs

[Document(page_content='LayoutParser : A Uniﬁed Toolkit for Deep\nLearning Based Document Image Analysis\nZejiang Shen1( \x00), Ruochen Zhang2, Melissa Dell3, Benjamin Charles Germain\nLee4, Jacob Carlson3, and Weining Li5\n1Allen Institute for AI\nshannons@allenai.org\n2Brown University\nruochen zhang@brown.edu\n3Harvard University\n{melissadell,jacob carlson }@fas.harvard.edu\n4University of Washington\nbcgl@cs.washington.edu\n5University of Waterloo\nw422li@uwaterloo.ca\nAbstract. Recent advances in document image analysis (DIA) have been\nprimarily driven by the application of neural networks. Ideally, research\noutcomes could be easily deployed in production and extended for further\ninvestigation. However, various factors like loosely organized codebases\nand sophisticated model conﬁgurations complicate the easy reuse of im-\nportant innovations by a wide audience. Though there have been on-going\neﬀorts to improve reusability and simplify deep learning (DL) model\ndevelopment in

### Using PDFPlumber
Like PyMuPDF, the output Documents contain detailed metadata about the PDF and its pages, and returns one document per page.

In [88]:
from langchain_community.document_loaders import PDFPlumberLoader

loader = PyPDFium2Loader("data/layout-parser-paper.pdf")

data = loader.load()
data[0]

Document(page_content='LayoutParser: A Unified Toolkit for Deep\r\nLearning Based Document Image Analysis\r\nZejiang Shen\r\n1\r\n(\r\n\ufffe), Ruochen Zhang\r\n2\r\n, Melissa Dell\r\n3\r\n, Benjamin Charles Germain\r\nLee\r\n4\r\n, Jacob Carlson\r\n3\r\n, and Weining Li\r\n5\r\n1 Allen Institute for AI\r\nshannons@allenai.org 2 Brown University\r\nruochen zhang@brown.edu 3 Harvard University\r\n{melissadell,jacob carlson\r\n}@fas.harvard.edu\r\n4 University of Washington\r\nbcgl@cs.washington.edu 5 University of Waterloo\r\nw422li@uwaterloo.ca\r\nAbstract. Recent advances in document image analysis (DIA) have been\r\nprimarily driven by the application of neural networks. Ideally, research\r\noutcomes could be easily deployed in production and extended for further\r\ninvestigation. However, various factors like loosely organized codebases\r\nand sophisticated model configurations complicate the easy reuse of im\ufffeportant innovations by a wide audience. Though there have been on-goi

[Project Map](#project-map)

---

### Using AmazonTextractPDFParser
The ```AmazonTextractPDFLoad``` calls the [Amazon Textract Service](https://aws.amazon.com/textract/) to convert PDFs into a Document structure. The loader does pure OCR at the moment, with more features like layout support planned, depending on demand. Single and multi-page documents are supported with up to 3000 pages and 512 MB of size.

For the call to be successful an AWS account is required, similar to the [AWS CLI](https://docs.aws.amazon.com/cli/latest/userguide/cli-chap-configure.html) requirements.

Besides the AWS configuration, it is very similar to the other PDF loaders, while also supporting JPEG, PNG and TIFF and non-native PDF formats.

In [89]:
!pip install amazon-textract-caller



In [90]:
from langchain_community.document_loaders import AmazonTextractPDFLoader
loader = AmazonTextractPDFLoader("data/images.jpg")
documents = loader.load()

ImportError: Could not import amazon-textract-caller or amazon-textract-textractor python package. Please install it with `pip install amazon-textract-caller` & `pip install amazon-textract-textractor`.

[Project Map](#project-map)
