**Reference Link:** [RAG Systems Essentials (Analytics Vidhya)](https://courses.analyticsvidhya.com/courses/take/rag-systems-essentials/lessons/60148017-hands-on-deep-dive-into-rag-evaluation-metrics-generator-metrics-i)

# Exploring Document Loaders in LangChain

## Install OpenAI, HuggingFace and LangChain dependencies

In [0]:
!pip install -q langchain==0.3.11
!pip install -q langchain-openai==0.2.12
!pip install -q langchain-community==0.3.11
!pip install -q jq==1.7.0
!pip install -q pypdf==4.2.0
!pip install -q PyMuPDF==1.24.5

In [0]:
# For Windows
# !winget install jqlang.jq 

In [0]:
# takes 2 - 5 mins to install on Colab
!pip install -q "unstructured[all-docs]==0.14.0"

After installing `unstructured`above remember to restart your session when it shows you the following popup, if it doesn't go to `Runtime`and `Restart Session`

![](https://i.imgur.com/UOBaotk.png)

In [0]:
# install OCR dependencies for unstructured
!sudo apt-get install tesseract-ocr
!sudo apt-get install poppler-utils

In [0]:
# !pip install -q pytesseract
# !pip install -q pdf2image

## Document Loaders

Document loaders are used to import data from various sources into LangChain as `Document` objects. A `Document` typically includes a piece of text along with its associated metadata.

### Examples of Document Loaders:

- **Text File Loader:** Loads data from a simple `.txt` file.
- **Web Page Loader:** Retrieves the text content from any web page.
- **YouTube Video Transcript Loader:** Loads transcripts from YouTube videos.

### Functionality:

- **Load Method:** Each document loader has a `load` method that enables the loading of data as documents from a pre-configured source.
- **Lazy Load Option:** Some loaders also support a "lazy load" feature, which allows data to be loaded into memory gradually as needed.

For more detailed information, visit [LangChain's document loader documentation](https://python.langchain.com/docs/modules/data_connection/document_loaders/).


### Text Loader

The simplest loader reads in a file as text and places it all into one document.



In [0]:
# !curl -o README.md https://raw.githubusercontent.com/langchain-ai/langchain/master/README.md

In [0]:
from langchain_community.document_loaders import TextLoader

loader = TextLoader("./docs/dummy.txt")
doc = loader.load()

In [0]:
len(doc)

In [0]:
print(f"The whole documents are: \n{doc}\n")

In [0]:
print(f"\nThe number of documents : {len(doc)}\n")

In [0]:
type(doc[0])

In [0]:
print(f"\n Type of first documents : {type(doc[0])}\n")

In [0]:
print(f"Content of first document : \n{doc[0].page_content}\n")

In [0]:
print(f"Metadata of first document : \n{doc[0].metadata}")

In [0]:
print(doc[0].page_content[:100])

### Markdown Loader

Markdown is a lightweight markup language for creating formatted text using a plain-text editor.

This showcases how to load Markdown documents into a langchain document format that we can use in our pipelines and chains.

Load the whole document

Download nltk packages if needed

In [0]:
import nltk
nltk.download('punkt_tab')
nltk.download('averaged_perceptron_tagger_eng')

In [0]:
from langchain_community.document_loaders import UnstructuredMarkdownLoader

loader = UnstructuredMarkdownLoader("./docs/README.md", mode='single')
docs = loader.load()

In [0]:
len(docs)

In [0]:
type(docs[0])

In [0]:
print(docs[0].metadata)

In [0]:
print(docs[0].page_content[:100])

Load document and separate based on elements

In [0]:
from langchain_community.document_loaders import UnstructuredMarkdownLoader

loader = UnstructuredMarkdownLoader("./docs/README.md", mode="elements")
docs = loader.load()

In [0]:
len(docs)

In [0]:
docs[:10]

In [0]:
from collections import Counter
Counter([doc.metadata['category'] for doc in docs])

In [0]:
docs[0].metadata

Comparing Unstructured.io loaders vs LangChain wrapper API

In [0]:
from unstructured.partition.md import partition_md

docs = partition_md(filename="./docs/README.md")

In [0]:
len(docs)

In [0]:
docs[:10]

In [0]:
docs[0].to_dict()

In [0]:
docs[1].to_dict()

In [0]:
from langchain_core.documents import Document

lc_docs = [Document(page_content=doc.text,
                    metadata=doc.metadata.to_dict())
              for doc in docs]
lc_docs[:10]

### CSV Loader

A comma-separated values (CSV) file is a delimited text file that uses a comma to separate values. Each line of the file is a data record. Each record consists of one or more fields, separated by commas.

LangChain implements a CSV Loader that will load CSV files into a sequence of `Document` objects. Each row of the CSV file is converted to one document.

In [0]:
import pandas as pd

# Create a DataFrame with some dummy real estate data
data = {
    'Property_ID': [101, 102, 103, 104, 105],
    'Address': ['123 Elm St', '456 Oak St', '789 Pine St', '321 Maple St', '654 Cedar St'],
    'City': ['Springfield', 'Rivertown', 'Laketown', 'Hillside', 'Sunnyvale'],
    'State': ['CA', 'TX', 'FL', 'NY', 'CO'],
    'Zip_Code': [98765, 87654, 76543, 65432, 54321],
    'Bedrooms': [3, 2, 4, 3, 5],
    'Bathrooms': [2, 1, 3, 2, 4],
    'Listing_Price': [500000, 350000, 600000, 475000, 750000]
}

df = pd.DataFrame(data)

# Save the DataFrame to a CSV file
df.to_csv('./docs/data.csv', index=False)

In [0]:
from langchain_community.document_loaders.csv_loader import CSVLoader

loader = CSVLoader(file_path="./docs/data.csv")
docs = loader.load()

In [0]:
docs

In [0]:
docs[0]

In [0]:
print(docs[0].page_content)

`CSVLoader` will accept a `csv_args` kwarg that supports customization of arguments passed to Python's csv.`DictReader`. See the [`csv` module](https://docs.python.org/3/library/csv.html) documentation for more information of what `csv` args are supported.

In [0]:
loader = CSVLoader(file_path="./docs/data.csv",
                   csv_args={
                      "delimiter": ",",
                      "quotechar": '"',
                      "fieldnames": ["Property ID", "Address", "City", "State",
                                     "Zip Code", "Bedrooms", "Bathrooms", "Price"],
                   },
                  )
docs = loader.load()

In [0]:
docs

In [0]:
print(docs[0].page_content)

In [0]:
print(docs[1].page_content)

Unstructured.io loads the entire CSV as a single table

In [0]:
from langchain_community.document_loaders import UnstructuredCSVLoader

loader = UnstructuredCSVLoader("./docs/data.csv")
docs = loader.load()

In [0]:
len(docs)

In [0]:
docs[0]

In [0]:
print(docs[0].page_content)

### JSON Loader

[JSON (JavaScript Object Notation)](https://en.wikipedia.org/wiki/JSON) is an open standard file format and data interchange format that uses human-readable text to store and transmit data objects consisting of attribute–value pairs and arrays (or other serializable values).

[JSON Lines](https://jsonlines.org/) is a file format where each line is a valid JSON value.

LangChain implements a [JSONLoader](https://api.python.langchain.com/en/latest/document_loaders/langchain_community.document_loaders.json_loader.JSONLoader.html) to convert JSON and JSONL data into LangChain `Document` objects. It uses a specified [`jq` schema](https://en.wikipedia.org/wiki/Jq_(programming_language)) to parse the JSON files, allowing for the extraction of specific fields into the content and metadata of the LangChain Document.

It uses the `jq` python package. Check out [this manual](https://jqlang.github.io/jq/manual/) for a detailed documentation of the `jq` syntax.

In [0]:
import json

# Sample data dictionary similar to the one you provided but with modified contents
data = {
    'image': {'creation_timestamp': 1675549016, 'uri': 'image_of_the_meeting.jpg'},
    'is_still_participant': True,
    'joinable_mode': {'link': '', 'mode': 1},
    'magic_words': [],
    'messages': [
        {'content': 'See you soon!',
         'sender_name': 'User B',
         'timestamp_ms': 1675597571851},
        {'content': 'Thanks for the update! See you then.',
         'sender_name': 'User A',
         'timestamp_ms': 1675597435669},
        {'content': 'Actually, the green one is sold out.',
         'sender_name': 'User B',
         'timestamp_ms': 1675596277579},
        {'content': 'I was hoping to purchase the green one!',
         'sender_name': 'User A',
         'timestamp_ms': 1675595140251},
        {'content': 'I’m really interested in the green one, not the red!',
         'sender_name': 'User A',
         'timestamp_ms': 1675595109305},
        {'content': 'Here’s the $150 for it.',
         'sender_name': 'User B',
         'timestamp_ms': 1675595068468},
        {'photos': [{'creation_timestamp': 1675595059,
                     'uri': 'image_of_the_item.jpg'}],
         'sender_name': 'User B',
         'timestamp_ms': 1675595060730},
        {'content': 'It typically sells for at least $200 online',
         'sender_name': 'User B',
         'timestamp_ms': 1675595045152},
        {'content': 'How much are you asking?',
         'sender_name': 'User A',
         'timestamp_ms': 1675594799696},
        {'content': 'Good morning! $50 is far too low.',
         'sender_name': 'User B',
         'timestamp_ms': 1675577876645},
        {'content': 'Hello! I’m interested in the item you posted. I can offer $50. Let me know if that works for you. Thanks!',
         'sender_name': 'User A',
         'timestamp_ms': 1675549022673}
    ],
    'participants': [{'name': 'User A'}, {'name': 'User B'}],
    'thread_path': 'inbox/User A and User B chat',
    'title': 'User A and User B chat'
}

# Save the modified data to a JSON file
with open('./docs/chat_data.json', 'w') as file:
    json.dump(data, file, indent=4)


To load the full data as a single document

In [0]:
from langchain_community.document_loaders import JSONLoader

loader = JSONLoader(
    file_path='./docs/chat_data.json',
    jq_schema='.',
    text_content=False)

data = loader.load()

In [0]:
len(data)

In [0]:
data

Suppose we are interested in extracting the values under the `messages` key of the JSON data

In [0]:
loader = JSONLoader(
    file_path='./docs/chat_data.json',
    jq_schema='.messages[]',
    text_content=False)

data = loader.load()
data

Suppose we are interested in extracting the values under the `content` field within the `messages` key of the JSON data

In [0]:
loader = JSONLoader(
    file_path='./docs/chat_data.json',
    jq_schema='.messages[].content',
    text_content=False)

data = loader.load()
data

### Basic JSON Loading
For robust loading, especially with diverse file types, consider these options:

In [0]:
import pprint

In [0]:
file_path = './docs/facebook_chat.json'
with open(file_path, "r") as file:
    data = json.load(file)

print(data)

### Using JSONLoader for Structured Retrieval: 
Use jq_schema to specify the data structure and extract only the required fields (Schema-Based Retrieval)

In [0]:
loader = JSONLoader(
    file_path='./docs/facebook_chat.json',
    jq_schema='.messages[].content',
    text_content=False)

data = loader.load()
data

### Processing JSON Lines (JSONL): 
Seamlessly handle files where each line represents a separate JSON object by setting json_lines=True.

In [0]:
# Example - JSON (Processing JSON Lines)

loader = JSONLoader(
    file_path='./docs/facebook_chat_messages.jsonl',
    jq_schema='.content',
    text_content=False,
    json_lines=True
)

data = loader.load()
data

In [0]:
# Example - JSON (Use jq_schema='.' and content_key for simpler extraction)

loader = JSONLoader(
    file_path='./docs/facebook_chat_messages.jsonl',
    jq_schema='.',
    content_key="sender_name",
    text_content=False,
    json_lines=True
)

data = loader.load()
data

### Adding Metadata from JSON: 
Use custom functions to extract additional metadata, enhancing data context and traceability.

In [0]:
# Example - JSON (Adding Metadata from JSON)

def metadata_func(record: dict, metadata: dict) -> dict:
    metadata["sender_name"] = record.get("sender_name")
    metadata["timestamp_ms"] = record.get("timestamp_ms")
    return metadata

loader = JSONLoader(
    file_path='./docs/facebook_chat.json',
    jq_schema='.messages[]',
    content_key="content",
    metadata_func=metadata_func # Add metadata from JSON
)

data = loader.load()
data

### PDF Loaders

[Portable Document Format (PDF)](https://en.wikipedia.org/wiki/PDF), standardized as ISO 32000, is a file format developed by Adobe in 1992 to present documents, including text formatting and images, in a manner independent of application software, hardware, and operating systems.

LangChain integrates with a host of PDF parsers. Some are simple and relatively low-level; others will support OCR and image-processing, or perform advanced document layout analysis. The right choice will depend on your use-case and through experimentation.

Here we will see how to load PDF documents into the LangChain `Document` format

We download a research paper to experiment with

If the following command fails you can download the paper manually by going to http://arxiv.org/pdf/2103.15348.pdf, save it as `layoutparser_paper.pdf`and upload it on the left in Colab from the upload files option

In [0]:
# !wget -O 'layoutparser_paper.pdf' 'http://arxiv.org/pdf/2103.15348.pdf'

#### PyPDFLoader

Here we load a PDF using `pypdf` into list of documents, where each document contains the page content and metadata with page number. Typically each PDF page becomes one document

In [0]:
from langchain_community.document_loaders import PyPDFLoader

loader = PyPDFLoader("./docs/layoutparser_paper.pdf")
pages = loader.load()

In [0]:
len(pages)

In [0]:
pages[0]

In [0]:
print(pages[0].page_content)

In [0]:
print(pages[0].metadata)

In [0]:
print(pages[4].page_content)

#### PyMuPDFLoader

This is the fastest of the PDF parsing options, and contains detailed metadata about the PDF and its pages, as well as returns one document per page. It uses the `pymupdf` library internally.

In [0]:
pip install --upgrade pymupdf

In [0]:
from langchain_community.document_loaders import PyMuPDFLoader

loader = PyMuPDFLoader("./docs/layoutparser_paper.pdf")
pages = loader.load()

In [0]:
len(pages)

In [0]:
pages[0]

In [0]:
pages[0].metadata

In [0]:
print(pages[0].page_content)

In [0]:
print(pages[4].page_content)

#### UnstructuredPDFLoader

[Unstructured.io](https://unstructured-io.github.io/unstructured/) supports a common interface for working with unstructured or semi-structured file formats, such as Markdown or PDF. LangChain's [`UnstructuredPDFLoader`](https://api.python.langchain.com/en/latest/document_loaders/langchain_community.document_loaders.pdf.UnstructuredPDFLoader.html) integrates with Unstructured to parse PDF documents into LangChain [`Document`](https://api.python.langchain.com/en/latest/documents/langchain_core.documents.base.Document.html) objects.

Load PDF as a single document - no complex parsing

In [0]:
from langchain_community.document_loaders import UnstructuredPDFLoader

loader = UnstructuredPDFLoader('./docs/layoutparser_paper.pdf')
data = loader.load()

In [0]:
len(data)

In [0]:
print(data[0].page_content[:1000])

Load PDF with complex parsing, table detection and chunking by sections

In [0]:
# takes 3-4 mins on Colab
loader = UnstructuredPDFLoader('./docs/layoutparser_paper.pdf',
                               strategy='hi_res',
                               extract_images_in_pdf=False,
                               infer_table_structure=True,
                               chunking_strategy="by_title",
                               max_characters=4000, # max size of chunks
                               new_after_n_chars=3800, # preferred size of chunks
                               combine_text_under_n_chars=2000, # smaller chunks < 2000 chars will be combined into a larger chunk
                               mode='elements')
data = loader.load()

In [0]:
len(data)

In [0]:
[doc.metadata['category'] for doc in data]

In [0]:
data[0]

In [0]:
print(data[0].page_content)

In [0]:
data[5]

In [0]:
data[5].page_content

In [0]:
from IPython.display import HTML

HTML(data[5].metadata['text_as_html'])

Load using raw unstructured.io APIs for PDFs

In [0]:
from unstructured.partition.pdf import partition_pdf

# Get elements - takes 3-4 mins
raw_pdf_elements = partition_pdf(
    filename="./docs/layoutparser_paper.pdf",
    strategy='hi_res',
    # Unstructured first finds embedded image blocks
    extract_images_in_pdf=False,
    # Use layout model (YOLOX) to get bounding boxes (for tables) and find titles
    # Titles are any sub-section of the document
    infer_table_structure=True,
    # Post processing to aggregate text once we have the title
    chunking_strategy="by_title",
    # Chunking params to aggregate text blocks
    # Attempt to create a new chunk 3800 chars
    # Attempt to keep chunks > 2000 chars
    max_characters=4000,
    new_after_n_chars=3800,
    combine_text_under_n_chars=2000,
    image_output_dir_path="./",
)

In [0]:
len(raw_pdf_elements)

In [0]:
raw_pdf_elements

In [0]:
raw_pdf_elements[5].to_dict()

Convert into LangChain `document`format

In [0]:
from langchain_core.documents import Document

lc_docs = [Document(page_content=doc.text,
                    metadata=doc.metadata.to_dict())
              for doc in raw_pdf_elements]
lc_docs[5]

### Microsoft Office Document Loaders

The Microsoft Office suite of productivity software includes Microsoft Word, Microsoft Excel, Microsoft PowerPoint, Microsoft Outlook, and Microsoft OneNote. It is available for Microsoft Windows and macOS operating systems. It is also available on Android and iOS.

[Unstructured.io](https://docs.unstructured.io/open-source/introduction/overview) provides a variety of document loaders to load MS Office documents. Check them out [here](https://docs.unstructured.io/open-source/core-functionality/partitioning).

Here we will leverage LangChain's [`UnstructuredWordDocumentLoader`](https://api.python.langchain.com/en/latest/document_loaders/langchain_community.document_loaders.word_document.UnstructuredWordDocumentLoader.html) to load data from a MS Word document.

In [0]:
# !gdown 1DEz13a7k4yX9yFrWaz3QJqHdfecFYRV-

Load word doc as a single document

In [0]:
from langchain_community.document_loaders import UnstructuredWordDocumentLoader

loader = UnstructuredWordDocumentLoader('./Docs/Intel Strategy.docx')
data = loader.load()

In [0]:
len(data)

In [0]:
data[0].page_content[:1000]

Load word doc with complex parsing and section based chunks

In [0]:
loader = UnstructuredWordDocumentLoader('./Docs/Intel Strategy.docx',
                                        strategy='fast',
                                        chunking_strategy="by_title",
                                        max_characters=3000, # max limit of a document chunk
                                        new_after_n_chars=2500, # preferred document chunk size
                                        mode='elements')
data = loader.load()

In [0]:
data

In [0]:
len(data)

In [0]:
print(data[0])

In [0]:
print(data[0].page_content)

In [0]:
print(data[0].metadata)

In [0]:
data[0]

In [0]:
data[1]

In [0]:
data[2]

In [0]:
data[3]

In [0]:
data[4]

### Directory Loaders

LangChain's [`DirectoryLoader`](https://api.python.langchain.com/en/latest/document_loaders/langchain_community.document_loaders.directory.DirectoryLoader.html) implements functionality for reading files from disk into LangChain [`Document`](https://api.python.langchain.com/en/latest/documents/langchain_core.documents.base.Document.html#langchain_core.documents.base.Document) objects.

In [0]:
# !wget -O 'Vision Transformers.pdf' 'https://arxiv.org/pdf/2010.11929.pdf'

We first define and assign specific loaders which can be used by LangChain when processing the files for a specific file type. We follow this format

```
loaders = {
  'file_format_extension' : (LoaderClass, LoaderKeywordArguments)
}
```

Where:

- `file_format_extension` can be anything like `.docx`, `.pdf`etc.
- `LoaderClass` is a specific data loader like `PyMuPDFLoader`
- `LoaderKeywordArguments` are any specific keyword arguments which needs to be passed into that loader at runtime

In [0]:
# Define a dictionary to map file extensions to their respective loaders
loaders = {
    '.pdf': (PyMuPDFLoader, {}),
    '.docx': (UnstructuredWordDocumentLoader, {'strategy': 'fast',
                                              'chunking_strategy' : 'by_title',
                                              'max_characters' : 3000, # max limit of a document chunk
                                              'new_after_n_chars' : 2500, # preferred document chunk size
                                              'mode' : 'elements'
                                              })
}

`DirectoryLoader` accepts a `loader_cls` argument, which defaults to `UnstructuredLoader` but we can pass our own loaders which we defined above in the `loader_cls`argument and any keyword args for the loader can be passed in the `loader_kwargs` argument.

We can also show a progress bar by setting `show_progress=True`

We can use the `glob` parameter to control which files to load based on file patterns

Here we create two separate loaders to load files which are word documents and PDFs

In [0]:
from langchain_community.document_loaders import DirectoryLoader

# Define a function to create a DirectoryLoader for a specific file type
def create_directory_loader(file_type, directory_path):
    return DirectoryLoader(
        path=directory_path,
        glob=f"**/*{file_type}",
        loader_cls=loaders[file_type][0],
        loader_kwargs=loaders[file_type][1],
        show_progress=True
    )

# Create DirectoryLoader instances for each file type
pdf_loader = create_directory_loader('.pdf', './docs')
docx_loader = create_directory_loader('.docx', './docs')

# Load the files
pdf_documents = pdf_loader.load()
docx_documents = docx_loader.load()

In [0]:
len(pdf_documents)

In [0]:
pdf_documents

In [0]:
pdf_documents[18]

In [0]:
len(docx_documents)

In [0]:
docx_documents

### YouTube Transcript Loader
You can get the transcript from any YouTube link. First, you have to install the given libraries. If you want to keep the video information like title, and description in the documents, then keep add_video_info=True Otherwise keep False.

In [0]:
%pip install --upgrade --quiet  youtube-transcript-api
%pip install --upgrade --quiet  pytube

In [0]:
from langchain_community.document_loaders import YoutubeLoader

loader = YoutubeLoader.from_youtube_url(
    "https://www.youtube.com/watch?v=LAfrShnpVIk",
    add_video_info=False,
)

result = loader.load()
print(f"Without Video Info: {result}\n\n")

# loader = YoutubeLoader.from_youtube_url(
#     "https://www.youtube.com/watch?v=ZL-cwYRMPjI",
#     add_video_info=True,
# )
# result = loader.load()
# print("With Video Info: {result}")

If you want to format the Transcripts, you can do it through the TranscriptFormat class. Here I have created the documents where each document will contain the transcript of 10-second contexts from the video. You can customize it in your desired format.

In [0]:
from langchain_community.document_loaders.youtube import TranscriptFormat

loader = YoutubeLoader.from_youtube_url(
    "https://www.youtube.com/watch?v=TKCMw0utiak",
    add_video_info=False,
    transcript_format=TranscriptFormat.CHUNKS,
    chunk_size_seconds=10,
)
print("\n\n".join(map(repr, loader.load())))

### Scraping data from URLs

We can scrape data from URLs in various ways such as HTMLloader, RecursiveUrlLoader, FireCrawl, and others. The RecursiveUrlLoader is used to recursively scrape all the child links from a root URL and then convert the data into Documents.

Recursive Url loader extracts the content with HTML tags. I have created a function for extracting tags and returning the actual content. You can also define the extractor in the Recursive URL Loader class. Look at the code and run it on your device.

In [0]:
%pip install -qU langchain-community beautifulsoup4
%pip install lxml

In [0]:
from langchain_community.document_loaders import RecursiveUrlLoader
import re
from bs4 import BeautifulSoup

# create recursiveurlloader instace with specific url
loader = RecursiveUrlLoader(
    "https://learn.microsoft.com/en-us/azure/databricks/generative-ai/vector-search/",
    timeout=10,
    max_depth=2,
)

docs = loader.load()
print(len(docs))
print(docs[0].page_content)  # see the content

In [0]:
# function for removing html tag
def bs4_extractor(html: str) -> str:
    soup = BeautifulSoup(html, "lxml")
    return re.sub(r"\n\n+", "\n\n", soup.text).strip()

content = bs4_extractor(docs[0].page_content)
print(f"After removing the html tags: {content}")

# define the extractor in the RecursiveUrlLoader Class
loader = RecursiveUrlLoader("https://learn.microsoft.com/en-us/azure/databricks/generative-ai/vector-search/", extractor=bs4_extractor)
docs = loader.load()
print(docs[1].page_content)

### Research Paper Data Loader
ArXiv is an open-source archive for over 2 million scholarly articles in the fields of physics, mathematics, computer science, finance, statistics, electrical engineering, system science, economics, biology, and so on. So ArXiv is a great source of data in various domain. We can easily properly use data for our applications.

To access ArXiv data, we need to install the language community, Arxiv, and PyMuPDF integration packages. PyMuPDF helps to transform PDF files downloaded from the ArXiv website into text format.

In [0]:
%pip install -qU langchain-community arxiv pymupdf

In [0]:

from langchain_community.document_loaders import ArxivLoader

# Supports all arguments of `ArxivAPIWrapper`
loader = ArxivLoader(
    query="Yolov8",
    load_max_docs=2,
    # doc_content_chars_max=1000,
    # load_all_available_meta=False,
    # ...
)

docs = loader.load()
print(docs[0])

# get the summary of the paper
docs = loader.get_summaries_as_docs()
docs[0]

### Custom Data Loader
If you don’t find any Data Loader class for your specific data type, then you can create a Custom Document Loader to convert your data into LLM-ready documents. Now create a custom document loader from Subclassing from BaseLoader that loads a file and creates a document from each line in the file. The document contains text and metadata. The BaseLoader class is used to convert the raw data into documents.

Now create a CustomDocumentLoader for loading the data and converting it into Documents. Then create a text file and load the text file through Custom Document Loader.

In [0]:
from typing import AsyncIterator, Iterator

from langchain_core.document_loaders import BaseLoader
from langchain_core.documents import Document


class CustomDocumentLoader(BaseLoader):
    """An example document loader that reads a file line by line."""

    def __init__(self, file_path: str) -> None:
        """Initialize the loader with a file path.

        Args:
            file_path: The path to the file to load.
        """
        self.file_path = file_path

    def lazy_load(self) -> Iterator[Document]:  # <-- Does not take any arguments
        """A lazy loader that reads a file line by line.

        When you're implementing lazy load methods, you should use a generator
        to yield documents one by one.
        """
        with open(self.file_path, encoding="utf-8") as f:
            line_number = 0
            for line in f:
                yield Document(
                    page_content=line,
                    metadata={"line_number": line_number, "source": self.file_path},
                )
                line_number += 1

# create a text file
with open("./meow.txt", "w", encoding="utf-8") as f:
    quality_content = "meow meow🐱 \n meow meow🐱 \n meow😻😻"
    f.write(quality_content)

# create instance of Custom Document Loader with the text file
loader = CustomDocumentLoader("./meow.txt")

# load the data and convert into Documents
for doc in loader.lazy_load():
    print()
    print(type(doc))
    print(doc)