## Set Up the Environment

In [3]:
%run setup.ipynb

## Document Loaders

Document loaders are used to import data from various sources into LangChain as `Document` objects. A `Document` typically includes a piece of text along with its associated metadata.

### Examples of Document Loaders:

- **Text File Loader:** Loads data from a simple `.txt` file.
- **Web Page Loader:** Retrieves the text content from any web page.
- **YouTube Video Transcript Loader:** Loads transcripts from YouTube videos.

### Functionality:

- **Load Method:** Each document loader has a `load` method that enables the loading of data as documents from a pre-configured source.
- **Lazy Load Option:** Some loaders also support a "lazy load" feature, which allows data to be loaded into memory gradually as needed.

For more detailed information, visit [LangChain's document loader documentation](https://python.langchain.com/docs/modules/data_connection/document_loaders/).


### Directory Loaders

LangChain's [`DirectoryLoader`](https://api.python.langchain.com/en/latest/document_loaders/langchain_community.document_loaders.directory.DirectoryLoader.html) implements functionality for reading files from disk into LangChain [`Document`](https://api.python.langchain.com/en/latest/documents/langchain_core.documents.base.Document.html#langchain_core.documents.base.Document) objects.

We first define and assign specific loaders which can be used by LangChain when processing the files for a specific file type. We follow this format

```
loaders = {
  'file_format_extension' : (LoaderClass, LoaderKeywordArguments)
}
```

Where:

- `file_format_extension` can be anything like `.docx`, `.pdf`etc.
- `LoaderClass` is a specific data loader like `PyMuPDFLoader`
- `LoaderKeywordArguments` are any specific keyword arguments which needs to be passed into that loader at runtime

In [5]:
from langchain_community.document_loaders import PyMuPDFLoader
from langchain_community.document_loaders import UnstructuredWordDocumentLoader

In [6]:
# Define a dictionary to map file extensions to their respective loaders
# Each key is a file extension (e.g., '.pdf', '.docx')
# Each value is a tuple: (LoaderClass, LoaderKeywordArguments)
#   - LoaderClass: The class used to load that file type
#   - LoaderKeywordArguments: Dictionary of keyword arguments for the loader

loaders = {
    # For PDF files, use PyMuPDFLoader with no additional arguments
    '.pdf': (PyMuPDFLoader, {}),

    # For DOCX files, use UnstructuredWordDocumentLoader with specific options:
    #   - 'strategy': 'fast' (use fast parsing)
    #   - 'chunking_strategy': 'by_title' (split document by title)
    #   - 'max_characters': 3000 (maximum characters per document chunk)
    #   - 'new_after_n_chars': 2500 (preferred chunk size before starting a new chunk)
    #   - 'mode': 'elements' (parse document as elements)
    '.docx': (
        UnstructuredWordDocumentLoader,
        {
            'strategy': 'fast',
            'chunking_strategy': 'by_title',
            'max_characters': 3000,      # max limit of a document chunk
            'new_after_n_chars': 2500,   # preferred document chunk size
            'mode': 'elements'
        }
    )
}

`DirectoryLoader` accepts a `loader_cls` argument, which defaults to `UnstructuredLoader` but we can pass our own loaders which we defined above in the `loader_cls`argument and any keyword args for the loader can be passed in the `loader_kwargs` argument.

We can also show a progress bar by setting `show_progress=True`

We can use the `glob` parameter to control which files to load based on file patterns

Here we create two separate loaders to load files which are word documents and PDFs

In [8]:
from langchain_community.document_loaders import DirectoryLoader

# Define a function to create a DirectoryLoader for a specific file type
def create_directory_loader(file_type, directory_path):
    return DirectoryLoader(
        path=directory_path,
        glob=f"**/*{file_type}",
        loader_cls=loaders[file_type][0],
        loader_kwargs=loaders[file_type][1],
        show_progress=True
    )

# Create DirectoryLoader instances for each file type
pdf_loader = create_directory_loader('.pdf', '../../docs')
docx_loader = create_directory_loader('.docx', '../../docs')

# Load the files
pdf_documents = pdf_loader.load()
docx_documents = docx_loader.load()

100%|██████████| 7/7 [00:01<00:00,  6.52it/s]
100%|██████████| 1/1 [00:03<00:00,  3.35s/it]


In [9]:
len(pdf_documents)

108

In [10]:
pdf_documents

[Document(metadata={'producer': 'pdfTeX-1.40.21', 'creator': 'LaTeX with hyperref', 'creationdate': '2021-06-22T01:27:10+00:00', 'source': '../../docs/layoutparser_paper.pdf', 'file_path': '../../docs/layoutparser_paper.pdf', 'total_pages': 16, 'format': 'PDF 1.5', 'title': '', 'author': '', 'subject': '', 'keywords': '', 'moddate': '2021-06-22T01:27:10+00:00', 'trapped': '', 'modDate': 'D:20210622012710Z', 'creationDate': 'D:20210622012710Z', 'page': 0}, page_content='LayoutParser: A Uniﬁed Toolkit for Deep\nLearning Based Document Image Analysis\nZejiang Shen1 (\x00), Ruochen Zhang2, Melissa Dell3, Benjamin Charles Germain\nLee4, Jacob Carlson3, and Weining Li5\n1 Allen Institute for AI\nshannons@allenai.org\n2 Brown University\nruochen zhang@brown.edu\n3 Harvard University\n{melissadell,jacob carlson}@fas.harvard.edu\n4 University of Washington\nbcgl@cs.washington.edu\n5 University of Waterloo\nw422li@uwaterloo.ca\nAbstract. Recent advances in document image analysis (DIA) have been

In [11]:
pdf_documents[18]

Document(metadata={'producer': 'pdfTeX-1.40.12', 'creator': 'LaTeX with hyperref package', 'creationdate': '2015-12-03T01:48:07+00:00', 'source': '../../docs/cnn_paper.pdf', 'file_path': '../../docs/cnn_paper.pdf', 'total_pages': 11, 'format': 'PDF 1.5', 'title': '', 'author': '', 'subject': '', 'keywords': '', 'moddate': '2015-12-03T01:48:07+00:00', 'trapped': '', 'modDate': 'D:20151203014807Z', 'creationDate': 'D:20151203014807Z', 'page': 2}, page_content='Introduction to Convolutional Neural Networks\n3\nmore suited for image-focused tasks - whilst further reducing the parameters\nrequired to set up the model.\nOne of the largest limitations of traditional forms of ANN is that they tend to\nstruggle with the computational complexity required to compute image data.\nCommon machine learning benchmarking datasets such as the MNIST database\nof handwritten digits are suitable for most forms of ANN, due to its relatively\nsmall image dimensionality of just 28 × 28. With this dataset a si

In [12]:
len(docx_documents)

4

In [13]:
docx_documents

[Document(metadata={'source': '../../docs/Intel Strategy.docx', 'emphasized_text_contents': ['The Superpowers', 'Pervasive Connectivity', 'Ubiquitous compute'], 'emphasized_text_tags': ['b', 'b', 'b'], 'file_directory': '../../docs', 'filename': 'Intel Strategy.docx', 'last_modified': '2025-05-30T10:16:46', 'orig_elements': 'eJzNV2tvG7kV/SuEPnUBjaq3JffTwkhTA4usgTgtit2FwSHvjAjPkLMkR8pk0f/ec8mRH4lTJGiBGjBkic/7OPfcw1/+mFBDLdl4Z/TkUkyqi0213K90sb0gVaz323WxV/NFsVpVF+WS1mW5kpOpmLQUpZZRYs8fEyUj1c4Pd5q6eMDQHCsq09CdNp5UxBSfPZv9GX/aqTAZ561siWeubaRGvI+eDxpmWPKRlzQyxLvWaVMZStYt58tNMd8Uq/ntYn652F6ut5N/YWGkj/HLc/iIOHTphg92NNJ8In3Ly7Hvc+f1bq23u4tdsdpu58V6U5aFVHJZLGlH8916vadq/3qd//lIXsQDCd4pKjqJgaQPU5EumApnSbgqrTg53+hf++V8sQ+iNHVN2KEOpmvlPfGWg8Q4kRWwygYTjbPG1iK6k/Q6CCla50lwDBTC540Ssuu8k+qA86UVN1fFODETtwcThOFNJ2qUawmbjzQV1kXY1AyictluzHXSDqLsYxoz1rqj5LvxVURSB+saVw8CxuEouBTkEMSNjOItNQEGkp+Kqzc/jy7PxI9KOa+z5eJg2mm+BxbWadC0FM6n8f8AizWGjH16+UEeSXQNXNX5YJ6XonM5MOJ0IATDIILwvCGkQdaUbkohVqHvyHfuhMimAc3WcKQfc8GJY