# Loading Files with LangChain

LangChain is not able to load simple `.doc`-files. So we have to concentrate first on `.pdf` and `.docx` files. We assume that the files have no images or tables in them.


In [19]:
import os
import pathlib
from pathlib import Path
from langchain_community.document_loaders import PyPDFLoader, PyPDFDirectoryLoader, Docx2txtLoader
from dotenv import load_dotenv, find_dotenv

load_dotenv(find_dotenv())

DATA_PATH = os.getenv("DATA_PATH")

In [20]:
def get_path(name: str) -> pathlib.WindowsPath:
    """ Create a path object for a file in the data directory

    Args:
        name (str): File name or directory name to search for

    Returns:
        pathlib.WindowsPath: Path object for the file or directory
    """
    return next(Path(DATA_PATH).rglob(name))

## PDF Loader

LangChain has two simple methods to load `.pdf`-files. It can directly load one single pdf-file or or it can load every pdf-file in a directory.

### Single File

In [21]:
file_name = "ark_021_-_geschaeftsordnung_des_beirats.pdf"
path = get_path(file_name)

docs = PyPDFLoader(path).load()

In [22]:
# docs

### Directory

In [23]:
dir_name = "sgb_i"
path = get_path(dir_name)

docs = PyPDFDirectoryLoader(path).load()

In [24]:
# docs

## DOCX Loader

Returning only one single `document`-object and not like the pdf-method a list of `document`-objects. This loader does not extracting the page number.

In [25]:
file_name = "aktive_leistungen_bei_darlehensweiser_passiver_leistungsgewaehrung.docx"

path = get_path(file_name)
docs = Docx2txtLoader(path).load()

In [26]:
# docs