# Loading File with Unststructed.io

This Tool is able to load a lot of different file formats (see [here](https://docs.unstructured.io/open-source/core-functionality/partitioning)). 

In [31]:
import os
import pathlib
from pathlib import Path
from dotenv import load_dotenv, find_dotenv

load_dotenv(find_dotenv())

DATA_PATH = os.getenv("DATA_PATH")
POPPLER_PATH = os.getenv("POPPLER_PATH")
TESSERACT_PATH = os.getenv("TESSERACT_PATH")

In [32]:
def get_path(name: str) -> pathlib.WindowsPath:
    """ Create a path object for a file in the data directory

    Args:
        name (str): File name or directory name to search for

    Returns:
        pathlib.WindowsPath: Path object for the file or directory
    """
    return next(Path(DATA_PATH).rglob(name))

## All-in-One Loader

- **Pros**:
    - not necessary to know the file format
    - easy to use
- **Cons**:
    - not as fast as a specialized loader
    - not as flexible as a specialized loader
    - more dependencies

In [33]:
from unstructured.partition.auto import partition

file_name = "ark_021_-_geschaeftsordnung_des_beirats.pdf"
path = get_path(file_name)

ele = partition(
        str(path), 
        strategy="hi_res",
        languages=["deu"],
    )

In [34]:
ele[:5]

[<unstructured.documents.elements.Header at 0x220d4195390>,
 <unstructured.documents.elements.Title at 0x220d43a1050>,
 <unstructured.documents.elements.Text at 0x220b707ca50>,
 <unstructured.documents.elements.Title at 0x220d41771d0>,
 <unstructured.documents.elements.NarrativeText at 0x220d417e0d0>]

## PDF Loader

In [35]:
from unstructured.partition.pdf import partition_pdf

file_name = "ark_021_-_geschaeftsordnung_des_beirats.pdf"
path = get_path(file_name)

ele = partition_pdf(
        str(path), 
        strategy="hi_res",
        languages=["deu"],
    )

In [36]:
ele[:5]

[<unstructured.documents.elements.Header at 0x220d4176b90>,
 <unstructured.documents.elements.Title at 0x220d4176a90>,
 <unstructured.documents.elements.Text at 0x220d2a1ed50>,
 <unstructured.documents.elements.Title at 0x220b7052710>,
 <unstructured.documents.elements.NarrativeText at 0x220d424b490>]

## DOCX Loader

In [37]:
from unstructured.partition.docx import partition_docx

file_name = "aktive_leistungen_bei_darlehensweiser_passiver_leistungsgewaehrung.docx"
path = get_path(file_name)

ele = partition_docx(
        str(path), 
        strategy="hi_res",
        languages=["deu"],
    )

In [38]:
ele[:5]

[<unstructured.documents.elements.Title at 0x220b3d71f50>,
 <unstructured.documents.elements.Title at 0x220d443f590>,
 <unstructured.documents.elements.NarrativeText at 0x220d43a3810>,
 <unstructured.documents.elements.NarrativeText at 0x220d4178210>,
 <unstructured.documents.elements.NarrativeText at 0x220d432ba50>]

## DOC Loader

Uses [libreoffice](https://www.libreoffice.org/) to convert the file to a docx file and then uses the docx loader. So you have to have libreoffice installed on your system.

- **Installation**:
    - install libreoffice
    - add the path to the `soffice.exe` (`../program`) executable to your system path
    - restart your pc

In [39]:
from unstructured.partition.doc import partition_doc

file_name = "54_SGB_I_Pfaendung_20130402.999.doc"
path = get_path(file_name)

ele = partition_doc(
        str(path), 
        strategy="hi_res",
        languages=["deu"],
    )

In [40]:
ele[:5]

[<unstructured.documents.elements.Header at 0x220d418fb10>,
 <unstructured.documents.elements.Title at 0x220d41848d0>,
 <unstructured.documents.elements.NarrativeText at 0x220d43f34d0>,
 <unstructured.documents.elements.NarrativeText at 0x220d448d510>,
 <unstructured.documents.elements.NarrativeText at 0x220d421e110>]