Skip to content

Uses unstructured.io library to batch process documents in a directory specified by command line - with progress saving.

Notifications You must be signed in to change notification settings

FerrenF/bulkTextPartition

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

18 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Bulk Text Extractor

This Python script extracts text from various document formats and splits them into segments using the unstructured-io library.

Features

  • Extracts text from PDF, MOBI, EPUB, and DJVU files.
  • Splits documents into segments based on specified settings.
  • Saves extracted segments as JSON files.

Requirements

  • Python 3.x
  • unstructured-io (with all-docs package)

Usage

  1. Install dependencies: pip install unstructured[all-docs]
  2. Clone or download the repository.
  3. Run the script: python bulk_text_extractor.py <directory>
    • Replace <directory> with the path to your documents directory.
  4. The script will process each document and save extracted segments in a subdirectory within the specified directory.

Options

  • You can modify the BulkTextExtract class to customize settings like chunking strategy, page break handling, etc.
  • Refer to the unstructured-io documentation for more advanced functionalities.

License

MIT License

About

Uses unstructured.io library to batch process documents in a directory specified by command line - with progress saving.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published