GitHub - FerrenF/bulkTextPartition: Uses unstructured.io library to batch process documents in a directory specified by command line

Bulk Text Extractor

This Python script extracts text from various document formats and splits them into segments using the unstructured-io library.

Install dependencies: pip install unstructured[all-docs]
Clone or download the repository.
Run the script: python bulk_text_extractor.py <directory>
- Replace <directory> with the path to your documents directory.
The script will process each document and save extracted segments in a subdirectory within the specified directory.

You can modify the BulkTextExtract class to customize settings like chunking strategy, page break handling, etc.
Refer to the unstructured-io documentation for more advanced functionalities.

MIT License

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
dpsprep		dpsprep
.gitignore		.gitignore
debug.log		debug.log
main.py		main.py
progress.json		progress.json
readme.md		readme.md
requirements.txt		requirements.txt