Skip to content

Gholtes/docdump

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

24 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

MIT License

DocDump

A package to extract text from common document types

DocDump aims to allow for raw text data and document metadata to be easily extracted from a range of commonly used document types, such as Word, PDF, PowerPoint, Excel, txt. DocDump acts as a wrapper for a number of existing packages: PyPDF2, openpyxl, python-docx, python-pptx.

DocDump extracts all text as a single string, and does not preserve text structure. This makes it a useful tool in a natural language processing or search pipeline.

DocDump does not perform any preprocessing or normalisation of the extracted text.

Getting Started

DocDump requires Python 3.7+

Installation

pip install docdump

Usage

from docdump import doc_reader

document = doc_reader("sampleFile.docx")

text_dump = document.text
metadata = document.metadata
filetype = document.filetype
absolute_path = document.path

License

Distributed under the MIT License. See LICENSE for more information.

Contact

Grant Holtes - gwholtes@gmail.com

Project Link: https://github.com/Gholtes/docdump

About

Python package to extract text data from common document file formats

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages