streamsx.document

This toolkit allows extract text and metadata from documents in a binary formats such as PDF, Word, Office, etc. For this purpose the toolkit implements a DocumentSource operator.

The DocumentSource operator utilized multiple third party and open source document extraction technologies, and can be enhanced with additional commercial /proprietary extractors. The operator automatically determines the document MIME type and delegated the extraction request to appropriate extractor plugin.

Out of the box the toolkit provides the following extractors:

Apache Tika – The primary extractor for binary documents such as Office documents (Word, Powerpoint, Excel), HTML files, etc.
PDFBox – For handling Acrobat PDF files
TrueZIP – ZIP, JAR, TAR, GZ, GZIP files and other archive files
JUnrar – RAR files
Plain Text – Text files of various encodings (ASCII, UTF-8, UTF-16, local encodings)

The toolkit's home page is available at: http://ibmstreams.github.io/streamsx.document/

Name		Name	Last commit message	Last commit date
Latest commit History 36 Commits
com.ibm.streamsx.document		com.ibm.streamsx.document
samples/com.ibm.streamsx.document.sample		samples/com.ibm.streamsx.document.sample
.gitignore		.gitignore
GRADUATION_STATUS.md		GRADUATION_STATUS.md
LICENSE.md		LICENSE.md
README.md		README.md
build.xml		build.xml
updateWeb.pl		updateWeb.pl

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

streamsx.document

About

Releases

Packages

Contributors 4

Languages

License

IBMStreams/streamsx.document

Folders and files

Latest commit

History

Repository files navigation

streamsx.document

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 4

Languages

Packages