Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Mode 1527 - Ported & updated text extractors from 2.x #430

Merged
merged 4 commits into from Jul 3, 2012

Conversation

hchiorean
Copy link
Member

The Tika text extractor is now located in extractors/modeshape-extractor-tika. In addition to porting the text extractors, a few important changes were made:

  • the text extractor API was updated to work with the 3.x API
  • text extraction is triggered (if configured) preemptively by the binary storage and the result of the extraction is stored into the storage for subsequent access.
  • the AS7 integration & configuration mechanism for text extractors was implemented as well

Horia Chiorean added 4 commits June 29, 2012 15:50
…nd updated the binary store to extract the text and mime-type of binary values

Working on this, exposed how fragile - lock-wise - is working with the SharedLockingInputStream (FileSystemBinaryStore). Therefore, I've updated the mime-type detection so that mark & reset are avoided as much as possible, also making sure that streams are closed after each detector finishes with them.
The Tika version was bumped to 1.1 which required also the update of the POI version to 3.8.
…th the indexing process.

 This meant that a few changes were needed:
 - the text extractors configuration has been updated to resemble that of the sequencers
 - the binary store interface has been updated to be able to store and retrieve extracted text for a given binary (source) value
 - the TextExtractors class was changed to become the entry point into text extraction
…ptively by the binary storage, when a binary value is created.

For this to be possible, the context of the extractor cannot contain any node-specific information. Also, this exposed an issue with the SharedLockingInputStream: if the stream is closed in the "read" methods, Tika's parsers will keep reading it over and over (effectively reopening it each time) either causing OOM errors or duplicate text. This means the "close" call from the read methods has been removed.
…tractors. To validate the configuration and Arquillian integration test was added as well.
@rhauch rhauch merged commit 164303a into ModeShape:master Jul 3, 2012
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
2 participants