The library was forked from Yomu as it is no longer maintained.
Here are some of the formats supported:
- Microsoft Office OLE 2 and Office Open XML Formats (.doc, .docx, .xls, .xlsx, .ppt, .pptx)
- OpenOffice.org OpenDocument Formats (.odt, .ods, .odp)
- Apple iWorks Formats
- Rich Text Format (.rtf)
- Portable Document Format (.pdf)
For the complete list of supported formats, please visit the Apache Tika Supported Document Formats page.
Text, metadata and MIME type information can be extracted by calling
require 'henkei' data = File.read 'sample.pages' text = Henkei.read :text, data metadata = Henkei.read :metadata, data mimetype = Henkei.read :mimetype, data
Henkei is backward compatible with Yomu
text = Yomu.read :text, data
Reading text from a given filename
Create a new instance of Henkei and pass a filename.
henkei = Henkei.new 'sample.pages' text = henkei.text
Reading text from a given URL
This is useful for reading remote files, like documents hosted on Amazon S3.
henkei = Henkei.new 'http://svn.apache.org/repos/asf/poi/trunk/test-data/document/sample.docx' text = henkei.text
Reading text from a stream
Henkei can also read from a stream or any object that responds to
read, including file uploads from Ruby on Rails or Sinatra.
post '/:name/:filename' do henkei = Henkei.new params[:data][:tempfile] henkei.text end
Metadata is returned as a hash.
henkei = Henkei.new 'sample.pages' henkei.metadata['Content-Type'] #=> "application/vnd.apple.pages"
Reading MIME types
MIME type is returned as a MIME::Type object.
henkei = Henkei.new 'sample.docx' henkei.mimetype.content_type #=> "application/vnd.openxmlformats-officedocument.wordprocessingml.document" henkei.mimetype.extensions #=> ['docx']
Installation and Dependencies
Henkei packages the Apache Tika application jar and requires a working JRE for it to work.
Check that you either have the
JAVA_HOME environment variable set, or that
java is in your path.
Add this line to your application's Gemfile:
And then execute:
Or install it yourself as:
$ gem install henkei
Add the JVM Buildpack to your Heroku project:
$ heroku buildpacks:add heroku/jvm --index 1 -a YOUR_APP_NAME
- Fork it
- Create your feature branch (
git checkout -b my-new-feature)
- Create tests and make them pass (
- Commit your changes (
git commit -am 'Added some feature')
- Push to the branch (
git push origin my-new-feature)
- Create a new Pull Request