Skip to content
Read text and metadata from files and documents (.doc, .docx, .pages, .odt, .rtf, .pdf)
Branch: master
Clone or download
Pull request Compare This branch is 2 commits ahead, 1 commit behind abrom:master.
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Type Name Latest commit message Commit time
Failed to load latest commit information.

Travis Build Status Maintainability Test Coverage Gem Version

Henkei 変形

Henkei is a library for extracting text and metadata from files and documents using the Apache Tika content analysis toolkit.

The library was forked from Yomu as it is no longer maintained.

Here are some of the formats supported:

  • Microsoft Office OLE 2 and Office Open XML Formats (.doc, .docx, .xls, .xlsx, .ppt, .pptx)
  • OpenDocument Formats (.odt, .ods, .odp)
  • Apple iWorks Formats
  • Rich Text Format (.rtf)
  • Portable Document Format (.pdf)

For the complete list of supported formats, please visit the Apache Tika Supported Document Formats page.


Text, metadata and MIME type information can be extracted by calling directly:

require 'henkei'

data = 'sample.pages'
text = :text, data
metadata = :metadata, data
mimetype = :mimetype, data

Henkei is backward compatible with Yomu

text = :text, data

Reading text from a given filename

Create a new instance of Henkei and pass a filename.

henkei = 'sample.pages'
text = henkei.text

Reading text from a given URL

This is useful for reading remote files, like documents hosted on Amazon S3.

henkei = ''
text = henkei.text

Reading text from a stream

Henkei can also read from a stream or any object that responds to read, including file uploads from Ruby on Rails or Sinatra.

post '/:name/:filename' do
  henkei = params[:data][:tempfile]

Reading metadata

Metadata is returned as a hash.

henkei = 'sample.pages'
henkei.metadata['Content-Type'] #=> "application/"

Reading MIME types

MIME type is returned as a MIME::Type object.

henkei = 'sample.docx'
henkei.mimetype.content_type #=> "application/vnd.openxmlformats-officedocument.wordprocessingml.document"
henkei.mimetype.extensions #=> ['docx']

Installation and Dependencies

Java Runtime

Henkei packages the Apache Tika application jar and requires a working JRE for it to work. Check that you either have the JAVA_HOME environment variable set, or that java is in your path.


Add this line to your application's Gemfile:

gem 'henkei'

And then execute:

$ bundle

Or install it yourself as:

$ gem install henkei


Add the JVM Buildpack to your Heroku project:

$ heroku buildpacks:add heroku/jvm --index 1 -a YOUR_APP_NAME


  1. Fork it
  2. Create your feature branch ( git checkout -b my-new-feature )
  3. Create tests and make them pass ( rake test )
  4. Commit your changes ( git commit -am 'Added some feature' )
  5. Push to the branch ( git push origin my-new-feature )
  6. Create a new Pull Request
You can’t perform that action at this time.