Skip to content

Plugins::Attachments: Add an attachements plugin (support parsing various file formats) #92

Closed
@kimchy

Description

@kimchy

Using the new plugins system, implement the attachments plugin, allow to add a mapping type called attachment which accepts a binary input (base64) of an attachment to index.

Installation is simple, just download the plugin zip file and place it under plugins directory within the installation. When building from source, the plugin will be under build/distributions/plugins. Once placed in the installation, the attachment mapper type will be automatically supported.

Using the attachment type is simple, in your mapping JSON, simply a certain JSON element as attachment, for example:

{
    person : {
        properties : {
            "myAttachment" : { type : "attachment" }
        }
    }
}

In this case, the JSON to index can be:

{
    myAttachment : "... base64 encoded attachment ..."
}

The attachment type not only indexes the content of the doc, but also automatically adds meta data on the attachment as well (when available). The metadata supported are: date, title, author, and keywords. They can be queries using the "dot notation", for example: myAttachment.author.

Both the meta data and the actual content are simple core type mappers (string, date, ...), thus, they can be controlled in the mappings. For example:

{
    person : {
        properties : {
            "file" : { 
                type : "attachment",
                fields : {
                    file : {index : "no"},
                    date : {store : "yes"},
                    author : {analyzer: "myAnalyzer"}
                }
            }
        }
    }
}

In the above example, the actual content indexed is mapped under fields name file, and we decide not to index it, so it will only be available in the _all field. The other fields map to their respective metadata names, but there is no need to specify the type (like string or date) since it is already known.

The plugin uses Apache Tika (http://lucene.apache.org/tika/) to parse it, so many formats are supported, listed here: http://lucene.apache.org/tika/0.6/formats.html.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions