Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bug: ensure tika extracts embedded doc with a stream #1165

Open
pirhoo opened this issue Aug 28, 2023 · 2 comments
Open

bug: ensure tika extracts embedded doc with a stream #1165

pirhoo opened this issue Aug 28, 2023 · 2 comments

Comments

@pirhoo
Copy link
Member

pirhoo commented Aug 28, 2023

No description provided.

@pirhoo pirhoo changed the title bug: ensure tika extract embedded doc with a stream bug: ensure tika extracts embedded doc with a stream Aug 28, 2023
@bamthomas
Copy link
Collaborator

the result of extraction is stored into TikaDocumentSource class :

public class TikaDocumentSource {
    public final Metadata metadata;
    public final byte[] content;

    public TikaDocumentSource(final Metadata metadata, final byte[] content) {
        this.metadata = metadata;
        this.content = content;
    }
}

And is is used in SourceExtractor class.

The content should not be stored as an array of byte but as an InputStream (or subclass).

@mvanzalu
Copy link
Contributor

mvanzalu commented Sep 5, 2023

No solution for disk issue for the moment

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: Todo
Status: Done
Status: Todo
Development

No branches or pull requests

4 participants