Skip to content

ArcaneArts/tika

Repository files navigation

tika

tika is a Dart wrapper around the Apache Tika command-line interface. It uses dart:io to spawn a local Tika process, stream extracted text back as Stream<String>, and optionally collect the entire document into a single String.

This package assumes Apache Tika is already installed on the host machine or is available through java -jar /path/to/tika-app.jar.

Features

  • Extract document text through the tika CLI.
  • Stream stdout as Stream<String> for server-side pipelines.
  • Stream extracted text directly into a file without buffering the full payload in memory.
  • Read the full extracted text with a single Future<String>.
  • Support both PATH-based installs and explicit java -jar tika-app.jar execution.

Getting Started

Add the package to your project:

dart pub add tika

Your runtime environment must also have:

  • A Java runtime.
  • Apache Tika installed and available on PATH, or a downloaded tika-app.jar.

Usage

Use the default constructor when tika is already available on PATH:

import 'package:tika/tika.dart';

Future<void> main() async {
  TikaClient tika = TikaClient();

  String text = await tika.readText(
    documentPath: '/srv/documents/report.pdf',
  );

  print(text);
}

Stream text chunks directly from the Tika process:

import 'dart:io';

import 'package:tika/tika.dart';

Future<void> main() async {
  TikaClient tika = TikaClient();

  await for (String chunk in tika.streamText(
    documentPath: '/srv/documents/invoice.docx',
  )) {
    stdout.write(chunk);
  }
}

Write large extracted payloads straight to disk:

import 'dart:io';

import 'package:tika/tika.dart';

Future<void> main() async {
  TikaClient tika = TikaClient();

  await tika.streamToFile(
    documentPath: '/srv/documents/archive.pdf',
    file: File('/srv/output/archive.txt'),
  );
}

Use the jar constructor when you want to run an explicit Tika jar:

import 'package:tika/tika.dart';

Future<void> main() async {
  TikaClient tika = TikaClient.jar(
    jarPath: '/opt/tika/tika-app.jar',
  );

  String text = await tika.readText(
    documentPath: '/srv/documents/contract.pdf',
  );

  print(text);
}

Ubuntu Docker Setup

Apache Tika 3.3.0 is published by the Apache project as tika-app-3.3.0.jar, and the CLI documentation shows the supported java -jar tika-app.jar --text command shape. A minimal Ubuntu-based Docker image needs:

  • default-jre-headless
  • curl
  • ca-certificates

Example Dockerfile snippet:

FROM dart:stable

ARG TIKA_VERSION=3.3.0

RUN apt-get update \
 && apt-get install -y --no-install-recommends \
    default-jre-headless \
    curl \
    ca-certificates \
 && rm -rf /var/lib/apt/lists/*

RUN mkdir -p /opt/tika \
 && curl -fsSL "https://archive.apache.org/dist/tika/${TIKA_VERSION}/tika-app-${TIKA_VERSION}.jar" \
    -o /opt/tika/tika-app.jar \
 && printf '#!/bin/sh\nexec java -jar /opt/tika/tika-app.jar "$@"\n' > /usr/local/bin/tika \
 && chmod +x /usr/local/bin/tika

That wrapper script makes tika --text /path/to/file.pdf available to your Dart server, which means TikaClient() can use the default executable without extra configuration.

If you prefer not to create a shell wrapper, point the package at the jar directly:

TikaClient tika = TikaClient.jar(
  jarPath: '/opt/tika/tika-app.jar',
);

macOS Local Testing

For local development on macOS, the easiest path is Homebrew:

brew install tika

This installs the tika executable and pulls in the required Java dependency. After installation, verify it is available:

tika --version

Then your local Dart app can use:

TikaClient tika = TikaClient();

Notes

  • This package shells out to an installed binary and does not bundle Apache Tika itself.
  • If the process cannot be started, or if Tika exits with a non-zero code, the package throws TikaException.
  • streamToFile() uses streamText() internally so very large extracted text can be written incrementally instead of building one giant in-memory string.
  • streamLines() is available when line-by-line consumption is easier than raw text chunks.

Sources

About

Tika for dart / flutter servers

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors