Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Configuring Tesseract OCR for TikaOnDotNet #62

Open
LeeBear35 opened this issue Aug 29, 2016 · 8 comments
Open

Configuring Tesseract OCR for TikaOnDotNet #62

LeeBear35 opened this issue Aug 29, 2016 · 8 comments

Comments

@LeeBear35
Copy link

LeeBear35 commented Aug 29, 2016

The hope here is to get TikaOnDotNet fully configured to access Tesseract OCR for text extraction from images. With Tika .93 support for Tesseract was added, and we are now in the midst of validating the latest release Tika 1.13.1. A big set of validations center around Tika's ability to handle certain types of PDF files, it should be noted that TIFF images in PDFBox have changes due to licensing issues that are not in compliance with the Apache license.

So here is hoping that if we cannot read it one way, we might be able to read it using another.

The first step has been to extend Kevin's TextExtractor so that Meta data can be passed in to assist the parsing that set of extensions is here:

public static class TikaOnDotNetExtensions
{
    private static TikaConfig config = TikaConfig.getDefaultConfig();
    public static TextExtractionResult Extract(this TextExtractor te, byte[] data, string filePath, string ContentType)
    {
      TextExtractionResult result = te.Extract
        (
          metadata =>
          {
            metadata.add("resourceName", System.IO.Path.GetFileName(filePath));
            metadata.add("FilePath", filePath);
            try
            {
              if (!ContentType.Equals("application/octet-stream", StringComparison.CurrentCultureIgnoreCase))
              {
                metadata.add("Content-Type", ContentType);
              }
              else
              {
                Detector detector = config.getDetector();
                using (org.apache.tika.io.TikaInputStream inputStream = org.apache.tika.io.TikaInputStream.@get(data, metadata))
                {
                  MediaType foundType = detector.detect(inputStream, metadata);
                  if (!foundType.toString().Equals("application/octet-stream", StringComparison.CurrentCultureIgnoreCase))
                  {
                    metadata.add("Content-Type", foundType.toString());
                  }
                }
              }
            }
            catch (Exception ex)
            {
              throw ex;
            }


            return TikaInputStream.get(data, metadata);
          }
        );

      return result;
    }

    public static TextExtractionResult Extract(this TextExtractor te, byte[] data, string filePath)
    {
      return te.Extract(data, filePath, "application/octet-stream");
    }
}

The next step has been to dump the configuration to confirm how Tika is configured, and what changes might need to be made, the dump routine was added to the class above:

    public static string TikaConfigDump()
    {
      StringBuilder retVal = new StringBuilder();

      retVal.AppendFormat("{0}\t{1}\n\n", "Version", (new org.apache.tika.Tika(config)).toString());


      retVal.AppendLine("\nDetectors");

      CompositeDetector configDetector = (CompositeDetector)config.getDetector();
      var detectors = configDetector.getDetectors().toArray();
      foreach (Detector detector in detectors)
      {
        retVal.AppendFormat("\t{0}\n", ((java.lang.Object)detector).getClass().getName());

        if (detector.GetType() == typeof(CompositeDetector))
        {
          var subDetectors = configDetector.getDetectors().toArray();
          foreach (Detector subDetector in subDetectors)
          {
            retVal.AppendFormat("\t\t{0}\n", ((java.lang.Object)subDetector).getClass().getName());
          }
        }
      }

      retVal.AppendLine("\nParsers");

      CompositeParser configParser = (CompositeParser)config.getParser();
      var parsers = configParser.getAllComponentParsers().toArray();
      foreach (Parser parser in parsers)
      {
        retVal.AppendFormat("\t{0}\n", ((java.lang.Object)parser).getClass().getName());

        var parserTypes = parser.getSupportedTypes(new ParseContext()).toArray();
        foreach (MediaType mediaType in parserTypes)
        {
          retVal.AppendFormat("\t\t{0}\n", mediaType.toString());
        }
      }

      org.apache.tika.language.translate.Translator translator = config.getTranslator();
      if (translator.isAvailable())
      {
        retVal.AppendFormat("Translator {0}\n", ((java.lang.Object)translator).getClass().getName());
      }

      return retVal.ToString();
    }

On my system using the default configuration provided by Kevin you can see the setup below:

Version Apache Tika 1.13

Detectors
org.apache.tika.parser.microsoft.POIFSContainerDetector
org.apache.tika.parser.pkg.ZipContainerDetector
org.gagravarr.tika.OggDetector
org.apache.tika.mime.MimeTypes

Parsers
org.apache.tika.parser.asm.ClassParser
application/java-vm
org.apache.tika.parser.audio.AudioParser
audio/x-wav
audio/basic
audio/x-aiff
org.apache.tika.parser.audio.MidiParser
application/x-midi
audio/midi
org.apache.tika.parser.chm.ChmParser
application/vnd.ms-htmlhelp
application/x-chm
application/chm
org.apache.tika.parser.code.SourceCodeParser
text/x-c++src
text/x-groovy
text/x-java-source
org.apache.tika.parser.crypto.Pkcs7Parser
application/pkcs7-signature
application/pkcs7-mime
org.apache.tika.parser.dif.DIFParser
application/dif+xml
org.apache.tika.parser.dwg.DWGParser
image/vnd.dwg
org.apache.tika.parser.epub.EpubParser
application/x-ibooks+zip
application/epub+zip
org.apache.tika.parser.executable.ExecutableParser
application/x-msdownload
application/x-sharedlib
application/x-elf
application/x-object
application/x-executable
application/x-coredump
org.apache.tika.parser.external.CompositeExternalParser
org.apache.tika.parser.feed.FeedParser
application/atom+xml
application/rss+xml
org.apache.tika.parser.font.AdobeFontMetricParser
application/x-font-adobe-metric
org.apache.tika.parser.font.TrueTypeParser
application/x-font-ttf
org.apache.tika.parser.gdal.GDALParser
application/x-gsc
image/x-ozi
application/x-pds
image/eir
application/x-usgs-dem
application/aaigrid
application/x-bag
application/elas
application/x-rs2
application/x-tsx
application/x-lcp
image/geotiff
application/x-mbtiles
application/x-cappi
application/x-netcdf
application/x-gsag
application/x-epsilon
application/x-ace2
application/jaxa-pal-sar
image/x-pcraster
application/x-msgn
image/arg
application/x-hdf
image/x-mff
application/x-kro
image/x-hdf5-image
image/x-dimap
image/x-srp
image/big-gif
application/x-envi
application/x-cosar
application/x-ntv2
image/bmp
application/x-doq2
application/x-bt
application/x-kml
application/x-gmt
application/x-rst
application/vrt
application/pcisdk
application/x-ctg
application/x-e00-grid
application/x-rik
image/ida
image/x-mff2
application/sdts-raster
application/x-snodas
image/jp2
image/sar-ceos
application/terragen
application/x-wcs
application/leveller
application/x-ingr
application/x-gtx
image/sgi
application/x-pnm
image/raster
application/fits
application/x-r
image/gif
application/x-envi-hdr
application/x-http
application/x-rmf
application/x-ecrg-toc
application/aig
application/x-rpf-toc
image/adrg
application/x-srtmhgt
application/x-generic-bin
application/jdem
image/x-airsar
application/x-webp
application/x-ngs-geoid
application/x-pcidsk
image/x-fujibas
application/x-wms
application/x-map
image/ceos
application/xpm
application/x-zmap
image/envisat
application/x-ers
application/x-doq1
application/x-isis2
application/x-nwt-grd
application/x-ppi
image/ilwis
application/x-isis3
application/x-nwt-grc
application/x-blx
application/gff
application/x-ndf
image/jpeg
application/x-geo-pdf
application/x-l1b
image/fit
application/x-gsbg
application/x-sdat
application/x-ctable2
application/x-grib
application/x-coasp
application/x-dipex
application/grass-ascii-grid
image/fits
application/x-til
application/x-dods
image/png
application/x-gxf
application/x-gs7bg
application/x-cpg
application/x-lan
application/x-xyz
image/bsb
application/x-p-aux
application/dted
application/x-rasterlite
image/nitf
image/hfa
application/x-fast
application/x-los-las
org.apache.tika.parser.geo.topic.GeoParser
application/geotopic
org.apache.tika.parser.geoinfo.GeographicInformationParser
text/iso19139+xml
org.apache.tika.parser.grib.GribParser
application/x-grib2
org.apache.tika.parser.hdf.HDFParser
application/x-hdf
org.apache.tika.parser.html.HtmlParser
text/html
application/vnd.wap.xhtml+xml
application/x-asp
application/xhtml+xml
org.apache.tika.parser.image.BPGParser
image/bpg
image/x-bpg
org.apache.tika.parser.image.ICNSParser
image/icns
org.apache.tika.parser.image.ImageParser
image/png
image/vnd.wap.wbmp
image/bmp
image/x-xcf
image/gif
image/x-icon
image/x-ms-bmp
org.apache.tika.parser.image.PSDParser
image/vnd.adobe.photoshop
org.apache.tika.parser.image.TiffParser
image/tiff
org.apache.tika.parser.image.WebPParser
image/webp
org.apache.tika.parser.iptc.IptcAnpaParser
text/vnd.iptc.anpa
org.apache.tika.parser.isatab.ISArchiveParser
application/x-isatab
org.apache.tika.parser.iwork.IWorkPackageParser
application/vnd.apple.keynote
application/vnd.apple.iwork
application/vnd.apple.numbers
application/vnd.apple.pages
org.apache.tika.parser.jdbc.SQLite3Parser
org.apache.tika.parser.journal.JournalParser
application/pdf
org.apache.tika.parser.jpeg.JpegParser
image/jpeg
org.apache.tika.parser.mail.RFC822Parser
message/rfc822
org.apache.tika.parser.mat.MatParser
application/x-matlab-data
org.apache.tika.parser.mbox.MboxParser
application/mbox
org.apache.tika.parser.mbox.OutlookPSTParser
application/vnd.ms-outlook-pst
org.apache.tika.parser.microsoft.JackcessParser
application/x-msaccess
org.apache.tika.parser.microsoft.OfficeParser
application/x-tika-msoffice-embedded; format=ole10_native
application/msword
application/vnd.visio
application/vnd.ms-project
application/x-tika-msworks-spreadsheet
application/x-mspublisher
application/vnd.ms-powerpoint
application/x-tika-msoffice
application/sldworks
application/x-tika-ooxml-protected
application/vnd.ms-excel
application/vnd.ms-outlook
org.apache.tika.parser.microsoft.OldExcelParser
application/vnd.ms-excel.workspace.3
application/vnd.ms-excel.workspace.4
application/vnd.ms-excel.sheet.2
application/vnd.ms-excel.sheet.3
application/vnd.ms-excel.sheet.4
org.apache.tika.parser.microsoft.TNEFParser
application/vnd.ms-tnef
application/x-tnef
application/ms-tnef
org.apache.tika.parser.microsoft.ooxml.OOXMLParser
application/vnd.ms-word.document.macroenabled.12
application/vnd.ms-excel.addin.macroenabled.12
application/x-tika-ooxml
application/vnd.openxmlformats-officedocument.wordprocessingml.template
application/vnd.ms-powerpoint.addin.macroenabled.12
application/vnd.openxmlformats-officedocument.spreadsheetml.template
application/vnd.openxmlformats-officedocument.wordprocessingml.document
application/vnd.openxmlformats-officedocument.presentationml.template
application/vnd.ms-powerpoint.slideshow.macroenabled.12
application/vnd.openxmlformats-officedocument.presentationml.presentation
application/vnd.ms-powerpoint.presentation.macroenabled.12
application/vnd.openxmlformats-officedocument.spreadsheetml.sheet
application/vnd.openxmlformats-officedocument.presentationml.slideshow
application/vnd.ms-excel.template.macroenabled.12
application/vnd.ms-excel.sheet.macroenabled.12
application/vnd.ms-word.template.macroenabled.12
org.apache.tika.parser.mp3.Mp3Parser
audio/mpeg
org.apache.tika.parser.mp4.MP4Parser
video/x-m4v
application/mp4
video/3gpp
video/3gpp2
video/quicktime
audio/mp4
video/mp4
org.apache.tika.parser.netcdf.NetCDFParser
application/x-netcdf
org.apache.tika.parser.ocr.TesseractOCRParser
org.apache.tika.parser.odf.OpenDocumentParser
application/x-vnd.oasis.opendocument.presentation
application/vnd.oasis.opendocument.chart
application/x-vnd.oasis.opendocument.text-web
application/x-vnd.oasis.opendocument.image
application/vnd.oasis.opendocument.graphics-template
application/vnd.oasis.opendocument.text-web
application/x-vnd.oasis.opendocument.spreadsheet-template
application/vnd.oasis.opendocument.spreadsheet-template
application/vnd.sun.xml.writer
application/x-vnd.oasis.opendocument.graphics-template
application/vnd.oasis.opendocument.graphics
application/vnd.oasis.opendocument.spreadsheet
application/x-vnd.oasis.opendocument.chart
application/x-vnd.oasis.opendocument.spreadsheet
application/vnd.oasis.opendocument.image
application/x-vnd.oasis.opendocument.text
application/x-vnd.oasis.opendocument.text-template
application/vnd.oasis.opendocument.formula-template
application/x-vnd.oasis.opendocument.formula
application/vnd.oasis.opendocument.image-template
application/x-vnd.oasis.opendocument.image-template
application/x-vnd.oasis.opendocument.presentation-template
application/vnd.oasis.opendocument.presentation-template
application/vnd.oasis.opendocument.text
application/vnd.oasis.opendocument.text-template
application/vnd.oasis.opendocument.chart-template
application/x-vnd.oasis.opendocument.chart-template
application/x-vnd.oasis.opendocument.formula-template
application/x-vnd.oasis.opendocument.text-master
application/vnd.oasis.opendocument.presentation
application/x-vnd.oasis.opendocument.graphics
application/vnd.oasis.opendocument.formula
application/vnd.oasis.opendocument.text-master
org.apache.tika.parser.pdf.PDFParser
application/pdf
org.apache.tika.parser.pkg.CompressorParser
application/zlib
application/x-gzip
application/x-bzip2
application/x-compress
application/x-java-pack200
application/gzip
application/x-bzip
application/x-xz
org.apache.tika.parser.pkg.PackageParser
application/x-tar
application/java-archive
application/x-archive
application/zip
application/x-cpio
application/x-tika-unix-dump
application/x-7z-compressed
org.apache.tika.parser.pkg.RarParser
application/x-rar-compressed
org.apache.tika.parser.pot.PooledTimeSeriesParser
org.apache.tika.parser.rtf.RTFParser
application/rtf
org.apache.tika.parser.txt.TXTParser
text/plain
org.apache.tika.parser.video.FLVParser
video/x-flv
org.apache.tika.parser.xml.DcXMLParser
application/xml
image/svg+xml
org.apache.tika.parser.xml.FictionBookParser
application/x-fictionbook+xml
org.gagravarr.tika.FlacParser
audio/x-oggflac
audio/x-flac
org.gagravarr.tika.OggParser
audio/ogg
application/kate
application/ogg
video/daala
video/x-ogguvs
video/x-ogm
audio/x-oggpcm
video/ogg
video/x-dirac
video/x-oggrgb
video/x-oggyuv
org.gagravarr.tika.OpusParser
audio/opus
audio/ogg; codecs=opus
org.gagravarr.tika.SpeexParser
audio/ogg; codecs=speex
audio/speex
org.gagravarr.tika.TheoraParser
video/theora
org.gagravarr.tika.VorbisParser
audio/vorbis

The next set of steps will be configuring and testing Tesseract prior to integrating it in Tika.

@KevM
Copy link
Owner

KevM commented Aug 30, 2016

Thanks for creating this issue and looking into exposing this potentially useful feature of Tika and Tesseract.

@LeeBear35
Copy link
Author

After installing Tesseract I used pbrush to create a test image containing Hello World and saved it to bmp, gif, jpg, png and tif.

As a baseline I ran these files through Tesseract to make sure that Hello World was the text each file extracted. The GIF file failed because the drive I was running Tesseract on did not have a TMP directory at the root. Tesseract should be using the system temporary directory, but this is a bug in the current release.

After a couple false starts I finally was able to get it working correctly. Here are the steps:

    1. Create a TesseractOCRConfig object
  1. Call the setTesseractPath on that object passing in the installation path for Tesseract
  2. On the ParseContext object call set passing in the TypeOf TesseractOCRConfig, and the Config object

Parsing then starts using that Tesseract Parser

I refactored Kevin's TextExtractor so that it can be called using:

TikaOnDotNet.TextExtractionOCR.TextExtractor textExtractor = new TikaOnDotNet.TextExtractionOCR.TextExtractor();
textExtractor.TesseractPath = @"E:\Tesseract";
TextExtractionResult Actual = textExtractor.Extract(buffer, testFile, mimeType);

Here is the entire class:

using System;
using System.Linq;
using java.io;
using javax.xml.transform;
using javax.xml.transform.sax;
using javax.xml.transform.stream;
using org.apache.tika.io;
using org.apache.tika.metadata;
using org.apache.tika.parser;
using Exception = System.Exception;
using TikaOnDotNet.TextExtraction;
using org.apache.tika.config;
using org.apache.tika.detect;
using org.apache.tika.mime;
using org.apache.tika.parser.ocr;

namespace TikaOnDotNet.TextExtractionOCR
{
  public interface ITextExtractor
  {
    /// <summary>
    /// Extract text from a given filepath.
    /// </summary>
    /// <param name="filePath">File path to be extracted.</param>
    TextExtractionResult Extract(string filePath);

    /// <summary>
    /// Extract text from a byte[]. This is a good way to get data from arbitrary sources.
    /// </summary>
    /// <param name="data">A byte array of data which will have its text extracted.</param>
    TextExtractionResult Extract(byte[] data);

    /// <summary>
    /// Extract text from a byte[]. This is a good way to get data from arbitrary sources.
    /// </summary>
    /// <param name="data">A byte array of data which will have its text extracted.</param>
    /// <param name="filePath">A string containing the file name to help the detector determine the proper parser</param>
    /// <param name="ContentType">A string that has the mime type to help the detector determine the correct parser to use</param>
    TextExtractionResult Extract(byte[] data, string filePath, string ContentType);

    /// <summary>
    /// Extract text from a URI. Time to create your very of web spider.
    /// </summary>
    /// <param name="uri">URL which will have its text extracted.</param>
    TextExtractionResult Extract(Uri uri);

    /// <summary>
    /// Under the hood we are using Tika which is a Java project. Tika wants an java.io.InputStream. The other overloads eventually call this Extract giving this method a Func.
    /// </summary>
    /// <param name="streamFactory">A Func which takes a Metadata object and returns an InputStream.</param>
    /// <returns></returns>
    TextExtractionResult Extract(Func<Metadata, InputStream> streamFactory);
  }

  public class TextExtractor : ITextExtractor
  {
    private static TikaConfig config = TikaConfig.getDefaultConfig();
    private TesseractOCRConfig tesseractOCRConfig;
    private static string tesseractPath = string.Empty;
    public string TesseractPath 
    { 
      get { return tesseractPath; } 
      set 
      { 
        tesseractPath = value;
        tesseractOCRConfig = new TesseractOCRConfig();
        //todo: validate directory and tesseract.exe at location
        tesseractOCRConfig.setTesseractPath(tesseractPath);
      } 
    }
    public bool IsOCRPathEnabled 
    { 
      get { return tesseractOCRConfig != null; } 
      set
      {
        if (value)
        {
          tesseractOCRConfig = new TesseractOCRConfig();
          tesseractOCRConfig.setTesseractPath(tesseractPath);
        }
        else
        {
          tesseractOCRConfig = null;
        }
      }
    }
    public TextExtractionResult Extract(string filePath)
    {
      try
      {
        var inputStream = new FileInputStream(filePath);
        return Extract(metadata =>
        {
          var result = TikaInputStream.get(inputStream);
          metadata.add("FilePath", filePath);
          return result;
        });
      }
      catch (Exception ex)
      {
        throw new TextExtractionException("Extraction of text from the file '{0}' failed.".ToFormat(filePath), ex);
      }
    }
    public TextExtractionResult Extract(byte[] data)
    {
      return Extract(data, string.Empty, string.Empty);
    }

    public TextExtractionResult Extract(byte[] data, string filePath, string ContentType)
    {
      TextExtractionResult result = Extract
        (
          metadata =>
          {
            metadata.add(org.apache.tika.metadata.TikaMetadataKeys.__Fields.RESOURCE_NAME_KEY, System.IO.Path.GetFileName(filePath));
            metadata.add(org.apache.tika.metadata.TikaMimeKeys.__Fields.TIKA_MIME_FILE, filePath);
            try
            {
              if (!ContentType.Equals(org.apache.tika.mime.MimeTypes.OCTET_STREAM, StringComparison.CurrentCultureIgnoreCase))
              {
                metadata.add(org.apache.tika.metadata.HttpHeaders.__Fields.CONTENT_TYPE, ContentType);
              }
              else
              {
                Detector detector = config.getDetector();
                using (org.apache.tika.io.TikaInputStream inputStream = org.apache.tika.io.TikaInputStream.@get(data, metadata))
                {
                  MediaType foundType = detector.detect(inputStream, metadata);
                  if (!foundType.toString().Equals(org.apache.tika.mime.MimeTypes.OCTET_STREAM, StringComparison.CurrentCultureIgnoreCase))
                  {
                    metadata.add(org.apache.tika.metadata.HttpHeaders.__Fields.CONTENT_TYPE, foundType.toString());
                  }
                }
              }
            }
            catch (Exception ex)
            {
              throw ex;
            }

            return TikaInputStream.get(data, metadata);
          }
        );

      return result;
    }

    public TextExtractionResult Extract(Uri uri)
    {
      var jUri = new java.net.URI(uri.ToString());
      return Extract(metadata =>
      {
        var result = TikaInputStream.get(jUri, metadata);
        metadata.add("Uri", uri.ToString());
        return result;
      });
    }

    public TextExtractionResult Extract(Func<Metadata, InputStream> streamFactory)
    {
      try
      {
        var parser = new AutoDetectParser();
        var metadata = new Metadata();
        var outputWriter = new StringWriter();
        var parseContext = new ParseContext();

        if (IsOCRPathEnabled)
        {
          parseContext.set(typeof(TesseractOCRConfig), tesseractOCRConfig);
        }

        //use the base class type for the key or parts of Tika won't find a usable parser
        parseContext.set(typeof(Parser), parser);

        using (var inputStream = streamFactory(metadata))
        {
          try
          {
            parser.parse(inputStream, getTransformerHandler(outputWriter), metadata, parseContext);
          }
          finally
          {
            inputStream.close();
          }
        }

        return AssembleExtractionResult(outputWriter.ToString(), metadata);
      }
      catch (Exception ex)
      {
        throw new TextExtractionException("Extraction failed.", ex);
      }
    }

    private static TextExtractionResult AssembleExtractionResult(string text, Metadata metadata)
    {
      var metaDataResult = metadata.names()
        .ToDictionary(name => name, name => string.Join(", ", metadata.getValues(name)));

      var contentType = metaDataResult["Content-Type"];

      return new TextExtractionResult
      {
        Text = text,
        ContentType = contentType,
        Metadata = metaDataResult
      };
    }

    private static TransformerHandler getTransformerHandler(Writer output)
    {
      var factory = (SAXTransformerFactory)TransformerFactory.newInstance();
      var transformerHandler = factory.newTransformerHandler();

      transformerHandler.getTransformer().setOutputProperty(OutputKeys.METHOD, "text");
      transformerHandler.getTransformer().setOutputProperty(OutputKeys.INDENT, "yes");

      transformerHandler.setResult(new StreamResult(output));
      return transformerHandler;
    }
    public static string TikaConfigDump()
    {
      System.Text.StringBuilder retVal = new System.Text.StringBuilder();

      retVal.AppendFormat("{0}\t{1}\n", "Version", (new org.apache.tika.Tika(config)).toString());

      retVal.AppendLine("\nDetectors");

      CompositeDetector configDetector = (CompositeDetector)config.getDetector();
      var detectors = configDetector.getDetectors().toArray();
      foreach (Detector detector in detectors)
      {
        retVal.AppendFormat("\t{0}\n", ((java.lang.Object)detector).getClass().getName());

        if (detector.GetType() == typeof(CompositeDetector))
        {
          var subDetectors = configDetector.getDetectors().toArray();
          foreach (Detector subDetector in subDetectors)
          {
            retVal.AppendFormat("\t\t{0}\n", ((java.lang.Object)subDetector).getClass().getName());
          }
        }
      }

      retVal.AppendLine("\nParsers");

      CompositeParser configParser = (CompositeParser)config.getParser();
      var parsers = configParser.getAllComponentParsers().toArray();
      foreach (Parser parser in parsers)
      {
        retVal.AppendFormat("\t{0}\n", ((java.lang.Object)parser).getClass().getName());

        var parserTypes = parser.getSupportedTypes(new ParseContext()).toArray();
        foreach (MediaType mediaType in parserTypes)
        {
          retVal.AppendFormat("\t\t{0}\n", mediaType.toString());
        }
      }

      org.apache.tika.language.translate.Translator translator = config.getTranslator();
      if (translator.isAvailable())
      {
        retVal.AppendFormat("Translator {0}\n", ((java.lang.Object)translator).getClass().getName());
      }

      retVal.AppendFormat("\nFallback Parser: {0}\n", configParser.getFallback());

      return retVal.ToString();
    }
  }
}

@KevM
Copy link
Owner

KevM commented Sep 1, 2016

Would you like to submit a PR with this and I can work with you to get this
capability into the text extractor?

@LeeBear35
Copy link
Author

LeeBear35 commented Sep 1, 2016

I would be happy to.

@KevM
Copy link
Owner

KevM commented Dec 12, 2016

I'd like to discuss this feature addition a bit. @Sicos1997 was nice enough to roll this feature into PR #72 creating a separate ITextExtractor implementation which works with Tesseract to OCR images and optionally PDFs.

Unfortunately it looks like the Tika integration with Tesseract requires an executable (not a library) to be installed. Here are the [windows instructions(https://github.com/tesseract-ocr/tesseract/wiki#windows].

An unofficial installer for windows for Tesseract 3.05-dev is available from Tesseract at UB Mannheim. This includes the training tools.
An installer for the old version 3.02 is available for Windows from our download page. This includes the English training data. If you want to use another language, download the appropriate training data, unpack it using 7-zip, and copy the .traineddata file into the 'tessdata' directory, probably C:\Program Files\Tesseract OCR\tessdata.

I see a few problems just getting Tesseract installed:

  • There is no supported Windows installer
  • There are multiple steps even if you use the unsupported installer.
  • There are language specific steps to take.

None of this is turn key. So how do we test it? Here is a possible plan...

Add a Tesseract TextExtractor

  • Add a separate Tesseract enabled implementation of ITextExtractor.
  • Write a healthy setup documentation for tesseract. Linking to the official docs and providing way-finding to common problems (as it sound like they happen and I don't want to get into the Tesseract support business).
  • But... How do we know it is working?

Testing Concerns

I am not sure how to have our Appveyor CI test the Tesseract integration. There is no chocolately package.

The biggest hurdle I see to having support for this feature is:

  • Figure out how to automate getting the Tesseract executable ready for test runs locally and on our CI server. One way to do this might be to create a Chocolatey package for Tesseract as Appveyor nicely supports installing Chocolatey packages.

The main reason I don't want to move forward is I don't want to manually test this feature. So, until we can automate it I don't want to add it. If someone who is using Tika + Tesseract now via .Net were to step up and help out with the automation I would be happy to work with you on it.

Another option

If someone really wants this feature but is not willing to do the automation required we could start a new Nuget and let someone own the manual testing it would require. I am also happy to facilitate that direction. This said it seems like a Chocolatey package is an equivalent route.

@LeeBear35
Copy link
Author

LeeBear35 commented Dec 12, 2016 via email

@KevM
Copy link
Owner

KevM commented Dec 12, 2016

Thanks, it is useful to see how you got it working.

@delagoutte-wanao
Copy link

Hello,
I try the code of LeeBear35 and it is only with tesseract 3.05 but not with version 4.
is someone able to make tikaondotnet work with tesseract 4?
do you think it could be a problem with the version of tika that is deploy with tikaondotnet ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants