-
Notifications
You must be signed in to change notification settings - Fork 73
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Configuring Tesseract OCR for TikaOnDotNet #62
Comments
Thanks for creating this issue and looking into exposing this potentially useful feature of Tika and Tesseract. |
After installing Tesseract I used pbrush to create a test image containing Hello World and saved it to bmp, gif, jpg, png and tif. As a baseline I ran these files through Tesseract to make sure that Hello World was the text each file extracted. The GIF file failed because the drive I was running Tesseract on did not have a TMP directory at the root. Tesseract should be using the system temporary directory, but this is a bug in the current release. After a couple false starts I finally was able to get it working correctly. Here are the steps:
Parsing then starts using that Tesseract Parser I refactored Kevin's TextExtractor so that it can be called using:
Here is the entire class:
|
Would you like to submit a PR with this and I can work with you to get this |
I would be happy to. |
I'd like to discuss this feature addition a bit. @Sicos1997 was nice enough to roll this feature into PR #72 creating a separate Unfortunately it looks like the Tika integration with Tesseract requires an executable (not a library) to be installed. Here are the [windows instructions(https://github.com/tesseract-ocr/tesseract/wiki#windows].
I see a few problems just getting Tesseract installed:
None of this is turn key. So how do we test it? Here is a possible plan... Add a Tesseract TextExtractor
Testing ConcernsI am not sure how to have our Appveyor CI test the Tesseract integration. There is no chocolately package. The biggest hurdle I see to having support for this feature is:
The main reason I don't want to move forward is I don't want to manually test this feature. So, until we can automate it I don't want to add it. If someone who is using Tika + Tesseract now via .Net were to step up and help out with the automation I would be happy to work with you on it. Another optionIf someone really wants this feature but is not willing to do the automation required we could start a new Nuget and let someone own the manual testing it would require. I am also happy to facilitate that direction. This said it seems like a Chocolatey package is an equivalent route. |
Kevin,
I was able to get the Tesseract working with Tika on dot net. It really is not hard at all. The windows installer basically extracts the files to a folder, then you just have to tell Tika where Tesseract is installed. For testing you can open Paint Brush and Type in Hello World and save it in the various image formats and then run them through to ensure basic Tesseract OCR is working. Here is the call that I make and the class extension that I implemented to set the path. Note: The harder problem I had was turning off Tesseract because it sets environment variables and they are used even without having the application set the path. The quickest way to turn off the Tesseract OCR was to rename the Tesseract folder so Tika could not find it. Hope this helps.
// Load path for tesseract from PRGX.MT.Tika.dll.config
TikaOnDotNet.TextExtractionOCR.TextExtractor.EnableAppSettingsTesseractPath(TesseractInstallPath);
Class extension to the Tika on Dot Net code
public class TextExtractor : ITextExtractor
{
private static TikaConfig config = TikaConfig.getDefaultConfig();
private static TesseractOCRConfig tesseractOCRConfig = null;
private static string tesseractPath = string.Empty;
public static void EnableAppSettingsTesseractPath(string TesseractInstallPath)
{
TesseractPath = TesseractInstallPath;
if (string.IsNullOrWhiteSpace(tesseractPath))
{
tesseractOCRConfig = null;
tesseractPath = string.Empty;
}
else
{
tesseractOCRConfig = new TesseractOCRConfig();
tesseractOCRConfig.setTesseractPath(tesseractPath);
tesseractOCRConfig.setTimeout(240);
}
}
public static string TesseractPath
{
get { return tesseractPath; }
set
{
if (!string.IsNullOrEmpty(value))
{
if (!Directory.Exists(value))
{
throw new DirectoryNotFoundException(string.Format("Tesseract Directory not found: {0}", value));
}
if (!System.IO.File.Exists(Path.Combine(value, "tesseract.exe")))
{
throw new System.IO.FileNotFoundException(string.Format("Could not find tesseract.exe at {0}", tesseractPath));
}
tesseractPath = value;
}
else
{
tesseractPath = string.Empty;
tesseractOCRConfig = null;
}
}
}
public static bool IsOCRPathEnabled
{
get { return tesseractOCRConfig != null; }
set
{
if (value)
{
tesseractOCRConfig = new TesseractOCRConfig();
tesseractOCRConfig.setTesseractPath(tesseractPath);
}
else
{
tesseractOCRConfig = null;
}
}
}
}
|
Thanks, it is useful to see how you got it working. |
Hello, |
The hope here is to get TikaOnDotNet fully configured to access Tesseract OCR for text extraction from images. With Tika .93 support for Tesseract was added, and we are now in the midst of validating the latest release Tika 1.13.1. A big set of validations center around Tika's ability to handle certain types of PDF files, it should be noted that TIFF images in PDFBox have changes due to licensing issues that are not in compliance with the Apache license.
So here is hoping that if we cannot read it one way, we might be able to read it using another.
The first step has been to extend Kevin's TextExtractor so that Meta data can be passed in to assist the parsing that set of extensions is here:
The next step has been to dump the configuration to confirm how Tika is configured, and what changes might need to be made, the dump routine was added to the class above:
On my system using the default configuration provided by Kevin you can see the setup below:
Version Apache Tika 1.13
Detectors
org.apache.tika.parser.microsoft.POIFSContainerDetector
org.apache.tika.parser.pkg.ZipContainerDetector
org.gagravarr.tika.OggDetector
org.apache.tika.mime.MimeTypes
Parsers
org.apache.tika.parser.asm.ClassParser
application/java-vm
org.apache.tika.parser.audio.AudioParser
audio/x-wav
audio/basic
audio/x-aiff
org.apache.tika.parser.audio.MidiParser
application/x-midi
audio/midi
org.apache.tika.parser.chm.ChmParser
application/vnd.ms-htmlhelp
application/x-chm
application/chm
org.apache.tika.parser.code.SourceCodeParser
text/x-c++src
text/x-groovy
text/x-java-source
org.apache.tika.parser.crypto.Pkcs7Parser
application/pkcs7-signature
application/pkcs7-mime
org.apache.tika.parser.dif.DIFParser
application/dif+xml
org.apache.tika.parser.dwg.DWGParser
image/vnd.dwg
org.apache.tika.parser.epub.EpubParser
application/x-ibooks+zip
application/epub+zip
org.apache.tika.parser.executable.ExecutableParser
application/x-msdownload
application/x-sharedlib
application/x-elf
application/x-object
application/x-executable
application/x-coredump
org.apache.tika.parser.external.CompositeExternalParser
org.apache.tika.parser.feed.FeedParser
application/atom+xml
application/rss+xml
org.apache.tika.parser.font.AdobeFontMetricParser
application/x-font-adobe-metric
org.apache.tika.parser.font.TrueTypeParser
application/x-font-ttf
org.apache.tika.parser.gdal.GDALParser
application/x-gsc
image/x-ozi
application/x-pds
image/eir
application/x-usgs-dem
application/aaigrid
application/x-bag
application/elas
application/x-rs2
application/x-tsx
application/x-lcp
image/geotiff
application/x-mbtiles
application/x-cappi
application/x-netcdf
application/x-gsag
application/x-epsilon
application/x-ace2
application/jaxa-pal-sar
image/x-pcraster
application/x-msgn
image/arg
application/x-hdf
image/x-mff
application/x-kro
image/x-hdf5-image
image/x-dimap
image/x-srp
image/big-gif
application/x-envi
application/x-cosar
application/x-ntv2
image/bmp
application/x-doq2
application/x-bt
application/x-kml
application/x-gmt
application/x-rst
application/vrt
application/pcisdk
application/x-ctg
application/x-e00-grid
application/x-rik
image/ida
image/x-mff2
application/sdts-raster
application/x-snodas
image/jp2
image/sar-ceos
application/terragen
application/x-wcs
application/leveller
application/x-ingr
application/x-gtx
image/sgi
application/x-pnm
image/raster
application/fits
application/x-r
image/gif
application/x-envi-hdr
application/x-http
application/x-rmf
application/x-ecrg-toc
application/aig
application/x-rpf-toc
image/adrg
application/x-srtmhgt
application/x-generic-bin
application/jdem
image/x-airsar
application/x-webp
application/x-ngs-geoid
application/x-pcidsk
image/x-fujibas
application/x-wms
application/x-map
image/ceos
application/xpm
application/x-zmap
image/envisat
application/x-ers
application/x-doq1
application/x-isis2
application/x-nwt-grd
application/x-ppi
image/ilwis
application/x-isis3
application/x-nwt-grc
application/x-blx
application/gff
application/x-ndf
image/jpeg
application/x-geo-pdf
application/x-l1b
image/fit
application/x-gsbg
application/x-sdat
application/x-ctable2
application/x-grib
application/x-coasp
application/x-dipex
application/grass-ascii-grid
image/fits
application/x-til
application/x-dods
image/png
application/x-gxf
application/x-gs7bg
application/x-cpg
application/x-lan
application/x-xyz
image/bsb
application/x-p-aux
application/dted
application/x-rasterlite
image/nitf
image/hfa
application/x-fast
application/x-los-las
org.apache.tika.parser.geo.topic.GeoParser
application/geotopic
org.apache.tika.parser.geoinfo.GeographicInformationParser
text/iso19139+xml
org.apache.tika.parser.grib.GribParser
application/x-grib2
org.apache.tika.parser.hdf.HDFParser
application/x-hdf
org.apache.tika.parser.html.HtmlParser
text/html
application/vnd.wap.xhtml+xml
application/x-asp
application/xhtml+xml
org.apache.tika.parser.image.BPGParser
image/bpg
image/x-bpg
org.apache.tika.parser.image.ICNSParser
image/icns
org.apache.tika.parser.image.ImageParser
image/png
image/vnd.wap.wbmp
image/bmp
image/x-xcf
image/gif
image/x-icon
image/x-ms-bmp
org.apache.tika.parser.image.PSDParser
image/vnd.adobe.photoshop
org.apache.tika.parser.image.TiffParser
image/tiff
org.apache.tika.parser.image.WebPParser
image/webp
org.apache.tika.parser.iptc.IptcAnpaParser
text/vnd.iptc.anpa
org.apache.tika.parser.isatab.ISArchiveParser
application/x-isatab
org.apache.tika.parser.iwork.IWorkPackageParser
application/vnd.apple.keynote
application/vnd.apple.iwork
application/vnd.apple.numbers
application/vnd.apple.pages
org.apache.tika.parser.jdbc.SQLite3Parser
org.apache.tika.parser.journal.JournalParser
application/pdf
org.apache.tika.parser.jpeg.JpegParser
image/jpeg
org.apache.tika.parser.mail.RFC822Parser
message/rfc822
org.apache.tika.parser.mat.MatParser
application/x-matlab-data
org.apache.tika.parser.mbox.MboxParser
application/mbox
org.apache.tika.parser.mbox.OutlookPSTParser
application/vnd.ms-outlook-pst
org.apache.tika.parser.microsoft.JackcessParser
application/x-msaccess
org.apache.tika.parser.microsoft.OfficeParser
application/x-tika-msoffice-embedded; format=ole10_native
application/msword
application/vnd.visio
application/vnd.ms-project
application/x-tika-msworks-spreadsheet
application/x-mspublisher
application/vnd.ms-powerpoint
application/x-tika-msoffice
application/sldworks
application/x-tika-ooxml-protected
application/vnd.ms-excel
application/vnd.ms-outlook
org.apache.tika.parser.microsoft.OldExcelParser
application/vnd.ms-excel.workspace.3
application/vnd.ms-excel.workspace.4
application/vnd.ms-excel.sheet.2
application/vnd.ms-excel.sheet.3
application/vnd.ms-excel.sheet.4
org.apache.tika.parser.microsoft.TNEFParser
application/vnd.ms-tnef
application/x-tnef
application/ms-tnef
org.apache.tika.parser.microsoft.ooxml.OOXMLParser
application/vnd.ms-word.document.macroenabled.12
application/vnd.ms-excel.addin.macroenabled.12
application/x-tika-ooxml
application/vnd.openxmlformats-officedocument.wordprocessingml.template
application/vnd.ms-powerpoint.addin.macroenabled.12
application/vnd.openxmlformats-officedocument.spreadsheetml.template
application/vnd.openxmlformats-officedocument.wordprocessingml.document
application/vnd.openxmlformats-officedocument.presentationml.template
application/vnd.ms-powerpoint.slideshow.macroenabled.12
application/vnd.openxmlformats-officedocument.presentationml.presentation
application/vnd.ms-powerpoint.presentation.macroenabled.12
application/vnd.openxmlformats-officedocument.spreadsheetml.sheet
application/vnd.openxmlformats-officedocument.presentationml.slideshow
application/vnd.ms-excel.template.macroenabled.12
application/vnd.ms-excel.sheet.macroenabled.12
application/vnd.ms-word.template.macroenabled.12
org.apache.tika.parser.mp3.Mp3Parser
audio/mpeg
org.apache.tika.parser.mp4.MP4Parser
video/x-m4v
application/mp4
video/3gpp
video/3gpp2
video/quicktime
audio/mp4
video/mp4
org.apache.tika.parser.netcdf.NetCDFParser
application/x-netcdf
org.apache.tika.parser.ocr.TesseractOCRParser
org.apache.tika.parser.odf.OpenDocumentParser
application/x-vnd.oasis.opendocument.presentation
application/vnd.oasis.opendocument.chart
application/x-vnd.oasis.opendocument.text-web
application/x-vnd.oasis.opendocument.image
application/vnd.oasis.opendocument.graphics-template
application/vnd.oasis.opendocument.text-web
application/x-vnd.oasis.opendocument.spreadsheet-template
application/vnd.oasis.opendocument.spreadsheet-template
application/vnd.sun.xml.writer
application/x-vnd.oasis.opendocument.graphics-template
application/vnd.oasis.opendocument.graphics
application/vnd.oasis.opendocument.spreadsheet
application/x-vnd.oasis.opendocument.chart
application/x-vnd.oasis.opendocument.spreadsheet
application/vnd.oasis.opendocument.image
application/x-vnd.oasis.opendocument.text
application/x-vnd.oasis.opendocument.text-template
application/vnd.oasis.opendocument.formula-template
application/x-vnd.oasis.opendocument.formula
application/vnd.oasis.opendocument.image-template
application/x-vnd.oasis.opendocument.image-template
application/x-vnd.oasis.opendocument.presentation-template
application/vnd.oasis.opendocument.presentation-template
application/vnd.oasis.opendocument.text
application/vnd.oasis.opendocument.text-template
application/vnd.oasis.opendocument.chart-template
application/x-vnd.oasis.opendocument.chart-template
application/x-vnd.oasis.opendocument.formula-template
application/x-vnd.oasis.opendocument.text-master
application/vnd.oasis.opendocument.presentation
application/x-vnd.oasis.opendocument.graphics
application/vnd.oasis.opendocument.formula
application/vnd.oasis.opendocument.text-master
org.apache.tika.parser.pdf.PDFParser
application/pdf
org.apache.tika.parser.pkg.CompressorParser
application/zlib
application/x-gzip
application/x-bzip2
application/x-compress
application/x-java-pack200
application/gzip
application/x-bzip
application/x-xz
org.apache.tika.parser.pkg.PackageParser
application/x-tar
application/java-archive
application/x-archive
application/zip
application/x-cpio
application/x-tika-unix-dump
application/x-7z-compressed
org.apache.tika.parser.pkg.RarParser
application/x-rar-compressed
org.apache.tika.parser.pot.PooledTimeSeriesParser
org.apache.tika.parser.rtf.RTFParser
application/rtf
org.apache.tika.parser.txt.TXTParser
text/plain
org.apache.tika.parser.video.FLVParser
video/x-flv
org.apache.tika.parser.xml.DcXMLParser
application/xml
image/svg+xml
org.apache.tika.parser.xml.FictionBookParser
application/x-fictionbook+xml
org.gagravarr.tika.FlacParser
audio/x-oggflac
audio/x-flac
org.gagravarr.tika.OggParser
audio/ogg
application/kate
application/ogg
video/daala
video/x-ogguvs
video/x-ogm
audio/x-oggpcm
video/ogg
video/x-dirac
video/x-oggrgb
video/x-oggyuv
org.gagravarr.tika.OpusParser
audio/opus
audio/ogg; codecs=opus
org.gagravarr.tika.SpeexParser
audio/ogg; codecs=speex
audio/speex
org.gagravarr.tika.TheoraParser
video/theora
org.gagravarr.tika.VorbisParser
audio/vorbis
The next set of steps will be configuring and testing Tesseract prior to integrating it in Tika.
The text was updated successfully, but these errors were encountered: