-
Notifications
You must be signed in to change notification settings - Fork 1
Programing with SDK
Although its is possible to run SmartDocumentor fully on built in workers, it is common to have to adapt the workflows to the clients needs and customize a worker or even create a new one.
Next we will explore common methods used with SmartDocumentor SDK.
SmartDocumentor facilitates reading several types of files and extracting pages from those files. Most commonly .TIFF files and .PDF files.
To read a page from such a type of files with SmartDocumentor we call the ImageDocument and set the values we want to extract.
The next example show how to extract one page from a multi page PDF file:
using (var imageFile = new ImageDocument(PdfFileName))
{
var result = imageFile.GetImageByIndex(new PageBitmapRequest()
{
ImageResultMode = PageBitmapRequest.PageBitmapImageModeResult.Filename,
PageIndex = 1,
});
var pageFile = result.PageFilename;
}
To create a new ImageDocument object it is possible to pass a Path to a file, a Stream or a Bitmap.
The method GetmageByIndex returns the information relative to one page from the document.
PageBitmapRequest allows to set the parameters that we whant to use to extract the image and if any the post processing conditions.
In the example above we pass two values to get the first page with a path to the extracted image. If none are passed the PageIndex default is the first page and the ImageResultMode is a Bitmap.
Note:
PageBitmapImageModeResult can be:
Bitmap - image, Pix - pointer, or Filename - temp file.
Other default values for the PageBitmapRequest:
AddBorder - false
AutoCleanBlackBorders - false
AutoDeWarp - false
AutoDeskew - false
AutoDetectOrientation - false
AutoDetectOrientationMethod - Fast
AutoEqualize - false
AutoInvert - false
BorderWidth - 0
ColorizeSettings - Default
ConvertToBitonal - false
ConvertToBitonalCustomParams - null
ConvertToBitonalMethod - DynamicOtsu
ConvertToBitonalThreshold - 0
ConvertToColorDepth - Color
FlipHorizontal - false
FlipVertical - false
ImageResultMode - Bitmap
MorphMaskOperation - And
MorphMaskSequence - null
NoImageCloning - false
Noise - 0
NormalizeToPaperFormatBeforeProcessing - None
PageCorrections - Default
PageCropSettings - Default
PageIndex - 0
PdfDocumentPageDPI - 300
PdfExtractImageFromImageList - false
PdfExtractTextWords - false
RemovePunchHoles - false
RotationMode - 0
Size - Default
Tag - null
Thickness - 0
This is the same process that is used in the DocumentPreProcessingWorker so it is possible to set these parameters in the task workspace configuration.
<Step From="FileImportedFromFolder" Using="DocumentPreProcessingWorker" To="PreProcessedCompleted">
<SettingList>
<Setting Name="AutoDeskew" Value="True" />
<Setting Name="CleanBlackBorders" Value="False" />
<Setting Name="AutoInvert" Value="False" />
<Setting Name="AutoDetectOrientation" Value="False" />
<Setting Name="ConvertToBitonal" Value="False" />
<Setting Name="Thickness" Value="0" />
<Setting Name="Noise" Value="0" />
<Setting Name="DPI" Value="300" />
<Setting Name="DeleteOriginalDocument" Value="True" />
</SettingList>
</Step>
In order to extract the text from a PDF file that contains embedded text it is possible to use the ImageDocument passing in the PageBitmapRequest the parameter PdfExtractTextWords as true.
The example shows how try to extract the text and get the content.
var request = new PageBitmapRequest
{
PageIndex = pageIndex - 1,
AutoDeskew = true,
ConvertToBitonal = false,
PdfExtractTextWords = true
};
using (PageBitmapResponse response = imageDocument.GetImageByIndex(request))
{
var result = response.PdfOcrJobResult;
if(result != null)
var text = result.Text;
}
I the PDF file does not contains embedded text or if we need to do the OCR extraction from an image, we need to use the OCR engine. So for example we could just add up to the previous code:
using (IOcrEngine ocrEngine = OcrFactory.CreateOcrEngineAuto()
{
var ocrJobReq = new OcrJobRequest
{
Image = response.PageImage,
Language = "Portuguese",
EntityTextMode = OcrEntityTextMode.Word,
PageNumber = 1
};
OcrJobResult ocrResult = ocrEngine.DoOcr(ocrJobReq);
if(ocrResult != null)
var text = ocrResult.Text;
}
Once again if we want to do this using SmartDocumentor's workers this options already built in to the OCRExtractionWorker. The next settings, in a PDf file, will check if the PDF contains embedded text and if not will do the OCR and save the text result from the first page in the task.
<Step From="ToProcess" Using="OCRExtractionWorker" To="OCRCompleted">
<SettingList>
<Setting Name="OcrSavePageText" Value="True" />
<Setting Name="PdfExtractTextWords" Value="True" />
<Setting Name="OcrPageRange" Value="1" />
</SettingList>
</Step>
One common request is to read or separate documents by barcode. With SmartDocumentor we can use the same logic from the previous examples and add the BarcodeEngine to check the barcodes found in a page.
using (BarcodeEngine barcodeEngine = new BarcodeEngine())
{
using (var imageFile = new ImageDocument(fileName))
{
var req = new PageBitmapRequest()
{
PageIndex = 1,
ImageResultMode = PageBitmapRequest.PageBitmapImageModeResult.Filename,
AutoDeskew = true
};
using (var response = imageFile.GetImageByIndex(req))
{
IBarcodeInfo[] barcodes = barcodeEngine.ReadBarcode(response.PageFilename);
}
}
}
This will return a object with the found barcodes, the text found in the barcode and confidence levels.
If you need to split a file and you are setting it from the scanner of from a folder the FolderMonitorWorker already is able to do that.
<Step Using="FolderMonitorWorker" To="FileImportedFromFolder">
<SettingList>
<Setting Name="Folders" Value="\\localhost\input" />
<Setting Name="FilePatterns" Value="*.tif|*.tiff|*.pdf|*.jpg|*.png" />
<Setting Name="DocSeparationEnabled" Value="true" />
<Setting Name="DocSeparationMethod" Value="Barcodes" />
<Setting Name="DocSeparationBarcodeType" Value="Code128" />
<Setting Name="DocSeparationBarcodeValue" Value="CODE" />
</SettingList>
</Step>
Adress: R. de Passos Manuel 223 3°, 4000-385 Porto, Portugal
Email: support@devscope.net
Phone: +315 22 375 1350
Working Days/Hours: Mon-Fri/9:00AM-19:00PM
Copyright © DevScope