Programing with SDK

Although its is possible to run SmartDocumentor fully on built in workers, it is common to have to adapt the workflows to the clients needs and customize a worker or even create a new one.

Next we will explore common methods used with SmartDocumentor SDK.

How to read page from file

SmartDocumentor facilitates reading several types of files and extracting pages from those files. Most commonly .TIFF files and .PDF files.

To read a page from such a type of files with SmartDocumentor we call the ImageDocument and set the values we want to extract.

The next example show how to extract one page from a multi page PDF file:

using (var imageFile = new ImageDocument(PdfFileName))
{

	var result = imageFile.GetImageByIndex(new PageBitmapRequest()
	{
		ImageResultMode = PageBitmapRequest.PageBitmapImageModeResult.Filename,
		PageIndex = 1,
		
	});

	var pageFile = result.PageFilename;
}

To create a new ImageDocument object it is possible to pass a Path to a file, a Stream or a Bitmap.

The method GetmageByIndex returns the information relative to one page from the document.

PageBitmapRequest allows to set the parameters that we whant to use to extract the image and if any the post processing conditions.

In the example above we pass two values to get the first page with a path to the extracted image. If none are passed the PageIndex default is the first page and the ImageResultMode is a Bitmap.

Note:

PageBitmapImageModeResult can be:
Bitmap - image, Pix - pointer, or Filename - temp file.

Other default values for the PageBitmapRequest:

AddBorder - false AutoCleanBlackBorders - false AutoDeWarp - false AutoDeskew - false AutoDetectOrientation - false AutoDetectOrientationMethod - Fast AutoEqualize - false AutoInvert - false BorderWidth - 0 ColorizeSettings - Default ConvertToBitonal - false ConvertToBitonalCustomParams - null ConvertToBitonalMethod - DynamicOtsu

ConvertToBitonalThreshold - 0 ConvertToColorDepth - Color FlipHorizontal - false FlipVertical - false ImageResultMode - Bitmap MorphMaskOperation - And MorphMaskSequence - null NoImageCloning - false Noise - 0 NormalizeToPaperFormatBeforeProcessing - None PageCorrections - Default PageCropSettings - Default PageIndex - 0 PdfDocumentPageDPI - 300 PdfExtractImageFromImageList - false PdfExtractTextWords - false RemovePunchHoles - false RotationMode - 0 Size - Default Tag - null Thickness - 0

This is the same process that is used in the DocumentPreProcessingWorker so it is possible to set these parameters in the task workspace configuration.

<Step From="FileImportedFromFolder" Using="DocumentPreProcessingWorker" To="PreProcessedCompleted">
  <SettingList>
	<Setting Name="AutoDeskew" Value="True" />
	<Setting Name="CleanBlackBorders" Value="False" />
	<Setting Name="AutoInvert" Value="False" />
	<Setting Name="AutoDetectOrientation" Value="False" />
	<Setting Name="ConvertToBitonal" Value="False" />
	<Setting Name="Thickness" Value="0" />
	<Setting Name="Noise" Value="0" />
	<Setting Name="DPI" Value="300" />
	<Setting Name="DeleteOriginalDocument" Value="True" />
  </SettingList>
</Step>

How to extract text from image/pdf

In order to extract the text from a PDF file that contains embedded text it is possible to use the ImageDocument passing in the PageBitmapRequest the parameter PdfExtractTextWords as true.

The example shows how try to extract the text and get the content.

var request = new PageBitmapRequest
{
	PageIndex = pageIndex - 1,
	AutoDeskew = true,
	ConvertToBitonal = false,
	PdfExtractTextWords = true
};

using (PageBitmapResponse response = imageDocument.GetImageByIndex(request))
{
	var result = response.PdfOcrJobResult;
	if(result != null)
		var text = result.Text;
}

I the PDF file does not contains embedded text or if we need to do the OCR extraction from an image, we need to use the OCR engine. So for example we could just add up to the previous code:

using (IOcrEngine ocrEngine = OcrFactory.CreateOcrEngineAuto()
{
	var ocrJobReq = new OcrJobRequest
	{
		Image = response.PageImage,
		Language = "Portuguese",
		EntityTextMode = OcrEntityTextMode.Word,
		PageNumber = 1
	};

	OcrJobResult ocrResult = ocrEngine.DoOcr(ocrJobReq);
	if(ocrResult != null)
		var text = ocrResult.Text;
}

Once again if we want to do this using SmartDocumentor's workers this options already built in to the OCRExtractionWorker. The next settings, in a PDf file, will check if the PDF contains embedded text and if not will do the OCR and save the text result from the first page in the task.

<Step From="ToProcess" Using="OCRExtractionWorker" To="OCRCompleted">
  <SettingList>
	<Setting Name="OcrSavePageText" Value="True" />
	<Setting Name="PdfExtractTextWords" Value="True" />
	<Setting Name="OcrPageRange" Value="1" />
  </SettingList>
</Step>

How to read barcodes

One common request is to read or separate documents by barcode. With SmartDocumentor we can use the same logic from the previous examples and add the BarcodeEngine to check the barcodes found in a page.

using (BarcodeEngine barcodeEngine = new BarcodeEngine())
{
	using (var imageFile = new ImageDocument(fileName))
	{
		var req = new PageBitmapRequest()
				{
					PageIndex = 1,
					ImageResultMode = PageBitmapRequest.PageBitmapImageModeResult.Filename,
					AutoDeskew = true
				};
				
		using (var response = imageFile.GetImageByIndex(req))
		{
		IBarcodeInfo[] barcodes = barcodeEngine.ReadBarcode(response.PageFilename);
		}
	}
}

This will return a object with the found barcodes, the text found in the barcode and confidence levels.

If you need to split a file and you are setting it from the scanner of from a folder the FolderMonitorWorker already is able to do that.

<Step Using="FolderMonitorWorker" To="FileImportedFromFolder">
  <SettingList>
	<Setting Name="Folders" Value="\\localhost\input" />
	<Setting Name="FilePatterns" Value="*.tif|*.tiff|*.pdf|*.jpg|*.png" />
	<Setting Name="DocSeparationEnabled" Value="true" />
	<Setting Name="DocSeparationMethod" Value="Barcodes" />
	<Setting Name="DocSeparationBarcodeType" Value="Code128" />
	<Setting Name="DocSeparationBarcodeValue" Value="CODE" />
  </SettingList>
</Step>

Contact us

Adress: R. de Passos Manuel 223 3°, 4000-385 Porto, Portugal

Email: support@devscope.net

Phone: +315 22 375 1350

Working Days/Hours: Mon-Fri/9:00AM-19:00PM

Programing with SDK

How to read page from file

How to extract text from image/pdf

How to read barcodes

Contact us

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally