With the steps below I copied zones out of sensitive documents into their own documents and deleted the more unusual names. This leaves me with documents only containing first or last names - the documents are sorted alphabetically so the link between first anem and last name is broken.
My documents have zones of all different sizes and locations all over the documents. I manually created the zones. Now I want to chop all of these zones out of the documents and make new documents all of exactly the same size with a zone size that is big enough to read ALL of these zones.
This runs in a script locator on any document. It works in KTA Transformation Designer by pressing F7 on any document, because it loops through all the xdocs in the same folder as the xdoc being tested.
In my case, it found the widest field was 1963 pixels and the heighest field was 436 pixels.
Private Sub SL_Dim_LocateAlternatives(ByVal pXDoc As CASCADELib.CscXDocument, ByVal pLocator As CASCADELib.CscXDocField)
Dim FileName As String, Field As CscXDocField, Path As String, F As Long, Truth As New CscXDocument, I As Long, Alt As CscXDocFieldAlternative
Dim W As Long, H As Long
Path = Left (pXDoc.FileName, InStrRev(pXDoc.FileName,"\"))
ChDir Path
FileName = Dir("*.xdc")
While FileName <> ""
Truth.Load(FileName) ' Load the XDoc from the file system.
For F=0 To Truth.Fields.Count-1
Set Field=Truth.Fields(F)
If Field.PageIndex>-1 And Field.Text<>"" Then
If Field.Width>W Then W=Field.Width
If Field.Height>H Then H=Field.Height
End If
Next
FileName = Dir()
Wend
Set Alt= pLocator.Alternatives.Create
Alt.Confidence=1
With Alt.SubFields.Create("W")
.Text=CStr(W)
.Confidence=1
End With
With Alt.SubFields.Create("H")
.Text=CStr(H)
.Confidence=1
End With
End Sub
Export all Fields as their own TIF files with the file name as the field value and every TIF image EXACTLY the same size.
each document is now exactly one zone with width =1963 and height = 436. Each zone is padded 20 pixels on all four size.
This means one single AZL with size (1963,436) can read them all perfectly.
I can now benchmark the OCR engine
' Class script: Document
Private Sub SL_Dim_LocateAlternatives(ByVal pXDoc As CASCADELib.CscXDocument, ByVal pLocator As CASCADELib.CscXDocField)
Dim FileName As String, Field As CscXDocField, Path As String, F As Long, Truth As New CscXDocument, I As Long, Alt As CscXDocFieldAlternative
Dim W As Long, H As Long
Randomize 'Seed the random number generator with the current time
Path = Left (pXDoc.FileName, InStrRev(pXDoc.FileName,"\"))
ChDir Path
FileName = Dir("*.xdc")
While FileName <> ""
Truth.Load(FileName) ' Load the XDoc from the file system.
For F=0 To Truth.Fields.Count-1
Set Field=Truth.Fields(F)
If Field.PageIndex>-1 And Field.Text<>"" Then
Select Case Field.Name
Case "FirstName", "LastName"
Document_ExportField(Truth, F, 1963,436, 20, "C:\temp\out")
End Select
End If
Next
FileName = Dir()
Wend
Set Alt= pLocator.Alternatives.Create
Alt.Confidence=1
With Alt.SubFields.Create("W")
.Text=CStr(W)
.Confidence=1
End With
With Alt.SubFields.Create("H")
.Text=CStr(H)
.Confidence=1
End With
End Sub
Private Sub Document_ExportField(ByVal XDoc As CASCADELib.CscXDocument, FieldId As Long, Width As Long, Height As Long, Padding As Long, Path As String)
'Create a tif image for every field. Each tif image has exactly the same width and height and same padding around the zone.
'This makes it easy For an AZL To Read All of them
Dim FileName As String, Field As CscXDocField, F As Long, I As Long, Image As CscImage
Set Field=XDoc.Fields(F)
If InStr(Field.Text, "/") >0 Then Exit Sub ' File name would have an illegal "/" in it, so skip.
If Field.PageIndex>-1 And Field.Text<>"" Then ' this field has coordinates and text
Set Image=New CscImage
Image.CreateImage(CscImgColFormatBinary,Width+Padding*2,Height+Padding*2,Image.XResolution,Image.YResolution) ' Make a Black&White image
Image.CopyRect(XDoc.CDoc.Pages(Field.PageIndex).GetImage.BinarizeWithVRS(),Field.Left,Field.Top,Padding,Padding,Field.Width,Field.Height)
I=0
Do 'increment I if the file already exists
I=I+1
FileName=Path & "\" & Field.Text & "_" & Format(I,"00") & ".tif"
Loop While File_Exists (FileName)
Image.Save(FileName)
End If
End Sub
Function File_Exists(file As String) As Boolean
On Error GoTo ErrorHandler
Return (GetAttr(file) And vbDirectory) = 0
Exit Function
ErrorHandler:
End Function
I added an Advanced Zone Locator (AZL) to the project. Gave it the first image as the sample, set the registration to None and added one Text Zone that covered the entire of the document
I then extracted all of the documents (CTRL-A, F7)
I added the Name field the the Details
and can now see all the OCR results with the documents
Here you can see a wrong OCR result with confidence of 54% in the Extraction Results Window.
These results contain errors, but we can see the correct text in the file name.
The following script in the document class will replace the OCR text with the correct text from the filename and set the confidence to 100%.
Private Sub Document_AfterExtract(ByVal pXDoc As CASCADELib.CscXDocument)
Dim Filename As String, Field As CscXDocField
Set Field=pXDoc.Fields.ItemByName("Name")
Filename = Mid(pXDoc.FileName,InStrRev(pXDoc.FileName,"\")+1) ' the filename is everything after the last backslash
Field.Text=UCase(Left(Filename,InStr(Filename,"_")-1)) ' True field value is everything left of _ in the file name
Field.Confidence=1.00 ' confidence = 100%
Field.ExtractionConfident=True 'Make the green check mark in the Extraction Results Window
End Sub
and re-extracted everything (CTRL-A, F7) and now every document is perfect!
Note the green check mark ExtractionConfident=true and confidence =100%
I saved all the documents
and made a backup of the folder containing this perfect truth (Because it is so easily to mess this up and lose it all!)
remove this script before continuing, otherwise the benchmark will look perfect 😊.
I Right-clicked on the test set and selected Use as Benchmark Set.
I have 425 documents and the benchmark takes longer than 1 minute to run, so I made a tiny test set so that I can test that the benchmark is working correctly becfore I run the full thing! I created a Document Subset called small.
I dragged a few documents into the small subset
I set the small subset to be the Default Document Subset
and saved the Benchmark Set.
I ran Extraction Benchmark from the Process Menu on this tiny set of 6 documents.
In the results I see that I don't have optimal results. I would like to know what the best choice is for the confidence threshold, which is defaulted to 80%.
Now I am ready to run the Threshold Optimizer to find the best threshold value!