Project4: Post-processing of OCR (Optical Character Recognition) text files

Background

OCR is a technology that enables you to convert different types of documents (e.g., scanned documents, PDF files or images captured by a digital camera) into editable and searchable data. An OCR system consists of a pre-processing module, a word recognition module and a post-processing module, which are pipelined together to retrieve a relevant scanned document.

Full Project Description

Team members (group 6)

Data Source

Project summary

In this project, we created an OCR post-processing procedure to enhance Tesseract OCR output. For dectection we used the D-3 method which includes feature engineering and training the Support Vector Machine (SVM) for the classification. First, we labelled tesseract using the ground truth, but we ignored the files with different number of lines and lines with different number of words. For the improvement, we could have implemented a way to include all the files and all the lines. Each feature was built using separate functions and Buildfeature fuction is used to aggregate all the features and to create a feature matrix. Then the result was fed into SVM for training. Training took about 30 mins. As a result, the weighted average precision is 0.83 and the weighed average recall is 0.84. Overall, the weighted average f1-score is 0.83.
For correction part, we used C-3 method which can help us to detect those words containning exact one typos on the test data from detection part. First, we used edit distance to find each typo's potential correction candidates. Then we used Bayesian combination rule to choose the most possible correction. And the algorithm we used to calcaute the pr(t|c) is stated as following:

    del[cp_ 1, cp]/chars[cp-1, cp] if deletion
    add[cp_ l, tp]/chars[cp-l]  if insertion
    sub[tp, Cp]/chars[cp] if substitution 
    rev[cp, cp + l]/chars[cp, cp + 1] if reversal

We should be careful when we deal with the cases then the index of position of correction equal to 0 since in these cases, we don't have any information about cp-1 and according to the advice of professor,we calculate the number of words in training set instead which is also quite rational. The method we use to calcuate pr(c) is ELE and the posterior probility pr(c)*pr(t|c).
We evaluated our algorithm using precision and recall, in both word level and character level. And the result shows that we improved the word precision from 0.67 to 0.77; word recall from 0.66 to 0.75; character precision from 0.94 to 0.96; character recall from 0.91 to 0.94, which is a significant enhancement to the tesseract.

Contribution statement

HyunBin Yoo and Guanren Wang discussed on how to label the data for SVM training and which features would be appropriate as the features to SVM and developed the detection SVM model. HyunBin Yoo implemented the detection part algorithm in python. Guanren Wang checked the code for detection and pointed out the mistakes. Andy Huang, Feng Su and Ying Jin discussed, explored the correction model and designed, carried out the evaluation part. All team members contributed to the GitHub repository and prepared the presentation. As a presenter, HyunBin Yoo created the powerpoint slide. All team members approve our work presented in our GitHub repository including this contribution statement.

Following suggestions by RICH FITZJOHN (@richfitz). This folder is orgarnized as follows.

proj/
├── lib/
├── data/
├── doc/
├── figs/
└── output/

Please see each subfolder for a README file.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Project4: Post-processing of OCR (Optical Character Recognition) text files

Background

Full Project Description

Team members (group 6)

Data Source

Project summary

Contribution statement

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 86 Commits
data		data
doc		doc
figs		figs
lib		lib
output		output
.DS_Store		.DS_Store
README.md		README.md

Grandeurwang/Detection_and_correction_of_OCR_tesseract_processed_text_files

Folders and files

Latest commit

History

Repository files navigation

Project4: Post-processing of OCR (Optical Character Recognition) text files

Background

Full Project Description

Team members (group 6)

Data Source

Project summary

Contribution statement

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages