Skip to content

KaixuanZ/PR1956

Repository files navigation

Due to copy right issue, we public images of PR1954 instead of PR1956 (their data and method are nearly the same)

This project is mainly based on our NeurIPS 2019 workshop paper Information Extraction from Text Region with Complex Tabular Structure

Repo Structure

Documents/

Files explaining the methods of preprocessing and classification

Visualization/

Code for visualization results

OCR/

Code for using Google Cloud Vision API

Preprocessing/

Code for preprocessing pipeline

CNN/, GraphicalModel/, and Postprocessing/

Code for classification pipeline

Dataset

Introduction

Please look at the second section of our paper. Notice that PR1954 doesn't have class Table.

Number of Raw Scans 684
Page Bounding Box Included
ROI Bounding Box Included
Column Bounding Box Included
Row Bounding Box Included
Row Classification Included

Download

AWS CLI is required for downloading data from AWS.

Code for downloading all the data of PR1954.

AWS S3 Directory Structure

Data AWS S3 Path
Raw Image s3://harvardaha-raw-data/personnel-records/1954/scans/firm/
Labeled Data s3://harvardaha-results/personnel-records/1954/labeled_data/
Segmentation Results s3://harvardaha-results/personnel-records/1954/seg/firm/
CNN Output s3://harvardaha-results/personnel-records/1954/prob/firm/
CRF Output s3://harvardaha-results/personnel-records/1954/cls/CRF/firm/

Classification Accuracy

Book Method Accuracy
PR1954 CNN 98.4%
PR1954 CNN+CRF 99.7%
PR1956 CNN 95.6%
PR1956 CNN+CRF 96.8%

Environment

Anaconda

Using anaconda to create a virtual environment with the command below:

conda env create -f environment.yml

Demo

If you have download the whole dataset and want to reproduce the results, please read the files in Documents/ and run correspondent shell one by one. Remember to modify the input/output path.

If you have download the whole dataset and want to visualize the results, run correspondent shell in Visualization/. Remember to modify the input/output path.

A demo code which downloads one sample data and visualizes results.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published