Due to copy right issue, we public images of PR1954 instead of PR1956 (their data and method are nearly the same)
This project is mainly based on our NeurIPS 2019 workshop paper Information Extraction from Text Region with Complex Tabular Structure
Files explaining the methods of preprocessing and classification
Code for visualization results
Code for using Google Cloud Vision API
Code for preprocessing pipeline
Code for classification pipeline
Please look at the second section of our paper. Notice that PR1954 doesn't have class Table
.
Number of Raw Scans | 684 |
Page Bounding Box | Included |
ROI Bounding Box | Included |
Column Bounding Box | Included |
Row Bounding Box | Included |
Row Classification | Included |
AWS CLI is required for downloading data from AWS.
Code for downloading all the data of PR1954.
Data | AWS S3 Path |
---|---|
Raw Image | s3://harvardaha-raw-data/personnel-records/1954/scans/firm/ |
Labeled Data | s3://harvardaha-results/personnel-records/1954/labeled_data/ |
Segmentation Results | s3://harvardaha-results/personnel-records/1954/seg/firm/ |
CNN Output | s3://harvardaha-results/personnel-records/1954/prob/firm/ |
CRF Output | s3://harvardaha-results/personnel-records/1954/cls/CRF/firm/ |
Book | Method | Accuracy |
---|---|---|
PR1954 | CNN | 98.4% |
PR1954 | CNN+CRF | 99.7% |
PR1956 | CNN | 95.6% |
PR1956 | CNN+CRF | 96.8% |
Using anaconda to create a virtual environment with the command below:
conda env create -f environment.yml
If you have download the whole dataset and want to reproduce the results, please read the files in Documents/
and run correspondent shell one by one. Remember to modify the input/output path.
If you have download the whole dataset and want to visualize the results, run correspondent shell in Visualization/
. Remember to modify the input/output path.
A demo code which downloads one sample data and visualizes results.