Repo Structure

Due to copy right issue, we public images of PR1954 instead of PR1956 (their data and method are nearly the same)

This project is mainly based on our NeurIPS 2019 workshop paper Information Extraction from Text Region with Complex Tabular Structure

Repo Structure

`Documents/`

Files explaining the methods of preprocessing and classification

`Visualization/`

Code for visualization results

`OCR/`

Code for using Google Cloud Vision API

`Preprocessing/`

Code for preprocessing pipeline

`CNN/`, `GraphicalModel/`, and `Postprocessing/`

Code for classification pipeline

Dataset

`Introduction`

Please look at the second section of our paper. Notice that PR1954 doesn't have class Table.


Number of Raw Scans	684
Page Bounding Box	Included
ROI Bounding Box	Included
Column Bounding Box	Included
Row Bounding Box	Included
Row Classification	Included

`Download`

AWS CLI is required for downloading data from AWS.

Code for downloading all the data of PR1954.

`AWS S3 Directory Structure`

Data	AWS S3 Path
Raw Image	s3://harvardaha-raw-data/personnel-records/1954/scans/firm/
Labeled Data	s3://harvardaha-results/personnel-records/1954/labeled_data/
Segmentation Results	s3://harvardaha-results/personnel-records/1954/seg/firm/
CNN Output	s3://harvardaha-results/personnel-records/1954/prob/firm/
CRF Output	s3://harvardaha-results/personnel-records/1954/cls/CRF/firm/

`Classification Accuracy`

Book	Method	Accuracy
PR1954	CNN	98.4%
PR1954	CNN+CRF	99.7%
PR1956	CNN	95.6%
PR1956	CNN+CRF	96.8%

Environment

`Anaconda`

Using anaconda to create a virtual environment with the command below:

conda env create -f environment.yml

Demo

If you have download the whole dataset and want to reproduce the results, please read the files in Documents/ and run correspondent shell one by one. Remember to modify the input/output path.

If you have download the whole dataset and want to visualize the results, run correspondent shell in Visualization/. Remember to modify the input/output path.

A demo code which downloads one sample data and visualizes results.

Name		Name	Last commit message	Last commit date
Latest commit History 200 Commits
.idea		.idea
CNN		CNN
Documemts		Documemts
GraphicalModel		GraphicalModel
OCR		OCR
PostProcessing		PostProcessing
Preprocessing		Preprocessing
Visualization		Visualization
demo		demo
CombineAllResults.py		CombineAllResults.py
CombineAllResults_index.py		CombineAllResults_index.py
CombineRes.sh		CombineRes.sh
DownloadPR1954.sh		DownloadPR1954.sh
GenerateMapping.py		GenerateMapping.py
LICENSE		LICENSE
README.md		README.md
Rect.py		Rect.py
classification.sh		classification.sh
environment.yaml		environment.yaml

License

KaixuanZ/PR1956

Folders and files

Latest commit

History

Repository files navigation

Repo Structure

Documents/

Visualization/

OCR/

Preprocessing/

CNN/, GraphicalModel/, and Postprocessing/

Dataset

Introduction

Download

AWS S3 Directory Structure

Classification Accuracy

Environment

Anaconda

Demo

About

Resources

License

Stars

Watchers

Forks

Languages

`Documents/`

`Visualization/`

`OCR/`

`Preprocessing/`

`CNN/`, `GraphicalModel/`, and `Postprocessing/`

`Introduction`

`Download`

`AWS S3 Directory Structure`

`Classification Accuracy`

`Anaconda`