The University of Nebraska-Lincoln's (UNL) Aida digital libraries research team and the Library of Congress (LC) collaborated on a "summer of machine learning" in 2019 to explore machine learning techniques for extending the accessibility of digital collections. The UNL team developed a number of prototype explorations over multiple iterations to investigate a range of questions and issues related to the digital materials, the LC's collections, and to machine learning practices in cultural heritage organizations. The UNL team employed a variety of machine learning approaches such as back-propagation neural network-based classifiers and deep learning approaches, including convolutional neural networks. More specifically, these projects involve VGG16, ResNeXt, dhSegment, and a fusion network combining ResNeXt and U-Net.
This repository includes the code developed and used across the team's explorations.
These instructions will get you a copy of the project up and running on your local machine for development and testing purposes.
For Exploration - Document Segmentation, the required software systems and libraries are:
- Anaconda >= 4.3
- Python >= 3.6
- TensorFlow 1.13
- CUDA 10.0 [if training on GPU]
- imageio >= 2.5
- pandas >= 0.24.2
- shapely >= 1.6.4
- scikit-learn >= 0.20.3
- scikit-image >= 0.15.0
- opencv-python >= 4.0.1
- tqdm >= 4.31.1
- sacred 0.7.4
- requests >= 2.21.0
- click >= 7.0
For Exploration - Graphic Element Classification and Text Extraction and Exploration - Digitization Type Differentiation, the required software systems and libraries are:
- Python 3.7
- MXNet 1.5
- CUDA 10.0 [if training on GPU]
- Matplotlib 3.1.1
- opencv-python 4.1
- numpy 1.17
For Exploration - Document Type Classification, the required software systems and libraries are:
- Anaconda >= 4.3
- Python >= 3.6
- TensorFlow 1.13
- CUDA 10.0 [if training on GPU]
- opencv-python >= 4.0.1
- numpy >= 1.16.2
- scikit-learn >= 0.20.3
- scikit-image >= 0.15.0
- matplotlib >= 1.4.3
- pandas >= 0.24.2
- seaborn 0.9.0
For Exploration - Document Image Quality Assessment, the required software systems and libraries are:
- Python 3.7
- scipy 1.3.1
- opencv-python 4.1
- skimage 0.15
Step-by-step instructions on how to install required software systems and libraries for each project
For Exploration - Document Segmentation
- Download Python 3.6 from https://www.python.org/downloads/
- Download CUDA 10.0 from https://developer.nvidia.com/cuda-toolkit-archive
- Install Anaconda or Miniconda (installation procedure)
- Open Terminal (for MacOS), Command-Line (for Windows)
- Go to the codebase/Exploration - Document Segmentation folder
- Create a virtual environment and activate it
conda create -n segmentation python=3.6
source activate segmentation
- Install packages
python setup.py install
For Exploration - Graphic Element Classification and Text Extraction and Exploration - Digitization Type Differentiation
- Download Python 3.7 from https://www.python.org/downloads/
- Download CUDA 10.0 from https://developer.nvidia.com/cuda-toolkit-archive
- Install downloaded installation file
- Open Terminal (for macOS), Command-Line (for Windows)
- Install MXNet
pip install 'mxnet-cu100==1.5.1'
- Install Matplotlib
python -m pip install -U 'matplotlib==3.1.1'
- Install opencv-python
pip install 'opencv-python==4.1'
- Install numpy
pip install 'numpy==1.17'
For Exploration - Document Type Classification
- Download Python 3.6 from https://www.python.org/downloads/
- Download CUDA 10.0 from https://developer.nvidia.com/cuda-toolkit-archive
- Install Anaconda or Miniconda (installation procedure)
- Open Terminal (for MacOS), Command-Line (for Windows)
- Go to the codebase/Exploration - Digitization Type Differentiation folder
- Create a virtual environment and activate it
conda create -n classification python=3.6
source activate classification
- Install packages
python setup.py install
For Exploration - Document Image Quality Assessment
- Download Python 3.7 from https://www.python.org/downloads/
- Install downloaded installation file
- Open Terminal (for macOS), Command-Line (for Windows)
- Install scipy
pip install 'scipy==1.3.1'
- Install opencv-python
pip install 'opencv-python==4.1'
- Install skimage
pip install 'scikit-image==0.15'
Exploration - Document Segmentation:
- Download all files in demo/Exploration - Document Segmentation https://git.unl.edu/unl_loc_summer_collab/codebase/tree/master/demo/Exploration%20-%20Document%20Segmentation
- Install required softwares and libraries
- Download one of the following dataset: (1) https://git.unl.edu/unl_loc_summer_collab/labeled_data/tree/master/ENP_500 or (2) https://git.unl.edu/unl_loc_summer_collab/labeled_data/tree/master/difficulty_collection, for segmentation or clustering task, respectively
- Copy the downlaoded folder to the downloaded 'Exploration - Document Segmentation' folder
- Run one of the following command, depending on the purpose
# Activate virtual environment
source activate segmentation
# For segmentation task
python demo_segmentation.py
# For clustering task
python demo_clustering.py
Exploration - Graphic Element Classification and Text Extraction:
- Download all files in demo/Exploration - Graphic Element Classification and Text Extraction https://git.unl.edu/unl_loc_summer_collab/codebase/tree/master/demo/Exploration%20-%20Graphic%20Element%20Classification%20and%20Text%20Extraction
- Install the required software and libraries
- Download dataset from https://git.unl.edu/unl_loc_summer_collab/labeled_data/tree/master/BeyondWord_orginal_resolution
- Copy the downloaded folder to the downloaded 'Exploration - Graphic Element Classification and Text Extraction' folder
- Run the evaluation script
python eval.py
Exploration - Document Type Classification:
- Download all files in demo/Exploration - Document Type Classification https://git.unl.edu/unl_loc_summer_collab/codebase/tree/master/demo/Exploration%20-%20Document%20Type%20Classification
- Install required softwares and libraries
- Download dataset from https://git.unl.edu/unl_loc_summer_collab/labeled_data/tree/master/suffrage_1002
- Copy the downlaoded folder to the downloaded 'Exploration - Document Type Classification' folder
- Run the demonstration script
# Activate virtual environment
source activate Exploration - Document Type Classification
# Run demonstration
python demo_classification.py
Exploration - Digitization Type Differentiation:
- Download all files in [demo/Exploration - Digitization Type Differentiation] https://git.unl.edu/unl_loc_summer_collab/codebase/tree/master/demo/Exploration%20-%20Digitization%20Type%20Differentiation
- Install the required software and libraries
- Download dataset from https://git.unl.edu/unl_loc_summer_collab/labeled_data/tree/master/micrpfilm_scanning
- Copy the downloaded folder to the downloaded 'Exploration - Digitization Type Differentiation' folder
- Run the evaluation script
python eval.py
Please read the README file inside each project folder for a description of each end-to-end test.
- Exploration - Document Segmentation: https://git.unl.edu/unl_loc_summer_collab/codebase/tree/master/demo/Exploration%20-%20Document%20Segmentation
- Exploration - Digitization Type Differentiation: <https://git.unl.edu/unl_loc_summer_collab/codebase/tree/master/Exploration%20-%20Graphic Element%20Classification%20and%20Text%20Extraction>
- Exploration - Document Type Classification: https://git.unl.edu/unl_loc_summer_collab/codebase/tree/master/Exploration%20-%20Document%20Type%20Classification
- Exploration - Document Image Quality Assessment: https://git.unl.edu/unl_loc_summer_collab/codebase/tree/master/Exploration%20-%20Document%20Image%20Quality%20Assessment
- Exploration - Digitization Type Differentiation: https://git.unl.edu/unl_loc_summer_collab/codebase/tree/master/Exploration%20-%20Digitization%20Type%20Differentiation
- Python - The programming language
- CUDA Toolkit - Enable GPU for model training
- MXNet - Deep learning framework
- TensorFLow - Deep learning framework
- Matlab - Math, graphics, programming platform
--- | Inputs | Technique | Output | Reports |
---|---|---|---|---|
Exploration - Document Segmentation (segmentation) | ENP_500 (European historical newspaper) Beyond Words |
U-Net | 5 class pixel-level segmented image | Progress report - Chulwoo Pack - 07312019.pdf Progress report - Chulwoo Pack - 08052019.pdf |
Exploration - Document Clustering (clustering) | ENP_500 (European historical newspaper) | t-SNE | Clustered manifold | Progress report - Chulwoo Pack - 09232019.pdf |
Exploration - Graphic Element Classification and Text Extraction | ENP_500 (European historical newspaper) Beyond Words |
U-NeXt | Predicted region segmentation | Progress report - Yi Liu - 07302019.pdf |
Exploration - Document Type Classification | suffrage_1002 (LoC Suffrage campaign) | U-Net | Type of document image: handwritten, typed, and mixed | Progress report - Chulwoo Pack - 08132919 Progress report - Chulwoo Pack - 08202019.pdf |
Exploration - Document Image Quality Assessment | Civil War Campaign | DIQA | Four quality scores | Progress report - Yi Liu - 08122019.pdf Progress report - Yi Liu - 09052019.pdf |
Exploration - Document Image Quality Assessment | difficulty_collection (LoC Manuscript/Mixed material) | U-Net, DIQA | visual difficulty correlation | Progress report - Chulwoo Pack - 10312019.pdf |
Exploration - Digitization Type Differentiation | Civil War Campaign | ResNeXt | Classify micrpfilm or scanning | Progress report - Yi Liu - 09052019.pdf Progress report - Yi Liu - 10292019.pdf |
- Yi Liu - research associate and developer
- Chulwoo (Mike) Pack - research associate and developer
- Elizabeth Lorang - senior adviser
- Leen-Kiat Soh - senior adviser
- Ashlyn Stewart - research assistant
This project is licensed under the GPL License - see the LICENSE file for details