DAIRE (Deep Archival Image Retrieval Engine)

DAIRE (Deep Archival Image Retrieval Engine) is an image exploration tool based on latent representations derived from neural networks, which allows scholars to "query" using an image of interest to rapidly find related images within a web archive. More details can be found in our paper:

Tobi Adewoye, Xiao Han, Nick Ruest, Ian Milligan, Samantha Fritz, and Jimmy Lin. Content-Based Exploration of Archival Images Using Neural Networks. Proceedings of the 20th ACM/IEEE-CS Joint Conference on Digital Libraries (JCDL 2020), August 2020.

A live demo is available at http://daire.cs.uwaterloo.ca/, running on images from the "EnchantedForest" neighborhood of GeoCities. This repo holds the code that runs that demo.

Installation

If you haven't set up the Archives Unleashed Toolkit, follow the instructions here.

Use the Toolkit to extract image information and place the parquet files in data/images/:

spark-submit --class io.archivesunleashed.app.CommandLineAppRunner path/to/aut-fatjar.jar \ 
    --extractor ImageInformationExtractor --input /path/to/warcs/* \
    --output /path/to/daire/data/images --output-format parquet

Use the Toolkit to extract the image graph and place the parquet files in data/imagegraph/:

spark-submit --class io.archivesunleashed.app.CommandLineAppRunner path/to/aut-fatjar.jar \
    --extractor ImageInformationExtractor --input /path/to/warcs/* \
    --output /path/to/daire/data/imagegraph --output-format parquet

Install DAIRE dependencies:

pip install -r requirements.txt

Preprocessing the Data

Set save_image=True in script/extract-all-parquet-multi.py and run the script:

python script/extract-all-parquets-multi.py
python script/extract-parquets-url-to-name.py

This will save the images to img/ and generate full_info.txt and url_to_name.txt, along with some intermediate files.

Generate the HNSW index:

python script/index-hnsw.py

The resulting index will be saved in bin/<index_number>.bin and bin/<index_number>.txt.

In future runs, you can load from an index as follows:

python script/index-hnsw.py <index_number>

Running the App

The front-end is built with TypeScript and React. To make changes, follow the steps in the ui/ directory here.

Finally, start up the Flask server:

python server.py

Resources

How I scale HNSW to more images (10^6, 10^7, 10^8)? Discussion in in this Github issue.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

backend

backend

bin

bin

script

script

ui

ui

.gitignore

.gitignore

README.md

README.md

full_info.txt

full_info.txt

requirements.txt

requirements.txt

server.py

server.py

url_to_name.txt

url_to_name.txt

Repository files navigation

DAIRE (Deep Archival Image Retrieval Engine)

Installation

Preprocessing the Data

Running the App

Resources

About

Contributors 3

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 33 Commits
backend		backend
bin		bin
script		script
ui		ui
.gitignore		.gitignore
README.md		README.md
full_info.txt		full_info.txt
requirements.txt		requirements.txt
server.py		server.py
url_to_name.txt		url_to_name.txt

archivesunleashed/daire

Folders and files

Latest commit

History

Repository files navigation

DAIRE (Deep Archival Image Retrieval Engine)

Installation

Preprocessing the Data

Running the App

Resources

About

Resources

Stars

Watchers

Forks

Languages