Skip to content
This repository has been archived by the owner on Sep 20, 2021. It is now read-only.

archivesunleashed/daire

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

33 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

DAIRE (Deep Archival Image Retrieval Engine)

DAIRE (Deep Archival Image Retrieval Engine) is an image exploration tool based on latent representations derived from neural networks, which allows scholars to "query" using an image of interest to rapidly find related images within a web archive. More details can be found in our paper:

A live demo is available at http://daire.cs.uwaterloo.ca/, running on images from the "EnchantedForest" neighborhood of GeoCities. This repo holds the code that runs that demo.

Installation

If you haven't set up the Archives Unleashed Toolkit, follow the instructions here.

Use the Toolkit to extract image information and place the parquet files in data/images/:

spark-submit --class io.archivesunleashed.app.CommandLineAppRunner path/to/aut-fatjar.jar \ 
    --extractor ImageInformationExtractor --input /path/to/warcs/* \
    --output /path/to/daire/data/images --output-format parquet

Use the Toolkit to extract the image graph and place the parquet files in data/imagegraph/:

spark-submit --class io.archivesunleashed.app.CommandLineAppRunner path/to/aut-fatjar.jar \
    --extractor ImageInformationExtractor --input /path/to/warcs/* \
    --output /path/to/daire/data/imagegraph --output-format parquet

Install DAIRE dependencies:

pip install -r requirements.txt

Preprocessing the Data

Set save_image=True in script/extract-all-parquet-multi.py and run the script:

python script/extract-all-parquets-multi.py
python script/extract-parquets-url-to-name.py

This will save the images to img/ and generate full_info.txt and url_to_name.txt, along with some intermediate files.

Generate the HNSW index:

python script/index-hnsw.py

The resulting index will be saved in bin/<index_number>.bin and bin/<index_number>.txt.

In future runs, you can load from an index as follows:

python script/index-hnsw.py <index_number>

Running the App

The front-end is built with TypeScript and React. To make changes, follow the steps in the ui/ directory here.

Finally, start up the Flask server:

python server.py

Resources

How I scale HNSW to more images (10^6, 10^7, 10^8)? Discussion in in this Github issue.