# Early 19C Illustration Metadata: Final Report

The my ACS project has successfully concluded with the creation of a large and novel dataset of illustration metadata. The dataset was produced in four stages using two specially retrained convolutional neural networks as well as one standard model (InceptionV3).

The key deliverables of this projects are the following:

- A csv file identifying all illustrated pages from HathiTrust volumes from the early 19C
- A nearest-neighbors index (and utilities) for finding similar images to a 

I will discuss the four stages briefly before turning to a discussion and some examples. All listed files are included in the project's [Zenodo repository](TODO) unless stated otherwise. 

## Classification

We began by identifying all Google-digitized volumes published during the years 1800-1850 (inclusive). These **500,013** volumes are contained in the file `google_ids_1800-1850.txt.gz`, which is a subset of the July 2019 [Hathifile](https://www.hathitrust.org/hathifiles). The Hathifile fields include basic publication information and are listed in `hathi_field_list.txt`.

From this comprehensive set of early-nineteenth century volumes, we find all potentially illustrated pages using OCR-derived metadata. Apply a retrained CNN model to filter out noisy candidate pages.

This model is built with Tensorflow and is located here: `model1`. Code for interacting with the model is here.

My midpoint report describes the early steps in greater detail and can be found [here](https://wiki.htrc.illinois.edu/display/COM/A+Half-Century+of+Illustrated+Pages%3A+ACS+Lab+Notes).

## Region of interest (ROI) extraction

There **2,584,888** total ROIs. 

## Dimensionality reduction
## Indexing and visualization

## Discussion

Talk about users and applications as well as challenges (so much time for indexing steps). Future work. Need for image infrastructure and distributed workers.

## Using the data

This section presents Python code for working with the dataset. Note that HathiTrust APIs are used only sparingly. In general, it is much more efficient to download metadata in bulk and parse those files instead of making API calls.

In [166]:
import pandas as pd
import numpy as np
import os, random, re, sys
from annoy import AnnoyIndex
from glob import glob

In [169]:
# try the full index... it is VERY fast!
# It's just that build the index is linear: https://markroxor.github.io/gensim/static/notebooks/annoytutorial.html
u2 = AnnoyIndex(f, 'angular')
u2.load('../_app-files/early-19C-illustrations_full-index.ann')

True

In [170]:
df_meta = pd.read_csv('..\_app-files\early-19C-illustrations_metadata.csv')

In [173]:
df_meta.columns

Index(['htid', 'page_seq', 'page_label', 'crop_no', 'vector_path'], dtype='object')

In [171]:
def htid_page_seq_nns_metadata(htid, seq_num, nns_index, df_meta, k):
    """Given a target htid:page_seq and metadata dataframe, gives metadata for k neighboring images"""
    
    # get the index for the page in question (N.B. whitespace in column name)
    idx = df_meta[(df_meta['htid'] == htid) & (df_meta['page_seq'] == seq_num)].index
    
    # multiple crops alert
    if len(idx) > 1:
        print("Multiple crops for this page_seq")
        
    # the nearest neighbor ROI indices
    nns = nns_index.get_nns_by_item(idx[0], k)
    
    print(nns)
    
    # return rows from metadat table matching these indices
    return df_meta.iloc[nns]

In [172]:
# explain how this stuff can be found through HT viewer! it's in the URL
# end writeup with rousing call for this metadata to be derived for ALL hathi years + every time a new
# set of scans gets uploaded
univ_history = htid_page_seq_nns_metadata('uiug.30112003448526', 28, u2, df_meta, 10)
print(univ_history)

[1617907, 254046, 1613603, 1269805, 1844016, 1341309, 1724782, 1407896, 2438718, 1454741]
                        htid  page_seq    page_label  crop_no  \
1617907  uiug.30112003448526        28  inline_image        0   
254046        ucm.5321309033       347  inline_image        0   
1613603  uiug.30112048888058       326  inline_image        0   
1269805        chi.097881099        75  inline_image        0   
1844016       uc1.c046857802       166  inline_image        1   
1341309         chi.79355181       551  inline_image        0   
1724782         uc1.$b557159       190   plate_image        0   
1407896   njp.32101063578338        46   plate_image        0   
2438718           hvd.hn5cxz        19  inline_image        0   
1454741   njp.32101080155110        80   plate_image        0   

                                            vector_path  
1617907  uiug/31042/uiug.30112003448526_00000028_00.npy  
254046          ucm/5193/ucm.5321309033_00000347_00.npy  
1613603  uiug/31485/

In [133]:
# DONE: ask Boris if htids are encoded with the standard utility methods
# In the meta_csv, htids are NOT encoded; in vectors.txt and anywhere they are in file format, they ARE encoded


# find way to look up TITLE of work
# do montage of ENTIRE publisher set (really want t-SNE, but maybe Damon has something like that?)
# don't even need vectors, per se, for t-SNE on tractable subsets
# compare Hawthorne ballon (my McNeil paper) with Parley's moon and stars (sadly, added too late to make project)

## Acknowledgements

Thank you Ryan and Boris, especially.

## Appendix: Project assets

The deliverables are available via Zenodo:

- A csv file with basic metadata
- A nearest-neighbors index

 The code used for the data processing is on [GitHub](https://github.com/htrc/ACS-krewson).