# Early 19C Illustration Metadata: Final Report

The my ACS project has successfully concluded with the creation of a large and novel dataset of illustration metadata. The dataset was produced in four stages using two specially retrained convolutional neural networks as well as one standard model (InceptionV3).

The key deliverables of this projects are the following:

- A csv file identifying all illustrated pages from HathiTrust volumes from the early 19C
- A nearest-neighbors index (and utilities) for finding similar images to a 

I will discuss the four stages briefly before turning to a discussion and some examples. All listed files are included in the project's [Zenodo repository](TODO) unless stated otherwise. 

## Classification

We began by identifying all Google-digitized volumes published during the years 1800-1850 (inclusive). These **500,013** volumes are contained in the file `google_ids_1800-1850.txt.gz`, which is a subset of the July 2019 [Hathifile](https://www.hathitrust.org/hathifiles). The Hathifile fields include basic publication information and are listed in `hathi_field_list.txt`.

From this comprehensive set of early-nineteenth century volumes, we find all potentially illustrated pages using OCR-derived metadata. Apply a retrained CNN model to filter out noisy candidate pages.

This model is built with Tensorflow and is located here: `model1`. Code for interacting with the model is here.

My midpoint report describes the early steps in greater detail and can be found [here](https://wiki.htrc.illinois.edu/display/COM/A+Half-Century+of+Illustrated+Pages%3A+ACS+Lab+Notes).

## Region of interest (ROI) extraction

There **2,584,888** total ROIs. 

## Dimensionality reduction
## Indexing and visualization

## Discussion

Talk about users and applications as well as challenges (so much time for indexing steps). Future work. Need for image infrastructure and distributed workers.

## Using the data

This section presents Python code for working with the dataset. Note that HathiTrust APIs are used only sparingly. In general, it is much more efficient to download metadata in bulk and parse those files instead of making API calls.

In [70]:
import pandas as pd
import numpy as np
import os, random, re, sys
from annoy import AnnoyIndex
from glob import glob
from pathlib import Path

In [19]:
# the volumes used in the ACS project
HATHIFILE = "google_ids_1800-1850.txt.gz"

# corrected field names file. See also:
# https://www.hathitrust.org/hathifiles_description
HATHICOLS = "hathifiles/hathi_field_list.txt"

In [3]:
def search_hathifile(ht_file, col_file):
    """
    Return rows matching the query, as well as stubbytree paths for htids
    """

    # Use iterative method to scale to full hathifiles
    with open(col_file, "r") as fp:
        col_names = fp.readline().strip('\n').split('\t')
        num_cols = len(col_names)

    # https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html
    iter_csv = pd.read_csv(
        ht_file, 
        sep='\t', 
        header=None,
        names=col_names,
        engine='c',
        # quicker if we can assert some types for the fields
        dtype={
            'htid': 'str',
            'rights_date_used': 'object',
            'pub_place': 'str', # sadly, this is just the partner lib
            'imprint': 'str'
        },
        iterator=True,
        chunksize=5000,
        error_bad_lines=False)

    df = pd.DataFrame()
    for i, chunk in enumerate(iter_csv):

        # hard code query: use a basic regex with matching group
        # find: "Munroe, Francis", "Munroe and Francis", "Munroe & Francis"
        conditions = (chunk['imprint'].str.contains(
            r"\bMunroe(?:,| and| &) Francis\b",
            na=False,
            flags=re.IGNORECASE)
        )
        # concatenate valid rows, idx doesn't matter
        df = pd.concat([df, chunk[conditions]], ignore_index=True)
    return df

In [4]:
df = search_hathifile(HATHIFILE, HATHICOLS)

In [5]:
df.shape

(360, 26)

In [14]:
# Utility functions from Hathi's feature datasets
# https://github.com/htrc/htrc-feature-reader/blob/39010fd41c049f4f86b9c8ff4a44e000217093c2/htrc_features/utils.py

def _id_encode(id):
    '''
    :param id: A Pairtree ID. If it's a Hathitrust ID, this is the part after the library
        code; e.g. the part after the first period for vol.123/456.
    :return: A sanitized id. e.g., 123/456 will return as 123=456 to avoid filesystem issues.
    '''
    return id.replace(":", "+").replace("/", "=").replace(".", ",")

def _id_decode(id):
    '''
    :param id: A sanitized Pairtree ID.
    :return: An original Pairtree ID.
    '''
    return id.replace("+", ":").replace("=", "/").replace(",", ".")

def clean_htid(htid):
    '''
    :param htid: A HathiTrust ID of form lib.vol; e.g. mdp.1234
    :return: A sanitized version of the HathiTrust ID, appropriate for filename use.
    '''
    libid, volid = htid.split('.', 1)
    volid_clean = _id_encode(volid)
    return '.'.join([libid, volid_clean])

def id_to_stubbytree(htid, format = None, suffix = None, compression = None):
    '''
    Take an HTRC id and convert it to a 'stubbytree' location.
    '''
    libid, volid = htid.split('.', 1)
    volid_clean = _id_encode(volid)

    suffixes = [s for s in [format, compression] if s is not None]
    filename = ".".join([clean_htid(htid), *suffixes])
    path = os.path.join(libid, volid_clean[::3], filename)
    return path

In [42]:
stubby_ids = [id_to_stubbytree(htid) for htid in df.htid.values]

In [59]:
len(stubby_ids)

360

In [43]:
stubby_ids[:5]

['mdp\\31331\\mdp.39015038731918',
 'mdp\\31197\\mdp.39015010791476',
 'mdp\\31198\\mdp.39015010791484',
 'mdp\\31199\\mdp.39015010791492',
 'mdp\\31190\\mdp.39015010791500']

In [44]:
rois_dir = Path(os.path.abspath('../_app-files/roi-vectors/vectors'))

In [54]:
# for each volume, find associated .npy vectors within stubbytree directory
munroe_francis = {}

for stubby_id in stubby_ids:
    vol_vectors = glob(os.path.join(rois_dir, stubby_id + "*.npy"))
    if len(vol_vectors) != 0:
        munroe_francis[stubby_id] = vol_vectors

In [57]:
total = 0
for vec_list in munroe_francis.values():
    total += len(vec_list)

In [60]:
# 1477 vectors from 118 illustrated volumes (out of 360 total)
# dimensionality of the vectors is (1,1000) per np.shape
total

1477

In [65]:
with open('munroe.csv', 'w') as fp:
    for stubby_id in munroe_francis.keys():
        fp.write(os.path.normpath(stubby_id))
        fp.write('\n')

In [76]:
# Modified from: https://github.com/spotify/annoy

f = 1000
t = AnnoyIndex(f, 'angular')
i = 0

# Find all vectors per volume and index them from 0
for k,v in munroe_francis.items():
    for vec in v:
        item = np.load(vec)
        # transpose vector since it needs to be (1000,1) not (1,1000)
        t.add_item(i, item.T)
        i += 1

# Try with 1000 trees
t.build(1000)
t.save('test.ann')

True

In [77]:
u = AnnoyIndex(f, 'angular')
u.load('test.ann')
print(u.get_nns_by_item(0, 10))

[0, 614, 979, 905, 968, 177, 900, 955, 893, 677]


In [79]:
# great news! building that tree took about 20s on my laptop
# really need to have data frame with index for each vector; keep name similar to .ann file
# say t-SNE and all that can be future directions (clustering)

In [80]:
!dir ..\_app-files

 Volume in drive C is Windows
 Volume Serial Number is C2C5-01EE

 Directory of C:\Users\stephen-krewson\Documents\_app-files

07/28/2020  03:33 PM    <DIR>          .
07/28/2020  03:33 PM    <DIR>          ..
08/21/2019  04:13 PM    <DIR>          19c-book-illustrations
08/21/2019  04:24 PM    12,357,353,554 19c-book-illustrations-2.zip
08/03/2019  02:54 PM    12,359,693,970 19c-book-illustrations.zip
11/05/2018  01:50 PM       234,493,010 combined_data.json
10/02/2018  02:53 PM        17,152,239 default_graph.pb
06/29/2019  10:14 AM            27,142 Dissertation.zip
07/28/2020  03:30 PM    18,015,031,444 early-19C-illustrations_full-index.ann
07/28/2020  12:54 PM        11,207,602 early-19C-illustrations_full-index_list.txt.gz
06/29/2019  10:17 AM            13,054 evangelical-black-atlantic-report.zip
11/25/2018  03:15 PM       298,068,462 FULL_mhl_1770-1879.json
11/10/2018  12:50 PM       265,513,072 FULL_mhl_1800-1879.json
07/03/2019  04:13 PM    <DIR>          HUMS 304b [2015]
0

In [81]:
# try the full index... it is VERY fast!
# It's just that build the index is linear: https://markroxor.github.io/gensim/static/notebooks/annoytutorial.html
u2 = AnnoyIndex(f, 'angular')
u2.load('../_app-files/early-19C-illustrations_full-index.ann')

Wall time: 17.1 ms


True

In [159]:
df_meta = pd.read_csv('..\_app-files\early-19C-illustrations_metadata.csv')

In [165]:
def htid_page_seq_nns_metadata(htid, seq_num, nns_index, df_meta, k):
    """Given a target htid:page_seq and metadata dataframe, gives metadata for k neighboring images"""
    
    # get the index for the page in question (N.B. whitespace in column name)
    idx = df_meta[(df_meta['htid'] == htid) & (df_meta['page_seq'] == seq_num)].index
    
    # multiple crops alert
    if len(idx) > 1:
        print("Multiple crops for this page_seq")
        
    # the nearest neighbor ROI indices
    nns = nns_index.get_nns_by_item(idx[0], k)
    
    print(nns)
    
    # return rows from metadat table matching these indices
    return df_meta.iloc[nns]

In [164]:
# explain how this stuff can be found through HT viewer! it's in the URL
# end writeup with rousing call for this metadata to be derived for ALL hathi years + every time a new
# set of scans gets uploaded
univ_history = htid_page_seq_nns_metadata('uiug.30112003448526', 28, u2, df_meta, 10)
print(univ_history)

[1617907, 254046, 1613603, 1269805, 1844016, 1341309, 1724782, 1407896, 2438718, 1454741]
                        htid  page_seq    page_label  crop_no  \
1617907  uiug.30112003448526        28  inline_image        0   
254046        ucm.5321309033       347  inline_image        0   
1613603  uiug.30112048888058       326  inline_image        0   
1269805        chi.097881099        75  inline_image        0   
1844016       uc1.c046857802       166  inline_image        1   
1341309         chi.79355181       551  inline_image        0   
1724782         uc1.$b557159       190   plate_image        0   
1407896   njp.32101063578338        46   plate_image        0   
2438718           hvd.hn5cxz        19  inline_image        0   
1454741   njp.32101080155110        80   plate_image        0   

                                            vector_path  
1617907  uiug/31042/uiug.30112003448526_00000028_00.npy  
254046          ucm/5193/ucm.5321309033_00000347_00.npy  
1613603  uiug/31485/

In [133]:
# DONE: ask Boris if htids are encoded with the standard utility methods
# In the meta_csv, htids are NOT encoded; in vectors.txt and anywhere they are in file format, they ARE encoded


# find way to look up TITLE of work
# do montage of ENTIRE publisher set (really want t-SNE, but maybe Damon has something like that?)
# don't even need vectors, per se, for t-SNE on tractable subsets
# compare Hawthorne ballon (my McNeil paper) with Parley's moon and stars (sadly, added too late to make project)

## Acknowledgements

Thank you Ryan and Boris, especially.

## Appendix: Project assets

The deliverables are available via Zenodo:

- A csv file with basic metadata
- A nearest-neighbors index

 The code used for the data processing is on [GitHub](https://github.com/htrc/ACS-krewson).