# Example 3: Visualize all illustrations for a given publisher

In this case, the firm Munroe & Francis, active in Boston from c. 1800 to 1860.

## Step 1: Search Hathifile for publisher

Hathifiles can be very big, so we iteratively search them for field (column) values matching a query. This can take some finesse, since publisher names are often very similar and the name of a firm can be written in slightly different ways (e.g. '&' vs. 'and').

In [40]:
import pandas as pd
import numpy as np
import os, random, re, sys
from glob import glob
from annoy import AnnoyIndex

In [3]:
# the volumes used in the ACS project
HATHIFILE = "google_ids_1800-1850.txt.gz"
HATHICOLS = "hathifiles/hathi_field_list.txt"

In [9]:
def search_hathifile(ht_file, col_file, search_col, search_expr):
    """
    Given a hathifile and field names, return dataframe of rows
    where search_col contains search_expr (a regex)
    """
    # Use iterative method to scale to full hathifiles
    with open(col_file, "r") as fp:
        col_names = fp.readline().strip('\n').split('\t')
        num_cols = len(col_names)

    # https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html
    iter_csv = pd.read_csv(
        ht_file, 
        sep='\t', 
        header=None,
        names=col_names,
        engine='c',
        # quicker if we can assert some types for the fields
        dtype={
            'htid': 'str',
            'rights_date_used': 'object',
            'pub_place': 'str', # sadly, this is just the partner lib
            'imprint': 'str'
        },
        iterator=True,
        chunksize=5000,
        error_bad_lines=False)

    df = pd.DataFrame()
    for i, chunk in enumerate(iter_csv):
        condition = (chunk[search_col].str.contains(search_expr, na=False, flags=re.IGNORECASE))
        
        # hathifile idx has no relation to Neighbor tree: ignore
        df = pd.concat([df, chunk[condition]], ignore_index=True)
        
    return df

In [12]:
# find publishers "Munroe, Francis", "Munroe and Francis", "Munroe & Francis" (with matching group)
search_col = 'imprint'
search_expr = r"\bMunroe(?:,| and| &) Francis\b"

df = search_hathifile(HATHIFILE, HATHICOLS, search_col, search_expr)
df.shape

(360, 26)

In [17]:
# show a few results -- just the search field and the date published
df[[search_col, 'rights_date_used']].head(10)

Unnamed: 0,imprint,rights_date_used
0,Munroe and Francis [etc.],1804.0
1,Munroe and Francis [etc.],1807.0
2,Munroe and Francis [etc.],1808.0
3,Munroe and Francis [etc.],1809.0
4,Munroe and Francis [etc.],1809.0
5,Munroe and Francis [etc.],1805.0
6,Munroe and Francis [etc.],1806.0
7,Munroe and Francis [etc.],1810.0
8,Munroe and Francis [etc.],1810.0
9,Munroe and Francis [etc.],1811.0


## Step 2: Find search result matches in illustration metadata

We have a bunch of `htid`s from the Hathifile, but many of them will not contain any illustrations. To narrow down our set of results, we need to look up the `htid`s in our illustration metadata. This can be done with the main CSV file or with the vectors.tar file. Either way, the goal is to get a list of all image or vector files corresponding to specific regions of interest (illustrations) for the volumes returned in our search.

If you want to work with the vectors in `vectors.tar`, you will want to convert to HTRCs stubbytree format.

In [32]:
# Utility functions from Hathi's feature datasets
# https://github.com/htrc/htrc-feature-reader/blob/39010fd41c049f4f86b9c8ff4a44e000217093c2/htrc_features/utils.py
def _id_encode(id):
    '''
    :param id: A Pairtree ID. If it's a Hathitrust ID, this is the part after the library
        code; e.g. the part after the first period for vol.123/456.
    :return: A sanitized id. e.g., 123/456 will return as 123=456 to avoid filesystem issues.
    '''
    return id.replace(":", "+").replace("/", "=").replace(".", ",")

def _id_decode(id):
    '''
    :param id: A sanitized Pairtree ID.
    :return: An original Pairtree ID.
    '''
    return id.replace("+", ":").replace("=", "/").replace(",", ".")

def clean_htid(htid):
    '''
    :param htid: A HathiTrust ID of form lib.vol; e.g. mdp.1234
    :return: A sanitized version of the HathiTrust ID, appropriate for filename use.
    '''
    libid, volid = htid.split('.', 1)
    volid_clean = _id_encode(volid)
    return '.'.join([libid, volid_clean])

def id_to_stubbytree(htid, format = None, suffix = None, compression = None):
    '''
    Take an HTRC id and convert it to a 'stubbytree' location.
    '''
    libid, volid = htid.split('.', 1)
    volid_clean = _id_encode(volid)

    suffixes = [s for s in [format, compression] if s is not None]
    filename = ".".join([clean_htid(htid), *suffixes])
    path = os.path.join(libid, volid_clean[::3], filename)
    return path

In [51]:
stubby_ids = [id_to_stubbytree(htid) for htid in df.htid.values]
len(stubby_ids)

360

In [62]:
# N.B. this assumes the vectors.tar file has been extracted to this directory
# You also need to be careful about filepath conventions; best to use os.path
VEC_DIR = os.path.abspath("../_app-files/roi-vectors/vectors")

In [64]:
# for each volume, find associated .npy vectors within stubbytree directory -- store in dictionary
munroe_francis = {}
for stubby_id in stubby_ids:
    vol_path = os.path.join(VEC_DIR, stubby_id + "*.npy")
    vol_vectors = glob(vol_path)
    if len(vol_vectors) != 0:
        munroe_francis[stubby_id] = vol_vectors

In [59]:
# optional: we can write out the matches to a text file (I did this to request the JPEG assets from HTRC)
#with open('munroe.csv', 'w') as fp:
#    for stubby_id in munroe_francis.keys():
#        fp.write(os.path.normpath(stubby_id))
#        fp.write('\n')

### Step 2a (optional): Create Annoy index using project vectors

You can experiment with building a smaller Annoy index with just these results.

In [66]:
# Modified from: https://github.com/spotify/annoy
f = 1000
t = AnnoyIndex(f, 'angular')
i = 0

# Find all vectors per volume and index them from 0
for k,v in munroe_francis.items():
    for vec in v:
        item = np.load(vec)
        # transpose vector since it needs to be (1000,1) not (1,1000)
        t.add_item(i, item.T)
        i += 1

# Try with 1000 trees
t.build(1000)
t.save('munroe-francis.ann')

True

In [68]:
u = AnnoyIndex(f, 'angular')
u.load('munroe-francis.ann')
print(u.get_nns_by_item(0, 10))

[0, 614, 979, 905, 968, 177, 900, 955, 893, 677]


## Step 3: Visualize images from search results

Now that we have a reasonably sized subset matching our search query, we can visualize the ROIs belonging to those volumes.

WARNING: we are still figuring out how to access the raw image data. I plan to write a fallback method that uses the Hathi Data API.

In [69]:
ROI_DIR = "..\_app-files\munroe-francis\samplecrops"

In [70]:
# we know that each subdirectory is an htid, with JPEG crops inside it
roi_jpegs = glob(ROI_DIR + '/*/*.jpg')
len(roi_jpegs)

1477

In [72]:
# just use pixplot: https://github.com/YaleDHLab/pix-plot
# do a quick pip freeze of environment htrc (conda list as well)
# remember to include metadata!