# Visualize all illustrations for a given query

The project data can be unwieldy to work with. In many cases, you want to isolate a subset of the 2.5+ million illustrated regions. Analysis can then be done at a smaller scale and more quickly. This notebook shows how to get started with such research. The goal will be to find the metadata for all books published in 1800-1850 by the Boston firm Munroe & Francis.

1. To create a set of volume IDs to visualize, search Hathifile excerpt using metadata query
2. Using volume IDs from Step 1, find all illustrated pages corresponding to those volumes. Need to do this with a file IN the Zenodo or that mimics the Google Cloud bucket.
3. Create a PixPlot metadata file that joins the Hathifile information with the stubby IDs.
4. Get image assets from cloud storage
5. Run PixPlot processing
6. Visualize with PixPlot instance

## Step 1: Filter Hathifile using metadata; extract matching rows

Hathifiles can be very big, so we iteratively search them for field (column) values matching a query. This can take some finesse, since publisher names are often very similar and the name of a firm can be written in slightly different ways (e.g. '&' vs. 'and').

In [2]:
import pandas as pd
import numpy as np
import os, random, re, sys
from glob import glob

In [25]:
# Files that we need to open/access to generate the PixPlot metadata file. It's simplest to use absolute paths.
# These files are available in the Zenodo repository: http://zenodo.org/record/3940528#.XyRNSZ5KjIU

# 1. Hathifile subset, for performing basic metadata queries
HATHIFILE = "/home/stephen-krewson/project-hathi-images/datafiles/google_ids_1800-1850.txt.gz"
HATHICOLS = "/home/stephen-krewson/project-hathi-images/datafiles/hathi_field_list.txt"

# 2. Flat file of all 2.5m "regions of interest" (illustrations)
# TODO: should I use gzip on this?
ROIFILE = "/home/stephen-krewson/project-hathi-images/datafiles/early-19C-illustrations_metadata.csv"

In [24]:
# Test that the files exist
!stat $HATHIFILE
!stat $HATHICOLS
!stat $ROIFILE

  File: /home/stephen-krewson/project-hathi-images/datafiles/google_ids_1800-1850.txt.gz
  Size: 30724746  	Blocks: 60016      IO Block: 4096   regular file
Device: 810h/2064d	Inode: 29532       Links: 1
Access: (0644/-rw-r--r--)  Uid: ( 1000/stephen-krewson)   Gid: ( 1000/stephen-krewson)
Access: 2021-04-07 08:38:14.840310100 -0400
Modify: 2019-07-31 12:07:38.271850000 -0400
Change: 2021-04-07 08:38:01.500310100 -0400
 Birth: -
  File: /home/stephen-krewson/project-hathi-images/datafiles/hathi_field_list.txt
  Size: 307       	Blocks: 8          IO Block: 4096   regular file
Device: 810h/2064d	Inode: 56569       Links: 1
Access: (0644/-rw-r--r--)  Uid: ( 1000/stephen-krewson)   Gid: ( 1000/stephen-krewson)
Access: 2021-04-07 08:02:40.270310100 -0400
Modify: 2021-04-07 08:01:30.360310100 -0400
Change: 2021-04-07 08:01:30.360310100 -0400
 Birth: -
  File: /home/stephen-krewson/project-hathi-images/datafiles/early-19C-illustrations_metadata.csv
  Size: 202726013 	Blocks: 395952     IO Bl

In [5]:
def search_hathifile(ht_file, col_file, search_col, search_expr):
    '''
    :param ht_file: A Hathifile in CSV format.
    :param col_file: A newline-delimited file with the Hathifile column names
    :param search_col: The field/column on which to search
    :param search_expr: A regular expression against which search_col values can be compared
    :return: A pandas dataframe of rows in which search_col matches search_expr
    '''
    # Use iterative method to scale to full hathifiles
    with open(col_file, "r") as fp:
        col_names = fp.readline().strip('\n').split('\t')
        num_cols = len(col_names)

    # https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html
    iter_csv = pd.read_csv(
        ht_file, 
        sep='\t', 
        header=None,
        names=col_names,
        engine='c',
        # quicker if we can assert some types for the fields
        dtype={
            'htid': 'str',
            'rights_date_used': 'object', # values NOT guaranteed to be numeric
            'pub_place': 'str', # sadly, this is just the partner lib
            'imprint': 'str'
        },
        iterator=True,
        chunksize=5000,
        error_bad_lines=False)

    df = pd.DataFrame()
    for i, chunk in enumerate(iter_csv):
        condition = (chunk[search_col].str.contains(search_expr, na=False, flags=re.IGNORECASE))
        
        # hathifile idx has no relation to Neighbor tree: ignore
        df = pd.concat([df, chunk[condition]], ignore_index=True)
        
    return df

In [6]:
search_col = 'imprint'

# find publishers "Munroe, Francis", "Munroe and Francis", "Munroe & Francis" (with matching group)
# search_expr = r"\bMunroe(?:,| and| &) Francis\b"
# search_label = "munroe-francis"

# similar string, but for carter and hendee
search_expr = r"\bCarter(?:,| and| &) Hendee\b"
# a label for the results of this experiment (in case you want to compare later)
search_label = "carter-hendee"

df = search_hathifile(HATHIFILE, HATHICOLS, search_col, search_expr)
df.shape

(250, 26)

In [7]:
df.columns # use title, rights_date_used, imprint

Index(['htid', 'access', 'rights', 'ht_bib_key', 'description', 'source',
       'source_bib_num', 'oclc_num', 'isbn', 'issn', 'lccn', 'title',
       'imprint', 'rights_reason_code', 'rights_timestamp', 'us_gov_doc_flag',
       'rights_date_used', 'pub_place', 'lang', 'bib_fmt', 'collection_code',
       'content_provider_code', 'responsible_entity_code',
       'digitization_agent_code', 'access_profile_code', 'author'],
      dtype='object')

In [8]:
# convert date objects to integers, for the year of publication
df['rights_date_used'] = pd.to_numeric(df['rights_date_used']).astype(int)

In [15]:
# show a few results -- just the search field and the date published
df[[search_col, 'title', 'rights_date_used']].sample(10)

Unnamed: 0,imprint,title,rights_date_used
224,"Carter, Hendee, and Co. 1836.","The mercantile arithmetic, adapted to the comm...",1833
150,"Carter, Hendee, 1832.",Sermons and charges / by James Freeman.,1832
237,"Carter, Hendee and Co., 1832","Rudiments of the Italian language, or, Easy le...",1832
152,"Carter, Hendee, 1832.",Sermons and charges / by James Freeman.,1832
239,"Carter, Hendee & Babcock, l83l.","The buckwheat cake, a poem ..",1831
58,"Carter, Hendee & co., 1832.","The biographies of Lady Russell, and Madame Gu...",1832
63,"Peirce and Parker, and Carter and Hendee], 183...",The Naturalist,1831
97,"Carter and Hendee, 1829.",The constitution of man considered in relation...,1829
78,"Carter, Hendee, 1833.",A manual containing information respecting the...,1833
45,"Hilliard Gray and co., and Carter, Hendee and ...","Adam's Latin grammar, with some improvements, ...",1832


## Step 2: Construct page image IDs for all htids in the query results

We have a bunch of `htid`s from the Hathifile, but many of them will not contain any illustrations. To narrow down our set of results, we need to look up the `htid`s in our illustration metadata. This can be done with the main CSV file (???) or with the vectors.tar file.

Since the dataset is stored using a stubby tree, we need utilities for working with these paths.

In [16]:
# Utility functions from Hathi's feature datasets
# https://github.com/htrc/htrc-feature-reader/blob/39010fd41c049f4f86b9c8ff4a44e000217093c2/htrc_features/utils.py
def _id_encode(id):
    '''
    :param id: A Pairtree ID. If it's a Hathitrust ID, this is the part after the library
        code; e.g. the part after the first period for vol.123/456.
    :return: A sanitized id. e.g., 123/456 will return as 123=456 to avoid filesystem issues.
    '''
    return id.replace(":", "+").replace("/", "=").replace(".", ",")

def _id_decode(id):
    '''
    :param id: A sanitized Pairtree ID.
    :return: An original Pairtree ID.
    '''
    return id.replace("+", ":").replace("=", "/").replace(",", ".")

def clean_htid(htid):
    '''
    :param htid: A HathiTrust ID of form lib.vol; e.g. mdp.1234
    :return: A sanitized version of the HathiTrust ID, appropriate for filename use.
    '''
    libid, volid = htid.split('.', 1)
    volid_clean = _id_encode(volid)
    return '.'.join([libid, volid_clean])

def id_to_stubbytree(htid, format = None, suffix = None, compression = None):
    '''
    Take an HTRC id and convert it to a 'stubbytree' location.
    '''
    libid, volid = htid.split('.', 1)
    volid_clean = _id_encode(volid)

    suffixes = [s for s in [format, compression] if s is not None]
    filename = ".".join([clean_htid(htid), *suffixes])
    path = os.path.join(libid, volid_clean[::3], filename)
    return path

In [18]:
# map the stubbytree dir paths to the original htids
stubby_dict = {htid: id_to_stubbytree(htid) for htid in df.htid.values}

In [21]:
# Print the intitial htid and show its stubby id transformation
for k,v in stubby_dict.items():
    print(k, "-->", v)
    break

hvd.hnztud --> hvd/ht/hvd.hnztud


In [28]:
# The project CSV is 200 MB... can I gzip it? And do an efficient join on htid with pandas?
# Fields are: htid, page_seq, page_label, crop_no, vector_path
# See: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.join.html

In [27]:
# Old method: use Unix globbing to find stubby .npy files and swap out the extension
# Problem: Vectors were a dead end, so we want to exclusively work with flat metadata files

# for each volume, find associated .npy vectors within stubbytree directory -- store in dictionary

#VEC_DIR = os.path.abspath("../_app-files/roi-vectors/vectors")

#query_vectors = {}

#for stubby_id in stubby_dict.keys():
#    vol_path = os.path.join(VEC_DIR, stubby_id + "*.npy")
#    vol_vectors = glob(vol_path)
#    if len(vol_vectors) != 0:
#        query_vectors[stubby_id] = vol_vectors

## Step 3: Reformat ROIs with metadata for Pixplot

We can reformat our selected ROIs, taking selected columns and renaming them. If we are able to acquire image data, this will allow us to attach the metadata and build a PixPlot visualization.

See https://github.com/YaleDHLab/pix-plot for more details.

In [66]:
# columns we want to keep from hathifile: these will map to 'description' and 'year' in PixPlot's format
col_map = {
    'rights_date_used': 'year',
    'title': 'description'
}

rows = []
for k, v in query_vectors.items():
    
    # transform .npy file into jpeg, separate from rest of path
    for npy_file in v:
        
        vec_base = os.path.basename(npy_file)
        img_base = os.path.splitext(vec_base)[0] + '.jpg'
        
        # remember the unencoded htid
        htid = stubby_dict[k]
        
        # row to be added to df_pixplot
        row = {}
        
        # get metadata for this volume
        metadata = df[df['htid'] == htid][col_map.keys()]
        
        # tricky, since values could be a list or object
        for col in metadata.columns:
            row[col_map[col]] = metadata[col].values[0]

        # add img_base path and label
        row['filename'] = img_base
        row['label'] = search_label
        
        rows.append(row)

In [68]:
# turn dict rows into dataframe -- 'filename' shows the convention for image paths used in the project
df_pixplot = pd.DataFrame.from_dict(rows)
df_pixplot.sample(10)

Unnamed: 0,year,description,filename,label
739,1832,Les contes de Pierre Parley sur l'Amérique--,nyp.33433012110858_00000088_00.jpg,carter-hendee
734,1832,Les contes de Pierre Parley sur l'Amérique--,nyp.33433012110858_00000079_00.jpg,carter-hendee
2469,1830,Youth's keepsake.,nyp.33433105233252_00000160_00.jpg,carter-hendee
2128,1850,The Farmer's almanack.,hvd.32044044506970_00000461_00.jpg,carter-hendee
177,1830,The young lady's book : a manual of elegant re...,uc1.$b264661_00000076_00.jpg,carter-hendee
58,1835,"Elements of natural philosophy, with questions...",njp.32101013012156_00000254_00.jpg,carter-hendee
1979,1834,Scientific tracts.,umn.319510027996728_00000018_00.jpg,carter-hendee
760,1831,The architecture of birds.,nyp.33433011003872_00000089_00.jpg,carter-hendee
834,1833,Scenes of American wealth and industry in prod...,nyp.33433007272812_00000039_00.jpg,carter-hendee
2462,1830,Youth's keepsake.,nyp.33433105233252_00000078_00.jpg,carter-hendee


In [69]:
# use the search label to make a metadata path
#metadata_csv = "{}_pixplot-metadata.csv".format(search_label)

# save as a CSV that PixPlot can accept
#df_pixplot.to_csv(metadata_csv, sep=',', header=True, index=False)