# Visualize all illustrations for a given query

The project data can be unwieldy to work with. In many cases, you want to isolate a subset of the 2.5+ million illustrated regions. Analysis can then be done at a smaller scale and more quickly. This notebook shows how to get started with such research. The goal will be to find the metadata for all books published in 1800-1850 by the Boston firm Munroe & Francis.

1. To create a set of volume IDs to visualize, search Hathifile excerpt using metadata query
2. Using volume IDs from Step 1, find all illustrated pages corresponding to those volumes. Need to do this with a file IN the Zenodo or that mimics the Google Cloud bucket.
3. Create a PixPlot metadata file that joins the Hathifile information with the stubby IDs.
4. Get image assets from cloud storage
5. Run PixPlot processing
6. Visualize with PixPlot instance

## Step 1: Filter Hathifile using metadata; extract matching rows

Hathifiles can be very big, so we iteratively search them for field (column) values matching a query. This can take some finesse, since publisher names are often very similar and the name of a firm can be written in slightly different ways (e.g. '&' vs. 'and').

In [2]:
import pandas as pd
import numpy as np
import os, random, re, sys
from glob import glob

In [31]:
# Files that we need to open/access to generate the PixPlot metadata file. It's simplest to use absolute paths.
# These files are available in the Zenodo repository: http://zenodo.org/record/3940528#.XyRNSZ5KjIU

# 1. Hathifile subset, for performing basic metadata queries
HATHIFILE = "/home/stephen-krewson/project-hathi-images/datafiles/1800-1850_hathifile.txt.gz"
HATHICOLS = "/home/stephen-krewson/project-hathi-images/datafiles/hathifile_columns.txt"

# 2. Flat file of all 2.5m "regions of interest" (illustrations). Compression takes it from 200 MB to 23 MB!
# TODO: Rename these files and reupload into Zenodo with script
ROIFILE = "/home/stephen-krewson/project-hathi-images/datafiles/1800-1850_roi-table.csv.gz"

In [32]:
# Test that the files exist
!stat $HATHIFILE
!stat $HATHICOLS
!stat $ROIFILE

  File: /home/stephen-krewson/project-hathi-images/datafiles/1800-1850_hathifile.txt.gz
  Size: 30724746  	Blocks: 60016      IO Block: 4096   regular file
Device: 810h/2064d	Inode: 42323       Links: 1
Access: (0644/-rw-r--r--)  Uid: ( 1000/stephen-krewson)   Gid: ( 1000/stephen-krewson)
Access: 2021-04-08 13:26:27.327452800 -0400
Modify: 2021-04-08 13:26:27.347452800 -0400
Change: 2021-04-08 13:26:27.347452800 -0400
 Birth: -
  File: /home/stephen-krewson/project-hathi-images/datafiles/hathifile_columns.txt
  Size: 307       	Blocks: 8          IO Block: 4096   regular file
Device: 810h/2064d	Inode: 56569       Links: 1
Access: (0644/-rw-r--r--)  Uid: ( 1000/stephen-krewson)   Gid: ( 1000/stephen-krewson)
Access: 2021-04-08 13:27:14.617452800 -0400
Modify: 2021-04-07 08:01:30.360310100 -0400
Change: 2021-04-08 13:27:06.057452800 -0400
 Birth: -
  File: /home/stephen-krewson/project-hathi-images/datafiles/1800-1850_roi-table.csv.gz
  Size: 23688591  	Blocks: 46272      IO Block: 4096 

In [33]:
def search_hathifile(ht_file, col_file, search_col, search_expr):
    '''
    :param ht_file: A Hathifile in CSV format.
    :param col_file: A newline-delimited file with the Hathifile column names
    :param search_col: The field/column on which to search
    :param search_expr: A regular expression against which search_col values can be compared
    :return: A pandas dataframe of rows in which search_col matches search_expr
    '''
    # Use iterative method to scale to full hathifiles
    with open(col_file, "r") as fp:
        col_names = fp.readline().strip('\n').split('\t')
        num_cols = len(col_names)

    # https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html
    iter_csv = pd.read_csv(
        ht_file, 
        sep='\t', 
        header=None,
        names=col_names,
        engine='c',
        # quicker if we can assert some types for the fields
        dtype={
            'htid': 'str',
            'rights_date_used': 'object', # values NOT guaranteed to be numeric
            'pub_place': 'str', # sadly, this is just the partner lib
            'imprint': 'str'
        },
        iterator=True,
        chunksize=5000,
        error_bad_lines=False)

    df = pd.DataFrame()
    for i, chunk in enumerate(iter_csv):
        condition = (chunk[search_col].str.contains(search_expr, na=False, flags=re.IGNORECASE))
        
        # hathifile idx has no relation to Neighbor tree: ignore
        df = pd.concat([df, chunk[condition]], ignore_index=True)
        
    return df

In [34]:
search_col = 'imprint'

# find publishers "Munroe, Francis", "Munroe and Francis", "Munroe & Francis" (with matching group)
# search_expr = r"\bMunroe(?:,| and| &) Francis\b"
# search_label = "munroe-francis"

# similar string, but for carter and hendee
search_expr = r"\bCarter(?:,| and| &) Hendee\b"
# a label for the results of this experiment (in case you want to compare later)
search_label = "carter-hendee"

df = search_hathifile(HATHIFILE, HATHICOLS, search_col, search_expr)
df.shape

(250, 26)

In [35]:
df.columns # use title, rights_date_used, imprint

Index(['htid', 'access', 'rights', 'ht_bib_key', 'description', 'source',
       'source_bib_num', 'oclc_num', 'isbn', 'issn', 'lccn', 'title',
       'imprint', 'rights_reason_code', 'rights_timestamp', 'us_gov_doc_flag',
       'rights_date_used', 'pub_place', 'lang', 'bib_fmt', 'collection_code',
       'content_provider_code', 'responsible_entity_code',
       'digitization_agent_code', 'access_profile_code', 'author'],
      dtype='object')

In [36]:
# convert date objects to integers, for the year of publication
df['rights_date_used'] = pd.to_numeric(df['rights_date_used']).astype(int)

In [47]:
# Define the columns we want to view/keep from the Hathifile; show a sample
hf_cols = ['htid', search_col, 'title', 'rights_date_used']
df[hf_cols].sample(10)

Unnamed: 0,htid,imprint,title,rights_date_used
207,umn.31951002030024t,"Carter and Hendee, 1830.",The young emigrants : a tale designed for youn...,1830
62,inu.32000001832221,"Carter, Hendee & Babcock ; E. Bliss, 1831.",Travels in Malta and Sicily : with sketches of...,1831
206,chi.086928422,"E. Littel; Boston, Carter & Hendee; [etc., etc...",Hints for naval officers cruising in the West ...,1830
76,hvd.32044012566428,"Carter, Hendee and co., 1833.",Good wives.,1833
235,nyp.33433105233252,"Carter and Hendee,",Youth's keepsake.,1830
54,hvd.32044010381515,"Carter, Hendee and Babcock, 1831.",Lectures on witchcraft comprising a history of...,1831
234,nyp.33433116244611,"Carter and Hendee,",Youth's keepsake.,1836
214,hvd.32044048688139,"Carter, Hendee, and Babcock, 1831-1833.",Scientific tracts.,1832
215,hvd.32044044506970,"Carter, Hendee and Co., 1836-1847.",The Farmer's almanack.,1850
236,uiug.30112077871199,"Carter and Hendee, 1832.",A Collection of psalms and hymns for Christian...,1832


In [39]:
# Now merge on htids -- just a few seconds to read! compression inferred automatically!
df_rois = pd.read_csv(ROIFILE)

In [44]:
df_rois.sample(10)

Unnamed: 0,htid,page_seq,page_label,crop_no,vector_path
282861,ucm.5325489004,377,plate_image,0,ucm/5594/ucm.5325489004_00000377_00.npy
19067,inu.30000093219388,214,inline_image,0,inu/30918/inu.30000093219388_00000214_00.npy
740727,nyp.33433062732494,227,plate_image,1,nyp/33639/nyp.33433062732494_00000227_01.npy
2293251,hvd.ah5kx2,143,inline_image,0,hvd/ak/hvd.ah5kx2_00000143_00.npy
1853863,uc1.b4151269,91,inline_image,0,uc1/b56/uc1.b4151269_00000091_00.npy
216933,ucm.5322144112,415,plate_image,0,ucm/5242/ucm.5322144112_00000415_00.npy
1775076,uc1.b4070680,334,plate_image,1,uc1/b78/uc1.b4070680_00000334_01.npy
1524532,njp.32101064097023,267,plate_image,0,njp/30692/njp.32101064097023_00000267_00.npy
302318,ucm.5324204756,167,plate_image,1,ucm/5446/ucm.5324204756_00000167_01.npy
2365384,hvd.hns7sp,59,inline_image,0,hvd/h7/hvd.hns7sp_00000059_00.npy


In [48]:
# Now merge on htid
# https://towardsdatascience.com/guide-to-big-data-joins-python-sql-pandas-spark-dask-51b7f4fec810

# there are duplicate ids in df_rois, because each htid has many pages and ROIs potentially
df_merged = pd.merge(df[hf_cols], df_rois, on='htid', how='inner', validate='one_to_many')

In [51]:
df_merged.head(10)

Unnamed: 0,htid,imprint,title,rights_date_used,page_seq,page_label,crop_no,vector_path
0,hvd.hnztud,"Carter and Hendee, 1830.",Studies in poetry. Embracing notices of the li...,1830,8,plate_image,0,hvd/ht/hvd.hnztud_00000008_00.npy
1,hvd.hwhrcx,"Carter and Hendee, 1830.",Studies in poetry. Embracing notices of the li...,1830,10,plate_image,0,hvd/hr/hvd.hwhrcx_00000010_00.npy
2,osu.32435053600458,"Carter and Hendee, 1832.",Retrospections of the stage. By the late John ...,1832,483,plate_image,0,osu/33505/osu.32435053600458_00000483_00.npy
3,mdp.39015093173741,"Carter, Hendee & Co., 1832.","An elementary treatise on geometry, simplified...",1832,147,inline_image,1,mdp/31974/mdp.39015093173741_00000147_01.npy
4,mdp.39015093173741,"Carter, Hendee & Co., 1832.","An elementary treatise on geometry, simplified...",1832,51,inline_image,0,mdp/31974/mdp.39015093173741_00000051_00.npy
5,mdp.39015093173741,"Carter, Hendee & Co., 1832.","An elementary treatise on geometry, simplified...",1832,134,inline_image,0,mdp/31974/mdp.39015093173741_00000134_00.npy
6,mdp.39015093173741,"Carter, Hendee & Co., 1832.","An elementary treatise on geometry, simplified...",1832,101,inline_image,0,mdp/31974/mdp.39015093173741_00000101_00.npy
7,mdp.39015093173741,"Carter, Hendee & Co., 1832.","An elementary treatise on geometry, simplified...",1832,132,inline_image,1,mdp/31974/mdp.39015093173741_00000132_01.npy
8,mdp.39015093173741,"Carter, Hendee & Co., 1832.","An elementary treatise on geometry, simplified...",1832,20,inline_image,1,mdp/31974/mdp.39015093173741_00000020_01.npy
9,mdp.39015093173741,"Carter, Hendee & Co., 1832.","An elementary treatise on geometry, simplified...",1832,147,inline_image,0,mdp/31974/mdp.39015093173741_00000147_00.npy


## Step: Convert vector_path to .jpg

Assuming the vector_path is using the standard stubby tree form, we can simply swap on the .npy extension for .jpg.

In [16]:
# Utility functions from Hathi's feature datasets
# https://github.com/htrc/htrc-feature-reader/blob/39010fd41c049f4f86b9c8ff4a44e000217093c2/htrc_features/utils.py
def _id_encode(id):
    '''
    :param id: A Pairtree ID. If it's a Hathitrust ID, this is the part after the library
        code; e.g. the part after the first period for vol.123/456.
    :return: A sanitized id. e.g., 123/456 will return as 123=456 to avoid filesystem issues.
    '''
    return id.replace(":", "+").replace("/", "=").replace(".", ",")

def _id_decode(id):
    '''
    :param id: A sanitized Pairtree ID.
    :return: An original Pairtree ID.
    '''
    return id.replace("+", ":").replace("=", "/").replace(",", ".")

def clean_htid(htid):
    '''
    :param htid: A HathiTrust ID of form lib.vol; e.g. mdp.1234
    :return: A sanitized version of the HathiTrust ID, appropriate for filename use.
    '''
    libid, volid = htid.split('.', 1)
    volid_clean = _id_encode(volid)
    return '.'.join([libid, volid_clean])

def id_to_stubbytree(htid, format = None, suffix = None, compression = None):
    '''
    Take an HTRC id and convert it to a 'stubbytree' location.
    '''
    libid, volid = htid.split('.', 1)
    volid_clean = _id_encode(volid)

    suffixes = [s for s in [format, compression] if s is not None]
    filename = ".".join([clean_htid(htid), *suffixes])
    path = os.path.join(libid, volid_clean[::3], filename)
    return path

In [18]:
# map the stubbytree dir paths to the original htids
stubby_dict = {htid: id_to_stubbytree(htid) for htid in df.htid.values}

In [21]:
# Print the intitial htid and show its stubby id transformation
for k,v in stubby_dict.items():
    print(k, "-->", v)
    break

hvd.hnztud --> hvd/ht/hvd.hnztud


In [28]:
# The project CSV is 200 MB... can I gzip it? And do an efficient join on htid with pandas?
# Fields are: htid, page_seq, page_label, crop_no, vector_path
# See: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.join.html

In [27]:
# Old method: use Unix globbing to find stubby .npy files and swap out the extension
# Problem: Vectors were a dead end, so we want to exclusively work with flat metadata files

# for each volume, find associated .npy vectors within stubbytree directory -- store in dictionary

#VEC_DIR = os.path.abspath("../_app-files/roi-vectors/vectors")

#query_vectors = {}

#for stubby_id in stubby_dict.keys():
#    vol_path = os.path.join(VEC_DIR, stubby_id + "*.npy")
#    vol_vectors = glob(vol_path)
#    if len(vol_vectors) != 0:
#        query_vectors[stubby_id] = vol_vectors

## Step 3: Reformat ROIs with metadata for Pixplot

We can reformat our selected ROIs, taking selected columns and renaming them. If we are able to acquire image data, this will allow us to attach the metadata and build a PixPlot visualization.

See https://github.com/YaleDHLab/pix-plot for more details.

In [66]:
# columns we want to keep from hathifile: these will map to 'description' and 'year' in PixPlot's format
col_map = {
    'rights_date_used': 'year',
    'title': 'description'
}

rows = []
for k, v in query_vectors.items():
    
    # transform .npy file into jpeg, separate from rest of path
    for npy_file in v:
        
        vec_base = os.path.basename(npy_file)
        img_base = os.path.splitext(vec_base)[0] + '.jpg'
        
        # remember the unencoded htid
        htid = stubby_dict[k]
        
        # row to be added to df_pixplot
        row = {}
        
        # get metadata for this volume
        metadata = df[df['htid'] == htid][col_map.keys()]
        
        # tricky, since values could be a list or object
        for col in metadata.columns:
            row[col_map[col]] = metadata[col].values[0]

        # add img_base path and label
        row['filename'] = img_base
        row['label'] = search_label
        
        rows.append(row)

In [68]:
# turn dict rows into dataframe -- 'filename' shows the convention for image paths used in the project
df_pixplot = pd.DataFrame.from_dict(rows)
df_pixplot.sample(10)

Unnamed: 0,year,description,filename,label
739,1832,Les contes de Pierre Parley sur l'Amérique--,nyp.33433012110858_00000088_00.jpg,carter-hendee
734,1832,Les contes de Pierre Parley sur l'Amérique--,nyp.33433012110858_00000079_00.jpg,carter-hendee
2469,1830,Youth's keepsake.,nyp.33433105233252_00000160_00.jpg,carter-hendee
2128,1850,The Farmer's almanack.,hvd.32044044506970_00000461_00.jpg,carter-hendee
177,1830,The young lady's book : a manual of elegant re...,uc1.$b264661_00000076_00.jpg,carter-hendee
58,1835,"Elements of natural philosophy, with questions...",njp.32101013012156_00000254_00.jpg,carter-hendee
1979,1834,Scientific tracts.,umn.319510027996728_00000018_00.jpg,carter-hendee
760,1831,The architecture of birds.,nyp.33433011003872_00000089_00.jpg,carter-hendee
834,1833,Scenes of American wealth and industry in prod...,nyp.33433007272812_00000039_00.jpg,carter-hendee
2462,1830,Youth's keepsake.,nyp.33433105233252_00000078_00.jpg,carter-hendee


In [69]:
# use the search label to make a metadata path
#metadata_csv = "{}_pixplot-metadata.csv".format(search_label)

# save as a CSV that PixPlot can accept
#df_pixplot.to_csv(metadata_csv, sep=',', header=True, index=False)