# Visualize all illustrations for a given query

The project data can be unwieldy to work with. In many cases, you want to isolate a subset of the 2.5+ million illustrated regions. Analysis can then be done at a smaller scale and more quickly. This notebook shows how to get started with such research. The goal will be to find the metadata for all books published in 1800-1850 by the Boston firm Munroe & Francis.

1. To create a set of volume IDs to visualize, search Hathifile excerpt using metadata query
2. Using volume IDs from Step 1, find all illustrated pages corresponding to those volumes. Need to do this with a file IN the Zenodo or that mimics the Google Cloud bucket.
3. Create a PixPlot metadata file that joins the Hathifile information with the stubby IDs.
4. Get image assets from cloud storage
5. Run PixPlot processing
6. Visualize with PixPlot instance

## Step 1: Filter Hathifile using metadata; extract matching rows

Hathifiles can be very big, so we iteratively search them for field (column) values matching a query. This can take some finesse, since publisher names are often very similar and the name of a firm can be written in slightly different ways (e.g. '&' vs. 'and').

In [2]:
import pandas as pd
import numpy as np
import os, random, re, sys
from glob import glob

In [31]:
# Files that we need to open/access to generate the PixPlot metadata file. It's simplest to use absolute paths.
# These files are available in the Zenodo repository: http://zenodo.org/record/3940528#.XyRNSZ5KjIU

# 1. Hathifile subset, for performing basic metadata queries
HATHIFILE = "/home/stephen-krewson/project-hathi-images/datafiles/1800-1850_hathifile.txt.gz"
HATHICOLS = "/home/stephen-krewson/project-hathi-images/datafiles/hathifile_columns.txt"

# 2. Flat file of all 2.5m "regions of interest" (illustrations). Compression takes it from 200 MB to 23 MB!
# TODO: Rename these files and reupload into Zenodo with script
ROIFILE = "/home/stephen-krewson/project-hathi-images/datafiles/1800-1850_roi-table.csv.gz"

In [32]:
# Test that the files exist
!stat $HATHIFILE
!stat $HATHICOLS
!stat $ROIFILE

  File: /home/stephen-krewson/project-hathi-images/datafiles/1800-1850_hathifile.txt.gz
  Size: 30724746  	Blocks: 60016      IO Block: 4096   regular file
Device: 810h/2064d	Inode: 42323       Links: 1
Access: (0644/-rw-r--r--)  Uid: ( 1000/stephen-krewson)   Gid: ( 1000/stephen-krewson)
Access: 2021-04-08 13:26:27.327452800 -0400
Modify: 2021-04-08 13:26:27.347452800 -0400
Change: 2021-04-08 13:26:27.347452800 -0400
 Birth: -
  File: /home/stephen-krewson/project-hathi-images/datafiles/hathifile_columns.txt
  Size: 307       	Blocks: 8          IO Block: 4096   regular file
Device: 810h/2064d	Inode: 56569       Links: 1
Access: (0644/-rw-r--r--)  Uid: ( 1000/stephen-krewson)   Gid: ( 1000/stephen-krewson)
Access: 2021-04-08 13:27:14.617452800 -0400
Modify: 2021-04-07 08:01:30.360310100 -0400
Change: 2021-04-08 13:27:06.057452800 -0400
 Birth: -
  File: /home/stephen-krewson/project-hathi-images/datafiles/1800-1850_roi-table.csv.gz
  Size: 23688591  	Blocks: 46272      IO Block: 4096 

In [33]:
def search_hathifile(ht_file, col_file, search_col, search_expr):
    '''
    :param ht_file: A Hathifile in CSV format.
    :param col_file: A newline-delimited file with the Hathifile column names
    :param search_col: The field/column on which to search
    :param search_expr: A regular expression against which search_col values can be compared
    :return: A pandas dataframe of rows in which search_col matches search_expr
    '''
    # Use iterative method to scale to full hathifiles
    with open(col_file, "r") as fp:
        col_names = fp.readline().strip('\n').split('\t')
        num_cols = len(col_names)

    # https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html
    iter_csv = pd.read_csv(
        ht_file, 
        sep='\t', 
        header=None,
        names=col_names,
        engine='c',
        # quicker if we can assert some types for the fields
        dtype={
            'htid': 'str',
            'rights_date_used': 'object', # values NOT guaranteed to be numeric
            'pub_place': 'str', # sadly, this is just the partner lib
            'imprint': 'str'
        },
        iterator=True,
        chunksize=5000,
        error_bad_lines=False)

    df = pd.DataFrame()
    for i, chunk in enumerate(iter_csv):
        condition = (chunk[search_col].str.contains(search_expr, na=False, flags=re.IGNORECASE))
        
        # hathifile idx has no relation to Neighbor tree: ignore
        df = pd.concat([df, chunk[condition]], ignore_index=True)
        
    return df

In [34]:
search_col = 'imprint'

# find publishers "Munroe, Francis", "Munroe and Francis", "Munroe & Francis" (with matching group)
# search_expr = r"\bMunroe(?:,| and| &) Francis\b"
# search_label = "munroe-francis"

# similar string, but for carter and hendee
search_expr = r"\bCarter(?:,| and| &) Hendee\b"
# a label for the results of this experiment (in case you want to compare later)
search_label = "carter-hendee"

df = search_hathifile(HATHIFILE, HATHICOLS, search_col, search_expr)
df.shape

(250, 26)

In [35]:
df.columns # use title, rights_date_used, imprint

Index(['htid', 'access', 'rights', 'ht_bib_key', 'description', 'source',
       'source_bib_num', 'oclc_num', 'isbn', 'issn', 'lccn', 'title',
       'imprint', 'rights_reason_code', 'rights_timestamp', 'us_gov_doc_flag',
       'rights_date_used', 'pub_place', 'lang', 'bib_fmt', 'collection_code',
       'content_provider_code', 'responsible_entity_code',
       'digitization_agent_code', 'access_profile_code', 'author'],
      dtype='object')

In [36]:
# convert date objects to integers, for the year of publication
df['rights_date_used'] = pd.to_numeric(df['rights_date_used']).astype(int)

In [80]:
# Define the columns we want to view/keep from the Hathifile; show a sample
hf_cols = ['htid', search_col, 'title', 'rights_date_used']
df[hf_cols].sample(10)

Unnamed: 0,htid,imprint,title,rights_date_used
192,hvd.hn1cdn,"Lincoln & Edmands, and Carter & Hendee, 1830.",The classical reader : a selection of lessons ...,1830
163,hvd.hn2fkw,"Carter and Hendee, 1832.",The American common-place book of prose : a co...,1832
223,nyp.33433066415021,"Carter, Hendee & co., 1833.",Exercises in arithmetic : with particular refe...,1833
11,hvd.32044024168874,"Carter and Hendee, 1831.","An essay on demonology, ghosts and apparitions...",1831
86,hvd.hn5dsx,"Carter and Hendee, 1832.",Les contes de Pierre Parley sur l'Amérique--,1832
22,mdp.39015013735819,"Carter, Hendee & co. ; T.B. White, 1833.","Review, historical and political, of the late ...",1833
199,hvd.hn5f32,"Published by J. & J. Harper, Sold by Collins &...",The Dutchman's fireside. A tale / by the autho...,1833
72,nyp.33433087564898,"Carter and Hendee, 1830.",Mathematical tables : comprising logarithms of...,1830
78,hvd.hnb163,"Carter, Hendee, 1833.",A manual containing information respecting the...,1833
177,hvd.32044097044630,"Carter and Hendee, 1830.",First lessons in plane geometry : together wit...,1830


In [39]:
# Now merge on htids -- just a few seconds to read! compression inferred automatically!
df_rois = pd.read_csv(ROIFILE)

In [44]:
df_rois.sample(10)

Unnamed: 0,htid,page_seq,page_label,crop_no,vector_path
282861,ucm.5325489004,377,plate_image,0,ucm/5594/ucm.5325489004_00000377_00.npy
19067,inu.30000093219388,214,inline_image,0,inu/30918/inu.30000093219388_00000214_00.npy
740727,nyp.33433062732494,227,plate_image,1,nyp/33639/nyp.33433062732494_00000227_01.npy
2293251,hvd.ah5kx2,143,inline_image,0,hvd/ak/hvd.ah5kx2_00000143_00.npy
1853863,uc1.b4151269,91,inline_image,0,uc1/b56/uc1.b4151269_00000091_00.npy
216933,ucm.5322144112,415,plate_image,0,ucm/5242/ucm.5322144112_00000415_00.npy
1775076,uc1.b4070680,334,plate_image,1,uc1/b78/uc1.b4070680_00000334_01.npy
1524532,njp.32101064097023,267,plate_image,0,njp/30692/njp.32101064097023_00000267_00.npy
302318,ucm.5324204756,167,plate_image,1,ucm/5446/ucm.5324204756_00000167_01.npy
2365384,hvd.hns7sp,59,inline_image,0,hvd/h7/hvd.hns7sp_00000059_00.npy


In [48]:
# Now merge on htid
# https://towardsdatascience.com/guide-to-big-data-joins-python-sql-pandas-spark-dask-51b7f4fec810

# there are duplicate ids in df_rois, because each htid has many pages and ROIs potentially
df_merged = pd.merge(df[hf_cols], df_rois, on='htid', how='inner', validate='one_to_many')

In [52]:
df_merged.sample(10)

Unnamed: 0,htid,imprint,title,rights_date_used,page_seq,page_label,crop_no,vector_path
98,mdp.39015063894896,"Carter, Hendee, 1834.",An elementary treatise on geometry : simplifie...,1834,172,inline_image,0,mdp/31699/mdp.39015063894896_00000172_00.npy
1568,hvd.hnszhk,"Carter, Hendee, 1832.","A system of universal geography, popular and s...",1832,803,inline_image,1,hvd/hz/hvd.hnszhk_00000803_01.npy
957,nyp.33433066365416,"Lilly & Wait, and Carter & Hendee; [etc., etc....","Knowledge for the people : or, the plain why a...",1832,201,plate_image,0,nyp/33661/nyp.33433066365416_00000201_00.npy
2403,nyp.33433105233229,"Carter and Hendee,",Youth's keepsake.,1842,87,plate_image,0,nyp/33032/nyp.33433105233229_00000087_00.npy
1200,hvd.32044082030669,"Carter, Hendee, 1833.",Scenes of American wealth and industry in prod...,1833,76,inline_image,0,hvd/34836/hvd.32044082030669_00000076_00.npy
551,wu.89088264569,"Carter, Hendee, and Babcock [etc.], 1831-33.","Scientific tracts, designed for instruction an...",1833,110,inline_image,0,wu/8866/wu.89088264569_00000110_00.npy
1260,hvd.hwglsl,"Carter, Hendee, 1833.",Scenes of American wealth and industry in prod...,1833,101,inline_image,0,hvd/hl/hvd.hwglsl_00000101_00.npy
830,nyp.33433007272812,"Carter, Hendee, and Co. [etc.] 1833.",Scenes of American wealth and industry in prod...,1833,89,inline_image,0,nyp/33071/nyp.33433007272812_00000089_00.npy
666,hvd.hn5dsx,"Carter and Hendee, 1832.",Les contes de Pierre Parley sur l'Amérique--,1832,46,inline_image,0,hvd/hd/hvd.hn5dsx_00000046_00.npy
2452,nyp.33433116244611,"Carter and Hendee,",Youth's keepsake.,1836,206,inline_image,0,nyp/33141/nyp.33433116244611_00000206_00.npy


### Step: Convert vector_path to .jpg

Assuming the vector_path is using the standard stubby tree form, we can simply swap on the .npy extension for .jpg.

Another idea: can we add a column with the HathiTrust URL? PixPlot has "permalink" field! This is great! Also a "category" field.

In [61]:
# Create functions to swap in .jpg extension and create permalink
# TODO: Look in ACS-Krewson and find the actual code used to create the stubbytree paths
def npy_to_jpg(path):
    '''
    param path: The .npy vector associated with an ROI.
    return: Same stubbytree path, but with a JPEG extension instead.
    '''
    return os.path.splitext(path)[0] + '.jpg'

def row_to_permalink(row):
    '''
    param row: Pandas dataframe row containing the following fields: htid, page_seq
    return: Permalink to HathiTrust view of given page
    '''
    return "https://babel.hathitrust.org/cgi/pt?id={}&view=1up&seq={}".format(row['htid'], row['page_seq'])

In [63]:
npy_to_jpg(df_merged.iloc[0]['vector_path'])

'hvd/ht/hvd.hnztud_00000008_00.jpg'

In [62]:
# Test the function on first row in the merged dataframe
row_to_permalink(df_merged.iloc[0])

'https://babel.hathitrust.org/cgi/pt?id=hvd.hnztud&view=1up&seq=8'

In [71]:
# Overwrite the vector_path column with .jpg extensions; use map() since a column is a Series
df_merged['roi_stubbypath'] = df_merged['vector_path'].map(npy_to_jpg)

In [72]:
# For the permalink, we will want to use apply
# Let's call this "custom_sum" as "sum" is a built-in function
df_merged['permalink'] = df_merged.apply(row_to_permalink, axis=1)

In [73]:
df_merged.head(5)

Unnamed: 0,htid,imprint,title,rights_date_used,page_seq,page_label,crop_no,vector_path,permalink,roi_stubbypath
0,hvd.hnztud,"Carter and Hendee, 1830.",Studies in poetry. Embracing notices of the li...,1830,8,plate_image,0,hvd/ht/hvd.hnztud_00000008_00.npy,https://babel.hathitrust.org/cgi/pt?id=hvd.hnz...,hvd/ht/hvd.hnztud_00000008_00.jpg
1,hvd.hwhrcx,"Carter and Hendee, 1830.",Studies in poetry. Embracing notices of the li...,1830,10,plate_image,0,hvd/hr/hvd.hwhrcx_00000010_00.npy,https://babel.hathitrust.org/cgi/pt?id=hvd.hwh...,hvd/hr/hvd.hwhrcx_00000010_00.jpg
2,osu.32435053600458,"Carter and Hendee, 1832.",Retrospections of the stage. By the late John ...,1832,483,plate_image,0,osu/33505/osu.32435053600458_00000483_00.npy,https://babel.hathitrust.org/cgi/pt?id=osu.324...,osu/33505/osu.32435053600458_00000483_00.jpg
3,mdp.39015093173741,"Carter, Hendee & Co., 1832.","An elementary treatise on geometry, simplified...",1832,147,inline_image,1,mdp/31974/mdp.39015093173741_00000147_01.npy,https://babel.hathitrust.org/cgi/pt?id=mdp.390...,mdp/31974/mdp.39015093173741_00000147_01.jpg
4,mdp.39015093173741,"Carter, Hendee & Co., 1832.","An elementary treatise on geometry, simplified...",1832,51,inline_image,0,mdp/31974/mdp.39015093173741_00000051_00.npy,https://babel.hathitrust.org/cgi/pt?id=mdp.390...,mdp/31974/mdp.39015093173741_00000051_00.jpg


In [53]:
# Utility functions from Hathi's feature datasets
# https://github.com/htrc/htrc-feature-reader/blob/39010fd41c049f4f86b9c8ff4a44e000217093c2/htrc_features/utils.py
def _id_encode(id):
    '''
    :param id: A Pairtree ID. If it's a Hathitrust ID, this is the part after the library
        code; e.g. the part after the first period for vol.123/456.
    :return: A sanitized id. e.g., 123/456 will return as 123=456 to avoid filesystem issues.
    '''
    return id.replace(":", "+").replace("/", "=").replace(".", ",")

def _id_decode(id):
    '''
    :param id: A sanitized Pairtree ID.
    :return: An original Pairtree ID.
    '''
    return id.replace("+", ":").replace("=", "/").replace(",", ".")

def clean_htid(htid):
    '''
    :param htid: A HathiTrust ID of form lib.vol; e.g. mdp.1234
    :return: A sanitized version of the HathiTrust ID, appropriate for filename use.
    '''
    libid, volid = htid.split('.', 1)
    volid_clean = _id_encode(volid)
    return '.'.join([libid, volid_clean])

def id_to_stubbytree(htid, format = None, suffix = None, compression = None):
    '''
    Take an HTRC id and convert it to a 'stubbytree' location.
    '''
    libid, volid = htid.split('.', 1)
    volid_clean = _id_encode(volid)

    suffixes = [s for s in [format, compression] if s is not None]
    filename = ".".join([clean_htid(htid), *suffixes])
    path = os.path.join(libid, volid_clean[::3], filename)
    return path

## Step 3: Reformat ROIs with metadata for Pixplot

We can reformat our selected ROIs, taking selected columns and renaming them. If we are able to acquire image data, this will allow us to attach the metadata and build a PixPlot visualization.

df_merged | pixplot
--- | ---
'rights_date_used' | 'year'
'title' | 'description'
'permalink' | 'permalink'
'roi_stubbypath' | 'filename'
search_label | 'label'

See https://github.com/YaleDHLab/pix-plot for more details.

In [94]:
# A much more idiomatic way to do it! I could also try to add the imprint to the description, but maybe not...
# Internet Archive / MARC has publication place but Hathifile does not (I think)
df_pixplot = pd.DataFrame({
    'year': df_merged['rights_date_used'],
    'description': df_merged['title'],
    'filename': df_merged['roi_stubbypath'],
    'label': search_label,
    'permalink': df_merged['permalink']
})

In [95]:
df_pixplot['description'][0]

'Studies in poetry. Embracing notices of the lives and writings of the best poets in the English language, a copious selection of elegant extracts, a short analysis of Hebrew poetry, and translations from the sacred poets: designed to illustrate the principles of rhetoric, and teach their application to poetry. By George B. Cheever.'

In [96]:
# this is what we need to check re: stubbytrees -- I think I should call clean_id() since uc paths have $ in them
# Again, best to just reproduce whatever Boris did
# uc1/ 66/𝑢𝑐1. b264661_00000065_00.jpg

In [93]:
df_pixplot.sample(10)

Unnamed: 0,year,description,filename,label,permalink
430,1830,The young lady's book : a manual of elegant re...,uc1/$66/uc1.$b264661_00000065_00.jpg,carter-hendee,https://babel.hathitrust.org/cgi/pt?id=uc1.$b2...
1238,1833,Scenes of American wealth and industry in prod...,hvd/hl/hvd.hwglsl_00000142_00.jpg,carter-hendee,https://babel.hathitrust.org/cgi/pt?id=hvd.hwg...
831,1833,Scenes of American wealth and industry in prod...,nyp/33071/nyp.33433007272812_00000092_00.jpg,carter-hendee,https://babel.hathitrust.org/cgi/pt?id=nyp.334...
1220,1833,Scenes of American wealth and industry in prod...,hvd/hl/hvd.hwglsl_00000063_00.jpg,carter-hendee,https://babel.hathitrust.org/cgi/pt?id=hvd.hwg...
1690,1830,The tales of Peter Parley about America.,hvd/34260/hvd.32044021161005_00000065_00.jpg,carter-hendee,https://babel.hathitrust.org/cgi/pt?id=hvd.320...
1310,1832,"A system of universal geography, popular and s...",hvd/hz/hvd.hnszhk_00000782_01.jpg,carter-hendee,https://babel.hathitrust.org/cgi/pt?id=hvd.hns...
60,1835,"Elements of natural philosophy, with questions...",njp/30115/njp.32101013012156_00000147_01.jpg,carter-hendee,https://babel.hathitrust.org/cgi/pt?id=njp.321...
1934,1833,"A system of school geography, chiefly derived ...",hvd/hp/hvd.hn6ps5_00000121_00.jpg,carter-hendee,https://babel.hathitrust.org/cgi/pt?id=hvd.hn6...
1054,1834,"The every day book, for youth. By Peter Parley...",hvd/hd/hvd.hn5dhi_00000051_00.jpg,carter-hendee,https://babel.hathitrust.org/cgi/pt?id=hvd.hn5...
1319,1832,"A system of universal geography, popular and s...",hvd/hz/hvd.hnszhk_00000914_00.jpg,carter-hendee,https://babel.hathitrust.org/cgi/pt?id=hvd.hns...


In [69]:
# use the search label to make a metadata path
#metadata_csv = "{}_pixplot-metadata.csv".format(search_label)

# save as a CSV that PixPlot can accept
#df_pixplot.to_csv(metadata_csv, sep=',', header=True, index=False)