# Visualize all illustrations for a given query

The project data can be unwieldy to work with. In many cases, you want to isolate a subset of the 2.5+ million illustrated regions. Analysis can then be done at a smaller scale and more quickly. This notebook shows how to get started with such research. The goal will be to find the metadata for all books published in 1800-1850 by the Boston firm Munroe & Francis.

1. To create a set of volume IDs to visualize, search Hathifile excerpt using metadata query
2. Using volume IDs from Step 1, find all illustrated pages corresponding to those volumes. Need to do this with a file IN the Zenodo or that mimics the Google Cloud bucket.
3. Create a PixPlot metadata file that joins the Hathifile information with the stubby IDs.
4. Get image assets from cloud storage
5. Run PixPlot processing
6. Visualize with PixPlot instance

## Step 1: Filter Hathifile using metadata; extract matching rows

Hathifiles can be very big, so we iteratively search them for field (column) values matching a query. This can take some finesse, since publisher names are often very similar and the name of a firm can be written in slightly different ways (e.g. '&' vs. 'and').

In [2]:
import pandas as pd
import numpy as np
import os, random, re, sys
from glob import glob

In [31]:
# Files that we need to open/access to generate the PixPlot metadata file. It's simplest to use absolute paths.
# These files are available in the Zenodo repository: http://zenodo.org/record/3940528#.XyRNSZ5KjIU

# 1. Hathifile subset, for performing basic metadata queries
HATHIFILE = "/home/stephen-krewson/project-hathi-images/datafiles/1800-1850_hathifile.txt.gz"
HATHICOLS = "/home/stephen-krewson/project-hathi-images/datafiles/hathifile_columns.txt"

# 2. Flat file of all 2.5m "regions of interest" (illustrations). Compression takes it from 200 MB to 23 MB!
# TODO: Rename these files and reupload into Zenodo with script
ROIFILE = "/home/stephen-krewson/project-hathi-images/datafiles/1800-1850_roi-table.csv.gz"

In [174]:
# Test that the files exist
#!stat $HATHIFILE
#!stat $HATHICOLS
#!stat $ROIFILE

In [139]:
def search_hathifile(ht_file, col_file, search_col, search_expr):
    '''
    :param ht_file: A Hathifile in CSV format.
    :param col_file: A newline-delimited file with the Hathifile column names
    :param search_col: The field/column on which to search
    :param search_expr: A regular expression against which search_col values can be compared
    :return: A pandas dataframe of rows in which search_col matches search_expr
    '''
    # Use iterative method to scale to full hathifiles
    with open(col_file, "r") as fp:
        col_names = fp.readline().strip('\n').split('\t')
        num_cols = len(col_names)

    # https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html
    iter_csv = pd.read_csv(
        ht_file, 
        sep='\t', 
        header=None,
        names=col_names,
        engine='c',
        # quicker if we can assert some types for the fields (this may not be necessary...)
        dtype={
            'htid': 'str',
            'rights_date_used': 'object', # values NOT guaranteed to be numeric
            'pub_place': 'str', # sadly, this is just the partner lib
            'imprint': 'str'
        },
        iterator=True,
        chunksize=5000,
        error_bad_lines=False)

    df = pd.DataFrame()
    for i, chunk in enumerate(iter_csv):
        condition = (chunk[search_col].str.contains(search_expr, na=False, flags=re.IGNORECASE))
        
        # hathifile idx has no relation to Neighbor tree: ignore
        df = pd.concat([df, chunk[condition]], ignore_index=True)
        
    return df

In [140]:
search_col = 'imprint'

# find publishers "Munroe, Francis", "Munroe and Francis", "Munroe & Francis" (with matching group)
# search_expr = r"\bMunroe(?:,| and| &) Francis\b"
# search_label = "munroe-francis"

# similar string, but for carter and hendee
search_expr = r"\bCarter(?:,| and| &) Hendee\b"
# a label for the results of this experiment (in case you want to compare later)
search_label = "carter-hendee"

df = search_hathifile(HATHIFILE, HATHICOLS, search_col, search_expr)
df.shape

(250, 26)

In [141]:
df.columns # use title, rights_date_used, imprint

Index(['htid', 'access', 'rights', 'ht_bib_key', 'description', 'source',
       'source_bib_num', 'oclc_num', 'isbn', 'issn', 'lccn', 'title',
       'imprint', 'rights_reason_code', 'rights_timestamp', 'us_gov_doc_flag',
       'rights_date_used', 'pub_place', 'lang', 'bib_fmt', 'collection_code',
       'content_provider_code', 'responsible_entity_code',
       'digitization_agent_code', 'access_profile_code', 'author'],
      dtype='object')

In [147]:
# convert date objects to integers, for the year of publication
df['rights_date_used'] = pd.to_numeric(df['rights_date_used']).astype(int)

# Remove all NaN values: helps with problem where author string couldn't be used
df = df.fillna("None")

# Define the columns we want to view/keep from the Hathifile; show a sample
hf_cols = ['htid', 'rights_date_used', 'author', 'title', search_col]

In [148]:
# Looks at some of the data
df[hf_cols].sample(10)

Unnamed: 0,htid,rights_date_used,author,title,imprint
157,hvd.ah65up,1832,"Brown, Bartholomew, 1772-1854,","Templi carmina; songs of the temple, or, Bridg...","Carter, Hendee, 1832."
209,hvd.hn6ps5,1833,"Goodrich, Samuel G. 1793-1860.","A system of school geography, chiefly derived ...","Carter, Hendee and co., 1833."
182,hvd.hn7v46,1830,"Gérando, Joseph-Marie, baron de, 1772-1842.","Self-education ; or, The means and art of mora...","Carter and Hendee, 1830."
218,hvd.hwjnzw,1832,"Blake, John Lauris, 1788-1857.",Conversations on the evidences of Christianity...,"Carter, Hendee, & Co., 1832."
232,nyp.33433105233229,1842,,Youth's keepsake.,"Carter and Hendee,"
12,nyp.33433070299742,1831,"Thacher, James, 1754-1844.","An essay on demonology, ghosts and apparitions...","Carter and Hendee, 1831."
207,umn.31951002030024t,1830,"Sedgwick, Susan Ann Livingston Ridley, 1789-1867.",The young emigrants : a tale designed for youn...,"Carter and Hendee, 1830."
200,hvd.hw237f,1834,"Pierpont, John, 1785-1866,","The American first class book; or, Exercises i...","Carter, Hendee & co., 1834."
74,wu.89097432249,1830,"Pestalozzi, Johann Heinrich, 1746-1827.",Letters of Pestalozzi on the education of infa...,"Carter and Hendee, 1830."
45,mdp.39015030943792,1832,"Adam, Alexander, 1741-1809.","Adam's Latin grammar, with some improvements, ...","Hilliard Gray and co., and Carter, Hendee and ..."


In [149]:
# Test of previous NaN author (for carter-hendee only)
df.iloc[231]['author']

'None'

In [150]:
# Now merge on htids -- just a few seconds to read! compression inferred automatically!
df_rois = pd.read_csv(ROIFILE)

In [151]:
df_rois.sample(10)

Unnamed: 0,htid,page_seq,page_label,crop_no,vector_path
1143083,mdp.39015057081278,99,inline_image,0,mdp/31587/mdp.39015057081278_00000099_00.npy
1632689,uc1.c2553075,147,plate_image,0,uc1/c57/uc1.c2553075_00000147_00.npy
1602819,uiug.30112113441916,445,plate_image,4,uiug/31141/uiug.30112113441916_00000445_04.npy
1781467,uc1.b4606283,781,inline_image,1,uc1/b08/uc1.b4606283_00000781_01.npy
2303190,hvd.32044038513263,10,plate_image,0,hvd/34316/hvd.32044038513263_00000010_00.npy
1195801,mdp.39015013149631,94,inline_image,0,mdp/31143/mdp.39015013149631_00000094_00.npy
1116066,mdp.39015063446416,457,plate_image,0,mdp/31641/mdp.39015063446416_00000457_00.npy
397910,umn.31951d00324580u,131,plate_image,0,umn/35028/umn.31951d00324580u_00000131_00.npy
376514,umn.31951d03004701h,274,inline_image,0,umn/35000/umn.31951d03004701h_00000274_00.npy
1958106,uc1.$b520404,342,inline_image,0,uc1/$20/uc1.$b520404_00000342_00.npy


In [152]:
# Now merge on htid
# https://towardsdatascience.com/guide-to-big-data-joins-python-sql-pandas-spark-dask-51b7f4fec810
# there are duplicate ids in df_rois, because each htid has many pages and ROIs potentially
df_merged = pd.merge(df[hf_cols], df_rois, on='htid', how='inner', validate='one_to_many')

In [154]:
df_merged.sample(5)

Unnamed: 0,htid,rights_date_used,author,title,imprint,page_seq,page_label,crop_no,vector_path
804,nyp.33433011003880,1831,"Rennie, James, 1787-1867.",The architecture of birds.,Lilly and Wait (late Wells and Lilly) and Cart...,102,inline_image,0,nyp/33108/nyp.33433011003880_00000102_00.npy
739,nyp.33433012110858,1832,"Goodrich, Samuel G. 1793-1860.",Les contes de Pierre Parley sur l'Amérique--,"Carter and Hendee, 1832.",51,inline_image,0,nyp/33115/nyp.33433012110858_00000051_00.npy
1344,hvd.hnszhk,1832,"Goodrich, Samuel G. 1793-1860.","A system of universal geography, popular and s...","Carter, Hendee, 1832.",909,inline_image,1,hvd/hz/hvd.hnszhk_00000909_01.npy
413,uc1.$b264661,1830,,The young lady's book : a manual of elegant re...,"Carter, Hendee and Babcock, and Abel Bowen, [1...",131,inline_image,0,uc1/$66/uc1.$b264661_00000131_00.npy
1877,hvd.hn6ps5,1833,"Goodrich, Samuel G. 1793-1860.","A system of school geography, chiefly derived ...","Carter, Hendee and co., 1833.",236,inline_image,1,hvd/hp/hvd.hn6ps5_00000236_01.npy


### Augment the merged data

- Add HathiTrust permalink URL
- Ensure canonical stubbytree path to ROI JPEG (currently just swapping extensions and dropping vector_path column)

In [155]:
# Create functions to swap in .jpg extension and create permalink
# TODO: Look in ACS-Krewson and find the actual code used to create the stubbytree paths
def npy_to_jpg(path):
    '''
    param path: The .npy vector associated with an ROI.
    return: Same stubbytree path, but with a JPEG extension instead.
    '''
    return os.path.splitext(path)[0] + '.jpg'

def row_to_permalink(row):
    '''
    param row: Pandas dataframe row containing the following fields: htid, page_seq
    return: Permalink to HathiTrust view of given page
    '''
    return "https://babel.hathitrust.org/cgi/pt?id={}&view=1up&seq={}".format(row['htid'], row['page_seq'])

In [156]:
# Some tests of these utility functions
#npy_to_jpg(df_merged.iloc[0]['vector_path'])
#row_to_permalink(df_merged.iloc[0])

In [157]:
# Overwrite the vector_path column with .jpg extensions; use map() since a single column is a Series
df_merged['roi_stubbypath'] = df_merged['vector_path'].map(npy_to_jpg)

In [158]:
# Drop the 'vector_path' column -- need to do this for the roi-table in Zenodo anyway
df_merged = df_merged.drop(columns=['vector_path'])

In [159]:
# Use apply() for doing row-wise operations that produce a new column (axis 1) using multiple row values
df_merged['permalink'] = df_merged.apply(row_to_permalink, axis=1)

In [160]:
df_merged.sample(5)

Unnamed: 0,htid,rights_date_used,author,title,imprint,page_seq,page_label,crop_no,roi_stubbypath,permalink
129,uc1.$b216978,1831,,The Naturalist,"Peirce and Parker, and Carter and Hendee], 183...",318,plate_image,0,uc1/$17/uc1.$b216978_00000318_00.jpg,https://babel.hathitrust.org/cgi/pt?id=uc1.$b2...
2044,hvd.32044044506970,1850,,The Farmer's almanack.,"Carter, Hendee and Co., 1836-1847.",71,inline_image,0,hvd/34407/hvd.32044044506970_00000071_00.jpg,https://babel.hathitrust.org/cgi/pt?id=hvd.320...
2401,nyp.33433105233229,1842,,Youth's keepsake.,"Carter and Hendee,",155,inline_image,0,nyp/33032/nyp.33433105233229_00000155_00.jpg,https://babel.hathitrust.org/cgi/pt?id=nyp.334...
2309,osu.32435012029989,1834,"Emerson, B. D. 1781-1872.","The national spelling-book, and pronouncing tu...","Carter, Hendee & co., 1834, [i.e. 1835]",6,plate_image,0,osu/33128/osu.32435012029989_00000006_00.jpg,https://babel.hathitrust.org/cgi/pt?id=osu.324...
1033,hvd.hn5dhi,1834,"Goodrich, Samuel G. 1793-1860.","The every day book, for youth. By Peter Parley...","Carter, Hendee and co., 1834.",61,inline_image,0,hvd/hd/hvd.hn5dhi_00000061_00.jpg,https://babel.hathitrust.org/cgi/pt?id=hvd.hn5...


In [161]:
# Utility functions from Hathi's feature datasets
# https://github.com/htrc/htrc-feature-reader/blob/39010fd41c049f4f86b9c8ff4a44e000217093c2/htrc_features/utils.py
def _id_encode(id):
    '''
    :param id: A Pairtree ID. If it's a Hathitrust ID, this is the part after the library
        code; e.g. the part after the first period for vol.123/456.
    :return: A sanitized id. e.g., 123/456 will return as 123=456 to avoid filesystem issues.
    '''
    return id.replace(":", "+").replace("/", "=").replace(".", ",")

def _id_decode(id):
    '''
    :param id: A sanitized Pairtree ID.
    :return: An original Pairtree ID.
    '''
    return id.replace("+", ":").replace("=", "/").replace(",", ".")

def clean_htid(htid):
    '''
    :param htid: A HathiTrust ID of form lib.vol; e.g. mdp.1234
    :return: A sanitized version of the HathiTrust ID, appropriate for filename use.
    '''
    libid, volid = htid.split('.', 1)
    volid_clean = _id_encode(volid)
    return '.'.join([libid, volid_clean])

def id_to_stubbytree(htid, format = None, suffix = None, compression = None):
    '''
    Take an HTRC id and convert it to a 'stubbytree' location.
    '''
    libid, volid = htid.split('.', 1)
    volid_clean = _id_encode(volid)

    suffixes = [s for s in [format, compression] if s is not None]
    filename = ".".join([clean_htid(htid), *suffixes])
    path = os.path.join(libid, volid_clean[::3], filename)
    return path

## Create PixPlot dataframe with appropriate field names

See https://github.com/YaleDHLab/pix-plot for what metadata fields are accepted. The 

In [187]:
def row_to_description(row):
    '''
    param row: A dataframe row that contains the fields: author, imprint, title
    returns: A volume description for PixPlot with pipe-separated fields.
    '''
    return " | ".join([row['title'], row['author'], row['imprint']])

def path_to_gcloud_url(path):
    '''
    param path: A stubbytree JPEG path
    returns: The path to the JPEG in my Google Cloud bucket "hathitrust-full_1800-50"
    NOTE: I could map this over the roi_stubbypath column, but better to do this step when importing data
    '''
    return os.path.join("gs://hathitrust-full_1800-50", path)

In [183]:
# A much more idiomatic way to do it! Note that 'label' is used for "supervised UMAP project" while
# 'category' is "a categorical label for the image"
# TODO: search_label --> QUERY_LABEL for clarity
# TODO: add gs:// bucket URL directly to dataframe? Recall that the csv in Zenodo points to vectors.tar...

df_pixplot = pd.DataFrame({
    'filename': df_merged['roi_stubbypath'],
    'year': df_merged['rights_date_used'],
    'description': df_merged.apply(row_to_description, axis=1),
    'label': search_label,
    'category': df_merged['page_label'],
    'permalink': df_merged['permalink']
})

In [188]:
df_pixplot.sample(5)

Unnamed: 0,filename,year,description,label,category,permalink
505,uc1/$66/uc1.$b264661_00000286_02.jpg,1830,The young lady's book : a manual of elegant re...,carter-hendee,inline_image,https://babel.hathitrust.org/cgi/pt?id=uc1.$b2...
2065,hvd/34407/hvd.32044044506970_00000025_00.jpg,1850,"The Farmer's almanack. | None | Carter, Hendee...",carter-hendee,inline_image,https://babel.hathitrust.org/cgi/pt?id=hvd.320...
1487,hvd/hz/hvd.hnszhk_00000172_00.jpg,1832,"A system of universal geography, popular and s...",carter-hendee,inline_image,https://babel.hathitrust.org/cgi/pt?id=hvd.hns...
871,nyp/33071/nyp.33433007272812_00000030_00.jpg,1833,Scenes of American wealth and industry in prod...,carter-hendee,inline_image,https://babel.hathitrust.org/cgi/pt?id=nyp.334...
1434,hvd/hz/hvd.hnszhk_00000491_01.jpg,1832,"A system of universal geography, popular and s...",carter-hendee,inline_image,https://babel.hathitrust.org/cgi/pt?id=hvd.hns...


In [185]:
# Another example with $s
df_pixplot.iloc[154]['filename']

'uc1/$66/uc1.$b264661_00000136_00.jpg'

In [186]:
# Brief excursus on Univ. of Calif. htids that contain '$', e.g. 'uc1/$66/uc1.$b264661_00000065_00.jpg'
# These will print oddly within Jupyter: "uc1/ 66/𝑢𝑐1. b264661_00000065_00.jpg"
# But a UC cloud URL is fine: gs://hathitrust-full_1800-50/uc1/$66/uc1.$b161168_00000002_00.jpg
df_pixplot.iloc[430]['filename']

'uc1/$66/uc1.$b264661_00000065_00.jpg'

In [189]:
# Output should be:
# 1) CSV with Google Cloud URL and all the merged columns (cleaned up). Use this for pulling the data to the VM.
# 2) PixPlot CSV for running PixPlot analysis. This CSV has augmented description column that combines book metadata fields.
# With a good naming scheme, each query will generate these two files + appropriate labels and categories.

# use the search label to make a metadata path
metadata_csv = "{}_pixplot-metadata.csv".format(search_label)

# save as a CSV that PixPlot can accept
df_pixplot.to_csv(metadata_csv, sep=',', header=True, index=False)