# Visualize all illustrations for a given query

The project data can be unwieldy to work with. In many cases, you want to isolate a subset of the 2.5+ million illustrated regions. Analysis can then be done at a smaller scale and more quickly. This notebook shows how to get started with such research. The goal will be to find the metadata for all books published in 1800-1850 by the Boston firm Munroe & Francis.

1. To create a set of volume IDs to visualize, search Hathifile excerpt using metadata query
2. Using volume IDs from Step 1, find all illustrated pages corresponding to those volumes. Need to do this with a file IN the Zenodo or that mimics the Google Cloud bucket.
3. Create a PixPlot metadata file that joins the Hathifile information with the stubby IDs.
4. Get image assets from cloud storage
5. Run PixPlot processing
6. Visualize with PixPlot instance

## Step 1: Filter Hathifile using metadata; extract matching rows

Hathifiles can be very big, so we iteratively search them for field (column) values matching a query. This can take some finesse, since publisher names are often very similar and the name of a firm can be written in slightly different ways (e.g. '&' vs. 'and').

In [1]:
import pandas as pd
import os, re

In [2]:
# Files that we need to open/access to generate the PixPlot metadata file. It's simplest to use absolute paths.
# These files are available in the Zenodo repository:
# http://zenodo.org/record/3940528#.XyRNSZ5KjIU

# 1. Hathifile subset, for performing basic metadata queries
HATHIFILE = "/home/stephen-krewson/project-hathi-images/metadata/1800-1850_hathifile.txt.gz"
HATHICOLS = "/home/stephen-krewson/project-hathi-images/metadata/hathifile_columns.txt"

# 2. Flat file of all 2.5m "regions of interest" (illustrations). Compression takes it from 200 MB to 23 MB!
ROIFILE = "/home/stephen-krewson/project-hathi-images/metadata/1800-1850_roi-table.csv.gz"

In [3]:
def search_hathifile(ht_file, col_file, search_col, search_expr):
    '''
    :param ht_file: A Hathifile in CSV format.
    :param col_file: A newline-delimited file with the Hathifile column names
    :param search_col: The field/column on which to search
    :param search_expr: A regular expression against which search_col values can be compared
    :return: A pandas dataframe of rows in which search_col matches search_expr
    '''
    # Use iterative method to scale to full hathifiles
    with open(col_file, "r") as fp:
        col_names = fp.readline().strip('\n').split('\t')
        num_cols = len(col_names)

    # https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html
    iter_csv = pd.read_csv(
        ht_file, 
        sep='\t', 
        header=None,
        names=col_names,
        engine='c',
        # quicker if we can assert some types for the fields (this may not be necessary...)
        dtype={
            'htid': 'str',
            'rights_date_used': 'object', # values NOT guaranteed to be numeric
            'pub_place': 'str', # sadly, this is just the partner lib
            'imprint': 'str'
        },
        iterator=True,
        chunksize=5000,
        error_bad_lines=False)

    df = pd.DataFrame()
    
    for i, chunk in enumerate(iter_csv):
        # Apply the query, ignoring case/capitalization
        condition = (chunk[search_col].str.contains(search_expr, na=False, flags=re.IGNORECASE))
        # hathifile idx has no relation to Neighbor tree: ignore
        df = pd.concat([df, chunk[condition]], ignore_index=True)
        
    return df

In [4]:
# N.B. Queries should be (manually) recorded in datasets/queries.json

# SEARCHES ON PUBLISHER
#search_col = 'imprint'

# For example, to find publishers "Munroe, Francis", "Munroe and Francis", "Munroe & Francis":
#search_label = 'munroe-francis'
#search_expr = r'\bMunroe(?:,| and| &) Francis\b'

#search_label = 'carter-hendee'
#search_expr = r'\bCarter(?:,| and| &) Hendee\b'


# SEARCHES ON TITLE
search_col = 'title'

# For example, to find any title with 'peter' followed by 'parley' separated by anything
search_label = 'peter-parley'
search_expr = r'peter.*parley'

In [5]:
# Run the configured query and report back the number of matching columns
df = search_hathifile(HATHIFILE, HATHICOLS, search_col, search_expr)
df.shape

(179, 26)

In [6]:
# Only use a few of the HathiFile columns: title, rights_date_used, imprint
df.columns

Index(['htid', 'access', 'rights', 'ht_bib_key', 'description', 'source',
       'source_bib_num', 'oclc_num', 'isbn', 'issn', 'lccn', 'title',
       'imprint', 'rights_reason_code', 'rights_timestamp', 'us_gov_doc_flag',
       'rights_date_used', 'pub_place', 'lang', 'bib_fmt', 'collection_code',
       'content_provider_code', 'responsible_entity_code',
       'digitization_agent_code', 'access_profile_code', 'author'],
      dtype='object')

In [7]:
# convert date objects to integers, for the year of publication
df['rights_date_used'] = pd.to_numeric(df['rights_date_used']).astype(int)

# Replace all NaN values with the string "None": helps with problem where author string couldn't be used
df = df.fillna("None")

# Define the columns we want to view/keep from the Hathifile; show a sample
hf_cols = ['htid', 'rights_date_used', 'author', 'title', 'imprint']

In [8]:
# Looks at some of the data
df[hf_cols].sample(10)

Unnamed: 0,htid,rights_date_used,author,title,imprint
169,chi.35737525,1833,"Goodrich, Samuel G. 1793-1860",Peter Parley's tales about ancient and modern ...,"Charles Desilver, 1855, c1833."
90,njp.32101066459494,1837,"Goodrich, Samuel G. 1793-1860.",Peter Parley's common school history : illustr...,"E.H. Butler & Co., 1848, c1837."
76,nyp.33433074864525,1840,"Goodrich, Samuel G. 1793-1860.","Moral tales: or, A selection of interesting st...",Nafis & Cornish [1840]
171,chi.39728497,1848,"Goodrich, Samuel G. 1793-1860",Illustrative anecdotes of the animal kingdom /...,"C. H. Peirce and G. C. Rand , 1848."
139,hvd.hn5dnf,1838,"Goodrich, Samuel G. 1793-1860,",Peter Parley's book of Bible stories for child...,"Munroe & Francis, etc., etc., 1838."
130,hvd.32044020537320,1844,"Goodrich, Samuel G. 1793-1860.",Curiosities of human nature: by the author of ...,"Bradbury, Soden & co., 1844."
40,hvd.hn5e4a,1844,"Goodrich, Samuel G. 1793-1860.",Lights and shadows of European history: by the...,"Bradbury, Soden & Co., 1844."
134,hvd.hn5e4i,1845,"Goodrich, Samuel G. 1793-1860.","Literature, ancient and modern, with specimens...","Bradbury, Soden & co., 1845."
166,iau.31858048131597,1847,"Goodrich, Samuel G. 1793-1860.",Inquisitive Jack and his Aunt Mary / by Peter ...,"Darton and Clark, 1847"
120,hvd.hn5dne,1838,"Goodrich, Samuel G. 1793-1860.",Peter Parley's juvenile tales / Peter Parley.,"Thomas, Cowperthwait & Co., 1838."


In [25]:
# Cell for more closely inspecting something from the sample
df.iloc[72]['title']

'Short stories, or A selection [of] interesting tales / by the author of Peter Parley.'

## Interlude: Examine imprints

With the query htids in hand, we can look at the most common publishers.

In [18]:
# Boston's BradburySodenCo start to dominate in the 1840s due to Parley's Cabinet Library
# See: http://worldcat.org/identities/lccn-no99009453/ (find identities for other publishers, too)
# Each case study / publisher should have an argument or some kind of analysis attached to it. Change over time.
df['imprint'].map(lambda x: ''.join(filter(str.isalpha, x))).value_counts().head(10)

BradburySodenCo         20
BradburySoden           14
BradburySodenco         13
JAllen                   6
BradburySodenandco       4
NafisCornish             4
CarterHendeeandco        3
SColman                  3
ThomasCowperthwaitco     3
LeavittAllenc            3
Name: imprint, dtype: int64

In [26]:
# Read in project ROI metadata in preparatio to merge on htids
# This takes just a few seconds to read and compression method is inferred automatically!
df_rois = pd.read_csv(ROIFILE)

In [27]:
df_rois.sample(10)

Unnamed: 0,htid,page_seq,page_label,crop_no,vector_path
1101204,mdp.39015061238245,1083,inline_image,0,mdp/31634/mdp.39015061238245_00001083_00.npy
611031,nyp.33433081787024,189,plate_image,0,nyp/33882/nyp.33433081787024_00000189_00.npy
359785,umn.31951002791068r,130,inline_image,0,umn/35096/umn.31951002791068r_00000130_00.npy
2198681,hvd.hwsq9x,265,inline_image,0,hvd/hq/hvd.hwsq9x_00000265_00.npy
1910211,uc1.32106013839391,23,inline_image,1,uc1/30139/uc1.32106013839391_00000023_01.npy
2126901,hvd.32044107305351,99,plate_image,0,hvd/34005/hvd.32044107305351_00000099_00.npy
2057052,hvd.hxjftn,440,plate_image,0,hvd/hf/hvd.hxjftn_00000440_00.npy
716884,nyp.33433099915955,658,plate_image,0,nyp/33915/nyp.33433099915955_00000658_00.npy
367423,umn.31951002319715k,8,plate_image,0,umn/35011/umn.31951002319715k_00000008_00.npy
2371499,hvd.32044067944009,38,plate_image,0,hvd/34640/hvd.32044067944009_00000038_00.npy


In [28]:
# Now merge on htid
# https://towardsdatascience.com/guide-to-big-data-joins-python-sql-pandas-spark-dask-51b7f4fec810
# there are duplicate ids in df_rois, because each htid has many pages and ROIs potentially
df_merged = pd.merge(df[hf_cols], df_rois, on='htid', how='inner', validate='one_to_many')

In [29]:
df_merged.sample(5)

Unnamed: 0,htid,rights_date_used,author,title,imprint,page_seq,page_label,crop_no,vector_path
2032,nyp.33433003344607,1846,"Goodrich, Samuel G. 1793-1860.","A national geography, for schools : illus. by ...","Huntington & Savage, 1846",111,inline_image,2,nyp/33040/nyp.33433003344607_00000111_02.npy
3446,nyp.33433082329776,1837,"Goodrich, Samuel G. 1793-1860.",Peter Parley's common school history : illustr...,"Marshall, Williams & Butler, 1841, c1837.",230,inline_image,0,nyp/33827/nyp.33433082329776_00000230_00.npy
4144,njp.32101066121961,1846,"Goodrich, Samuel G. 1793-1860.",Tales of sea and land / by the author of Peter...,"Sorin and Ball, 1846",57,inline_image,0,njp/30626/njp.32101066121961_00000057_00.npy
7033,coo.31924023244852,1846,"Goodrich, Samuel G. 1793-1860.","Tales about Europe, Asia, Africa, and America ...","T. Tegg, [1846]",45,inline_image,0,coo/32245/coo.31924023244852_00000045_00.npy
7269,osu.32435067004739,1843,"Goodrich, Samuel G. 1793-1860.","What to do, and how to do it, or, Morals and m...","Sheldon, 1859, c1843.",75,inline_image,0,osu/33603/osu.32435067004739_00000075_00.npy


### Augment the merged data

- Add HathiTrust permalink URL
- Ensure canonical stubbytree path to ROI JPEG (currently just swapping extensions and dropping vector_path column)

In [30]:
# Create functions to swap in .jpg extension and create permalink
# TODO: Look in ACS-Krewson and find the actual code used to create the stubbytree paths
def npy_to_jpg(path):
    '''
    param path: The .npy vector associated with an ROI.
    return: Same stubbytree path, but with a JPEG extension instead.
    '''
    return os.path.splitext(path)[0] + '.jpg'

def row_to_permalink(row):
    '''
    param row: Pandas dataframe row containing the following fields: htid, page_seq
    return: Permalink to HathiTrust view of given page
    '''
    return "https://babel.hathitrust.org/cgi/pt?id={}&view=1up&seq={}".format(row['htid'], row['page_seq'])

In [31]:
# Overwrite the vector_path column with .jpg extensions; use map() since a single column is a Series
df_merged['roi_stubbypath'] = df_merged['vector_path'].map(npy_to_jpg)

In [32]:
# Drop the 'vector_path' column -- need to do this for the roi-table in Zenodo anyway
df_merged = df_merged.drop(columns=['vector_path'])

In [33]:
# Use apply() for doing row-wise operations that produce a new column (axis 1) using multiple row values
df_merged['permalink'] = df_merged.apply(row_to_permalink, axis=1)

In [34]:
df_merged.sample(5)

Unnamed: 0,htid,rights_date_used,author,title,imprint,page_seq,page_label,crop_no,roi_stubbypath,permalink
6003,umn.31951001683708s,1837,"Goodrich, Samuel G. 1793-1860.","Peter Parley's book of the United States, geog...","C. J. Hendee, 1837.",9,inline_image,0,umn/35080/umn.31951001683708s_00000009_00.jpg,https://babel.hathitrust.org/cgi/pt?id=umn.319...
4270,njp.32101066459494,1837,"Goodrich, Samuel G. 1793-1860.",Peter Parley's common school history : illustr...,"E.H. Butler & Co., 1848, c1837.",184,inline_image,0,njp/30659/njp.32101066459494_00000184_00.jpg,https://babel.hathitrust.org/cgi/pt?id=njp.321...
7552,iau.31858048131597,1847,"Goodrich, Samuel G. 1793-1860.",Inquisitive Jack and his Aunt Mary / by Peter ...,"Darton and Clark, 1847",151,inline_image,0,iau/35439/iau.31858048131597_00000151_00.jpg,https://babel.hathitrust.org/cgi/pt?id=iau.318...
3362,nyp.33433082003488,1839,"Goodrich, Samuel G. 1793-1860.",Peter Parley's picture book.,S. Colman [1839],131,inline_image,0,nyp/33808/nyp.33433082003488_00000131_00.jpg,https://babel.hathitrust.org/cgi/pt?id=nyp.334...
3697,hvd.hn5dhi,1834,"Goodrich, Samuel G. 1793-1860.","The every day book, for youth. By Peter Parley...","Carter, Hendee and co., 1834.",58,inline_image,0,hvd/hd/hvd.hn5dhi_00000058_00.jpg,https://babel.hathitrust.org/cgi/pt?id=hvd.hn5...


In [35]:
# Utility functions from Hathi's feature datasets
# https://github.com/htrc/htrc-feature-reader/blob/39010fd41c049f4f86b9c8ff4a44e000217093c2/htrc_features/utils.py
def _id_encode(id):
    '''
    :param id: A Pairtree ID. If it's a Hathitrust ID, this is the part after the library
        code; e.g. the part after the first period for vol.123/456.
    :return: A sanitized id. e.g., 123/456 will return as 123=456 to avoid filesystem issues.
    '''
    return id.replace(":", "+").replace("/", "=").replace(".", ",")

def _id_decode(id):
    '''
    :param id: A sanitized Pairtree ID.
    :return: An original Pairtree ID.
    '''
    return id.replace("+", ":").replace("=", "/").replace(",", ".")

def clean_htid(htid):
    '''
    :param htid: A HathiTrust ID of form lib.vol; e.g. mdp.1234
    :return: A sanitized version of the HathiTrust ID, appropriate for filename use.
    '''
    libid, volid = htid.split('.', 1)
    volid_clean = _id_encode(volid)
    return '.'.join([libid, volid_clean])

def id_to_stubbytree(htid, format = None, suffix = None, compression = None):
    '''
    Take an HTRC id and convert it to a 'stubbytree' location.
    '''
    libid, volid = htid.split('.', 1)
    volid_clean = _id_encode(volid)

    suffixes = [s for s in [format, compression] if s is not None]
    filename = ".".join([clean_htid(htid), *suffixes])
    path = os.path.join(libid, volid_clean[::3], filename)
    return path

## Create PixPlot dataframe with appropriate field names

See https://github.com/YaleDHLab/pix-plot for what metadata fields are accepted. The 

In [36]:
def row_to_description(row):
    '''
    param row: A dataframe row that contains the fields: author, imprint, title
    returns: A volume description for PixPlot with pipe-separated fields.
    '''
    return " | ".join([row['title'], row['author'], row['imprint']])

def path_to_gcloud_url(path):
    '''
    param path: A stubbytree JPEG path
    returns: The path to the JPEG in my Google Cloud bucket "hathitrust-full_1800-50"
    NOTE: I could map this over the roi_stubbypath column, but better to do this step when importing data
    '''
    return os.path.join("gs://hathitrust-full_1800-50", path)

In [37]:
# A much more idiomatic way to do it! Note that 'label' is used for "supervised UMAP project" while
# 'category' is "a categorical label for the image"
# TODO: search_label --> QUERY_LABEL for clarity
# TODO: add gs:// bucket URL directly to dataframe? Recall that the csv in Zenodo points to vectors.tar...

df_pixplot = pd.DataFrame({
    'filename': df_merged['roi_stubbypath'],
    'year': df_merged['rights_date_used'],
    'description': df_merged.apply(row_to_description, axis=1),
    'label': search_label,
    'category': df_merged['page_label'],
    'permalink': df_merged['permalink']
})

In [39]:
df_pixplot.shape

(8250, 6)

In [40]:
df_pixplot.sample(5)

Unnamed: 0,filename,year,description,label,category,permalink
7294,osu/33603/osu.32435067004739_00000158_00.jpg,1843,"What to do, and how to do it, or, Morals and m...",peter-parley,inline_image,https://babel.hathitrust.org/cgi/pt?id=osu.324...
4918,hvd/hn/hvd.hn5nux_00000192_00.jpg,1844,"The manners, customs, and antiquities of the I...",peter-parley,inline_image,https://babel.hathitrust.org/cgi/pt?id=hvd.hn5...
4037,njp/30164/njp.32101015069741_00000118_00.jpg,1847,Peter Parley's geography for beginners : with ...,peter-parley,inline_image,https://babel.hathitrust.org/cgi/pt?id=njp.321...
7730,iau/35456/iau.31858048257269_00000502_00.jpg,1850,Peter Parley's universal history on the basis ...,peter-parley,inline_image,https://babel.hathitrust.org/cgi/pt?id=iau.318...
3513,nyp/33827/nyp.33433082329776_00000103_00.jpg,1837,Peter Parley's common school history : illustr...,peter-parley,inline_image,https://babel.hathitrust.org/cgi/pt?id=nyp.334...


In [41]:
# Brief excursus on Univ. of Calif. htids that contain '$', e.g. 'uc1/$66/uc1.$b264661_00000065_00.jpg'
# These will print oddly within Jupyter: "uc1/ 66/𝑢𝑐1. b264661_00000065_00.jpg"
# But a UC cloud URL is fine: gs://hathitrust-full_1800-50/uc1/$66/uc1.$b161168_00000002_00.jpg
#df_pixplot.iloc[430]['filename']

In [42]:
# Output should be:
# 1) CSV with Google Cloud URL and all the merged columns (cleaned up). Use this for pulling the data to the VM.
# 2) PixPlot CSV for running PixPlot analysis. This CSV has augmented description column that combines book metadata fields.
# With a good naming scheme, each query will generate these two files + appropriate labels and categories.

# use the search label to make a metadata path
metadata_csv = "{}_pixplot-metadata.csv".format(search_label)

# N.B. Remember to place the file in the conventional location:
# /datasets/<search_label>/metadata/<search_label>_pixplot-metadata.csv

# save as a CSV that PixPlot can accept
df_pixplot.to_csv(metadata_csv, sep=',', header=True, index=False)

# TODO: Redo carter-hendee and munroe-francis jobs so they have  the full number of columns