<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Find-all-pages-with-images-in-the-workset" data-toc-modified-id="Find-all-pages-with-images-in-the-workset-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Find all pages with images in the workset</a></span></li><li><span><a href="#Download-options-for-image-pages" data-toc-modified-id="Download-options-for-image-pages-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Download options for image pages</a></span></li><li><span><a href="#Feature-reader-token-counts" data-toc-modified-id="Feature-reader-token-counts-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Feature-reader token counts</a></span></li><li><span><a href="#Semantic-Similarity-Visualizations" data-toc-modified-id="Semantic-Similarity-Visualizations-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Semantic Similarity Visualizations</a></span></li></ul></div>

In [2]:
from __future__ import print_function
from collections import defaultdict
from config import ht_keys as ht

from gensim.corpora import Dictionary
from gensim.models import TfidfModel
from gensim.similarities import MatrixSimilarity

from hathidata.api import HathiDataClient
from htrc_features import FeatureReader

import json
import matplotlib.pyplot as plt
import numpy as np
import os
import pandas as pd
import seaborn as sns
import sys

%matplotlib notebook
sns.set()

ModuleNotFoundError: No module named 'gensim'

# Find all pages with images in the workset

After creating a collection on HathiTrust.org (while logged in), we can loop over the htids and find all the pages with images using the Data API.

In [None]:
# establish a connection to the data API; Workset Tool only allows downloads in capsule
access_key = ht['access_key']
secret_key = ht['secret_key']
data_api = HathiDataClient(access_key, secret_key)

# project string name conventions:
# HT METADATA is in         <project>_metadata.json
# IMAGE_ON_PAGE list is in  <project>_images.json
project = "parley-america"

# Method 1: Download JSON metadata after creating collection on HathiTrust.org while logged in
# My example collection is 328 19C books from Boston educational publisher Carter & Hendee
# Note: you will need to rename this file using the convention above *outside* of the script
metadata_path = project + "_metadata.json"

with open(metadata_path, "r") as fp:
    data = json.load(fp)
vol_ids = [item['htitem_id'] for item in data['gathers']]

# only query API if the JSON file does not exist
dict_path = project + "_images.json"

if not os.path.isfile(dict_path):
    
    # Dictionary keyed on htid stores image page lists
    volumes = {}
    
    for i,vol in enumerate(vol_ids):
        
        # status update
        print(i,vol)
        
        vol_meta = json.loads(data_api.request_vol_meta(vol, "json"))
        sequence = vol_meta['htd:seqmap'][0]['htd:seq']
        image_pages = [int(page['pseq']) for page in sequence if 'IMAGE_ON_PAGE' in page['htd:pfeat']]
        
        # add the list of image pages to the dictionary
        volumes[vol] = image_pages
        
    # outside of the loop, save as json
    with open(dict_path, "w") as fp:
        json.dump(volumes, fp)
        
else:
    with open(dict_path, "r") as fp:
        volumes = json.load(fp)
    

print("Total number of images:", sum([len(v) for k,v in volumes.items()]))

# Download options for image pages

HathiTrust says that the Data API is feasible for up to 10k volumes. We are clearly well under. But it is still very time consuming to get ~7k page images along with a sliding three-page window for the raw OCR. The recommended solution is to ask HT for help in downloading a custom dataset with `rysnc`. See:

- https://www.hathitrust.org/datasets
- https://gist.github.com/lit-cs-sysadmin/8ffb90911697adc1262c

However, for exploratory research, we want to be able to get started with the Data API. Our file structure will look like this:

```
<project>_metadata.json
<project>_images.json

<project>
-->img
   --><htid>
      --><pg>.png
         <pg>.png
         ...
      --><htid>
         <pg>.png
         <pg>.png
         ...
      ...
-->ocr       
   --><htid>
      --><pg>.txt
         <pg>.txt
         ...
      --><htid>
         <pg>.txt
         <pg>.txt
         ...
      ...       
```
        
Practically speaking, this will take a few hours so I have broken the downloading code out into a script: `download_image_pages.py`.

# Feature-reader token counts

In [None]:
# init feature-reader on volume list
fr = FeatureReader(ids=vol_ids)

# build up dictionary keyed on tokens, sum counts over all volumes
counts = defaultdict(lambda: 0)
for vol in fr:
    VTL = vol.tokenlist(pages=False,case=False,pos=False,page_freq=False)
    VTL = VTL.reset_index()
    temp = VTL.to_dict('index')
    for k,v in temp.items():
        counts[v['lowercase']] += int(v['count'])

# mappings for both directions: just use an incrementing index as the ID
id2word = {}
word2id = {}
for i,k in enumerate(counts.keys()):
    id2word[i] = k
    word2id[k] = i
    
# additional brackets since from_corpus method takes a list argument
corpus = [[(word2id[k],v) for k,v in counts.items()]]

# gensim format, including the mapping
dct = Dictionary.from_corpus(corpus, id2word=id2word)

# flat list of page BoW lists (not grouped by volume)
corpus_bows = []

# nested list of (id,pg) lists for each volume
corpus_idxs = []

for i,vol in enumerate(fr):
    # tokenlist from the entire volume, preserving page location info
    PTL = vol.tokenlist(case=False,pos=False)
    bow_idxs = []

    for pg in volumes[vol.id]:
        try:
            bow = [(dct.token2id[k[1]],int(v['count'])) for k,v in PTL.loc[pg].to_dict('index').items()]
            corpus_bows.append(bow)
            bow_idxs.append((vol.year,pg))
        except KeyError:
            pass
    
    # convenient for knowing how many pages and which pages for each volume
    corpus_idxs.append(bow_idxs)

# create the model using BoWs over all volumes
tfidf_model = TfidfModel(corpus_bows)

# all pairwise term frequency-inverse document frequency similarities
tfidf_sims = MatrixSimilarity(tfidf_model[corpus_bows], num_features=len(dct))

# Semantic Similarity Visualizations

We use heatmaps of pairwise tf-idf similarities.

In [None]:
# top right block of the matrix is 0th vol vs 1st vol
size1 = len(corpus_idxs[0])
block1 = np.asarray(tfidf_sims)[:size1,size1:]

# top right block of similarity matrix
plt.figure()
plt.title("1830 vs. 1827/1845 Tales about America (tf-idf similarity).")
sns.heatmap(block1,
            xticklabels=corpus_idxs[1],
            yticklabels=corpus_idxs[0],
            vmax=0.5,
            linewidths=0.5,
            cmap="YlGnBu")

# we have to flatten the corpus_idx list since it is a list of lists
plt.figure()
plt.title("Full similarity matrix.")
sns.heatmap(tfidf_sims,
            xticklabels=[itm for sublist in corpus_idxs for itm in sublist],
            yticklabels=[itm for sublist in corpus_idxs for itm in sublist],
            vmax=0.35,
            linewidths=0.5,
            cmap="YlGnBu")