## Download images from Google Cloud Storage

The gsutil URIs for the dataset are of the form:

```gs://hathitrust-full_1800-50/chi/000/chi.085041031_00000006_00.jpg```

The bucket name is **hathitrust-full_1800-50** and the storage class is Nearline (multiple zones). The simplest way to land this data to a file is with gsutil, from the VM. But Python bindings are available as well.

In [13]:
from google.cloud import storage
import pandas as pd
import os

In [14]:
# Check that GOOGLE_APPLICATION_CREDENTIALS is set. This is somewhat tricky from Jupyter:
# Use a *full path* and the env command; !export has no effect
%env GOOGLE_APPLICATION_CREDENTIALS=/home/stephen-krewson/project-hathi-images/global-matrix-242515-49432d870e22.json
!echo $GOOGLE_APPLICATION_CREDENTIALS

env: GOOGLE_APPLICATION_CREDENTIALS=/home/stephen-krewson/project-hathi-images/global-matrix-242515-49432d870e22.json
/home/stephen-krewson/project-hathi-images/global-matrix-242515-49432d870e22.json


In [18]:
storage_client = storage.Client()
bucket = storage_client.get_bucket("hathitrust-full_1800-50")

In [21]:
blobs = storage_client.list_blobs(bucket)
for blob in blobs:
    print(blob)
    break

<Blob: hathitrust-full_1800-50, chi/000/chi.085041031_00000006_00.jpg, 1601825270933858>


In [22]:
# From Google documentation
def download_blob(bucket_name, source_blob_name, destination_file_name):
    """Downloads a blob from the bucket."""
    # bucket_name = "your-bucket-name"
    # source_blob_name = "storage-object-name"
    # destination_file_name = "local/path/to/file"

    storage_client = storage.Client()

    bucket = storage_client.bucket(bucket_name)

    # Construct a client side representation of a blob.
    # Note `Bucket.blob` differs from `Bucket.get_blob` as it doesn't retrieve
    # any content from Google Cloud Storage. As we don't need additional data,
    # using `Bucket.blob` is preferred here.
    blob = bucket.blob(source_blob_name)
    blob.download_to_filename(destination_file_name)

    print(
        "Blob {} downloaded to {}.".format(
            source_blob_name, destination_file_name
        )
    )

In [24]:
# The intended location for the data. If within Git, remember to not version .jpg files.
# Make sure the data is being backed up on my 4TB disk; so best to keep it on C:\
METADATA_FILE = "../metadata/pixplot_peter-parley.csv"
OUTPUT_DIR = "~/datasets/peter-parley/images"

In [25]:
df = pd.read_csv(METADATA_FILE)

In [27]:
df['filename'].head(5)

0    mdp/31656/mdp.39015063752268_00000015_00.jpg
1    mdp/31656/mdp.39015063752268_00000048_01.jpg
2    mdp/31656/mdp.39015063752268_00000036_00.jpg
3    mdp/31656/mdp.39015063752268_00000060_00.jpg
4    mdp/31656/mdp.39015063752268_00000072_01.jpg
Name: filename, dtype: object

In [28]:
# POSE THE QUESTION BEFORE RUNNING THE JOB! Consider adding "geography" to Peter Parley query?
# How many more volumes would this add?
# PixPlot 1: Educational publishers (Carter Hendee, Munroe Francis, etc.). Cf. Goodrich's output and memoirs.
# PixPlot 2: Geography books and Peter Parley texts. Searching title for geography OR peter parley. To avoid dedup step.
# https://cloud.google.com/storage/docs/downloading-objects#storage-download-object-python
# https://cloud.google.com/storage/docs/reference/libraries#command-line