# Download and transfer image data

This notebook gives some methods for getting image data to a VM for running PixPlot analysis.

It also shows how to sync that data to a local machine, for running demos and looking through image folders.

## Download images from Google Cloud Storage

The gsutil URIs for the dataset are of the form:

```gs://hathitrust-full_1800-50/chi/000/chi.085041031_00000006_00.jpg```

The bucket name is **hathitrust-full_1800-50** and the storage class is Nearline (multiple zones). The simplest way to land this data to a file is with gsutil, from the VM. But Python bindings are available as well.

In [16]:
from google.cloud import storage
import pandas as pd
import os

In [18]:
# See https://cloud.google.com/storage/docs/reference/libraries#cloud-console
# In the shell: export GOOGLE_APPLICATION_CREDENTIALS="../global-matrix-242515-49432d870e22.json"

In [20]:
!env

SHELL=/bin/bash
WSL_DISTRO_NAME=Ubuntu-20.04
TMUX=/tmp/tmux-1000/default,308,0
SSH_AUTH_SOCK=/tmp/ssh-Nh9KWluIzt3G/agent.13132
SSH_AGENT_PID=13133
NAME=LAPTOP-KM088T8L
PWD=/home/stephen-krewson/project-hathi-images/notebooks
LOGNAME=stephen-krewson
_=/usr/bin/env
MOTD_SHOWN=update-motd
HOME=/home/stephen-krewson
LANG=C.UTF-8
WSL_INTEROP=/run/WSL/8158_interop
LS_COLORS=rs=0:di=01;34:ln=01;36:mh=00:pi=40;33:so=01;35:do=01;35:bd=40;33;01:cd=40;33;01:or=40;31;01:mi=00:su=37;41:sg=30;43:ca=30;41:tw=30;42:ow=34;42:st=37;44:ex=01;32:*.tar=01;31:*.tgz=01;31:*.arc=01;31:*.arj=01;31:*.taz=01;31:*.lha=01;31:*.lz4=01;31:*.lzh=01;31:*.lzma=01;31:*.tlz=01;31:*.txz=01;31:*.tzo=01;31:*.t7z=01;31:*.zip=01;31:*.z=01;31:*.dz=01;31:*.gz=01;31:*.lrz=01;31:*.lz=01;31:*.lzo=01;31:*.xz=01;31:*.zst=01;31:*.tzst=01;31:*.bz2=01;31:*.bz=01;31:*.tbz=01;31:*.tbz2=01;31:*.tz=01;31:*.deb=01;31:*.rpm=01;31:*.jar=01;31:*.war=01;31:*.ear=01;31:*.sar=01;31:*.rar=01;31:*.alz=01;31:*.ace=01;31:*.zoo=01;31:*.cp

In [19]:
storage_client = storage.Client()
bucket = storage_client.bucket("hathitrust-full_1800-50")

DefaultCredentialsError: Could not automatically determine credentials. Please set GOOGLE_APPLICATION_CREDENTIALS or explicitly create credentials and re-run the application. For more information, please see https://cloud.google.com/docs/authentication/getting-started

In [None]:
# Test gcloud connection
def download_blob(bucket_name, source_blob_name, destination_file_name):
    """Downloads a blob from the bucket."""
    # bucket_name = "your-bucket-name"
    # source_blob_name = "storage-object-name"
    # destination_file_name = "local/path/to/file"

    storage_client = storage.Client()

    bucket = storage_client.bucket(bucket_name)

    # Construct a client side representation of a blob.
    # Note `Bucket.blob` differs from `Bucket.get_blob` as it doesn't retrieve
    # any content from Google Cloud Storage. As we don't need additional data,
    # using `Bucket.blob` is preferred here.
    blob = bucket.blob(source_blob_name)
    blob.download_to_filename(destination_file_name)

    print(
        "Blob {} downloaded to {}.".format(
            source_blob_name, destination_file_name
        )
    )


In [3]:
# The intended location for the data. If within Git, remember to not version .jpg files.
# Make sure the data is being backed up on my 4TB disk; so best to keep it on C:\
METADATA_FILE = "../metadata/pixplot_peter-parley.csv"
OUTPUT_DIR = "~/datasets/peter-parley/images"

In [6]:
df = pd.read_csv(METADATA_FILE)

In [13]:
gcloud_urls = df['filename'].map(lambda x: "gs://hathitrust-full_1800-50/" + x)

In [14]:
gcloud_urls

0       gs://hathitrust-full_1800-50/mdp/31656/mdp.390...
1       gs://hathitrust-full_1800-50/mdp/31656/mdp.390...
2       gs://hathitrust-full_1800-50/mdp/31656/mdp.390...
3       gs://hathitrust-full_1800-50/mdp/31656/mdp.390...
4       gs://hathitrust-full_1800-50/mdp/31656/mdp.390...
                              ...                        
8245    gs://hathitrust-full_1800-50/osu/33780/osu.324...
8246    gs://hathitrust-full_1800-50/osu/33780/osu.324...
8247    gs://hathitrust-full_1800-50/osu/33780/osu.324...
8248    gs://hathitrust-full_1800-50/osu/33780/osu.324...
8249    gs://hathitrust-full_1800-50/osu/33780/osu.324...
Name: filename, Length: 8250, dtype: object

In [21]:
# POSE THE QUESTION BEFORE RUNNING THE JOB! Consider adding "geography" to Peter Parley query?
# How many more volumes would this add?
# PixPlot 1: Educational publishers (Carter Hendee, Munroe Francis, etc.). Cf. Goodrich's output and memoirs.
# PixPlot 2: Geography books and Peter Parley texts. Searching title for geography OR peter parley. To avoid dedup step.

In [None]:
# https://cloud.google.com/storage/docs/downloading-objects#storage-download-object-python
# https://cloud.google.com/storage/docs/reference/libraries#command-line
