## LoC Data Package Tutorial: Image Resources to PDF

The loc.gov API provides structured data about Library of Congress collections in JSON and YAML formats. This notebook shows how you can take use the API to access image resources, belonging to an LoC Item, and aggregate them into a single PDF file. 

### Understanding API Responses Review:
**JSON Response Objects**
Each of the endpoint types has a distinct response format, but they can be broadly grouped into two categories:
- responses to queries for a list of items, or Search Results Responses 
- responses to queries for a **single item**, or Item and Resource Responses

Furthermore, this notebook will focus on the **JSON Response Object** for a **single item** and formatting its corresponding Resources (files that make-up an item, e.g. pictures of book) into a .pdf file.

### Prerequisites

There are no prequisites in order to run this notebook, besides the installation of libraries listed in the imports section.


### I. Imports:


In [131]:
from PIL import Image
import os
from io import BytesIO
import requests

### II. Create a request URL

First, we will start by ensuring we have a link to an item of interest.
 In this instance we will look at the _Benjamin Harrison Papers: Series 13, Venezuela Boundary Dispute, 1895-1899; Part 2, 1895-1899_ as an example.

Notice the format of the link to this item: `'https://www.loc.gov/item/mss250640164/?%'`


In [132]:
item_link="https://www.loc.gov/item/mss250640164/?"
request_url = item_link + "fo=json"

# Note: The addition of the "fo=json" string ensures that the item request is in JSON format

print(f'Item API Request URL: {request_url}')

Item API Request URL: https://www.loc.gov/item/mss250640164/?fo=json


### III. Request Data and Reviewing Item data:

In [133]:
# Generates request from LOC API to extract data in JSON format
r = requests.get(request_url)
data = r.json()
# print(data)

# Here is a quick way at looking at the structure of the data

print("Data for exploration:\n" + ", ".join(value for value in data.keys()))

# Here is another way at looking at the structure of the data
# print('Data for exploration:')
# for value in data.keys():
#     print(value)
# print(data.keys())

# For demonstration purposes these are details about the item that you could explore if you want.
print("\nItem Data:\n" + ", ".join(value for value in data["item"].keys()))

#Printing out the citation is a simple way to verify that you have the correct item.
citation = data['cite_this']['apa']
print(f'\nCitation Information of the Item: {citation} ')

Data for exploration:
articles_and_essays, cite_this, item, more_like_this, options, related_items, resources, timestamp, type

Item Data:
_version_, access_restricted, aka, call_number, campaigns, contributor_names, contributors, created_published, date, dates, digital_id, digitized, display_offsite, extract_timestamp, extract_urls, format, genre, group, hassegments, id, image_url, index, item, language, languages, location, locations, mime_type, online_format, original_format, other_formats, other_title, partof, related_items, repository, resources, rights, score, shard, shelf_id, site, source_collection, source_modified, subject, subject_headings, subjects, timestamp, title, type, url

Citation Information of the Item: Harrison, B. (1895) <cite>Benjamin Harrison Papers: Series 13, Venezuela Boundary Dispute, -1899; Part 2, -1899</cite>. - 1899. [Manuscript/Mixed Material] Retrieved from the Library of Congress, https://www.loc.gov/item/mss250640164/. 


### IV. Resource Data and Extracting Resource URLS:

In the previous code cell, you can see that the content itself has a lot of Metadata to that can be explored. However, in this notebook we will focus on access to information about the resources.

As opposed to looking at item with `data['item']` we will look at the resources through `data['resources]`. Furthermore, we will be creating a list of all of the resources image urls with the best resolution (based on the largest height).

In [134]:
resources = data['resources'][0]
files = data['resources'][0]['files']

num_resources = len(resources['files'])
print(f'Total # of Resources: {num_resources}')

print('Resource Data: ' + ", ".join(key for key in resources.keys()))

# Another way to review information
# for key in data['resources'][0].keys():
#     print(key)

Total # of Resources: 1143
Resource Data: caption, files, image, url


In [135]:
# This is a simple function that returns the url of the resource page based on the height of the image (which is our standard for resolution)
def best_resolution(page):
    heights = []
    # Loops through the image '.jpg' files to find the best resolution of a page (based on height of the image)
    for img in files[page]:
        if '.jpg' in img['url']:
            heights.append(img['height'])
        else:
            continue
        max_height =  max(heights)
        
    # Returns the url that corresponds to the height
    for img in files[page]:
        if img['height'] == max_height:
            return img['url']

In [136]:
# By default the function will generate a list of the urls from pages 0-10 of the item, but you can change the range.

def generate_url_list(start=0,end=10):
    url_list = []
    for i in range(start,end):
        #URL with Best resolution for the designated page
        url = best_resolution(i)
        url_list.append(url)
    return url_list

# This is a list of the urls, that correspond to the image resources, used to generate the pdf
print(generate_url_list())

['https://tile.loc.gov/image-services/iiif/service:mss:mss25064:mss25064-141:0018/full/pct:100/0/default.jpg', 'https://tile.loc.gov/image-services/iiif/service:mss:mss25064:mss25064-141:0019/full/pct:100/0/default.jpg', 'https://tile.loc.gov/image-services/iiif/service:mss:mss25064:mss25064-141:0020/full/pct:100/0/default.jpg', 'https://tile.loc.gov/image-services/iiif/service:mss:mss25064:mss25064-141:0021/full/pct:100/0/default.jpg', 'https://tile.loc.gov/image-services/iiif/service:mss:mss25064:mss25064-141:0022/full/pct:100/0/default.jpg', 'https://tile.loc.gov/image-services/iiif/service:mss:mss25064:mss25064-141:0023/full/pct:100/0/default.jpg', 'https://tile.loc.gov/image-services/iiif/service:mss:mss25064:mss25064-141:0024/full/pct:100/0/default.jpg', 'https://tile.loc.gov/image-services/iiif/service:mss:mss25064:mss25064-141:0025/full/pct:100/0/default.jpg', 'https://tile.loc.gov/image-services/iiif/service:mss:mss25064:mss25064-141:0026/full/pct:100/0/default.jpg', 'https://

In [137]:
# Function that facilitates the download of the image using it's url
def download_image(url):
    response = requests.get(url)
    return Image.open(BytesIO(response.content))

In [138]:
# Leveraging the list of urls, this function allows you to create the final .pdf file with the aggregate resources.
def create_pdf(image_urls, pdf_name):
    images = []
    for url in image_urls:
        image = download_image(url)
        images.append(image)

    images[0].save(
        pdf_name, "PDF", resolution=100.0, save_all=True, append_images=images[1:]
    )

    print("LOC Item Resources have been saved as pdf: "+ pdf_name)

### V. Putting it all Together: Generating a PDF from Item Resources using the LoC API


In [139]:
# Requesting Data from the LoC API
item_link="https://www.loc.gov/item/mss250640164/?"
request_url = item_link + "fo=json"
r = requests.get(request_url)
data = r.json()

# Extracting Resource and File data
resources = data['resources'][0]
files = data['resources'][0]['files']

# Generating URL list from my range
start = 0
end = 10
image_urls = generate_url_list(start,end)


In [140]:
# creating the PDF
pdf_name = 'sample.pdf'
create_pdf(image_urls, pdf_name)

LOC Item Resources have been saved as pdf: sample.pdf
