# Data Wrangling - Whats-on-the-menu

This notebook wrangles all of the image data from the nypl dataset, 'whats-on-the-menu'

In [0]:
# Import Libraries
import pandas as pd
import numpy as np
import os
import requests
import json
from bs4 import BeautifulSoup
import time

In [13]:
# Mount Drive
from google.colab import drive
drive.mount('/content/drive')

Go to this URL in a browser: https://accounts.google.com/o/oauth2/auth?client_id=947318989803-6bn6qk8qdgf4n4g3pfee6491hc0brc4i.apps.googleusercontent.com&redirect_uri=urn%3aietf%3awg%3aoauth%3a2.0%3aoob&response_type=code&scope=email%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdocs.test%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive.photos.readonly%20https%3a%2f%2fwww.googleapis.com%2fauth%2fpeopleapi.readonly

Enter your authorization code:
··········
Mounted at /content/drive


In [0]:
# Load in data
current = os.getcwd()
whatsOnTheMenu = '/drive/My Drive/Menu_Parser_Datasets/whats-on-the-menu/Menu_Data/'
filePath1 = current + whatsOnTheMenu + 'MenuPage.csv'
filePath2 = current + whatsOnTheMenu + 'Menu.csv'
filePath3 = current + whatsOnTheMenu + 'Dish.csv'
filePath4 = current + whatsOnTheMenu + 'MenuItem.csv'

df_page = pd.read_csv(filePath1)
df_menu = pd.read_csv(filePath2)
df_dish = pd.read_csv(filePath3)
df_item = pd.read_csv(filePath4)

## List of Menu ID's

The first step is to create a list of all menu ID's. This list will be used to generate a URL for each menu image to download. Once the URLs are generated, they are stored in a list, that will be called to download the corresponding menu image

In [19]:
# Create a list of all menu_id
id_list = df_page['menu_id'].unique().tolist()
id_list[:5]

[12460, 12461, 12462, 12463, 12464]

In [34]:
# Build URL List for request
APIKEY = 'removed'
    
id_len = len(id_list)
urlList = []
for i in range(id_len):
    url = "http://api.menus.nypl.org/menus/" + str(id_list[i]) + "?token=" + APIKEY
    urlList.append(url)

print(urlList[:10])

['http://api.menus.nypl.org/menus/12460?token=removed', 'http://api.menus.nypl.org/menus/12461?token=removed', 'http://api.menus.nypl.org/menus/12462?token=removed', 'http://api.menus.nypl.org/menus/12463?token=removed', 'http://api.menus.nypl.org/menus/12464?token=removed', 'http://api.menus.nypl.org/menus/12465?token=removed', 'http://api.menus.nypl.org/menus/12466?token=removed', 'http://api.menus.nypl.org/menus/12467?token=removed', 'http://api.menus.nypl.org/menus/12468?token=removed', 'http://api.menus.nypl.org/menus/12469?token=removed']


In [22]:
# Get Data via Request call
r = requests.get(urlList[0])
print(r.status_code)

200


In [23]:
# Check to see how data is formated, and what the necessary json data is, to download images
json_data = r.json()
json_data

{'call_number': '1956-0000',
 'currency': None,
 'currency_symbol': None,
 'day': 2,
 'dish_count': 0,
 'event': 'DAILY MENU',
 'first_page_aspect': 'portrait',
 'first_page_full_height': 7230,
 'first_page_full_width': 5428,
 'id': 12460,
 'keywords': None,
 'language': None,
 'large_src': 'http://images.nypl.org/index.php?id=1603595&t=w',
 'large_src_jp2': 'http://j2k.repo.nypl.org/adore-djatoka/resolver?url_ver=Z39.88-2004&rft_id=urn:uuid:510d47e4-2955-a3d9-e040-e00a18064a99&svc_id=info:lanl-repo/svc/getRegion&svc_val_fmt=info:ofi/fmt:kev:mtx:jpeg2000&svc.format=image/jpeg&svc.scale=1800,0&svc.rotate=0',
 'links': [{'href': 'http://api.menus.nypl.org/menus', 'rel': 'index'},
  {'href': 'http://api.menus.nypl.org/menus/12460/pages', 'rel': 'pages'},
  {'href': 'http://api.menus.nypl.org/menus/12460/dishes', 'rel': 'dishes'}],
 'location': 'Monterey Coffee Shop And Cocktail Lounge',
 'location_type': None,
 'month': 4,
 'name': None,
 'notes': 'PRICED MENU PRINTED IN BLACK INK ON WHIT

To download images, the ['large_src'] and ['large_src_jp2'] keys are required

In [24]:
len(id_list)

19816

## Capture Images

Here we have our imageCapture() function that iterates through the urlList and downloads all of our menu images

In [0]:
def imageCapture(urlList, id_list, src = 'large_src', startIndex = 0):
    
    assert len(urlList) == len(id_list), 'urlList and id_list are different sizes'
    
    for i in range(len(urlList[startIndex:])):
        time.sleep(0.5)
        r = requests.get(urlList[i+startIndex])
        if r.status_code == 200:
            json_data = r.json()
            response = requests.get(json_data[src])
            if response.status_code == 200:
                fileName = "/content/drive/My Drive/Menu_Parser_Datasets/whats-on-the-menu/Menu_Image_Data/" + str(id_list[i+startIndex]) + ".jpg"
                with open(fileName, 'wb') as f:
                    f.write(response.content)
                    print('Completed Loop Iteration ' + str(i+startIndex) + ' of ' + str(len(id_list)))
            else:
                print('Response was not okay')
                print('On loop iteration: ', i+startIndex)
        else:
            print('URL Response was not okay. Got status code: ', r.status_code)
            print('On loop iteration: ', i+startIndex)
            print('Terminating Loop')
        
    print('Done')
    return 0

In [33]:
# Collect Image Data
imageCapture(urlList=urlList, id_list=id_list, src='large_src', startIndex=0)

Completed Loop Iteration 0 of 19816


KeyboardInterrupt: ignored

## Capture Missing Menu Images

When inspecting our images, some of them have a 1kb size. In this section, we repeat the imageCapture process, but we download large_src_2, instead of large_src

In [0]:
# See what entries have dish items
for i in range(len(json_list)):
    try:
        print('Entry number ' + str(i) + ' Dish Count: ', json_list[i]['dish_count'])
    except:
        print('Skipped list entry number  ', str(i))

Entry number 0 Dish Count:  0
Entry number 1 Dish Count:  0
Entry number 2 Dish Count:  0
Entry number 3 Dish Count:  0
Entry number 4 Dish Count:  0
Skipped list entry number   5
Entry number 6 Dish Count:  0
Entry number 7 Dish Count:  0
Entry number 8 Dish Count:  0
Skipped list entry number   9
Skipped list entry number   10
Entry number 11 Dish Count:  140
Entry number 12 Dish Count:  189
Entry number 13 Dish Count:  41
Entry number 14 Dish Count:  14
Entry number 15 Dish Count:  17
Entry number 16 Dish Count:  0
Skipped list entry number   17
Entry number 18 Dish Count:  0
Entry number 19 Dish Count:  0
Entry number 20 Dish Count:  0
Entry number 21 Dish Count:  14
Entry number 22 Dish Count:  12
Entry number 23 Dish Count:  18
Entry number 24 Dish Count:  12
Entry number 25 Dish Count:  0
Skipped list entry number   26
Skipped list entry number   27
Entry number 28 Dish Count:  0
Entry number 29 Dish Count:  165
Entry number 30 Dish Count:  10
Entry number 31 Dish Count:  141
En

In [0]:
imageCapture(urlList=reCaptureURL, id_list=imageList, src='large_src_jp2')

Response was not okay
On loop iteration:  0
Response was not okay
On loop iteration:  1
Response was not okay
On loop iteration:  2
Completed Loop Iteration 3 of 100
Response was not okay
On loop iteration:  4
Response was not okay
On loop iteration:  5
Response was not okay
On loop iteration:  6
Completed Loop Iteration 7 of 100
Response was not okay
On loop iteration:  8
Response was not okay
On loop iteration:  9
Completed Loop Iteration 10 of 100
Completed Loop Iteration 11 of 100
Completed Loop Iteration 12 of 100
Completed Loop Iteration 13 of 100
Completed Loop Iteration 14 of 100
Completed Loop Iteration 15 of 100
Response was not okay
On loop iteration:  16
Response was not okay
On loop iteration:  17
Response was not okay
On loop iteration:  18
Response was not okay
On loop iteration:  19
Response was not okay
On loop iteration:  20
Completed Loop Iteration 21 of 100
Completed Loop Iteration 22 of 100
Completed Loop Iteration 23 of 100
Completed Loop Iteration 24 of 100
Respo

0

In [0]:
for i in range(len(json_list)):
    try:
        print('Entry number ' + str(i) + ' Dish Count: ', json_list[i]['dish_count'])
    except:
        print('Skipped list entry number  ', str(i))

Entry number 0 Dish Count:  0
Entry number 1 Dish Count:  0
Entry number 2 Dish Count:  0
Entry number 3 Dish Count:  0
Entry number 4 Dish Count:  0
Entry number 5 Dish Count:  0
Entry number 6 Dish Count:  0
Entry number 7 Dish Count:  0
Entry number 8 Dish Count:  0
Entry number 9 Dish Count:  0
Entry number 10 Dish Count:  0
Entry number 11 Dish Count:  0
Entry number 12 Dish Count:  0
Entry number 13 Dish Count:  0
Entry number 14 Dish Count:  0
Entry number 15 Dish Count:  0
Entry number 16 Dish Count:  0
Entry number 17 Dish Count:  0
Entry number 18 Dish Count:  0
Entry number 19 Dish Count:  0
Entry number 20 Dish Count:  0
Entry number 21 Dish Count:  0
Entry number 22 Dish Count:  0
Entry number 23 Dish Count:  0
Entry number 24 Dish Count:  0
Entry number 25 Dish Count:  0
Entry number 26 Dish Count:  0
Entry number 27 Dish Count:  0
Entry number 28 Dish Count:  0
Entry number 29 Dish Count:  0
Entry number 30 Dish Count:  0
Entry number 31 Dish Count:  0
Entry number 32 Di

Now that all data has been collected that contains dishes, we can move onto the next step