# Capstone Proposals:
    
As Americans, we have an enormous variety of foods commonly available at grocery stores, corner shops, and increasingly, online. For one project, I'd like to take a look at the fundamental ingredients and nutrients that compose this great variety. My hope would be to predict the number of nutrients found in certain food items, as well as looking hollistically at the ingredients that are most common in our foods. I would like to then relate this to general dietary habits of the US population, and get a granular idea of the nutrient composition of diets. The stretch goal would be to predict the changes in nutrient-use given an increasing population of vegans, where I could estimate increases in required nutrient needs based on 'normal' diets. I'm unsure whether this would be kosher, because obviously I can't check the quality of my predictions for the future. I could, however, see if I could predict correlations on historical data in growth of the vegan diet in the US, and then use that model to predict the effect of future growth of vegan diets.
       
*** (I'm still pondering the best way to go about this, but at least you can see where I'm hoping to take this) ***
    
'Nutrient Database from 2012' - https://catalog.data.gov/dataset/usda-national-nutrient-database-for-standard-reference
'Nutrient Database from 2009' - https://catalog.data.gov/dataset/usda-national-nutrient-database-for-standard-reference-release-22
    
Partially pre-cleaned nutrient data - https://github.com/mhess126/usda_national_nutrients
USDA API (not sure if this could be useful yet?) - https://ndb.nal.usda.gov/ndb/api/doc
    
    
I'm still looking for better data to work with that could give me some idea of dietary habits, but here's where I've been looking - https://catalog.data.gov/dataset?q=bureauCode:%22005:13%22 ; https://catalog.data.gov/dataset?q=usda+consumption+national+nutrient&sort=views_recent+desc&ext_location=&ext_bbox=&ext_prev_extent=-142.03125%2C2.4601811810210052%2C-59.0625%2C58.63121664342478
    
    
Alternatively, I could look at pricing these ingredients, based on the foodtypes that we find them in. For example, take a chili sauce. Of this sauce, take a look at the unique ingredients, and their respective portion size in the sauce. Let's say that black beans compose 20% of the chili, and the chili runs 5 dollars/unit. Then the pricing for the black bean ingredient would be 1 dollar.

From there, I would want to observe how expensive these ingredients can get and how their prices change depending on what products they may be found in. 
    
'Food Price Outlook, current' - https://catalog.data.gov/dataset/food-price-outlook
    

In [1]:
# pip install scrapy
# pip install --upgrade zope2

import foursquare
import json
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import random
import requests
from scrapy import Selector
from scrapy.http import HtmlResponse
import seaborn as sns
import time
import unicodedata

In [2]:
CLIENT_ID = '33NDJLQ342FAMTNX5Z55PR0PQQOJZRAZZ3XEAI0ERQXEJRUL'
CLIENT_SECRET = 'FGMFNZGMWUR1ILZFGH2NV1OKQY3WK5AAPHWKXTFWRR3B4Z4E'
client = foursquare.Foursquare(client_id=CLIENT_ID, client_secret=CLIENT_SECRET)

In [3]:
#Let's define a geo dictionary whose bounds encompass all of SF, and another 4 miles south as well (so ~7mi x 11mi)
ne = {'ne_lat': -122.3550796509, 'ne_long': 37.8127675576}
sw = {'sw_lat': -122.5164413452, 'sw_long': 37.7078622611}

# east/west
lat_bounds = [ne['ne_lat'], sw['sw_lat']]
print lat_bounds[0]
# north/south
lon_bounds = [ne['ne_long'], sw['sw_long']]
print lon_bounds

#increment ~ half a mile (in latitude/longitude)
increment = 0.017

# The gridding below moves North, starting from the bottom SW corner boundary, and then moves east half a mile,
# and repeats the process until stopping at the NE corner boundary.
grid_pairs = []
for lat in np.arange(lat_bounds[1], lat_bounds[0], increment):
    for lon in np.arange(lon_bounds[1], lon_bounds[0], increment):
        grid_pairs.append([lat, lon])
        
print len(grid_pairs)
# for x, y in grid_pairs:
#     print x, y

-122.355079651
[37.8127675576, 37.7078622611]
360


In [4]:
print len(grid_pairs[:100]), len(grid_pairs[100:200]), len(grid_pairs[200:300]), len(grid_pairs[300:360])

100 100 100 60


In [4]:
# This will be the func to add the next offset to my completed venue list...
def extend_unique_venues(unique_venues, proposed_venue):
    if proposed_venue not in unique_venues:
        unique_venues.append(proposed_venue)

In [106]:
# for x, y in grid_pairs:
    
#     search = client.venues.search(params={'ll': "%.2f, %.2f" % (y, x),'query': 'food', 'limit':'50',
#                                       'intent':'browse','radius':'800'})
#     searched_venue_ids = [search['venues'][i]['id'] for i in range(len(search['venues']))]
#     searched_name_ids = [search['venues'][i]['name'] for i in range(len(search['venues']))]
#     for next_id in searched_venue_ids:
#         extend_unique_venues(unique_venues=unique_venue_ids, proposed_venue=next_id)
#     for next_name in searched_name_ids:
#         extend_unique_venues(unique_venues=unique_venue_names, proposed_venue=next_name)
# print('--- %s seconds ---' % (time.time() - start_time))

--- 288.122615099 seconds ---


In [6]:
print range(len(grid_pairs)+1)

[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100, 101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116, 117, 118, 119, 120, 121, 122, 123, 124, 125, 126, 127, 128, 129, 130, 131, 132, 133, 134, 135, 136, 137, 138, 139, 140, 141, 142, 143, 144, 145, 146, 147, 148, 149, 150, 151, 152, 153, 154, 155, 156, 157, 158, 159, 160, 161, 162, 163, 164, 165, 166, 167, 168, 169, 170, 171, 172, 173, 174, 175, 176, 177, 178, 179, 180, 181, 182, 183, 184, 185, 186, 187, 188, 189, 190, 191, 192, 193, 194, 195, 196, 197, 198, 199, 200, 201, 202, 203, 204, 205, 206, 207, 208, 209, 210, 211, 212, 213, 214, 215, 216, 217, 218, 219, 220, 221,

In [5]:
# This will independently pull the venue names and id codes that correspond to the geographical areas
# I paired off in the previous step with grid_pairs. The names and id's will be subesquently used to
# construct menu url's, which I then intend to scrape.
unique_venue_ids = []
unique_venue_names = []
start_time = time.time()

#My hypothesis is that there are 18k items to get (360 pairs * 50 offset limit)
urls_collected = 0
# for i in range(len(grid_pairs)+1):
#     single_loop_limit = i * 499.
#     print "loop # %d " %i
while urls_collected <= 18000:
    try:
        for x, y in grid_pairs:
    
            search = client.venues.search(params={'ll': "%.2f, %.2f" % (y, x),'query': 'food', 'limit':'50',
                                      'intent':'browse','radius':'800'})
            searched_venue_ids = [search['venues'][i]['id'] for i in range(len(search['venues']))]
            searched_name_ids = [search['venues'][i]['name'] for i in range(len(search['venues']))]
            for next_id in searched_venue_ids:
                extend_unique_venues(unique_venues=unique_venue_ids, proposed_venue=next_id)
            for next_name in searched_name_ids:
                extend_unique_venues(unique_venues=unique_venue_names, proposed_venue=next_name)
        print('--- %s active seconds ---' % (time.time() - start_time))
        
        urls_collected += 1.
    except:
        print('Sleeping...')
        time.sleep(random.randint(115,140))

--- 82.7495970726 active seconds ---
--- 159.90173912 active seconds ---
--- 239.536785126 active seconds ---
--- 323.484917164 active seconds ---
--- 402.770659208 active seconds ---
--- 482.115272045 active seconds ---
--- 556.072066069 active seconds ---
--- 629.63965106 active seconds ---
--- 711.704174995 active seconds ---
--- 787.107383013 active seconds ---
--- 882.598478079 active seconds ---
--- 979.289232016 active seconds ---
--- 1068.0080452 active seconds ---
Sleeping...
Sleeping...
Sleeping...
Sleeping...
Sleeping...
Sleeping...
Sleeping...
Sleeping...
Sleeping...
Sleeping...
Sleeping...
Sleeping...
Sleeping...
Sleeping...
Sleeping...
Sleeping...
Sleeping...
Sleeping...
Sleeping...
--- 3683.34785199 active seconds ---
--- 3778.06432319 active seconds ---
--- 3856.88606906 active seconds ---
--- 3951.04915118 active seconds ---
--- 4032.20707417 active seconds ---
--- 4117.67094302 active seconds ---
--- 4198.7937541 active seconds ---
--- 4282.31700802 active seconds ---

No handlers could be found for logger "foursquare"


--- 7880.0768702 active seconds ---
--- 7958.1555202 active seconds ---
--- 8041.26506519 active seconds ---
--- 8125.88318014 active seconds ---
--- 8211.78091407 active seconds ---
--- 8291.83138418 active seconds ---
--- 8369.5076201 active seconds ---
--- 8451.23121309 active seconds ---
--- 8531.66188121 active seconds ---
Sleeping...
Sleeping...
Sleeping...
Sleeping...
Sleeping...
Sleeping...
Sleeping...
Sleeping...
Sleeping...
Sleeping...
Sleeping...
Sleeping...
Sleeping...
Sleeping...
Sleeping...
Sleeping...
Sleeping...
--- 10937.731431 active seconds ---
--- 11013.9429541 active seconds ---
--- 11088.7928562 active seconds ---
--- 11162.9293182 active seconds ---
--- 11241.6192901 active seconds ---
--- 11315.4319642 active seconds ---
--- 11388.4958301 active seconds ---
--- 11462.104228 active seconds ---
--- 11537.6739991 active seconds ---
--- 11612.3592041 active seconds ---
--- 11689.0019212 active seconds ---
--- 11760.9208891 active seconds ---
--- 11833.98472 active s

KeyboardInterrupt: 

In [15]:
for x, y in grid_pairs:
    search = client.venues.search(params={'ll': "%.2f, %.2f" % (y, x),'query': 'food', 'limit':'50',
                                      'intent':'browse','radius':'800'})
    searched_venue_ids = [search['venues'][i]['id'] for i in range(len(search['venues']))]
    print searched_venue_ids, x,y
    searched_name_ids = [search['venues'][i]['name'] for i in range(len(search['venues']))]
    print searched_name_ids
    for next_id in searched_venue_ids:
        extend_unique_venues(unique_venues=unique_venue_ids, proposed_venue=next_id)
    for next_name in searched_name_ids:
        extend_unique_venues(unique_venues=unique_venue_names, proposed_venue=next_name)

RateLimitExceeded: Quota exceeded

In [8]:
print len(unique_venue_ids), len(unique_venue_names)

477 449


In [16]:
unique_venue_ids

[u'463bfdccf964a52026461fe3',
 u'4e5eaa2f7d8b67dc8ff77135',
 u'4a90954ff964a520a61820e3',
 u'4afba001f964a520d51e22e3',
 u'4f7a311fe4b041c95d550ca8',
 u'4b00c4dff964a520094122e3',
 u'4e3d8765ae60454236667cc4',
 u'4bde0311e75c0f47488cc603',
 u'4f32812819836c91c7de479f',
 u'4e3d84ec8877b00cfc4840a6',
 u'5754af30cd10ed6881de7b60',
 u'4f32beaf19836c91c7f5d533',
 u'4bef37cdea570f477f178fd2',
 u'4bbe071a4e069c7467789fe3',
 u'571d28cd498e8c93939cf545',
 u'50257f67e4b03069531cfabe',
 u'4e45e021a809fb2fa3e44cbe',
 u'4b724171f964a520ca752de3',
 u'4af5d67ff964a52094fd21e3',
 u'4c12816da9c220a1c11b549d',
 u'527aecd211d276025019c129',
 u'4c379bdd93db0f47c21b2092',
 u'4ade52a7f964a520fd7421e3',
 u'4abad156f964a5200e8320e3',
 u'4b7715bff964a520d27b2ee3',
 u'4df5103ca809141629a80d9d',
 u'4dd95ce21838b8561d176bf8',
 u'4f09ee0ce4b000dd76f65306',
 u'4fc94ec1e4b0aa75b6e27865',
 u'4f44ccf019836ed0019698cd',
 u'4d867ed899b78cfac200e61f',
 u'4a669b50f964a520c4c81fe3',
 u'55c6423b498ec482a085c462',
 u'527ae60

In [77]:
unique_venue_ids[:5], unique_venue_names[:5]

([u'463bfdccf964a52026461fe3',
  u'4e5eaa2f7d8b67dc8ff77135',
  u'4a90954ff964a520a61820e3',
  u'4afba001f964a520d51e22e3',
  u'4f7a311fe4b041c95d550ca8'],
 [u'Asian American Food Company',
  u'Sunny Vibrations Foodtruck',
  u'Other Avenues Food Store',
  u'7-Eleven',
  u'Mediterranean Cafe'])

In [17]:
#NOTE: THIS ISN'T WORKING B/C SOME OF THE MENUS DON'T ALIGN PROPERLY IN MY FOOR LOOP - ASK DAVIDDD

#this works for the .search api calls only (need a diff process b/c .explore doesn't return names)
menu_urls_from_search = []
base_url = 'https://foursquare.com/v/'
for venue_id, venue_name in zip(unique_venue_ids, unique_venue_names):
    dat_id = unicodedata.normalize('NFKD', venue_id).encode('ascii','ignore')
    dat_name = unicodedata.normalize('NFKD', venue_name).encode('ascii','ignore')
    dat_name = dat_name.lower().replace('/','-').replace(' ','-')
    transformed_url = base_url+dat_name+'/'+dat_id+'/menu'
    menu_urls_from_search.append(transformed_url)
len(menu_urls_from_search)

449

In [18]:
unique_urls_from_search = menu_urls_from_search

In [None]:
# Thus far my testing shows that no venues have been duplicated with my searches

In [9]:
# unique_urls_from_search = set(menu_urls_from_search)
# print len(unique_urls_from_search)

8184


In [None]:
# I'll probably have to do a try/except when I'm looping through ll to avoid issues if i'm accidentally nowhere,,
# or in the water, etc..

In [131]:
x,y,z,t = (-122.5135231018,37.7327650667,-122.3772239685,37.8094586784)

explore = client.venues.explore(params={'ll': "%.2f, %.2f" % (y, x), 'section': 'food','limit':'50','offset':'50'})

# print len(explore['groups'][0]['items'])
print explore['groups'][0]['items'][49]['venue']['menu']['url']
print range(len(explore['groups'][0]['items']))

# for offset in range(50, 201, 50):
#     print range((offset - 50), (offset+1), 50)
    
# explored_venue_ids = [explore['groups'][0]['items'][i]['venue']['menu']['url']
# for i in range(len(explore['groups'][0]['items']))]
# explored_venue_ids[2]

https://foursquare.com/v/fuji/4a944e31f964a520262120e3/menu
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49]


In [6]:
# NOTE: Have to rewrite the searched_venue_ids part, to make the .explore function properly
unique_venues_from_explore = []
start_time = time.time()
for x, y in grid_pairs:
    for offset in range(45, 91, 45):
        explore = client.venues.explore(params={'ll': "%.2f, %.2f" % (y, x), 'radius': '800',
                                               'section': 'food','limit':'50','offset':''+str(offset)+''})
    
        explored_venue_ids = [explore['groups'][0]['items'][i]['venue']['menu']['url']
                              for i in range(len(explore['groups'][0]['items']))]
        for next_id in explored_venue_ids:
            extend_unique_venues(unique_venues=unique_venues_from_explore, proposed_venue=next_id)
    print('--- %s seconds ---' % (time.time() - start_time))

--- 0.567449092865 seconds ---
--- 1.00466704369 seconds ---
--- 1.51477503777 seconds ---
--- 2.24890804291 seconds ---
--- 2.78229403496 seconds ---
--- 3.24658298492 seconds ---
--- 3.9832971096 seconds ---
--- 4.45633101463 seconds ---
--- 4.9555311203 seconds ---
--- 5.39794898033 seconds ---
--- 6.04097914696 seconds ---
--- 6.50417995453 seconds ---
--- 7.4026350975 seconds ---
--- 7.87629199028 seconds ---
--- 8.97564196587 seconds ---
--- 9.43341612816 seconds ---
--- 9.89865398407 seconds ---
--- 10.3631491661 seconds ---
--- 11.1085720062 seconds ---
--- 11.5866601467 seconds ---
--- 12.0771820545 seconds ---
--- 12.6289811134 seconds ---
--- 13.1475510597 seconds ---
--- 13.7378480434 seconds ---
--- 14.2311871052 seconds ---
--- 14.7393050194 seconds ---
--- 15.5005941391 seconds ---
--- 16.2226669788 seconds ---
--- 16.7216150761 seconds ---
--- 17.3147940636 seconds ---
--- 18.0659849644 seconds ---
--- 18.5995841026 seconds ---
--- 19.1224069595 seconds ---
--- 19.68028

KeyError: 'menu'

In [31]:
unique_venues = []
for offset in range(50, 200, 50):
    #intent:browse let's use use radius; radius is radius in meters around given 'll'; 800 meters (dimension I elected)
    #is approx 0.5 miles (but I may adjust the radius going forward)
    #note: i don't need to include the radius here, b/c foursquare will automatically modify it dependent on
    #the density of venues in that area, but just to be sure i grab as many as possible...
    
    search = client.venues.explore(params={'ll': "%.2f, %.2f" % [(y, x) for x, y in grid_pairs], 'radius': '800',
                                          'section': 'food','limit':'50','offset':''+str(offset)+''})
    
    #NOTE: i need to figure out what the response variables are for .explore, so i know how to grab the venue_id's
    #NOTE: no offset available in the .search call
    
    #if i use this (below) api call, i need to pull out all the non-food places (which won't have menus)
    #i'd still have to do that regularly, but this could mean i get way fewer venues of value (read: food venues)
    #alternatively, by using the above api call, i can guarantee i pull food items, but only the most popular ones.
    #so either way, there might be some missing info. If i collate all the data though, and put those venues together,
    #i should get a pretty comprehensive list!

    
    
    #... so offset won't work for this..., only for .explore
    ven_ids[offset] = [search['venues'][i]['id'] for i in range(len(search['venues']))]
    

In [19]:
# Now that I have plenty of urls, let's plug them into a scraper so i can populate my df

df_ready_rows = []
def parse_url(url, data=False):
    
#     venue_rows = []
    response               =  requests.get(url)
    
    #Steps:
    #1) get the unicode objects
    #2) change objects from unicode to string
    
    venue_name_uni         =  Selector(text=response.text).xpath('//h1[@class="venueName"]/text()').extract()
    venue_name             = unicodedata.normalize('NFKD', venue_name_uni[0]).encode('ascii','ignore')
    
    #also need to do an iteration to capture the multiple descriptors
    venue_desc_uni         =  Selector(text=response.text).xpath('//span[@class="unlinkedCategory"]/\
    text()').extract()
    
    #I'll make a list to capture the entire description:
    venue_desc_list        = []
    for venue_desc_phrase in range(len(venue_desc_uni)):
        venue_desc_n       = unicodedata.normalize('NFKD', venue_desc_uni[venue_desc_phrase]).encode('ascii',
                                                                                                  'ignore')
        venue_desc_list.append(venue_desc_n)

    #The url too, right?
    venue_menu_url         = url

    #Grabbing the venue rating, just a note: venueScore positive/neutral/negative, but I'm only getting the
    #rating 1-10
    venueScore_options     = ['positive','neutral','negative']
    venue_rated = []
    for vs_option in venueScore_options:
        try:
            venue_rating_uni        = Selector(text=response.text).xpath('//div[@class="venueRateBlock  "]/\
    span[@class="venueScore '+vs_option+'"]/span/text()').extract()
            venue_rated             = unicodedata.normalize('NFKD', venue_rating_uni[0]).encode('ascii','ignore')
            venue_rated = float(venue_rated)
        except:
            pass
        
    #Even if there is no rating, I'd still like to keep track of that...
    if venue_rated == np.nan:
        venue_rated = 'rating_not_available'
        
    #And I'll transform the list back into a string...
#     venue_rated = venue_rated[0]
    
    #NOTE: do i also need to account for when menus don't have titles? because in that case meta_menu_list
    #could/would
    #return null. if so, perhaps just do a 'try excepct:pass' function if it can't find titles, but could it still
    #grab the menu items? maybe i should just put in a "null title" for the meta_menu_n to overcome this
    #I no longer think this is an issue, but maybe something to put in the appendix for later?
    
    meta_menu_list      =  Selector(text=response.text).xpath('//h2[@class="categoryName"]/text()').extract()
        
    for meta_menu_item in range(len(meta_menu_list)):
        
        meta_menu_n         = unicodedata.normalize('NFKD', meta_menu_list[meta_menu_item]).encode('ascii',
                                                                                                   'ignore')
        
#         print "meta menu title %d:" %(meta_menu_item+1), meta_menu_n, "# of meta menus:", len(meta_menu_list)
        
        depth_menus_n_uni = Selector(text=response.text).xpath('//div[@class="menu"]['+str(meta_menu_item+1)+']/\
        div[@class="menuItems"]/div[@class="section"]/div[@class="sectionHeader"]/\
        div[@class="sectionName"]/text()').extract()
        
        for meta_depth_nn in range(len(depth_menus_n_uni)):
            
            depth_menus_n     = unicodedata.normalize('NFKD', depth_menus_n_uni[meta_depth_nn]
                                                     ).encode('ascii','ignore')

            #get the name of the depth menu, and record it's location as 'n_level'
            n_level = meta_depth_nn+1
#             print "depth menu title %d:" %(n_level), depth_menus_n
            
            #let's grab the entire depth menu:
            depth_menu_id_uni = Selector(text=response.text).xpath('//div[@class="menu"]\
            ['+str(meta_menu_item+1)+']/div[@class="menuItems"]/div[@class="section"]['+str(n_level)+']/\
            div[@class="sectionHeader"]/div[@class="sectionName"]/text()').extract()
            depth_menu_id     = unicodedata.normalize('NFKD', depth_menu_id_uni[0]
                                                     ).encode('ascii','ignore')
            depth_menu_id = len(depth_menu_id_uni)
#             print "#id of depth menu:", depth_menu_id
            
            #loop throught the left and right side of each container:
            left_or_right_list = ['left','right']
            
            for left_or_right in left_or_right_list:
    
                #need the length of the [left/right] container, to iterate through:
                container_len_uni = Selector(text=response.text).xpath('//div[@class="menu"]\
                ['+str(meta_menu_item+1)+']/div[@class="menuItems"]/div[@class="section"]['+str(n_level)+']/div\
                [@class="entryContainer"]/div[@class="'+left_or_right+'Column"]/\
                div[@class="entry"]/node()[1]//text()').extract()
#                 print "left_check:", left_or_right, "contain len:", len(container_len_uni)
            
                for section_n in range(len(container_len_uni)):                    
                    
                    #now we can get the name of that menu item...
                    menu_item_name_uni = Selector(text=response.text).xpath('//div[@class="menu"]\
                    ['+str(meta_menu_item+1)+']/div[@class="menuItems"]/div[@class="section"]\
                    ['+str(n_level)+']/div\
                    [@class="entryContainer"]/div[@class="'+left_or_right+'Column"]/div[@class="entry"]\
                    ['+str(section_n+1)+']/node()[1]//text()').extract()
                    menu_item_name     = unicodedata.normalize('NFKD', menu_item_name_uni[0]
                                                                    ).encode('ascii','ignore')
#                     print "menu_item_name:", menu_item_name
                    
                    #and then we can get the price (if there is one...)
                    try:
                        menu_item_price_uni = Selector(text=response.text).xpath('//div[@class="menu"]\
                    ['+str(meta_menu_item+1)+']/div[@class="menuItems"]/div[@class="section"]\
                    ['+str(n_level)+']/div\
                    [@class="entryContainer"]/div[@class="'+left_or_right+'Column"]/div[@class="entry"]\
                    ['+str(section_n+1)+']/node()[2]//text()').extract()
                        menu_item_price     = unicodedata.normalize('NFKD', menu_item_price_uni[0]
                                                                    ).encode('ascii','ignore')
                        menu_item_price = float(menu_item_price)
#                         print "menu_item_price:", menu_item_price
                    except:
#                         print "menu_item_price:", "price_not_available"
                        menu_item_price = 'price_not_available'
                    
                    #and finally the description (if there is one...)
                    try:
                        menu_item_desc_uni = Selector(text=response.text).xpath('//div[@class="menu"]\
                    ['+str(meta_menu_item+1)+']/div[@class="menuItems"]/div[@class="section"]\
                    ['+str(n_level)+']/div\
                    [@class="entryContainer"]/div[@class="'+left_or_right+'Column"]/div[@class="entry"]\
                    ['+str(section_n+1)+']/node()[3]//text()').extract()
                        menu_item_desc     = unicodedata.normalize('NFKD', menu_item_desc_uni[0]
                                                                    ).encode('ascii','ignore')
#                         print "menu_item_desc:", menu_item_desc
                    except:
#                         print "menu_item_desc:", "desc_not_available"
                        menu_item_desc = 'desc_not_available'

                    # Finally, I'll append my results so that when I wrap up the fuction, I can finish with
                    # a prepared set of info, dataframe ready.
                    df_ready_rows.append([venue_name,
                                       venue_desc_list,
                                       venue_menu_url,
                                       venue_rated,
                                       meta_menu_n,
                                       depth_menus_n,
                                       menu_item_name,
                                       menu_item_price,
                                       menu_item_desc])
#     df_ready_rows.append(venue_rows)
    return df_ready_rows
# 

In [111]:
# unique_urls_from_search

In [11]:
# parse_url('https://foursquare.com/v/grand-sichuan-house/4b1c3b4cf964a520bb0424e3/menu')

In [20]:
# Could be not working because they don't have a menu, or because they're not a restaurant. So could be useful to
# later determine if they are or aren't restaurants to begin with
start_time = time.time()
for menu_url in unique_urls_from_search:
    try:
        parse_url(menu_url)
    except:
        pass
    print("--- %s seconds ---" % (time.time() - start_time))

--- 0.602472066879 seconds ---
--- 0.773817062378 seconds ---
--- 1.0442199707 seconds ---
--- 13.0715789795 seconds ---
--- 13.5725190639 seconds ---
--- 25.2655169964 seconds ---
--- 25.4364790916 seconds ---
--- 42.8654990196 seconds ---
--- 43.0581560135 seconds ---
--- 43.3828670979 seconds ---
--- 55.3158929348 seconds ---
--- 55.5251250267 seconds ---
--- 67.3907461166 seconds ---
--- 79.476678133 seconds ---
--- 79.7490861416 seconds ---
--- 79.9500091076 seconds ---
--- 80.1297399998 seconds ---
--- 80.3187291622 seconds ---
--- 92.2162539959 seconds ---
--- 92.6900620461 seconds ---
--- 93.0314569473 seconds ---
--- 104.258902073 seconds ---
--- 105.10366416 seconds ---
--- 105.332853079 seconds ---
--- 117.259299994 seconds ---
--- 117.431175947 seconds ---
--- 117.860899925 seconds ---
--- 118.066966057 seconds ---
--- 129.956959009 seconds ---
--- 130.110574961 seconds ---
--- 130.55176115 seconds ---
--- 142.247686148 seconds ---
--- 142.796014071 seconds ---
--- 143.0015

In [23]:
print len(df_ready_rows)

2570


In [24]:
column_headers = ['venue_name', 'venue_desc_list', 'venue_menu_url', 'venue_rated', 'meta_menu_n', 'depth_menus_n',
                  'menu_item_name', 'menu_item_price', 'menu_item_desc']

In [25]:
something_else = pd.DataFrame(df_ready_rows, columns=column_headers)
something_else.shape

(2570, 9)

In [42]:
for j in something_else.venue_desc_list:
    if 'vegan' in j[0]:
        print "vegans"

In [48]:
something_else.loc[50:]

Unnamed: 0,venue_name,venue_desc_list,venue_menu_url,venue_rated,meta_menu_n,depth_menus_n,menu_item_name,menu_item_price,menu_item_desc
50,Little Henry's,[Italian Restaurant],https://foursquare.com/v/little-henry's/4bde03...,5.3,Main Menu,Entree - Seafood,Calamari Saute,12.95,"Sauteed with garlic, tomato and white wine"
51,Little Henry's,[Italian Restaurant],https://foursquare.com/v/little-henry's/4bde03...,5.3,Main Menu,Entree - Seafood,Sauteed Butterfly Shrimps,12.95,"Sauteed mushroom, onion, butter and white wine"
52,Little Henry's,[Italian Restaurant],https://foursquare.com/v/little-henry's/4bde03...,5.3,Main Menu,Entree - Seafood,Cioppino,14.95,"Crab, clam, shrimp and calamari"
53,Little Henry's,[Italian Restaurant],https://foursquare.com/v/little-henry's/4bde03...,5.3,Main Menu,Entree - Seafood,Clam House with Linguini,14.95,"Clam, shrimp, calamari and mussel, (red or whi..."
54,Little Henry's,[Italian Restaurant],https://foursquare.com/v/little-henry's/4bde03...,5.3,Main Menu,Hot Sandwiches,Roast Chicken Sandwich,5.95,desc_not_available
55,Little Henry's,[Italian Restaurant],https://foursquare.com/v/little-henry's/4bde03...,5.3,Main Menu,Hot Sandwiches,Hamburger,5.95,desc_not_available
56,Little Henry's,[Italian Restaurant],https://foursquare.com/v/little-henry's/4bde03...,5.3,Main Menu,Hot Sandwiches,Meat Ball Sandwich,5.95,desc_not_available
57,Little Henry's,[Italian Restaurant],https://foursquare.com/v/little-henry's/4bde03...,5.3,Main Menu,Hot Sandwiches,Italian Sausage Sandwich,5.95,desc_not_available
58,Little Henry's,[Italian Restaurant],https://foursquare.com/v/little-henry's/4bde03...,5.3,Main Menu,Hot Sandwiches,Cheese Burger,6.45,desc_not_available
59,Little Henry's,[Italian Restaurant],https://foursquare.com/v/little-henry's/4bde03...,5.3,Main Menu,Side Orders,Crispy Fries,3.95,desc_not_available


In [36]:
something_else.venue_menu_url.unique()

array(["https://foursquare.com/v/little-henry's/4bde0311e75c0f47488cc603/menu",
       'https://foursquare.com/v/daily-health-organic-veggie-cafe/4293c000f964a52030241fe3/menu',
       'https://foursquare.com/v/yin-xing-food--deli-market/4a7decacf964a52034f01fe3/menu',
       "https://foursquare.com/v/mexican-food-martha's/49e8aef1f964a52066651fe3/menu",
       'https://foursquare.com/v/third-ave-food-mart/4e1a0a90e4cd49a7e3f9244f/menu',
       'https://foursquare.com/v/pasta-shop-fine-foods/51e0ad55498eb7f2b6ed10e2/menu',
       'https://foursquare.com/v/pekin-chinese-food/4293c000f964a5202f241fe3/menu',
       'https://foursquare.com/v/bollywood-donuts-food-truck/4e4d4a8fbd413c4cc66ff6da/menu',
       'https://foursquare.com/v/woo-ri-food-market/4e4d0731bd413c4cc66e18cc/menu',
       'https://foursquare.com/v/emergency-food-rations-depot/4f32752419836c91c7d9b88f/menu',
       "https://foursquare.com/v/herb's-fine-food/49baf390f964a520ca531fe3/menu",
       'https://foursquare.com/v/s

In [95]:
# # practice run:
# parse_url(menu_url)
# parse_url("https://foursquare.com/v/wong's-kitchen/4be45d2b2457a593cb1faa15/menu")
# parse_url("https://foursquare.com/v/gracias-madre/4b4955ccf964a520b86d26e3/menu")
# parse_url("https://foursquare.com/v/nopalito/4f8f447f7716cd1fbf769274/menu")
# parse_url("https://foursquare.com/v/holy-gelato/49daaa1ff964a520a15e1fe3/menu")


In [1]:
# # I can run this once I have a clean set of urls
# for menu_url in menu_urls:
#     try:
#         parse_url(menu_url)
#     except:
#         pass

In [None]:
#NOTES: going to want to cull all the multiples from the menu url's that i grab?
#should i make the loop above this cell more efficient by checking and storing which url's don't work?
#B/C then I can exclude them the next time i run my loops ; BUT, it's probably better to keep track of these,
#because while some won't even be eateries, many will be eateries that simply don't have foursqare menus.
#In such cases, knowing the venue information could still be valuable, because we can surface those to users who
#wish to manually add items (could possibly add items by taking pictures of the menu where the item is located)?

In [None]:
#BELOW ARE POTENTIALLY USEFUL, BUT UNUSED MATERIAL::

In [None]:
# NOTE: I rewrote the searched_venue_ids part, to work with the .explore API endpoints
unique_venues_from_explore = []
unique_venue_names_from_explore = []
start_time = time.time()
for x, y in grid_pairs:
    try:
        explore = client.venues.explore(params={'ll': '%.2f, %.2f' % (y, x), 'llAcc':'100.0','radius': '3000',
                                               'section': 'food','limit':'50','offset':'50','sortByDistance':'1'})
        explored_venue_ids = []
        explored_venue_names = []
        for i in range(len(explore['groups'][0]['items'])):
            try:
                pulled_id = explore['groups'][0]['items'][i]['venue']['menu']['url']
                explored_venue_ids.append(pulled_id)
                pulled_name = explore['groups'][0]['items'][i]['venue']['name']
                explored_venue_names.append(pulled_name)
            except:
                pass
        print explored_venue_ids, explored_venue_names
        for next_id, next_name in zip(explored_venue_ids, explored_venue_names):
            if 'foursquare.com' in str(next_id):
                unique_venues_from_explore.append(next_id)
                unique_venue_names_from_explore.append(next_name)
        print("--- %s seconds ---" % (time.time() - start_time))
    except:
        pass

In [None]:
print len(unique_venue_names_from_explore), len(unique_venues_from_explore)

In [None]:
# # Category/column titles, in order:
# # [venue_name, venue_desc_list, venue_menu_url, venue_rated], [meta_menu_n], [depth_menus_n], [menu_item_name,
# # menu_item_price, menu_item_desc]

# venue_rows = []
# for [venue_name, venue_desc_list, venue_menu_url, venue_rated] in venues: 
#     for meta_menu in meta_menu_n:
#         for depth_menu in depth_menus_n:
#             venue_rows.append([venue_name,
#                                venue_desc_list,
#                                venue_rated,
#                                meta_menu,
#                                depth_menu,
#                                menu_item_name,
#                                menu_item_price,
#                                menu_item_desc])

In [9]:
# # Category/column titles, in order:
# # [venue_name, venue_desc_list, venue_menu_url, venue_rated], [meta_menu_n], [depth_menus_n], [menu_item_name,
# # menu_item_price, menu_item_desc]

# venue_dict = {}
# for [venue_name, venue_desc_list, venue_menu_url, venue_rated] in venues:
#     venue_dict[venue_name] = {'desc_list':venue_desc_list,
#                               'menu_url':venue_menu_url,
#                               'rating':venue_rated}
    
#     for meta_menu in meta_menu_n:
#         venue_dict[venue_name][meta_menu] = {}
            
#         for depth_menu in depth_menus_n:
#             venue_dict[venue_name][meta_menu][depth_menu] = {'menu_item_name':menu_item_name,
#                                                              'menu_item_price':menu_item_price,
#                                                              'menu_item_desc':menu_item_desc}
            
