# Capstone Proposals:
    
As Americans, we have an enormous variety of foods commonly available at grocery stores, corner shops, and increasingly, online. For one project, I'd like to take a look at the fundamental ingredients and nutrients that compose this great variety. My hope would be to predict the number of nutrients found in certain food items, as well as looking hollistically at the ingredients that are most common in our foods. I would like to then relate this to general dietary habits of the US population, and get a granular idea of the nutrient composition of diets. The stretch goal would be to predict the changes in nutrient-use given an increasing population of vegans, where I could estimate increases in required nutrient needs based on 'normal' diets. I'm unsure whether this would be kosher, because obviously I can't check the quality of my predictions for the future. I could, however, see if I could predict correlations on historical data in growth of the vegan diet in the US, and then use that model to predict the effect of future growth of vegan diets.
       
*** (I'm still pondering the best way to go about this, but at least you can see where I'm hoping to take this) ***
    
'Nutrient Database from 2012' - https://catalog.data.gov/dataset/usda-national-nutrient-database-for-standard-reference
'Nutrient Database from 2009' - https://catalog.data.gov/dataset/usda-national-nutrient-database-for-standard-reference-release-22
    
Partially pre-cleaned nutrient data - https://github.com/mhess126/usda_national_nutrients
USDA API (not sure if this could be useful yet?) - https://ndb.nal.usda.gov/ndb/api/doc
    
    
I'm still looking for better data to work with that could give me some idea of dietary habits, but here's where I've been looking - https://catalog.data.gov/dataset?q=bureauCode:%22005:13%22 ; https://catalog.data.gov/dataset?q=usda+consumption+national+nutrient&sort=views_recent+desc&ext_location=&ext_bbox=&ext_prev_extent=-142.03125%2C2.4601811810210052%2C-59.0625%2C58.63121664342478
    
    
Alternatively, I could look at pricing these ingredients, based on the foodtypes that we find them in. For example, take a chili sauce. Of this sauce, take a look at the unique ingredients, and their respective portion size in the sauce. Let's say that black beans compose 20% of the chili, and the chili runs 5 dollars/unit. Then the pricing for the black bean ingredient would be 1 dollar.

From there, I would want to observe how expensive these ingredients can get and how their prices change depending on what products they may be found in. 
    
'Food Price Outlook, current' - https://catalog.data.gov/dataset/food-price-outlook
    

In [15]:
# pip install scrapy
# pip install --upgrade zope2

from collections import Counter
import foursquare
import json
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import pickle
import random
import requests
from scrapy import Selector
from scrapy.http import HtmlResponse
import seaborn as sns
import time
import unicodedata

%matplotlib inline
%config InlineBackend.figure_format = 'retina'

In [2]:
CLIENT_ID = '33NDJLQ342FAMTNX5Z55PR0PQQOJZRAZZ3XEAI0ERQXEJRUL'
CLIENT_SECRET = 'FGMFNZGMWUR1ILZFGH2NV1OKQY3WK5AAPHWKXTFWRR3B4Z4E'
client = foursquare.Foursquare(client_id=CLIENT_ID, client_secret=CLIENT_SECRET)

In [3]:
#Let's define a geo dictionary whose bounds encompass all of SF, and another 4 miles south as well (so ~7mi x 11mi)
ne = {'ne_lat': -122.3550796509, 'ne_long': 37.8127675576}
sw = {'sw_lat': -122.5164413452, 'sw_long': 37.7078622611}

# east/west
lat_bounds = [ne['ne_lat'], sw['sw_lat']]
print lat_bounds[0]
# north/south
lon_bounds = [ne['ne_long'], sw['sw_long']]
print lon_bounds

#increment ~ half a mile (in latitude/longitude), is 0.007
increment = 0.007

# The gridding below moves North, starting from the bottom SW corner boundary, and then moves east half a mile,
# and repeats the process until stopping at the NE corner boundary.
grid_pairs = []
for lat in np.arange(lat_bounds[1], lat_bounds[0], increment):
    for lon in np.arange(lon_bounds[1], lon_bounds[0], increment):
        grid_pairs.append([lat, lon])
        
print len(grid_pairs)

-122.355079651
[37.8127675576, 37.7078622611]
360


In [4]:
# NOTES:
# should i make the loop above this cell more efficient by checking and storing which url's don't work?
# B/C then I can exclude them the next time i run my loops ; BUT, it's probably better to keep track of these,
# because while some won't even be eateries, many will be eateries that simply don't have foursqare menus.

# In such cases, knowing the venue information could still be valuable, because we can surface those to users who
# wish to manually add items (could possibly add items by taking pictures of the menu where the item is located)?

In [5]:
# Here, I work with the Explore Endpoint

# This will pull the venue names and url's (labeled as 'explored_venue_ids') that correspond to the geographical
# areas I paired off in the previous step with grid_pairs. These are the  urls that I then intend to scrape in
# subsequent steps.

unique_venues_from_explore = []
unique_venue_names_from_explore = []
start_time = time.time()
counter = 0
for x, y in grid_pairs:
    for offset in range(50, 251, 50):
        try:
            # Radius is radius in meters around given 'll'; 800 meters
            # is approx 0.5 miles (but I may adjust the radius going forward)
        
            explore = client.venues.explore(params={'ll': '%.2f, %.2f' % (y, x), 'llAcc':'100.0','radius': '2000',
                                               'section': 'food','limit':'50','offset':''+str(offset)+'',
                                                    'sortByDistance':'1'})
            explored_venue_ids = []
            explored_venue_names = []

            for i in range(len(explore['groups'][0]['items'])):
                try:
                    pulled_id = explore['groups'][0]['items'][i]['venue']['menu']['url']
                    explored_venue_ids.append(pulled_id)
                    pulled_name = explore['groups'][0]['items'][i]['venue']['name']
                    explored_venue_names.append(pulled_name)
                except:
                    pass
            for next_id, next_name in zip(explored_venue_ids, explored_venue_names):
                if 'foursquare.com' in str(next_id):
                    unique_venues_from_explore.append(next_id)
                    unique_venue_names_from_explore.append(next_name)
            counter += 1
            print("--- %s seconds ---" % (time.time() - start_time)), "loop number:", counter
        except:
            pass

--- 0.38595700264 seconds --- loop number: 1
--- 0.891221046448 seconds --- loop number: 2
--- 1.10108804703 seconds --- loop number: 3
--- 1.71197199821 seconds --- loop number: 4
--- 1.94652605057 seconds --- loop number: 5
--- 2.42549800873 seconds --- loop number: 6
--- 2.64763998985 seconds --- loop number: 7
--- 2.86068606377 seconds --- loop number: 8
--- 3.10412812233 seconds --- loop number: 9
--- 3.31964111328 seconds --- loop number: 10
--- 3.5807890892 seconds --- loop number: 11
--- 3.84023118019 seconds --- loop number: 12
--- 4.05506515503 seconds --- loop number: 13
--- 4.27587914467 seconds --- loop number: 14
--- 4.50409913063 seconds --- loop number: 15
--- 4.74624109268 seconds --- loop number: 16
--- 4.98233413696 seconds --- loop number: 17
--- 5.20658898354 seconds --- loop number: 18
--- 5.80673003197 seconds --- loop number: 19
--- 6.41943216324 seconds --- loop number: 20
--- 6.68029713631 seconds --- loop number: 21
--- 7.30416417122 seconds --- loop number: 

In [6]:
# print len(unique_venue_names_from_explore), len(unique_venues_from_explore)

In [6]:
unique_venues_from_explore = list(set(unique_venues_from_explore))
unique_venue_names_from_explore = list(set(unique_venue_names_from_explore))
unique_but_chained_venues = len(unique_venues_from_explore) - len(unique_venue_names_from_explore)

print len(unique_venues_from_explore), unique_but_chained_venues
# So ~ 18% of our results are unique venues (and another 48 of them are unique venus,
# but w/ the same name i.e. chains)

1036 48


In [7]:
# Now that I have plenty of urls, let's plug them into a scraper so I can populate my df. Here are the headers
# I'm looking to get as well...

column_headers = ['venue_name', 'venue_desc_list', 'vegan_venue_check', 'venue_menu_url', 'venue_rated',
                  'meta_menu_n', 'depth_menus_n', 'menu_item_name', 'menu_item_price', 'menu_item_desc']

df_ready_rows = []
def parse_url(url, data=False):

    response               =  requests.get(url)
    
    #Steps:
    #1) get the unicode objects
    #2) change objects from unicode to string
    
    venue_name_uni         = Selector(text=response.text).xpath('//h1[@class="venueName"]/text()').extract()
    venue_name             = unicodedata.normalize('NFKD', venue_name_uni[0]).encode('ascii','ignore')
    
    # I also need to do an iteration to capture the multiple descriptors.
    # I'll store the descriptors in a list to capture the entire description:
    venue_desc_uni         =  Selector(text=response.text).xpath('//span[@class="unlinkedCategory"]/\
    text()').extract()
    venue_desc_list        = []
    vegan_venue_check      = [] # I'll also do a quick check to see up front if we have a vegan menu on our hands
    for venue_desc_phrase in range(len(venue_desc_uni)):
        venue_desc_n       = unicodedata.normalize('NFKD', venue_desc_uni[venue_desc_phrase]).encode('ascii',
                                                                                                  'ignore')
        venue_desc_list.append(venue_desc_n)

    # Here I am going to manually label all my menu items which correspond to restaurants that Foursquare already
    # recognizes as vegan. They're specific label for a vegan venue is 'Vegetarian / Vegan Restaurant' - this may
    # be misleading and incorrectly read as vegetarian or vegan, with it in fact means that it is vegetarian, but
    # more specifically, vegan.
    
    flat_desc = ' '.join(venue_desc_list)
    if 'vegetarian / vegan restaurant' in flat_desc.lower():
        vegan_venue_check  = 'vegan'
    else:
        vegan_venue_check  = 'not_vegan'

    # The url too, right?
    venue_menu_url         = url

    # Grabbing the venue rating, just a note: venueScore positive/neutral/negative, but I'm only getting the
    # rating number, on a scale 1-10
    venueScore_options     = ['positive','neutral','negative']
    venue_rated = 'placeholder'
    for vs_option in venueScore_options:
        try:
            venue_rating_uni        = Selector(text=response.text).xpath('//div[@class="venueRateBlock  "]/\
    span[@class="venueScore '+vs_option+'"]/span/text()').extract()
            venue_rating            = unicodedata.normalize('NFKD', venue_rating_uni[0]).encode('ascii','ignore')
            venue_rated             = float(venue_rating)
        except:
            pass
        
    # Even if there is no rating, I'd still like to keep track of that...
    if venue_rated == 'placeholder':
        venue_rated = 'rating_not_available'
        
    #And I'll transform the list back into a string...
#     venue_rated = venue_rated[0]
    
    #NOTE: do i also need to account for when menus don't have titles? because in that case meta_menu_list
    #could/would
    #return null. if so, perhaps just do a 'try excepct:pass' function if it can't find titles, but could it still
    #grab the menu items? maybe i should just put in a "null title" for the meta_menu_n to overcome this
    #I no longer think this is an issue, but maybe something to put in the appendix for later?
    
    meta_menu_list      =  Selector(text=response.text).xpath('//h2[@class="categoryName"]/text()').extract()
        
    for meta_menu_item in range(len(meta_menu_list)):
        
        meta_menu_n         = unicodedata.normalize('NFKD', meta_menu_list[meta_menu_item]).encode('ascii',
                                                                                                   'ignore')
        
#         print "meta menu title %d:" %(meta_menu_item+1), meta_menu_n, "# of meta menus:", len(meta_menu_list)
        
        depth_menus_n_uni = Selector(text=response.text).xpath('//div[@class="menu"]['+str(meta_menu_item+1)+']/\
        div[@class="menuItems"]/div[@class="section"]/div[@class="sectionHeader"]/\
        div[@class="sectionName"]/text()').extract()
        
        for meta_depth_nn in range(len(depth_menus_n_uni)):
            
            depth_menus_n     = unicodedata.normalize('NFKD', depth_menus_n_uni[meta_depth_nn]
                                                     ).encode('ascii','ignore')

            #get the name of the depth menu, and record it's location as 'n_level'
            n_level = meta_depth_nn+1
#             print "depth menu title %d:" %(n_level), depth_menus_n
            
            #let's grab the entire depth menu:
            depth_menu_id_uni = Selector(text=response.text).xpath('//div[@class="menu"]\
            ['+str(meta_menu_item+1)+']/div[@class="menuItems"]/div[@class="section"]['+str(n_level)+']/\
            div[@class="sectionHeader"]/div[@class="sectionName"]/text()').extract()
            depth_menu_id     = unicodedata.normalize('NFKD', depth_menu_id_uni[0]
                                                     ).encode('ascii','ignore')
            depth_menu_id = len(depth_menu_id_uni)
#             print "#id of depth menu:", depth_menu_id
            
            #loop throught the left and right side of each container:
            left_or_right_list = ['left','right']
            
            for left_or_right in left_or_right_list:
    
                #need the length of the [left/right] container, to iterate through:
                container_len_uni = Selector(text=response.text).xpath('//div[@class="menu"]\
                ['+str(meta_menu_item+1)+']/div[@class="menuItems"]/div[@class="section"]['+str(n_level)+']/div\
                [@class="entryContainer"]/div[@class="'+left_or_right+'Column"]/\
                div[@class="entry"]/node()[1]//text()').extract()
#                 print "left_check:", left_or_right, "contain len:", len(container_len_uni)
            
                for section_n in range(len(container_len_uni)):                    
                    
                    #now we can get the name of that menu item...
                    menu_item_name_uni = Selector(text=response.text).xpath('//div[@class="menu"]\
                    ['+str(meta_menu_item+1)+']/div[@class="menuItems"]/div[@class="section"]\
                    ['+str(n_level)+']/div\
                    [@class="entryContainer"]/div[@class="'+left_or_right+'Column"]/div[@class="entry"]\
                    ['+str(section_n+1)+']/node()[1]//text()').extract()
                    menu_item_name     = unicodedata.normalize('NFKD', menu_item_name_uni[0]
                                                                    ).encode('ascii','ignore')
#                     print "menu_item_name:", menu_item_name
                    
                    #and then we can get the price (if there is one...)
                    try:
                        menu_item_price_uni = Selector(text=response.text).xpath('//div[@class="menu"]\
                    ['+str(meta_menu_item+1)+']/div[@class="menuItems"]/div[@class="section"]\
                    ['+str(n_level)+']/div\
                    [@class="entryContainer"]/div[@class="'+left_or_right+'Column"]/div[@class="entry"]\
                    ['+str(section_n+1)+']/node()[2]//text()').extract()
                        menu_item_price     = unicodedata.normalize('NFKD', menu_item_price_uni[0]
                                                                    ).encode('ascii','ignore')
                        menu_item_price = float(menu_item_price)
#                         print "menu_item_price:", menu_item_price
                    except:
#                         print "menu_item_price:", "price_not_available"
                        menu_item_price = 'price_not_available'
                    
                    #and finally the description (if there is one...)
                    try:
                        menu_item_desc_uni = Selector(text=response.text).xpath('//div[@class="menu"]\
                    ['+str(meta_menu_item+1)+']/div[@class="menuItems"]/div[@class="section"]\
                    ['+str(n_level)+']/div\
                    [@class="entryContainer"]/div[@class="'+left_or_right+'Column"]/div[@class="entry"]\
                    ['+str(section_n+1)+']/node()[3]//text()').extract()
                        menu_item_desc     = unicodedata.normalize('NFKD', menu_item_desc_uni[0]
                                                                    ).encode('ascii','ignore')
#                         print "menu_item_desc:", menu_item_desc
                    except:
#                         print "menu_item_desc:", "desc_not_available"
                        menu_item_desc = 'desc_not_available'

                    # Finally, I'll append my results so that when I wrap up the fuction, I can finish with
                    # a prepared set of info, dataframe ready.
                    df_ready_rows.append([venue_name,
                                          venue_desc_list,
                                          vegan_venue_check,
                                          venue_menu_url,
                                          venue_rated,
                                          meta_menu_n,
                                          depth_menus_n,
                                          menu_item_name,
                                          menu_item_price,
                                          menu_item_desc])
    return df_ready_rows

In [8]:
# Actually, through the Explore endpoint, I was able to directly grab the menu url, so no need to manually build my
# url this time...
start_time = time.time()
countered = 0
for menu_url in unique_venues_from_explore:
    try:
        parse_url(menu_url)
    except:
        pass
    countered += 1
    print("--- %s seconds ---" % (time.time() - start_time)), "loop number:", countered
# Takes about 4 mins for 30 url's

--- 5.90584397316 seconds --- loop number: 1
--- 9.78714299202 seconds --- loop number: 2
--- 10.4112210274 seconds --- loop number: 3
--- 19.2670459747 seconds --- loop number: 4
--- 24.8595850468 seconds --- loop number: 5
--- 38.9098699093 seconds --- loop number: 6
--- 41.2392959595 seconds --- loop number: 7
--- 51.6869559288 seconds --- loop number: 8
--- 53.681442976 seconds --- loop number: 9
--- 57.6186280251 seconds --- loop number: 10
--- 63.3903489113 seconds --- loop number: 11
--- 72.6834738255 seconds --- loop number: 12
--- 76.8377130032 seconds --- loop number: 13
--- 78.575111866 seconds --- loop number: 14
--- 92.7377259731 seconds --- loop number: 15
--- 104.753307819 seconds --- loop number: 16
--- 109.562012911 seconds --- loop number: 17
--- 116.337558985 seconds --- loop number: 18
--- 119.351768017 seconds --- loop number: 19
--- 121.546679974 seconds --- loop number: 20
--- 124.82860899 seconds --- loop number: 21
--- 130.905108929 seconds --- loop number: 22


In [9]:
# I ran my script previously, and just saved a local copy...

explore_endpoint_df = pd.DataFrame(df_ready_rows, columns=column_headers)
explore_endpoint_df.shape

explore_endpoint_df.to_pickle('../../projects/Capstone Stuff/explore_endpoint_df.pkl')

In [11]:
food = pd.read_pickle('../../projects/Capstone Stuff/explore_endpoint_df.pkl')

In [14]:
food.shape

(90047, 10)

In [13]:
food.vegan_venue_check.value_counts()

not_vegan    88186
vegan         1861
Name: vegan_venue_check, dtype: int64

In [14]:
food.head()

Unnamed: 0,venue_name,venue_desc_list,vegan_venue_check,venue_menu_url,venue_rated,meta_menu_n,depth_menus_n,menu_item_name,menu_item_price,menu_item_desc
0,Hunan Chef,[Hunan Restaurant],not_vegan,https://foursquare.com/v/hunan-chef/4bcfa515a8...,7.3,Lunch,Rice Plates,Beef Or Chicken with Broccoli Over Ricea,5.25,desc_not_available
1,Hunan Chef,[Hunan Restaurant],not_vegan,https://foursquare.com/v/hunan-chef/4bcfa515a8...,7.3,Lunch,Rice Plates,Beef Or Chicken with Fresh Mushroom in Oyster ...,5.25,desc_not_available
2,Hunan Chef,[Hunan Restaurant],not_vegan,https://foursquare.com/v/hunan-chef/4bcfa515a8...,7.3,Lunch,Rice Plates,Beef with Mixed Vegetables Over Ricea,5.25,desc_not_available
3,Hunan Chef,[Hunan Restaurant],not_vegan,https://foursquare.com/v/hunan-chef/4bcfa515a8...,7.3,Lunch,Rice Plates,Beef with Scrambled Eggs Over Ricea,5.25,desc_not_available
4,Hunan Chef,[Hunan Restaurant],not_vegan,https://foursquare.com/v/hunan-chef/4bcfa515a8...,7.3,Lunch,Rice Plates,Roasted Duck Over Ricea,5.75,desc_not_available


In [15]:
import textacy
import textacy.data

In [16]:
food_desc = food[food.menu_item_desc != 'desc_not_available']
food_desc.head()

Unnamed: 0,venue_name,venue_desc_list,vegan_venue_check,venue_menu_url,venue_rated,meta_menu_n,depth_menus_n,menu_item_name,menu_item_price,menu_item_desc
47,Hunan Chef,[Hunan Restaurant],not_vegan,https://foursquare.com/v/hunan-chef/4bcfa515a8...,7.3,Menu,Pork,Mu Shu with Pancakes ( 4 Pcs.),7.95,"Choice of chicken, pork, or vegetables"
48,Hunan Chef,[Hunan Restaurant],not_vegan,https://foursquare.com/v/hunan-chef/4bcfa515a8...,7.3,Menu,Pork,Egg Fu Yung,7.95,"Choice of chicken, pork, or vegetables"
90,Hunan Chef,[Hunan Restaurant],not_vegan,https://foursquare.com/v/hunan-chef/4bcfa515a8...,7.3,Menu,Vegetables,Crispy Tofu,6.5,"Choice of sweet &sour, oyster, or currya sauce"
107,Chouchou,[French Restaurant],not_vegan,https://foursquare.com/v/chouchou/40f9bd80f964...,7.7,Dessert Menu,Desserts,Ice Cream,6,With ixed fresh fruit
120,Chouchou,[French Restaurant],not_vegan,https://foursquare.com/v/chouchou/40f9bd80f964...,7.7,Dine About Town Dinner,Entrees Ou Salades (Appetizers Or Salads),Salade Composee,price_not_available,"Mix green, wild arugula, cherry tomatoes, fres..."


In [17]:
texts = food_desc.sample(n=1000).menu_item_desc.values
print texts[0:5]
docs = textacy.corpus.Corpus(u'en', texts=(x.decode('utf-8','ignore') for x in food_desc.sample(n=1000).menu_item_desc.values))

['With a fried egg, polenta hash, spinach, and paprika oil'
 'Freshly baked bagel topped with nutty sesame seeds.'
 'Grass-fed beef filet mignon, green peppercorn sauce, braised endive'
 'Baby shrimp in garlic sauce.'
 'Fish of The Day (All fresh seafood is subject to availability)']


In [18]:
for i in range(10):
    print [x for x in textacy.extract.noun_chunks(docs[i])]

[Chili Sauce]
[Steamed rice, lentil, patties, special spice mixture]
[Boneless chicken leg meats, garlic soy, potato starch]
[Braised lamb shoulder, swiss chard, gremolata]
[Mi, toi cha gio bo nuong]
[Grilled whole trout, thai herbs, sauteed basil, ginger, mushrooms, lemon grass, garlic sauce]
[citrus]
[Ham]
[ixed fresh fruit]
[Sliced norwegian salmon, egg salad, capers, dill, bread, green salad, house vinaigrette]


In [20]:
# Some tom-foolery w/ regards to the explore_endpoint_df is below, before my final notes

In [21]:
# Here I'm just testing out a prototype search function
count = 0
loc_list = []
for i in range(food.shape[0]):
    flat_desc = ' '.join(food.venue_desc_list[i])
    if 'american' in flat_desc.lower():
        count += 1
        loc_list.append(i)

In [22]:
for i in loc_list:
    dffff = food.loc[[i]]
dffff

Unnamed: 0,venue_name,venue_desc_list,vegan_venue_check,venue_menu_url,venue_rated,meta_menu_n,depth_menus_n,menu_item_name,menu_item_price,menu_item_desc
90134,The Corner Store,"[American Restaurant, Cocktail Bar, Bar]",not_vegan,https://foursquare.com/v/the-corner-store/502e...,8.4,Brunch Menu,Sweet Stuff,Strauss Farm Soft Serve,6,"Vanilla, Chocolate or Twist. Toppings: Oreo, B..."


In [23]:
len(food[food.vegan_venue_check == 'vegan'][
    food.menu_item_desc != 'desc_not_available'].menu_item_desc.value_counts())

  from ipykernel import kernelapp as app


970

In [36]:
blah = food[food.vegan_venue_check == 'vegan'][food.menu_item_name != 'desc_not_available'][
    food.menu_item_desc != 'desc_not_available'][['venue_name','depth_menus_n','menu_item_name','menu_item_desc']].reset_index()
blah.drop('index', axis=1, inplace=True)
blah
# Here we have vegan food items, where we have item names and descriptions available

  if __name__ == '__main__':
  from ipykernel import kernelapp as app


Unnamed: 0,venue_name,depth_menus_n,menu_item_name,menu_item_desc
0,Enjoy Vegetarian Restaurant,Appetizer,Combination Platter,"Soybean sheet, bbq pork, tofu, sweet & sour wh..."
1,Enjoy Vegetarian Restaurant,Appetizer,Japanese Sashimi,W/ wasabi
2,Enjoy Vegetarian Restaurant,Appetizer,Pot Sticker,Deep fried 6 pcs
3,Enjoy Vegetarian Restaurant,Appetizer,Crispy Tofu,W/ salt & chili peppers
4,Enjoy Vegetarian Restaurant,Appetizer,Vegetarian Goose,(soybean sheet)
5,Enjoy Vegetarian Restaurant,Appetizer,Fried Curry Potato Triangle,Samosa
6,Enjoy Vegetarian Restaurant,Soup,Veggie Shark's Fin Soup,W/ dice vegetables
7,Enjoy Vegetarian Restaurant,Soup,Thai Style Soup,"Tom yum sauce w/ tomato, soft tofu, baby corn"
8,Enjoy Vegetarian Restaurant,Soup,Won Ton Soup,6 pcs/ 10 pcs
9,Enjoy Vegetarian Restaurant,Soup,Clay Pot Soup,"Tofu, mushroom, bamboo pith, dean vermicelli, ..."


In [37]:
# I strongly suspect venue_name 'The Plant' is the same as 'The Plant Cafe Organic'
blah[blah.venue_name == 'The Plant Cafe Organic'].menu_item_name.value_counts()

Mango                                  2
Acai Berry Protein                     2
Protein                                2
Bagel                                  2
Sambazon C                             2
Chocolate Banana                       2
Strawberry                             2
Quinoa Bowl                            1
Hummus Plate                           1
Roasted Chicken and Avocado            1
Raspberry Lime                         1
Udon Noodles                           1
Citrus C Blend                         1
Fish Tacos                             1
Green Curry                            1
Healthy Sunrise                        1
Side Salad                             1
Wheatgrass                             1
Pear and Blue Cheese                   1
Masala Vegetable Stew                  1
Breakfast Burrito                      1
Spicy Red with Green Beet              1
Tikka Wrap                             1
One Egg                                1
Sambazon Bowl   

In [38]:
blah[blah.venue_name == 'Shangri-La Vegetarian Restaurant'].menu_item_name.value_counts()
blah[blah.venue_name == 'Gracias Madre'].menu_item_desc.values
blah[blah.venue_name == 'Gracias Madre'][['depth_menus_n','menu_item_name','menu_item_desc']]

Unnamed: 0,depth_menus_n,menu_item_name,menu_item_desc
985,Bebidas,Bloody Mary,"House infused jalepeno soju, tomato, celery, c..."
986,Bebidas,Tropical Green Smoothie,"Mango, pineapple, spinach, coconut milk, ginge..."
987,Bebidas,Madre Green Smoothie,"Spinach, cilantro, mint, avocado, pineapple ju..."
988,Cafe Y Te,Horchata Latte,Served iced or hot
989,Cafe Y Te,Pot of Tea,"Earl grey, jasmine green or ginger mint"
990,Botanas,Warm Lemon Scone,Served with orange cream and strawberry jam
991,Botanas,Avocado Toast,"Gluten free, mariposa bakery toast, cashew que..."
992,Comida,Chimichanga,"Tempeh chorizo, caramelized onions, red pepper..."
993,Comida,Chilaquiles,House made tortilla chips sauteed in spicy sal...
994,Comida,Potato & Chorizo Hash,Roasted potatoes and tempeh chorizo with avoca...


In [27]:
food[food.menu_item_name == 'French Fries'].menu_item_desc.value_counts()

desc_not_available                                                                                       80
Smoky ketchup, sweet-onion aioli                                                                          2
Anchovy Ketchup                                                                                           2
With aioli                                                                                                2
Fresh thyme, aioli                                                                                        2
Smoky tomato ketchup, sweet-onion aioli                                                                   2
French fries                                                                                              1
Saffron aioli                                                                                             1
Super tasty potatoes fried when you order em                                                              1
Hand cut fries, skin and all

In [18]:
# Below is an alternate approach, using the Search Endpoint

In [19]:
# # This will independently pull the venue names and id codes that correspond to the geographical areas
# # I paired off in the previous step with grid_pairs. The names and id's will be subesquently used to
# # construct menu url's, which I then intend to scrape.

# unique_venues_from_search = []
# unique_venue_names_from_search = []
# start_time = time.time()

# for x, y in grid_pairs:
#     try:
#         search = client.venues.search(params={'ll': '%.2f, %.2f' % (y, x),'query': 'food', 'limit':'50',
#                                       'intent':'browse','radius':'800'})
#         searched_venue_ids = [search['venues'][i]['id'] for i in range(len(search['venues']))]
#         searched_name_ids = [search['venues'][i]['name'] for i in range(len(search['venues']))]
#         for next_id, next_name in zip(searched_venue_ids, searched_name_ids):
#             unique_venues_from_search.append(next_id)
#             unique_venue_names_from_search.append(next_name)
# #         print('--- %s loop-active seconds ---' % (time.time() - start_time))
#     except:
#         print('Sleeping...')
# #         time.sleep(random.randint(115,140))
# print('--- %s active seconds ---' % (time.time() - start_time))

In [20]:
# print len(unique_venue_names_from_search), len(unique_venues_from_search)

In [21]:
# unique_venue_names_from_search[:5], unique_venues_from_search[:5]

In [22]:
# # This will only work for the Search Endpoint
# menu_urls_from_search = []
# base_url = 'https://foursquare.com/v/'
# for venue_id, venue_name in zip(unique_venues_from_search, unique_venue_names_from_search):
#     dat_id = unicodedata.normalize('NFKD', venue_id).encode('ascii','ignore')
#     dat_name = unicodedata.normalize('NFKD', venue_name).encode('ascii','ignore')
#     dat_name = dat_name.lower().replace('/','-').replace(' ','-')
#     transformed_url = base_url+dat_name+'/'+dat_id+'/menu'
#     menu_urls_from_search.append(transformed_url)
# len(menu_urls_from_search)

In [23]:
# # Testing that no venues have been duplicated with my searches
# unique_urls_from_search = list(set(menu_urls_from_search))

In [24]:
# len(unique_urls_from_search)
# # This amounts to roughly 8% of my total venues searched i.e. 280/(70*50)

In [25]:
# # I'd use a try except because if the manually generated urls don't yield menus,
# # it could be because they're not a restaurant. So could be useful to later
# # determine if they are or aren't restaurants to begin with. But for now, that goes beyond the scope of this project.
# start_time = time.time()
# for menu_url in unique_urls_from_search[:30]:
#     try:
#         parse_url(menu_url)
#     except:
#         pass
#     print("--- %s seconds ---" % (time.time() - start_time))

In [26]:
# search_endpoint_df = pd.DataFrame(df_ready_rows, columns=column_headers)
# search_endpoint_df.shape

In [27]:
# search_endpoint_df.loc[:50[]

In [28]:
# search_endpoint_df.venue_menu_url.unique()

In [29]:
#BELOW ARE POTENTIALLY USEFUL, BUT UNUSED MATERIAL::

In [30]:
# # This func could be useful to add the next offset to my completed venue list...
# def extend_unique_venues(unique_venues, proposed_venue):
#     if proposed_venue not in unique_venues:
#         unique_venues.append(proposed_venue)

In [31]:
# # Category/column titles, in order:
# # [venue_name, venue_desc_list, venue_menu_url, venue_rated], [meta_menu_n], [depth_menus_n], [menu_item_name,
# # menu_item_price, menu_item_desc]

# venue_rows = []
# for [venue_name, venue_desc_list, venue_menu_url, venue_rated] in venues: 
#     for meta_menu in meta_menu_n:
#         for depth_menu in depth_menus_n:
#             venue_rows.append([venue_name,
#                                venue_desc_list,
#                                venue_rated,
#                                meta_menu,
#                                depth_menu,
#                                menu_item_name,
#                                menu_item_price,
#                                menu_item_desc])

In [32]:
# # Category/column titles, in order:
# # [venue_name, venue_desc_list, venue_menu_url, venue_rated], [meta_menu_n], [depth_menus_n], [menu_item_name,
# # menu_item_price, menu_item_desc]

# venue_dict = {}
# for [venue_name, venue_desc_list, venue_menu_url, venue_rated] in venues:
#     venue_dict[venue_name] = {'desc_list':venue_desc_list,
#                               'menu_url':venue_menu_url,
#                               'rating':venue_rated}
    
#     for meta_menu in meta_menu_n:
#         venue_dict[venue_name][meta_menu] = {}
            
#         for depth_menu in depth_menus_n:
#             venue_dict[venue_name][meta_menu][depth_menu] = {'menu_item_name':menu_item_name,
#                                                              'menu_item_price':menu_item_price,
#                                                              'menu_item_desc':menu_item_desc}
            


In [33]:
# # If for some reason I find it easier to work with category id's from API calls, I can use this block:
# # So for instance, I can say that if pulled_id == '4bf58dd8d48988d1d3941735', append('vegan') or append(1),
# # and that could easily serve as my target to predict on

# explore = client.venues.explore(params={'ll': '%.2f, %.2f' % (37.8127675576, -122.3550796509),
#                                         'llAcc':'100.0','radius': '6000',
#                                         'section': 'food','limit':'50','offset':'50','sortByDistance':'1'})
# pulled_id = explore['groups'][0]['items'][10]['venue']['categories'][0]['id']
# print pulled_id