# Capstone Proposals:
    
As Americans, we have an enormous variety of foods commonly available at grocery stores, corner shops, and increasingly, online. For one project, I'd like to take a look at the fundamental ingredients and nutrients that compose this great variety. My hope would be to predict the number of nutrients found in certain food items, as well as looking hollistically at the ingredients that are most common in our foods. I would like to then relate this to general dietary habits of the US population, and get a granular idea of the nutrient composition of diets. The stretch goal would be to predict the changes in nutrient-use given an increasing population of vegans, where I could estimate increases in required nutrient needs based on 'normal' diets. I'm unsure whether this would be kosher, because obviously I can't check the quality of my predictions for the future. I could, however, see if I could predict correlations on historical data in growth of the vegan diet in the US, and then use that model to predict the effect of future growth of vegan diets.
       
*** (I'm still pondering the best way to go about this, but at least you can see where I'm hoping to take this) ***
    
'Nutrient Database from 2012' - https://catalog.data.gov/dataset/usda-national-nutrient-database-for-standard-reference
'Nutrient Database from 2009' - https://catalog.data.gov/dataset/usda-national-nutrient-database-for-standard-reference-release-22
    
Partially pre-cleaned nutrient data - https://github.com/mhess126/usda_national_nutrients
USDA API (not sure if this could be useful yet?) - https://ndb.nal.usda.gov/ndb/api/doc
    
    
I'm still looking for better data to work with that could give me some idea of dietary habits, but here's where I've been looking - https://catalog.data.gov/dataset?q=bureauCode:%22005:13%22 ; https://catalog.data.gov/dataset?q=usda+consumption+national+nutrient&sort=views_recent+desc&ext_location=&ext_bbox=&ext_prev_extent=-142.03125%2C2.4601811810210052%2C-59.0625%2C58.63121664342478
    
    
Alternatively, I could look at pricing these ingredients, based on the foodtypes that we find them in. For example, take a chili sauce. Of this sauce, take a look at the unique ingredients, and their respective portion size in the sauce. Let's say that black beans compose 20% of the chili, and the chili runs 5 dollars/unit. Then the pricing for the black bean ingredient would be 1 dollar.

From there, I would want to observe how expensive these ingredients can get and how their prices change depending on what products they may be found in. 
    
'Food Price Outlook, current' - https://catalog.data.gov/dataset/food-price-outlook
    

In [1]:
# pip install scrapy
# pip install --upgrade zope2

import foursquare
import json
import numpy as np
import requests
from scrapy import Selector
from scrapy.http import HtmlResponse
import unicodedata

In [2]:
CLIENT_ID = '33NDJLQ342FAMTNX5Z55PR0PQQOJZRAZZ3XEAI0ERQXEJRUL'
CLIENT_SECRET = 'FGMFNZGMWUR1ILZFGH2NV1OKQY3WK5AAPHWKXTFWRR3B4Z4E'
client = foursquare.Foursquare(client_id=CLIENT_ID, client_secret=CLIENT_SECRET)

In [3]:
x,y,z,t = (-122.5135231018,37.7327650667,-122.3772239685,37.8094586784)

In [31]:
#NOTE: Add an incrementer to go through the search w/ changing offsets
for offset in range(50, 200, 50):
    search = client.venues.search(params={'ll': "%.2f, %.2f" % (y, x), 'section': 'food','limit':'50',
                                      'offset':''+str(offset)+''})
    ven_ids[offset] = [search['venues'][i]['id'] for i in range(len(search['venues']))]
    
search = client.venues.search(params={'ll': "%.2f, %.2f" % (y, x), 'section': 'food','limit':'50',
                                      'offset':''+str(i)+''})
search = client.venues.search(params={'near': 'Brooklyn, NY', 'section': 'food','limit':'50'})

In [15]:
ven_ids = [search['venues'][i]['id'] for i in range(len(search['venues']))] #just 50 for this example so far
# print(json.dumps(esearch, indent = 4))

In [16]:
# for i in range(50):
#     print esearch['venues'][i]['name']

In [18]:
menu_urls = []
base_url = "https://foursquare.com/v/"
for j in range(len(ven_ids)):
    dat_id = unicodedata.normalize('NFKD', search['venues'][j]['id']).encode('ascii','ignore')
    dat_name = unicodedata.normalize('NFKD', search['venues'][j]['name']).encode('ascii','ignore')
    dat_name = dat_name.lower().replace('/','-').replace(' ','-')
    transformed_url = base_url+dat_name+"/"+dat_id+"/menu"
    menu_urls.append(transformed_url)
len(menu_urls)

50

In [8]:
starting_list = client.venues.search(params={'near': 'Paris, France', 'radius':'1500'})
# print(starting_list)

In [9]:
# Category/column titles, in order:
# [venue_name, venue_desc_list, venue_menu_url, venue_rated], [meta_menu_n], [depth_menus_n], [menu_item_name,
# menu_item_price, menu_item_desc]

In [19]:
#now that i have menu urls, let's plug them into a scraper so i can populate my df

menu_url = "https://foursquare.com/v/nopalito/4f8f447f7716cd1fbf769274/menu"
menu_url = "https://foursquare.com/v/holy-gelato/49daaa1ff964a520a15e1fe3/menu"

def parse_url(url=menu_url, data=False):
    
    response               =  requests.get(url)
    
    #Steps:
    #1) get the unicode objects
    #2) change objects from unicode to string
    
    venue_name_uni         =  Selector(text=response.text).xpath("//h1[@class='venueName']/text()").extract()
    venue_name             = unicodedata.normalize('NFKD', venue_name_uni[0]).encode('ascii','ignore')
    venue_name = [venue_name]
    print venue_name
    
    #also need to do an iteration to capture the multiple descriptors
    venue_desc_uni         =  Selector(text=response.text).xpath("//span[@class='unlinkedCategory']/\
    text()").extract()
    
    #I'll make a list to capture the entire description:
    venue_desc_list        = []
    for venue_desc_phrase in range(len(venue_desc_uni)):
        venue_desc_n       = unicodedata.normalize('NFKD', venue_desc_uni[venue_desc_phrase]).encode('ascii',
                                                                                                  'ignore')
        venue_desc_list.append(venue_desc_n)
    print venue_desc_list

    #The url too, right?
    venue_menu_url         = [menu_url]
    print venue_menu_url

    #Grabbing the venue rating, just a note: venueScore positive/neutral/negative, but I'm only getting the
    #rating 1-10
    venueScore_options     = ["positive","neutral","negative"]
    venue_rated            = []
    for vs_option in venueScore_options:
        try:
            venue_rating_uni        = Selector(text=response.text).xpath("//div[@class='venueRateBlock  ']/\
    span[@class='venueScore "+vs_option+"']/span/text()").extract()
            venue_rating             = unicodedata.normalize('NFKD', venue_rating_uni[0]).encode('ascii','ignore')
            venue_rated.append(venue_rating)        
        except:
            pass
        
    #Even if there is no rating, I'd still like to keep track of that...
    if venue_rated == []:
        venue_rated.append("rating_not_available")
        print "rating_not_available"
    
    #NOTE: do i also need to account for when menus don't have titles? because in that case meta_menu_list
    #could/would
    #return null. if so, perhaps just do a 'try excepct:pass' function if it can't find titles, but could it still
    #grab the menu items? maybe i should just put in a "null title" for the meta_menu_n to overcome this
    #I no longer think this is an issue, but maybe something to put in the appendix for later?
    
    meta_menu_list      =  Selector(text=response.text).xpath("//h2[@class='categoryName']/text()").extract()
        
    for meta_menu_item in range(len(meta_menu_list)):
        
        meta_menu_n         = unicodedata.normalize('NFKD', meta_menu_list[meta_menu_item]).encode('ascii',
                                                                                                   'ignore')
        
        print "meta menu title %d:" %(meta_menu_item+1), meta_menu_n, "# of meta menus:", len(meta_menu_list)
        meta_menu_n = [meta_menu_n]
        
        depth_menus_n_uni = Selector(text=response.text).xpath("//div[@class='menu']["+str(meta_menu_item+1)+"]/\
        div[@class='menuItems']/div[@class='section']/div[@class='sectionHeader']/\
        div[@class='sectionName']/text()").extract()
        
        for meta_depth_nn in range(len(depth_menus_n_uni)):
            
            depth_menus_n     = unicodedata.normalize('NFKD', depth_menus_n_uni[meta_depth_nn]
                                                     ).encode('ascii','ignore')

            #get the name of the depth menu, and record it's location as 'n_level'
            n_level = meta_depth_nn+1
            print "depth menu title %d:" %(n_level), depth_menus_n
            depth_menus_n = [depth_menus_n]
            
            #let's grab the entire depth menu:
            depth_menu_id_uni = Selector(text=response.text).xpath("//div[@class='menu']\
            ["+str(meta_menu_item+1)+"]/div[@class='menuItems']/div[@class='section']["+str(n_level)+"]/\
            div[@class='sectionHeader']/div[@class='sectionName']/text()").extract()
            depth_menu_id     = unicodedata.normalize('NFKD', depth_menu_id_uni[0]
                                                     ).encode('ascii','ignore')
            depth_menu_id = len(depth_menu_id_uni)
            print "#id of depth menu:", depth_menu_id
            
            #loop throught the left and right side of each container:
            left_or_right_list = ["left","right"]
            
            for left_or_right in left_or_right_list:
    
                #need the length of the [left/right] container, to iterate through:
                container_len_uni = Selector(text=response.text).xpath("//div[@class='menu']\
                ["+str(meta_menu_item+1)+"]/div[@class='menuItems']/div[@class='section']["+str(n_level)+"]/div\
                [@class='entryContainer']/div[@class='"+left_or_right+"Column']/\
                div[@class='entry']/node()[1]//text()").extract()
                print "left_check:", left_or_right, "contain len:", len(container_len_uni)
            
                for section_n in range(len(container_len_uni)):                    
                    
                    #now we can get the name of that menu item...
                    menu_item_name_uni = Selector(text=response.text).xpath("//div[@class='menu']\
                    ["+str(meta_menu_item+1)+"]/div[@class='menuItems']/div[@class='section']\
                    ["+str(n_level)+"]/div\
                    [@class='entryContainer']/div[@class='"+left_or_right+"Column']/div[@class='entry']\
                    ["+str(section_n+1)+"]/node()[1]//text()").extract()
                    menu_item_name     = unicodedata.normalize('NFKD', menu_item_name_uni[0]
                                                                    ).encode('ascii','ignore')
                    print "menu_item_name:", menu_item_name
                    menu_item_name = [menu_item_name]
                    
                    #and then we can get the price (if there is one...)
                    try:
                        menu_item_price_uni = Selector(text=response.text).xpath("//div[@class='menu']\
                    ["+str(meta_menu_item+1)+"]/div[@class='menuItems']/div[@class='section']\
                    ["+str(n_level)+"]/div\
                    [@class='entryContainer']/div[@class='"+left_or_right+"Column']/div[@class='entry']\
                    ["+str(section_n+1)+"]/node()[2]//text()").extract()
                        menu_item_price     = unicodedata.normalize('NFKD', menu_item_price_uni[0]
                                                                    ).encode('ascii','ignore')
                        print "menu_item_price:", menu_item_price
                        menu_item_price = [menu_item_price]
                    except:
                        print "menu_item_price:", "price_not_available"
                        menu_item_price = ["price_not_available"]
                    
                    #and finally the description (if there is one...)
                    try:
                        menu_item_desc_uni = Selector(text=response.text).xpath("//div[@class='menu']\
                    ["+str(meta_menu_item+1)+"]/div[@class='menuItems']/div[@class='section']\
                    ["+str(n_level)+"]/div\
                    [@class='entryContainer']/div[@class='"+left_or_right+"Column']/div[@class='entry']\
                    ["+str(section_n+1)+"]/node()[3]//text()").extract()
                        menu_item_desc     = unicodedata.normalize('NFKD', menu_item_desc_uni[0]
                                                                    ).encode('ascii','ignore')
                        print "menu_item_desc:", menu_item_desc
                        menu_item_desc = [menu_item_desc]
                    except:
                        print "menu_item_desc:", "desc_not_available"
                        menu_item_desc = ["desc_not_available"]

                        
    print venue_name, venue_desc_list, venue_rated
    return venue_name, venue_desc_list, venue_rated

In [20]:
# # practice run:
# parse_url(menu_url)
# parse_url("https://foursquare.com/v/wong's-kitchen/4be45d2b2457a593cb1faa15/menu")
parse_url("https://foursquare.com/v/gracias-madre/4b4955ccf964a520b86d26e3/menu")

['Gracias Madre']
['Vegetarian / Vegan Restaurant', 'Mexican Restaurant']
['https://foursquare.com/v/holy-gelato/49daaa1ff964a520a15e1fe3/menu']
meta menu title 1: Brunch Menu # of meta menus: 3
depth menu title 1: Bebidas
#id of depth menu: 1
left_check: left contain len: 3
menu_item_name: Mimosa
menu_item_price: 8.00
menu_item_desc: desc_not_available
menu_item_name: Bloody Mary
menu_item_price: 12.00
menu_item_desc: House infused jalepeno soju, tomato, celery, cilantro, vegan worcestershire, and fresh grated horseradish
menu_item_name: Tropical Green Smoothie
menu_item_price: 8.00
menu_item_desc: Mango, pineapple, spinach, coconut milk, ginger and sea salt
left_check: right contain len: 2
menu_item_name: Michelada
menu_item_price: 8.00
menu_item_desc: desc_not_available
menu_item_name: Madre Green Smoothie
menu_item_price: 8.00
menu_item_desc: Spinach, cilantro, mint, avocado, pineapple juice and sea salt
depth menu title 2: Cafe Y Te
#id of depth menu: 1
left_check: left contain le

(['Gracias Madre'],
 ['Vegetarian / Vegan Restaurant', 'Mexican Restaurant'],
 ['9.0'])

In [12]:
# I can run this once I have a clean set of urls
for menu_url in menu_urls:
    try:
        parse_url(menu_url)
    except:
        pass

["D'Savannah Bar & Lounge"]
['Cocktail Bar', 'Lounge']
["https://foursquare.com/v/d'savannah-bar-&-lounge/510393cbe4b02d11cd5d0509/menu"]
meta menu title 1: Speisekarte # of meta menus: 1
depth menu title 1: Suppen / Soups
#id of depth menu: 1
left_check: left contain len: 1
menu_item_name: Mangus Merek
menu_item_price: not_available
menu_item_desc: Mangokokossuppe mit Ingwer
left_check: right contain len: 1
menu_item_name: Hargetz Merek
menu_item_price: not_available
menu_item_desc: Gemusesuppe mit Krokodilfleisch & frischen Krautern
depth menu title 2: Salate / Salads
#id of depth menu: 1
left_check: left contain len: 1
menu_item_name: Captown Salat
menu_item_price: not_available
menu_item_desc: Strauenstreifen auf gemischtem Salat der Saison mit Tomaten und Fruchten an Mangodressing
left_check: right contain len: 1
menu_item_name: Keren Salat
menu_item_price: not_available
menu_item_desc: Knackiger Salat der Saison mit gegrilltem Gemuse, Tomaten, Zwiebeln und Papaya an Erdnussdressi

In [None]:
#NOTES: going to want to cull all the multiples from the menu url's that i grab?
#should i make the loop above this cell more efficient by checking and storing which url's don't work?
#B/C then I can exclude them the next time i run my loops ; BUT, it's probably better to keep track of these,
#because while some won't even be eateries, many will be eateries that simply don't have foursqare menus.
#In such cases, knowing the venue information could still be valuable, because we can surface those to users who
#wish to manually add items (could possibly add items by taking pictures of the menu where the item is located)?