# Food Pilgrim
## Overview
Food Pilgrim is a web application to generate a list of destinations in a specific region that will delight a user. Every city has specific types of food that it is known for. Food Pilgrim will take advantag of this to identify the local specialities and recommend a set of locations to experience them yourself.

## Backing Data
Primary source of data will be twitter comments. By mining twitter comments we can identify trends in locales about specific types of food. Relying on the assumption that popular local food types will get talked about more frequently on twitter by local users. We will be able to train a system to extract this local specialities from twitter data.

## Current Status
Initial validation of this project has been completed. A small twitter sample set has been downloaded in 5 sample cities. An analysis was performed to valid the assumption that local special food items will have a high hit rate for local twitter users. Initial progress has been started to explore techniques for creating an automated system to extract local specialities for all cities.

In [2]:
import pandas as pd
import jsonpickle
import json
from collections import Counter
from nltk.tokenize import TweetTokenizer
from nltk.corpus import stopwords
import string

In [3]:
# Setup constants for the sample cities
cities = ['seattle', 'austin', 'new_orleans', 'chicago', 'philly']
filenames = {
        'seattle': r'data\Seattle_Tweets.json',
        'austin': r'data\Austin_Tweets.json',
        'new_orleans': r'data\New_Orleans_Tweets.json',
        'chicago': r'data\Chicago_Tweets.json',
        'philly': r'data\Philly_Tweets.json'
}

In [5]:
# Import data
# This data was collected via the public twitter APIs, via a seperate python script.
# Data collection size was heavily limitted by twitter API rate limits and time limits for this Challenege

df = pd.DataFrame()

for city in cities:
    tweets = []
    with open(filenames[city], 'r') as f:
        for line in f:
            tweets.append(jsonpickle.decode(line))
        print('{0} Data Loaded. {1} rows'.format(city, len(tweets)))

    temp = pd.DataFrame(tweets)
    temp = temp.filter(items=['created_at', 'text'])
    temp['city'] = city
    df = df.append(temp)

print('Full dataframe size {0}'.format(df.shape[0]))

seattle Data Loaded. 12672 rows
austin Data Loaded. 7013 rows
new_orleans Data Loaded. 5905 rows
chicago Data Loaded. 4553 rows
philly Data Loaded. 9870 rows
Full dataframe size 40013


In [6]:
# Just a quick sample view of the data.
print(df.head(5))

                       created_at  \
0  Mon May 06 06:58:50 +0000 2019   
1  Mon May 06 06:56:46 +0000 2019   
2  Mon May 06 06:56:43 +0000 2019   
3  Mon May 06 06:54:20 +0000 2019   
4  Mon May 06 06:53:44 +0000 2019   

                                                text     city  
0  RT @AkhSuliee: Healthy Ramadan Fasting:\n\n1. ...  seattle  
1  RT @AkhSuliee: Healthy Ramadan Fasting:\n\n1. ...  seattle  
2  how is deconstructed food a thing? just finish...  seattle  
3  RT @AkhSuliee: Healthy Ramadan Fasting:\n\n1. ...  seattle  
4  Kalain pud ni kian oy.nikaon sila eat all u ca...  seattle  


 We now work to validate a core assumption: That popular food items will have a local specific increase in frequency of tweets.

In [7]:
# Define a list of expected food types and potential text matches for people discussing them.
guessWords = {
        'gumbo': ['gumbo'],
        'teriyaki': ['teriyaki'],
        'taco': ['taco'],
        'hotdog': ['hot dog', 'hotdog'],
        'cheesesteak': ['cheesesteak', 'cheese steak'],
        'deepdish': ['deepdish', 'deep dish']
}

In [9]:
# Compute new columns that represent if the tweet contained our word targets
def containsFoodWord(row, words):
    text = row['text'].lower()
    for word in words:
        if word in text:
            return True
    return False

for guess in guessWords:
    df[guess] = df.apply(containsFoodWord, args=(guessWords[guess],), axis=1)
    
# print a peek of the top with the enriched data columns
print(df.head(5))

                       created_at  \
0  Mon May 06 06:58:50 +0000 2019   
1  Mon May 06 06:56:46 +0000 2019   
2  Mon May 06 06:56:43 +0000 2019   
3  Mon May 06 06:54:20 +0000 2019   
4  Mon May 06 06:53:44 +0000 2019   

                                                text     city  gumbo  \
0  RT @AkhSuliee: Healthy Ramadan Fasting:\n\n1. ...  seattle  False   
1  RT @AkhSuliee: Healthy Ramadan Fasting:\n\n1. ...  seattle  False   
2  how is deconstructed food a thing? just finish...  seattle  False   
3  RT @AkhSuliee: Healthy Ramadan Fasting:\n\n1. ...  seattle  False   
4  Kalain pud ni kian oy.nikaon sila eat all u ca...  seattle  False   

   teriyaki   taco  hotdog  cheesesteak  deepdish  
0     False  False   False        False     False  
1     False  False   False        False     False  
2     False  False   False        False     False  
3     False  False   False        False     False  
4     False  False   False        False     False  


This is an area for lots of added value in the project. We are doing a very simple word match to determine what the tweet was about. It is likely we could get much better performance by analysis tweets with a classified to extract if the tweet was about a food type.

In [10]:
# Now we can compute the interesting data. The frequency account by city for our target words.
# Aggregate counts of all guessWords by city
def agg(x):
    data = {'num_tweets': len(x.index)}
    for guess in guessWords:
        data[guess] = x[guess].sum()
        data[guess + '%'] = x[guess].sum() * 100 / len(x.index)
    return pd.Series(data)

cityCounts = df.groupby('city').apply(agg)

print(cityCounts)

             num_tweets  gumbo    gumbo%  teriyaki  teriyaki%   taco  \
city                                                                   
austin           7013.0    0.0  0.000000       1.0   0.014259  102.0   
chicago          4553.0    2.0  0.043927       0.0   0.000000  142.0   
new_orleans      5905.0   29.0  0.491109       1.0   0.016935   24.0   
philly           9870.0    1.0  0.010132       1.0   0.010132   51.0   
seattle         12672.0    2.0  0.015783      11.0   0.086806  116.0   

                taco%  hotdog   hotdog%  cheesesteak  cheesesteak%  deepdish  \
city                                                                           
austin       1.454442    12.0  0.171111          2.0      0.028518       0.0   
chicago      3.118823    18.0  0.395344          1.0      0.021964       7.0   
new_orleans  0.406435     3.0  0.050804          0.0      0.000000       0.0   
philly       0.516717   132.0  1.337386         34.0      0.344478       0.0   
seattle      0.

Success! We see fairly strong signals even on this smaller sample set. Especially for **gumbo**, **teriyaki**, **cheesesteak**, and **deepdish**. All items that are traditionally famous as being signature items for their cities. And as seen in tweets are a much higher topic being disucssed in those cities. Not all of our initial guess have worked out. And unexpected results such as the popularity of hot dog in Philidephia were also observed.

We now look to validate our second assumption: That a system can be created to identify signature food items for a city from observation of twitter data. The technique used here is a simple frequency count, with the goal to identify uncommon increases in a cities frequency of usage compared to overall.

In [11]:
# We use the nltk tokenizer optimized for tweets. And try to filter out common puncuation and stop words.
token_count = Counter()
tknzr = TweetTokenizer()
tweet_count = 0

punc = list(string.punctuation)
stop = stopwords.words('english') + punc + ['rt', 'RT', 'via', '...']

for row in df.itertuples():
    tweet = row.text
    tokens = tknzr.tokenize(tweet)
    tokens = [token for token in tokens if token not in stop]
    token_count.update(tokens)
    tweet_count += 1
    #if tweet_count % 1000 == 0:
    #    print('Processed {0} tweets'.format(tweet_count) )

print('Length of counter {0}'.format(len(token_count)))
print(token_count.most_common(50))

Length of counter 64999
[('…', 20989), ('I', 12900), ('food', 12253), ('’', 10921), ('eat', 9926), ('foods', 5218), ("Don't", 4699), ('eating', 2893), ('2', 2844), ('1', 2838), ('3', 2613), ('What', 2545), ('fried', 2389), ('Healthy', 2344), ('Ramadan', 2340), ('salty', 2292), ('skip', 2281), ('Iftar', 2277), ('overeat', 2275), ('Avoid', 2275), ('Fasting', 2273), ('@AkhSuliee', 2271), ('Suhoor', 2271), ('The', 1992), ('better', 1869), ('restaurant', 1820), ('Food', 1810), ('like', 1803), ('would', 1723), ('us', 1589), ('get', 1459), ('good', 1396), ('\u200b', 1335), ('people', 1200), ('want', 1194), ('This', 1172), ('live', 1165), ('go', 1159), ('“', 1148), ('place', 1139), ('kids', 1115), ('It', 1098), ('Y', 1070), ('one', 1049), ('check', 1048), ('time', 1032), ('tryna', 1022), ('kind', 984), ('😤', 975), ('drug', 965)]


Initial inspection doesn't lead to any great insights. The most common words are not a great indicator for food types.

In [12]:
# We compute a new dictionary of dictionaries. Holding the per city token frequency count for our sample 5 cities.
all_freq = dict()
for x in token_count:
    #print('x type is {0} and value {1}'.format(type(x), x))
    all_freq[x] = token_count[x] / len(token_count)

per_city_freq = dict()
for city in cities:
    print('Computing freq of tokens for {0}'.format(city))
    city_data = df[df['city'] == city]
    for row in city_data.itertuples():
        token_count = Counter()
        tweet = row.text
        tokens = tknzr.tokenize(tweet)
        tokens = [token for token in tokens if token not in stop]
        token_count.update(tokens)
        city_freq = dict()
        for x in token_count:
            city_freq[x] = token_count[x] / len(token_count)
        per_city_freq[city] = city_freq

Computing freq of tokens for seattle
Computing freq of tokens for austin
Computing freq of tokens for new_orleans
Computing freq of tokens for chicago
Computing freq of tokens for philly


In [15]:
# We now look for insights. We see if any city has an unusually high token count relative to the overall token count.
for city in cities:
    for token in per_city_freq[city]:
        if per_city_freq[city][token] >= 10 * all_freq[token]:
            print('{0} in {1} with prob: {2:.5f} compared to total prop: {3:.5f}'.format(token, city, per_city_freq[city][token], all_freq[token]))

@GeneDexter in seattle with prob: 0.12500 compared to total prop: 0.00002
dope in seattle with prob: 0.12500 compared to total prop: 0.00015
man in seattle with prob: 0.12500 compared to total prop: 0.00631
Great in seattle with prob: 0.12500 compared to total prop: 0.00286
great in seattle with prob: 0.12500 compared to total prop: 0.00815
@carlos_glezgtez in austin with prob: 0.06250 compared to total prop: 0.00012
enthusiasts in austin with prob: 0.06250 compared to total prop: 0.00014
Austin in austin with prob: 0.06250 compared to total prop: 0.00537
can't in austin with prob: 0.06250 compared to total prop: 0.00289
miss in austin with prob: 0.06250 compared to total prop: 0.00258
2019 in austin with prob: 0.06250 compared to total prop: 0.00340
edition in austin with prob: 0.06250 compared to total prop: 0.00020
Taste in austin with prob: 0.06250 compared to total prop: 0.00055
Mexico in austin with prob: 0.06250 compared to total prop: 0.00065
Festival in austin with prob: 0.062

Results using this technique are not highly exciting. It is clear that the vast majority of the results are of no help to us in determining popular food for the city. 

Its clear we are being limitted by the small size of our twitter data set. As local popular items like a link, or usernames are able to cause a trigger to show on this list.

There is some small interesting results in this data. Where words like steak, macaroni, yams, and vegan. Came out from this analysis. Potentially showing that with a better feature extraction this could be a useful technique for determining popular food types for a city.