# Food Pilgrim
## Overview
Food Pilgrim is a web application to generate a list of destinations in a specific region that will delight a user. Every city has specific types of food that it is known for. Food Pilgrim will take advantag of this to identify the local specialities and recommend a set of locations to experience them yourself.

## Backing Data
Primary source of data will be twitter comments. By mining twitter comments we can identify trends in locales about specific types of food. Relying on the assumption that popular local food types will get talked about more frequently on twitter by local users. We will be able to train a system to extract this local specialities from twitter data.

## Current Status
Initial validation of this project has been completed. A small twitter sample set has been downloaded in 5 sample cities. An analysis was performed to valid the assumption that local special food items will have a high hit rate for local twitter users. Initial progress has been started to explore techniques for creating an automated system to extract local specialities for all cities.

In [1]:
import pandas as pd
import jsonpickle
import json
from collections import Counter
from nltk.tokenize import TweetTokenizer
from nltk.corpus import stopwords
import string

In [2]:
# Setup constants for the sample cities
cities = ['seattle', 'austin', 'new_orleans', 'chicago', 'philly']
filenames = {
        'seattle': r'data\Seattle_Tweets.json',
        'austin': r'data\Austin_Tweets.json',
        'new_orleans': r'data\New_Orleans_Tweets.json',
        'chicago': r'data\Chicago_Tweets.json',
        'philly': r'data\Philly_Tweets.json'
}

In [3]:
# Import data
# This data was collected via the public twitter APIs, via a seperate python script.
# Data collection size was heavily limitted by twitter API rate limits and time limits for this Challenege

df = pd.DataFrame()

for city in cities:
    tweets = []
    with open(filenames[city], 'r') as f:
        for line in f:
            tweets.append(jsonpickle.decode(line))
        print('{0} Data Loaded. {1} rows'.format(city, len(tweets)))

    temp = pd.DataFrame(tweets)
    temp = temp.filter(items=['created_at', 'text'])
    temp['city'] = city
    df = df.append(temp)

print('Full dataframe size {0}'.format(df.shape[0]))

seattle Data Loaded. 18926 rows
austin Data Loaded. 20841 rows
new_orleans Data Loaded. 15885 rows
chicago Data Loaded. 28697 rows
philly Data Loaded. 13559 rows
Full dataframe size 97908


In [4]:
# Just a quick sample view of the data.
print(df.head(5))

                       created_at  \
0  Mon May 06 06:58:50 +0000 2019   
1  Mon May 06 06:56:46 +0000 2019   
2  Mon May 06 06:56:43 +0000 2019   
3  Mon May 06 06:54:20 +0000 2019   
4  Mon May 06 06:53:44 +0000 2019   

                                                text     city  
0  RT @AkhSuliee: Healthy Ramadan Fasting:\n\n1. ...  seattle  
1  RT @AkhSuliee: Healthy Ramadan Fasting:\n\n1. ...  seattle  
2  how is deconstructed food a thing? just finish...  seattle  
3  RT @AkhSuliee: Healthy Ramadan Fasting:\n\n1. ...  seattle  
4  Kalain pud ni kian oy.nikaon sila eat all u ca...  seattle  


 We now work to validate a core assumption: That popular food items will have a local specific increase in frequency of tweets.

In [5]:
# Define a list of expected food types and potential text matches for people discussing them.
guessWords = {
        'gumbo': ['gumbo'],
        'teriyaki': ['teriyaki'],
        'taco': ['taco'],
        'hotdog': ['hot dog', 'hotdog'],
        'cheesesteak': ['cheesesteak', 'cheese steak'],
        'deepdish': ['deepdish', 'deep dish']
}

In [6]:
# Compute new columns that represent if the tweet contained our word targets
def containsFoodWord(row, words):
    text = row['text'].lower()
    for word in words:
        if word in text:
            return True
    return False

for guess in guessWords:
    df[guess] = df.apply(containsFoodWord, args=(guessWords[guess],), axis=1)
    
# print a peek of the top with the enriched data columns
print(df.head(5))

                       created_at  \
0  Mon May 06 06:58:50 +0000 2019   
1  Mon May 06 06:56:46 +0000 2019   
2  Mon May 06 06:56:43 +0000 2019   
3  Mon May 06 06:54:20 +0000 2019   
4  Mon May 06 06:53:44 +0000 2019   

                                                text     city  gumbo  \
0  RT @AkhSuliee: Healthy Ramadan Fasting:\n\n1. ...  seattle  False   
1  RT @AkhSuliee: Healthy Ramadan Fasting:\n\n1. ...  seattle  False   
2  how is deconstructed food a thing? just finish...  seattle  False   
3  RT @AkhSuliee: Healthy Ramadan Fasting:\n\n1. ...  seattle  False   
4  Kalain pud ni kian oy.nikaon sila eat all u ca...  seattle  False   

   teriyaki   taco  hotdog  cheesesteak  deepdish  
0     False  False   False        False     False  
1     False  False   False        False     False  
2     False  False   False        False     False  
3     False  False   False        False     False  
4     False  False   False        False     False  


This is an area for lots of added value in the project. We are doing a very simple word match to determine what the tweet was about. It is likely we could get much better performance by analysis tweets with a classified to extract if the tweet was about a food type.

In [7]:
# Now we can compute the interesting data. The frequency account by city for our target words.
# Aggregate counts of all guessWords by city
def agg(x):
    data = {'num_tweets': len(x.index)}
    for guess in guessWords:
        data[guess] = x[guess].sum()
        data[guess + '%'] = x[guess].sum() * 100 / len(x.index)
    return pd.Series(data)

cityCounts = df.groupby('city').apply(agg)

print(cityCounts)

             num_tweets  gumbo    gumbo%  teriyaki  teriyaki%   taco  \
city                                                                   
austin          20841.0    0.0  0.000000       5.0   0.023991  313.0   
chicago         28697.0    6.0  0.020908       1.0   0.003485  354.0   
new_orleans     15885.0   54.0  0.339943       1.0   0.006295  151.0   
philly          13559.0    1.0  0.007375       2.0   0.014750   70.0   
seattle         18926.0    2.0  0.010567      13.0   0.068689  206.0   

                taco%  hotdog   hotdog%  cheesesteak  cheesesteak%  deepdish  \
city                                                                           
austin       1.501847    33.0  0.158342          2.0      0.009596       0.0   
chicago      1.233578    70.0  0.243928          1.0      0.003485      40.0   
new_orleans  0.950582    12.0  0.075543          0.0      0.000000       0.0   
philly       0.516262   144.0  1.062025         42.0      0.309757       0.0   
seattle      1.

Success! We see fairly strong signals even on this smaller sample set. Especially for **gumbo**, **teriyaki**, **cheesesteak**, and **deepdish**. All items that are traditionally famous as being signature items for their cities. And as seen in tweets are a much higher topic being disucssed in those cities. Not all of our initial guess have worked out. And unexpected results such as the popularity of hot dog in Philidephia were also observed.

We now look to validate our second assumption: That a system can be created to identify signature food items for a city from observation of twitter data. The technique used here is a simple frequency count, with the goal to identify uncommon increases in a cities frequency of usage compared to overall.

In [8]:
# We use the nltk tokenizer optimized for tweets. And try to filter out common puncuation and stop words.
token_count = Counter()
tknzr = TweetTokenizer()
tweet_count = 0

punc = list(string.punctuation)
stop = stopwords.words('english') + punc + ['rt', 'RT', 'via', '...']

for row in df.itertuples():
    tweet = row.text
    tokens = tknzr.tokenize(tweet)
    tokens = [token for token in tokens if token not in stop]
    token_count.update(tokens)
    tweet_count += 1
    #if tweet_count % 1000 == 0:
    #    print('Processed {0} tweets'.format(tweet_count) )

print('Length of counter {0}'.format(len(token_count)))
print(token_count.most_common(50))

Length of counter 107069
[('…', 50740), ('I', 34029), ('food', 30508), ('’', 29419), ('eat', 24809), ('foods', 6343), ('The', 5796), ('us', 4995), ("Don't", 4916), ('like', 4727), ('Food', 4519), ('restaurant', 4502), ('know', 4087), ('What', 3775), ('eating', 3771), ('Shop', 3760), ('Kitty', 3737), ('go', 3644), ('2', 3566), ('get', 3538), ('1', 3392), ('would', 3310), ('“', 3283), ('3', 3146), ('good', 2979), ('This', 2937), ('If', 2787), ('We', 2775), ('better', 2749), ('want', 2729), ('time', 2655), ('Do', 2584), ('”', 2580), ('fried', 2545), ('day', 2544), ('people', 2515), ('She', 2509), ('today', 2460), ('one', 2439), ('Healthy', 2427), ('Ramadan', 2402), ('skip', 2388), ('special', 2374), ('It', 2364), ('Fasting', 2347), ('salty', 2327), ('You', 2318), ('Avoid', 2312), ('Iftar', 2311), ('😂', 2308)]


Initial inspection doesn't lead to any great insights. The most common words are not a great indicator for food types.

In [45]:
# We compute a new dictionary with the token frequency count for our sample 5 cities.
all_freq = dict()
token_count = Counter()
for row in df.itertuples():
    tweet = row.text
    tokens = tknzr.tokenize(tweet)
    tokens = [token for token in tokens if token not in stop]
    token_count.update(tokens)
for x in token_count:
    all_freq[x] = token_count[x] / df.shape[0]

# Compute the freq of each token for just 1 city
per_city_freq = dict()
for city in cities:
    print('Computing freq of tokens for {0}'.format(city))
    city_data = df[df['city'] == city]
    token_count = Counter()
    city_freq = dict()
    for row in city_data.itertuples():
        tweet = row.text
        tokens = tknzr.tokenize(tweet)
        tokens = [token for token in tokens if token not in stop]
        token_count.update(tokens)
    for x in token_count:
        city_freq[x] = token_count[x] / city_data.shape[0]
    per_city_freq[city] = city_freq
    
# Compute the freq of a token with 1 excluded city
not_per_city_freq = dict()
for city in cities:
    print('Computing inverse freq of tokens for {0}'.format(city))
    not_city_data = df[df['city'] != city]
    not_token_count = Counter()
    not_city_freq = dict()
    for row in not_city_data.itertuples():
        tweet = row.text
        tokens = tknzr.tokenize(tweet)
        tokens = [token for token in tokens if token not in stop]
        not_token_count.update(tokens)
    for x in not_token_count:
        not_city_freq[x] = not_token_count[x] / not_city_data.shape[0]
    not_per_city_freq[city] = city_freq

Computing freq of tokens for seattle
Computing freq of tokens for austin
Computing freq of tokens for new_orleans
Computing freq of tokens for chicago
Computing freq of tokens for philly
Computing inverse freq of tokens for seattle
Computing inverse freq of tokens for austin
Computing inverse freq of tokens for new_orleans
Computing inverse freq of tokens for chicago
Computing inverse freq of tokens for philly


In [61]:
# We now look for insights. We see if any city has an unusually high token count relative to the overall token count.
for city in cities:
    for token in per_city_freq[city]: 
        if (per_city_freq[city][token] >=  5 * all_freq[token]) and (all_freq[token] >= 0.001):
            print('{0} in {1} with prob: {2:.5f} compared to total prop: {3:.5f}'.format(token, city, per_city_freq[city][token], all_freq[token]))

@AkhSuliee in seattle with prob: 0.12042 compared to total prop: 0.02328
Fasting in seattle with prob: 0.12042 compared to total prop: 0.02397
Suhoor in seattle with prob: 0.12042 compared to total prop: 0.02328
overeat in seattle with prob: 0.12057 compared to total prop: 0.02339
Iftar in seattle with prob: 0.12063 compared to total prop: 0.02360
Avoid in seattle with prob: 0.12057 compared to total prop: 0.02361
salty in seattle with prob: 0.12052 compared to total prop: 0.02377
@leetaevong in seattle with prob: 0.00682 compared to total prop: 0.00132
Yuta in seattle with prob: 0.00687 compared to total prop: 0.00133
Taeyong in seattle with prob: 0.00687 compared to total prop: 0.00133
Jaehyun in seattle with prob: 0.00682 compared to total prop: 0.00132
​ in seattle with prob: 0.09754 compared to total prop: 0.01929
@KylePlantEmoji in seattle with prob: 0.03387 compared to total prop: 0.00657
https://t.co/AzV5HtsnjX in seattle with prob: 0.00793 compared to total prop: 0.00153
Yzma 

Results using this technique are not highly exciting. It is clear that the vast majority of the results are of no help to us in determining popular food for the city. 

Its clear we are being limitted by the small size of our twitter data set. As local popular items like a link, or usernames are able to cause a trigger to show on this list.

There is some small interesting results in this data. Where words like steak, macaroni, yams, and vegan. Came out from this analysis. Potentially showing that with a better feature extraction this could be a useful technique for determining popular food types for a city.