# Yelp Data Challenge - Restaurant Recommender
Summary: 
- With current rating system, I can actually make use of it to build recommendation system. 
- Here I build item-item colaborative filtering recommender system, NMF_recommender system and SVD_recommender system.
- I compare the performance of recommender system by checking the common labels shared by recommendered restaurants and visited restaurants.
- It turns out that the item-item colaborative filtering recommender system and SVD_recommender system have better performance. SVD_recommender system shows faster computation speed. Therefore, SVD_recommender system wins the case here.
- The reason can be concluded as:
    - colaborative filtering need to calculate pair-wise distance so it is slower than matrix factorization models.
    - SVD_recommender with restaurants labels solves the puzzle of choosing latent factor. Therefore, SVD_recommender system outperforms NMF_recommender system


In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
% matplotlib inline
plt.style.use("ggplot")

In [2]:
df = pd.read_csv('/Users/LiangTan/Documents/BitTigerDS/Yelp/Yelp_Data_Challenge_Project/dataset/last_year_restaurant_reviews.csv')

In [3]:
df.head(3)

Unnamed: 0,business_id,name,categories,avg_stars,cool,date,funny,review_id,stars,text,useful,user_id
6,--9e1ONYQuAa-CB_Rrw7Tw,Delmonico Steakhouse,"['Cajun/Creole', 'Steakhouses', 'Restaurants']",4.0,0,2017-02-14,0,Xp3ppynEvVu1KxDHQ3ae8w,5,Delmonico Steakhouse is a steakhouse owned by ...,0,KC8H7qTZVPIEnanw9fG43g
9,--9e1ONYQuAa-CB_Rrw7Tw,Delmonico Steakhouse,"['Cajun/Creole', 'Steakhouses', 'Restaurants']",4.0,1,2017-05-28,0,LEzphAnz0vKE32PUCbjLgQ,4,One of the top steak places I've had in Vegas ...,2,3RTesI_MAwct13LWm4rhLw
11,--9e1ONYQuAa-CB_Rrw7Tw,Delmonico Steakhouse,"['Cajun/Creole', 'Steakhouses', 'Restaurants']",4.0,0,2017-08-25,0,4e-cxYVdlIu2ZDxVJqUfOQ,5,This place is superb from the customer service...,0,EAOt1UQhJD0GG3l_jv7rWA


## Clean data and get rating data 

#### 1. Select relevant columns in the original dataframe

In [4]:
recommender_df = df[['business_id', 'user_id', 'stars']]
recommender_df.head(3)

Unnamed: 0,business_id,user_id,stars
6,--9e1ONYQuAa-CB_Rrw7Tw,KC8H7qTZVPIEnanw9fG43g,5
9,--9e1ONYQuAa-CB_Rrw7Tw,3RTesI_MAwct13LWm4rhLw,4
11,--9e1ONYQuAa-CB_Rrw7Tw,EAOt1UQhJD0GG3l_jv7rWA,5


There are many users that haven't given many reviews, I will exclude these users from the item-item similarity recommender.

In [5]:
reviews_count_df = recommender_df.groupby('user_id')['stars'].count()
reviews_count_df.head(5)

user_id
---1lKK3aKOuomHnwAkAow    1
---udAKDsn0yQXmzbWQNSw    2
--2vR0DIsmQ6WfcSzKWigw    2
--4uW4yJiRT2oXMYkCPq1Q    1
--66hzx80CeVZcrm4AKJtQ    1
Name: stars, dtype: int64

In [27]:
print('Max reviews: %s, Min reviews: %s' % (max(reviews_count_df), min(reviews_count_df)))
print('Median reviews: %s, Mean reviews: %s' % (np.median(reviews_count_df), round(np.mean(reviews_count_df),2)))
print('25%% reviews: %d,  75%% reviews: %d' % (np.percentile(reviews_count_df, 25), np.percentile(reviews_count_df, 75)))
print('Number of unique business: %d' % (len(set(recommender_df['business_id']))))

Max reviews: 163, Min reviews: 1
Median reviews: 1.0, Mean reviews: 1.86
25% reviews: 1,  75% reviews: 2
Number of unique business: 4072


In [28]:
active_user = list(reviews_count_df[reviews_count_df >= 10].index)
mask = [user in active_user for user in recommender_df['user_id']]
active_user_df = recommender_df[mask]
active_user_df.head(5)

Unnamed: 0,business_id,user_id,stars
64,--9e1ONYQuAa-CB_Rrw7Tw,JaqcCU3nxReTW2cBLHounA,5
233,--9e1ONYQuAa-CB_Rrw7Tw,y4O_c6UUAAtPb3Uk-T4t8A,5
343,--9e1ONYQuAa-CB_Rrw7Tw,-N0xFiL7wxv07F11bfLOvQ,4
352,--9e1ONYQuAa-CB_Rrw7Tw,s2o_JsABvrZVm_T03qrBUw,5
407,--9e1ONYQuAa-CB_Rrw7Tw,kOPRX94rDBXEPmLBZNG7RQ,1


In [29]:
print('The total number of active users in Las Vegas in 2017 is %d.' % \
      (len(active_user_df.groupby('user_id')['stars'].count())))

The total number of active users in Las Vegas in 2017 is 1423.


In [30]:
print('The total records number for active users in Las Vegas in 2017 is %d.' % \
      (len(active_user_df)))

The total records number for active users in Las Vegas in 2017 is 25550.


#### 2. Create utility matrix from records

In [32]:
from scipy import sparse
highest_user_id = len(set(active_user_df['user_id']))
highest_movie_id = len(set(active_user_df['business_id']))
ratings_mat = sparse.lil_matrix((highest_user_id, highest_movie_id))
ratings_mat

<1423x3178 sparse matrix of type '<class 'numpy.float64'>'
	with 0 stored elements in LInked List format>

Fill the rate matrix based on table

In [33]:
user_id = list(set(active_user_df['user_id']))
business_id = list(set(active_user_df['business_id']))
for _, row in active_user_df.iterrows():
    ratings_mat[user_id.index(row.user_id), business_id.index(row.business_id)] = row.stars
ratings_mat

<1423x3178 sparse matrix of type '<class 'numpy.float64'>'
	with 25550 stored elements in LInked List format>

## Item - Item Collaborative Filter Recommender

In [77]:
from sklearn.metrics.pairwise import cosine_similarity
from time import time
class ItemItemRecommender(object):

    def __init__(self, neighborhood_size):
        self.neighborhood_size = neighborhood_size

    def fit(self, ratings_mat):
        self.ratings_mat = ratings_mat
        self.n_users = ratings_mat.shape[0]
        self.n_items = ratings_mat.shape[1]
        self.item_sim_mat = cosine_similarity(self.ratings_mat.T)
        self._set_neighborhoods()

    def _set_neighborhoods(self):
        least_to_most_sim_indexes = np.argsort(self.item_sim_mat, 1)
        self.neighborhoods = least_to_most_sim_indexes[:, -self.neighborhood_size:]

    def pred_one_user(self, user_id, report_run_time=False):
        start_time = time()
        items_rated_by_this_user = self.ratings_mat[user_id].nonzero()[1]
        # Just initializing so I have somewhere to put rating preds
        out = np.zeros(self.n_items)
        for item_to_rate in range(self.n_items):
            relevant_items = np.intersect1d(self.neighborhoods[item_to_rate],
                                            items_rated_by_this_user,
                                            assume_unique=True)  # assume_unique speeds up intersection op
            out[item_to_rate] = self.ratings_mat[user_id, relevant_items] * \
                self.item_sim_mat[item_to_rate, relevant_items] / \
                self.item_sim_mat[item_to_rate, relevant_items].sum()
        if report_run_time:
            print("Execution time: %f seconds" % (time()-start_time))
        cleaned_out = np.nan_to_num(out)
        return cleaned_out

    def pred_all_users(self, report_run_time=False):
        start_time = time()
        all_ratings = [
            self.pred_one_user(user_id) for user_id in range(self.n_users)]
        if report_run_time:
            print("Execution time: %f seconds" % (time()-start_time))
        return np.array(all_ratings)

    def top_n_recs(self, user_id, n):
        pred_ratings = self.pred_one_user(user_id)
        item_index_sorted_by_pred_rating = list(np.argsort(pred_ratings))
        items_rated_by_this_user = self.ratings_mat[user_id].nonzero()[1]
        unrated_items_by_pred_rating = [item for item in item_index_sorted_by_pred_rating
                                        if item not in items_rated_by_this_user]
        return unrated_items_by_pred_rating[-n:], items_rated_by_this_user

In [78]:
my_rec_engine = ItemItemRecommender(neighborhood_size=80)
my_rec_engine.fit(ratings_mat)

Let me try the recommder system with a lucky user.

In [79]:
lucky_user = np.random.choice(active_user_df['user_id'], 1)[0]
lucky_user_index = user_id.index(lucky_user)
lucky_user_recommend, items_rated_by_this_user = my_rec_engine.top_n_recs(user_id=lucky_user_index, n = 10)



In [80]:
print("The top ten recommendation for user %s are: " % (lucky_user))
print('%s' % (', '.join(list(set(df['name'][df['business_id'] == business_id[i]]))[0] \
                       for i in lucky_user_recommend)))

The top ten recommendation for user DeVGAiOf2mHVUDfxvuhVlQ are: 
Domino's Pizza, Las Vegas South Premium Outlets, Gordon Biersch Brewery Restaurant, Elia, Papa Murphy's, Battista's Hole In the Wall, Newport Cafe, Sonic Drive-In, Gyro Time Restaurant, Subway


In [81]:
original_rated_restaurants = [list(set(df['name'][df['business_id'] == business_id[i]]))[0] for i in items_rated_by_this_user]
mask = [name in original_rated_restaurants for name in df['name']]
original_category = df['categories'][mask]
vectorizer = TfidfVectorizer(analyzer = 'word', stop_words = 'english', 
                             lowercase = True
                            )
original_category_vec = vectorizer.fit_transform(original_category).toarray()
original_word = vectorizer.get_feature_names()
print('Categories from user rated restaurants: \n%s' % (','.join(i for i in original_word)))

Categories from user rated restaurants: 
active,american,arts,bar,barbeque,bars,beer,bowling,breakfast,breweries,brunch,buffets,burgers,cafes,cajun,casinos,caterers,cocktail,coffee,creole,delis,diners,entertainment,event,fast,food,gastropubs,hotels,japanese,life,mexican,music,musicians,new,nightlife,performing,pizza,planning,plates,pubs,restaurants,salad,seafood,services,small,spaces,spirits,sports,sushi,tapas,tea,traditional,travel,venues,wedding,wine


In [82]:
recommend_res = [list(set(df['name'][df['business_id'] == business_id[i]]))[0] \
                 for i in lucky_user_recommend]
mask = [name in recommend_res for name in df['name']]
recommend_category = df['categories'][mask]
vectorizer = TfidfVectorizer(analyzer = 'word', stop_words = 'english', 
                             lowercase = True
                            )
recommend_category_vec = vectorizer.fit_transform(recommend_category).toarray()
recommend_word = vectorizer.get_feature_names()
print('Categories from recommend restaurants: \n%s' % (','.join(i for i in recommend_word)))

Categories from recommend restaurants: 
american,bars,bed,breakfast,breweries,brunch,burgers,cafes,caterers,centers,cheesesteaks,chicken,cream,delivery,event,fashion,fast,food,free,frozen,gluten,greek,hotels,ice,italian,mediterranean,new,nightlife,outlet,pizza,planning,restaurants,salad,sandwiches,services,shopping,soup,stores,traditional,travel,vegetarian,wine,wings,yogurt


In [83]:
#Check the common labels
print("Common labels are: \n%s" % (', '.join(word for word in recommend_word if word in original_word)))

Common labels are: 
american, bars, breakfast, breweries, brunch, burgers, cafes, caterers, event, fast, food, hotels, new, nightlife, pizza, planning, restaurants, salad, services, traditional, travel, wine


## Matrix Factorization recommender (NMF)

In [60]:
from sklearn.decomposition import NMF
class NMF_Recommender(object):

    def __init__(self, n_components):
        self.n_components = n_components

    def fit(self, ratings_mat):
        self.ratings_mat = ratings_mat
        self.n_users = ratings_mat.shape[0]
        self.n_items = ratings_mat.shape[1]
        nmf = NMF(n_components = 200)
        nmf.fit(ratings_mat)
        self.W = nmf.transform(ratings_mat)
        self.H = nmf.components_
        self.error = nmf.reconstruction_err_
        self.ratings_mat_fitted = self.W.dot(self.H)

    def get_error(self):
        return self.error
        
    def pred_one_user(self, user_id, report_run_time=False):
        start_time = time()
        cleaned_out = self.ratings_mat_fitted[user_id,:]
        if report_run_time:
            print("Execution time: %f seconds" % (time()-start_time))
        return cleaned_out

    def pred_all_users(self, report_run_time=False):
        start_time = time()
        all_ratings = [
            self.pred_one_user(user_id) for user_id in range(self.n_users)]
        if report_run_time:
            print("Execution time: %f seconds" % (time()-start_time))
        return np.array(all_ratings)

    def top_n_recs(self, user_id, n):
        pred_ratings = self.pred_one_user(user_id)
        item_index_sorted_by_pred_rating = list(np.argsort(pred_ratings))
        items_rated_by_this_user = self.ratings_mat[user_id].nonzero()[1]
        unrated_items_by_pred_rating = [item for item in item_index_sorted_by_pred_rating
                                        if item not in items_rated_by_this_user]
        return unrated_items_by_pred_rating[-n:], items_rated_by_this_user

In [72]:
# get recommendations for the same lucky user
my_rec_engine = NMF_Recommender(n_components=200)
my_rec_engine.fit(ratings_mat)
lucky_user_recommend, items_rated_by_this_user = my_rec_engine.top_n_recs(user_id=lucky_user_index, n = 10)
print("The top ten recommendation for user %s are: " % (lucky_user))
print('%s' % (', '.join(list(set(df['name'][df['business_id'] == business_id[i]]))[0] \
                       for i in lucky_user_recommend)))

The top ten recommendation for user HJj82f-csBI7jjgenwqhvw are: 
Bahama Breeze, Black Bear Diner, Cracker Barrel Old Country Store, Grimaldi's Pizzeria, Market Grille Cafe, Blueberry Hill Family Restaurant , Mimosas Gourmet, Black Bear Diner, Sambalatte Torrefazione, Stephano's Greek & Mediterranean Grill


In [73]:
print("The users original rated resturants are :\n %s" % (','.join(list(set(df['name'][df['business_id'] == business_id[i]]))[0] \
                       for i in items_rated_by_this_user)))

The users original rated resturants are :
 The Cracked Egg,Capriotti's Sandwich Shop,Putter's Bar & Grill,Babystacks Cafe,Timbers - Novat,Bob Taylor's Ranch House,Teriyaki Madness,BJ's Restaurant & Brewhouse,PT's Brewing,Mt Charleston Lodge Restaurant,Viva Las Arepas,Makai Pacific Island Grill,Sweet Tomatoes


In [74]:
original_rated_restaurants = [list(set(df['name'][df['business_id'] == business_id[i]]))[0] for i in items_rated_by_this_user]
mask = [name in original_rated_restaurants for name in df['name']]
original_category = df['categories'][mask]
vectorizer = TfidfVectorizer(analyzer = 'word', stop_words = 'english', 
                             lowercase = True
                            )
original_category_vec = vectorizer.fit_transform(original_category).toarray()
original_word = vectorizer.get_feature_names()
print('Categories from user rated restaurants: \n%s' % (','.join(i for i in original_word)))

Categories from user rated restaurants: 
american,arts,asian,barbeque,bars,beer,breakfast,breweries,brunch,buffets,burgers,cafes,caterers,cocktail,delis,entertainment,event,fast,filipino,food,free,fusion,gluten,halls,hawaiian,japanese,latin,lounges,new,nightlife,pizza,planning,pool,restaurants,salad,sandwiches,seafood,services,soup,spirits,sports,stands,steakhouses,traditional,vegetarian,venezuelan,wine


In [75]:
recommend_res = [list(set(df['name'][df['business_id'] == business_id[i]]))[0] \
                 for i in lucky_user_recommend]
mask = [name in recommend_res for name in df['name']]
recommend_category = df['categories'][mask]
vectorizer = TfidfVectorizer(analyzer = 'word', stop_words = 'english', 
                             lowercase = True
                            )
recommend_category_vec = vectorizer.fit_transform(recommend_category).toarray()
recommend_word = vectorizer.get_feature_names()
print('Categories from recommend restaurants: \n%s' % (','.join(i for i in recommend_word)))

Categories from recommend restaurants: 
american,bakeries,bars,breakfast,brunch,cafes,caribbean,caterers,coffee,desserts,diners,eastern,event,food,greek,italian,laotian,latin,lebanese,mediterranean,middle,new,nightlife,pizza,planning,restaurants,salad,seafood,services,southern,tea,traditional,vegan


In [76]:
#Check the common labels
print("Common labels are: \n%s" % (', '.join(word for word in recommend_word if word in original_word)))

Common labels are: 
american, bars, breakfast, brunch, cafes, caterers, event, food, latin, new, nightlife, pizza, planning, restaurants, salad, seafood, services, traditional


Based on user's previous rating, the NMF recommder shows better performance.

## Matrix Factorization recommender (SVD) with restaurants' labels.

Each business has its own labels. Suppose we have a table of business_id against category labels. Each element in the table represents the style score of resturants to labels. Additionally, we can build another table of user_id against category labels. Each element in the table stands for the preference/taste of clients to each label. By multipling two tables, we can get the utility table. The two sub-table can have negative number as preference can be divided into like or dislike.

In [63]:
#get the number of labels 
mask = [business in business_id for business in df['business_id']]
category = df['categories'][mask]
from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer(analyzer = 'word', stop_words = 'english', 
                             lowercase = True
                            )
category_vec = vectorizer.fit_transform(category).toarray()
words = vectorizer.get_feature_names()
#This is the number of unique categories
print('The total number of restaurant labels is %d' % (len(words))) 

The total number of restaurant labels is 361


In [64]:
from sklearn.decomposition import TruncatedSVD
class SVD_Recommender(object):

    def __init__(self):
        self.n_components = 361 #the number of labels

    def fit(self, ratings_mat):
        self.ratings_mat = ratings_mat
        self.n_users = ratings_mat.shape[0]
        self.n_items = ratings_mat.shape[1]
        svd = TruncatedSVD(n_components=self.n_components, n_iter=7, random_state=1)
        svd.fit(ratings_mat)
        self.V = svd.components_
        self.U = svd.transform(ratings_mat)
        self.ratings_mat_fitted = self.U.dot(self.V)

    def get_error(self):
        return ((self.ratings_mat_fitted - self.ratings_mat)**2).mean(axis=None)
        
    def pred_one_user(self, user_id, report_run_time=False):
        start_time = time()
        cleaned_out = self.ratings_mat_fitted[user_id,:]
        if report_run_time:
            print("Execution time: %f seconds" % (time()-start_time))
        return cleaned_out

    def pred_all_users(self, report_run_time=False):
        start_time = time()
        all_ratings = [
            self.pred_one_user(user_id) for user_id in range(self.n_users)]
        if report_run_time:
            print("Execution time: %f seconds" % (time()-start_time))
        return np.array(all_ratings)

    def top_n_recs(self, user_id, n):
        pred_ratings = self.pred_one_user(user_id)
        item_index_sorted_by_pred_rating = list(np.argsort(pred_ratings))
        items_rated_by_this_user = self.ratings_mat[user_id].nonzero()[1]
        unrated_items_by_pred_rating = [item for item in item_index_sorted_by_pred_rating
                                        if item not in items_rated_by_this_user]
        return unrated_items_by_pred_rating[-n:], items_rated_by_this_user

In [65]:
# get recommendations for the same lucky user
my_rec_engine = SVD_Recommender()
my_rec_engine.fit(ratings_mat)
lucky_user_recommend, items_rated_by_this_user = my_rec_engine.top_n_recs(user_id=lucky_user_index, n = 10)
print("The top ten recommendation for user %s are: " % (lucky_user))
print('%s' % (', '.join(list(set(df['name'][df['business_id'] == business_id[i]]))[0] \
                       for i in lucky_user_recommend)))

The top ten recommendation for user HJj82f-csBI7jjgenwqhvw are: 
Tommy Bahama Restaurant | Bar | Store - Las Vegas, Biwon Korean BBQ & Sushi Restaurant, Sambalatte Torrefazione, Buffalo Wild Wings, Fun Tacos, Philly Steak Express, Black Bear Diner, China A Go Go, The Original Sunrise Cafe, Sushi Bomb


Let me check whether the recommendation make sence. I can check through whether the category labels are consistent between original returants and recommend restaurants.

In [68]:
original_rated_restaurants = [list(set(df['name'][df['business_id'] == business_id[i]]))[0] for i in items_rated_by_this_user]
mask = [name in original_rated_restaurants for name in df['name']]
original_category = df['categories'][mask]
vectorizer = TfidfVectorizer(analyzer = 'word', stop_words = 'english', 
                             lowercase = True
                            )
original_category_vec = vectorizer.fit_transform(original_category).toarray()
original_word = vectorizer.get_feature_names()
print('Categories from user rated restaurants: \n%s' % (','.join(i for i in original_word)))

Categories from user rated restaurants: 
american,arts,asian,barbeque,bars,beer,breakfast,breweries,brunch,buffets,burgers,cafes,caterers,cocktail,delis,entertainment,event,fast,filipino,food,free,fusion,gluten,halls,hawaiian,japanese,latin,lounges,new,nightlife,pizza,planning,pool,restaurants,salad,sandwiches,seafood,services,soup,spirits,sports,stands,steakhouses,traditional,vegetarian,venezuelan,wine


In [70]:
recommend_res = [list(set(df['name'][df['business_id'] == business_id[i]]))[0] \
                 for i in lucky_user_recommend]
mask = [name in recommend_res for name in df['name']]
recommend_category = df['categories'][mask]
vectorizer = TfidfVectorizer(analyzer = 'word', stop_words = 'english', 
                             lowercase = True
                            )
recommend_category_vec = vectorizer.fit_transform(recommend_category).toarray()
recommend_word = vectorizer.get_feature_names()
print('Categories from recommend restaurants: \n%s' % (','.join(i for i in recommend_word)))

Categories from recommend restaurants: 
american,asian,bagels,barbeque,bars,breakfast,brunch,burgers,cafes,cheesesteaks,chicken,chinese,coffee,diners,fast,food,fusion,hawaiian,japanese,korean,mediterranean,mexican,new,nightlife,pizza,restaurants,salad,sandwiches,seafood,sports,sushi,szechuan,tacos,tea,thai,traditional,wings


In [71]:
#Check the common labels
print("Common labels are: \n%s" % (', '.join(word for word in recommend_word if word in original_word)))

Common labels are: 
american, asian, barbeque, bars, breakfast, brunch, burgers, cafes, fast, food, fusion, hawaiian, japanese, new, nightlife, pizza, restaurants, salad, sandwiches, seafood, sports, traditional
