# Yelp Data Challenge - Restaurant Recommender
Summary: 
- Recommend restaurants to existing users.
- Approach: Item - Item Collaborative Filter, Matrix Factorization-NMF, Matrix Factorization-SVD
- Compare the performance
    - Select one user randomly based on their userID
    - Provide them with top ten recommendations
    - Compare the common categories from user rated restaurants and recommended restaurants



In [19]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
% matplotlib inline
plt.style.use("ggplot")

In [2]:
df = pd.read_csv('2017_restaurant_reviews.csv')

In [3]:
df.head(3)

Unnamed: 0,business_id,name,categories,avg_stars,cool,date,funny,review_id,stars,text,useful,user_id
2,--1UhMGODdWsrMastO9DZw,The Spicy Amigos,"Restaurants, Mexican",4.0,0,2017-12-12,0,e1HiHHD7CzY5NKZG7hvhTw,5,Absolutely delicious! And great service as wel...,0,Sew1Nht6Q0sGTIZeNvRfLw
3,--1UhMGODdWsrMastO9DZw,The Spicy Amigos,"Restaurants, Mexican",4.0,0,2017-08-09,0,oKm8UTv-QSC0oCbniqwxjg,4,"Tasty, authentic Mexican street food that give...",0,NoQCmYKyMPs4D01Wa6dZew
5,--1UhMGODdWsrMastO9DZw,The Spicy Amigos,"Restaurants, Mexican",4.0,1,2017-05-10,0,9CLEOpUCqRkIR02sx-JsMQ,5,A little hole in the wall for some really deli...,0,atyCaAjUYatIFDOGKy00SA


## Clean data and get rating data 

#### 1. Select relevant columns in the original dataframe

In [4]:
recommender_df = df[['business_id', 'user_id', 'stars']]
recommender_df.head(3)

Unnamed: 0,business_id,user_id,stars
2,--1UhMGODdWsrMastO9DZw,Sew1Nht6Q0sGTIZeNvRfLw,5
3,--1UhMGODdWsrMastO9DZw,NoQCmYKyMPs4D01Wa6dZew,4
5,--1UhMGODdWsrMastO9DZw,atyCaAjUYatIFDOGKy00SA,5


There are many users that haven't given many reviews, I will exclude these users from the item-item similarity recommender.

In [5]:
print(recommender_df.groupby('user_id').count())  #分组聚合出不同user_id的个数

                        business_id  stars
user_id                                   
--2PnhMMH7EYoY3wywOvgQ            1      1
--6kLBBsm0GPM9vIB2YBDw            1      1
--7gjElmOrthETJ8XqzMBw            1      1
--8NUFYnpU_Zu09TgcLevw            1      1
--BumyUHiO_7YsHurb9Hkw           43     43
--C93xIlmjtgQfSOIpcQSA            1      1
--DKDJlRHfsvufdGSk_Sdw            1      1
--NIc98RMssgy0mSZL3vpA            9      9
--Qh8yKWAvIP4V4K8ZPfHA           33     33
--WhK4MJx0fKvg64LqwStg            1      1
--YhjyV-ce1nFLYxP49C5A           34     34
--_i0TDbSrV8HN19XlSEFw            2      2
--b8fKG7GFGSGfl_BzTnPw            1      1
--cj94VBt0CHYM2UfQBglg            1      1
--neBDssyZlHqAWgrHtUBQ            2      2
--t6W1JHbStaCp5RO05thA            1      1
-018WmPPk8qlp3TEiqqMVw            1      1
-03y31IzykunU9azzgLsoQ            1      1
-06T53TLMkg_xGl3flhDNw            1      1
-0OE9Pn8vSK-WjJeRtHDtw            1      1
-0OSHsV_0VZz4E09FLgQtQ            1      1
-0Z6b2zZhdX

In [6]:
reviews_count_df = recommender_df.groupby('user_id')['stars'].count()
reviews_count_df.head(5)

user_id
--2PnhMMH7EYoY3wywOvgQ     1
--6kLBBsm0GPM9vIB2YBDw     1
--7gjElmOrthETJ8XqzMBw     1
--8NUFYnpU_Zu09TgcLevw     1
--BumyUHiO_7YsHurb9Hkw    43
Name: stars, dtype: int64

In [7]:
print('Max reviews: %s, Min reviews: %s' % (max(reviews_count_df), min(reviews_count_df)))
print('Median reviews: %s, Mean reviews: %s' % (np.median(reviews_count_df), round(np.mean(reviews_count_df),2)))
print('25%% reviews: %d,  75%% reviews: %d' % (np.percentile(reviews_count_df, 25), np.percentile(reviews_count_df, 75)))
print('Number of unique business: %d' % (len(set(recommender_df['business_id']))))

Max reviews: 227, Min reviews: 1
Median reviews: 1.0, Mean reviews: 2.75
25% reviews: 1,  75% reviews: 2
Number of unique business: 13181


In [13]:
active_user = list(reviews_count_df[reviews_count_df >= 10].index)  #filt the index of user_id in reviews_count_df whose review>10
mask = [user in active_user for user in recommender_df['user_id']]
active_user_df = recommender_df[mask]
active_user_df.head(5)

Unnamed: 0,business_id,user_id,stars
21,--1UhMGODdWsrMastO9DZw,TzU30D-CjtPP3XumggK0Mg,4
22,--1UhMGODdWsrMastO9DZw,ZgAzKwganIXImRAMcvdK_A,4
23,--1UhMGODdWsrMastO9DZw,m-p-7WuB85UjsLDaxJXCXA,5
41,--DaPTJW3-tB1vP-PfdTEg,2HjBjUrqjjVfopPfghgpqw,3
85,--SrzpvFLwP_YFwB_Cetow,6eCgSb66TE1LeiWPPBPnTg,4


In [14]:
print('The total number of active users in Canada in 2017 is %d.' % \
      (len(active_user_df.groupby('user_id')['stars'].count())))

The total number of active users in Canada in 2017 is 3190.


In [15]:
print('The total records number for active users in Canada in 2017 is %d.' % \
      (len(active_user_df)))

The total records number for active users in Canada in 2017 is 74895.


#### 2. Create utility matrix from records

In [16]:
from scipy import sparse
highest_user_id = len(set(active_user_df['user_id']))
highest_movie_id = len(set(active_user_df['business_id']))
ratings_mat = sparse.lil_matrix((highest_user_id, highest_movie_id))
ratings_mat

<3190x10717 sparse matrix of type '<class 'numpy.float64'>'
	with 0 stored elements in LInked List format>

Fill the rate matrix based on table

In [17]:
user_id = list(set(active_user_df['user_id']))
business_id = list(set(active_user_df['business_id']))
for _, row in active_user_df.iterrows():
    ratings_mat[user_id.index(row.user_id), business_id.index(row.business_id)] = row.stars  #fill the stars in ratings_mat based on the active_user_df
ratings_mat

<3190x10717 sparse matrix of type '<class 'numpy.float64'>'
	with 74894 stored elements in LInked List format>

## Item - Item Collaborative Filter Recommender

In [18]:
from sklearn.metrics.pairwise import cosine_similarity
from time import time
class ItemItemRecommender(object):

    def __init__(self, neighborhood_size):
        self.neighborhood_size = neighborhood_size

    def fit(self, ratings_mat):
        self.ratings_mat = ratings_mat
        self.n_users = ratings_mat.shape[0]
        self.n_items = ratings_mat.shape[1]
        self.item_sim_mat = cosine_similarity(self.ratings_mat.T)  ##########
        self._set_neighborhoods()

    def _set_neighborhoods(self):
        least_to_most_sim_indexes = np.argsort(self.item_sim_mat, 1)
        self.neighborhoods = least_to_most_sim_indexes[:, -self.neighborhood_size:]

    def pred_one_user(self, user_id, report_run_time=False):
        start_time = time()
        items_rated_by_this_user = self.ratings_mat[user_id].nonzero()[1]
        # Just initializing so I have somewhere to put rating preds
        out = np.zeros(self.n_items)
        for item_to_rate in range(self.n_items):
            relevant_items = np.intersect1d(self.neighborhoods[item_to_rate],
                                            items_rated_by_this_user,  ##### = self.ratings_mat[user_id].nonzero()[1]
                                            assume_unique=True)  #########   assume_unique speeds up intersection op
            out[item_to_rate] = self.ratings_mat[user_id, relevant_items] * \
                self.item_sim_mat[item_to_rate, relevant_items] / \
                self.item_sim_mat[item_to_rate, relevant_items].sum()
        if report_run_time:
            print("Execution time: %f seconds" % (time()-start_time))
        cleaned_out = np.nan_to_num(out)
        return cleaned_out

    def pred_all_users(self, report_run_time=False):
        start_time = time()
        all_ratings = [
            self.pred_one_user(user_id) for user_id in range(self.n_users)]  #ratings_mat.shape[0]
        if report_run_time:
            print("Execution time: %f seconds" % (time()-start_time))
        return np.array(all_ratings)

    def top_n_recs(self, user_id, n):
        pred_ratings = self.pred_one_user(user_id)
        item_index_sorted_by_pred_rating = list(np.argsort(pred_ratings))
        items_rated_by_this_user = self.ratings_mat[user_id].nonzero()[1]
        unrated_items_by_pred_rating = [item for item in item_index_sorted_by_pred_rating
                                        if item not in items_rated_by_this_user]
        return unrated_items_by_pred_rating[-n:], items_rated_by_this_user

In [20]:
#neighborhood_size
my_rec_engine = ItemItemRecommender(neighborhood_size=80)
my_rec_engine.fit(ratings_mat)   ###rating_mat

In [17]:
#my_rec_engine.pred_all_users()



array([[1., 0., 0., ..., 0., 0., 0.],
       [4., 0., 0., ..., 0., 0., 5.],
       [5., 0., 5., ..., 0., 2., 4.],
       ...,
       [0., 0., 0., ..., 0., 0., 0.],
       [3., 0., 0., ..., 0., 4., 4.],
       [0., 0., 0., ..., 0., 3., 0.]])

In [34]:
lucky_user = np.random.choice(active_user_df['user_id'], 1)[0]
lucky_user_index = user_id.index(lucky_user)
lucky_user_recommend, items_rated_by_this_user = my_rec_engine.top_n_recs(user_id=lucky_user_index, n = 10)

In [35]:
print("The top ten recommendation for user %s are: " % (lucky_user))
print('%s' % (', '.join(list(set(df['name'][df['business_id'] == business_id[i]]))[0] \
                       for i in lucky_user_recommend)))

The top ten recommendation for user xo8HykGB7Ekm_QKrMRg3Zw are: 
HKS BBQ & Noodle House, Osaka Sushi Japanese Korean Restaurant, Chef Papa Tea & Noodle Bar, Poke Guys, Markham, Congee Queen, Azyun Restaurant, KINTON RAMEN, Skyview Fusion Cuisine, Dagu Rice Noodle Markham, MINE Sushi


In [36]:
from sklearn.feature_extraction.text import TfidfVectorizer
original_rated_restaurants = [list(set(df['name'][df['business_id'] == business_id[i]]))[0] for i in items_rated_by_this_user]
mask = [name in original_rated_restaurants for name in df['name']]
original_category = df['categories'][mask]
vectorizer = TfidfVectorizer(analyzer = 'word', stop_words = 'english', 
                             lowercase = True)
original_category_vec = vectorizer.fit_transform(original_category).toarray()
original_word = vectorizer.get_feature_names()
print('Categories from user rated restaurants: \n%s' % (','.join(i for i in original_word)))

Categories from user rated restaurants: 
american,asian,barbeque,bars,burgers,cafes,canadian,chicken,chinese,comfort,desserts,fast,food,fusion,italian,japanese,korean,lounges,new,nightlife,noodles,pizza,pubs,ramen,restaurants,sandwiches,seafood,sports,steakhouses,sushi,traditional,vietnamese,wings


In [37]:
recommend_res = [list(set(df['name'][df['business_id'] == business_id[i]]))[0] \
                 for i in lucky_user_recommend]
mask = [name in recommend_res for name in df['name']]
recommend_category = df['categories'][mask]
vectorizer = TfidfVectorizer(analyzer = 'word', stop_words = 'english', 
                             lowercase = True)
recommend_category_vec = vectorizer.fit_transform(recommend_category).toarray()
recommend_word = vectorizer.get_feature_names()
print('Categories from recommend restaurants: \n%s' % (','.join(i for i in recommend_word)))

Categories from recommend restaurants: 
asian,barbeque,bars,chinese,dim,diners,event,food,fusion,hawaiian,japanese,korean,nightlife,noodles,planning,poke,pubs,ramen,restaurants,services,soup,spaces,sum,sushi,tapas,venues


In [38]:
#Check the common labels
print("Common labels are: \n%s" % (', '.join(word for word in recommend_word if word in original_word)))#original_word = vectorizer.get_feature_names()

Common labels are: 
asian, barbeque, bars, chinese, food, fusion, japanese, korean, nightlife, noodles, pubs, ramen, restaurants, sushi


## Matrix Factorization recommender (NMF)

In [26]:
from sklearn.decomposition import NMF
class NMF_Recommender(object):

    def __init__(self, n_components):
        self.n_components = n_components

    def fit(self, ratings_mat):
        self.ratings_mat = ratings_mat
        self.n_users = ratings_mat.shape[0]
        self.n_items = ratings_mat.shape[1]
        nmf = NMF(n_components = 200)
        nmf.fit(ratings_mat)
        self.W = nmf.transform(ratings_mat)
        self.H = nmf.components_
        self.error = nmf.reconstruction_err_
        self.ratings_mat_fitted = self.W.dot(self.H)

    def get_error(self):
        return self.error
        
    def pred_one_user(self, user_id, report_run_time=False):
        start_time = time()
        cleaned_out = self.ratings_mat_fitted[user_id,:]
        if report_run_time:
            print("Execution time: %f seconds" % (time()-start_time))
        return cleaned_out

    def pred_all_users(self, report_run_time=False):
        start_time = time()
        all_ratings = [
            self.pred_one_user(user_id) for user_id in range(self.n_users)]
        if report_run_time:
            print("Execution time: %f seconds" % (time()-start_time))
        return np.array(all_ratings)

    def top_n_recs(self, user_id, n):
        pred_ratings = self.pred_one_user(user_id)
        item_index_sorted_by_pred_rating = list(np.argsort(pred_ratings))
        items_rated_by_this_user = self.ratings_mat[user_id].nonzero()[1]
        unrated_items_by_pred_rating = [item for item in item_index_sorted_by_pred_rating
                                        if item not in items_rated_by_this_user]
        return unrated_items_by_pred_rating[-n:], items_rated_by_this_user

In [39]:
# get recommendations for the same lucky user
my_rec_engine = NMF_Recommender(n_components=200)
my_rec_engine.fit(ratings_mat)
lucky_user_recommend, items_rated_by_this_user = my_rec_engine.top_n_recs(user_id=lucky_user_index, n = 10)  #items_rated_by_this_user-original
print("The top ten recommendation for user %s are: " % (lucky_user))
print('%s' % (', '.join(list(set(df['name'][df['business_id'] == business_id[i]]))[0] \
                       for i in lucky_user_recommend)))

The top ten recommendation for user xo8HykGB7Ekm_QKrMRg3Zw are: 
Chef Papa Tea & Noodle Bar, Woodstone Eatery, Eggette Hut, Poke Guys, Markham, MINE Sushi, KINTON RAMEN, Ai Sushi, Azyun Restaurant, Skyview Fusion Cuisine, Dagu Rice Noodle Markham


In [40]:
print("The users original rated resturants are :\n %s" % (','.join(list(set(df['name'][df['business_id'] == business_id[i]]))[0] \
                       for i in items_rated_by_this_user)))

The users original rated resturants are :
 Boston Pizza,Fat Ninja Bite,Montana's BBQ & Bar,Golden Duke Chinese Cuisine,The Pho Restaurant,Yokohama Japanese Cuisine,Cups Bingsu Café,Kiku,JOEY Markville,Pho 99 Vietnamese Restaurant,The Six Cafe and Restaurant,Kyouka Ramen,Gal's Sushi


In [41]:
original_rated_restaurants = [list(set(df['name'][df['business_id'] == business_id[i]]))[0] for i in items_rated_by_this_user]
mask = [name in original_rated_restaurants for name in df['name']]
original_category = df['categories'][mask]
vectorizer = TfidfVectorizer(analyzer = 'word', stop_words = 'english', 
                             lowercase = True)
original_category_vec = vectorizer.fit_transform(original_category).toarray()
original_word = vectorizer.get_feature_names()
print('Categories from user rated restaurants: \n%s' % (','.join(i for i in original_word)))

Categories from user rated restaurants: 
american,asian,barbeque,bars,burgers,cafes,canadian,chicken,chinese,comfort,desserts,fast,food,fusion,italian,japanese,korean,lounges,new,nightlife,noodles,pizza,pubs,ramen,restaurants,sandwiches,seafood,sports,steakhouses,sushi,traditional,vietnamese,wings


In [42]:
recommend_res = [list(set(df['name'][df['business_id'] == business_id[i]]))[0] \
                 for i in lucky_user_recommend]
mask = [name in recommend_res for name in df['name']]
recommend_category = df['categories'][mask]
vectorizer = TfidfVectorizer(analyzer = 'word', stop_words = 'english', 
                             lowercase = True
                            )
recommend_category_vec = vectorizer.fit_transform(recommend_category).toarray()
recommend_word = vectorizer.get_feature_names()
print('Categories from recommend restaurants: \n%s' % (','.join(i for i in recommend_word)))

Categories from recommend restaurants: 
asian,bars,breakfast,brunch,chinese,desserts,dim,event,food,fusion,hawaiian,japanese,nightlife,noodles,planning,poke,pubs,ramen,restaurants,services,soup,spaces,sum,sushi,tapas,venues,waffles


In [43]:
#Check the common labels
print("Common labels are: \n%s" % (', '.join(word for word in recommend_word if word in original_word)))

Common labels are: 
asian, bars, chinese, desserts, food, fusion, japanese, nightlife, noodles, pubs, ramen, restaurants, sushi


#### Common labels from Item - Item Collaborative Filter are: 
- asian, barbeque, bars, chinese, food, fusion, japanese, korean, nightlife, noodles, pubs, ramen, restaurants, sushi

#### Common labels are from SVD are: 
- asian, barbeque, bars, chinese, food, fusion, japanese, korean, nightlife, noodles, pubs, ramen, restaurants, sushi

## Matrix Factorization recommender (SVD) with restaurants' labels.

In [44]:
#get the number of labels 
mask = [business in business_id for business in df['business_id']]
category = df['categories'][mask]
from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer(analyzer = 'word', stop_words = 'english', 
                             lowercase = True
                            )
category_vec = vectorizer.fit_transform(category).toarray()
words = vectorizer.get_feature_names()
#This is the number of unique categories
print('The total number of restaurant labels is %d' % (len(words))) 

The total number of restaurant labels is 441


In [45]:
from sklearn.decomposition import TruncatedSVD
class SVD_Recommender(object):

    def __init__(self):
        self.n_components = 361 #the number of labels##############################

    def fit(self, ratings_mat):
        self.ratings_mat = ratings_mat
        self.n_users = ratings_mat.shape[0]
        self.n_items = ratings_mat.shape[1]
        svd = TruncatedSVD(n_components=self.n_components, n_iter=7, random_state=1)  #####################
        svd.fit(ratings_mat)
        self.V = svd.components_
        self.U = svd.transform(ratings_mat)
        self.ratings_mat_fitted = self.U.dot(self.V)

    def get_error(self):
        return ((self.ratings_mat_fitted - self.ratings_mat)**2).mean(axis=None)
        
    def pred_one_user(self, user_id, report_run_time=False):
        start_time = time()
        cleaned_out = self.ratings_mat_fitted[user_id,:]
        if report_run_time:
            print("Execution time: %f seconds" % (time()-start_time))
        return cleaned_out

    def pred_all_users(self, report_run_time=False):
        start_time = time()
        all_ratings = [
            self.pred_one_user(user_id) for user_id in range(self.n_users)]
        if report_run_time:
            print("Execution time: %f seconds" % (time()-start_time))
        return np.array(all_ratings)

    def top_n_recs(self, user_id, n):
        pred_ratings = self.pred_one_user(user_id)
        item_index_sorted_by_pred_rating = list(np.argsort(pred_ratings))
        items_rated_by_this_user = self.ratings_mat[user_id].nonzero()[1]
        unrated_items_by_pred_rating = [item for item in item_index_sorted_by_pred_rating
                                        if item not in items_rated_by_this_user]
        return unrated_items_by_pred_rating[-n:], items_rated_by_this_user

In [46]:
# get recommendations for the same lucky user
my_rec_engine = SVD_Recommender()
my_rec_engine.fit(ratings_mat)
lucky_user_recommend, items_rated_by_this_user = my_rec_engine.top_n_recs(user_id=lucky_user_index, n = 10)
print("The top ten recommendation for user %s are: " % (lucky_user))
print('%s' % (', '.join(list(set(df['name'][df['business_id'] == business_id[i]]))[0] \
                       for i in lucky_user_recommend)))

The top ten recommendation for user xo8HykGB7Ekm_QKrMRg3Zw are: 
HKS BBQ & Noodle House, Osaka Sushi Japanese Korean Restaurant, Chef Papa Tea & Noodle Bar, Poke Guys, Markham, Congee Queen, Azyun Restaurant, KINTON RAMEN, Skyview Fusion Cuisine, Dagu Rice Noodle Markham, MINE Sushi


In [47]:
original_rated_restaurants = [list(set(df['name'][df['business_id'] == business_id[i]]))[0] for i in items_rated_by_this_user]
mask = [name in original_rated_restaurants for name in df['name']]
original_category = df['categories'][mask]
vectorizer = TfidfVectorizer(analyzer = 'word', stop_words = 'english', 
                             lowercase = True)
original_category_vec = vectorizer.fit_transform(original_category).toarray()
original_word = vectorizer.get_feature_names()
print('Categories from user rated restaurants: \n%s' % (','.join(i for i in original_word)))

Categories from user rated restaurants: 
american,asian,barbeque,bars,burgers,cafes,canadian,chicken,chinese,comfort,desserts,fast,food,fusion,italian,japanese,korean,lounges,new,nightlife,noodles,pizza,pubs,ramen,restaurants,sandwiches,seafood,sports,steakhouses,sushi,traditional,vietnamese,wings


In [48]:
recommend_res = [list(set(df['name'][df['business_id'] == business_id[i]]))[0] \
                 for i in lucky_user_recommend]
mask = [name in recommend_res for name in df['name']]
recommend_category = df['categories'][mask]
vectorizer = TfidfVectorizer(analyzer = 'word', stop_words = 'english', 
                             lowercase = True)
recommend_category_vec = vectorizer.fit_transform(recommend_category).toarray()
recommend_word = vectorizer.get_feature_names()
print('Categories from recommend restaurants: \n%s' % (','.join(i for i in recommend_word)))

Categories from recommend restaurants: 
asian,barbeque,bars,chinese,dim,diners,event,food,fusion,hawaiian,japanese,korean,nightlife,noodles,planning,poke,pubs,ramen,restaurants,services,soup,spaces,sum,sushi,tapas,venues


In [49]:
#Check the common labels
print("Common labels are: \n%s" % (', '.join(word for word in recommend_word if word in original_word)))

Common labels are: 
asian, barbeque, bars, chinese, food, fusion, japanese, korean, nightlife, noodles, pubs, ramen, restaurants, sushi
