# Yelp Data Challenge - Restaurant Recommender
Summary: 
- Recommend restaurants to existing users.
- Approach: Item - Item Collaborative Filter, Matrix Factorization-NMF, Matrix Factorization-SVD
- Compare the performance
    - Select one user randomly based on their userID
    - Provide them with top ten recommendations
    - Compare the common categories from user rated restaurants and recommended restaurants



In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
% matplotlib inline
plt.style.use("ggplot")

In [3]:
# df = pd.read_csv('2017_restaurant_reviews.csv')
df = pd.read_csv('2017_restaurant_new_reviews.csv')

In [3]:
df.head(3)

Unnamed: 0,business_id,name,categories,avg_stars,cool,date,funny,review_id,stars,text,useful,user_id
2,--1UhMGODdWsrMastO9DZw,The Spicy Amigos,"Restaurants, Mexican",4.0,0,2017-12-12,0,e1HiHHD7CzY5NKZG7hvhTw,5,Absolutely delicious! And great service as wel...,0,Sew1Nht6Q0sGTIZeNvRfLw
3,--1UhMGODdWsrMastO9DZw,The Spicy Amigos,"Restaurants, Mexican",4.0,0,2017-08-09,0,oKm8UTv-QSC0oCbniqwxjg,4,"Tasty, authentic Mexican street food that give...",0,NoQCmYKyMPs4D01Wa6dZew
5,--1UhMGODdWsrMastO9DZw,The Spicy Amigos,"Restaurants, Mexican",4.0,1,2017-05-10,0,9CLEOpUCqRkIR02sx-JsMQ,5,A little hole in the wall for some really deli...,0,atyCaAjUYatIFDOGKy00SA


## Clean data and get rating data 

#### 1. Select relevant columns in the original dataframe

In [5]:
recommender_df = df[['business_id', 'user_id', 'new_cluster']]
recommender_df.head(3)

Unnamed: 0,business_id,user_id,new_cluster
0,--1UhMGODdWsrMastO9DZw,Sew1Nht6Q0sGTIZeNvRfLw,1
1,--1UhMGODdWsrMastO9DZw,NoQCmYKyMPs4D01Wa6dZew,1
2,--1UhMGODdWsrMastO9DZw,atyCaAjUYatIFDOGKy00SA,2


There are many users that haven't given many reviews, I will exclude these users from the item-item similarity recommender.

In [83]:
print(recommender_df.groupby('user_id').count())  #分组聚合出不同user_id的个数

                        business_id  new_cluster
user_id                                         
--2PnhMMH7EYoY3wywOvgQ            1            1
--6kLBBsm0GPM9vIB2YBDw            1            1
--7gjElmOrthETJ8XqzMBw            1            1
--8NUFYnpU_Zu09TgcLevw            1            1
--BumyUHiO_7YsHurb9Hkw           43           43
--C93xIlmjtgQfSOIpcQSA            1            1
--DKDJlRHfsvufdGSk_Sdw            1            1
--NIc98RMssgy0mSZL3vpA            9            9
--Qh8yKWAvIP4V4K8ZPfHA           33           33
--WhK4MJx0fKvg64LqwStg            1            1
--YhjyV-ce1nFLYxP49C5A           34           34
--_i0TDbSrV8HN19XlSEFw            2            2
--b8fKG7GFGSGfl_BzTnPw            1            1
--cj94VBt0CHYM2UfQBglg            1            1
--neBDssyZlHqAWgrHtUBQ            2            2
--t6W1JHbStaCp5RO05thA            1            1
-018WmPPk8qlp3TEiqqMVw            1            1
-03y31IzykunU9azzgLsoQ            1            1
-06T53TLMkg_xGl3flhD

In [6]:
reviews_count_df = recommender_df.groupby('user_id')['new_cluster'].count()
reviews_count_df.head(5)

user_id
--2PnhMMH7EYoY3wywOvgQ     1
--6kLBBsm0GPM9vIB2YBDw     1
--7gjElmOrthETJ8XqzMBw     1
--8NUFYnpU_Zu09TgcLevw     1
--BumyUHiO_7YsHurb9Hkw    43
Name: new_cluster, dtype: int64

In [7]:
print('Max reviews: %s, Min reviews: %s' % (max(reviews_count_df), min(reviews_count_df)))
print('Median reviews: %s, Mean reviews: %s' % (np.median(reviews_count_df), round(np.mean(reviews_count_df),2)))
print('25%% reviews: %d,  75%% reviews: %d' % (np.percentile(reviews_count_df, 25), np.percentile(reviews_count_df, 75)))
print('Number of unique business: %d' % (len(set(recommender_df['business_id']))))

Max reviews: 227, Min reviews: 1
Median reviews: 1.0, Mean reviews: 2.75
25% reviews: 1,  75% reviews: 2
Number of unique business: 13181


In [8]:
active_user = list(reviews_count_df[reviews_count_df >= 10].index)  #filt the index of user_id in reviews_count_df whose the counts of review>10
mask = [user in active_user for user in recommender_df['user_id']]
active_user_df = recommender_df[mask]
active_user_df.head(5)

Unnamed: 0,business_id,user_id,new_cluster
11,--1UhMGODdWsrMastO9DZw,TzU30D-CjtPP3XumggK0Mg,3
12,--1UhMGODdWsrMastO9DZw,ZgAzKwganIXImRAMcvdK_A,1
13,--1UhMGODdWsrMastO9DZw,m-p-7WuB85UjsLDaxJXCXA,1
18,--DaPTJW3-tB1vP-PfdTEg,2HjBjUrqjjVfopPfghgpqw,2
30,--SrzpvFLwP_YFwB_Cetow,6eCgSb66TE1LeiWPPBPnTg,3


In [96]:
print('The total number of active users in Canada in 2017 and 2018 is %d.' % \
      (len(active_user_df.groupby('user_id')['new_cluster'].count())))

The total number of active users in Canada in 2017 and 2018 is 3190.


In [97]:
print('The total records number for active users in Canada in 2017 and 2018 is %d.' % \
      (len(active_user_df)))

The total records number for active users in Canada in 2017 and 2018 is 74895.


#### 2. Create utility matrix from records

In [11]:
from scipy import sparse
highest_user_id = len(set(active_user_df['user_id']))
highest_movie_id = len(set(active_user_df['business_id']))
ratings_mat = sparse.lil_matrix((highest_user_id, highest_movie_id))
ratings_mat

<3190x10717 sparse matrix of type '<class 'numpy.float64'>'
	with 0 stored elements in LInked List format>

Fill the rate matrix based on table

In [12]:
user_id = list(set(active_user_df['user_id']))
business_id = list(set(active_user_df['business_id']))
for _, row in active_user_df.iterrows():
    ratings_mat[user_id.index(row.user_id), business_id.index(row.business_id)] = row.new_cluster  #fill the new_clusterin ratings_mat based on the active_user_df
ratings_mat

<3190x10717 sparse matrix of type '<class 'numpy.float64'>'
	with 74894 stored elements in LInked List format>

## Item - Item Collaborative Filter Recommender

In [64]:
from sklearn.metrics.pairwise import cosine_similarity
from time import time
class ItemItemRecommender(object):

    def __init__(self, neighborhood_size):
        self.neighborhood_size = neighborhood_size

    def fit(self, ratings_mat):
        self.ratings_mat = ratings_mat
        self.n_users = ratings_mat.shape[0]
        self.n_items = ratings_mat.shape[1]
        self.item_sim_mat = cosine_similarity(self.ratings_mat.T)  ##########
        self._set_neighborhoods()

    def _set_neighborhoods(self):
        least_to_most_sim_indexes = np.argsort(self.item_sim_mat, 1)
        self.neighborhoods = least_to_most_sim_indexes[:, -self.neighborhood_size:]

    def pred_one_user(self, user_id, report_run_time=False):
        start_time = time()
        items_rated_by_this_user = self.ratings_mat[user_id].nonzero()[1]
        # Just initializing so I have somewhere to put rating preds
        out = np.zeros(self.n_items)
        for item_to_rate in range(self.n_items):
            relevant_items = np.intersect1d(self.neighborhoods[item_to_rate],
                                            items_rated_by_this_user,  ##### = self.ratings_mat[user_id].nonzero()[1]
                                            assume_unique=True)  #########   assume_unique speeds up intersection op
            out[item_to_rate] = self.ratings_mat[user_id, relevant_items] * \
                self.item_sim_mat[item_to_rate, relevant_items] / \
                self.item_sim_mat[item_to_rate, relevant_items].sum()
        if report_run_time:
            print("Execution time: %f seconds" % (time()-start_time))
        cleaned_out = np.nan_to_num(out)
        return cleaned_out

    def pred_all_users(self, report_run_time=False):
        start_time = time()
        all_ratings = [
            self.pred_one_user(user_id) for user_id in range(self.n_users)]  #ratings_mat.shape[0]
        if report_run_time:
            print("Execution time: %f seconds" % (time()-start_time))
        return np.array(all_ratings)

    def top_n_recs(self, user_id, n):
        pred_ratings = self.pred_one_user(user_id)
        item_index_sorted_by_pred_rating = list(np.argsort(pred_ratings))
        items_rated_by_this_user = self.ratings_mat[user_id].nonzero()[1]
        unrated_items_by_pred_rating = [item for item in item_index_sorted_by_pred_rating
                                        if item not in items_rated_by_this_user]
        return unrated_items_by_pred_rating[-n:], items_rated_by_this_user

In [65]:
#neighborhood_size
my_rec_engine = ItemItemRecommender(neighborhood_size=80)
my_rec_engine.fit(ratings_mat)   ###rating_mat

In [17]:
#my_rec_engine.pred_all_users()



array([[1., 0., 0., ..., 0., 0., 0.],
       [4., 0., 0., ..., 0., 0., 5.],
       [5., 0., 5., ..., 0., 2., 4.],
       ...,
       [0., 0., 0., ..., 0., 0., 0.],
       [3., 0., 0., ..., 0., 4., 4.],
       [0., 0., 0., ..., 0., 3., 0.]])

In [66]:
lucky_user = np.random.choice(active_user_df['user_id'], 1)[0]
lucky_user_index = user_id.index(lucky_user)
lucky_user_recommend, items_rated_by_this_user = my_rec_engine.top_n_recs(user_id=lucky_user_index, n = 10)



In [67]:
print("The top ten recommendation for user %s are: " % (lucky_user))
print('%s' % (', '.join(list(set(df['name'][df['business_id'] == business_id[i]]))[0] \
                       for i in lucky_user_recommend)))

The top ten recommendation for user -3s52C4zL_DHRK0ULG6qtg are: 
Au Petit Extra, Le Roi du Wonton, Firegrill Steakhouse & Bar, Novita Italian Cuisine, ConforTable, Chez Léo, 1915 Lan Zhou Ramen, Gus, Dumpling House, Wuhan Noodle 1950


In [68]:
from sklearn.feature_extraction.text import TfidfVectorizer
original_rated_restaurants = [list(set(df['name'][df['business_id'] == business_id[i]]))[0] for i in items_rated_by_this_user]
mask = [name in original_rated_restaurants for name in df['name']]
original_category = df['categories'][mask]
vectorizer = TfidfVectorizer(analyzer = 'word', stop_words = 'english', 
                             lowercase = True)
original_category_vec = vectorizer.fit_transform(original_category).toarray()
original_word = vectorizer.get_feature_names()
print('Categories from user rated restaurants: \n%s' % (','.join(i for i in original_word)))

Categories from user rated restaurants: 
american,bars,brazilian,breakfast,brunch,buffets,burgers,cafes,canadian,food,french,grocery,hawaiian,juice,mexican,new,pizza,portuguese,restaurants,salad,sandwiches,seafood,smoothies,steakhouses,traditional,trucks,vegan,vegetarian


In [69]:
recommend_res = [list(set(df['name'][df['business_id'] == business_id[i]]))[0] \
                 for i in lucky_user_recommend]
mask = [name in recommend_res for name in df['name']]
recommend_category = df['categories'][mask]
vectorizer = TfidfVectorizer(analyzer = 'word', stop_words = 'english', 
                             lowercase = True)
recommend_category_vec = vectorizer.fit_transform(recommend_category).toarray()
recommend_word = vectorizer.get_feature_names()
print('Categories from recommend restaurants: \n%s' % (','.join(i for i in recommend_word)))

Categories from recommend restaurants: 
barbeque,bars,breakfast,brunch,burgers,canadian,chinese,cocktail,comfort,dim,flavor,food,french,italian,japanese,local,new,nightlife,noodles,poutineries,ramen,restaurants,salad,seafood,southern,steakhouses,sum,taiwanese


In [70]:
#Check the common labels
s=', '.join(word for word in recommend_word if word in original_word)
print("Common labels are: \n%s" % (s))
print(len(s.split(',')))    #original_word = vectorizer.get_feature_names()

Common labels are: 
bars, breakfast, brunch, burgers, canadian, food, french, new, restaurants, salad, seafood, steakhouses
12


In [84]:
# 过滤出review>100的user_id
active_user2 = list(reviews_count_df[reviews_count_df >= 100].index)  
mask2 = [user in active_user2 for user in recommender_df['user_id']]
active_user_df2 = recommender_df[mask2]
active_user2_ls = list(active_user_df2['user_id'].unique())
print(active_user2_ls)

['XbiKsujS_qxU3xsr0xUqmQ', 'CjbfWpCRLbA-L_eS_ztd6Q', 'JrXC_MDp38BWwLn2SFdNsA', 'CxDOIDnH8gp9KXzpBHJYXw', 'U1vl4SQzO3wTAWlYVnSjnw', 'wffnrXJoLppOlvNOZKU70A', 'tU94-C1zpBsfGFvpsJJr2w', 'qKpkRCPk4ycbllTfFcRbNw', 'O3pSxv1SyHpY4qi4Q16KzA', 'Plqi4pG84PA_vBM8OfDPDg', 'Ao-6FYE29-I8WwPg67806A', 'jpIGlAym6z88W2xzHiK5_A', 'PGeiszoVusiv0wTHVdWklA', 'U5YQX_vMl_xQy8EQDqlNQQ', 'XrYTMhY9YJvzX2pMepIz7A', 'orh0HRUNCWuQMt9Iia_osg', 'CQ67NJigSe5-uBDX3b_CUw', 'jwctwzboGhQmtC50Juxa9A', 'TibBhm-fbksozIDFD8wjPQ', 'ic-tyi1jElL_umxZVh8KNA', 'uz5-sq6wHrXScrIWb8r1Mg', 'LB5ViGU59ww2XRCx803t0w', '8Dvr-U6jCZTVGD52LwC2qA', 'eZeBuiVZWT7u3SktO7mv9w', 'Z09rco1enQXNCd9H0u7kvg', 'g-y4Me4bqDz8jwFzX_e17w', 'O3q-nwYZykMmacxjru01Zg', 'Zd3wzNdevk15CwMIJdbjZw', '3aYeG-x5A44GIgmBHrwyAA', 'pn_flI3EBNugBEYFp9okxQ', 'Kj9cFO70zZOQorN0mgeLWA', 'Nq6e5N8bjgD9B46O4va_zA', 'yphnJ8zYbJF7Y3QbtMj91g', 'iRQ_YKpCBdaCwvc2X8_3NQ', 'hFvLG_m26hYMx1UGQSpaEg', 'aX6_Pf3njB-H3FrqgnNJ2g', '0Zswwlz4NzUJoG-skyWzIw', 'vs8aSP9ArwqAlb0LeCnFeQ']


In [85]:
cfr_ls = []
rec_engine2 = ItemItemRecommender(neighborhood_size=80)
rec_engine2.fit(ratings_mat)
for user in active_user2_ls:
    current_lucky_user = user
    current_lucky_user_index = user_id.index(current_lucky_user)
    lucky_user_recommend2, items_rated_by_this_user2 = rec_engine2.top_n_recs(user_id=current_lucky_user_index, n = 10)
    cfr_dic = {}
    cfr_dic['user_id'] = current_lucky_user
    rec_result = ', '.join(list(set(df['name'][df['business_id'] == business_id[i]]))[0] for i in lucky_user_recommend2)
    cfr_dic['recommendation'] = rec_result
    cfr_ls.append(cfr_dic)
    print("The top ten recommendation for user %s are: " % (current_lucky_user))
    print('%s' % rec_result)
    print('\n')
print(cfr_ls)



The top ten recommendation for user XbiKsujS_qxU3xsr0xUqmQ are: 
Frite Alors, Sachi Sushi, Canyon Creek, Hai Tang Cafe, B'saha Restaurant, McEwan, Wow Sushi, Blossom Pure Organic, North Stars Bar & Grill, Kaiju


The top ten recommendation for user CjbfWpCRLbA-L_eS_ztd6Q are: 
Silk Road Restaurant, Aish Tanoor, Cardinal Rule, Bikkuri Japanese Restaurant, Mezzetta Restaurant & Tapas Bar, Lykn Chicken, Doner Kebab House, Sushi Osaka, Ryoji Ramen & Izakaya, Narula's


The top ten recommendation for user JrXC_MDp38BWwLn2SFdNsA are: 
Drupati's Roti & Doubles, McDonald's, Red Soul Japanese Restaurant, Wow Sushi, Jerusalem Shawarma, Freshii, Bâton Rouge Steakhouse & Bar, Mashu Mashu Mediterranean Grill, Basha - Marché Jean Talon, Shanghai Chinese Restaurant


The top ten recommendation for user CxDOIDnH8gp9KXzpBHJYXw are: 
Lili.Co, Da Ke Yi Snacks, Kramer's Bar & Grill, Zakkushi, Potato Noodle Soup of Bai, China Cottage, Soupesoup, Kim Kim Indian Hakka Chinese Restaurant, Beef Noodle Restaura

The top ten recommendation for user hFvLG_m26hYMx1UGQSpaEg are: 
Pizza Hut, Mean Bao, Dine on 3, Martino's Pizza & Asian Fusion Kitchen, Shawarma Box, Las Delicias, The Wilcox Gastropub, Subway, The Captain's Boil, KINTON RAMEN


The top ten recommendation for user aX6_Pf3njB-H3FrqgnNJ2g are: 
Big Smoke Burger, Pizza Nova, Ryus Noodle Bar, Sweet Mahal, Canton Rice Noodle, Dagu Rice Noodle, The Birchcliff, Keating Channel Pub & Grill, Cafe de Causette, Starbucks


The top ten recommendation for user 0Zswwlz4NzUJoG-skyWzIw are: 
Sweet Esc, Pies Plus Cafe, Starbucks Reserve, The Halal Guys, Hemispheres Restaurant & Bistro, Niche Coffee and Tea, N9 Cafe, Jessie's, Canton Kitchen, Foodie North


The top ten recommendation for user vs8aSP9ArwqAlb0LeCnFeQ are: 
La Cubana, Live Organic Food Bar - Liberty Village, Blu restaurante, Presse Café, Pie Squared, Mr. Chestnut, Pita & Grill, Toranj, Astoria Shish Kebob House, Little Sheep Mongolian Hot Pot


[{'user_id': 'XbiKsujS_qxU3xsr0xUqmQ', 'reco

In [86]:
df_cfr = pd.DataFrame(cfr_ls)
df_cfr = df_cfr[['user_id','recommendation']]
df_cfr.head()

Unnamed: 0,user_id,recommendation
0,XbiKsujS_qxU3xsr0xUqmQ,"Frite Alors, Sachi Sushi, Canyon Creek, Hai Ta..."
1,CjbfWpCRLbA-L_eS_ztd6Q,"Silk Road Restaurant, Aish Tanoor, Cardinal Ru..."
2,JrXC_MDp38BWwLn2SFdNsA,"Drupati's Roti & Doubles, McDonald's, Red Soul..."
3,CxDOIDnH8gp9KXzpBHJYXw,"Lili.Co, Da Ke Yi Snacks, Kramer's Bar & Grill..."
4,U1vl4SQzO3wTAWlYVnSjnw,"La Bella Italiana, Bastix Souvenirs, TrueTrue,..."


In [87]:
df_cfr.to_csv('New_Item-Item_CFR_Example.csv')

In [88]:
df_cfr=pd.read_csv('New_Item-Item_CFR_Example.csv')

In [89]:
df_cfr.head()

Unnamed: 0.1,Unnamed: 0,user_id,recommendation
0,0,XbiKsujS_qxU3xsr0xUqmQ,"Frite Alors, Sachi Sushi, Canyon Creek, Hai Ta..."
1,1,CjbfWpCRLbA-L_eS_ztd6Q,"Silk Road Restaurant, Aish Tanoor, Cardinal Ru..."
2,2,JrXC_MDp38BWwLn2SFdNsA,"Drupati's Roti & Doubles, McDonald's, Red Soul..."
3,3,CxDOIDnH8gp9KXzpBHJYXw,"Lili.Co, Da Ke Yi Snacks, Kramer's Bar & Grill..."
4,4,U1vl4SQzO3wTAWlYVnSjnw,"La Bella Italiana, Bastix Souvenirs, TrueTrue,..."


## Matrix Factorization recommender (NMF)

In [71]:
from sklearn.decomposition import NMF
class NMF_Recommender(object):

    def __init__(self, n_components):
        self.n_components = n_components

    def fit(self, ratings_mat):
        self.ratings_mat = ratings_mat
        self.n_users = ratings_mat.shape[0]
        self.n_items = ratings_mat.shape[1]
        nmf = NMF(n_components = 200)
        nmf.fit(ratings_mat)
        self.W = nmf.transform(ratings_mat)
        self.H = nmf.components_
        self.error = nmf.reconstruction_err_
        self.ratings_mat_fitted = self.W.dot(self.H)

    def get_error(self):
        return self.error
        
    def pred_one_user(self, user_id, report_run_time=False):
        start_time = time()
        cleaned_out = self.ratings_mat_fitted[user_id,:]
        if report_run_time:
            print("Execution time: %f seconds" % (time()-start_time))
        return cleaned_out

    def pred_all_users(self, report_run_time=False):
        start_time = time()
        all_ratings = [
            self.pred_one_user(user_id) for user_id in range(self.n_users)]
        if report_run_time:
            print("Execution time: %f seconds" % (time()-start_time))
        return np.array(all_ratings)

    def top_n_recs(self, user_id, n):
        pred_ratings = self.pred_one_user(user_id)
        item_index_sorted_by_pred_rating = list(np.argsort(pred_ratings))
        items_rated_by_this_user = self.ratings_mat[user_id].nonzero()[1]
        unrated_items_by_pred_rating = [item for item in item_index_sorted_by_pred_rating
                                        if item not in items_rated_by_this_user]
        return unrated_items_by_pred_rating[-n:], items_rated_by_this_user

In [72]:
# get recommendations for the same lucky user
my_rec_engine = NMF_Recommender(n_components=200)
my_rec_engine.fit(ratings_mat)
lucky_user_recommend, items_rated_by_this_user = my_rec_engine.top_n_recs(user_id=lucky_user_index, n = 10)  #items_rated_by_this_user-original
print("The top ten recommendation for user %s are: " % (lucky_user))
print('%s' % (', '.join(list(set(df['name'][df['business_id'] == business_id[i]]))[0] \
                       for i in lucky_user_recommend)))

The top ten recommendation for user -3s52C4zL_DHRK0ULG6qtg are: 
Les Street Monkeys, Hooters, Sushi Crystal, Mad Hatter Pub, Chand Palace, Restaurant LOV, Bombay Mahal, Kampaï Garden, Kumamoto, Le Poké Bar


In [73]:
print("The users original rated resturants are :\n %s" % (','.join(list(set(df['name'][df['business_id'] == business_id[i]]))[0] \
                       for i in items_rated_by_this_user)))

The users original rated resturants are :
 M4 Burritos Peel,La Castile Steak House & Seafood Restaurant,Café Vasco Da Gama,Venice,Burger Bar Crescent,Luna Cafe,Chop'd & Co,Pizzéria N° 900 Peel,Robin des Bois,Cantinho de Lisboa,Rodízio Brasil


In [74]:
original_rated_restaurants = [list(set(df['name'][df['business_id'] == business_id[i]]))[0] for i in items_rated_by_this_user]
mask = [name in original_rated_restaurants for name in df['name']]
original_category = df['categories'][mask]
vectorizer = TfidfVectorizer(analyzer = 'word', stop_words = 'english', 
                             lowercase = True)
original_category_vec = vectorizer.fit_transform(original_category).toarray()
original_word = vectorizer.get_feature_names()
print('Categories from user rated restaurants: \n%s' % (','.join(i for i in original_word)))

Categories from user rated restaurants: 
american,bars,brazilian,breakfast,brunch,buffets,burgers,cafes,canadian,food,french,grocery,hawaiian,juice,mexican,new,pizza,portuguese,restaurants,salad,sandwiches,seafood,smoothies,steakhouses,traditional,trucks,vegan,vegetarian


In [75]:
recommend_res = [list(set(df['name'][df['business_id'] == business_id[i]]))[0] \
                 for i in lucky_user_recommend]
mask = [name in recommend_res for name in df['name']]
recommend_category = df['categories'][mask]
vectorizer = TfidfVectorizer(analyzer = 'word', stop_words = 'english', 
                             lowercase = True
                            )
recommend_category_vec = vectorizer.fit_transform(recommend_category).toarray()
recommend_word = vectorizer.get_feature_names()
print('Categories from recommend restaurants: \n%s' % (','.join(i for i in recommend_word)))

Categories from recommend restaurants: 
american,bar,barbeque,bars,beer,buffets,cambodian,chicken,food,hawaiian,indian,japanese,live,lounges,nightlife,noodles,poke,pubs,ramen,raw,restaurants,seafood,sports,sushi,traditional,vegan,vegetarian,wings


In [76]:
#Check the common labels
s=', '.join(word for word in recommend_word if word in original_word)
print("Common labels are: \n%s" % (s))
print(len(s.split(',')))

Common labels are: 
american, bars, buffets, food, hawaiian, restaurants, seafood, traditional, vegan, vegetarian
10


In [90]:
nmf_ls = []
rec_engine_nmf = NMF_Recommender(n_components=200)
rec_engine_nmf.fit(ratings_mat)
for user in active_user2_ls:
    current_lucky_user = user
    current_lucky_user_index = user_id.index(current_lucky_user)
    lucky_user_recommend_nmf, items_rated_by_this_user_nmf = rec_engine_nmf.top_n_recs(user_id=current_lucky_user_index, n = 10)
    nmf_dic = {}
    nmf_dic['user_id'] = current_lucky_user
    rec_result = ', '.join(list(set(df['name'][df['business_id'] == business_id[i]]))[0] for i in lucky_user_recommend_nmf)
    nmf_dic['recommendation'] = rec_result
    nmf_ls.append(nmf_dic)
print(nmf_ls)

[{'user_id': 'XbiKsujS_qxU3xsr0xUqmQ', 'recommendation': "The Poke Box, James Cheese Back Ribs, Mi'hito Sushi Laboratory, Home Of The Brave, Bar Verde, Hawthorne Food and Drink, Piano Piano, Provo FoodBar, Jack Astor's Bar & Grill, Portland Variety"}, {'user_id': 'CjbfWpCRLbA-L_eS_ztd6Q', 'recommendation': 'Eggette Hut, Ten23, August 8, Sang-Ho Seafood Restaurant, Xe Lua Restaurant, Sansotei, Tasty BBQ Seafood Restaurant, Azyun Restaurant, Golden Horse Restaurant, House of Gourmet'}, {'user_id': 'JrXC_MDp38BWwLn2SFdNsA', 'recommendation': 'Maison Du Japon, Daldongnae, Spicy Mafia, Fabbrica, New City Restaurant, Chris Jerk Caribbean Bistro, Wonton Chai Noodle, The Oxley, KINTON RAMEN, Good Taste Casserole Rice'}, {'user_id': 'CxDOIDnH8gp9KXzpBHJYXw', 'recommendation': 'Miss Lin Cafe, Amano, Spicy Mafia, Cafe Alice, Mon Ami - Korean BBQ, Hanabusa Cafe, Wow Sushi, Potman Hotpot, Restaurant Ethan, Good Taste Casserole Rice'}, {'user_id': 'U1vl4SQzO3wTAWlYVnSjnw', 'recommendation': 'Restaur

In [91]:
df_nmf = pd.DataFrame(nmf_ls)
df_nmf = df_nmf[['user_id','recommendation']]
df_nmf.head()

Unnamed: 0,user_id,recommendation
0,XbiKsujS_qxU3xsr0xUqmQ,"The Poke Box, James Cheese Back Ribs, Mi'hito ..."
1,CjbfWpCRLbA-L_eS_ztd6Q,"Eggette Hut, Ten23, August 8, Sang-Ho Seafood ..."
2,JrXC_MDp38BWwLn2SFdNsA,"Maison Du Japon, Daldongnae, Spicy Mafia, Fabb..."
3,CxDOIDnH8gp9KXzpBHJYXw,"Miss Lin Cafe, Amano, Spicy Mafia, Cafe Alice,..."
4,U1vl4SQzO3wTAWlYVnSjnw,"Restaurant Beijing, Mon Shing Restaurant, Un B..."


In [92]:
df_nmf.to_csv('New_NMF_Recommendation_Example.csv')
df_nmf=pd.read_csv('New_NMF_Recommendation_Example.csv')
df_nmf.head()

Unnamed: 0.1,Unnamed: 0,user_id,recommendation
0,0,XbiKsujS_qxU3xsr0xUqmQ,"The Poke Box, James Cheese Back Ribs, Mi'hito ..."
1,1,CjbfWpCRLbA-L_eS_ztd6Q,"Eggette Hut, Ten23, August 8, Sang-Ho Seafood ..."
2,2,JrXC_MDp38BWwLn2SFdNsA,"Maison Du Japon, Daldongnae, Spicy Mafia, Fabb..."
3,3,CxDOIDnH8gp9KXzpBHJYXw,"Miss Lin Cafe, Amano, Spicy Mafia, Cafe Alice,..."
4,4,U1vl4SQzO3wTAWlYVnSjnw,"Restaurant Beijing, Mon Shing Restaurant, Un B..."


## Matrix Factorization recommender (SVD) with restaurants' labels.

In [77]:
#get the number of labels 
mask = [business in business_id for business in df['business_id']]
category = df['categories'][mask]
from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer(analyzer = 'word', stop_words = 'english', 
                             lowercase = True
                            )
category_vec = vectorizer.fit_transform(category).toarray()
words = vectorizer.get_feature_names()
#This is the number of unique categories
print('The total number of restaurant labels is %d' % (len(words))) 

The total number of restaurant labels is 441


In [78]:
from sklearn.decomposition import TruncatedSVD
class SVD_Recommender(object):

    def __init__(self):
        self.n_components = 361 #the number of labels##############################

    def fit(self, ratings_mat):
        self.ratings_mat = ratings_mat
        self.n_users = ratings_mat.shape[0]
        self.n_items = ratings_mat.shape[1]
        svd = TruncatedSVD(n_components=self.n_components, n_iter=7, random_state=1)  #####################
        svd.fit(ratings_mat)
        self.V = svd.components_
        self.U = svd.transform(ratings_mat)
        self.ratings_mat_fitted = self.U.dot(self.V)

    def get_error(self):
        return ((self.ratings_mat_fitted - self.ratings_mat)**2).mean(axis=None)
        
    def pred_one_user(self, user_id, report_run_time=False):
        start_time = time()
        cleaned_out = self.ratings_mat_fitted[user_id,:]
        if report_run_time:
            print("Execution time: %f seconds" % (time()-start_time))
        return cleaned_out

    def pred_all_users(self, report_run_time=False):
        start_time = time()
        all_ratings = [
            self.pred_one_user(user_id) for user_id in range(self.n_users)]
        if report_run_time:
            print("Execution time: %f seconds" % (time()-start_time))
        return np.array(all_ratings)

    def top_n_recs(self, user_id, n):
        pred_ratings = self.pred_one_user(user_id)
        item_index_sorted_by_pred_rating = list(np.argsort(pred_ratings))
        items_rated_by_this_user = self.ratings_mat[user_id].nonzero()[1]
        unrated_items_by_pred_rating = [item for item in item_index_sorted_by_pred_rating
                                        if item not in items_rated_by_this_user]
        return unrated_items_by_pred_rating[-n:], items_rated_by_this_user

In [79]:
# get recommendations for the same lucky user
my_rec_engine = SVD_Recommender()
my_rec_engine.fit(ratings_mat)
lucky_user_recommend, items_rated_by_this_user = my_rec_engine.top_n_recs(user_id=lucky_user_index, n = 10)
print("The top ten recommendation for user %s are: " % (lucky_user))
print('%s' % (', '.join(list(set(df['name'][df['business_id'] == business_id[i]]))[0] \
                       for i in lucky_user_recommend)))

The top ten recommendation for user -3s52C4zL_DHRK0ULG6qtg are: 
Marathon Souvlaki, Arthurs Nosh Bar, Maison Christian Faure, Escondite, Restaurant LOV, Brigade Pizzeria Napolitaine, Seasoned Dreams, The Bier Markt, Pamika Brasserie, Mad Hatter Pub


In [80]:
original_rated_restaurants = [list(set(df['name'][df['business_id'] == business_id[i]]))[0] for i in items_rated_by_this_user]
mask = [name in original_rated_restaurants for name in df['name']]
original_category = df['categories'][mask]
vectorizer = TfidfVectorizer(analyzer = 'word', stop_words = 'english', 
                             lowercase = True)
original_category_vec = vectorizer.fit_transform(original_category).toarray()
original_word = vectorizer.get_feature_names()
print('Categories from user rated restaurants: \n%s' % (','.join(i for i in original_word)))

Categories from user rated restaurants: 
american,bars,brazilian,breakfast,brunch,buffets,burgers,cafes,canadian,food,french,grocery,hawaiian,juice,mexican,new,pizza,portuguese,restaurants,salad,sandwiches,seafood,smoothies,steakhouses,traditional,trucks,vegan,vegetarian


In [81]:
recommend_res = [list(set(df['name'][df['business_id'] == business_id[i]]))[0] \
                 for i in lucky_user_recommend]
mask = [name in recommend_res for name in df['name']]
recommend_category = df['categories'][mask]
vectorizer = TfidfVectorizer(analyzer = 'word', stop_words = 'english', 
                             lowercase = True)
recommend_category_vec = vectorizer.fit_transform(recommend_category).toarray()
recommend_word = vectorizer.get_feature_names()
print('Categories from recommend restaurants: \n%s' % (','.join(i for i in recommend_word)))

Categories from recommend restaurants: 
american,arts,bakeries,bars,beer,belgian,breakfast,brunch,cafes,cake,canadian,caribbean,caterers,coffee,desserts,entertainment,european,event,food,french,gastropubs,german,greek,mediterranean,mexican,modern,music,new,nightlife,patisserie,pizza,planning,pubs,restaurants,salad,sandwiches,services,shop,spaces,spirits,sports,tea,thai,traditional,vegan,vegetarian,venues,wine


In [82]:
#Check the common labels
s=', '.join(word for word in recommend_word if word in original_word)
print("Common labels are: \n%s" % (s))
print(len(s.split(',')))

Common labels are: 
american, bars, breakfast, brunch, cafes, canadian, food, french, mexican, new, pizza, restaurants, salad, sandwiches, traditional, vegan, vegetarian
17


#### Common labels from NMF are: 
- american, bars, buffets, food, hawaiian, restaurants, seafood, traditional, vegan, vegetarian
- 10

#### Common labels from Item - Item Collaborative Filter are: 
- bars, breakfast, brunch, burgers, canadian, food, french, new, restaurants, salad, seafood, steakhouses
- 12

In [93]:
svd_ls = []
rec_engine_svd = SVD_Recommender()
rec_engine_svd.fit(ratings_mat)
for user in active_user2_ls:
    current_lucky_user = user
    current_lucky_user_index = user_id.index(current_lucky_user)
    lucky_user_recommend_svd, items_rated_by_this_user_svd = rec_engine_svd.top_n_recs(user_id=current_lucky_user_index, n = 10)
    svd_dic = {}
    svd_dic['user_id'] = current_lucky_user
    rec_result = ', '.join(list(set(df['name'][df['business_id'] == business_id[i]]))[0] for i in lucky_user_recommend_svd)
    svd_dic['recommendation'] = rec_result
    svd_ls.append(svd_dic)
print(svd_ls)

[{'user_id': 'XbiKsujS_qxU3xsr0xUqmQ', 'recommendation': "Saffron Spice Kitchen, Jack Astor's Bar & Grill, Cups Bingsu Café, Omescape - Markham, The Shore Club - Toronto, Le Viet Asian Cuisine, Thai Room Liberty Village, Barque Smokehouse, Sukho Thai, Bombay Street Food"}, {'user_id': 'CjbfWpCRLbA-L_eS_ztd6Q', 'recommendation': "Raijin Ramen, Lhasa Kitchen, Sky Dragon Chinese Restaurant, Coquine Restaurant, Hakka Wow, Bake Code  - North York, Isabella's Boutique Restaurant, Eddie's Wok N Roll, Ajisen Ramen, J-Town By The Sea"}, {'user_id': 'JrXC_MDp38BWwLn2SFdNsA', 'recommendation': "New Northern Dumplings, Pho'Q, Osaka Sushi Japanese Korean Restaurant, Main St Greek, The Oxley, Samosa King - Embassy Restaurant, KINTON RAMEN, Kid Lee, GuangZhou Tai Ping Sha Chinese Restaurant, Happy Congee"}, {'user_id': 'CxDOIDnH8gp9KXzpBHJYXw', 'recommendation': "Congee Queen, Ga Bin Korean Restaurant, Sukho Thai, Albert's Real Jamaican Foods, Seor Ak San, Phayathai, GuangZhou Tai Ping Sha Chinese Re

In [94]:
df_svd = pd.DataFrame(svd_ls)
df_svd = df_svd[['user_id','recommendation']]
df_svd.head()

Unnamed: 0,user_id,recommendation
0,XbiKsujS_qxU3xsr0xUqmQ,"Saffron Spice Kitchen, Jack Astor's Bar & Gril..."
1,CjbfWpCRLbA-L_eS_ztd6Q,"Raijin Ramen, Lhasa Kitchen, Sky Dragon Chines..."
2,JrXC_MDp38BWwLn2SFdNsA,"New Northern Dumplings, Pho'Q, Osaka Sushi Jap..."
3,CxDOIDnH8gp9KXzpBHJYXw,"Congee Queen, Ga Bin Korean Restaurant, Sukho ..."
4,U1vl4SQzO3wTAWlYVnSjnw,"Café Pacefika, Tsukuyomi, Pizzéria N° 900 Peel..."


In [95]:
df_svd.to_csv('New_SVD_Recommendation_Example.csv')
df_svd=pd.read_csv('New_SVD_Recommendation_Example.csv')
df_svd.head()

Unnamed: 0.1,Unnamed: 0,user_id,recommendation
0,0,XbiKsujS_qxU3xsr0xUqmQ,"Saffron Spice Kitchen, Jack Astor's Bar & Gril..."
1,1,CjbfWpCRLbA-L_eS_ztd6Q,"Raijin Ramen, Lhasa Kitchen, Sky Dragon Chines..."
2,2,JrXC_MDp38BWwLn2SFdNsA,"New Northern Dumplings, Pho'Q, Osaka Sushi Jap..."
3,3,CxDOIDnH8gp9KXzpBHJYXw,"Congee Queen, Ga Bin Korean Restaurant, Sukho ..."
4,4,U1vl4SQzO3wTAWlYVnSjnw,"Café Pacefika, Tsukuyomi, Pizzéria N° 900 Peel..."
