# Build the Restaurant Recommender
Summary: 
- With current rating system, I can actually make use of it to build recommendation system. 
- Here I build item-item colaborative filtering recommender system, NMF_recommender system and SVD_recommender system.
- I compare the performance of recommender system by checking the common labels shared by recommendered restaurants and visited restaurants.
- It turns out that the item-item colaborative filtering recommender system and SVD_recommender system have better performance. SVD_recommender system shows faster computation speed. Therefore, SVD_recommender system wins the case here.
- The reason can be concluded as:
    - colaborative filtering need to calculate pair-wise distance so it is slower than matrix factorization models.
    - SVD_recommender with restaurants labels solves the puzzle of choosing latent factor. Therefore, SVD_recommender system outperforms NMF_recommender system


In [34]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
% matplotlib inline
plt.style.use("ggplot")

In [2]:
df = pd.read_csv('2017_restaurant_reviews.csv')

In [3]:
df.head(3)

Unnamed: 0,business_id,name,categories,avg_stars,cool,date,funny,review_id,stars,text,useful,user_id
2,--1UhMGODdWsrMastO9DZw,The Spicy Amigos,"Restaurants, Mexican",4.0,0,2017-12-12,0,e1HiHHD7CzY5NKZG7hvhTw,5,Absolutely delicious! And great service as wel...,0,Sew1Nht6Q0sGTIZeNvRfLw
3,--1UhMGODdWsrMastO9DZw,The Spicy Amigos,"Restaurants, Mexican",4.0,0,2017-08-09,0,oKm8UTv-QSC0oCbniqwxjg,4,"Tasty, authentic Mexican street food that give...",0,NoQCmYKyMPs4D01Wa6dZew
5,--1UhMGODdWsrMastO9DZw,The Spicy Amigos,"Restaurants, Mexican",4.0,1,2017-05-10,0,9CLEOpUCqRkIR02sx-JsMQ,5,A little hole in the wall for some really deli...,0,atyCaAjUYatIFDOGKy00SA


## Clean data and get rating data 

#### 1. Select relevant columns in the original dataframe

In [4]:
recommender_df = df[['business_id', 'user_id', 'stars']]
recommender_df.head(3)

Unnamed: 0,business_id,user_id,stars
2,--1UhMGODdWsrMastO9DZw,Sew1Nht6Q0sGTIZeNvRfLw,5
3,--1UhMGODdWsrMastO9DZw,NoQCmYKyMPs4D01Wa6dZew,4
5,--1UhMGODdWsrMastO9DZw,atyCaAjUYatIFDOGKy00SA,5


There are many users that haven't given many reviews, I will exclude these users from the item-item similarity recommender.

In [5]:
print(recommender_df.groupby('user_id').count())

                        business_id  stars
user_id                                   
--2PnhMMH7EYoY3wywOvgQ            1      1
--6kLBBsm0GPM9vIB2YBDw            1      1
--7gjElmOrthETJ8XqzMBw            1      1
--8NUFYnpU_Zu09TgcLevw            1      1
--BumyUHiO_7YsHurb9Hkw           43     43
--C93xIlmjtgQfSOIpcQSA            1      1
--DKDJlRHfsvufdGSk_Sdw            1      1
--NIc98RMssgy0mSZL3vpA            9      9
--Qh8yKWAvIP4V4K8ZPfHA           33     33
--WhK4MJx0fKvg64LqwStg            1      1
--YhjyV-ce1nFLYxP49C5A           34     34
--_i0TDbSrV8HN19XlSEFw            2      2
--b8fKG7GFGSGfl_BzTnPw            1      1
--cj94VBt0CHYM2UfQBglg            1      1
--neBDssyZlHqAWgrHtUBQ            2      2
--t6W1JHbStaCp5RO05thA            1      1
-018WmPPk8qlp3TEiqqMVw            1      1
-03y31IzykunU9azzgLsoQ            1      1
-06T53TLMkg_xGl3flhDNw            1      1
-0OE9Pn8vSK-WjJeRtHDtw            1      1
-0OSHsV_0VZz4E09FLgQtQ            1      1
-0Z6b2zZhdX

In [6]:
reviews_count_df = recommender_df.groupby('user_id')['stars'].count()
reviews_count_df.head(5)

user_id
--2PnhMMH7EYoY3wywOvgQ     1
--6kLBBsm0GPM9vIB2YBDw     1
--7gjElmOrthETJ8XqzMBw     1
--8NUFYnpU_Zu09TgcLevw     1
--BumyUHiO_7YsHurb9Hkw    43
Name: stars, dtype: int64

In [7]:
print('Max reviews: %s, Min reviews: %s' % (max(reviews_count_df), min(reviews_count_df)))
print('Median reviews: %s, Mean reviews: %s' % (np.median(reviews_count_df), round(np.mean(reviews_count_df),2)))
print('25%% reviews: %d,  75%% reviews: %d' % (np.percentile(reviews_count_df, 25), np.percentile(reviews_count_df, 75)))
print('Number of unique business: %d' % (len(set(recommender_df['business_id']))))

Max reviews: 227, Min reviews: 1
Median reviews: 1.0, Mean reviews: 2.75
25% reviews: 1,  75% reviews: 2
Number of unique business: 13181


In [8]:
active_user = list(reviews_count_df[reviews_count_df >= 10].index)  #过滤出review>10的user_id
mask = [user in active_user for user in recommender_df['user_id']]
active_user_df = recommender_df[mask]
active_user_df.head(5)

Unnamed: 0,business_id,user_id,stars
21,--1UhMGODdWsrMastO9DZw,TzU30D-CjtPP3XumggK0Mg,4
22,--1UhMGODdWsrMastO9DZw,ZgAzKwganIXImRAMcvdK_A,4
23,--1UhMGODdWsrMastO9DZw,m-p-7WuB85UjsLDaxJXCXA,5
41,--DaPTJW3-tB1vP-PfdTEg,2HjBjUrqjjVfopPfghgpqw,3
85,--SrzpvFLwP_YFwB_Cetow,6eCgSb66TE1LeiWPPBPnTg,4


In [32]:
print('The total number of active users in Canada in 2017 and 2018 is %d.' % \
      (len(active_user_df.groupby('user_id')['stars'].count())))

The total number of active users in Canada in 2017 and 2018 is 3190.


In [33]:
print('The total records number for active users in Canada in 2017 and2018 is %d.' % \
      (len(active_user_df)))

The total records number for active users in Canada in 2017 and2018 is 74895.


#### 2. Create utility matrix from records

In [11]:
from scipy import sparse
highest_user_id = len(set(active_user_df['user_id']))
highest_movie_id = len(set(active_user_df['business_id']))
ratings_mat = sparse.lil_matrix((highest_user_id, highest_movie_id))
ratings_mat

<3190x10717 sparse matrix of type '<class 'numpy.float64'>'
	with 0 stored elements in LInked List format>

Fill the rate matrix based on table

In [12]:
user_id = list(set(active_user_df['user_id']))
business_id = list(set(active_user_df['business_id']))
for _, row in active_user_df.iterrows():
    ratings_mat[user_id.index(row.user_id), business_id.index(row.business_id)] = row.stars  #fill the stars in ratings_mat
ratings_mat

<3190x10717 sparse matrix of type '<class 'numpy.float64'>'
	with 74894 stored elements in LInked List format>

## Item - Item Collaborative Filter Recommender

In [13]:
from sklearn.metrics.pairwise import cosine_similarity
from time import time
class ItemItemRecommender(object):

    def __init__(self, neighborhood_size):
        self.neighborhood_size = neighborhood_size

    def fit(self, ratings_mat):
        self.ratings_mat = ratings_mat
        self.n_users = ratings_mat.shape[0]
        self.n_items = ratings_mat.shape[1]
        self.item_sim_mat = cosine_similarity(self.ratings_mat.T)
        self._set_neighborhoods()

    def _set_neighborhoods(self):
        least_to_most_sim_indexes = np.argsort(self.item_sim_mat, 1)
        self.neighborhoods = least_to_most_sim_indexes[:, -self.neighborhood_size:]

    def pred_one_user(self, user_id, report_run_time=False):
        start_time = time()
        items_rated_by_this_user = self.ratings_mat[user_id].nonzero()[1]
        # Just initializing so I have somewhere to put rating preds
        out = np.zeros(self.n_items)
        for item_to_rate in range(self.n_items):
            relevant_items = np.intersect1d(self.neighborhoods[item_to_rate],
                                            items_rated_by_this_user,
                                            assume_unique=True)  # assume_unique speeds up intersection op
            out[item_to_rate] = self.ratings_mat[user_id, relevant_items] * \
                self.item_sim_mat[item_to_rate, relevant_items] / \
                self.item_sim_mat[item_to_rate, relevant_items].sum()
        if report_run_time:
            print("Execution time: %f seconds" % (time()-start_time))
        cleaned_out = np.nan_to_num(out)
        return cleaned_out

    def pred_all_users(self, report_run_time=False):
        start_time = time()
        all_ratings = [
            self.pred_one_user(user_id) for user_id in range(self.n_users)]
        if report_run_time:
            print("Execution time: %f seconds" % (time()-start_time))
        return np.array(all_ratings)

    def top_n_recs(self, user_id, n):
        pred_ratings = self.pred_one_user(user_id)
        item_index_sorted_by_pred_rating = list(np.argsort(pred_ratings))
        items_rated_by_this_user = self.ratings_mat[user_id].nonzero()[1]
        unrated_items_by_pred_rating = [item for item in item_index_sorted_by_pred_rating
                                        if item not in items_rated_by_this_user]
        return unrated_items_by_pred_rating[-n:], items_rated_by_this_user

In [14]:
#neighborhood_size
my_rec_engine = ItemItemRecommender(neighborhood_size=80)
my_rec_engine.fit(ratings_mat)

Let me try the recommder system with a lucky user.

In [15]:
lucky_user = np.random.choice(active_user_df['user_id'], 1)[0]
lucky_user_index = user_id.index(lucky_user)
lucky_user_recommend, items_rated_by_this_user = my_rec_engine.top_n_recs(user_id=lucky_user_index, n = 10)



In [16]:
print("The top ten recommendation for user %s are: " % (lucky_user))
print('%s' % (', '.join(list(set(df['name'][df['business_id'] == business_id[i]]))[0] \
                       for i in lucky_user_recommend)))

The top ten recommendation for user dmN6SfMI-pQyT-ouOmEkjQ are: 
East Africa Restaurant, Knox, Diablos, Gia Ba, Blu restaurante, Jellyfish Crudo + Charbon, Sabor Latino, Restaurant My Canh, Panda Thai, Gyu-Kaku Japanese BBQ


##### 输出示例用户的推荐结果

In [49]:
# 过滤出review>100的user_id
active_user2 = list(reviews_count_df[reviews_count_df >= 100].index)  
mask2 = [user in active_user2 for user in recommender_df['user_id']]
active_user_df2 = recommender_df[mask2]
active_user2_ls = list(active_user_df2['user_id'].unique())
print(active_user2_ls)

['XbiKsujS_qxU3xsr0xUqmQ', 'CjbfWpCRLbA-L_eS_ztd6Q', 'JrXC_MDp38BWwLn2SFdNsA', 'CxDOIDnH8gp9KXzpBHJYXw', 'U1vl4SQzO3wTAWlYVnSjnw', 'wffnrXJoLppOlvNOZKU70A', 'tU94-C1zpBsfGFvpsJJr2w', 'qKpkRCPk4ycbllTfFcRbNw', 'O3pSxv1SyHpY4qi4Q16KzA', 'Plqi4pG84PA_vBM8OfDPDg', 'Ao-6FYE29-I8WwPg67806A', 'jpIGlAym6z88W2xzHiK5_A', 'PGeiszoVusiv0wTHVdWklA', 'U5YQX_vMl_xQy8EQDqlNQQ', 'XrYTMhY9YJvzX2pMepIz7A', 'orh0HRUNCWuQMt9Iia_osg', 'CQ67NJigSe5-uBDX3b_CUw', 'jwctwzboGhQmtC50Juxa9A', 'TibBhm-fbksozIDFD8wjPQ', 'ic-tyi1jElL_umxZVh8KNA', 'uz5-sq6wHrXScrIWb8r1Mg', 'LB5ViGU59ww2XRCx803t0w', '8Dvr-U6jCZTVGD52LwC2qA', 'eZeBuiVZWT7u3SktO7mv9w', 'Z09rco1enQXNCd9H0u7kvg', 'g-y4Me4bqDz8jwFzX_e17w', 'O3q-nwYZykMmacxjru01Zg', 'Zd3wzNdevk15CwMIJdbjZw', '3aYeG-x5A44GIgmBHrwyAA', 'pn_flI3EBNugBEYFp9okxQ', 'Kj9cFO70zZOQorN0mgeLWA', 'Nq6e5N8bjgD9B46O4va_zA', 'yphnJ8zYbJF7Y3QbtMj91g', 'iRQ_YKpCBdaCwvc2X8_3NQ', 'hFvLG_m26hYMx1UGQSpaEg', 'aX6_Pf3njB-H3FrqgnNJ2g', '0Zswwlz4NzUJoG-skyWzIw', 'vs8aSP9ArwqAlb0LeCnFeQ']


In [51]:
# lucky_user2 = np.random.choice(active_user_df2['user_id'], 1)[0]
# lucky_user2
cfr_ls = []
rec_engine2 = ItemItemRecommender(neighborhood_size=80)
rec_engine2.fit(ratings_mat)
for user in active_user2_ls:
    current_lucky_user = user
    current_lucky_user_index = user_id.index(current_lucky_user)
    lucky_user_recommend2, items_rated_by_this_user2 = rec_engine2.top_n_recs(user_id=current_lucky_user_index, n = 10)
    cfr_dic = {}
    cfr_dic['user_id'] = current_lucky_user
    rec_result = ', '.join(list(set(df['name'][df['business_id'] == business_id[i]]))[0] for i in lucky_user_recommend2)
    cfr_dic['recommendation'] = rec_result
    cfr_ls.append(cfr_dic)
    print("The top ten recommendation for user %s are: " % (current_lucky_user))
    print('%s' % rec_result)
    print('\n')
print(cfr_ls)



The top ten recommendation for user XbiKsujS_qxU3xsr0xUqmQ are: 
Hot Spot Chinese Cuisine, Restaurant Chase, Five Spice Kitchen, Cibo Wine Bar, Ghazale Restaurant, Crack Me Up, The Empanada Company, Field Trip Cafe, ChuChai, Annabelle Pasta Bar


The top ten recommendation for user CjbfWpCRLbA-L_eS_ztd6Q are: 
Moti Mahal , Kawa Sushi, Zucca Trattoria, Grandeur Palace, OMG Restaurant & Lounge, Mira Mira, Pho Dau Bo, Curry Love, Home Of The Brave, Maluca Wine Bar & Restaurant


The top ten recommendation for user JrXC_MDp38BWwLn2SFdNsA are: 
Bolan Thai Cuisine, Saigon Bangkok Restaurant, Sorrel, Porchetta & Co, Paradis BBQ, John's Fish & Chips, Dbarkadz Filipino Cuisine, Diana's Oyster Bar and Grill, Crepe TO, Eggspectation


The top ten recommendation for user CxDOIDnH8gp9KXzpBHJYXw are: 
Laloux, Negroni, Pray Tell, The Monkey Bar, Kitchen Galerie, The Lunch Box, Positano Restaurant, Martino's Pizza & Asian Fusion Kitchen, Bus Terminal Diner, Cuchulainn's Irish Pub


The top ten recomme

The top ten recommendation for user hFvLG_m26hYMx1UGQSpaEg are: 
Sushi Zone, Comptoir 21, Feast of Dilli, Rougamo & Noodles, Jack Astor's Bar & Grill, McDonald's, The Birchcliff, The Red Room, Kalendar, AGO Bistro


The top ten recommendation for user aX6_Pf3njB-H3FrqgnNJ2g are: 
Hey Lucy, Outrigger, Rikkochez, Gio Rana's Really Really Nice Restaurant, Queen Sheba, Taste of Jaffna, Tender Trap Restaurant, Le Petit Bistro, Carisma, Haroo


The top ten recommendation for user 0Zswwlz4NzUJoG-skyWzIw are: 
River Tai, Zen Japanese Restaurant, Panda Express, Fat Pasha, Wuhan Noodle 1950, Fast Fresh Foods, Super Taste Noodle House, Jimmy's Coffee, Poke Eats, GAB


The top ten recommendation for user vs8aSP9ArwqAlb0LeCnFeQ are: 
Chicken Squared, Ami Sushi, Lemongrass Restaurant, Paintlounge, Pizza Land, Made In China Hot Pot, Koji Japanese Restaurant, Snakes & Lattes Midtown, Something Sweet 4 U, D-Spot Dessert Cafe


[{'user_id': 'XbiKsujS_qxU3xsr0xUqmQ', 'recommendation': 'Hot Spot Chinese C

In [53]:
df_cfr = pd.DataFrame(cfr_ls)
df_cfr = df_cfr[['user_id','recommendation']]
df_cfr.head()

Unnamed: 0,user_id,recommendation
0,XbiKsujS_qxU3xsr0xUqmQ,"Hot Spot Chinese Cuisine, Restaurant Chase, Fi..."
1,CjbfWpCRLbA-L_eS_ztd6Q,"Moti Mahal , Kawa Sushi, Zucca Trattoria, Gran..."
2,JrXC_MDp38BWwLn2SFdNsA,"Bolan Thai Cuisine, Saigon Bangkok Restaurant,..."
3,CxDOIDnH8gp9KXzpBHJYXw,"Laloux, Negroni, Pray Tell, The Monkey Bar, Ki..."
4,U1vl4SQzO3wTAWlYVnSjnw,"8 Sushi, Banh Xeo Minh, Positano Restaurant, O..."


In [54]:
df_cfr.to_csv('Item-Item_CFR_Example.csv')

**Test，验证**

In [17]:
from sklearn.feature_extraction.text import TfidfVectorizer
original_rated_restaurants = [list(set(df['name'][df['business_id'] == business_id[i]]))[0] for i in items_rated_by_this_user]
mask = [name in original_rated_restaurants for name in df['name']]
original_category = df['categories'][mask]
vectorizer = TfidfVectorizer(analyzer = 'word', stop_words = 'english', 
                             lowercase = True)
original_category_vec = vectorizer.fit_transform(original_category).toarray()
original_word = vectorizer.get_feature_names()
print('Categories from user rated restaurants: \n%s' % (','.join(i for i in original_word)))

Categories from user rated restaurants: 
american,arabian,bakeries,barbeque,bars,breakfast,british,brunch,burgers,cafes,canadian,caterers,chips,cocktail,coffee,comfort,diners,eastern,event,falafel,fish,food,free,french,gluten,hawaiian,italian,japanese,korean,lebanese,live,lounges,mediterranean,middle,new,nightlife,noodles,pizza,planning,poke,poutineries,pubs,ramen,raw,restaurants,salad,sandwiches,seafood,services,soup,steakhouses,sushi,tea,traditional,vegan,vegetarian,wine


In [18]:
recommend_res = [list(set(df['name'][df['business_id'] == business_id[i]]))[0] \
                 for i in lucky_user_recommend]
mask = [name in recommend_res for name in df['name']]
recommend_category = df['categories'][mask]
vectorizer = TfidfVectorizer(analyzer = 'word', stop_words = 'english', 
                             lowercase = True)
recommend_category_vec = vectorizer.fit_transform(recommend_category).toarray()
recommend_word = vectorizer.get_feature_names()
print('Categories from recommend restaurants: \n%s' % (','.join(i for i in recommend_word)))

Categories from recommend restaurants: 
african,american,asian,bakeries,barbeque,bars,chinese,comfort,european,food,french,fusion,grocery,international,italian,japanese,latin,lounges,modern,nightlife,pan,pubs,restaurants,sandwiches,seafood,southern,sports,steakhouses,taiwanese,thai,traditional,vietnamese


In [19]:
#Check the common labels
print("Common labels are: \n%s" % (', '.join(word for word in recommend_word if word in original_word)))#original_word = vectorizer.get_feature_names()

Common labels are: 
american, bakeries, barbeque, bars, comfort, food, french, italian, japanese, lounges, nightlife, pubs, restaurants, sandwiches, seafood, steakhouses, traditional


## Matrix Factorization recommender (NMF)

In [20]:
from sklearn.decomposition import NMF
class NMF_Recommender(object):

    def __init__(self, n_components):
        self.n_components = n_components

    def fit(self, ratings_mat):
        self.ratings_mat = ratings_mat
        self.n_users = ratings_mat.shape[0]
        self.n_items = ratings_mat.shape[1]
        nmf = NMF(n_components = 200)
        nmf.fit(ratings_mat)
        self.W = nmf.transform(ratings_mat)
        self.H = nmf.components_
        self.error = nmf.reconstruction_err_
        self.ratings_mat_fitted = self.W.dot(self.H)

    def get_error(self):
        return self.error
        
    def pred_one_user(self, user_id, report_run_time=False):
        start_time = time()
        cleaned_out = self.ratings_mat_fitted[user_id,:]
        if report_run_time:
            print("Execution time: %f seconds" % (time()-start_time))
        return cleaned_out

    def pred_all_users(self, report_run_time=False):
        start_time = time()
        all_ratings = [
            self.pred_one_user(user_id) for user_id in range(self.n_users)]
        if report_run_time:
            print("Execution time: %f seconds" % (time()-start_time))
        return np.array(all_ratings)

    def top_n_recs(self, user_id, n):
        pred_ratings = self.pred_one_user(user_id)
        item_index_sorted_by_pred_rating = list(np.argsort(pred_ratings))
        items_rated_by_this_user = self.ratings_mat[user_id].nonzero()[1]
        unrated_items_by_pred_rating = [item for item in item_index_sorted_by_pred_rating
                                        if item not in items_rated_by_this_user]
        return unrated_items_by_pred_rating[-n:], items_rated_by_this_user

In [21]:
# get recommendations for the same lucky user
my_rec_engine = NMF_Recommender(n_components=200)
my_rec_engine.fit(ratings_mat)
lucky_user_recommend, items_rated_by_this_user = my_rec_engine.top_n_recs(user_id=lucky_user_index, n = 10)
print("The top ten recommendation for user %s are: " % (lucky_user))
print('%s' % (', '.join(list(set(df['name'][df['business_id'] == business_id[i]]))[0] \
                       for i in lucky_user_recommend)))

The top ten recommendation for user dmN6SfMI-pQyT-ouOmEkjQ are: 
KINTON RAMEN, Notre-Boeuf-de-Grâce, Otto Yakitori Izakaya, Europea, Escondite, Satay Brothers, Restaurant LOV, Au Pied de Cochon, C'ChoColat, Deville Dinerbar


In [22]:
print("The users original rated resturants are :\n %s" % (','.join(list(set(df['name'][df['business_id'] == business_id[i]]))[0] \
                       for i in items_rated_by_this_user)))

The users original rated resturants are :
 Big in Japan,Olive & Gourmando,Wienstein & Gavino's,Brit & Chips,Burger Bar Crescent,Le Majestique,Opiano,Moleskine,Mandy's,GaNaDaRa,Lola Rosa,Tommy,Le Poké Bar,Garage Beirut,Rosewood,Ramen Misoya,Biirū,Universel,Kantapia,Big In Japan,Holder Restaurant Bar,La Banquise,Sir Winston Churchill Pub,Shawarmaz,Les 400 Coups,Boustan,Le Warehouse


##### NMF 示例用户推荐结果

In [57]:
print(active_user2_ls)

['XbiKsujS_qxU3xsr0xUqmQ', 'CjbfWpCRLbA-L_eS_ztd6Q', 'JrXC_MDp38BWwLn2SFdNsA', 'CxDOIDnH8gp9KXzpBHJYXw', 'U1vl4SQzO3wTAWlYVnSjnw', 'wffnrXJoLppOlvNOZKU70A', 'tU94-C1zpBsfGFvpsJJr2w', 'qKpkRCPk4ycbllTfFcRbNw', 'O3pSxv1SyHpY4qi4Q16KzA', 'Plqi4pG84PA_vBM8OfDPDg', 'Ao-6FYE29-I8WwPg67806A', 'jpIGlAym6z88W2xzHiK5_A', 'PGeiszoVusiv0wTHVdWklA', 'U5YQX_vMl_xQy8EQDqlNQQ', 'XrYTMhY9YJvzX2pMepIz7A', 'orh0HRUNCWuQMt9Iia_osg', 'CQ67NJigSe5-uBDX3b_CUw', 'jwctwzboGhQmtC50Juxa9A', 'TibBhm-fbksozIDFD8wjPQ', 'ic-tyi1jElL_umxZVh8KNA', 'uz5-sq6wHrXScrIWb8r1Mg', 'LB5ViGU59ww2XRCx803t0w', '8Dvr-U6jCZTVGD52LwC2qA', 'eZeBuiVZWT7u3SktO7mv9w', 'Z09rco1enQXNCd9H0u7kvg', 'g-y4Me4bqDz8jwFzX_e17w', 'O3q-nwYZykMmacxjru01Zg', 'Zd3wzNdevk15CwMIJdbjZw', '3aYeG-x5A44GIgmBHrwyAA', 'pn_flI3EBNugBEYFp9okxQ', 'Kj9cFO70zZOQorN0mgeLWA', 'Nq6e5N8bjgD9B46O4va_zA', 'yphnJ8zYbJF7Y3QbtMj91g', 'iRQ_YKpCBdaCwvc2X8_3NQ', 'hFvLG_m26hYMx1UGQSpaEg', 'aX6_Pf3njB-H3FrqgnNJ2g', '0Zswwlz4NzUJoG-skyWzIw', 'vs8aSP9ArwqAlb0LeCnFeQ']


In [58]:
nmf_ls = []
rec_engine_nmf = NMF_Recommender(n_components=200)
rec_engine_nmf.fit(ratings_mat)
for user in active_user2_ls:
    current_lucky_user = user
    current_lucky_user_index = user_id.index(current_lucky_user)
    lucky_user_recommend_nmf, items_rated_by_this_user_nmf = rec_engine_nmf.top_n_recs(user_id=current_lucky_user_index, n = 10)
    nmf_dic = {}
    nmf_dic['user_id'] = current_lucky_user
    rec_result = ', '.join(list(set(df['name'][df['business_id'] == business_id[i]]))[0] for i in lucky_user_recommend_nmf)
    nmf_dic['recommendation'] = rec_result
    nmf_ls.append(nmf_dic)
print(nmf_ls)

[{'user_id': 'XbiKsujS_qxU3xsr0xUqmQ', 'recommendation': 'Portland Variety, Ginger & Onion Cuisine, Gonoe Sushi, Odd Seoul, Sharetea, Valens Restaurant, Napoli Centrale, Home Of The Brave, Hair of the Dog, Tosto Quickfire Pizza Pasta'}, {'user_id': 'CjbfWpCRLbA-L_eS_ztd6Q', 'recommendation': "Little Piggy's, Walkers Wine Bar & Grill, Copacabana Brazilian Steak House, Pasha Authentic Turkish Cuisine, A-Game Cafe, Sud Forno, Trattoria Nervosa, Sansotei, Dragon Pearl Buffet, Azyun Restaurant"}, {'user_id': 'JrXC_MDp38BWwLn2SFdNsA', 'recommendation': "Diana's Oyster Bar And Grill, Millie Cafe, Urban Herbivore, Kaka All You Can Eat, Philthy Philly's, Sushi Legend, New City Restaurant, Daldongnae, The Oxley, Second Cup"}, {'user_id': 'CxDOIDnH8gp9KXzpBHJYXw', 'recommendation': "Legend Pot, The Keg Steakhouse + Bar - Mansion, DM Chicken, Azyun Restaurant, Maple Yip Seafood Restaurant, J'adore Hot Pot, Honest Weight, Congee Town, GuangZhou Tai Ping Sha Chinese Restaurant, Green Tea Restaurant"

In [59]:
df_nmf = pd.DataFrame(nmf_ls)
df_nmf = df_nmf[['user_id','recommendation']]
df_nmf.head()

Unnamed: 0,user_id,recommendation
0,XbiKsujS_qxU3xsr0xUqmQ,"Portland Variety, Ginger & Onion Cuisine, Gono..."
1,CjbfWpCRLbA-L_eS_ztd6Q,"Little Piggy's, Walkers Wine Bar & Grill, Copa..."
2,JrXC_MDp38BWwLn2SFdNsA,"Diana's Oyster Bar And Grill, Millie Cafe, Urb..."
3,CxDOIDnH8gp9KXzpBHJYXw,"Legend Pot, The Keg Steakhouse + Bar - Mansion..."
4,U1vl4SQzO3wTAWlYVnSjnw,"GaNaDaRa, Didar, Sushi Shop, Opiano, Serrano B..."


In [60]:
df_nmf.to_csv('NMF_Recommendation_Example.csv')

**Test**

In [23]:
original_rated_restaurants = [list(set(df['name'][df['business_id'] == business_id[i]]))[0] for i in items_rated_by_this_user]
mask = [name in original_rated_restaurants for name in df['name']]
original_category = df['categories'][mask]
vectorizer = TfidfVectorizer(analyzer = 'word', stop_words = 'english', 
                             lowercase = True
                            )
original_category_vec = vectorizer.fit_transform(original_category).toarray()
original_word = vectorizer.get_feature_names()
print('Categories from user rated restaurants: \n%s' % (','.join(i for i in original_word)))

Categories from user rated restaurants: 
american,arabian,bakeries,barbeque,bars,breakfast,british,brunch,burgers,cafes,canadian,caterers,chips,cocktail,coffee,comfort,diners,eastern,event,falafel,fish,food,free,french,gluten,hawaiian,italian,japanese,korean,lebanese,live,lounges,mediterranean,middle,new,nightlife,noodles,pizza,planning,poke,poutineries,pubs,ramen,raw,restaurants,salad,sandwiches,seafood,services,soup,steakhouses,sushi,tea,traditional,vegan,vegetarian,wine


In [24]:
recommend_res = [list(set(df['name'][df['business_id'] == business_id[i]]))[0] \
                 for i in lucky_user_recommend]
mask = [name in recommend_res for name in df['name']]
recommend_category = df['categories'][mask]
vectorizer = TfidfVectorizer(analyzer = 'word', stop_words = 'english', 
                             lowercase = True
                            )
recommend_category_vec = vectorizer.fit_transform(recommend_category).toarray()
recommend_word = vectorizer.get_feature_names()
print('Categories from recommend restaurants: \n%s' % (','.join(i for i in recommend_word)))

Categories from recommend restaurants: 
american,asian,bars,breakfast,brunch,burgers,canadian,chocolatiers,creperies,desserts,diners,fast,food,french,fusion,indonesian,japanese,malaysian,mexican,new,nightlife,noodles,polish,pubs,ramen,restaurants,sandwiches,shops,singaporean,soup,specialty,stands,traditional,trucks,vegan,vegetarian


In [25]:
#Check the common labels
print("Common labels are: \n%s" % (', '.join(word for word in recommend_word if word in original_word)))

Common labels are: 
american, bars, breakfast, brunch, burgers, canadian, diners, food, french, japanese, new, nightlife, noodles, pubs, ramen, restaurants, sandwiches, soup, traditional, vegan, vegetarian


#### Common labels from Item - Item Collaborative Filter are: 
- american, asian, bakeries, barbeque, bars, breakfast, brunch, cafes, canadian, chicken, chinese, coffee, creperies, desserts, fast, food, fusion, japanese, juice, new, nightlife, restaurants, salad, seafood, shop, smoothies, specialty, tea, thai, traditional, wings

#### Common labels are from SVD are: 
- american, asian, bars, breakfast, brunch, cafes, canadian, chinese, creperies, desserts, fast, food, fusion, japanese, korean, new, nightlife, noodles, restaurants, sandwiches, soup, sushi, taiwanese, tea, traditional

Based on user's previous rating, the NMF recommder shows better performance.(这个性能指什么?)

## Matrix Factorization recommender (SVD) with restaurants' labels.

Each business has its own labels. Suppose we have a table of business_id against category labels. Each element in the table represents the style score of resturants to labels. Additionally, we can build another table of user_id against category labels. Each element in the table stands for the preference/taste of clients to each label. By multipling two tables, we can get the utility table. The two sub-table can have negative number as preference can be divided into like or dislike.

In [26]:
#get the number of labels 
mask = [business in business_id for business in df['business_id']]
category = df['categories'][mask]
from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer(analyzer = 'word', stop_words = 'english', 
                             lowercase = True
                            )
category_vec = vectorizer.fit_transform(category).toarray()
words = vectorizer.get_feature_names()
#This is the number of unique categories
print('The total number of restaurant labels is %d' % (len(words))) 

The total number of restaurant labels is 441


In [27]:
from sklearn.decomposition import TruncatedSVD
class SVD_Recommender(object):

    def __init__(self):
        self.n_components = 361 #the number of labels

    def fit(self, ratings_mat):
        self.ratings_mat = ratings_mat
        self.n_users = ratings_mat.shape[0]
        self.n_items = ratings_mat.shape[1]
        svd = TruncatedSVD(n_components=self.n_components, n_iter=7, random_state=1)
        svd.fit(ratings_mat)
        self.V = svd.components_
        self.U = svd.transform(ratings_mat)
        self.ratings_mat_fitted = self.U.dot(self.V)

    def get_error(self):
        return ((self.ratings_mat_fitted - self.ratings_mat)**2).mean(axis=None)
        
    def pred_one_user(self, user_id, report_run_time=False):
        start_time = time()
        cleaned_out = self.ratings_mat_fitted[user_id,:]
        if report_run_time:
            print("Execution time: %f seconds" % (time()-start_time))
        return cleaned_out

    def pred_all_users(self, report_run_time=False):
        start_time = time()
        all_ratings = [
            self.pred_one_user(user_id) for user_id in range(self.n_users)]
        if report_run_time:
            print("Execution time: %f seconds" % (time()-start_time))
        return np.array(all_ratings)

    def top_n_recs(self, user_id, n):
        pred_ratings = self.pred_one_user(user_id)
        item_index_sorted_by_pred_rating = list(np.argsort(pred_ratings))
        items_rated_by_this_user = self.ratings_mat[user_id].nonzero()[1]
        unrated_items_by_pred_rating = [item for item in item_index_sorted_by_pred_rating
                                        if item not in items_rated_by_this_user]
        return unrated_items_by_pred_rating[-n:], items_rated_by_this_user

In [28]:
# get recommendations for the same lucky user
my_rec_engine = SVD_Recommender()
my_rec_engine.fit(ratings_mat)
lucky_user_recommend, items_rated_by_this_user = my_rec_engine.top_n_recs(user_id=lucky_user_index, n = 10)
print("The top ten recommendation for user %s are: " % (lucky_user))
print('%s' % (', '.join(list(set(df['name'][df['business_id'] == business_id[i]]))[0] \
                       for i in lucky_user_recommend)))

The top ten recommendation for user dmN6SfMI-pQyT-ouOmEkjQ are: 
Kumamoto, Brasserie 701, La Graine Brûlée, La Habanera, Pizzéria N° 900 Peel, Aux Vivres Plateau, KINKA IZAKAYA MONTREAL, Joe's Panini, Escondite, KINTON RAMEN


##### SVD 示例输出

In [63]:
svd_ls = []
rec_engine_svd = SVD_Recommender()
rec_engine_svd.fit(ratings_mat)
for user in active_user2_ls:
    current_lucky_user = user
    current_lucky_user_index = user_id.index(current_lucky_user)
    lucky_user_recommend_svd, items_rated_by_this_user_svd = rec_engine_svd.top_n_recs(user_id=current_lucky_user_index, n = 10)
    svd_dic = {}
    svd_dic['user_id'] = current_lucky_user
    rec_result = ', '.join(list(set(df['name'][df['business_id'] == business_id[i]]))[0] for i in lucky_user_recommend_svd)
    svd_dic['recommendation'] = rec_result
    svd_ls.append(svd_dic)
print(svd_ls)

[{'user_id': 'XbiKsujS_qxU3xsr0xUqmQ', 'recommendation': "Odd Seoul, Sansotei Ramen, Otto's Berlin Döner, Canis, Crown Prince Fine Dining & Banquet, BarChef, Gong Cha Tea, Tosto Quickfire Pizza Pasta, The Dumpling King, Sakawa Coffee & Japanese Restaurant"}, {'user_id': 'CjbfWpCRLbA-L_eS_ztd6Q', 'recommendation': "Copacabana Brazilian Steak House, Wow Sushi, Pie Squared, Banh Mi Boys, Pho Ngoc Yen Restaurant, Ajisen Ramen, Double D's, Congee Queen, The Keg Steakhouse + Bar - Estate Drive, Pomegranate Restaurant"}, {'user_id': 'JrXC_MDp38BWwLn2SFdNsA', 'recommendation': "Lime Asian Kitchen, LOCAL Public Eatery, Main St Greek, Peter's Fine Dining Steak and Seafood, Kenzo Ramen, Me Va Me Kitchen Express, Pi Co, Second Cup, The Oxley, Nichiban Sushi4U"}, {'user_id': 'CxDOIDnH8gp9KXzpBHJYXw', 'recommendation': "Bulldog Coffee, Hexagon, GuangZhou Tai Ping Sha Chinese Restaurant, Panagio's All Day Grill, Northwestern Chinese Cuisine, Pho'Q, JaBistro, The Wickson Social, Hai Tang Cafe, Basil B

In [64]:
df_svd = pd.DataFrame(svd_ls)
df_svd = df_svd[['user_id','recommendation']]
df_svd.head()

Unnamed: 0,user_id,recommendation
0,XbiKsujS_qxU3xsr0xUqmQ,"Odd Seoul, Sansotei Ramen, Otto's Berlin Döner..."
1,CjbfWpCRLbA-L_eS_ztd6Q,"Copacabana Brazilian Steak House, Wow Sushi, P..."
2,JrXC_MDp38BWwLn2SFdNsA,"Lime Asian Kitchen, LOCAL Public Eatery, Main ..."
3,CxDOIDnH8gp9KXzpBHJYXw,"Bulldog Coffee, Hexagon, GuangZhou Tai Ping Sh..."
4,U1vl4SQzO3wTAWlYVnSjnw,"L'Oeufrier, Dunn's Famous, Adamo, Café Pacefik..."


In [65]:
df_svd.to_csv('SVD_Recommendation_Example.csv')

**Test**

Let me check whether the recommendation make sense. I can check through whether the category labels are consistent between original returants and recommend restaurants.

In [29]:
original_rated_restaurants = [list(set(df['name'][df['business_id'] == business_id[i]]))[0] for i in items_rated_by_this_user]
mask = [name in original_rated_restaurants for name in df['name']]
original_category = df['categories'][mask]
vectorizer = TfidfVectorizer(analyzer = 'word', stop_words = 'english', 
                             lowercase = True
                            )
original_category_vec = vectorizer.fit_transform(original_category).toarray()
original_word = vectorizer.get_feature_names()
print('Categories from user rated restaurants: \n%s' % (','.join(i for i in original_word)))

Categories from user rated restaurants: 
american,arabian,bakeries,barbeque,bars,breakfast,british,brunch,burgers,cafes,canadian,caterers,chips,cocktail,coffee,comfort,diners,eastern,event,falafel,fish,food,free,french,gluten,hawaiian,italian,japanese,korean,lebanese,live,lounges,mediterranean,middle,new,nightlife,noodles,pizza,planning,poke,poutineries,pubs,ramen,raw,restaurants,salad,sandwiches,seafood,services,soup,steakhouses,sushi,tea,traditional,vegan,vegetarian,wine


In [30]:
recommend_res = [list(set(df['name'][df['business_id'] == business_id[i]]))[0] \
                 for i in lucky_user_recommend]
mask = [name in recommend_res for name in df['name']]
recommend_category = df['categories'][mask]
vectorizer = TfidfVectorizer(analyzer = 'word', stop_words = 'english', 
                             lowercase = True
                            )
recommend_category_vec = vectorizer.fit_transform(recommend_category).toarray()
recommend_word = vectorizer.get_feature_names()
print('Categories from recommend restaurants: \n%s' % (','.join(i for i in recommend_word)))

Categories from recommend restaurants: 
american,bars,brasseries,breakfast,brunch,cafes,coffee,cuban,food,french,japanese,juice,latin,lounges,mexican,nightlife,noodles,pizza,plates,pubs,ramen,restaurants,sandwiches,small,smoothies,soup,tapas,tea,trucks,vegan,vegetarian


In [31]:
#Check the common labels
print("Common labels are: \n%s" % (', '.join(word for word in recommend_word if word in original_word)))

Common labels are: 
american, bars, breakfast, brunch, cafes, coffee, food, french, japanese, lounges, nightlife, noodles, pizza, pubs, ramen, restaurants, sandwiches, soup, tea, vegan, vegetarian
