# Yelp Data Challenge - Restaurant Recommender
Summary: 
- With current rating system, I can actually make use of it to build recommendation system. 
- Here I build item-item colaborative filtering recommender system, NMF_recommender system and SVD_recommender system.
- I compare the performance of recommender system by checking the common labels shared by recommendered restaurants and visited restaurants.
- It turns out that the item-item colaborative filtering recommender system and SVD_recommender system have better performance. SVD_recommender system shows faster computation speed. Therefore, SVD_recommender system wins the case here.
- The reason can be concluded as:
    - colaborative filtering need to calculate pair-wise distance so it is slower than matrix factorization models.
    - SVD_recommender with restaurants labels solves the puzzle of choosing latent factor. Therefore, SVD_recommender system outperforms NMF_recommender system


In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
% matplotlib inline
plt.style.use("ggplot")

In [2]:
df = pd.read_csv('2017_restaurant_reviews.csv')

In [3]:
df.head(3)

Unnamed: 0,business_id,name,categories,avg_stars,cool,date,funny,review_id,stars,text,useful,user_id
1,--DaPTJW3-tB1vP-PfdTEg,Sunnyside Grill,"Restaurants, Breakfast & Brunch",3.5,0,2017-06-18,0,MzkM0K4Ifb8Xr3gTgCiW9g,1,Decided to go for Fathers Day breakfast with m...,0,8hfJ_MM1_Yd-ECDfuWeTAA
4,--DaPTJW3-tB1vP-PfdTEg,Sunnyside Grill,"Restaurants, Breakfast & Brunch",3.5,0,2017-05-02,0,_HY2yoOUbe13yw7PYFs8gg,5,Love the breakfast here and the pancakes! Alwa...,0,gi7nOGT7lrJRdcmgcSvjqA
9,--DaPTJW3-tB1vP-PfdTEg,Sunnyside Grill,"Restaurants, Breakfast & Brunch",3.5,0,2017-07-20,0,hTACl2M-vXHjnG3EXIU8Kw,1,This was our first visit to this restaurant an...,0,-TQ0Vqbu8F_3UM7B15GPbQ


## Clean data and get rating data 

#### 1. Select relevant columns in the original dataframe

In [4]:
recommender_df = df[['business_id', 'user_id', 'stars']]
recommender_df.head(3)

Unnamed: 0,business_id,user_id,stars
1,--DaPTJW3-tB1vP-PfdTEg,8hfJ_MM1_Yd-ECDfuWeTAA,1
4,--DaPTJW3-tB1vP-PfdTEg,gi7nOGT7lrJRdcmgcSvjqA,5
9,--DaPTJW3-tB1vP-PfdTEg,-TQ0Vqbu8F_3UM7B15GPbQ,1


There are many users that haven't given many reviews, I will exclude these users from the item-item similarity recommender.

In [8]:
print(recommender_df.groupby('user_id').count())

                        business_id  stars
user_id                                   
--7gjElmOrthETJ8XqzMBw            1      1
--BumyUHiO_7YsHurb9Hkw           37     37
--C93xIlmjtgQfSOIpcQSA            1      1
--DKDJlRHfsvufdGSk_Sdw            1      1
--Qh8yKWAvIP4V4K8ZPfHA           30     30
--WhK4MJx0fKvg64LqwStg            1      1
--YhjyV-ce1nFLYxP49C5A           24     24
--neBDssyZlHqAWgrHtUBQ            1      1
--t6W1JHbStaCp5RO05thA            1      1
-018WmPPk8qlp3TEiqqMVw            1      1
-06T53TLMkg_xGl3flhDNw            1      1
-0OE9Pn8vSK-WjJeRtHDtw            1      1
-0Z6b2zZhdXV2pKX4jcCEQ            1      1
-0kOvwKyRPhMtkj41zMLNw            1      1
-1Kz6O4zTiC9eeYlUXlnTQ            1      1
-1g1OCAmYUjkRgk8tcUMBA            1      1
-1l27u1nHhe7F7AlheMIGQ           11     11
-1wbglcr6x1qrUbqP1YAIA           14     14
-2-QlooozM674mk38AWJIQ            2      2
-2WXHI0UtH4v6OCnF10ENA            1      1
-2kCxY7_aw5hOz7fJnGMbQ           31     31
-2n9u5-z_uK

In [5]:
reviews_count_df = recommender_df.groupby('user_id')['stars'].count()
reviews_count_df.head(5)

user_id
--7gjElmOrthETJ8XqzMBw     1
--BumyUHiO_7YsHurb9Hkw    37
--C93xIlmjtgQfSOIpcQSA     1
--DKDJlRHfsvufdGSk_Sdw     1
--Qh8yKWAvIP4V4K8ZPfHA    30
Name: stars, dtype: int64

In [9]:
print('Max reviews: %s, Min reviews: %s' % (max(reviews_count_df), min(reviews_count_df)))
print('Median reviews: %s, Mean reviews: %s' % (np.median(reviews_count_df), round(np.mean(reviews_count_df),2)))
print('25%% reviews: %d,  75%% reviews: %d' % (np.percentile(reviews_count_df, 25), np.percentile(reviews_count_df, 75)))
print('Number of unique business: %d' % (len(set(recommender_df['business_id']))))

Max reviews: 176, Min reviews: 1
Median reviews: 1.0, Mean reviews: 2.9
25% reviews: 1,  75% reviews: 2
Number of unique business: 5381


In [10]:
active_user = list(reviews_count_df[reviews_count_df >= 10].index)  #过滤出review>10的user_id
mask = [user in active_user for user in recommender_df['user_id']]
active_user_df = recommender_df[mask]
active_user_df.head(5)

Unnamed: 0,business_id,user_id,stars
17,--DaPTJW3-tB1vP-PfdTEg,2HjBjUrqjjVfopPfghgpqw,3
61,--SrzpvFLwP_YFwB_Cetow,6eCgSb66TE1LeiWPPBPnTg,4
70,--SrzpvFLwP_YFwB_Cetow,QbjryxoBD9wfzFpmvgubTw,5
73,--SrzpvFLwP_YFwB_Cetow,fM9LC2P8jQrQGGsXKccLQw,3
76,--SrzpvFLwP_YFwB_Cetow,XbiKsujS_qxU3xsr0xUqmQ,4


In [39]:
print('The total number of active users in Toronto in 2017 is %d.' % \
      (len(active_user_df.groupby('user_id')['stars'].count())))

The total number of active users in Toronto in 2017 is 1697.


In [40]:
print('The total records number for active users in Toronto in 2017 is %d.' % \
      (len(active_user_df)))

The total records number for active users in Toronto in 2017 is 36652.


#### 2. Create utility matrix from records

In [13]:
from scipy import sparse
highest_user_id = len(set(active_user_df['user_id']))
highest_movie_id = len(set(active_user_df['business_id']))
ratings_mat = sparse.lil_matrix((highest_user_id, highest_movie_id))
ratings_mat

<1697x4464 sparse matrix of type '<class 'numpy.float64'>'
	with 0 stored elements in LInked List format>

Fill the rate matrix based on table

In [14]:
user_id = list(set(active_user_df['user_id']))
business_id = list(set(active_user_df['business_id']))
for _, row in active_user_df.iterrows():
    ratings_mat[user_id.index(row.user_id), business_id.index(row.business_id)] = row.stars  #fill the stars in ratings_mat
ratings_mat

<1697x4464 sparse matrix of type '<class 'numpy.float64'>'
	with 36652 stored elements in LInked List format>

## Item - Item Collaborative Filter Recommender

In [17]:
from sklearn.metrics.pairwise import cosine_similarity
from time import time
class ItemItemRecommender(object):

    def __init__(self, neighborhood_size):
        self.neighborhood_size = neighborhood_size

    def fit(self, ratings_mat):
        self.ratings_mat = ratings_mat
        self.n_users = ratings_mat.shape[0]
        self.n_items = ratings_mat.shape[1]
        self.item_sim_mat = cosine_similarity(self.ratings_mat.T)
        self._set_neighborhoods()

    def _set_neighborhoods(self):
        least_to_most_sim_indexes = np.argsort(self.item_sim_mat, 1)
        self.neighborhoods = least_to_most_sim_indexes[:, -self.neighborhood_size:]

    def pred_one_user(self, user_id, report_run_time=False):
        start_time = time()
        items_rated_by_this_user = self.ratings_mat[user_id].nonzero()[1]
        # Just initializing so I have somewhere to put rating preds
        out = np.zeros(self.n_items)
        for item_to_rate in range(self.n_items):
            relevant_items = np.intersect1d(self.neighborhoods[item_to_rate],
                                            items_rated_by_this_user,
                                            assume_unique=True)  # assume_unique speeds up intersection op
            out[item_to_rate] = self.ratings_mat[user_id, relevant_items] * \
                self.item_sim_mat[item_to_rate, relevant_items] / \
                self.item_sim_mat[item_to_rate, relevant_items].sum()
        if report_run_time:
            print("Execution time: %f seconds" % (time()-start_time))
        cleaned_out = np.nan_to_num(out)
        return cleaned_out

    def pred_all_users(self, report_run_time=False):
        start_time = time()
        all_ratings = [
            self.pred_one_user(user_id) for user_id in range(self.n_users)]
        if report_run_time:
            print("Execution time: %f seconds" % (time()-start_time))
        return np.array(all_ratings)

    def top_n_recs(self, user_id, n):
        pred_ratings = self.pred_one_user(user_id)
        item_index_sorted_by_pred_rating = list(np.argsort(pred_ratings))
        items_rated_by_this_user = self.ratings_mat[user_id].nonzero()[1]
        unrated_items_by_pred_rating = [item for item in item_index_sorted_by_pred_rating
                                        if item not in items_rated_by_this_user]
        return unrated_items_by_pred_rating[-n:], items_rated_by_this_user

In [18]:
#neighborhood_size
my_rec_engine = ItemItemRecommender(neighborhood_size=80)
my_rec_engine.fit(ratings_mat)

Let me try the recommder system with a lucky user.

In [19]:
lucky_user = np.random.choice(active_user_df['user_id'], 1)[0]
lucky_user_index = user_id.index(lucky_user)
lucky_user_recommend, items_rated_by_this_user = my_rec_engine.top_n_recs(user_id=lucky_user_index, n = 10)



In [20]:
print("The top ten recommendation for user %s are: " % (lucky_user))
print('%s' % (', '.join(list(set(df['name'][df['business_id'] == business_id[i]]))[0] \
                       for i in lucky_user_recommend)))

The top ten recommendation for user Ts2fDK6swkRDIhPpq62_2g are: 
Salus Fresh Foods, King Solomon and Queen of Sheba, The Bottom Line, Starving Artist, Tasty BBQ Seafood Restaurant, Global Imperial Cuisine, Perfect Blend Bakery & Espresso Bar, Patchmon's Thai Desserts & More, Ninki Japanese Cuisine, Popeyes Louisiana Kitchen


In [23]:
from sklearn.feature_extraction.text import TfidfVectorizer
original_rated_restaurants = [list(set(df['name'][df['business_id'] == business_id[i]]))[0] for i in items_rated_by_this_user]
mask = [name in original_rated_restaurants for name in df['name']]
original_category = df['categories'][mask]
vectorizer = TfidfVectorizer(analyzer = 'word', stop_words = 'english', 
                             lowercase = True)
original_category_vec = vectorizer.fit_transform(original_category).toarray()
original_word = vectorizer.get_feature_names()
print('Categories from user rated restaurants: \n%s' % (','.join(i for i in original_word)))

Categories from user rated restaurants: 
american,asian,bagels,bakeries,barbeque,bars,beer,breakfast,brunch,cafes,canadian,caterers,chicken,chinese,coffee,comfort,cream,creperies,desserts,dim,diners,ethnic,event,fast,food,frozen,fusion,ice,imported,internet,japanese,juice,korean,mediterranean,new,nightlife,noodles,pizza,planning,plates,restaurants,salad,sandwiches,seafood,services,shop,small,smoothies,soul,soup,specialty,spirits,sum,sushi,swiss,taiwanese,tapas,tea,thai,traditional,vietnamese,wine,wings,yogurt


In [24]:
recommend_res = [list(set(df['name'][df['business_id'] == business_id[i]]))[0] \
                 for i in lucky_user_recommend]
mask = [name in recommend_res for name in df['name']]
recommend_category = df['categories'][mask]
vectorizer = TfidfVectorizer(analyzer = 'word', stop_words = 'english', 
                             lowercase = True)
recommend_category_vec = vectorizer.fit_transform(recommend_category).toarray()
recommend_word = vectorizer.get_feature_names()
print('Categories from recommend restaurants: \n%s' % (','.join(i for i in recommend_word)))

Categories from recommend restaurants: 
american,asian,bakeries,barbeque,bars,breakfast,brunch,cafes,cajun,canadian,chicken,chinese,coffee,creole,creperies,desserts,ethiopian,fast,food,free,fusion,gluten,health,japanese,juice,markets,new,nightlife,pubs,restaurants,salad,seafood,shop,smoothies,southern,specialty,steakhouses,tea,thai,traditional,waffles,wings


In [25]:
#Check the common labels
print("Common labels are: \n%s" % (', '.join(word for word in recommend_word if word in original_word)))#original_word = vectorizer.get_feature_names()

Common labels are: 
american, asian, bakeries, barbeque, bars, breakfast, brunch, cafes, canadian, chicken, chinese, coffee, creperies, desserts, fast, food, fusion, japanese, juice, new, nightlife, restaurants, salad, seafood, shop, smoothies, specialty, tea, thai, traditional, wings


## Matrix Factorization recommender (NMF)

In [26]:
from sklearn.decomposition import NMF
class NMF_Recommender(object):

    def __init__(self, n_components):
        self.n_components = n_components

    def fit(self, ratings_mat):
        self.ratings_mat = ratings_mat
        self.n_users = ratings_mat.shape[0]
        self.n_items = ratings_mat.shape[1]
        nmf = NMF(n_components = 200)
        nmf.fit(ratings_mat)
        self.W = nmf.transform(ratings_mat)
        self.H = nmf.components_
        self.error = nmf.reconstruction_err_
        self.ratings_mat_fitted = self.W.dot(self.H)

    def get_error(self):
        return self.error
        
    def pred_one_user(self, user_id, report_run_time=False):
        start_time = time()
        cleaned_out = self.ratings_mat_fitted[user_id,:]
        if report_run_time:
            print("Execution time: %f seconds" % (time()-start_time))
        return cleaned_out

    def pred_all_users(self, report_run_time=False):
        start_time = time()
        all_ratings = [
            self.pred_one_user(user_id) for user_id in range(self.n_users)]
        if report_run_time:
            print("Execution time: %f seconds" % (time()-start_time))
        return np.array(all_ratings)

    def top_n_recs(self, user_id, n):
        pred_ratings = self.pred_one_user(user_id)
        item_index_sorted_by_pred_rating = list(np.argsort(pred_ratings))
        items_rated_by_this_user = self.ratings_mat[user_id].nonzero()[1]
        unrated_items_by_pred_rating = [item for item in item_index_sorted_by_pred_rating
                                        if item not in items_rated_by_this_user]
        return unrated_items_by_pred_rating[-n:], items_rated_by_this_user

In [27]:
# get recommendations for the same lucky user
my_rec_engine = NMF_Recommender(n_components=200)
my_rec_engine.fit(ratings_mat)
lucky_user_recommend, items_rated_by_this_user = my_rec_engine.top_n_recs(user_id=lucky_user_index, n = 10)
print("The top ten recommendation for user %s are: " % (lucky_user))
print('%s' % (', '.join(list(set(df['name'][df['business_id'] == business_id[i]]))[0] \
                       for i in lucky_user_recommend)))

The top ten recommendation for user Ts2fDK6swkRDIhPpq62_2g are: 
Kaiju, Hibiscus, House of Gourmet, Plentea, Loaded Pierogi, King's Noodle Restaurant, Japango, Cocina Economica, Basil Box, Gyubee Japanese BBQ - Downtown


In [28]:
print("The users original rated resturants are :\n %s" % (','.join(list(set(df['name'][df['business_id'] == business_id[i]]))[0] \
                       for i in items_rated_by_this_user)))

The users original rated resturants are :
 Hot-Star,Bach Yen,Akai Sushi,Onnki Donburi,Lee Chen Asian Bistro,Red Lobster,Swiss Chalet Rotisserie & Grill,Luscious Desserts,Spring Rolls,Chop Chop,Panera Bread,Yummy Yummy Dumplings,Church Street Espresso,Kobi Korean Restaurant,Huh Ga Ne,Big Sushi,Aroma Espresso Bar,Canephora Cafe & Bakery


In [29]:
original_rated_restaurants = [list(set(df['name'][df['business_id'] == business_id[i]]))[0] for i in items_rated_by_this_user]
mask = [name in original_rated_restaurants for name in df['name']]
original_category = df['categories'][mask]
vectorizer = TfidfVectorizer(analyzer = 'word', stop_words = 'english', 
                             lowercase = True
                            )
original_category_vec = vectorizer.fit_transform(original_category).toarray()
original_word = vectorizer.get_feature_names()
print('Categories from user rated restaurants: \n%s' % (','.join(i for i in original_word)))

Categories from user rated restaurants: 
american,asian,bagels,bakeries,barbeque,bars,beer,breakfast,brunch,cafes,canadian,caterers,chicken,chinese,coffee,comfort,cream,creperies,desserts,dim,diners,ethnic,event,fast,food,frozen,fusion,ice,imported,internet,japanese,juice,korean,mediterranean,new,nightlife,noodles,pizza,planning,plates,restaurants,salad,sandwiches,seafood,services,shop,small,smoothies,soul,soup,specialty,spirits,sum,sushi,swiss,taiwanese,tapas,tea,thai,traditional,vietnamese,wine,wings,yogurt


In [30]:
recommend_res = [list(set(df['name'][df['business_id'] == business_id[i]]))[0] \
                 for i in lucky_user_recommend]
mask = [name in recommend_res for name in df['name']]
recommend_category = df['categories'][mask]
vectorizer = TfidfVectorizer(analyzer = 'word', stop_words = 'english', 
                             lowercase = True
                            )
recommend_category_vec = vectorizer.fit_transform(recommend_category).toarray()
recommend_word = vectorizer.get_feature_names()
print('Categories from recommend restaurants: \n%s' % (','.join(i for i in recommend_word)))

Categories from recommend restaurants: 
american,asian,barbeque,bars,cafes,canadian,chinese,cocktail,coffee,comfort,desserts,ethnic,fast,food,free,fusion,gluten,imported,japanese,latin,malaysian,mexican,new,nightlife,noodles,organic,polish,restaurants,rooms,salad,soul,specialty,steakhouses,stores,sushi,tea,thai,traditional,vegan,vegetarian,vietnamese


In [31]:
#Check the common labels
print("Common labels are: \n%s" % (', '.join(word for word in recommend_word if word in original_word)))

Common labels are: 
american, asian, barbeque, bars, cafes, canadian, chinese, coffee, comfort, desserts, ethnic, fast, food, fusion, imported, japanese, new, nightlife, noodles, restaurants, salad, soul, specialty, sushi, tea, thai, traditional, vietnamese


#### Common labels from Item - Item Collaborative Filter are: 
- american, asian, bakeries, barbeque, bars, breakfast, brunch, cafes, canadian, chicken, chinese, coffee, creperies, desserts, fast, food, fusion, japanese, juice, new, nightlife, restaurants, salad, seafood, shop, smoothies, specialty, tea, thai, traditional, wings

#### Common labels are from SVD are: 
- american, asian, bars, breakfast, brunch, cafes, canadian, chinese, creperies, desserts, fast, food, fusion, japanese, korean, new, nightlife, noodles, restaurants, sandwiches, soup, sushi, taiwanese, tea, traditional

Based on user's previous rating, the NMF recommder shows better performance.(这个性能指什么?)

## Matrix Factorization recommender (SVD) with restaurants' labels.

Each business has its own labels. Suppose we have a table of business_id against category labels. Each element in the table represents the style score of resturants to labels. Additionally, we can build another table of user_id against category labels. Each element in the table stands for the preference/taste of clients to each label. By multipling two tables, we can get the utility table. The two sub-table can have negative number as preference can be divided into like or dislike.

In [32]:
#get the number of labels 
mask = [business in business_id for business in df['business_id']]
category = df['categories'][mask]
from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer(analyzer = 'word', stop_words = 'english', 
                             lowercase = True
                            )
category_vec = vectorizer.fit_transform(category).toarray()
words = vectorizer.get_feature_names()
#This is the number of unique categories
print('The total number of restaurant labels is %d' % (len(words))) 

The total number of restaurant labels is 365


In [33]:
from sklearn.decomposition import TruncatedSVD
class SVD_Recommender(object):

    def __init__(self):
        self.n_components = 361 #the number of labels

    def fit(self, ratings_mat):
        self.ratings_mat = ratings_mat
        self.n_users = ratings_mat.shape[0]
        self.n_items = ratings_mat.shape[1]
        svd = TruncatedSVD(n_components=self.n_components, n_iter=7, random_state=1)
        svd.fit(ratings_mat)
        self.V = svd.components_
        self.U = svd.transform(ratings_mat)
        self.ratings_mat_fitted = self.U.dot(self.V)

    def get_error(self):
        return ((self.ratings_mat_fitted - self.ratings_mat)**2).mean(axis=None)
        
    def pred_one_user(self, user_id, report_run_time=False):
        start_time = time()
        cleaned_out = self.ratings_mat_fitted[user_id,:]
        if report_run_time:
            print("Execution time: %f seconds" % (time()-start_time))
        return cleaned_out

    def pred_all_users(self, report_run_time=False):
        start_time = time()
        all_ratings = [
            self.pred_one_user(user_id) for user_id in range(self.n_users)]
        if report_run_time:
            print("Execution time: %f seconds" % (time()-start_time))
        return np.array(all_ratings)

    def top_n_recs(self, user_id, n):
        pred_ratings = self.pred_one_user(user_id)
        item_index_sorted_by_pred_rating = list(np.argsort(pred_ratings))
        items_rated_by_this_user = self.ratings_mat[user_id].nonzero()[1]
        unrated_items_by_pred_rating = [item for item in item_index_sorted_by_pred_rating
                                        if item not in items_rated_by_this_user]
        return unrated_items_by_pred_rating[-n:], items_rated_by_this_user

In [35]:
# get recommendations for the same lucky user
my_rec_engine = SVD_Recommender()
my_rec_engine.fit(ratings_mat)
lucky_user_recommend, items_rated_by_this_user = my_rec_engine.top_n_recs(user_id=lucky_user_index, n = 10)
print("The top ten recommendation for user %s are: " % (lucky_user))
print('%s' % (', '.join(list(set(df['name'][df['business_id'] == business_id[i]]))[0] \
                       for i in lucky_user_recommend)))

The top ten recommendation for user Ts2fDK6swkRDIhPpq62_2g are: 
KINTON RAMEN, The Cups, Antler Kitchen & Bar, JOEY Sherway, Bubble Republic on Bay, August 8, Monga Fried Chicken, Chine Hot Pot & Noodles, Sunrise House, Millie Creperie


Let me check whether the recommendation make sense. I can check through whether the category labels are consistent between original returants and recommend restaurants.

In [36]:
original_rated_restaurants = [list(set(df['name'][df['business_id'] == business_id[i]]))[0] for i in items_rated_by_this_user]
mask = [name in original_rated_restaurants for name in df['name']]
original_category = df['categories'][mask]
vectorizer = TfidfVectorizer(analyzer = 'word', stop_words = 'english', 
                             lowercase = True
                            )
original_category_vec = vectorizer.fit_transform(original_category).toarray()
original_word = vectorizer.get_feature_names()
print('Categories from user rated restaurants: \n%s' % (','.join(i for i in original_word)))

Categories from user rated restaurants: 
american,asian,bagels,bakeries,barbeque,bars,beer,breakfast,brunch,cafes,canadian,caterers,chicken,chinese,coffee,comfort,cream,creperies,desserts,dim,diners,ethnic,event,fast,food,frozen,fusion,ice,imported,internet,japanese,juice,korean,mediterranean,new,nightlife,noodles,pizza,planning,plates,restaurants,salad,sandwiches,seafood,services,shop,small,smoothies,soul,soup,specialty,spirits,sum,sushi,swiss,taiwanese,tapas,tea,thai,traditional,vietnamese,wine,wings,yogurt


In [37]:
recommend_res = [list(set(df['name'][df['business_id'] == business_id[i]]))[0] \
                 for i in lucky_user_recommend]
mask = [name in recommend_res for name in df['name']]
recommend_category = df['categories'][mask]
vectorizer = TfidfVectorizer(analyzer = 'word', stop_words = 'english', 
                             lowercase = True
                            )
recommend_category_vec = vectorizer.fit_transform(recommend_category).toarray()
recommend_word = vectorizer.get_feature_names()
print('Categories from recommend restaurants: \n%s' % (','.join(i for i in recommend_word)))

Categories from recommend restaurants: 
american,asian,bars,breakfast,brunch,bubble,buffets,cafes,canadian,chinese,creperies,desserts,fast,food,fusion,hot,japanese,korean,lounges,new,nightlife,noodles,pot,pubs,ramen,restaurants,rooms,sandwiches,soup,sushi,taiwanese,tea,traditional


In [38]:
#Check the common labels
print("Common labels are: \n%s" % (', '.join(word for word in recommend_word if word in original_word)))

Common labels are: 
american, asian, bars, breakfast, brunch, cafes, canadian, chinese, creperies, desserts, fast, food, fusion, japanese, korean, new, nightlife, noodles, restaurants, sandwiches, soup, sushi, taiwanese, tea, traditional
