The project's primary objective is to create a Restaurant Recommender Application designed to assist users in locating dining establishments that align with their tastes. Through the implementation of machine learning techniques, I have gathered genuine restaurant reviews and established a sophisticated recommender system.

In my project, I conducted an in-depth analysis of restaurant reviews from Vancouver. Throughout the project, I gathered data from web sources, constructed a comprehensive data frame, established a robust recommender system, and developed a user-friendly application to utilize the recommendation system seamlessly.

__Please note: this is notebook 3 of 3.__

In this notebook I created content base recommender system.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [2]:
import session_info
session_info.show()

In the realm of recommender systems, various approaches exist: User Independent System, Content-Based Recommender System and User-Based Recommender System.

User Independent System doesn't rely on any user-specific data. This system predominantly suggests the most popular restaurants to users without considering individual preferences.

Creating a User-Based Recommender System isn't feasible in my context, given the nature of TripAdvisor's user behavior. Users tend to provide reviews for a relatively small number of restaurants. On average, each user contributes only 1.7 reviews. This limited data makes it challenging to construct an effective user-based recommender system.

So in my project, I will concentrate on developing a Content-Based Recommender System.

### Content Based Recommendations

These recommendation engines are built around the idea if a user likes some item then they will like similar items based on the item content/description/review.

First, we need to put the content into a format a model can understand. Right now we only have restaurant reviews and we can't really feed this straight into some model. I will use the Term Frequency Inverse Document Frequency measure (TF-IDF). The TF-IDF is made of two measures, term frequency (TF) and inverse document frequency (IDF).

For a term (a word or phrase),  𝑡 , and a document , 𝑑 , the term frequency  TF(𝑡,𝑑) , measures how common the term is in the document.
The inverse document frequency of a term  𝑡 ,  IDF(𝑡) , is the inverse of the number of documents a term  𝑡  appears in.

The intuition is that when a term appears in many documents, it's not very relevant. However, if it only appears in a few documents, it's very relevant to identifying those documents.

______

Let's download final dataframe from previous notebook.

In [3]:
# Lets download the data
df = pd.read_csv('/Users/evgenijkucukov/Desktop/Brainstation/Portfolio/Restaurant_reviews_project/df_final_table')
df.head(5)

Unnamed: 0,restaurant,review,rating,Cuisine
0,Hydra Estiatorio Mediterranean,Wonderful fresh exotic Greek food that is deli...,5.0,
1,Hydra Estiatorio Mediterranean,"Food was delicious, packed full of flavors. Th...",5.0,
2,Hydra Estiatorio Mediterranean,"Hello, We would like to say thank you for the ...",5.0,
3,Hydra Estiatorio Mediterranean,Happy hour was amazing. Buck a shuck oysters a...,5.0,
4,Hydra Estiatorio Mediterranean,Roman was the best waiter! he checked up on us...,5.0,


We operate a restaurant chain that is distinguished by its various locations, such as Cactus Club and JOEY. Let's simplify their names by removing location-specific identifiers.

In [4]:
# check which names with "Cactus Club Cafe" do we have
# Create a boolean mask to filter rows
mask = df['restaurant'].str.contains('Cactus Club')

# Use the mask to select the rows and create a new DataFrame
cactus_club_df = df[mask]


In [5]:
cactus_club_df['restaurant'].unique()

array(['Cactus Club Cafe Coal Harbour', 'Cactus Club Cafe Robson',
       'Cactus Club Cafe Bentall 5', 'Cactus Club Cafe Broadway + Ash',
       'Cactus Club Cafe English Bay', 'Cactus Club Cafe Richmond',
       'Cactus Club Cafe North Vancouver', 'Cactus Club Cafe Yaletown',
       'Cactus Club Cafe North Burnaby', 'Cactus Club Cafe Park Royal',
       'Cactus Club Cafe Byrne Road', 'Cactus Club Cafe',
       'Cactus Club Cafe Station Square'], dtype=object)

In [6]:
# check which names with "JOEY" do we have
# Create a boolean mask to filter rows
mask_j = df['restaurant'].str.contains('JOEY')

# Use the mask to select the rows and create a new DataFrame
joey_df = df[mask_j]

joey_df['restaurant'].unique()

array(['JOEY Burrard', 'JOEY Bentall One', 'JOEY Shipyards',
       'JOEY Burnaby'], dtype=object)

In [7]:
# check which names with "Tap & Barrel" do we have
# Create a boolean mask to filter rows
mask_tb = df['restaurant'].str.contains('Tap & Barrel')

# Use the mask to select the rows and create a new DataFrame
tb_df = df[mask_tb]

In [8]:
tb_df['restaurant'].unique()

array(['Tap & Barrel - Convention Centre', 'Tap & Barrel - Shipyards',
       'Tap & Barrel - Olympic Village', 'Tap & Barrel - Bridges'],
      dtype=object)

In [9]:
# replace all names above with "Cactus Club Cafe" or "JOEY" or "Tap & Barrel"

# Define a dictionary for replacements
replacements = {
    'Cactus Club Cafe Coal Harbour': "Cactus Club Cafe", 
    'Cactus Club Cafe Robson': "Cactus Club Cafe",
    'Cactus Club Cafe Bentall 5': "Cactus Club Cafe", 
    'Cactus Club Cafe Broadway + Ash': "Cactus Club Cafe",
    'Cactus Club Cafe English Bay': "Cactus Club Cafe", 
    'Cactus Club Cafe Richmond': "Cactus Club Cafe",
    'Cactus Club Cafe North Vancouver': "Cactus Club Cafe", 
    'Cactus Club Cafe Yaletown': "Cactus Club Cafe",
    'Cactus Club Cafe North Burnaby': "Cactus Club Cafe", 
    'Cactus Club Cafe Park Royal': "Cactus Club Cafe",
    'Cactus Club Cafe Byrne Road': "Cactus Club Cafe", 
    'Cactus Club Cafe': "Cactus Club Cafe",
    'Cactus Club Cafe Station Square': "Cactus Club Cafe",
    'JOEY Burrard': "JOYE",
    'JOEY Bentall One': "JOYE", 
    'JOEY Shipyards': "JOYE",
    'JOEY Burnaby': "JOYE",
    'Tap & Barrel - Convention Centre': 'Tap & Barrel',
    'Tap & Barrel - Shipyards': 'Tap & Barrel',
    'Tap & Barrel - Olympic Village': 'Tap & Barrel', 
    'Tap & Barrel - Bridges': 'Tap & Barrel'
}

# Use the .replace() method to perform the replacements
df['restaurant'] = df['restaurant'].replace(replacements)

In [10]:
# Sanity check
# Create a boolean mask to filter rows
mask_check = df['restaurant'].str.contains('Cactus Club')

# Use the mask to select the rows and create a new DataFrame
cactus_club_df_check = df[mask_check]

cactus_club_df_check.nunique()

restaurant      1
review        326
rating          2
Cuisine         9
dtype: int64

In [11]:
# Sanity check
# Create a boolean mask to filter rows
mask_check_j = df['restaurant'].str.contains('JOYE')

# Use the mask to select the rows and create a new DataFrame
joey_df_check = df[mask_check_j]

joey_df_check.nunique()

restaurant      1
review        115
rating          2
Cuisine         2
dtype: int64

In [12]:
# Sanity check
# Create a boolean mask to filter rows
mask_check_tb = df['restaurant'].str.contains('Tap & Barrel')

# Use the mask to select the rows and create a new DataFrame
tb_df_check = df[mask_check_tb]

tb_df_check.nunique()

restaurant     1
review        98
rating         1
Cuisine        1
dtype: int64

Let's group reviews by restaurant names and count the number of reviews for each restaurant.

In [14]:
# calculate numbers of reviews for each restaurant
df1 = df.groupby('restaurant')['review'].count().reset_index()
df1.head(3)

Unnamed: 0,restaurant,review
0,1927 Lobby Lounge at Rosewood Hotel Georgia,26
1,1931 Gallery Bistro,15
2,33 Acres Brewing,15


In [15]:
df1_sorted = df1.sort_values(by='review')
df1_sorted

Unnamed: 0,restaurant,review
75,Beach Ave Bar and Grill,6
436,Kinkura Sushi,7
179,Chung Chun Rice Hot Dog,7
567,Nightshade Yvr,7
629,Pho Khanh Express,7
...,...,...
222,Denny's,180
549,Moxies,186
125,Burgoo,240
134,Cactus Club Cafe,326


Freshslice Pizza stands out with a notably higher number of reviews compared to the other restaurants. To ensure a fair comparison, let's remove this outlier from the dataset.

In [16]:
# delete outlier

rows_to_delete = df[df['restaurant'] == 'Freshslice Pizza'].index

# Drop the rows by index
df = df.drop(rows_to_delete)

For the next step, let's consolidate the reviews for each restaurant into a single column.

In [20]:
# convert reviews in string 
df['review'] = df['review'].astype(str)

In [21]:
# Let's group data table by restaurant and create a list of reviews for each restaurant.
df_rest  = df.groupby('restaurant')['review'].apply(lambda x: '\n'.join(x)).reset_index()
df_rest.head(3)

Unnamed: 0,restaurant,review
0,1927 Lobby Lounge at Rosewood Hotel Georgia,"My wife and I dropped in here today, around 1:..."
1,1931 Gallery Bistro,You used to get a nice lunch up above the art ...
2,33 Acres Brewing,Fourth stop on our Mt Pleasant self-guided bre...


In [22]:
# Let's megre restaurant reviews and number of reviews
df_rest = df_rest.merge(df1, on='restaurant', how='left')
df_rest.head(5)

Unnamed: 0,restaurant,review_x,review_y
0,1927 Lobby Lounge at Rosewood Hotel Georgia,"My wife and I dropped in here today, around 1:...",26
1,1931 Gallery Bistro,You used to get a nice lunch up above the art ...,15
2,33 Acres Brewing,Fourth stop on our Mt Pleasant self-guided bre...,15
3,4 Stones Vegetarian Cuisine,I was traveling a lot and had a layover so I t...,16
4,75 West Coast Grill,I’d been curious about 75 West Coast Grill for...,19


In [23]:
# Let's rename column names
df_rest = df_rest.rename(columns = {'review_x': 'review'})
df_rest = df_rest.rename(columns = {'review_y': 'review_count'})
df_rest.head(3)

Unnamed: 0,restaurant,review,review_count
0,1927 Lobby Lounge at Rosewood Hotel Georgia,"My wife and I dropped in here today, around 1:...",26
1,1931 Gallery Bistro,You used to get a nice lunch up above the art ...,15
2,33 Acres Brewing,Fourth stop on our Mt Pleasant self-guided bre...,15


In [24]:
#save file
df_rest.to_csv('df_for_app', index=False)

To enhance the convenience of our application, let's include restaurant links in our DataFrame.

In [25]:
links = pd.read_csv('/Users/evgenijkucukov/Desktop/Brainstation/Portfolio/Restaurant_reviews_project/df_restaurants_info_1')

In [26]:
links.head(3)

Unnamed: 0,name,url
0,Freshslice Pizza,https://www.tripadvisor.com/Restaurant_Review-...
1,1. Hydra Estiatorio Mediterranean,https://www.tripadvisor.com/Restaurant_Review-...
2,2. Alouette Bistro,https://www.tripadvisor.com/Restaurant_Review-...


To merge our DataFrame with restaurant links, we'll begin by renaming a column in the links data table.

In [29]:
links = links.rename(columns = {'name': 'restaurant'})

In [30]:
links['restaurant'] = links['restaurant'].str.replace(r'^\d+\. ', '',regex=True)


In [31]:
df2 = df_rest.merge(links, on = 'restaurant', how='left' )
df2

Unnamed: 0,restaurant,review,review_count,url
0,1927 Lobby Lounge at Rosewood Hotel Georgia,"My wife and I dropped in here today, around 1:...",26,https://www.tripadvisor.com/Restaurant_Review-...
1,1931 Gallery Bistro,You used to get a nice lunch up above the art ...,15,https://www.tripadvisor.com/Restaurant_Review-...
2,33 Acres Brewing,Fourth stop on our Mt Pleasant self-guided bre...,15,https://www.tripadvisor.com/Restaurant_Review-...
3,4 Stones Vegetarian Cuisine,I was traveling a lot and had a layover so I t...,16,https://www.tripadvisor.com/Restaurant_Review-...
4,75 West Coast Grill,I’d been curious about 75 West Coast Grill for...,19,https://www.tripadvisor.com/Restaurant_Review-...
...,...,...,...,...
1021,iDen & Quan Ju De Beijing Duck House,这家环境和服务(full service)一流，是按欧美方式经营的fine dining. ...,15,https://www.tripadvisor.com/Restaurant_Review-...
1022,sushi California,"really nice sushi and appetizer big menu, diff...",15,https://www.tripadvisor.com/Restaurant_Review-...
1023,tetsu Sushi,A fabulous place for true sushi lovers. Excell...,15,https://www.tripadvisor.com/Restaurant_Review-...
1024,the apron,"Very tasty food, however, VERY expensive. Menu...",15,https://www.tripadvisor.com/Restaurant_Review-...


Let's ensure that we have a single link associated with each restaurant.

In [32]:
url_counts = df2.groupby('restaurant')['url'].nunique().reset_index()
url_counts

Unnamed: 0,restaurant,url
0,1927 Lobby Lounge at Rosewood Hotel Georgia,1
1,1931 Gallery Bistro,1
2,33 Acres Brewing,1
3,4 Stones Vegetarian Cuisine,1
4,75 West Coast Grill,1
...,...,...
985,iDen & Quan Ju De Beijing Duck House,1
986,sushi California,1
987,tetsu Sushi,1
988,the apron,1


In [33]:
# Check which restaurants have more than 1 link
duplicate_rest = url_counts[url_counts['url'] > 1]
duplicate_rest

Unnamed: 0,restaurant,url
64,Banana Leaf Malaysian Cuisine,2
81,Bellaggio Cafe,2
98,Bob Likes Thai Food,2
116,Browns Socialhouse,2
125,Burgoo,4
130,C-Lovers Fish & Chips,2
165,Chef Hung Taiwanese Beef Noodle,2
177,Chongqing,2
178,Chop Steakhouse & Bar,2
222,Denny's,4


In [34]:
# Let's delete duplicates
df2 = df2[~df2['restaurant'].duplicated(keep='first')]

In [35]:
# sanity check
url_counts2 = df2.groupby('restaurant')['url'].nunique().reset_index()
url_counts2

Unnamed: 0,restaurant,url
0,1927 Lobby Lounge at Rosewood Hotel Georgia,1
1,1931 Gallery Bistro,1
2,33 Acres Brewing,1
3,4 Stones Vegetarian Cuisine,1
4,75 West Coast Grill,1
...,...,...
985,iDen & Quan Ju De Beijing Duck House,1
986,sushi California,1
987,tetsu Sushi,1
988,the apron,1


In [38]:
# sanity check
duplicate_rest2 = url_counts2[url_counts2['url'] > 1]
duplicate_rest2 

Unnamed: 0,restaurant,url


Now we have one link for each restaurant.

In [39]:
#save file
df2.to_csv('df_for_app_3', index=False)

### Recommender System

Firstly, we use the TfidfVectorizer to transform the 'review' column into a TF-IDF matrix

In [40]:
from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer(stop_words = "english", min_df=2)
df_rest['review'] = df_rest['review'].fillna("")

TF_IDF_matrix = vectorizer.fit_transform(df_rest['review'])

In [41]:
TF_IDF_matrix.shape

(990, 8303)

Now that we have our data in a numeric format how can we measure the similarity between two documents. 

We see these are just numeric arrays, that is vectors. There are several common ways to compare two vectors  𝑎  and  𝑏 , probably the most common in recommender systems is the cosine similarity:

The closer two vectors (or documents) are, the higher this measure. We can calculate the similarity using cosine similiarity using sklearn's cosine similarity function:

In [42]:
# determine similarities
from sklearn.metrics.pairwise import cosine_similarity 
similarities = cosine_similarity(TF_IDF_matrix, dense_output=False)

In [43]:
# Check the shape
similarities.shape

(990, 990)

Now that we can directly compare two restaurants and we can make recommendations of the form: if you like restaurant  𝑎  then you will also like restaurant  𝑏 ,  𝑐 ,  𝑑 , 𝑒𝑡𝑐 .
We can do this just picking a candidate restaurant and taking its column in the similarity matrix, and then finding those rows where the similarities are highest:

In [None]:
def content_recommender(restaurant, similarities, review_threshold) :
    
    # Get the restaurant by the title
    restaurant_index = df_rest[df_rest['restaurant'] == restaurant].index
    
    # Create a dataframe with the restautant names
    sim_df = pd.DataFrame(
        {'restaurant': df_rest['restaurant'], 
         'similarity': np.array(similarities[restaurant_index, :].todense()).squeeze(),
         'review_count': df_rest['review_count'],
          
     })
    
    # Filter restaurants with more than the specified review threshold
    sim_df = sim_df[sim_df['review_count'] > review_threshold]
    
    
    # Get the top 10 restaurants with review threshold > 10 review
    top_restaurants = sim_df.sort_values(by='similarity', ascending=False).head(10)
    
    return top_restaurants

In [45]:
# Test the recommender
similar_restaurants = content_recommender("Nook", similarities, review_threshold=10 )
similar_restaurants.head(10)

Unnamed: 0,restaurant,similarity,review_count
570,Nook,1.0,60
742,Sopra Sotto,0.39811,60
848,The Firewood Cafe,0.333294,15
564,Nicli Antica Pizzeria,0.330177,15
272,Famoso Neapolitan Pizzeria,0.323567,17
19,Alberello Pizzeria,0.320284,15
637,Pizza Carano,0.315551,15
603,Pacifico Pizzeria & Ristorante,0.311179,15
687,Roundtable Pizza,0.305513,15
373,Ignite Pizzeria Mount Pleasant,0.296044,15


The output shows 10 restaurants that are similar to "Nook." The closer the similarity score is to 1, the more similar these restaurants are to "Nook." I've chosen a threshold of 10 because many restaurants have fewer than 10 reviews. Increasing the threshold would have a significant impact on the quality of the recommendations.