![Screen%20Shot%202020-03-26%20at%208.10.26%20PM.png](attachment:Screen%20Shot%202020-03-26%20at%208.10.26%20PM.png)

#### For building a content-based recommendation system, I will be using the restaurant metadata, and will be extracting features from the reviews.

In [1]:
# Importing important libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns

import warnings
warnings.filterwarnings("ignore")

# For Natural Language Processing
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer

# For scaling the data
from sklearn.preprocessing import StandardScaler

# For calculating cosine simiarity
from sklearn.metrics.pairwise import cosine_similarity

---

### <font color='darkcyan'> Importing the Dataset </font> 


In [3]:
# Importing the dataset

colab = pd.read_csv('Data/final_dataset.csv')

In [4]:
# Looking at the number of columns and rows

colab.shape

(313223, 10)

In [5]:
# Peeking into the dataset

colab.head()

Unnamed: 0,business_id,categories,latitude,longitude,restaurant_name,review_count,avg_rating,user_rating,review,user_id
0,C9oCPomVP0mtKa8z99E3gg,"Bakeries, Food",43.754093,-79.349548,Bakery Gateau,8,4.5,3.0,Oh? Another patbingsu review? This one was bet...,orh0HRUNCWuQMt9Iia_osg
1,C9oCPomVP0mtKa8z99E3gg,"Bakeries, Food",43.754093,-79.349548,Bakery Gateau,8,4.5,5.0,What really earns them their 5 stars is the un...,G5hDXvDMNuQ3JQnGCKqsKA
2,C9oCPomVP0mtKa8z99E3gg,"Bakeries, Food",43.754093,-79.349548,Bakery Gateau,8,4.5,4.0,Located inside the Galleria Supermarket.\nStop...,0Suzo_S25mTGJfrlcl1CfA
3,C9oCPomVP0mtKa8z99E3gg,"Bakeries, Food",43.754093,-79.349548,Bakery Gateau,8,4.5,5.0,Yummy cakes! U should try their sweet potato c...,cc7Pav2IUvAkVeqylvAsYg
4,C9oCPomVP0mtKa8z99E3gg,"Bakeries, Food",43.754093,-79.349548,Bakery Gateau,8,4.5,5.0,One of my favorite bakeries! This bakery is in...,keLUgL_4y60BkppiAsIk8Q


In [6]:
# Let's look at the columns

colab.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 313223 entries, 0 to 313222
Data columns (total 10 columns):
 #   Column           Non-Null Count   Dtype  
---  ------           --------------   -----  
 0   business_id      313223 non-null  object 
 1   categories       313223 non-null  object 
 2   latitude         313223 non-null  float64
 3   longitude        313223 non-null  float64
 4   restaurant_name  313223 non-null  object 
 5   review_count     313223 non-null  int64  
 6   avg_rating       313223 non-null  float64
 7   user_rating      313223 non-null  float64
 8   review           313223 non-null  object 
 9   user_id          313223 non-null  object 
dtypes: float64(4), int64(1), object(5)
memory usage: 23.9+ MB


In [7]:
# Let's check the number of unique business_ids

colab['business_id'].nunique()

5493

There are 5493 unique businesses in Toronto.

In [8]:
# Let's check the number of unique names of the restaurants

colab['restaurant_name'].nunique()

4363

`The number of unique restaurant names is less than the number of unique business ids.` So, what's happening here!

Actually, quite a lot of restaurants having the same name have different business ids. This could be because a restaurant francise in different locations could be owned by different individuals/businesses.

The content of the restaurants by the same name but different business ids will be similar if not exactly same. 

Since, I am building a content-based recommendation engine, where the user has to input a name of a restaurant, to be recommended similar restaurants, I plan to use the name of the restaurant as the unique indentifier rather than it's business id. 

This way the system won't end up recommending a restaurant with the same name as the input by the user.

---

### <font color='darkcyan'> Selecting features for building the recommendation system </font> 




| Column | Remarks                                            |
|:-------|:---------------------------------------------------|
|business_id|Keeping this column, as it might be useful later for joining with some other tables, as it's the unique identifier key for other tables.|
|categories|This column is non-numeric. It provides information on the category of the restaurants. A restaurant can have more than one category. We will be one-hot encoding this column|
|restaurant_name|Name of the restaurant. We will be using this as the unique identifier code for the recommendation system.|
|review_count|The number of times a restaurant has been reviewed. This is a numeric column.|
|avg_rating| This is the average rating of a restaurant. This is a numeric column.|
|review| Reviews given by the users to the restaurants. I'll extract features from this column using Natural Language Processing|

<br/>


In [9]:
# Selecting features for building the recommendation system

colab_selected = colab[['business_id', 'categories', 'restaurant_name','review_count', 'avg_rating', 'review']]


---
### <font color='darkcyan'> Extracting features from the reviews by using Natural Language Processing (NLP) </font> 

</br></br>

I am making a new dataframe with just the name of the restaurants and the reviews. Then I will group these reviews by restaurants and finally extract features from them.


In [10]:
# Creating a new dataframe with the name of the restaurant and reviews

reviews = colab_selected[['restaurant_name','review']]

In [11]:
# Checking the new dataframe

reviews.head()

Unnamed: 0,restaurant_name,review
0,Bakery Gateau,Oh? Another patbingsu review? This one was bet...
1,Bakery Gateau,What really earns them their 5 stars is the un...
2,Bakery Gateau,Located inside the Galleria Supermarket.\nStop...
3,Bakery Gateau,Yummy cakes! U should try their sweet potato c...
4,Bakery Gateau,One of my favorite bakeries! This bakery is in...


In [12]:
# Checking the number of columns and rows

reviews.shape

(313223, 2)

The new dataframe looks good. 

I will now check how many reviews have each restaurant got.

In [13]:
# Checking the number of reviews received by each restaurant

reviews['restaurant_name'].value_counts()

Pai Northern Thai Kitchen       2177
Banh Mi Boys                    1636
Khao San Road                   1467
KINKA IZAKAYA ORIGINAL          1425
Seven Lives Tacos Y Mariscos    1183
                                ... 
Pizzeria Bosco                     3
Madam Boeuf And Flea               3
Trattotia Gusto                    3
Stamps Lane                        3
Dulce Aroma                        3
Name: restaurant_name, Length: 4363, dtype: int64

There is a vast variety in the number of reviews received by each restaurant, ranging from 2177 to just 3 reviews.

If I group by the restaurant, the restaurant having more reviews will have a lot more text

In [14]:
# Grouping by the restaurant name and selecting the first five reviews or lesser (if reviews are less than 5)

reviews5 = reviews.groupby('restaurant_name').nth([0,1,2,3,4]).reset_index() 

In [15]:
# Checking the dataframe

reviews5.head(20)

Unnamed: 0,restaurant_name,review
0,'ONO Poké Bar,Honestly this place is the best! The staff are...
1,'ONO Poké Bar,4/5\n\nI was craving a poke bowl for the longe...
2,'ONO Poké Bar,Value packed deliciousness. It's more of a tak...
3,'ONO Poké Bar,"It was in a very obscure place, right under a ..."
4,'ONO Poké Bar,Ratings:\nFood: 7.5/10\nDrink: n/a\nService: 9...
5,00 Gelato,One of the best gelatos I have had in Toronto....
6,00 Gelato,"Came by based on all the excellent reviews, an..."
7,00 Gelato,So based on the fact that what I thought was a...
8,00 Gelato,Wow! I am pleasantly surprised at this gelato...
9,00 Gelato,Very tasty gelato. Very friendly staff. They w...


In [16]:
# Grouping the reviews by the name of the restaurant
final_reviews = reviews5.groupby('restaurant_name').agg(lambda x: ' '.join(x)).reset_index() 

In [17]:
final_reviews.shape

(4363, 2)

In [18]:
final_reviews.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4363 entries, 0 to 4362
Data columns (total 2 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   restaurant_name  4363 non-null   object
 1   review           4363 non-null   object
dtypes: object(2)
memory usage: 68.3+ KB


In [19]:
final_reviews.head()

Unnamed: 0,restaurant_name,review
0,'ONO Poké Bar,Honestly this place is the best! The staff are...
1,00 Gelato,One of the best gelatos I have had in Toronto....
2,0109 Dessert & Chocolate,The atmosphere of the eatery is very much like...
3,1 Plus 2 Pizza & Wings,This place got lucky I had to give a star to s...
4,100 Percent Korean,We ordered a pick up of pork bone soup.\n\nWhe...


All looks good. Let's extract features from the reviews now.

In [20]:
import nltk
nltk.download('stopwords')
import string

from nltk.corpus import stopwords 
ENGLISH_STOP_WORDS = stopwords.words('english')

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/ashimamarwaha/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [21]:
REMOVE_WORDS = ['restaurant', 'food', 'table', 'place', 'dish']

In [22]:
# Defining a custom function for tokenizing 

stemmer = nltk.stem.PorterStemmer()

def my_tokenizer(sentence):
    
    sentence = sentence.replace('\n','')
    
     # Remove numbers
    list_digit= ['0','1', '2', '3', '4', '5', '6', '7', '8', '9']
    for digit in list_digit:
        sentence = sentence.replace(digit,'')
    
    # Remove punctuation and set to lower case
    for punctuation_mark in string.punctuation:
        sentence = sentence.replace(punctuation_mark,'').lower()

    # split sentence into words
    listofwords = sentence.split(' ')
    listofstemmed_words = []
    
        
    # Remove stopwords and any tokens that are just empty strings
    for word in listofwords:
        if (not word in ENGLISH_STOP_WORDS) and (word!='') and (not word in REMOVE_WORDS):

            # Stem words
            stemmed_word = stemmer.stem(word)
            listofstemmed_words.append(stemmed_word)

    return listofstemmed_words

In [23]:
# Building a basic tf-idf vector using the above created tokenizer function

from sklearn.feature_extraction.text import TfidfVectorizer

# Instantiate the Vectorizer
tfidf = TfidfVectorizer(min_df= 100, max_df= 0.7, tokenizer=my_tokenizer)

# Fit the Vectorizer to the training data
tfidf.fit(final_reviews['review'])

# Transform the training data and the validation data
reviews_tfidf = tfidf.transform(final_reviews['review'])

In [24]:
reviews_tfidf.shape

(4363, 1324)

In [25]:
# Let's look at the top 20 words

word_weights = np.array(np.sum(reviews_tfidf, axis=0)).reshape((-1,))

words = np.array(tfidf.get_feature_names())
words_df = pd.DataFrame({"word": words, 
                         "weight": word_weights})

words_df.sort_values(by="weight", ascending=False).head(20)

Unnamed: 0,word,weight
211,chicken,213.599043
922,realli,159.715179
1303,would,143.952282
475,fri,140.311827
1145,tast,139.996731
80,back,139.659048
29,also,139.078027
848,pizza,137.468035
242,come,136.367059
753,nice,134.551363


In [26]:
# Transforming our original dataframe using the tfidf vector

tfidf_result = (reviews_tfidf).toarray()
tfidf_df = pd.DataFrame(tfidf_result, columns = tfidf.get_feature_names())
reviews_features = pd.concat([final_reviews, tfidf_df], axis=1)

In [27]:
# Peeking into the dataframe

reviews_features.head()

Unnamed: 0,restaurant_name,review,abl,absolut,accept,access,accommod,accompani,across,actual,...,yet,yong,york,youd,youll,young,your,youv,yum,yummi
0,'ONO Poké Bar,Honestly this place is the best! The staff are...,0.0,0.0,0.056812,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,00 Gelato,One of the best gelatos I have had in Toronto....,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.059762,0.0,0.0,0.0
2,0109 Dessert & Chocolate,The atmosphere of the eatery is very much like...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.066885,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,1 Plus 2 Pizza & Wings,This place got lucky I had to give a star to s...,0.0,0.0,0.0,0.0,0.0,0.0,0.059777,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,100 Percent Korean,We ordered a pick up of pork bone soup.\n\nWhe...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.050061,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


---

### <font color='darkcyan'> Categories </font> 


In [28]:
# Just selecting all the categories for now

categories = colab_selected[['restaurant_name','categories']]

In [29]:
# Grouping by the name of the restaurant and selecting the restaurant once

categories = categories.groupby('restaurant_name').nth([0]).reset_index() 

In [30]:
# Checking the number of rows and columns

categories.shape

(4363, 2)

In [31]:
categories.head()

Unnamed: 0,restaurant_name,categories
0,'ONO Poké Bar,"Poke, Asian Fusion, Hawaiian, Food, Restaurant..."
1,00 Gelato,"Belgian, Waffles, Restaurants, Ice Cream & Fro..."
2,0109 Dessert & Chocolate,"Desserts, Cafes, Food, Specialty Food, Restaur..."
3,1 Plus 2 Pizza & Wings,"Pizza, Restaurants, Food, Chicken Wings"
4,100 Percent Korean,"Korean, Restaurants"


In [32]:
# Generating a list of categories

list_categories = ', '.join(list(categories['categories'].unique()))
list_categories = list_categories.split(', ')

In [33]:
# Creating a list and frequency of the categories

from collections import Counter, defaultdict
c = Counter(list_categories)
print(len(c))

337


Each restaurant has more than one categorical tag. I am selecting the top 100 categoris as the feature. I will one-hot encode these categories.

In [37]:
# Selecting top 102 categories. 
top_cat = c.most_common(102)
top_cat[:19]

[('Restaurants', 2906),
 ('Food', 1195),
 ('Nightlife', 701),
 ('Bars', 681),
 ('Cafes', 333),
 ('Canadian (New)', 330),
 ('Coffee & Tea', 330),
 ('Breakfast & Brunch', 328),
 ('Sandwiches', 289),
 ('Chinese', 257),
 ('Bakeries', 248),
 ('Specialty Food', 244),
 ('Desserts', 222),
 ('Event Planning & Services', 202),
 ('Italian', 190),
 ('American (Traditional)', 190),
 ('Pubs', 186),
 ('Fast Food', 175),
 ('Japanese', 165)]

In [38]:
# Creating a list of top 100 categories
list_top_cat =[]

for cat in top_cat[2:]: # not including restaurant and food tags as they either of them appears in all the restaurantsdd
    x = cat[0]
    list_top_cat.append(x) 

In [39]:
len(list_top_cat)

100



##### Adding top category columns to the dataframe

In [40]:
# Creating new columns with name of the categories with zero values

for cat in list_top_cat:
    categories[cat] = 0

In [41]:
# Checking the dataframe
categories.head()

Unnamed: 0,restaurant_name,categories,Nightlife,Bars,Cafes,Canadian (New),Coffee & Tea,Breakfast & Brunch,Sandwiches,Chinese,...,Creperies,Poutineries,Southern,Falafel,Breweries,Turkish,Meat Shops,Afghan,Himalayan/Nepalese,Bagels
0,'ONO Poké Bar,"Poke, Asian Fusion, Hawaiian, Food, Restaurant...",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,00 Gelato,"Belgian, Waffles, Restaurants, Ice Cream & Fro...",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0109 Dessert & Chocolate,"Desserts, Cafes, Food, Specialty Food, Restaur...",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,1 Plus 2 Pizza & Wings,"Pizza, Restaurants, Food, Chicken Wings",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,100 Percent Korean,"Korean, Restaurants",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [42]:
# Checking the number of rows and columns
categories.shape

(4363, 102)

The dataframe looks good.

##### Now looping over the categories column to one hot encode the column

In [43]:
# Creating a new column in the dataframe

categories['list_cat'] = categories['categories'].apply(lambda x: x.split(', '))

In [45]:
# Checking the column
categories['list_cat'][:19]

0     [Poke, Asian Fusion, Hawaiian, Food, Restauran...
1     [Belgian, Waffles, Restaurants, Ice Cream & Fr...
2     [Desserts, Cafes, Food, Specialty Food, Restau...
3             [Pizza, Restaurants, Food, Chicken Wings]
4                                 [Korean, Restaurants]
5                 [Italian, Mediterranean, Restaurants]
6     [Comfort Food, Comedy Clubs, Restaurants, Nigh...
7      [Ramen, Chinese, Restaurants, Japanese, Noodles]
8                    [Mediterranean, Restaurants, Food]
9                                  [Pizza, Restaurants]
10                                 [Restaurants, Pizza]
11    [Bakeries, Food, Cafes, Gluten-Free, Restaurants]
12    [Salad, Sandwiches, Restaurants, Bars, Burgers...
13    [American (Traditional), Nightlife, Bars, Brea...
14    [Cafes, Bars, Hookah Bars, Restaurants, Lounge...
15    [Music Venues, Canadian (New), Arts & Entertai...
16    [Indian, Restaurants, Venues & Event Spaces, E...
17    [Chicken Shop, Burgers, Chicken Wings, Fas

In [46]:
# Looping over the new column to get the one-hot encoded category values. 
# Wherever the categoy would match that value will be udated to 1 for that particular row

index = 0

for cat in categories['list_cat'][0:4362]:
    for sub_cat in cat:
        if sub_cat in list_top_cat:
            categories[sub_cat][index]=1
    index += 1

In [47]:
categories.tail()

Unnamed: 0,restaurant_name,categories,Nightlife,Bars,Cafes,Canadian (New),Coffee & Tea,Breakfast & Brunch,Sandwiches,Chinese,...,Poutineries,Southern,Falafel,Breweries,Turkish,Meat Shops,Afghan,Himalayan/Nepalese,Bagels,list_cat
4358,süüp health bar,"Restaurants, Salad, Vegetarian, Soup, Vegan, B...",1,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,"[Restaurants, Salad, Vegetarian, Soup, Vegan, ..."
4359,teashop 168,"Taiwanese, Cafes, Restaurants, Coffee & Tea, Food",0,0,1,0,1,0,0,0,...,0,0,0,0,0,0,0,0,0,"[Taiwanese, Cafes, Restaurants, Coffee & Tea, ..."
4360,thairoomgrand,"Thai, Restaurants",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,"[Thai, Restaurants]"
4361,yuan yuan Chinese Restaurant,"Restaurants, Chinese",0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,"[Restaurants, Chinese]"
4362,z-teca Gourmet Burritos,"Restaurants, American (New), Fast Food, Gluten...",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,"[Restaurants, American (New), Fast Food, Glute..."


In [48]:
categories.shape

(4363, 103)

---

### Joining text data with category data

In [49]:
# Merging the two dataframes

df1 = pd.merge(reviews_features, categories, on ='restaurant_name')

In [50]:
# Checking the shape

df1.shape

(4363, 1428)

In [51]:
# Peeking into the dataframe
df1.head()

Unnamed: 0,restaurant_name,review,abl,absolut,accept,access,accommod,accompani,across,actual,...,Poutineries,Southern,Falafel,Breweries,Turkish,Meat Shops,Afghan,Himalayan/Nepalese,Bagels,list_cat
0,'ONO Poké Bar,Honestly this place is the best! The staff are...,0.0,0.0,0.056812,0.0,0.0,0.0,0.0,0.0,...,0,0,0,0,0,0,0,0,0,"[Poke, Asian Fusion, Hawaiian, Food, Restauran..."
1,00 Gelato,One of the best gelatos I have had in Toronto....,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0,0,0,0,0,0,0,0,0,"[Belgian, Waffles, Restaurants, Ice Cream & Fr..."
2,0109 Dessert & Chocolate,The atmosphere of the eatery is very much like...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0,0,0,0,0,0,0,0,0,"[Desserts, Cafes, Food, Specialty Food, Restau..."
3,1 Plus 2 Pizza & Wings,This place got lucky I had to give a star to s...,0.0,0.0,0.0,0.0,0.0,0.0,0.059777,0.0,...,0,0,0,0,0,0,0,0,0,"[Pizza, Restaurants, Food, Chicken Wings]"
4,100 Percent Korean,We ordered a pick up of pork bone soup.\n\nWhe...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.050061,...,0,0,0,0,0,0,0,0,0,"[Korean, Restaurants]"


In [52]:
# Getting the remaining columns from the first dataframe and joining
# with categories and review features
rem_cols = colab_selected[['restaurant_name','review_count', 'avg_rating']]

# Grouping by the name of the restaurant
rem_cols_final =rem_cols.groupby('restaurant_name').nth([0]).reset_index() 

In [53]:
# Finally merging all the columns

final_content = pd.merge(df1, rem_cols_final, on="restaurant_name", how='left')

In [54]:
# Peeking into the dataset

final_content.head()

Unnamed: 0,restaurant_name,review,abl,absolut,accept,access,accommod,accompani,across,actual,...,Falafel,Breweries,Turkish,Meat Shops,Afghan,Himalayan/Nepalese,Bagels,list_cat,review_count,avg_rating
0,'ONO Poké Bar,Honestly this place is the best! The staff are...,0.0,0.0,0.056812,0.0,0.0,0.0,0.0,0.0,...,0,0,0,0,0,0,0,"[Poke, Asian Fusion, Hawaiian, Food, Restauran...",82,4.0
1,00 Gelato,One of the best gelatos I have had in Toronto....,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0,0,0,0,0,0,0,"[Belgian, Waffles, Restaurants, Ice Cream & Fr...",61,4.5
2,0109 Dessert & Chocolate,The atmosphere of the eatery is very much like...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0,0,0,0,0,0,0,"[Desserts, Cafes, Food, Specialty Food, Restau...",194,3.5
3,1 Plus 2 Pizza & Wings,This place got lucky I had to give a star to s...,0.0,0.0,0.0,0.0,0.0,0.0,0.059777,0.0,...,0,0,0,0,0,0,0,"[Pizza, Restaurants, Food, Chicken Wings]",5,2.0
4,100 Percent Korean,We ordered a pick up of pork bone soup.\n\nWhe...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.050061,...,0,0,0,0,0,0,0,"[Korean, Restaurants]",191,4.5


In [55]:
# Checking the shape

final_content.shape

(4363, 1430)

In [56]:
# Saving the dataframe

csv_name = "final_content.csv"
final_content.to_csv(csv_name, index=False)

In [57]:
# Reading in the dataframe

final_content = pd.read_csv('final_content.csv')

In [58]:
# Sanity check

final_content.head()

Unnamed: 0,restaurant_name,review,abl,absolut,accept,access,accommod,accompani,across,actual,...,Falafel,Breweries,Turkish,Meat Shops,Afghan,Himalayan/Nepalese,Bagels,list_cat,review_count,avg_rating
0,'ONO Poké Bar,Honestly this place is the best! The staff are...,0.0,0.0,0.056812,0.0,0.0,0.0,0.0,0.0,...,0,0,0,0,0,0,0,"['Poke', 'Asian Fusion', 'Hawaiian', 'Food', '...",82,4.0
1,00 Gelato,One of the best gelatos I have had in Toronto....,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0,0,0,0,0,0,0,"['Belgian', 'Waffles', 'Restaurants', 'Ice Cre...",61,4.5
2,0109 Dessert & Chocolate,The atmosphere of the eatery is very much like...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0,0,0,0,0,0,0,"['Desserts', 'Cafes', 'Food', 'Specialty Food'...",194,3.5
3,1 Plus 2 Pizza & Wings,This place got lucky I had to give a star to s...,0.0,0.0,0.0,0.0,0.0,0.0,0.059777,0.0,...,0,0,0,0,0,0,0,"['Pizza', 'Restaurants', 'Food', 'Chicken Wings']",5,2.0
4,100 Percent Korean,We ordered a pick up of pork bone soup.\n\nWhe...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.050061,...,0,0,0,0,0,0,0,"['Korean', 'Restaurants']",191,4.5


I am using cosine similarity metrics to compute the similarity between the restaurants and ultimately make recommendations. If we think of the various features of each restaurant being a vector in a multi-dimensional space, this metric captures the orientation rather than the distance between the vectors. Mathematically, it measures the cosine of the angle between the two vectors. 

In [59]:
# Dropping the non-numerical columns to compute the cosine similarity

cosine_df = final_content.drop(['review', 'restaurant_name', 'list_cat', 'categories'], axis=1)

In [60]:
# Checking the shape
cosine_df.shape

(4363, 1426)

In [61]:
# Calculating the cosine similarity between the restaurants
similarity_score = cosine_similarity(cosine_df, cosine_df)

In [62]:
# Checking the score
similarity_score

array([[1.        , 0.99910665, 0.99926019, ..., 0.85425375, 0.65696851,
        0.99847979],
       [0.99910665, 1.        , 0.99809935, ..., 0.86607485, 0.67385098,
        0.9989015 ],
       [0.99926019, 0.99809935, 1.        , ..., 0.83877675, 0.6352106 ,
        0.99681027],
       ...,
       [0.85425375, 0.86607485, 0.83877675, ..., 1.        , 0.90498705,
        0.87317848],
       [0.65696851, 0.67385098, 0.6352106 , ..., 0.90498705, 1.        ,
        0.68423581],
       [0.99847979, 0.9989015 , 0.99681027, ..., 0.87317848, 0.68423581,
        1.        ]])

In [63]:
similarity_score.shape

(4363, 4363)

It is an array of 4363 by 4363. I will convert it into a dataframe with the index and column values as the restaurant name, and the row value will represent the cosine similarity between the column and the index restaurants. The dagonal value which will represent the cosine similarity between the restaurant themselves will be 1. the value will be between 0 (no similarity) and 1 (absolute similarity). 

In [64]:
sim = pd.DataFrame(similarity_score, columns=final_content['restaurant_name'], index=final_content['restaurant_name'])

In [65]:
sim.head()

restaurant_name,'ONO Poké Bar,00 Gelato,0109 Dessert & Chocolate,1 Plus 2 Pizza & Wings,100 Percent Korean,12 Tables,120 Diner,1915 Lan Zhou Ramen,2 Bros Cuisine,241 Pizza,...,iQ Food,illstyl3 Sammies,lbs. Restaurant,mmmuffins,pico de gallo,süüp health bar,teashop 168,thairoomgrand,yuan yuan Chinese Restaurant,z-teca Gourmet Burritos
restaurant_name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
'ONO Poké Bar,1.0,0.999107,0.99926,0.900036,0.99944,0.932998,0.98771,0.999363,0.982403,0.84338,...,0.996937,0.998153,0.999316,0.656977,0.824624,0.988253,0.80408,0.854254,0.656969,0.99848
00 Gelato,0.999107,1.0,0.998099,0.90736,0.998329,0.9408,0.990305,0.998796,0.986039,0.852321,...,0.997466,0.998888,0.998336,0.674128,0.837029,0.990301,0.816436,0.866075,0.673851,0.998902
0109 Dessert & Chocolate,0.99926,0.998099,1.0,0.890125,0.999907,0.922502,0.983618,0.999154,0.976927,0.831609,...,0.995403,0.996332,0.999619,0.635281,0.808071,0.984793,0.789079,0.838777,0.635211,0.99681
1 Plus 2 Pizza & Wings,0.900036,0.90736,0.890125,1.0,0.891998,0.943068,0.927787,0.898704,0.936476,0.945284,...,0.910814,0.914144,0.89359,0.808159,0.905385,0.921488,0.881771,0.92232,0.808058,0.911531
100 Percent Korean,0.99944,0.998329,0.999907,0.891998,1.0,0.924467,0.98446,0.999324,0.97803,0.833724,...,0.995696,0.996782,0.999687,0.639159,0.811133,0.98557,0.791131,0.841642,0.639278,0.997217


---

### Final Recommendation Function

In [66]:
def content_recommendations(name):
    
    # Making a dataframe to hold the list of recommendations sorted by cosine similarity in descending order
    recommended_restaurants = pd.DataFrame(list((sim[name].sort_values(ascending=False)[1:6]).index))
    
    recommended_restaurants.columns = ['Recommended Restaurants']
    
    
    return recommended_restaurants
        

In [67]:
content_recommendations("Afghan Cuisine")

Unnamed: 0,Recommended Restaurants
0,Tabriz Persian Cookhouse
1,Naan & Kabob
2,BB Cafe
3,Shater Abbas Express
4,Kovalsky Restaurant


In [68]:
content_recommendations("Siddartha")

Unnamed: 0,Recommended Restaurants
0,Spice Indian Bistro
1,Indian Crown
2,Chandni Chowk Restaurant
3,High Park Spicy House
4,Kailash Parbat


In [69]:
content_recommendations("Spicy Dragon")

Unnamed: 0,Recommended Restaurants
0,Lotus Garden Hakka Indian Style Chinese
1,Bamboo Buddha Chinese Resturant
2,98 Aroma
3,Canton Chilli Restaurant
4,The Royal Chinese Restaurant


While the content-based model can provide recommendations even when we don’t have any user data, it does not capture user preferences or provides recommendations across various categories.