## Modelling

In this section we will create a recommendation system using the datasets to solve our main problem.
There are different types of recomentation models, in this project we will focus on three types of recommentation systems

* 1. Content-Based Recommender systems
* 2. Collaborative Filtering Systems
* 3. Deep Neural Networks

In [42]:
# importing necesarry packages

import collections
import folium
import json 
import numpy as np
from nltk.corpus import stopwords
from nltk.stem.snowball import SnowballStemmer
from nltk.tokenize import RegexpTokenizer
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
import string
import pickle
from surprise import Reader , Dataset
from tabulate import tabulate
from surprise.model_selection import cross_validate
from surprise.prediction_algorithms import SVD
from surprise.prediction_algorithms import KNNWithMeans, KNNBasic, KNNBaseline
from surprise.model_selection import GridSearchCV
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.preprocessing import LabelEncoder
from tensorflow.keras import models ,layers, optimizers , losses, regularizers, metrics
from wordcloud import WordCloud

from understanding import DataLoader, DataInfo


# plotting styles
plt.style.use("fivethirtyeight")
%matplotlib inline

#### i) Cleaned Restaurant Informational Data

In [43]:
# Instantiate the DataLoader class
loader= DataLoader()

# Instantiate the DataInfo class
summary= DataInfo()

# Reading the restaurants csv file
restaurant_data= loader.read_data("data/filtered_restaurants_data.csv")

# Summary information on the restaurant df
print(f'\nRESTAURANT DATASET INFORMATION\n' + '=='*20 + '\n')
summary.info(restaurant_data)


RESTAURANT DATASET INFORMATION

Shape of the dataset : (38552, 15) 

Column Names
Index(['business_id', 'name', 'address', 'city', 'state', 'postal_code',
       'latitude', 'longitude', 'stars', 'review_count', 'is_open',
       'attributes', 'categories', 'hours', 'location'],
      dtype='object') 
 

Data Summary
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 38552 entries, 0 to 38551
Data columns (total 15 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   business_id   38552 non-null  object 
 1   name          38552 non-null  object 
 2   address       38552 non-null  object 
 3   city          38552 non-null  object 
 4   state         38552 non-null  object 
 5   postal_code   38552 non-null  object 
 6   latitude      38552 non-null  float64
 7   longitude     38552 non-null  float64
 8   stars         38552 non-null  float64
 9   review_count  38552 non-null  int64  
 10  is_open       38552 non-null  int64  
 11  attribu

Unnamed: 0,latitude,longitude,stars,review_count,is_open
count,38552.0,38552.0,38552.0,38552.0,38552.0
mean,36.899127,-87.678808,3.610383,110.331215,0.65457
std,6.15935,13.596218,0.748755,230.42083,0.475514
min,27.564457,-120.026076,1.0,5.0,0.0
25%,30.026555,-90.206195,3.0,18.0,0.0
50%,38.810572,-86.011028,3.5,47.0,1.0
75%,39.956489,-75.348735,4.0,117.0,1.0
max,53.649743,-74.685404,5.0,7568.0,1.0


Dataset Overview


Unnamed: 0,business_id,name,address,city,state,postal_code,latitude,longitude,stars,review_count,is_open,attributes,categories,hours,location
0,k0hlBqXX-Bt0vf1op7Jr1w,Tsevi's Pub And Grill,8025 Mackenzie Rd,Affton,Missouri,63123,38.565165,-90.321087,3.0,19,0,"{'Caters': 'True', 'Alcohol': ""u'full_bar'"", '...",Italian,Unknown,"State:Missouri, City:Affton, Address:8025 Mack..."
1,k0hlBqXX-Bt0vf1op7Jr1w,Tsevi's Pub And Grill,8025 Mackenzie Rd,Affton,Missouri,63123,38.565165,-90.321087,3.0,19,0,"{'Caters': 'True', 'Alcohol': ""u'full_bar'"", '...",American (Traditional),Unknown,"State:Missouri, City:Affton, Address:8025 Mack..."
2,k0hlBqXX-Bt0vf1op7Jr1w,Tsevi's Pub And Grill,8025 Mackenzie Rd,Affton,Missouri,63123,38.565165,-90.321087,3.0,19,0,"{'Caters': 'True', 'Alcohol': ""u'full_bar'"", '...",Greek,Unknown,"State:Missouri, City:Affton, Address:8025 Mack..."


#### ii) Cleaned User Review Data

In [44]:
# Loading the users csv file
users_data= loader.read_data("data/cleaned_users_data.csv")

# Summary information on the user review data
print(f'\nUSER DATASET INFORMATION\n' + '=='*20 + '\n')
summary.info(users_data)


USER DATASET INFORMATION

Shape of the dataset : (429771, 6) 

Column Names
Index(['review_id', 'user_id', 'business_id', 'stars', 'text', 'date'], dtype='object') 
 

Data Summary
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 429771 entries, 0 to 429770
Data columns (total 6 columns):
 #   Column       Non-Null Count   Dtype 
---  ------       --------------   ----- 
 0   review_id    429771 non-null  object
 1   user_id      429771 non-null  object
 2   business_id  429771 non-null  object
 3   stars        429771 non-null  int64 
 4   text         429771 non-null  object
 5   date         429771 non-null  object
dtypes: int64(1), object(5)
memory usage: 19.7+ MB

Descriptive Statistics


Unnamed: 0,stars
count,429771.0
mean,3.820449
std,1.513978
min,1.0
25%,3.0
50%,5.0
75%,5.0
max,5.0


Dataset Overview


Unnamed: 0,review_id,user_id,business_id,stars,text,date
0,iBUJvIOkToh2ZECVNq5PDg,iAD32p6h32eKDVxsPHSRHA,YB26JvvGS2LgkxEKOObSAw,5,I've been eating at this restaurant for over 5...,2021-01-08 01:49:36
1,HgEofz6qEQqKYPT7YLA34w,rYvWv-Ny16b1lMcw1IP7JQ,jfIwOEXcVRyhZjM4ISOh4g,1,How does a delivery person from here get lost ...,2021-01-02 00:19:00
2,Kxo5d6EOnOE-vERwQf2a1w,2ntnbUia9Bna62W0fqNcxg,S-VD26LE_LeJNx5nASk_pw,5,"The service is always good, the employees are ...",2021-01-26 18:01:45


In [45]:
import pandas as pd

def new_df(data):
    """
    The function takes in a dataframe and groups it by the 'business_id' column.
    It then combines all the text values in the 'text' column into one big text
    for each 'business_id' and assigns it to the 'review' column.
    """
    # Group by 'business_id' and aggregate 'text' into a single string
    grouped_df = data.groupby('business_id')['text'].apply(lambda x: ' '.join(x)).reset_index()
    
    # Rename the 'text' column to 'review'
    grouped_df.rename(columns={'text': 'review'}, inplace=True)
    
    return grouped_df

# Example usage
users_data_2 = new_df(users_data)
print(users_data_2.head())


              business_id                                             review
0  ---kPU91CF4Lq2-WlRu9Lw  This is a food truck alongside a picnic area. ...
1  --0iUa4sNDFiZFrAdIWhZQ  This place makes the best, most authentic pupu...
2  --epgcb7xHGuJ-4PUeSLAw  They are always behind even if there is 10 peo...
3  --hF_3v1JmU9nlu4zfXJ8Q  Really excited to get healthier options on the...
4  --lqIzK-ZVTtgwiQM63XgQ  Worst Wendy's I've ever been to. I don't know ...


In [46]:
def decompress(x):
    """
    The function takes in a dictionary and returns only the keys that have their values not being False   
    """
      
    list_ = []
    
    # Check if x is a string
    if not isinstance(x, str):
        return ' '
    
    # evaluate the attributes column to convert it from a string to a dictionary
    try:
        data_dict = eval(x)
    except Exception as e:
        print(f"Error evaluating {x}: {str(e)}")
        return ' '
    
    # iterate through the key-value pairs in the dictionary
    for key, val in data_dict.items():
        # check if the key is in the specified categories and if the value is not "None"
        if (key in ['Ambience', 'GoodForMeal', 'BusinessParking']) and (val != "None"):
            # if conditions are met, further iterate through sub-dictionary
            try:
                sub_dict = eval(val)
                for key_, val_ in sub_dict.items():
                    # if the sub-dictionary value is true, append it to the list
                    if val_:
                        list_.append(f'{key}_{key_}')
            except Exception as e:
                print(f"Error evaluating {val}: {str(e)}")
        else:
            # if the value is not false, append the key to the list
            if val != 'False':
                list_.append(key)
    
    # join the list of selected attribute names into a space-separated string
    return " ".join(list_)

# create a new column 'attributes_true' in the df by applying the decompress function
# include a condition to handle cases where attributes is 'Not-Available'
restaurant_data['attributes_true'] = restaurant_data.attributes.apply(lambda x: decompress(x) if x != 'Not-Available' else ' ')

In [47]:
# confirming if the new created column has performed as expected

print("Before:")
print(eval(restaurant_data.attributes[0]))
print('\n After:')
restaurant_data['attributes_true'][0]      # Print the result for the first row of 'attributes'

Before:
{'Caters': 'True', 'Alcohol': "u'full_bar'", 'RestaurantsAttire': "u'casual'", 'RestaurantsDelivery': 'False', 'RestaurantsTakeOut': 'True', 'HasTV': 'True', 'NoiseLevel': "u'average'", 'BusinessAcceptsCreditCards': 'True', 'OutdoorSeating': 'True', 'BusinessParking': "{'garage': False, 'street': False, 'validated': False, 'lot': True, 'valet': False}", 'Ambience': "{'romantic': False, 'intimate': False, 'touristy': False, 'hipster': False, 'divey': False, 'classy': False, 'trendy': False, 'upscale': False, 'casual': False}", 'RestaurantsPriceRange2': '1', 'GoodForKids': 'True', 'WiFi': "u'free'", 'RestaurantsReservations': 'False', 'RestaurantsGoodForGroups': 'True'}

 After:


'Caters Alcohol RestaurantsAttire RestaurantsTakeOut HasTV NoiseLevel BusinessAcceptsCreditCards OutdoorSeating BusinessParking_lot RestaurantsPriceRange2 GoodForKids WiFi RestaurantsGoodForGroups'

>From the above output we can see that the function has only retrieved keys that have values not equal to 'False'

> - We will then merge the **attributes_true, categories, reviews** columns into one large text for each unique business and assign to a new column **details**

In [48]:
# merging different columns to form one column of text 
restaurant_data['details']=restaurant_data[['attributes_true','categories','name']].apply(lambda x: ''.join(x), axis=1)

# previewing the first row value in the new column
restaurant_data.details[1]

"Caters Alcohol RestaurantsAttire RestaurantsTakeOut HasTV NoiseLevel BusinessAcceptsCreditCards OutdoorSeating BusinessParking_lot RestaurantsPriceRange2 GoodForKids WiFi RestaurantsGoodForGroupsAmerican (Traditional)Tsevi's Pub And Grill"

> After creating our desired column **details** , w'll then drop the columns that will not be useful onwards

In [49]:
# dropping columns
restaurant_data.drop(columns=['attributes_true'], inplace=True)

From the text example above we can see that the column text contains many symbols, punctuations and stopwords, next we shall remove the symbols and tokenize the column into a bag of words. These reasons serve to prepare text data for various text analysis and NLP tasks. It tokenizes the text, applies stemming, and standardizes the text for downstream processing, making it easier to analyze and extract meaningful information from the text.

In [50]:
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import RegexpTokenizer
from nltk.stem import SnowballStemmer
from sklearn.feature_extraction.text import TfidfVectorizer

# first create a pattern that strips all the non-word characters from words during tokenization
pattern =r"(?u)\b\w\w+\b"

# instantiate the tokenizer
tokenizer = RegexpTokenizer(pattern)

# instantiating the stemmer
stemmer = SnowballStemmer(language="english")

# creating a function to tokenize and stem words
def stem_and_tokenize(list_):
    tokens = tokenizer.tokenize(list_)
    return [stemmer.stem(token) for token in tokens]

After instantiating the tokenizer and stemmer we then calculate the text frequency-inverse document frequency values using the  **TfidfVectorizer()** method. Calculating TF-IDF values is a crucial step in preparing text data for analysis and transforming it into a format suitable for many NLP and text mining tasks. It helps convert unstructured text into structured numerical data that can be used for various analytical and machine learning purposes.

In [51]:
# instantiating the stop words
stopwords=stopwords.words('english')
# stemming the stopwords for uniformity while removing stopwords
stopwords=[ stemmer.stem(i) for i in stopwords]


tfidf = TfidfVectorizer( max_features=200 , 
                        stop_words=stopwords,
                        tokenizer= stem_and_tokenize
#                         ngram_range=(1, 2), 
#                         min_df=0, 
                        )


# fitting and transforming the details column to extract the top 200 features
tfidf_matrix=tfidf.fit_transform(restaurant_data['details'])

# previewing the tfidf matrix
pd.DataFrame.sparse.from_spmatrix(tfidf_matrix, columns=tfidf.get_feature_names_out()).head()



Unnamed: 0,alcohol,alcoholamerican,ambienc,ambience_casu,ambience_casualamerican,ambience_casualasian,ambience_classi,ambience_divey,ambience_hipst,ambience_intim,...,vietnames,villag,waffl,wheelchairaccess,wheelchairaccessibleamerican,wifi,wifiamerican,wine,wing,wok
0,0.148557,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.159496,0.0,0.0,0.0,0.0
1,0.115411,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.12391,0.0,0.0,0.0,0.0
2,0.148557,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.159496,0.0,0.0,0.0,0.0
3,0.139673,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.165246,0.0,0.0,0.0,0.7031,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.177414,0.0,0.0,0.0,0.0


The code is calculating the cosine similarity between the rows of the TF-IDF matrix (tfidf_matrix). The cosine similarity is a measure of similarity between two non-zero vectors in an inner product space, often used for text document similarity calculations. In this case, it's used to measure the similarity between the 'details' text descriptions of different businesses based on their TF-IDF scores.

> We will then pickle our desired data and cosine matrix for deployment

In [52]:
import pickle
from sklearn.metrics.pairwise import cosine_similarity


# saving our data for deployment
pickle.dump(tfidf_matrix, open('./data/tfidf_matrix.pkl', 'wb'))
pickle.dump(restaurant_data, open('./data/restaurant_data.pkl', 'wb'))
pickle.dump(users_data_2, open('./data/users_data_2.pkl', 'wb'))
print("Files saved...")

Files saved...


In [53]:
with open('./data/restaurant_data.pkl', 'rb') as file:
    restaurant_data=pickle.load(file)
with open('./data/users_data_2.pkl', 'rb') as file:
    users_data_2=pickle.load(file)
print("Files opened...")

Files opened...


### Content-Based Model

Using the cosine similarity matrix we will now create a content-based recommendation system that offers recommendations to users based on the restaurant names or text words representing the specifications of their desired restaurant and attributes.


> We use the cosine similarity matrix to compare similarities between different restaurants and the customer's preferences, then pick the top n similar restaurants to recommend based on his/her input. 

In [54]:
import folium

# creating a folium_map function that displays restaurant lovations

def folium_map(data):
    """
    The function takes in a dataframe and using the latitude and longitude columns displays a map showing the locations of 
    all the restaurants available in the input data
    """
    # reseting the index in the input dataframe
    dff=data.reset_index(drop=True)


# Set up center latitude and longitude
    center_lat = dff['latitude'][0]
    center_long = dff['longitude'][0]

# Initialize map with center lat and long
    map_ =folium.Map([center_lat,center_long], zoom_start=7)

# Adjust this limit to see more or fewer businesses
    limit=dff.shape[0]
    print(f"{limit-1} Restaurant Locations")
    for index in range(limit-1):
        # Extract information about business
        lat = dff.loc[index,'latitude']
        long = dff.loc[index,'longitude']
        name = dff.loc[index,'name']
        rating = dff.loc[index,'stars']
        location = dff.loc[index,'location']
        details = "{}\nStars: {} {}".format(name,rating,location)

# Create popup with relevant details
        popup = folium.Popup(details,parse_html=True)

# Create marker with relevant lat/long and popup
        marker = folium.Marker(location=[lat,long], popup=popup)

        marker.add_to(map_)

    return display(map_)  # returning a map display

In [55]:
folium_map(data=restaurant_data.loc[:20])

20 Restaurant Locations


The content_based function uses content-based recommendation techniques to provide restaurant recommendations based on user input preferences, restaurant names, or user-defined text. The recommendations can be filtered by minimum rating and location and are visually presented on an interactive map if specified.

In [81]:
def content_based(df=restaurant_data, name:str= None , rating:int =1, num:int=5, text: str=None, location:str = None):
    """
    The function takes the following input;
    
    df: DataFrame - a dataframe containing unique resturants
    name: str - name of restaurant to recommend similar restaurants
    num:int - number of restaurants to recommend
    location: string - preferred location
    rating: string - preferred rating of restaurant
    text: - User preferences inform of text
    
    Then based on the input parameters offers similar restaurants according to the input parameters to users
    """
    
    if name:
        index_ = df.loc[df['name'] == name].index[0]
        numerical_cols = df.select_dtypes(include=[np.number]).columns
        vector = df.loc[index_, numerical_cols].values.reshape(1, -1)
        sim = cosine_similarity(vector, df[numerical_cols].values)  # Calculate similarity with all other vectors
        sim = list(enumerate(sim[0]))  # Convert to list of tuples 
        sim = sorted(sim, key=lambda x: x[1], reverse=True)[1:num+1]  # Sort and select top N
        indices = [i[0] for i in sim]  # Extract indices of top scores
        print(f"Top {num} Restaurants Like [{name}]")
        
        # if the location parameter is passed then the dataframe is filtered based on the input location
        if location:                                                
            df=df.loc[ (df['stars']>=rating) & ( df.location.str.contains(location))]
            folium_map(df)
        else: 
            df= df.loc[ (df['stars']>=rating) ] 
        # filtering the data based on the selected indices    
        df=df.loc[indices,('name','stars','review_count','location')].sort_values('stars', ascending=False)
        return  df.reset_index(drop=True)
    
    # if the name is None then switch to other parameters
    else:
        # if the text has a passed input values then this if statement runs            
        if text: 
                text=text.lower()                                           # converting the text into lowercase
                tokens=stem_and_tokenize(text)                              # tokenizing and stemming the words
                tokens=[ word for word in tokens if word not in stopwords]  # removing stopwords
                text_set=set(tokens)                                        # taking only unique words
                
                if location: # using entered location to filter the data
                    df=df.loc[ (df.location.str.contains(location)) & (df['stars']>=rating)].reset_index(drop=True)

                vectors=[] # creating an emplty list to append the intersection values
                for words in df.details:                                     # looping over the text in the details column
                    words=words.lower()                                      # lowering the text
                    words=stem_and_tokenize(text)                            # tokenizing and stemming the words
                    words=[ word for word in tokens if word not in stopwords] # removing stopwords
                    words=set(words)                                         # taking only unique words
                    vector=text_set.intersection(words)                      # checking for intersection with entered text 
                    vectors.append(len(vector))                              # appending value to vectors list
                    
                vectors=sorted(list(enumerate(vectors)), key= lambda x: x[1], reverse=True)[:num] # sorting the list in desc
                indices= [i[0] for i in vectors]                                         # selecting indices of top values
                print(f"Top {num} Best Restaurants Based on entered text:")
                # using the indices fileter the dataframe 
                df=df.loc[indices].sort_values(by=['stars','review_count'],ascending=False)
                if location: folium_map(df)                                   # calling the folim_map of the selected values
                return df[['name','stars','review_count','location']].reset_index(drop=True) # offering recommendations
        
        # the if only location is entered as a parameter then the top businesses in that location are recommended
        if location:
            df=df.loc[ df.location.str.contains(location)& (df['stars']>=rating)] #filtering dataframe
            df=df.sort_values(['review_count','stars'])[:num]     # sorting in descending order
            folium_map(data=df)
            return df[['name','stars','review_count','location']].reset_index(drop=True) # offering recommendations
         
        # if both the name, text and location are None the most popular restaurants are recommended
        else:                
            df=df.loc[df['stars']>=rating].sort_values(by=['review_count','stars'],ascending=False)[:num]
            if location: folium_map(data=df)
            print("Most Popular Restaurants")
            return df[['name','stars','review_count','location']].reset_index(drop=True)

In [82]:
# running the recommender on default parameters
content_based()

Most Popular Restaurants


Unnamed: 0,name,stars,review_count,location
0,Acme Oyster House,4.0,7568,"State:Louisiana, City:New Orleans, Address:724..."
1,Oceana Grill,4.0,7400,"State:Louisiana, City:New Orleans, Address:739..."
2,Hattie B’s Hot Chicken - Nashville,4.5,6093,"State:Tennessee, City:Nashville, Address:112 1..."
3,Hattie B’s Hot Chicken - Nashville,4.5,6093,"State:Tennessee, City:Nashville, Address:112 1..."
4,Hattie B’s Hot Chicken - Nashville,4.5,6093,"State:Tennessee, City:Nashville, Address:112 1..."


In [83]:
# offering recommendations based on a specific location
content_based(location='Philadelphia')

4 Restaurant Locations


Unnamed: 0,name,stars,review_count,location
0,Ruby's Roof Jamaican Restaurant,1.0,5,"State:Pennsylvania, City:Philadelphia, Address..."
1,Montego Grill,1.0,5,"State:Pennsylvania, City:Philadelphia, Address..."
2,Montego Grill,1.0,5,"State:Pennsylvania, City:Philadelphia, Address..."
3,Hunnies Crispy Chicken,2.0,5,"State:Pennsylvania, City:Philadelphia, Address..."
4,Azione,2.0,5,"State:Pennsylvania, City:Philadelphia, Address..."


In [84]:
# recommending restaurants with attributes in the entered text
content_based(rating=4, location="Tampa Bay",num=5,\
        text="With ample parking space and has wifi and provides takeouts")

Top 5 Best Restaurants Based on entered text:
4 Restaurant Locations


Unnamed: 0,name,stars,review_count,location
0,The Vegan Halal Cart,5.0,12,"State:Florida, City:Tampa Bay, Address:Unknown"
1,Montauro Ristorante,4.5,75,"State:Florida, City:Tampa, Address:2501 W Tamp..."
2,Raíces de Mi Pueblo Restaurant and Cafe,4.5,15,"State:Florida, City:Tampa Bay, Address:1910 No..."
3,4 Rivers Smokehouse,4.0,343,"State:Florida, City:Tampa Bay, Address:623 S M..."
4,Vietnamese Food Truck,4.0,10,"State:Florida, City:Tampa Bay, Address:Unknown"


In [94]:
# recommending restaurants with attributes in the entered name
content_based(name= "Red Square Deli")

Top 5 Restaurants Like [Red Square Deli]


Unnamed: 0,name,stars,review_count,location
0,The Vegan Halal Cart,5.0,12,"State:Florida, City:Tampa Bay, Address:Unknown"
1,The Brinehouse,5.0,12,"State:Florida, City:Safety Harbor, Address:100..."
2,Spice Routes,5.0,12,"State:Florida, City:Saint Petersburg, Address:..."
3,Spice Routes,5.0,12,"State:Florida, City:Saint Petersburg, Address:..."
4,Sabine's Gout Creole,4.5,12,"State:Florida, City:Tampa, Address:6014 N 40th..."
