# Restaurant Recommender System

**Authors**: Lyla Kiratiwudhikul, Mina Lee, Tom Zhang

## Motivations and Objectives

For the past few decades, with the fast growing market of digital platforms, 
companies have tried to customize the advertising of their products based on individual customers' preferences or interests. 
This practice has been utilized across various industries and companies, from the e-commerce site Amazon suggesting relevant products to the streaming platform Netflix recommending similar shows to their users' view history and profile. The recommendation systems help increase sales as the users are able to easily see and purchase recommended products that match their needs and preferences.

In this project, we focus on devising a restaurant recommendation system (hereby referred to as “recommender”).
We use data of restaurants and customer profiles from Yelp, a platform for crowd-sourced reviews about businesses.
As an individual has unique restaurant preferences, such as cuisines, ambience, pets, diet types, and/or parking availability, we aim to build the recommender to recommend restaurants to users based on the insights gleaned from their reviews on the previous restaurants they have been to.

## Data Overview

The data is downloaded from [Yelp official website](https://www.yelp.com/dataset/documentation/main). There are two datasets relevant to our analysis and models: `business` and `review` data. The `business` dataset contains information about the businesses including name, location, hours, average rating stars, hours, number of reviews, and other features such as cuisine types and parking availability. The `review` dataset records full review text data as well as the `user_id` who wrote the review and the `business_id` for which the review was written. There are 150,346 businesses and 6,990,280 reviews in the Yelp original datasets. Below is the list of features in the two raw datasets:

**Business:**
- `business_id`: business’s ID, string
- `name`: business’s name, string
- `address`: business’s full address, string
- `city`: business’s city, string
- `state`: business’s state, string, 2 character state code, if applicable
- `postal_code`: business’s postal code, string
- `latitude`: business’s latitude, float
- `longitude`: business’s longitude, float
- `stars`: business’s average stars rating, float (1 to 5)
- `review_count`: business’s number of reviews, integer
- `is_open`	: whether the business is open or closed; 0 or 1 for closed or open, integer
- `attributes`: business’s features (e.g whether it offers parking, whether it accept credit cards, etc), JSON object
- `categories`: business’s categories (e.g. “Mexican”, “Burger”, etc), array of strings
- `hours`: business’s working hours, object of key day (Monday-Sunday) to value hours

**Review:**
- `review_id`: unique review ID, string
- `user_id`: 22 character user’s ID, string
- `business_id`: 22 character business’s ID, string
- `stars`: business’s stars rating, integer (1 to 5)
- `useful`: number of useful votes received, integer
- `integer`: number of funny votes received, integer
- `cool`: number of cool votes received, integer	
- `text`: review itself, string
- `date`: date format YYYY-MM-DD, string





### Data Preparation

**Business:**

As the main goal of this project is to build a recommender system that suggests top restaurants and their offerings based on user location, we have excluded the entries that are not classified as restaurants such as “spas”, “hotels”, and “hair salons” by dropping all rows whose `categories` do not contain any of the following keywords: “food”, “restaurant”, “bar”, “pubs”, “restaurant”, “tea”, and “coffee”. This has decreased the number of entries from 150,346 to 69,253.

Further, we have expanded `attributes` columns based on the JSON objects given. The first attempt returned 39 new columns. However, some of the expanded features contained nested JSON objects, so we further expanded `BusinessParking`, `Ambience`, `DietaryRestrictons`, and `Music` features and concatenated them to the other attributes, resulting in final 62 columns.

In addition, the missing values in original and newly created attributes based features are filled with `None`, indicating that the information is not available. We decided to impute the missing values this way instead of mode imputation method because there are users who might be indifferent regarding the missing attributes and those who are concerned about them (e.g. whether there is parking or not).

For `hours`, an expansion of operating hour objects into seven days was performed first. For the 9,710 missing entries for hours, we imputed the operating hours with the modes of the dataset for businesses whose value for the hours column was completely missing and imputed with `"closed"` for businesses whose hours are available for only some days and are not provided for the other days (i.e. we assumed that these restaurants are closed on the missing days).

Furthermore, there are a few states that have only one business in our dataset, which are not suitable for our recommender task, and we have, thus, decided to drop these states with only one business: North Carolina, Colorado, Hawaii, Montana, and South Dakota.

The final dataset for the business records has a total of 84 features with 68,054 entries. The business dataset has a total of five numeric features (`latitude`, `longitude`, `stars`, `review_count`, `is_open`). However, since `is_open` is a binary feature and `latitude` and `longitude` are useful in their original states. Only `stars` and `review_count` have been standardized to prevent any potential bias from large scales.


**Review:**

For the review dataset, we have first filtered the dataset in accordance with the business dataset by filtering out reviews for irrelevant businesses (using `business_id`) that have been pre-processed out as described above. This has reduced the size of the review dataset from 6,990,280 to 5,257,329 entries. Since the review dataset do not have missing entries, no imputation or dropping related to missing values have been conducted. However, we have decided to standardize the numeric features (`stars`, `useful`, `funny`, `cool`) in the review dataset to prevent any larger scales dominating the analysis and leading to a biased result.

Note: the code for data preparation can be found in `business_dat_inspect.ipynb` and `review_data_inspect.ipynb`.

## Exploratory Data Analysis

TODO:

In terms of the balance of the dataset, there is an imbalance since some restaurants have received much larger quantities of reviews than others. However, mitigating class imbalance in this dataset can be considered improper since the number of reviews could be an indicator for the business’s popularity.

## Baseline Model

In [1]:
import numpy as np
import pandas as pd
# pip install scikit-surprise
from surprise import Dataset, Reader, KNNWithMeans
from surprise.model_selection import GridSearchCV

## Intro

There are multiple ways to implement a recommendation system:

- collaborative filtering
- content-based filtering
- hybrid

In particular, collaborative filtering can be further divided into two types (a hyper-parameter):

- user-based: find similar users based on ratings a user gave out
- item-based: find similar items based on ratings given to an item

In either case, the algorithm relies on a user-item matrix, in which the rows match the users and columns the items. From here, we can then make predictions after calculating similarities amongst the users or items. This is known as a memory-based approach. If we apply an extra step to reduce the sparse user-item matrix with matrix factorization, this would be called a model-based approach.

For our baseline model, we will implement the memory-based collaborative filtering technique.

We also need to install `scikit-surprise`, a recommendation system package: `pip install scikit-surprise`. One of its functions `KNNWithMeans` would be particularly useful here. It is a basic collaborative filtering algorithm, taking into account the mean ratings of each user.

## Prep Data

We require a data frame with 3 columns: user, item, rating; with each row corresponding to a user's rating for a particular restaurant.

In [2]:
review_df = pd.read_feather('data/yelp_review_cleaned.feather')

In [3]:
df = review_df.loc[:, ['user_id', 'business_id', 'stars']]

Note that we previously scaled `stars`, so we will now un-scale it as we're now using it as our response:

In [4]:
stars_scaled_unique = sorted(list(df['stars'].unique()))

In [5]:
stars_scale_map = dict(list(zip(stars_scaled_unique, range(1, 6))))
stars_scale_map

{-2.0123613662910693: 1,
 -1.2947376560318022: 2,
 -0.5771139457725354: 3,
 0.1405097644867315: 4,
 0.8581334747459984: 5}

In [6]:
df['stars'] = df['stars'].map(stars_scale_map)

In [7]:
df.stars.unique()

array([3, 5, 4, 1, 2])

We will take a smaller random sample out of concerns for the hardware:

In [8]:
df.shape

(5257329, 3)

In [9]:
sub = df.sample(10000, random_state=42)

In [10]:
sub.head()

Unnamed: 0,user_id,business_id,stars
1322294,0lpxU4Dfi8AeBt0SeCrEuw,tQKqrLs16Xi-lFrd3_CBAQ,1
4297632,5nw1Zc3fi_ehDJFd3mUEYA,nLxNJuvgoHQHn_IGYifRnw,1
2143059,7fDqaGdUMccXQ4bnPwR6yg,etaIhl-sduOKc6J_qHmmtA,3
3068250,GyFJNSJjI5aWww-D0Btcbw,GlKffg2PMtzByocI5OHIQA,3
1371839,o66iBwIWxfWPypnqfrHVNw,XVFUNtPWYpxhoWPtBQHFdQ,2


## Build Recommender

Test: Collaborative filtering (item-based matrix, memory-based method, cosine-similarity):

In [11]:
# load in data into scikit-surprise format
reader = Reader(rating_scale=(1, 5))
data = Dataset.load_from_df(sub, reader)

In [12]:
# configs
sim_options = {
    "name": "cosine",  # to use item-based cosine similarity
    "user_based": False,  # Compute similarities between items
}

algo = KNNWithMeans(sim_options=sim_options)

In [13]:
trainingSet = data.build_full_trainset()

In [14]:
algo.fit(trainingSet)

Computing the cosine similarity matrix...
Done computing similarity matrix.


<surprise.prediction_algorithms.knns.KNNWithMeans at 0x3baeb9cc0>

In [15]:
prediction = algo.predict(sub.iloc[4, 0], sub.iloc[4, 1])
prediction.est

2.0

Now incorporate hyper-parameter tuning with grid search:

In [16]:
sim_options = {
    "name": ["msd", "cosine"],
    "min_support": [3, 4, 5],
    "user_based": [False, True],
}

param_grid = {"sim_options": sim_options}

In [17]:
gs = GridSearchCV(KNNWithMeans, param_grid, measures=["rmse", "mae"], cv=3)
gs.fit(data)

Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computi

In [22]:
print(f'best rmse: {gs.best_score["rmse"]}')
print(f'best params: {gs.best_params["rmse"]}')

best rmse: 1.4087381876138967
best params: {'sim_options': {'name': 'msd', 'min_support': 3, 'user_based': True}}


## Up Next

- Test model-based approach in collaborative filtering
- Test content-based recommenders
- Use more complex models such as neural networks rather than just cosine similarities