These datasets are hosted on: https://archive.ics.uci.edu/ml/datasets/Restaurant+%26+consumer+data

They were originally published by: Blanca Vargas-Govea, Juan Gabriel González-Serna, Rafael Ponce-Medellín. Effects of relevant contextual features in the performance of a restaurant recommender system. In RecSys11: Workshop on Context Aware Recommender Systems (CARS-2011), Chicago, IL, USA, October 23, 2011.

# Making Recommendations Based on Correlation

In [None]:
import numpy as np
import pandas as pd

In [None]:
# rating_final.csv
url = 'https://drive.google.com/file/d/1ptu4AlEXO4qQ8GytxKHoeuS1y4l_zWkC/view?usp=sharing' 
path = 'https://drive.google.com/uc?export=download&id='+url.split('/')[-2]
frame = pd.read_csv(path)

# chefmozcuisine.csv
url = 'https://drive.google.com/file/d/1S0_EGSRERIkSKW4D8xHPGZMqvlhuUzp1/view?usp=sharing' 
path = 'https://drive.google.com/uc?export=download&id='+url.split('/')[-2]
cuisine = pd.read_csv(path)

# 'geoplaces2.csv'
url = 'https://drive.google.com/file/d/1ee3ib7LqGsMUksY68SD9yBItRvTFELxo/view?usp=sharing' 
path = 'https://drive.google.com/uc?export=download&id='+url.split('/')[-2]
geodata = pd.read_csv(path, encoding = 'CP1252') # change encoding to 'mbcs' in Windows

### Preparing Data For Correlation

We will look for restaurants that are similar to the most popular restaurant from the last notebook "Tortas Locas Hipocampo". "Similarity" will be defined by how well other places correlate with "Tortas Locas" in the user-item matrix. In this matrix, we have all the users in the rows and all the restaurants in the columns. It has many NaNs because most of the time users have not visited many restaurants —we call this a sparse matrix.

In [None]:
places_crosstab = pd.pivot_table(data=frame, values='rating', index='userID', columns='placeID')
places_crosstab.head(10)

Let's look at the users that have visited "Tortas Locas":

In [None]:
# Tortas Locas
top_popular_placeID = 135085

In [None]:
Tortas_ratings = places_crosstab[top_popular_placeID]
Tortas_ratings[Tortas_ratings>=0] # exclude NaNs

## Evaluating Similarity Based on Correlation

Now we will look at how well other restaurants correlate with Tortas Locas. A strong positive correlation between two restaurants indicates that users who liked one restaruant also liked the other. A negative correlation would mean that users who liked one restaurant did not like the other. So, we will look for strong, positive correlations to find similar restaurants.

In [None]:
# we get warnings because computing the pearson correlation coefficient with NaNs, but the results are still ok
similar_to_Tortas = places_crosstab.corrwith(Tortas_ratings)
similar_to_Tortas

Many restuarants get a NaN, because there are no users that went to both that restaurant _and_ Tortas Locas. But some of them give us a correlation score. Let's drop NaNs and look at the valid results:

In [None]:
corr_Tortas = pd.DataFrame(similar_to_Tortas, columns=['PearsonR'])
corr_Tortas.dropna(inplace=True)
corr_Tortas.head(12)

Some correlations are a perfect 1. It is possible that this is because very few users went to both that restaurant and "Tortas Locas" (also because there are very few rating options, only 0, 1 and 2). 

In [None]:
rating = pd.DataFrame(frame.groupby('placeID')['rating'].mean())
rating['rating_count'] = frame.groupby('placeID')['rating'].count()

In [None]:
Tortas_corr_summary = corr_Tortas.join(rating['rating_count'])
Tortas_corr_summary.drop(top_popular_placeID, inplace=True) # drop Tortas Locas itself
Tortas_corr_summary

Let's filter out restaurants with a rating count below 10.

Then, take the top 10 restaurants in terms of similarity to Tortas:

In [None]:
top10 = Tortas_corr_summary[Tortas_corr_summary['rating_count']>=10].sort_values('PearsonR', ascending=False).head(10)
top10

In [None]:
places =  geodata[['placeID', 'name']]

In [None]:
top10 = top10.merge(places, left_index=True, right_on="placeID")
top10

Let's look at the cuisine type (some restaurants do not have a cuisine type... but for the ones that do, here it is):

In [None]:
top10.merge(cuisine)

## Challenge:

Create a function that takes as input a restaurant id and a number (n), and outputs the names of the top n most similar restuarants to the inputed one.

You can assume that the user-item matrix (places_crosstab) is already created.

In [None]:
def top_n_rest(rest_id, n):
    rest_ratings = places_crosstab[rest_id]
    similar_to_rest = places_crosstab.corrwith(rest_ratings)
    corr_rest = pd.DataFrame(similar_to_rest, columns=['PearsonR'])
    corr_rest.dropna(inplace=True)
    rest_corr_summary = corr_rest.join(rating['rating_count'])
    rest_corr_summary.drop(rest_id, inplace=True) # drop the inputed restaurant itself
    top10 = rest_corr_summary[rest_corr_summary['rating_count']>=10].sort_values('PearsonR', ascending=False).head(n)
    top10 = top10.merge(places, left_index=True, right_on="placeID")
    return list(top10["name"])

In [None]:
import warnings
warnings.filterwarnings('ignore')

In [None]:
top_n_rest(132921, 10)

### BONUS (Next iteration)
Instead of flitering out restaurants with a rating count below 10, let's consider a restaurant X as similar to Y only if at least 3 users have gone to both X and Y. 

i.e. user 143, 153, and 168 went to both restaurants - not 3 random users visited X, and a different 3 random users visited y

In [None]:
def top_n_rest(rest_id, n):
    matching_three_users = places_crosstab.loc[places_crosstab[rest_id].notna(), :]
    matching_three_users = matching_three_users.loc[:, matching_three_users.notna().sum() >= 3]
    rest_ratings = matching_three_users[rest_id]
    similar_to_rest = places_crosstab.corrwith(rest_ratings)
    corr_rest = pd.DataFrame(similar_to_Tortas, columns=['PearsonR'])
    corr_rest.dropna(inplace=True)
    rest_corr_summary = corr_rest.join(rating['rating_count'])
    rest_corr_summary.drop(rest_id, inplace=True) # drop the inputed restaurant itself
    top10 = rest_corr_summary[rest_corr_summary['rating_count']>=10].sort_values('PearsonR', ascending=False).head(n)
    top10 = top10.merge(places, left_index=True, right_on="placeID")
    return list(top10["name"])

In [None]:
top_n_rest(132921, 10)