These datasets are hosted on: https://archive.ics.uci.edu/ml/datasets/Restaurant+%26+consumer+data

They were originally published by: Blanca Vargas-Govea, Juan Gabriel González-Serna, Rafael Ponce-Medellín. Effects of relevant contextual features in the performance of a restaurant recommender system. In RecSys11: Workshop on Context Aware Recommender Systems (CARS-2011), Chicago, IL, USA, October 23, 2011.

# Making Recommendations Based on Correlation

In [None]:
import numpy as np
import pandas as pd

In [None]:
# rating_final.csv
url = 'https://drive.google.com/file/d/1ptu4AlEXO4qQ8GytxKHoeuS1y4l_zWkC/view?usp=sharing' 
path = 'https://drive.google.com/uc?export=download&id='+url.split('/')[-2]
frame = pd.read_csv(path)

# chefmozcuisine.csv
url = 'https://drive.google.com/file/d/1S0_EGSRERIkSKW4D8xHPGZMqvlhuUzp1/view?usp=sharing' 
path = 'https://drive.google.com/uc?export=download&id='+url.split('/')[-2]
cuisine = pd.read_csv(path)

# 'geoplaces2.csv'
url = 'https://drive.google.com/file/d/1ee3ib7LqGsMUksY68SD9yBItRvTFELxo/view?usp=sharing' 
path = 'https://drive.google.com/uc?export=download&id='+url.split('/')[-2]
geodata = pd.read_csv(path, encoding = 'CP1252') # change encoding to 'mbcs' in Windows

### Preparing Data For Correlation

In [None]:
frame

Unnamed: 0,userID,placeID,rating,food_rating,service_rating
0,U1077,135085,2,2,2
1,U1077,135038,2,2,1
2,U1077,132825,2,2,2
3,U1077,135060,1,2,2
4,U1068,135104,1,1,2
...,...,...,...,...,...
1156,U1043,132630,1,1,1
1157,U1011,132715,1,1,0
1158,U1068,132733,1,1,0
1159,U1068,132594,1,1,1


We will look for restaurants that are similar to the most popular restaurant from the last notebook "Tortas Locas Hipocampo". "Similarity" will be defined by how well other places correlate with "Tortas Locas" in the user-item matrix. In this matrix, we have all the users in the rows and all the restaurants in the columns. It has many NaNs because most of the time users have not visited many restaurants —we call this a sparse matrix.

In [None]:
places_crosstab = pd.pivot_table(data=frame, values='rating', index='userID', columns='placeID')
places_crosstab.head(10)

placeID,132560,132561,132564,132572,132583,132584,132594,132608,132609,132613,...,135080,135081,135082,135085,135086,135088,135104,135106,135108,135109
userID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
U1001,,,,,,,,,,,...,,,,0.0,,,,,,
U1002,,,,,,,,,,,...,,,,1.0,,,,1.0,,
U1003,,,,,,,,,,,...,2.0,,,,,,,,,
U1004,,,,,,,,,,,...,,,,,,,,2.0,,
U1005,,,,,,,,,,,...,,,,,,,,,,
U1006,,,,1.0,,,,,,,...,,,,,,,,,,
U1007,,,,1.0,,,,,,,...,,,,1.0,0.0,,,,1.0,
U1008,,,,,,,,,,,...,,,,,,,,,1.0,
U1009,,,,,,,,,,,...,,,,,,,,,,
U1010,,,,,,,,,,,...,,,,,,,,,,


Let's look at the users that have visited "Tortas Locas":

In [None]:
# Tortas Locas
top_popular_placeID = 135085

In [None]:
Tortas_ratings = places_crosstab[top_popular_placeID]
Tortas_ratings[Tortas_ratings>=0] # exclude NaNs

userID
U1001    0.0
U1002    1.0
U1007    1.0
U1013    1.0
U1016    2.0
U1027    1.0
U1029    1.0
U1032    1.0
U1033    2.0
U1036    2.0
U1045    2.0
U1046    1.0
U1049    0.0
U1056    2.0
U1059    2.0
U1062    0.0
U1077    2.0
U1081    1.0
U1084    2.0
U1086    2.0
U1089    1.0
U1090    2.0
U1092    0.0
U1098    1.0
U1104    2.0
U1106    2.0
U1108    1.0
U1109    2.0
U1113    1.0
U1116    2.0
U1120    0.0
U1122    2.0
U1132    2.0
U1134    2.0
U1135    0.0
U1137    2.0
Name: 135085, dtype: float64

In [None]:
Tortas_ratings


userID
U1001    0.0
U1002    1.0
U1003    NaN
U1004    NaN
U1005    NaN
        ... 
U1134    2.0
U1135    0.0
U1136    NaN
U1137    2.0
U1138    NaN
Name: 135085, Length: 138, dtype: float64

## Evaluating Similarity Based on Correlation

Now we will look at how well other restaurants correlate with Tortas Locas. A strong positive correlation between two restaurants indicates that users who liked one restaruant also liked the other. A negative correlation would mean that users who liked one restaurant did not like the other. So, we will look for strong, positive correlations to find similar restaurants.

In [None]:
# .corr
#column1.corr(column2)
#(places_crosstab[135085]).corr((places_crosstab[132572]))

In [None]:
# we get warnings because computing the pearson correlation coefficient with NaNs, but the results are still ok
similar_to_Tortas = places_crosstab.corrwith(Tortas_ratings)
#similar_to_Tortas = places_crosstab.corrwith(places_crosstab[135085])
similar_to_Tortas

  c = cov(x, y, rowvar, dtype=dtype)
  c *= np.true_divide(1, fact)


placeID
132560         NaN
132561         NaN
132564         NaN
132572   -0.428571
132583         NaN
            ...   
135088         NaN
135104         NaN
135106    0.454545
135108         NaN
135109         NaN
Length: 130, dtype: float64

Many restuarants get a NaN, because there are no users that went to both that restaurant _and_ Tortas Locas. But some of them give us a correlation score. Let's drop NaNs and look at the valid results:

In [None]:
corr_Tortas = pd.DataFrame(similar_to_Tortas, columns=['PearsonR'])
corr_Tortas.dropna(inplace=True)
corr_Tortas.head(12)

Unnamed: 0_level_0,PearsonR
placeID,Unnamed: 1_level_1
132572,-0.428571
132723,0.301511
132754,0.930261
132825,0.700745
132834,0.814823
132856,0.475191
132861,0.5
132862,0.559017
132872,0.840168
132921,0.493013


Some correlations are a perfect 1. It is possible that this is because very few users went to both that restaurant and "Tortas Locas" (also because there are very few rating options, only 0, 1 and 2). 

In [None]:
rating = pd.DataFrame(frame.groupby('placeID')['rating'].mean())
rating['rating_count'] = frame.groupby('placeID')['rating'].count()

In [None]:
Tortas_corr_summary = corr_Tortas.join(rating['rating_count'])
Tortas_corr_summary.drop(top_popular_placeID, inplace=True) # drop Tortas Locas itself
Tortas_corr_summary

Unnamed: 0_level_0,PearsonR,rating_count
placeID,Unnamed: 1_level_1,Unnamed: 2_level_1
132572,-0.428571,15
132723,0.301511,12
132754,0.930261,13
132825,0.700745,32
132834,0.814823,25
132856,0.475191,14
132861,0.5,7
132862,0.559017,18
132872,0.840168,12
132921,0.493013,17


Let's filter out restaurants with a rating count below 10.

Then, take the top 10 restaurants in terms of similarity to Tortas:

In [None]:
top10 = Tortas_corr_summary[Tortas_corr_summary['rating_count']>=10].sort_values('PearsonR', ascending=False).head(10)
top10

Unnamed: 0_level_0,PearsonR,rating_count
placeID,Unnamed: 1_level_1,Unnamed: 2_level_1
135076,1.0,13
135066,1.0,12
132754,0.930261,13
135045,0.912871,13
135062,0.898933,21
135028,0.892218,15
135042,0.881409,20
135046,0.867722,11
132872,0.840168,12
135038,0.831513,24


In [None]:
places =  geodata[['placeID', 'name']]

In [None]:
top10 = top10.merge(places, left_index=True, right_on="placeID")
top10

Unnamed: 0,PearsonR,rating_count,placeID,name
13,1.0,13,135076,Restaurante Pueblo Bonito
52,1.0,12,135066,Restaurante Guerra
117,0.930261,13,132754,Cabana Huasteca
28,0.912871,13,135045,Restaurante la Gran Via
113,0.898933,21,135062,Restaurante El Cielo Potosino
120,0.892218,15,135028,La Virreina
25,0.881409,20,135042,Restaurant Oriental Express
42,0.867722,11,135046,Restaurante El Reyecito
90,0.840168,12,132872,Pizzeria Julios
60,0.831513,24,135038,Restaurant la Chalita


Let's look at the cuisine type (some restaurants do not have a cuisine type... but for the ones that do, here it is):

In [None]:
top10.merge(cuisine)

Unnamed: 0,PearsonR,rating_count,placeID,name,Rcuisine
0,0.930261,13,132754,Cabana Huasteca,Mexican
1,0.892218,15,135028,La Virreina,Mexican
2,0.881409,20,135042,Restaurant Oriental Express,Chinese
3,0.867722,11,135046,Restaurante El Reyecito,Fast_Food
4,0.840168,12,132872,Pizzeria Julios,American


## Challenge 1:

Create a function that takes as input a restaurant id and a number (n), and outputs the names of the top n most similar restuarants to the inputed one.

You can assume that the user-item matrix (places_crosstab) is already created.

In [None]:
def top_n_rest(rest_id, n):
    rest_ratings = places_crosstab[rest_id]
    similar_to_rest = places_crosstab.corrwith(rest_ratings)
    corr_rest = pd.DataFrame(similar_to_rest, columns=['PearsonR'])
    corr_rest.dropna(inplace=True)
    rest_corr_summary = corr_rest.join(rating['rating_count'])
    rest_corr_summary.drop(rest_id, inplace=True) # drop the inputed restaurant itself
    top10 = rest_corr_summary[rest_corr_summary['rating_count']>=10].sort_values('PearsonR', ascending=False).head(n)
    top10 = top10.reset_index().merge(places, left_on="placeID", right_on="placeID")
    return list(top10["name"])

In [None]:
top_n_rest(132921, 10)

  c = cov(x, y, rowvar, dtype=dtype)
  c *= np.true_divide(1, fact)


['La Posada del Virrey',
 'Pizzeria Julios',
 'Restaurante Tiberius',
 'La Virreina',
 'Restaurante Pueblo Bonito',
 'Unicols Pizza',
 'Cafeteria y Restaurant El Pacifico',
 'puesto de tacos',
 'Restaurant Oriental Express',
 'Restaurante El Reyecito']

## Challenge 2:

Create a function that takes as input a movieId and a number (n), and outputs the names of the top n most similar movies to the inputed one.

You need to create the user-item matrix dataframe. Before, yo maybe need to join dataframes in order to get the UserId, MovieId, and Ratings all in the same place.

In [None]:
import pandas as pd

url = 'https://drive.google.com/file/d/1S0CtDB8NYUs94KgO0VDv6b2R1CShQcLF/view?usp=sharing' 
path = 'https://drive.google.com/uc?export=download&id='+url.split('/')[-2]
links = pd.read_csv(path)


url = 'https://drive.google.com/file/d/1sW3zww6gMzoln0-U0Zs7HW_bKYjtH99i/view?usp=sharing' 
path = 'https://drive.google.com/uc?export=download&id='+url.split('/')[-2]
movies = pd.read_csv(path)

url = 'https://drive.google.com/file/d/1nUpoWkhzhnYtUFvGYTR317RHiq7XtTx9/view?usp=sharing' 
path = 'https://drive.google.com/uc?export=download&id='+url.split('/')[-2]
ratings = pd.read_csv(path)

url = 'https://drive.google.com/file/d/1F9szBIzHvE9sk-p89sk1zpxVEG_gJezg/view?usp=sharing' 
path = 'https://drive.google.com/uc?export=download&id='+url.split('/')[-2]
tags = pd.read_csv(path)

In [None]:
ratings_movies = movies.merge(ratings)

In [None]:
#ratings_movies

Unnamed: 0,movieId,title,genres,userId,rating,timestamp
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,1,4.0,964982703
1,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,5,4.0,847434962
2,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,7,4.5,1106635946
3,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,15,2.5,1510577970
4,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,17,4.5,1305696483
...,...,...,...,...,...,...
100831,193581,Black Butler: Book of the Atlantic (2017),Action|Animation|Comedy|Fantasy,184,4.0,1537109082
100832,193583,No Game No Life: Zero (2017),Animation|Comedy|Fantasy,184,3.5,1537109545
100833,193585,Flint (2017),Drama,184,3.5,1537109805
100834,193587,Bungo Stray Dogs: Dead Apple (2018),Action|Animation,184,3.5,1537110021


In [None]:
movies_crosstab = pd.pivot_table(data=ratings_movies, values='rating', index='userId', columns='movieId')


score_results = pd.DataFrame(ratings.groupby('movieId')['rating'].mean())
score_results['rating_count'] = ratings.groupby('movieId')['rating'].count()


In [None]:
movies_crosstab

movieId,1,2,3,4,5,6,7,8,9,10,...,193565,193567,193571,193573,193579,193581,193583,193585,193587,193609
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,4.0,,4.0,,,4.0,,,,,...,,,,,,,,,,
2,,,,,,,,,,,...,,,,,,,,,,
3,,,,,,,,,,,...,,,,,,,,,,
4,,,,,,,,,,,...,,,,,,,,,,
5,4.0,,,,,,,,,,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
606,2.5,,,,,,2.5,,,,...,,,,,,,,,,
607,4.0,,,,,,,,,,...,,,,,,,,,,
608,2.5,2.0,2.0,,,,,,,4.0,...,,,,,,,,,,
609,3.0,,,,,,,,,4.0,...,,,,,,,,,,


In [None]:
#movies_crosstab

In [None]:
def top_n_movies(movie_id, n):
    movie_ratings = movies_crosstab[movie_id]
    similar_to_movie = movies_crosstab.corrwith(movie_ratings)
    corr_movie = pd.DataFrame(similar_to_movie, columns=['PearsonR'])
    corr_movie.dropna(inplace=True)
    movie_corr_summary = corr_movie.join(score_results['rating_count'])
    movie_corr_summary.drop(movie_id, inplace=True) # drop the inputed movie itself
    top10 = movie_corr_summary[movie_corr_summary['rating_count']>=10].sort_values('PearsonR', ascending=False).head(n)
    top10 = top10.merge(movies, how='inner', left_on = 'movieId', right_on = 'movieId')
    return top10[['title', 'PearsonR']]

In [None]:
top_n_movies(1, 10)

Unnamed: 0,title,PearsonR
0,Trainwreck (2015),0.983092
1,The Nice Guys (2016),0.968694
2,Battleship Potemkin (1925),0.958373
3,Avengers: Infinity War - Part I (2018),0.942264
4,Predestination (2014),0.936586
5,Blues Brothers 2000 (1998),0.935897
6,Ip Man (2008),0.931695
7,Singles (1992),0.922331
8,22 Jump Street (2014),0.913282
9,"Passion of the Christ, The (2004)",0.903757


### BONUS (Next iteration)
Instead of flitering out restaurants with a rating count below 10, let's consider a restaurant X as similar to Y only if at least 3 users have gone to both X and Y. 

i.e. user 143, 153, and 168 went to both restaurants - not 3 random users visited X, and a different 3 random users visited y

In [None]:
def top_n_movies(rest_id, n):

    matching_three_users = places_crosstab.loc[places_crosstab[rest_id].notna(), :]
    matching_three_users = matching_three_users.loc[:, matching_three_users.notna().sum() >= 3]
    movie_ratings = matching_three_users[rest_id]
    similar_to_rest = places_crosstab.corrwith(movie_ratings)
    corr_rest = pd.DataFrame(similar_to_rest, columns=['PearsonR'])
    corr_rest.dropna(inplace=True)
    rest_corr_summary = corr_rest.join(rating['rating_count'])
    rest_corr_summary.drop(rest_id, inplace=True) # drop the inputed restaurant itself
    top10 = rest_corr_summary[rest_corr_summary['rating_count']>=10].sort_values('PearsonR', ascending=False).head(n)
    top10 = top10.reset_index().merge(places, left_on="placeID", right_on="placeID")
    
    return list(top10["name"])

In [None]:
top_n_movies(132921, 10)

['La Posada del Virrey',
 'Pizzeria Julios',
 'Restaurante Tiberius',
 'La Virreina',
 'Restaurante Pueblo Bonito',
 'Unicols Pizza',
 'Cafeteria y Restaurant El Pacifico',
 'puesto de tacos',
 'Restaurant Oriental Express',
 'Restaurante El Reyecito']

In [None]:
['La Posada del Virrey',
 'Pizzeria Julios',
 'Restaurante Tiberius',
 'La Virreina',
 'Restaurante Pueblo Bonito',
 'Unicols Pizza',
 'Cafeteria y Restaurant El Pacifico',
 'puesto de tacos',
 'Restaurant Oriental Express',
 'Restaurante El Reyecito']